How-to do Sentiment Analysis with Flair in Python

00:00:00.000 | Hi, and welcome to this video on sentiment analysis using the Flare library.

00:00:06.660 | So Flare is an incredibly simple, easy-to-use library, which contains a load of pre-built

00:00:12.880 | models for NLP that we can simply import and use to make predictions.

00:00:19.480 | So it actually allows us to use some of the most powerful models out there as well.

00:00:24.580 | So in this tutorial, we're going to be using the Distilbert model, which is based on a

00:00:29.940 | BERT, but it's a lot smaller, but almost as powerful as BERT itself.

00:00:36.420 | So we're going to go ahead and begin.

00:00:39.580 | First, if you haven't already, you need to pip install Flare.

00:00:46.700 | And alongside Flare, you are also going to need PyTorch.

00:00:50.820 | If you haven't got PyTorch installed already, you'll need to head over to the PyTorch website.

00:00:58.820 | And they give you instructions on exactly what you need to install.

00:01:03.340 | So we come down to here and we can see, okay, for me, I have Windows.

00:01:08.180 | I want to install using Conda, using Python, and then CUDA.

00:01:14.100 | So this is if you have a CUDA-enabled GPU on your machine.

00:01:19.000 | If you don't know what that means, you probably don't.

00:01:23.780 | So in that case, just click none.

00:01:26.740 | But for me, I have 10.2.

00:01:29.860 | So all we need to do is copy the command underneath here, and then we would run this in our Anaconda

00:01:40.140 | prompt.

00:01:41.140 | I already have these installed, so I'm going to go ahead and actually begin coding.

00:01:47.500 | So we're going to need to use Pandas and also Flare.

00:01:56.960 | So now we have imported Flare, we can actually import a sentiment model straight away.

00:02:04.020 | So all we need to do is we want to pass our sentiment model to a variable, which we will

00:02:12.660 | call sentiment model.

00:02:14.780 | And we just need to write Flare.models.textClassifier and load.

00:02:29.520 | And then in here, we pass the model name that we would like to load.

00:02:34.960 | And in our case, it will be the English sentiment model, which is en-sentiment.

00:02:40.940 | Okay, so now we are downloading the model.

00:02:53.220 | And in a moment, that will have downloaded and we can begin using it.

00:02:58.480 | Now obviously, we need data.

00:03:01.120 | I have downloaded some data here, which is a sentiment data set based on the, I think

00:03:11.160 | it's the IMDB Movie Reviews.

00:03:15.260 | So you can find the same data set over here.

00:03:19.000 | Okay, so it's Sentiment Analysis on Movie Reviews data set, so it's from Rotten Tomatoes.

00:03:26.200 | And you scroll down and we have the training data and test data here.

00:03:31.400 | I'm just going to use the test data, but we can use either.

00:03:35.320 | We're just going to be making predictions based on the phrase here.

00:03:43.800 | So we need to read in our data.

00:03:47.400 | So it's going to read it in as if it were a CSV file, and we will just pass a tab as

00:03:54.160 | our separator because we are actually working with a tab-separated file.

00:04:06.040 | Okay, so here, it's actually a CSV, not CSV.

00:04:21.000 | Okay, so the first thing you'll notice is that we actually have duplicates of the same

00:04:28.440 | phrase.

00:04:29.440 | That is actually just how this data set is.

00:04:34.000 | It just contains the full phrase initially.

00:04:38.000 | So this first entry here is the full phrase, and then all of these following it are actually

00:04:44.120 | parts of that phrase.

00:04:47.000 | So what we can do, so let's change it so we can actually see the full phrase first.

00:04:59.440 | Okay, so we can't really see that much more anyway, but that's fine.

00:05:14.400 | So to remove this, we just want to drop all of the duplicates whilst keeping the first

00:05:20.840 | instance of the sentence ID.

00:05:22.860 | So you see each one of these, they all have the same sentence ID.

00:05:26.960 | It's actually only the first one that we need.

00:05:29.960 | So we just drop duplicates on this column, keeping the first entry.

00:05:53.520 | Okay so we're keeping the first entry, dropping duplicates from sentence ID, and we're just

00:05:58.900 | doing this operation in place.

00:06:04.440 | Okay so now we can see each sample is now a unique entry.

00:06:10.920 | Okay so now our data is ready.

00:06:14.580 | So we need to actually first convert our text into a tokenized list using Flare.

00:06:23.720 | So Flare does this one sentence at a time.

00:06:27.740 | So if we, for example, pass Hello World into the Flare tokenizer, we will be able to see

00:06:38.940 | what it's actually doing.

00:06:49.420 | Okay so here we can see that it's split each one of these into tokens.

00:06:55.340 | So we've got Hello as a token, World as a token, and then we have also split the exclamation

00:07:01.620 | mark at the end there.

00:07:04.460 | And you can see that Flare is telling us that there are a total of three tokens there.

00:07:09.100 | So each one of our samples here will need to be processed by this Flare.data.sentence

00:07:16.140 | method before we pass it into the actual model.

00:07:22.380 | Once we do have this, so let's call this Sample as well, we will pass it to our model for

00:07:33.580 | prediction, which is really easy, all we need to do is call the predict method on the sample.

00:07:45.980 | And now this doesn't output anything, instead it actually just modifies the sentence object

00:07:53.660 | that we have produced, so it modifies Sample.

00:07:58.260 | And we can see now that our Sample, we started a sentence and we started a number of tokens,

00:08:03.060 | but we also have these additional labels, which are the predictions.

00:08:09.020 | We have the label, which is positive, which means it's a happy or it's a positive sentiment.

00:08:16.140 | And then what we have here is actually the probability or the confidence in that prediction.

00:08:25.140 | That's great, but realistically we want to be extracting these labels.

00:08:31.360 | So we're actually able to extract these by accessing the labels method.

00:08:39.580 | So we have labels here and this produces the positive and the confidence.

00:08:46.160 | To access each one of these we access index 0 followed by dot value.

00:08:58.340 | So this will give us the positive.

00:09:03.480 | And then we can also do the same to get the confidence, called score, like that.

00:09:13.400 | So what we can do now is just create a simple for loop that will go through each sample

00:09:20.180 | in our test data and assign a probability for each one.

00:09:26.960 | So we will initially create a sentiment and confidence list.

00:09:39.120 | Then we will just, as we are looping through the data, we will append our sentiment value,

00:09:44.880 | so the positive or negative, and the confidence to each one of these lists.

00:10:10.820 | So here we are first tokenizing our sentence.

00:10:18.060 | Then we are making a prediction using that tokenized sentence, which we are calling sample.

00:10:26.740 | And as we did before, we have now got this labeled sentence and we just need to extract

00:10:33.220 | the two labels that we have here.

00:10:57.820 | Okay so we can see here that one of our sentences was just blank.

00:11:02.580 | So we will add in some logic to avoid any errors there.

00:11:30.940 | Okay so looking at this, it's also whenever there's a space as well.

00:11:36.140 | So we just need to trim this, which we can do easily using the strip method.

00:11:47.820 | Okay so it took a little bit of time, but we now have our predictions.

00:11:53.180 | So what we want to do is actually add what we have here in the sentiment and confidence

00:11:59.780 | list to our data frame.

00:12:02.020 | So to do that, we just add df sentiment to create a new sentiment column and we made

00:12:13.300 | that equal to the sentiment list that we have created.

00:12:17.100 | And we also do the same for confidence as well.

00:12:29.980 | Now we can see our data frame.

00:12:35.340 | Okay so initially looking at this, it looks pretty good.

00:12:38.760 | So intermittently pleasing, but mostly routine effort.

00:12:43.460 | Occasionally negative, but basically saying it's occasionally okay, but generally nothing

00:12:49.260 | special.

00:12:50.260 | So obviously it's a negative sentiment, which is matched up to negative sentiment here.

00:12:55.660 | Here we're saying okay Kidman's the only thing that's worth watching in Birthday Girl.

00:13:01.140 | And it says another example of the sad decline of British comedies in the post-Full Monty

00:13:05.780 | world.

00:13:06.780 | Fair enough.

00:13:08.460 | Also negative.

00:13:10.220 | So this one is our first positive, once you get into it, it's relevant, the movie becomes

00:13:15.340 | a heady experience.

00:13:16.340 | Yeah, I mean it sounds pretty positive to me.

00:13:19.620 | So it's quite good.

00:13:21.720 | Even here where we're not saying anything particularly like a negative or positive word,

00:13:27.220 | we're just saying that the movie is, or the movie delivers on the performance of striking

00:13:33.380 | skill and depth, which must be pretty hard for a machine to understand and actually get

00:13:40.200 | it right.

00:13:41.400 | But looking at all these, it's doing really well.

00:13:44.420 | And I think it's really cool that we can actually do this with so little effort, and we've only

00:13:50.740 | actually written a few lines of code in reality.

00:13:54.340 | And it's producing really good, accurate results, which is really impressive to me.

00:13:59.040 | So that's it for this video, I hope it's been useful.

00:14:04.020 | And thank you for watching, and I will see you again in the next one.

How-to do Sentiment Analysis with Flair in Python

Chapters