back to indexSentiment Analysis on ANY Length of Text With Transformers (Python)
Chapters
0:0
0:3 apply sentiment analysis to longer pieces of text
2:59 using bert for sequence classification
3:4 import the four sequence classification model
3:50 open the plug-in phase transformers
4:7 filter by text classification
5:38 add special tokens
6:12 add 412 padding tokens
9:49 split each of our tensors
9:58 split those into chunks of length 510
10:58 split our tensor in two batches
11:16 print out the length of each one of our tensors
12:36 pass a list of all the tensors
14:22 add the cls and separator
16:15 check the length of each one
18:34 print out the length of each one of those tenses
20:28 printing out the input id chunks
20:48 stack our input ids and attention mass tensors
23:50 add softmax onto the end
24:49 take a softmax function across each one of these outputs
26:25 take the odd maps of the mean
00:00:00.000 |
In this video, we're going to take a look at how we can apply sentiment analysis to longer pieces 00:00:07.360 |
of text. So if you've done this sort of thing before, in particular with transformers or even 00:00:12.960 |
LSTMs or any other architecture in NLP, you will find that we have an upper limit on the number of 00:00:19.840 |
words that we can consider at once. In this tutorial, we're going to be using BERT, which 00:00:25.440 |
is a transformer model. And at max, that consumes 512 tokens. Anything beyond that is just truncated, 00:00:33.120 |
so we don't consider anything beyond that limit. Now, in a lot of cases, maybe, for example, 00:00:39.760 |
you're analyzing sentiment tweets. It's not a problem. But when we start to look at maybe news 00:00:46.800 |
articles or Reddit posts, they can be quite a bit longer than just 512 tokens. And when I say tokens 00:00:55.200 |
tokens, typically maps to words or punctuation. So what I want to explore in this video is how we can 00:01:03.200 |
actually remove that limitation and just consume as many tokens or words as we'd like and still get 00:01:11.280 |
an accurate sentiment score whilst considering the full length of text. And at a high level, 00:01:18.400 |
this is essentially what we are going to be doing. We're going to be taking the original tensor, 00:01:24.560 |
which is the 1361 tokens, and we're going to split into different chunks. So we have chunk one, 00:01:30.880 |
chunk two, and chunk three here. Now, we want most of these chunks or all of these chunks in the end 00:01:37.120 |
are going to be 512 tokens long. And you can see with chunk one, chunk two, they are 512 already. 00:01:46.640 |
However, of course, 1361 can't be evenly split into 512. So the final chunk will be shorter. 00:01:55.200 |
And once we have split those into chunks, we will need to add padding, we need to add the 00:02:02.960 |
start sequence and separated tokens. If that is new to you, then don't worry, 00:02:07.120 |
we'll explain that very soon. And then we calculate the sentiment for each one of those, 00:02:11.520 |
take the average, and then use that as a sentiment prediction for the entire text. 00:02:16.880 |
And that's essentially a high level what we're going to be doing. But that is much easier said 00:02:22.400 |
than done. So let's just jump straight into the code and I'll show you how we actually do this. 00:02:27.840 |
Okay, what we have here is a post from the investing subreddit. It's pretty long, 00:02:34.640 |
I think it's something like 1300 tokens when we tokenize it. And obviously, that is far beyond the 00:02:42.240 |
512 token limit that we have with BERT. So if we want to consider the full text, we obviously have 00:02:50.800 |
to do something different. And first thing that I think we want to do is actually initialize our 00:02:58.000 |
model and tokenize it. Because we're using BERT for sequence classification, we will import the 00:03:05.280 |
BERT for sequence classification model or class. And we are importing that from the Transformers 00:03:12.000 |
library. So that is going to be our model class. And then we also need the tokenizer as well, 00:03:26.800 |
which is just a generic BERT tokenizer. So those two are our imports. And then we actually need to 00:03:33.520 |
initialize the tokenizer and the model. So the BERT tokenizer is pretty straightforward. 00:03:41.280 |
And then we are going from pre-trained. So we're using a pre-trained model here. 00:03:49.840 |
And if we just open the HuggingFace Transformers models page, so HuggingFace.co/models. 00:03:58.400 |
And we can head over here and we can actually search for the model that we'd like to use. 00:04:04.960 |
We're doing text classification, so we head over here and filter by text classification. 00:04:10.080 |
And then the investing subreddit is basically full of financial advice. So we really want to, 00:04:17.520 |
if possible, use a more financially savvy BERT model, which we can find with FinBERT. 00:04:25.520 |
And we have two options for FinBERT here. I'm going to go with the ProcessAI FinBERT model. 00:04:30.400 |
And all we actually need is this text here. We go back to our code and we'll just enter it here. 00:04:37.520 |
So process, we want slash, and we all just want this on the same line, like that. And we're also 00:04:45.520 |
going to be using the same model for our BERT for sequence classification. So BERT sequence 00:04:58.400 |
classification. And we do the from pre-trained ProcessAI FinBERT again. And that's all we need 00:05:06.560 |
to do to actually initialize our model and tokenizer. And now we're ready to actually 00:05:11.280 |
tokenize that input text. So when it comes to tokenizing input text, for those of you 00:05:16.960 |
that have worked with transformers before, it typically looks something like this. 00:05:22.240 |
So we write tokens or whichever variable name you'd like to use. We use tokenizer, encode plus. 00:05:34.640 |
We pass our text here. We add special tokens. So this is the CLS, 00:05:43.760 |
separated tokens, padding tokens. So anything from this list here. So all these tokens are used 00:05:55.840 |
specifically within BERT for different purposes. So we have padding token, which we use when a 00:06:01.840 |
sequence is too short. So BERT always requires that we have 512 tokens within our inputs. If we 00:06:10.160 |
are feeding in 100 tokens, then we add 412 padding tokens to fill that empty space. Unknown is just 00:06:18.720 |
when a word is unknown to BERT. And then we have the CLS token here. And this appears at the start 00:06:24.640 |
of every sequence. And the token ID for this is 101. So we'll be using this later. So it's 00:06:31.040 |
important to remember that number. And then we also have the SEP token, which indicates the 00:06:35.360 |
separator, which indicates the point between our input text and the padding. Or if there is no 00:06:42.800 |
padding, it would just indicate the end of the text. And they're the only ones that we really 00:06:47.600 |
need to be concerned about. So typically, we have those special tokens in there because BERT does 00:06:54.480 |
need them. We specify a max length, which is the 512 tokens that BERT would expect. And then we 00:07:02.320 |
say anything beyond that we want to truncate. And anything below that we want to pad up to the max 00:07:09.280 |
length. And this is typically what our tokens will look like. So now we have, it's a dictionary, 00:07:19.040 |
we have input IDs. We have this token type IDs, which we don't need to worry about. And we have 00:07:24.560 |
the attention mass. And that's typically what we would do. But in this case, we are doing things 00:07:31.680 |
slightly different. Because one, we don't want to add those special tokens immediately. Because if 00:07:36.640 |
we add this special token, we have a CLS or start of sentence token. And then we also have a 00:07:43.200 |
separate token at the end and start of our tensor. And we don't want that because we're going to be 00:07:50.880 |
splitting our tens up into three smaller tensors. So we actually don't want to add those yet, 00:07:56.480 |
we're going to add those manually later. And then we also have this maximum truncation and padding. 00:08:01.520 |
Obviously, we actually don't want to be using any of these because if we truncate our 1300 token 00:08:09.680 |
text into just 512, then that's just what we would normally do. We're not actually considering the 00:08:15.440 |
whole text, we're just considering the first 512 tokens. So clearly, we also don't want any of 00:08:21.440 |
those variables in there. In our case, we actually do something slightly different. We still use the 00:08:27.600 |
ENCODE plus method. So tokenizer, ENCODE plus. We also include text. This time, we want to specify 00:08:43.440 |
that we don't want to add those special tokens. So we set that to false. And that's actually it, 00:08:51.920 |
we don't want to include any of those other arguments in there. The only extra parameter 00:08:56.800 |
that we do want to add, which we want to add whenever we're working with PyTorch, is we want 00:09:03.200 |
to add return tensors equals PT. And this just tells the tokenizer to return PyTorch tensors. 00:09:13.760 |
Whereas here, what we had are actually just simple Python lists. And if we're using TensorFlow, 00:09:21.520 |
we switch this over to TF. In our case, using PyTorch. And let's just see what that gives us. 00:09:29.840 |
Okay, so here we get a warning about the sequence length. And that's fine, because we're going to 00:09:36.400 |
deal with that later. And then in here, we can see, okay, now we have PyTorch tensors rather 00:09:42.720 |
than the list that we had before, which is great, that's what we want. Now we have that, we actually 00:09:48.720 |
want to split each of our tensors, or the input IDs and the attention mass tensors, we don't need 00:09:55.200 |
to do anything with the token type IDs, we can get rid of those. We want to split those into chunks 00:10:01.520 |
of length 510. So the reason we're using 510, rather than 512, is because at the moment, 00:10:09.280 |
we don't have our CLS and separator tokens in there. So once we do add those, that will push 00:10:15.440 |
the 510 up to 512. So to split those into those chunks, it's actually incredibly easy. So we'll 00:10:24.640 |
just write input ID chunks. And we need to access our tokens dictionary. So tokens, and then we want 00:10:35.600 |
to access the input IDs here. And you'll see here that this is actually a tensor, it's almost like 00:10:42.720 |
a list within a list. So to access that, we want to access a zero index of that. And then we're 00:10:49.920 |
just going to split, which is a PyTorch method by 510. And that is literally all we need to do 00:10:57.680 |
to split our tensor into batches. And we repeat this again, but for the mask, 00:11:07.520 |
and just changes to attention mask. Again, we don't need token type IDs, 00:11:14.560 |
so we can just ignore that. And then let's just print out the length of each one of our tensors 00:11:20.160 |
here. So for tensor, and input ID chunks, just print the length of it. So we can check that we 00:11:31.600 |
are actually doing this correctly. So we can see we have 510, 510. And the last one is shorter, 00:11:40.080 |
of course, like we explained before, at 325. So that's pretty ideal, that's what we want. 00:11:48.080 |
And now we can move on to adding in our CLS and separate tokens. I'll just show you how this is 00:11:55.920 |
going to work. So I'm going to use a smaller tensor quickly, just as an example. 00:12:03.520 |
So we just need to also import torch. So we do that here. 00:12:19.840 |
Okay, so we have this tensor. And to add a value on either side of that, we can use the torch 00:12:26.560 |
cat method, which is for concatenating multiple tensors. In this case, we'd use torch cat. 00:12:34.160 |
And then we just pass a list of all the tensors that we would like to include here. 00:12:40.720 |
Now, we don't have a tensor for our token, so we just create it within this list. 00:12:48.480 |
And that's very easy, we just use torch tensor. 00:12:51.520 |
And then if you remember before, the CLS token is the equivalent of 101, when it's converted 00:13:00.880 |
to the token ID. So that's going to come at the start of our tensor. And in the middle, 00:13:06.880 |
we have our actual tensor. And at the end, we want to append our 102 tensor, which is the 00:13:15.280 |
separator token. Okay, and we just print that out, we can see, okay, we've got 101, 00:13:24.320 |
and then we have our sequence and 102 at the end. Then after we add our CLS and separator tokens, 00:13:32.560 |
we will use the same method for our padding as well. But we want to write this logic within a 00:13:39.360 |
for loop, which will iterate through each chunk and process each one individually. So first, 00:13:46.080 |
I'm going to create a variable to define the chunk size, which is going to be 512, which is our 00:13:54.640 |
target size. And we already split our tokens into chunks up here. So we can just iterate through 00:14:04.240 |
each one of those. So we'll just go through a range of the length of the number of chunks that 00:14:12.640 |
we have, this will go 0, 1, and 2. And now we can access each chunk using the I index here. 00:14:20.640 |
So first, we want to add the CLS and separator tokens, just like we have above. So to do that, 00:14:29.520 |
we go input ID chunks, we get the current index, and then just do torch cat, which is just 00:14:41.360 |
concatenate. And then we pass a list just like we did before, which is going to be torch tensor. 00:14:50.560 |
And then in the middle, we have A, we're going to replace that with this. 00:14:58.480 |
Okay, and then we want to do the same for our attention mask. But of course, 00:15:03.040 |
in our attention mask, if we look up here, it's just full of ones. And the only two values that 00:15:09.760 |
we can actually have in our attention mask is either 1 or 0. And the reason for this is whenever 00:15:17.040 |
we have a real token that Bert needs to pay attention to, we have a 1 in this attention 00:15:24.320 |
mask. Whereas if you have a padding token, that will correspond to a 0 in this attention mask. 00:15:30.880 |
And the reason for this is just so Bert doesn't process attention for the padding tokens within 00:15:38.800 |
our inputs. So it's essentially like telling Bert to just ignore the padding. So in our case here, 00:15:45.680 |
both of these are not padding tokens. So both of them should be 1. Okay, and then that gets us 00:15:54.480 |
our sequences with the CLS separator and added attention mask tokens in there. So now we need 00:16:03.040 |
to do the padding. And realistically with padding, we're actually only going to do that for the 00:16:08.160 |
final tensor. So what we will do to make sure that we don't try and pad the other tensors 00:16:15.440 |
is just check the length of each one. First, we'll calculate the required padding length, 00:16:23.120 |
which is just going to be equal to the chunk size minus the input ID chunk. And then we want the 00:16:35.040 |
index shape 0. So this is like taking the length of the tensor. Okay, and for chunks 1 and 2, 00:16:43.680 |
this will just be equal to 0. Whereas for the final chunk, it will not, it will be something 00:16:48.480 |
like 150 or 200. So what we want to do is say, if the pad length is greater than 0, 00:16:58.240 |
then this is where we add our padding tokens. So first, we'll do the input ID chunk. 00:17:06.160 |
And again, we're just going to use the torch concatenate method. 00:17:13.040 |
This time, we have our input ID chunk at the start. I think it's chunks, not chunk. 00:17:25.040 |
And also here, this should be mask chunks. So let's just fix that quickly. 00:17:37.440 |
Okay. And here, we first have this, and then the parts following this need to be our padding 00:17:50.560 |
tokens. And to create those, we are going to do the torch tensor again. And then in here, 00:17:59.440 |
we're going to just add one zero in a list. But then we're going to multiply that 00:18:05.200 |
by the pad length. So if the pad length is 100, this will give us a tensor that has 100 zeros 00:18:15.520 |
inside it, which is exactly what we want. And then we'll copy and paste this 00:18:22.000 |
and do the exact same thing for our masking tensor as well. 00:18:25.760 |
Okay. So now let's just print out the length of each one of those tensors. 00:18:39.280 |
So for chunk and input ID chunks, print the length of that chunk. And then we'll also just 00:18:51.760 |
print out the final chunk as well, so we can see everything is in the right place. 00:18:56.800 |
And here, so just copy. So this here needs to have an S on the end. 00:19:05.840 |
Oh, and up here. So when we first build these, so if I just print out one of them, 00:19:14.640 |
you see that the input ID chunks is actually a tuple containing three of our tensors. 00:19:24.640 |
So what we actually want to do, I'll just close that, is before we start this whole process, 00:19:34.800 |
we just want to convert them into lists so that we can actually change the values inside. Because 00:19:41.680 |
otherwise we are trying to change the values of a tuple, which we obviously can't because 00:19:46.640 |
tuples are immutable in Python, which means you can't change the values inside them. 00:19:52.560 |
So we just convert those to lists. And then we also need to add an S on here. 00:20:01.840 |
And there we go. We finally got there. So now we can see, okay, here we have 514. 00:20:09.360 |
So let me just rerun this bit here. And then rerun this. Okay. So it's because I was running 00:20:17.840 |
it twice. It was adding these twice. So now we have 512. And then we can see we have our tensor. 00:20:28.080 |
So this is just printing out the input ID chunks. You can see here we have all these values. 00:20:33.120 |
And this is just the final one. So you can see at the bottom we have this padding. 00:20:39.040 |
If we go up here, we have our starter sequence token 101. And down here we have the end of 00:20:44.800 |
sequence separator. So now what we want to do is stack our input IDs and attention mass tensors 00:20:52.240 |
together. So we'll create input IDs. We use torch stack for that. And that's going to be input ID 00:21:03.040 |
chunks. And then we also have the attention mask that we need to create. So we do the same thing 00:21:10.480 |
there. And that is the mask chunks. And then the format that BERT expects us to be feeding in this 00:21:21.120 |
data is a dictionary where we have key value pairs. So we have a key input IDs, which will lead 00:21:30.720 |
to our input IDs tensor here. And then another one called attention mask that will have the 00:21:36.880 |
attention mask as its value. So we'll just define that here. And this is just the format that BERT 00:21:47.200 |
expects. So the input IDs. And then we have the input IDs there. And then we also have the 00:21:54.560 |
attention mask. We have the attention mask in there. Now, as well as that, BERT expects these 00:22:04.880 |
tensors to be in a particular format. So the input IDs expects it to be in a long format. So we just 00:22:11.440 |
add long onto the end of there. And then for the attention mask, we expect integers. So we just add 00:22:17.600 |
int onto the end of there. And then we just print out input dict. So we can see what we are putting 00:22:24.240 |
in there. OK, great. So that is exactly the format that we need. Now we can get our outputs. So we 00:22:32.720 |
pass these into our model as keyword arguments. So we just add these two asterisk symbols. 00:22:39.280 |
That means it's a keyword argument. And then in there, we pass our input dict. And this allows 00:22:47.120 |
the function to read these keywords, take them as variables, and assign these tensors to them. 00:22:54.800 |
So there we have our outputs. You can see here that we have these logits. These are our activations 00:23:04.880 |
from the final layer of the BERT model. And you see, OK, we have these values. What we want in 00:23:11.840 |
the end is a set of probabilities. And of course, this is not a set of probabilities, because 00:23:17.680 |
probabilities we would expect to be between the values of 0 and 1. Here we have negatives. We 00:23:22.480 |
have values that are over 1. And that's not really what we would expect. So to convert these into 00:23:30.000 |
probabilities, all we need to do is apply a softmax function to them. Now softmax is essentially 00:23:39.200 |
sigmoid but applied across a set of categorical or output classes. And to implement that, we just do 00:23:47.760 |
torch and then functional. And then we just add softmax onto the end there. And we need to access 00:23:55.600 |
the output logits, which is in index 0 of the outputs variable. So that is just accessing 00:24:02.320 |
this tensor here. And then we access dimension minus 1. So the dimension negative 1 is just 00:24:11.840 |
accessing the final dimension of our tensor. So in this case, we have a 3D tensor. So this is like 00:24:19.360 |
accessing the second dimension or dimension number 2. Because when we have 3D tensor, 00:24:26.080 |
we have dimensions 0, 1, and 2. Minus 1 of 0 is just the dimension 2, if that makes sense. So 00:24:35.040 |
imagine we have 0, 1, and 2 here. If we go here and we take negative 1, we come around here to 00:24:44.000 |
the back of the list. And that is accessing the second dimension. So that is going to take a 00:24:50.560 |
softmax function across each one of these outputs. And then we can print that out. So now we have 00:25:01.440 |
our probabilities. So the outputs of the FinBert model, these ones here in the first column are 00:25:09.520 |
all positive. So this is the prediction of the chunks having a positive sentiment. 00:25:15.200 |
These are all negative. So the prediction of the chunk having a negative sentiment. And these are 00:25:21.440 |
all neutral. So if it has a neutral sentiment. So we see here, the first and second chunks 00:25:28.560 |
are both predicted to have a negative sentiment, particularly the first one. And the final one is 00:25:34.640 |
predicted to have a positive sentiment. Now if we want to get the overall prediction, all we do is 00:25:41.520 |
take the mean. So the probabilities. And we just want to take the mean. And we take that in the 00:25:47.920 |
0 dimension, which would just go from here down, take the mean of those three, take the mean of 00:25:55.040 |
these three, and take the mean of these three as well. Print it out. And you see here, negative 00:26:03.040 |
sentiment is definitely winning here. But only just, it's pretty close to the positive. So it's 00:26:08.160 |
reasonably difficult one to understand. And this is because over here, we have mostly negative, 00:26:15.360 |
kind of negative, and most positive. So it's a bit of a difficult one. But negative sentiment 00:26:20.320 |
does win out in the end. Now if you'd like to get the specific category that won, we'll just take 00:26:26.560 |
the arg maps of the mean. And that will give us a tensor. If we want to actually get the value out 00:26:34.480 |
of that tensor, we can just add item onto the end there. And that is it. We have taken the average 00:26:40.640 |
sentiment of a pretty long piece of text. And of course, we can just use this code and iterate 00:26:47.120 |
through it for multiple long pieces of text. And it doesn't really matter how long those pieces of 00:26:52.960 |
text are. This will still work. So I hope this has been an interesting and useful video for you. 00:27:00.880 |
I've definitely enjoyed working through this and figuring it all out. 00:27:05.040 |
So thank you very much for watching. And I will see you again in the next one.