back to index

Sentiment Analysis on ANY Length of Text With Transformers (Python)


Chapters

0:0
0:3 apply sentiment analysis to longer pieces of text
2:59 using bert for sequence classification
3:4 import the four sequence classification model
3:50 open the plug-in phase transformers
4:7 filter by text classification
5:38 add special tokens
6:12 add 412 padding tokens
9:49 split each of our tensors
9:58 split those into chunks of length 510
10:58 split our tensor in two batches
11:16 print out the length of each one of our tensors
12:36 pass a list of all the tensors
14:22 add the cls and separator
16:15 check the length of each one
18:34 print out the length of each one of those tenses
20:28 printing out the input id chunks
20:48 stack our input ids and attention mass tensors
23:50 add softmax onto the end
24:49 take a softmax function across each one of these outputs
26:25 take the odd maps of the mean

Whisper Transcript | Transcript Only Page

00:00:00.000 | In this video, we're going to take a look at how we can apply sentiment analysis to longer pieces
00:00:07.360 | of text. So if you've done this sort of thing before, in particular with transformers or even
00:00:12.960 | LSTMs or any other architecture in NLP, you will find that we have an upper limit on the number of
00:00:19.840 | words that we can consider at once. In this tutorial, we're going to be using BERT, which
00:00:25.440 | is a transformer model. And at max, that consumes 512 tokens. Anything beyond that is just truncated,
00:00:33.120 | so we don't consider anything beyond that limit. Now, in a lot of cases, maybe, for example,
00:00:39.760 | you're analyzing sentiment tweets. It's not a problem. But when we start to look at maybe news
00:00:46.800 | articles or Reddit posts, they can be quite a bit longer than just 512 tokens. And when I say tokens
00:00:55.200 | tokens, typically maps to words or punctuation. So what I want to explore in this video is how we can
00:01:03.200 | actually remove that limitation and just consume as many tokens or words as we'd like and still get
00:01:11.280 | an accurate sentiment score whilst considering the full length of text. And at a high level,
00:01:18.400 | this is essentially what we are going to be doing. We're going to be taking the original tensor,
00:01:24.560 | which is the 1361 tokens, and we're going to split into different chunks. So we have chunk one,
00:01:30.880 | chunk two, and chunk three here. Now, we want most of these chunks or all of these chunks in the end
00:01:37.120 | are going to be 512 tokens long. And you can see with chunk one, chunk two, they are 512 already.
00:01:46.640 | However, of course, 1361 can't be evenly split into 512. So the final chunk will be shorter.
00:01:55.200 | And once we have split those into chunks, we will need to add padding, we need to add the
00:02:02.960 | start sequence and separated tokens. If that is new to you, then don't worry,
00:02:07.120 | we'll explain that very soon. And then we calculate the sentiment for each one of those,
00:02:11.520 | take the average, and then use that as a sentiment prediction for the entire text.
00:02:16.880 | And that's essentially a high level what we're going to be doing. But that is much easier said
00:02:22.400 | than done. So let's just jump straight into the code and I'll show you how we actually do this.
00:02:27.840 | Okay, what we have here is a post from the investing subreddit. It's pretty long,
00:02:34.640 | I think it's something like 1300 tokens when we tokenize it. And obviously, that is far beyond the
00:02:42.240 | 512 token limit that we have with BERT. So if we want to consider the full text, we obviously have
00:02:50.800 | to do something different. And first thing that I think we want to do is actually initialize our
00:02:58.000 | model and tokenize it. Because we're using BERT for sequence classification, we will import the
00:03:05.280 | BERT for sequence classification model or class. And we are importing that from the Transformers
00:03:12.000 | library. So that is going to be our model class. And then we also need the tokenizer as well,
00:03:26.800 | which is just a generic BERT tokenizer. So those two are our imports. And then we actually need to
00:03:33.520 | initialize the tokenizer and the model. So the BERT tokenizer is pretty straightforward.
00:03:41.280 | And then we are going from pre-trained. So we're using a pre-trained model here.
00:03:49.840 | And if we just open the HuggingFace Transformers models page, so HuggingFace.co/models.
00:03:58.400 | And we can head over here and we can actually search for the model that we'd like to use.
00:04:04.960 | We're doing text classification, so we head over here and filter by text classification.
00:04:10.080 | And then the investing subreddit is basically full of financial advice. So we really want to,
00:04:17.520 | if possible, use a more financially savvy BERT model, which we can find with FinBERT.
00:04:25.520 | And we have two options for FinBERT here. I'm going to go with the ProcessAI FinBERT model.
00:04:30.400 | And all we actually need is this text here. We go back to our code and we'll just enter it here.
00:04:37.520 | So process, we want slash, and we all just want this on the same line, like that. And we're also
00:04:45.520 | going to be using the same model for our BERT for sequence classification. So BERT sequence
00:04:58.400 | classification. And we do the from pre-trained ProcessAI FinBERT again. And that's all we need
00:05:06.560 | to do to actually initialize our model and tokenizer. And now we're ready to actually
00:05:11.280 | tokenize that input text. So when it comes to tokenizing input text, for those of you
00:05:16.960 | that have worked with transformers before, it typically looks something like this.
00:05:22.240 | So we write tokens or whichever variable name you'd like to use. We use tokenizer, encode plus.
00:05:34.640 | We pass our text here. We add special tokens. So this is the CLS,
00:05:43.760 | separated tokens, padding tokens. So anything from this list here. So all these tokens are used
00:05:55.840 | specifically within BERT for different purposes. So we have padding token, which we use when a
00:06:01.840 | sequence is too short. So BERT always requires that we have 512 tokens within our inputs. If we
00:06:10.160 | are feeding in 100 tokens, then we add 412 padding tokens to fill that empty space. Unknown is just
00:06:18.720 | when a word is unknown to BERT. And then we have the CLS token here. And this appears at the start
00:06:24.640 | of every sequence. And the token ID for this is 101. So we'll be using this later. So it's
00:06:31.040 | important to remember that number. And then we also have the SEP token, which indicates the
00:06:35.360 | separator, which indicates the point between our input text and the padding. Or if there is no
00:06:42.800 | padding, it would just indicate the end of the text. And they're the only ones that we really
00:06:47.600 | need to be concerned about. So typically, we have those special tokens in there because BERT does
00:06:54.480 | need them. We specify a max length, which is the 512 tokens that BERT would expect. And then we
00:07:02.320 | say anything beyond that we want to truncate. And anything below that we want to pad up to the max
00:07:09.280 | length. And this is typically what our tokens will look like. So now we have, it's a dictionary,
00:07:19.040 | we have input IDs. We have this token type IDs, which we don't need to worry about. And we have
00:07:24.560 | the attention mass. And that's typically what we would do. But in this case, we are doing things
00:07:31.680 | slightly different. Because one, we don't want to add those special tokens immediately. Because if
00:07:36.640 | we add this special token, we have a CLS or start of sentence token. And then we also have a
00:07:43.200 | separate token at the end and start of our tensor. And we don't want that because we're going to be
00:07:50.880 | splitting our tens up into three smaller tensors. So we actually don't want to add those yet,
00:07:56.480 | we're going to add those manually later. And then we also have this maximum truncation and padding.
00:08:01.520 | Obviously, we actually don't want to be using any of these because if we truncate our 1300 token
00:08:09.680 | text into just 512, then that's just what we would normally do. We're not actually considering the
00:08:15.440 | whole text, we're just considering the first 512 tokens. So clearly, we also don't want any of
00:08:21.440 | those variables in there. In our case, we actually do something slightly different. We still use the
00:08:27.600 | ENCODE plus method. So tokenizer, ENCODE plus. We also include text. This time, we want to specify
00:08:43.440 | that we don't want to add those special tokens. So we set that to false. And that's actually it,
00:08:51.920 | we don't want to include any of those other arguments in there. The only extra parameter
00:08:56.800 | that we do want to add, which we want to add whenever we're working with PyTorch, is we want
00:09:03.200 | to add return tensors equals PT. And this just tells the tokenizer to return PyTorch tensors.
00:09:13.760 | Whereas here, what we had are actually just simple Python lists. And if we're using TensorFlow,
00:09:21.520 | we switch this over to TF. In our case, using PyTorch. And let's just see what that gives us.
00:09:29.840 | Okay, so here we get a warning about the sequence length. And that's fine, because we're going to
00:09:36.400 | deal with that later. And then in here, we can see, okay, now we have PyTorch tensors rather
00:09:42.720 | than the list that we had before, which is great, that's what we want. Now we have that, we actually
00:09:48.720 | want to split each of our tensors, or the input IDs and the attention mass tensors, we don't need
00:09:55.200 | to do anything with the token type IDs, we can get rid of those. We want to split those into chunks
00:10:01.520 | of length 510. So the reason we're using 510, rather than 512, is because at the moment,
00:10:09.280 | we don't have our CLS and separator tokens in there. So once we do add those, that will push
00:10:15.440 | the 510 up to 512. So to split those into those chunks, it's actually incredibly easy. So we'll
00:10:24.640 | just write input ID chunks. And we need to access our tokens dictionary. So tokens, and then we want
00:10:35.600 | to access the input IDs here. And you'll see here that this is actually a tensor, it's almost like
00:10:42.720 | a list within a list. So to access that, we want to access a zero index of that. And then we're
00:10:49.920 | just going to split, which is a PyTorch method by 510. And that is literally all we need to do
00:10:57.680 | to split our tensor into batches. And we repeat this again, but for the mask,
00:11:07.520 | and just changes to attention mask. Again, we don't need token type IDs,
00:11:14.560 | so we can just ignore that. And then let's just print out the length of each one of our tensors
00:11:20.160 | here. So for tensor, and input ID chunks, just print the length of it. So we can check that we
00:11:31.600 | are actually doing this correctly. So we can see we have 510, 510. And the last one is shorter,
00:11:40.080 | of course, like we explained before, at 325. So that's pretty ideal, that's what we want.
00:11:48.080 | And now we can move on to adding in our CLS and separate tokens. I'll just show you how this is
00:11:55.920 | going to work. So I'm going to use a smaller tensor quickly, just as an example.
00:12:03.520 | So we just need to also import torch. So we do that here.
00:12:19.840 | Okay, so we have this tensor. And to add a value on either side of that, we can use the torch
00:12:26.560 | cat method, which is for concatenating multiple tensors. In this case, we'd use torch cat.
00:12:34.160 | And then we just pass a list of all the tensors that we would like to include here.
00:12:40.720 | Now, we don't have a tensor for our token, so we just create it within this list.
00:12:48.480 | And that's very easy, we just use torch tensor.
00:12:51.520 | And then if you remember before, the CLS token is the equivalent of 101, when it's converted
00:13:00.880 | to the token ID. So that's going to come at the start of our tensor. And in the middle,
00:13:06.880 | we have our actual tensor. And at the end, we want to append our 102 tensor, which is the
00:13:15.280 | separator token. Okay, and we just print that out, we can see, okay, we've got 101,
00:13:24.320 | and then we have our sequence and 102 at the end. Then after we add our CLS and separator tokens,
00:13:32.560 | we will use the same method for our padding as well. But we want to write this logic within a
00:13:39.360 | for loop, which will iterate through each chunk and process each one individually. So first,
00:13:46.080 | I'm going to create a variable to define the chunk size, which is going to be 512, which is our
00:13:54.640 | target size. And we already split our tokens into chunks up here. So we can just iterate through
00:14:04.240 | each one of those. So we'll just go through a range of the length of the number of chunks that
00:14:12.640 | we have, this will go 0, 1, and 2. And now we can access each chunk using the I index here.
00:14:20.640 | So first, we want to add the CLS and separator tokens, just like we have above. So to do that,
00:14:29.520 | we go input ID chunks, we get the current index, and then just do torch cat, which is just
00:14:41.360 | concatenate. And then we pass a list just like we did before, which is going to be torch tensor.
00:14:50.560 | And then in the middle, we have A, we're going to replace that with this.
00:14:58.480 | Okay, and then we want to do the same for our attention mask. But of course,
00:15:03.040 | in our attention mask, if we look up here, it's just full of ones. And the only two values that
00:15:09.760 | we can actually have in our attention mask is either 1 or 0. And the reason for this is whenever
00:15:17.040 | we have a real token that Bert needs to pay attention to, we have a 1 in this attention
00:15:24.320 | mask. Whereas if you have a padding token, that will correspond to a 0 in this attention mask.
00:15:30.880 | And the reason for this is just so Bert doesn't process attention for the padding tokens within
00:15:38.800 | our inputs. So it's essentially like telling Bert to just ignore the padding. So in our case here,
00:15:45.680 | both of these are not padding tokens. So both of them should be 1. Okay, and then that gets us
00:15:54.480 | our sequences with the CLS separator and added attention mask tokens in there. So now we need
00:16:03.040 | to do the padding. And realistically with padding, we're actually only going to do that for the
00:16:08.160 | final tensor. So what we will do to make sure that we don't try and pad the other tensors
00:16:15.440 | is just check the length of each one. First, we'll calculate the required padding length,
00:16:23.120 | which is just going to be equal to the chunk size minus the input ID chunk. And then we want the
00:16:35.040 | index shape 0. So this is like taking the length of the tensor. Okay, and for chunks 1 and 2,
00:16:43.680 | this will just be equal to 0. Whereas for the final chunk, it will not, it will be something
00:16:48.480 | like 150 or 200. So what we want to do is say, if the pad length is greater than 0,
00:16:58.240 | then this is where we add our padding tokens. So first, we'll do the input ID chunk.
00:17:06.160 | And again, we're just going to use the torch concatenate method.
00:17:13.040 | This time, we have our input ID chunk at the start. I think it's chunks, not chunk.
00:17:25.040 | And also here, this should be mask chunks. So let's just fix that quickly.
00:17:37.440 | Okay. And here, we first have this, and then the parts following this need to be our padding
00:17:50.560 | tokens. And to create those, we are going to do the torch tensor again. And then in here,
00:17:59.440 | we're going to just add one zero in a list. But then we're going to multiply that
00:18:05.200 | by the pad length. So if the pad length is 100, this will give us a tensor that has 100 zeros
00:18:15.520 | inside it, which is exactly what we want. And then we'll copy and paste this
00:18:22.000 | and do the exact same thing for our masking tensor as well.
00:18:25.760 | Okay. So now let's just print out the length of each one of those tensors.
00:18:39.280 | So for chunk and input ID chunks, print the length of that chunk. And then we'll also just
00:18:51.760 | print out the final chunk as well, so we can see everything is in the right place.
00:18:56.800 | And here, so just copy. So this here needs to have an S on the end.
00:19:05.840 | Oh, and up here. So when we first build these, so if I just print out one of them,
00:19:14.640 | you see that the input ID chunks is actually a tuple containing three of our tensors.
00:19:24.640 | So what we actually want to do, I'll just close that, is before we start this whole process,
00:19:34.800 | we just want to convert them into lists so that we can actually change the values inside. Because
00:19:41.680 | otherwise we are trying to change the values of a tuple, which we obviously can't because
00:19:46.640 | tuples are immutable in Python, which means you can't change the values inside them.
00:19:52.560 | So we just convert those to lists. And then we also need to add an S on here.
00:20:01.840 | And there we go. We finally got there. So now we can see, okay, here we have 514.
00:20:09.360 | So let me just rerun this bit here. And then rerun this. Okay. So it's because I was running
00:20:17.840 | it twice. It was adding these twice. So now we have 512. And then we can see we have our tensor.
00:20:28.080 | So this is just printing out the input ID chunks. You can see here we have all these values.
00:20:33.120 | And this is just the final one. So you can see at the bottom we have this padding.
00:20:39.040 | If we go up here, we have our starter sequence token 101. And down here we have the end of
00:20:44.800 | sequence separator. So now what we want to do is stack our input IDs and attention mass tensors
00:20:52.240 | together. So we'll create input IDs. We use torch stack for that. And that's going to be input ID
00:21:03.040 | chunks. And then we also have the attention mask that we need to create. So we do the same thing
00:21:10.480 | there. And that is the mask chunks. And then the format that BERT expects us to be feeding in this
00:21:21.120 | data is a dictionary where we have key value pairs. So we have a key input IDs, which will lead
00:21:30.720 | to our input IDs tensor here. And then another one called attention mask that will have the
00:21:36.880 | attention mask as its value. So we'll just define that here. And this is just the format that BERT
00:21:47.200 | expects. So the input IDs. And then we have the input IDs there. And then we also have the
00:21:54.560 | attention mask. We have the attention mask in there. Now, as well as that, BERT expects these
00:22:04.880 | tensors to be in a particular format. So the input IDs expects it to be in a long format. So we just
00:22:11.440 | add long onto the end of there. And then for the attention mask, we expect integers. So we just add
00:22:17.600 | int onto the end of there. And then we just print out input dict. So we can see what we are putting
00:22:24.240 | in there. OK, great. So that is exactly the format that we need. Now we can get our outputs. So we
00:22:32.720 | pass these into our model as keyword arguments. So we just add these two asterisk symbols.
00:22:39.280 | That means it's a keyword argument. And then in there, we pass our input dict. And this allows
00:22:47.120 | the function to read these keywords, take them as variables, and assign these tensors to them.
00:22:54.800 | So there we have our outputs. You can see here that we have these logits. These are our activations
00:23:04.880 | from the final layer of the BERT model. And you see, OK, we have these values. What we want in
00:23:11.840 | the end is a set of probabilities. And of course, this is not a set of probabilities, because
00:23:17.680 | probabilities we would expect to be between the values of 0 and 1. Here we have negatives. We
00:23:22.480 | have values that are over 1. And that's not really what we would expect. So to convert these into
00:23:30.000 | probabilities, all we need to do is apply a softmax function to them. Now softmax is essentially
00:23:39.200 | sigmoid but applied across a set of categorical or output classes. And to implement that, we just do
00:23:47.760 | torch and then functional. And then we just add softmax onto the end there. And we need to access
00:23:55.600 | the output logits, which is in index 0 of the outputs variable. So that is just accessing
00:24:02.320 | this tensor here. And then we access dimension minus 1. So the dimension negative 1 is just
00:24:11.840 | accessing the final dimension of our tensor. So in this case, we have a 3D tensor. So this is like
00:24:19.360 | accessing the second dimension or dimension number 2. Because when we have 3D tensor,
00:24:26.080 | we have dimensions 0, 1, and 2. Minus 1 of 0 is just the dimension 2, if that makes sense. So
00:24:35.040 | imagine we have 0, 1, and 2 here. If we go here and we take negative 1, we come around here to
00:24:44.000 | the back of the list. And that is accessing the second dimension. So that is going to take a
00:24:50.560 | softmax function across each one of these outputs. And then we can print that out. So now we have
00:25:01.440 | our probabilities. So the outputs of the FinBert model, these ones here in the first column are
00:25:09.520 | all positive. So this is the prediction of the chunks having a positive sentiment.
00:25:15.200 | These are all negative. So the prediction of the chunk having a negative sentiment. And these are
00:25:21.440 | all neutral. So if it has a neutral sentiment. So we see here, the first and second chunks
00:25:28.560 | are both predicted to have a negative sentiment, particularly the first one. And the final one is
00:25:34.640 | predicted to have a positive sentiment. Now if we want to get the overall prediction, all we do is
00:25:41.520 | take the mean. So the probabilities. And we just want to take the mean. And we take that in the
00:25:47.920 | 0 dimension, which would just go from here down, take the mean of those three, take the mean of
00:25:55.040 | these three, and take the mean of these three as well. Print it out. And you see here, negative
00:26:03.040 | sentiment is definitely winning here. But only just, it's pretty close to the positive. So it's
00:26:08.160 | reasonably difficult one to understand. And this is because over here, we have mostly negative,
00:26:15.360 | kind of negative, and most positive. So it's a bit of a difficult one. But negative sentiment
00:26:20.320 | does win out in the end. Now if you'd like to get the specific category that won, we'll just take
00:26:26.560 | the arg maps of the mean. And that will give us a tensor. If we want to actually get the value out
00:26:34.480 | of that tensor, we can just add item onto the end there. And that is it. We have taken the average
00:26:40.640 | sentiment of a pretty long piece of text. And of course, we can just use this code and iterate
00:26:47.120 | through it for multiple long pieces of text. And it doesn't really matter how long those pieces of
00:26:52.960 | text are. This will still work. So I hope this has been an interesting and useful video for you.
00:27:00.880 | I've definitely enjoyed working through this and figuring it all out.
00:27:05.040 | So thank you very much for watching. And I will see you again in the next one.