In this video, we're going to take a look at how we can apply sentiment analysis to longer pieces of text. So if you've done this sort of thing before, in particular with transformers or even LSTMs or any other architecture in NLP, you will find that we have an upper limit on the number of words that we can consider at once.
In this tutorial, we're going to be using BERT, which is a transformer model. And at max, that consumes 512 tokens. Anything beyond that is just truncated, so we don't consider anything beyond that limit. Now, in a lot of cases, maybe, for example, you're analyzing sentiment tweets. It's not a problem.
But when we start to look at maybe news articles or Reddit posts, they can be quite a bit longer than just 512 tokens. And when I say tokens tokens, typically maps to words or punctuation. So what I want to explore in this video is how we can actually remove that limitation and just consume as many tokens or words as we'd like and still get an accurate sentiment score whilst considering the full length of text.
And at a high level, this is essentially what we are going to be doing. We're going to be taking the original tensor, which is the 1361 tokens, and we're going to split into different chunks. So we have chunk one, chunk two, and chunk three here. Now, we want most of these chunks or all of these chunks in the end are going to be 512 tokens long.
And you can see with chunk one, chunk two, they are 512 already. However, of course, 1361 can't be evenly split into 512. So the final chunk will be shorter. And once we have split those into chunks, we will need to add padding, we need to add the start sequence and separated tokens.
If that is new to you, then don't worry, we'll explain that very soon. And then we calculate the sentiment for each one of those, take the average, and then use that as a sentiment prediction for the entire text. And that's essentially a high level what we're going to be doing.
But that is much easier said than done. So let's just jump straight into the code and I'll show you how we actually do this. Okay, what we have here is a post from the investing subreddit. It's pretty long, I think it's something like 1300 tokens when we tokenize it.
And obviously, that is far beyond the 512 token limit that we have with BERT. So if we want to consider the full text, we obviously have to do something different. And first thing that I think we want to do is actually initialize our model and tokenize it. Because we're using BERT for sequence classification, we will import the BERT for sequence classification model or class.
And we are importing that from the Transformers library. So that is going to be our model class. And then we also need the tokenizer as well, which is just a generic BERT tokenizer. So those two are our imports. And then we actually need to initialize the tokenizer and the model.
So the BERT tokenizer is pretty straightforward. And then we are going from pre-trained. So we're using a pre-trained model here. And if we just open the HuggingFace Transformers models page, so HuggingFace.co/models. And we can head over here and we can actually search for the model that we'd like to use.
We're doing text classification, so we head over here and filter by text classification. And then the investing subreddit is basically full of financial advice. So we really want to, if possible, use a more financially savvy BERT model, which we can find with FinBERT. And we have two options for FinBERT here.
I'm going to go with the ProcessAI FinBERT model. And all we actually need is this text here. We go back to our code and we'll just enter it here. So process, we want slash, and we all just want this on the same line, like that. And we're also going to be using the same model for our BERT for sequence classification.
So BERT sequence classification. And we do the from pre-trained ProcessAI FinBERT again. And that's all we need to do to actually initialize our model and tokenizer. And now we're ready to actually tokenize that input text. So when it comes to tokenizing input text, for those of you that have worked with transformers before, it typically looks something like this.
So we write tokens or whichever variable name you'd like to use. We use tokenizer, encode plus. We pass our text here. We add special tokens. So this is the CLS, separated tokens, padding tokens. So anything from this list here. So all these tokens are used specifically within BERT for different purposes.
So we have padding token, which we use when a sequence is too short. So BERT always requires that we have 512 tokens within our inputs. If we are feeding in 100 tokens, then we add 412 padding tokens to fill that empty space. Unknown is just when a word is unknown to BERT.
And then we have the CLS token here. And this appears at the start of every sequence. And the token ID for this is 101. So we'll be using this later. So it's important to remember that number. And then we also have the SEP token, which indicates the separator, which indicates the point between our input text and the padding.
Or if there is no padding, it would just indicate the end of the text. And they're the only ones that we really need to be concerned about. So typically, we have those special tokens in there because BERT does need them. We specify a max length, which is the 512 tokens that BERT would expect.
And then we say anything beyond that we want to truncate. And anything below that we want to pad up to the max length. And this is typically what our tokens will look like. So now we have, it's a dictionary, we have input IDs. We have this token type IDs, which we don't need to worry about.
And we have the attention mass. And that's typically what we would do. But in this case, we are doing things slightly different. Because one, we don't want to add those special tokens immediately. Because if we add this special token, we have a CLS or start of sentence token. And then we also have a separate token at the end and start of our tensor.
And we don't want that because we're going to be splitting our tens up into three smaller tensors. So we actually don't want to add those yet, we're going to add those manually later. And then we also have this maximum truncation and padding. Obviously, we actually don't want to be using any of these because if we truncate our 1300 token text into just 512, then that's just what we would normally do.
We're not actually considering the whole text, we're just considering the first 512 tokens. So clearly, we also don't want any of those variables in there. In our case, we actually do something slightly different. We still use the ENCODE plus method. So tokenizer, ENCODE plus. We also include text. This time, we want to specify that we don't want to add those special tokens.
So we set that to false. And that's actually it, we don't want to include any of those other arguments in there. The only extra parameter that we do want to add, which we want to add whenever we're working with PyTorch, is we want to add return tensors equals PT.
And this just tells the tokenizer to return PyTorch tensors. Whereas here, what we had are actually just simple Python lists. And if we're using TensorFlow, we switch this over to TF. In our case, using PyTorch. And let's just see what that gives us. Okay, so here we get a warning about the sequence length.
And that's fine, because we're going to deal with that later. And then in here, we can see, okay, now we have PyTorch tensors rather than the list that we had before, which is great, that's what we want. Now we have that, we actually want to split each of our tensors, or the input IDs and the attention mass tensors, we don't need to do anything with the token type IDs, we can get rid of those.
We want to split those into chunks of length 510. So the reason we're using 510, rather than 512, is because at the moment, we don't have our CLS and separator tokens in there. So once we do add those, that will push the 510 up to 512. So to split those into those chunks, it's actually incredibly easy.
So we'll just write input ID chunks. And we need to access our tokens dictionary. So tokens, and then we want to access the input IDs here. And you'll see here that this is actually a tensor, it's almost like a list within a list. So to access that, we want to access a zero index of that.
And then we're just going to split, which is a PyTorch method by 510. And that is literally all we need to do to split our tensor into batches. And we repeat this again, but for the mask, and just changes to attention mask. Again, we don't need token type IDs, so we can just ignore that.
And then let's just print out the length of each one of our tensors here. So for tensor, and input ID chunks, just print the length of it. So we can check that we are actually doing this correctly. So we can see we have 510, 510. And the last one is shorter, of course, like we explained before, at 325.
So that's pretty ideal, that's what we want. And now we can move on to adding in our CLS and separate tokens. I'll just show you how this is going to work. So I'm going to use a smaller tensor quickly, just as an example. So we just need to also import torch.
So we do that here. Okay, so we have this tensor. And to add a value on either side of that, we can use the torch cat method, which is for concatenating multiple tensors. In this case, we'd use torch cat. And then we just pass a list of all the tensors that we would like to include here.
Now, we don't have a tensor for our token, so we just create it within this list. And that's very easy, we just use torch tensor. And then if you remember before, the CLS token is the equivalent of 101, when it's converted to the token ID. So that's going to come at the start of our tensor.
And in the middle, we have our actual tensor. And at the end, we want to append our 102 tensor, which is the separator token. Okay, and we just print that out, we can see, okay, we've got 101, and then we have our sequence and 102 at the end. Then after we add our CLS and separator tokens, we will use the same method for our padding as well.
But we want to write this logic within a for loop, which will iterate through each chunk and process each one individually. So first, I'm going to create a variable to define the chunk size, which is going to be 512, which is our target size. And we already split our tokens into chunks up here.
So we can just iterate through each one of those. So we'll just go through a range of the length of the number of chunks that we have, this will go 0, 1, and 2. And now we can access each chunk using the I index here. So first, we want to add the CLS and separator tokens, just like we have above.
So to do that, we go input ID chunks, we get the current index, and then just do torch cat, which is just concatenate. And then we pass a list just like we did before, which is going to be torch tensor. And then in the middle, we have A, we're going to replace that with this.
Okay, and then we want to do the same for our attention mask. But of course, in our attention mask, if we look up here, it's just full of ones. And the only two values that we can actually have in our attention mask is either 1 or 0. And the reason for this is whenever we have a real token that Bert needs to pay attention to, we have a 1 in this attention mask.
Whereas if you have a padding token, that will correspond to a 0 in this attention mask. And the reason for this is just so Bert doesn't process attention for the padding tokens within our inputs. So it's essentially like telling Bert to just ignore the padding. So in our case here, both of these are not padding tokens.
So both of them should be 1. Okay, and then that gets us our sequences with the CLS separator and added attention mask tokens in there. So now we need to do the padding. And realistically with padding, we're actually only going to do that for the final tensor. So what we will do to make sure that we don't try and pad the other tensors is just check the length of each one.
First, we'll calculate the required padding length, which is just going to be equal to the chunk size minus the input ID chunk. And then we want the index shape 0. So this is like taking the length of the tensor. Okay, and for chunks 1 and 2, this will just be equal to 0.
Whereas for the final chunk, it will not, it will be something like 150 or 200. So what we want to do is say, if the pad length is greater than 0, then this is where we add our padding tokens. So first, we'll do the input ID chunk. And again, we're just going to use the torch concatenate method.
This time, we have our input ID chunk at the start. I think it's chunks, not chunk. And also here, this should be mask chunks. So let's just fix that quickly. Okay. And here, we first have this, and then the parts following this need to be our padding tokens. And to create those, we are going to do the torch tensor again.
And then in here, we're going to just add one zero in a list. But then we're going to multiply that by the pad length. So if the pad length is 100, this will give us a tensor that has 100 zeros inside it, which is exactly what we want. And then we'll copy and paste this and do the exact same thing for our masking tensor as well.
Okay. So now let's just print out the length of each one of those tensors. So for chunk and input ID chunks, print the length of that chunk. And then we'll also just print out the final chunk as well, so we can see everything is in the right place. And here, so just copy.
So this here needs to have an S on the end. Oh, and up here. So when we first build these, so if I just print out one of them, you see that the input ID chunks is actually a tuple containing three of our tensors. So what we actually want to do, I'll just close that, is before we start this whole process, we just want to convert them into lists so that we can actually change the values inside.
Because otherwise we are trying to change the values of a tuple, which we obviously can't because tuples are immutable in Python, which means you can't change the values inside them. So we just convert those to lists. And then we also need to add an S on here. And there we go.
We finally got there. So now we can see, okay, here we have 514. So let me just rerun this bit here. And then rerun this. Okay. So it's because I was running it twice. It was adding these twice. So now we have 512. And then we can see we have our tensor.
So this is just printing out the input ID chunks. You can see here we have all these values. And this is just the final one. So you can see at the bottom we have this padding. If we go up here, we have our starter sequence token 101. And down here we have the end of sequence separator.
So now what we want to do is stack our input IDs and attention mass tensors together. So we'll create input IDs. We use torch stack for that. And that's going to be input ID chunks. And then we also have the attention mask that we need to create. So we do the same thing there.
And that is the mask chunks. And then the format that BERT expects us to be feeding in this data is a dictionary where we have key value pairs. So we have a key input IDs, which will lead to our input IDs tensor here. And then another one called attention mask that will have the attention mask as its value.
So we'll just define that here. And this is just the format that BERT expects. So the input IDs. And then we have the input IDs there. And then we also have the attention mask. We have the attention mask in there. Now, as well as that, BERT expects these tensors to be in a particular format.
So the input IDs expects it to be in a long format. So we just add long onto the end of there. And then for the attention mask, we expect integers. So we just add int onto the end of there. And then we just print out input dict. So we can see what we are putting in there.
OK, great. So that is exactly the format that we need. Now we can get our outputs. So we pass these into our model as keyword arguments. So we just add these two asterisk symbols. That means it's a keyword argument. And then in there, we pass our input dict. And this allows the function to read these keywords, take them as variables, and assign these tensors to them.
So there we have our outputs. You can see here that we have these logits. These are our activations from the final layer of the BERT model. And you see, OK, we have these values. What we want in the end is a set of probabilities. And of course, this is not a set of probabilities, because probabilities we would expect to be between the values of 0 and 1.
Here we have negatives. We have values that are over 1. And that's not really what we would expect. So to convert these into probabilities, all we need to do is apply a softmax function to them. Now softmax is essentially sigmoid but applied across a set of categorical or output classes.
And to implement that, we just do torch and then functional. And then we just add softmax onto the end there. And we need to access the output logits, which is in index 0 of the outputs variable. So that is just accessing this tensor here. And then we access dimension minus 1.
So the dimension negative 1 is just accessing the final dimension of our tensor. So in this case, we have a 3D tensor. So this is like accessing the second dimension or dimension number 2. Because when we have 3D tensor, we have dimensions 0, 1, and 2. Minus 1 of 0 is just the dimension 2, if that makes sense.
So imagine we have 0, 1, and 2 here. If we go here and we take negative 1, we come around here to the back of the list. And that is accessing the second dimension. So that is going to take a softmax function across each one of these outputs. And then we can print that out.
So now we have our probabilities. So the outputs of the FinBert model, these ones here in the first column are all positive. So this is the prediction of the chunks having a positive sentiment. These are all negative. So the prediction of the chunk having a negative sentiment. And these are all neutral.
So if it has a neutral sentiment. So we see here, the first and second chunks are both predicted to have a negative sentiment, particularly the first one. And the final one is predicted to have a positive sentiment. Now if we want to get the overall prediction, all we do is take the mean.
So the probabilities. And we just want to take the mean. And we take that in the 0 dimension, which would just go from here down, take the mean of those three, take the mean of these three, and take the mean of these three as well. Print it out. And you see here, negative sentiment is definitely winning here.
But only just, it's pretty close to the positive. So it's reasonably difficult one to understand. And this is because over here, we have mostly negative, kind of negative, and most positive. So it's a bit of a difficult one. But negative sentiment does win out in the end. Now if you'd like to get the specific category that won, we'll just take the arg maps of the mean.
And that will give us a tensor. If we want to actually get the value out of that tensor, we can just add item onto the end there. And that is it. We have taken the average sentiment of a pretty long piece of text. And of course, we can just use this code and iterate through it for multiple long pieces of text.
And it doesn't really matter how long those pieces of text are. This will still work. So I hope this has been an interesting and useful video for you. I've definitely enjoyed working through this and figuring it all out. So thank you very much for watching. And I will see you again in the next one.