Multi-Class Language Classification With BERT in TensorFlow

I welcome to this video on mid-class classification using Transformers and TensorFlow. So I've done a video very similar to this before, but I missed a few of the final steps at the end, which were saving and loading models and actually making predictions with those models. So I've made this video to cover those points, as a lot of you were asking for how to actually do that.

So we're going to cover those, and we're also going to cover all the other steps that lead up to that. So if this is the first video on this that you've seen, then I'm going to take you through everything. So I'm going to take you through sourcing data from Kaggle that we'll be using, pre-processing that data.

So that's tokenization and encoding the label data. Then we're also going to be looking at setting up a TensorFlow input pipeline, building and training the model, and then we go on to the few extra things I missed before-- so the saving and loading the model, and also making those predictions.

So that's everything that we'll be covering. And what I've done is left chapters in the timeline, so you can skip ahead to whatever you need to see. So we'll jump straight into the code. So here we're just going to download the data. We're going to be using the sentiment analysis on the movie reviews data set, which you can see over here on Kaggle.

You can download it from here if you just click on train.tsv.zip, download it, and unzip it. I'm just going to do it through the Kaggle API here, and then unzip it with zip file here. It's just a little bit easier. So I'll run that. And I do have a video on the Kaggle API, so I'll put a link to that in the description.

So we've got our data. You can see it in the left here. And I'm just going to import pandas and view what we have. And just read from CSV. And we're using a tab-limited file here, so we need to use the tab separator. And let's just see what we have.

OK, we have the sentiment here, and we have the phrase. And that's all we really need. So on the phrase here, we're going to be tokenizing this text to create two input tensors, our input IDs and the attention mask. Now, at the moment, we're going to contain these two tensors within two NumPy arrays, which will be of dimensions the length of the data frame by 512.

512 is the sequence length of our tokenized sequences when we're putting them into BERT. So when tokenizing, all we're going to do is iterate through each sample one by one and assign each sample or each tokenized sample to its own row in the respective NumPy array. And we'll just first initialize those as empty zero rows.

So we'll do that. So first, we need to import NumPy. And then, like I said before, the sequence length is 512. And then our number of samples is just equal to the length of our data frame. And with that, we can initialize those empty zero rows. So one will be xids, which will be our token IDs.

And that is initialized with empty zeroes. And then we pass the size of that array here. So number of samples or the length of the data frame by the sequence length, which is 512. And then we can copy that. And we do the same thing for xmask, which is our attention mask.

And then let's just confirm we have the right shape as well. We'll do that like this. So we have 156,000 samples and 512 tokens within each sample. So now that we have initialized those two arrays, we can begin populating them with the actual tokenized text. So we're going to be using transformers for that.

And we are using BERT, so we'll import BERT tokenizer. And then we just want to initialize the tokenizer. And what we'll do here is just load it from pre-trained. And so I'm using BERT base case. And then what we can do here is we'll just loop through every phrase within the phrase column.

And we'll just enumerate that. So we've got the row number in i and the actual phrase in the phrase variable. So here, phrase. And then we want to pull out our tokens using the tokenizer. So we do ENCODE plus. And then we have our phrase, the max length, which is sequence length.

So that's 512. We're going to set truncation to true. So here, if we have any text which is longer than 512 tokens, it will just truncate and cut it off at that point because we can't feed in arrays of-- or tensors of different shapes into our model. It needs to be consistent.

Likewise, if we have something that is too short, we want to pad it up to 512 tokens. And for that, we need to use padding equal to max length, which will just pad up to whatever number we pass into this max length variable here. And then I want to add special tokens.

So in BERT, we have a few tokens like this, which means the start of a sequence, this, which means separator. And that is either separating different sequences or marking the end of a sequence, and also this, which is the padding token, which we'll be using because we set padding equal to max length.

So if we have an input, which is maybe 412 tokens in length, or let's say 410, both of these will be added to it, which will push it up to 412. And then we'll add 100 padding tokens onto the end, which pushes it up to 512. So that's how that works.

So obviously, we do want to add those special tokens. Otherwise, BERT will not understand what it is reading. And then because we're using TensorFlow here, we're going to return tensors equals tf for TensorFlow. And then what we want to do here, so we've pulled out our tokens into this dictionary here.

So this will give us a dictionary. It will have different tensors in there. So it will have our input IDs and attention mass. And we want to add both of those to a row within each one of these zero rows. So to do that, we want to say xids.

And then this is why we have enumerated through here. So we have this i. So this tells us which row to put it in. And then we want to select the whole row. And we set that equal to tokens because input IDs. Now, as well, we do the same for the xmask.

But this time, rather than input IDs, we are going to set it equal to attention mask. So let's just have a quick look at what we have in xids now. OK, it's just 0. So now let's run this. And now if we rerun this, we can see, OK, now we have all these other values.

So first, this 101 is the CLS token that I mentioned before. So the start sequence token. And these zeros here, they are all padding tokens. So obviously, at the end here, we almost always have paddings unless the text is long enough to come up to this point or if it has been truncated.

But here, we can see the sort of structure that we would expect. And we see some duplication here. So we have 101. Then we have this 138, 1326. And if we look at our data here, we can say, OK, that's why. Because the first few examples, we have segments of the full sequence here.

So there is some duplication there. So that all looks pretty good. And I think let's just have a quick look at the XMASK. So here, you can see something different. We have these ones and zeros. So XMASK is essentially like a control for the attention layer within BERT. Wherever there's a one, BERT will calculate the attention for that token.

Wherever there's a zero, it will not. So this is to avoid BERT making any kind of connection with the padding tokens and actual words. Because these padding tokens are, in reality, not there. We want BERT to just completely ignore them. And we do that by passing every padding token as a zero within the attention masquerade or tensor.

Now, as well as that, we also want to want to encode our label data. So at the moment, we can see here we have the sentiment. And this is a set of values from 0 to 4, which represent each of the sentiment classes. So here, we have sentiment labels of 0, which is very negative, 1, somewhat negative, 2, neutral, 3, somewhat positive, and 4, positive.

So we're going to be keeping all those in there. But we're going to be one-hot encoding them. So to do that, we will first extract that data. So we'll go put into an array. We want to put df sentiment, which is the column name. And we want to take the values.

And if I just show you what that gives us, it just gives us an array. And we have up to 4, so 0, 1, 2, 3, 4. And that's good. And now what we need to do is initialize, again, a 0 array. So we'll do np0s. And in here, we're going to do it the length of the number of samples, because, again, we have the same number of samples in our labels here, so the length of the data frame.

And I want to say the array max value plus 1. Now, this works because in our array, we have 0, 1, 2, 3, 4. So the max value of that is 4. And we have five numbers here. So if we take the 4 plus 1, we get 5, which is the total number of values that we need.

And this essentially gives us the number of columns in this array. And we need one column for each different class. So that gives us what we need. And we can just check that as well. So we have labels.shape. And here, we see we have the length of the data frame or number of samples.

And we have our five classes. Now, let's have a quick look at what we're doing here. So we'll print out labels. OK, we just have zeros. Now, what we're going to do is we're going to do some fancy indexing to select each value that needs to be selected based on the index here and set that equal to 1.

So for these first three examples, we will be selecting this, making it equal to 1. And then this, because this is in the second. So we've got 0, 1, 2. These are the column indices. So first one, that will be set to 1 because we have 1 here. This is number 2, so we select number 2 and 2 here as well.

And then for 3 down here, we'd have a 1 here. So to do that, we need to specify both the row. So we're just going to be getting one row at a time. So all we need here is a range of numbers, which covers from 0 all the way down to 156,060, which is our length here.

So to do that, we just go np arrange. We have a number of samples. And then here, we need to select which column we want to set each value to or select each value for. And that's easy because we already have it here. This is our array. So we just write array.

And then each one of those that we select, we want to set equal to 1. So let me just put it there. I want to put it here. OK, and now let's rerun this cell as well. OK, and now we can see we have those ones in the correct places.

So that's our one-hot encoding. And now what we want to do is sell our data here and put into a format that TensorFlow will be able to read. So to do that, we want to import TensorFlow first. And what we're going to do is use the TensorFlow data set object.

So this is just a object provided by TensorFlow, which just allows us to transform our data and shuffle and batch it very easily. So it just makes our life a lot easier. So it's a data set. And we're creating this data set from tensor slices, which means arrays. And in here, we're going to pass a tuple of XIDs, XMASK, and labels.

And to actually view what we have inside that data set, we have to use specific data set methods because we can't just print it out and view everything. It's a generator. So what we can do is just take one. And that will show us the very top sample or after we batch it, the very top batch.

And I'll just print that out. We see here, OK, we have this take data set shapes. And we have this tuple here. So this is one sample. And inside here, we have a tensor, which is of shape 512. So this is our very first XIDs array-- or row, sorry.

So this is like doing this and getting this. So this is what this value is here. This is the size or the shape. OK, and then we have the same for the XMASK, which is the second item here in index 1. And we also have four labels as well.

You can see here. OK, and that's all good. But what we need now is to merge both of our input tensors into a single dictionary. So the reason that we do that is that when TensorFlow reads data during training, it's going to read-- or it's going to expect a tuple where it has the input in index 0 and the output or target labels in index 1.

It doesn't expect a tuple with three values. So to deal with that, what we do is we merge both of these into a dictionary, which will be contained within tuple index 0. So first, we create a map function. And what we're going to do is just apply whatever is inside this function to our data set.

And it will reformat everything in our data set to this format that we set up here. So our input IDs, let's just change that to input. We have the mass, and we have the labels. And all we want to do is return input IDs and mass together. And we're also going to give them these key names so that we can map the correct tensor to the correct input later on in the model.

So we go input IDs. We have our attention mass, which goes to our mass. And then that is the first part of our tuple. And then we also have the labels, which is the second part. And that's all we need to do for creating that map function. And then like I said before, data set makes things very easy for us.

So to actually apply that mapping function, all we need to do is data set dot map, map function. So now let's have a quick look at what we have, see if the format has changed or the shape. OK, you can see now we have-- so it's all within a tuple.

This is the 1 index of that tuple. This is a 0 index of that tuple. And in the 0 index, we have a dictionary with input IDs, which goes to our input ID tensor and attention mass, which maps to our attention mass tensor. OK, so that's great. And now what we want to do is actually batch this data.

So I'm going to use a batch size of 16. You might want to increase or decrease this, probably mostly dependent on your GPU. I have a very good GPU. So I would say this is at the upper end, the size that you want to be using. And what we do here is data set dot shuffle.

So this is going to shuffle our data set. And the value that you should input in here, I tend to go for this sort of value for this type of size data set. But if you notice that your data is not being shuffled, just increase this value. And then batch.

And then we have the batch size. OK, so split into batches of 16. So we first shuffle the data, and then we batch it. Otherwise, we would get batches, and we would end up shuffling the batches, which is not really what we want to do. We just want to actually shuffle the data within those batches.

And then we want to drop the remainder and set that equal to true. So that is dropping. So we have batch size of 16. If we had, for example, 33 samples, we would get two batches out of that, 16 and 16, and then we'd have one left over. And that wouldn't fit cleanly into a batch, so we would just drop it.

We would get rid of it. And that is what we're doing there. And then let's just have a look at what we have now. So now, see, this has changed. So we still have that tuple shape where we have the labels here and the inputs here. But our actual shapes, our tensor shapes, has changed.

So now we have 16 samples for every tensor. OK, so that is what we want. That's good. We've got our full data set, our train data set here. And what we might want to do is split that into a training and validation set. And we can do that by setting the split here.

So we're going to use 90%. You can change this. So we're going to 90% training data, 10% validation data. And what we need to do is just calculate the size of that split, or the size of the training data in this case. So we'll take the XIDs shape. Or actually, we can just set the SQL to the value that we defined up here, the number of samples.

So these are the same. Let me show you. So number of samples and the XIDs shape. Cases are the same thing. We've already defined that, so let's just go with that. And we're going to divide that by the batch size. And this gives us the new number of samples within our data set, because we've batched it here.

Now, one, we only want 90% of these, so we multiply it by split. So I do need to just run that quick. So that's our 90% mark. And when we're saying, in a moment, that we want this number of samples from the data set, we can't give it a float, because we can't have 0.375 of a batch.

It doesn't work. So what we need to do is set that equal to a integer to round off. So we do that here. Let's remove that and run that. So now we have our size. We can split our data set. So we're going to have the train data set, which is just going to be data set.

And as we did up here, where we do this take method to take a certain number of samples, we do the same. But now we're going to take a lot more than just one. We're going to be taking the size that we calculated here, which is 8,700 or so.

And then for the validation data set, we kind of want to do the opposite. We want to skip that many samples. So that's exactly what we write here. We just write skip size. So we're going to skip the first 8,700 or so. And we're just going to take the final ones out of that, the final 10%.

And then we're not going to be using data set anymore. And it's quite a large object. So we can just remove that from our memory. OK, so now we're on to actually building and training our model. So we're going to be using Transformers again to load in a pre-trained BERT model.

And we're going to be using the TF auto model using the TensorFlow version. And we'll set BERT equal to TF auto model from pre-trained. And it is BERT base and case. And just like we would, because here we're using the TensorFlow version here, if we got rid of this, we'd be using PyTorch.

So here we're using TensorFlow. Just like we would have any other TensorFlow model, we can use the summary method to print out what we have. OK, and we see we just have this one layer, which is a BERT layer. And this is just because everything within that BERT layer is a lot more complex than just one layer, but it's all embedded within that single layer.

So we can't actually see inside of it just by printing out the summary. Now, we have BERT. And that is great. But that's kind of like the core or the engine of our model. But we need to build a frame around that based on our use case. So the first thing we need to do is we have our two input layers.

We have that one for input IDs and one for the attention mask. So first, we need to define those. So we've already imported TensorFlow earlier for the data set, so we don't need to do that again. And what we do is input IDs, and we say tf.keras.layers. And we're using an input layer here.

And the shape of that input layer is equal to the sequence length. So sequence length, and then we just add a comma here. So that's the same shape as that we were seeing up here. Now, we need to set a name. And we do this because, as we have seen up here, we have this dictionary.

And we need to know which input each of these input tensors are going to go into. And that's how we do it. We map input IDs to this name here, input IDs. And we'll do the same for attention mask as well in a moment. And we set the data type equal to integer 32.

And we do that because these are tokens here, but expect them to be integer values. And we do the same for mask, tf.keras.layers, input. And the shape is sequence length again. We have the name, which is where we use our attention mask to map that across. And again, it's just the same D type, which is int 32.

So they are our input layers. And then what we need is we need-- after the input, what do we have? We want to feed this into BERT, right? So what we're doing there is we're creating our embeddings from BERT. And what we need to do is access the transformer within that BERT object.

So to do that, for BERT, we just write BERT dot BERT, which accesses that transformer. And in there, we want to pass out input IDs and our attention mask, which is going to be the mask, so these two input layers here. And we have two options here. We can pull out the raw activations or the raw activation tensors from the transformer model here.

And these are three-dimensional. And as I said, just take out that raw activation from the final layer of BERT. Or what they also include here is a pooled layer or pooled activations. So these are those 3D tensors pooled into 2D tensors. Now, what we're going to be using is dense layers here.

So we are expecting 2D tensors. And therefore, we want to be using this pooled layer. We could also pool it ourselves. But the pooling has already been done. So why do it again? Now, what we want to do here for our final part of this model is we need to convert these embeddings into our label predictions.

So to do that, we want two more layers. And these are going to both be densely connected layers. And for the first one, I'm going to use 1,024 units or neurons. And the activation will be ReLU. And we're passing the embeddings into that. And then our final layer, which is our labels, that is going to be the same thing again, so dense.

But this time, we just want a number of labels here. So we did calculate this before. It was array, max, plus 1, right? So that is just 5. So we have 5 output labels. And what we want to do here is calculate the probability across all 5 of those output labels.

So to do that, we want to use a softmax activation. Function. And we just want to say we're going to call this the outputs layer, because it is our outputs. Now, that is our model architecture. They are all of our layers, but we haven't actually put those layers into a model yet.

They're all kind of just floating there by themselves. Obviously, they do have these connections between them, but they're not initialized into a model object yet. So we need to do that. And to do that, we go tf.keras, model. And we also need to do here, we need to set the input layers, so our inputs.

And we have two of those, so we put them in a list here, input IDs, and the mask. So this is those two. Then we also need to set the outputs. And we just have one output, and that is y. Yeah, so just setting up our boundaries of our model.

We have the inputs, and they lead to our outputs. Everything else within this is already handled to go to x. We consume embeddings, and embeddings consumes input IDs and mask. So those connections are already set up. And let's just see what we have here. And I just realized that this is input.

This should be mask. Let's see what this error is. OK, so here I forgot to add this connection, so I need to add x there. OK, so now what do we have? It's a little bit messy, but-- so we have our input IDs, and we have the shape here, the attention mask.

So these are our two input layers. They lead into our BERT layer. Then we have our pre-classifying layer here, the densely connected neural net with 1,024 units. And we have our outputs, which is the softmax. And we have five of those. Now, if you would like to, what you can do if you don't have enough compute to train the BERT layer as well, you can also write this.

So go model layers. And we select number 2, because we have 0, 1, and 2. So BERT is number 2 in there. And we can set trainable equal to false. And this would just freeze the parameters within this BERT layer and just train the other two here. But I will be keeping that so they can be trained as well, although you don't need to.

It will probably give you a small performance increase, but not a huge amount in most cases. Now, I want to set up the model training parameters. So we need a optimizer, which is optimizers. And for this, we're going to be using Adam. We're using a pretty small learning rate of 1e to the minus 5 is because we've got our BERT model in here.

We also want to set a weight decay as well. So this is Adam with a decay. And what we also want to add is a loss function. So we want to do tf.keras losses. And because we're using categorical outputs here, we want to use categorical cross-entropy. And then we're going to set our accuracy as well.

And that is tf.keras metrics this time. And we're using categorical accuracy for the same reason. We're just going to need to pass accuracy in there as well. And let me change that to metrics. And then we just want to do model compile. Optimizer is equal to the optimizer. Loss to loss.

And metrics is going to be equal to a list containing our accuracy. OK, so that's our model training parameters all set up. So the final thing to do now is train our model. So to do that, we call model fit, like we would in any TensorFlow training. And we have our training data set, which we've built already.

The validation data we'll be using is our validation data set. And we'll train that for three epochs. And that will take some time. And immediately after we do train that, I'm also going to save that model to sentiment model. OK, now we'll just create a directory here and store all the files that we need for that model in there.

So I will go ahead and run that. And I will see you when it's done. OK, so finished training our model. We got to an accuracy of 75%, still moving, and also validation accuracy of 71% here. So inside the sentiment model directory that we just created when we saved our model, we have everything that we need to load our model as well.

So I'm just going to show you how we can do that. So what we can do is let's start a new file here, no notebook. OK, so let's just close this. And what we'll do here is we need to first import TensorFlow. And after we have imported TensorFlow, we need to actually load our model.

So we use tf.keras.models.loadModel. And then we're loading this from the sentiment model directory here. And then let's just check that what we have here is what we built before. OK, great. So we can see now we have our input layers, BERT, and then we have our preclassifier and classifier layers as well.

So that's exactly what we did before. And now we can go forwards and start making predictions with this model. So before we make our predictions, we still need to convert our data or input strings into tokens. To do that, I'm going to create a function to handle it for us.

First, we are going to need to import the tokenizer from Transformers. We're using BERT tokenizer just like we did before. And we're going to use the tokenizer, BERT tokenizer from pre-trained BERT base case. OK, so exactly the same as what we used before. So that is our tokenizer. And all we need to do is we're going to say-- call it prep data.

And here we would expect a text, which is a string. And we'll return our tokens. And this is just the same as what we were doing before. OK, so we do encode plus our text. We set a max length, which is going to be 512, as always. But we are going to truncate anything longer than that.

And we're going to pad anything shorter than that. And we'll pad it up to the max length. And then we want to add the special tokens. That is true. And then there's one other thing that we don't need, which are the token type IDs. So token type IDs. And this is just another tensor that is returned.

And we don't need them. So we can just ignore them. And we're going to return the TensorFlow tensors. OK, and that is all we need. And now we can just return our tensors in the correct format. So the format that we need is like we used before with the input IDs.

But if you remember before, within the data set, we were working with TensorFlow Float64 tensors. So we also need to convert these, which will be integers, I believe, into that format. So to do that, we do tf.cast tokens, input IDs. And we say we want to cast that to tf.float64.

And we can copy that across. And we'll repeat the same thing, but for our attention mask. So attention mask, we'll just copy that across. And that is all we need to prepare our data. And we'll just fix that. OK, and now what we can do is we can just prep data.

We'll just put something like "hello world." OK, and we get all these values. And you see here, I have entered that wrong. We don't even need this necessarily, but just for it to be explicit. And we just need to add the S onto IDs there. So rerun that and remove that error.

And you can see here we have our CLS token, "hello world," separated token, followed by lots of padding. OK, so our prep data works. And let's just put that into a value there. And what I want to do now is get the probability by calling model predict. And we do test here.

So we've already done prep data. And let's see what we get. OK, so we have these predictions, which is not that easy to read. And we also just need to access the zero index. So we just get a simple array there. And what we can do to get the actual class from that is we'll just import NumPy as NP, because we just want to take the maximum value out of all of these.

And to do that, we just do np argmax probs zero. OK, so it's had a neutral sentiment. But something like this movie is awful. And we should get zero, OK? And we'll do this. And we'll get four, OK? So it's working. So that, I know, is pretty long. But that is really everything you need from start to finish.

We've preprocessed the data, set up our input data set, pipeline. We've built, trained our model. We've saved it, loaded it, and made predictions. So that's really everything you need. So I hope that has been a useful video. And thank you very much for watching. I will see you again in the next one.

Multi-Class Language Classification With BERT in TensorFlow

Chapters

Transcript