How-to Build a Transformer for Language Classification in TensorFlow

Hi, and welcome to this video on implementing transformer models with TensorFlow. So we're going to go through six steps, which are downloading and preprocessing data, initializing the HuggingFace tokenizer model-- and by HuggingFace, I mean the Transformers framework. Then we encode input data to get input ID and attention tensors.

Then we build the full model architecture. So that is our input layers, which go into BERT, and then the output layers post-BERT. Then it's back to the normal TensorFlow process where we set up our optimizer, metrics, and loss, and then we begin training. And we will cover each one of these steps in this video.

So first, we need to actually get our data. So we're going to use the IMDB movie review data set, or it may actually be Rotten Tomatoes. I'm not sure. Now, this data set provides us with sentiment ratings from 0, which is terrible, up to 4, which is amazing. You can get the data set from Kaggle, or we can just download it using the Kaggle API, which is what we're going to do here.

Now, if you haven't used Kaggle API before, that's fine. All you need to do is install Kaggle, and then you need to head over to the Kaggle website, go to your account page, scroll down to, I think, it's API integration, download kaggle.json, and then you need to place kaggle.json in the correct Kaggle folder, which will have been created when you did the pip install.

Now, if you're not sure where that is, all you need to do is import Kaggle, like that. And when you execute this, a OS error will appear, and it will say you need kaggle.json, which you don't have, you need to put it in this folder. You just go ahead and put kaggle.json in that folder, and then you are ready.

Now, we just need to initialize our API and authenticate it. And now, we can use the competition download file method to download our data. And we are going to import the data into this directory. Now, let's refresh up here, and we can see it. Now, obviously, it's a zip file, so we need to quickly unzip that.

We can also do this in Python, or you can do it manually. But for now, we're just going to do it this way. So we're going to import the data into this directory. We need to quickly unzip that. We can also do this in Python, or you can do it manually.

We will just do it manually. It's easier-- well, quicker, at least. And there we go. We now have our data here. It's a tab-delimited file. So if we open it, you can see here, we're delimiting by the tab character. And we can see we have our phrase and our sentiment, which are the two that we care most about.

Now, we'll use pandas to read data. And because it's a tab-delimited file, we use read CSV. And then we just set the separator to tab. Now, this data set has a full sentence, which we can see with the phrase ID is 1 and sentence ID is 1. That is our full sentence or full phrase.

And then we have lots of parts of that same phrase cut down into different pieces and given a sentiment value. Now, I mean, you can use this. But I'm going to avoid it because I'm just going to be using the training data for both the training set and the validation set.

And I don't want to pollute the validation set with very similar phrases that we have used in the training data. So we're just going to drop duplicates and keep the first element of every unique sentence ID. And here, you can see that we have now removed those duplicates or segments.

With the segments removed, we only have 8,500. Now, we need to move on to encoding our data. So for that, we are going to be using the transformers framework, which we will also be using for the transformer itself. And this works by providing a tokenizer and the model itself for each transformer.

So we're going to be using BERT. And that means that we are going to import or initialize a BERT model and also the BERT tokenizer, which is already pre-built. Now, before we do this encoding, we need to figure out how long we want each sequence to be, because the encoding method also acts as our padding or truncation method.

So to do that, we will get the sequence length in words of each sentence and plot that out and just go by eye and say, OK, around here, won't cut off too much data. So first, we need to get the sequence length of every sentence in here. Now, what we're going to do here is get the length of each sentence split.

Now, split will, by default, split by spaces. So we will get a list of words here. Now, we need to actually visualize this. So we will use matplotlib and seaborn just because they're super easy and quick to use. And then we will also set the seaborn style just to make it a bit more visually appealing.

And we will also increase the figure size so we can actually see what's going on. And then we will use a distribution plot. And here, we can see the distribution of the length of each sequence in our data set. Now, we could cut it off maybe around 40 or even 50.

I think we'll go with 50 just so we get as much data in there as possible. So we'll set sequence length equal to 50. Now, we need to initialize our tokenizer. And before we do that, actually, we need to import it from the Transformers library. And we are getting our model from a pre-trained model.

And we're using BERT, base, cased. Now, cased here refers to whether BERT distinguishes the difference between uppercase and lowercase characters. The alternative would be uncased. And this would just not distinguish the difference between uppercase and lowercase. Everything would just become lowercase. But when people are on the internet and they want to shout and seem angry, people put everything in capital letters.

So BERT can probably pick up on this and tell that someone is being dramatic or shouting at you over the internet or whatever else. And because we are classifying sentiment here, it's probably quite important. Now that we've initialized our tokenizer, we can go on to the encoding. So we use the encode plus method, which looks like this.

So you'll see here we've just defined or hardcoded a single line, which is hello world. We are using a max length of 50. We want the encoder to truncate any text that is longer than 50 tokens. Obviously, this one will not be. But when we are feeding all of our data through it, we need this in there.

And on the other hand, we also want anything shorter than 50 to be padded with pad tokens. In this case, we will end up with 48 of these padding tokens. And here we are just telling the tokenizer to pad up to the value that we have given in the max length argument.

Now BERT comes with several special tokens. We have the start sequence, end of sequence, padding, unknown, and mask tokens. In order to add these in during the encoding method here, we need to write add special tokens true. And in this case, all it's gonna do is add the start of sequence token, the end of sequence token, and then it's gonna add all of our padding values.

Now, there are several different outputs that we can get from this encode plus method. By default, we have input IDs and the return token type IDs. Now, the token type IDs we don't really need, so we can tell the tokenizer to not return those. But we do need input IDs, which is fine, we get them by default, and also the attention mask tensor.

To return this, we just write return attention mask equals true. Finally, because we are working in TensorFlow, we need to return TensorFlow tensors. Okay, so here we have our outputs. So we get two tensors. One of those are the input IDs, and then also our attention mask here. So we input the sequence hello world, and we can see that this value here and input IDs is hello, and this is world.

Now, the 101 and 102 you see here are the start of sequence and end of sequence tokens used by BERT, and the remaining zeros are simply the padding tokens. We also have the attention mask, and this is used to tell BERT what tokens to calculate attention for, and which tokens to just completely ignore.

So where we have a one, that tells BERT, yep, pay attention to this. Where there's a zero, it means just ignore it. So we have zeros for every padding token because padding tokens aren't important to us, it's just padding. But then we have ones for the end of sequence and start of sequence tokens, and also hello and world, because they are actually important for BERT to pay attention to.

Now, of course, this is just one, and we need to do this for every sample in our dataset. So we'll go ahead and do that. Now, we're just gonna use a simple for loop to add each value or sequence into a NumPy array, which we will initialize now. So first import NumPy, and then initialize our two arrays.

Both of them are gonna be the same size, so it's gonna be the length of our data frame by the sequence length that we have defined, which is 50. And we do this twice. We also want one for the attention mask. And we can see the shape of our arrays here.

Now, we're just gonna use a for loop to do this. It's only a small dataset, so whatever we use doesn't really matter. We are only processing and encoding this data one time, and then we will save it and then load it from memory when we're actually training our model.

, and there we have our for loop. This will go through each sequence and add each one of those sequences into the perspective index of our initialized zero arrays here. Okay, and then here we can see our complete arrays. So at the top, we have our input IDs and XIDs, and we can see each one starts with our start of sequence token, 101, followed by a few words.

Then there will be the end of sequence token somewhere in the middle there, and then at the end, we have our padding token. And then we can also see in Xmask, which is our attention mask, we have the ones to pay attention to and the zeros to ignore, which obviously correspond to the respective values in the XIDs array.

Now, for our labels, we are actually going to one-hot encode them. So at the moment, let's see what we have. So we will get values of four, one, three, two, and zero. Though you can't see it here. Now, this is a pretty good format to go straight in and one-hot encode.

So all we need to do here is create a array from the DataFrame column. And then here, like we did before, we are just initializing a empty zero array. And what we are doing here is we are taking the array size. So if I, okay, let's do it a little differently.

So the array size is just the length of our DataFrame. And array.max is what we see maximum value within our array, which is the number four. And then we are adding one onto that, which is saying here that we want a zero array of 8,529 rows by five columns.

Like so. Now, at the moment, we just have an empty zero array. So we just need to add ones in the indices where we have a value. And we do that very easily like this. So we create a range of values from zero to 8,528, which is the array size here.

And then within that, we add array. Because in array, we have each sentiment value. So zero, four, three, two. And this will add a one in that index. So zero, three, four. And that produces our one-hot encoding. Like so. Here, our rating was one. Here, four. Here, one. And so on.

Now, I said before that typically we'd save these before loading them in. So we are gonna save them. Obviously, we don't need to load them back in. But I will show you how to anyway, because going forwards, if you want to retrain data, you have to do this. Otherwise, you'd have to do everything all over again.

And when you're working with bigger datasets, that will take quite some time. So first, we'll just save them. (silence) So now we've saved them, and we've just removed all of them from memory. So now we will not be able to access any of them. Going forwards, we are just going to load them back in, which is exactly the same process.

Quickly write that out. (silence) (silence) And now we have our data back in. Like so. Okay, so now we need to put all of our arrays into a TensorFlow dataset object. So we'll use a dataset object because it makes things a lot easier. So we can restructure the data, shuffle, and batch it in just a few lines of code, which is a lot faster in terms of performance, and also a lot faster for us to write down.

It's a much simpler code. So we'll import TensorFlow. And also, if you're using a GPU, you can check that it is being detected by your system with this. And there we can see I have a GPU being picked up by TensorFlow. So first, we need to restructure the data.

TensorFlow expects our data to be input as a tuple. So that tuple consists of our inputs and our target or output labels. Now, because we are using BERT, our data structure is slightly different because in the input tuple, we actually have a dictionary, and that dictionary contains a key that is input_ids, which maps to our xids array.

We also have another key, attention_mask, which maps to our xmask array. So first, we actually need to create our dataset object, which we do like so. And this just creates a generator which contains all of our data in the tuple-like format where each tuple contains one xids array, xmask array, and label array.

So we can actually view one of those, like so. And here we can see each one of our arrays. So the first one, xids, xmask, and then labels. Now, TensorFlow expects our input data as a dataset to be in a tuple format. The zero index of that tuple needs to be our input values and the one index of that tuple needs to be our labels.

Now, in our case, it's also slightly different because we are using two inputs. So within that tuple, we have our input, and within that input, we have a dictionary. And that dictionary consists of a key, input_ids, which maps to our xids array, and attention_mask, which must map to our xmask array.

So to create this structure, we need to build a mapping function, like so. And here we return the format we need. So input_ids, which will map to input_ids, and attention_mask, which will map to mask. Then, because we are still expecting to use the input output tuple, we also need to add labels onto the end there.

And to apply this mapping function to our dataset object, we use the map method. So now we can view a single row in this dataset. (silence) And now we can see a slightly different format. So we have our input dictionary here, where we have input_ids and attention_mask, and then we have our output tensor here.

Now, I said before that this makes shuffling and batching our dataset very easy. So we can do it in a single line, like this. (silence) So here we're gonna put our samples into batches of 32, and then the value that I've given to the shuffle method here, essentially just needs to be very large.

The larger your dataset, the larger this number needs to be. I take a sample of my dataset and increase the number. If I can see that this is not shuffling, the dataset properly. Okay, so now we have our shuffle batch dataset in the correct format. So now we just need to split it into our training and validation sets.

Okay, and to do that, we need to get the total size of our dataset now that it is batched. So we can do that, like so. Now, because the dataset object is a generator, we can't just take the length of it directly, so we have to convert it into a list.

Now, if you're working with a very large dataset, this is probably not the right method to use, and you should instead take the size of the dataset that you know already, and calculate the current length of it, including the batch size from that. But for us, this is fine, and we can see that dataset length is 267.

Now, if we want a nine to 10 split between the training and validation set, we simply use a split value of 0.9. And then we get our training and validation sets as two different datasets. To split the dataset, we use take and skip. We used take earlier on, and what it does is simply takes the specified number of, in this case, batches, and nothing else.

Skip, on the other hand, does the opposite. It skips a specified number of batches, and then takes the remainder. (mouse clicking) And then at the end here, we can delete dataset if space is an issue. Okay, so now our data is ready. We can go on to actually building our model architecture.

So first, we initialize BERT, and to do that, we need to import tf.autoModel from the Transformers library. And we initialize BERT like so. So here, remember to use the same model that you're using to initialize your tokenizer. And here, we can see that we have now imported BERT. So we have BERT, but we need to build a network around BERT as well.

First thing we need to do is define our input layers, and of course, there are two, because we have the input IDs and the attention mask. And the shape is simply the sequence length that we are using. And the name here is very important. This needs to match up to the dictionary value that we have defined in our dataset here.

So input IDs, input IDs, and attention mask, okay? So these need to match up. Otherwise, TensorFlow does not know where these are going. (mouse clicking) And we do the same here, but we are doing this for the attention mask. So those are our two input layers, and now we need to pull the embeddings from our initialized BERT model.

(mouse clicking) BERT consumes our two input layers, like so, and BERT will return two tensors to us. One of those is the last hidden state, which is what we are interested in. That is a 3D tensor, which provides all the information from the last hidden state of the BERT model.

The second tensor that we are going to ignore is called the pooler output, and the pooler output is essentially the last hidden state run through a feed-forward or linear activation function and pooled. So that creates a 2D tensor, which can be used for classification if you want, but we are going to pool it ourselves, so we will not be using that tensor.

(mouse clicking) Okay, so here you can experiment with adding LSTM layers, convolutional layers, or anything else, but for now, to keep things simple, we are just going to add a global max pooling layer, which will convert our output 3D tensor into a 2D tensor. Again, you could skip this, and you could just output the pooler output tensor, like this, by changing the zero to a one, but we are not going to be using that.

(mouse clicking) Okay, so up here, we just need to define the input data types as well, which I missed. (mouse clicking) And that will remove the type error. Now, we need to normalize our outputs here. This will almost always give better results when we are actually training the model.

(mouse clicking) And then following this, we will go into our Densely Connected Neural Network layers, which are in charge of actually figuring out the classification of our BERT embedding outputs. (mouse clicking) And then we want to add a dropout layer here. This just prevents any overfitting or too much overfitting.

Then we add another Densely Connected Neural Network. And finally, we are creating our output layer, which is going to be a Densely Connected Neural Network with a Softmax activation function. Now, we use Softmax here because we have our three labels, or no, sorry, five labels in the output. So let me change this.

So we have our five labels in the output because we have one hot encoded, the zero, one, two, three, and four. And finally, we just give it a name of outputs. Now, that is our model architecture, but we still need to tell TensorFlow what our input layers are and what our output layer is.

So to do that, we define our model, like so. And to the inputs, we pass input IDs and math. And then to the outputs, we just pass Y. Finally, we have our model, so we can actually execute that and produce a model summary here so we can see what we have built.

Okay, and here we can see our model. Now, if we scroll down to the bottom, we can see the number of parameters in our model, and we have quite a lot, 108 parameters. Almost all of them are trainable. Now, BERT is a very big model, and I wouldn't recommend training it unless you have a specific reason to.

Now, for this dataset, it's definitely overkill. So what we can do is actually go in here and we can actually freeze the BERT model by freezing the third layer at index two of our model layers. And we simply set trainable equal to false to do that. So if we now look at our model summary, we will see that the number of parameters is exactly the same, but the number of trainable parameters will have reduced by a lot.

So here, we now have 104,000 trainable parameters rather than 108 million trainable parameters. We can go ahead and put together our optimizer, loss and accuracy, compile our model, and begin training. Now, for our optimizer, we're just gonna use Adam. With a learning rate of 0.01. For the loss, because we are using one-hot encoding for our outputs, we are going to use the categorical cross-entropy loss.

And finally, for our accuracy, we are also gonna use categorical accuracy for the same reason. (silence) And we can compile our model. So here, I've just missed the R. And now we can actually train our model. So just do model fit. We have our training data, our validation set.

And I've found that for this model, we have to use a lot of epochs to actually get a good accuracy out of it. So I'm gonna train it for a total of maybe 140 epochs. Now, depending on your GPU, if you're using a GPU, this will still take quite some time.

So if you're using a CPU, it will take a very long time. So maybe reduce the number of epochs or reduce the data set size a little further if you want. So I'm gonna start training this, and I will see you on the other side. (silence) Okay, so we have finished training.

And you can see that the accuracy over time is actually quite good. It goes up very slowly, which is why using a lot of epochs has been quite useful here. But then if we take a look at the accuracy at the end, it's 82%, which is not bad. But I think more importantly is the fact that it was still going up very gradually.

I think with further training, this could quite easily get to about 90% on this data set, and considering it's a very small data set, that's pretty good. Now, if we look at the validation accuracy, we actually get a high number by quite a bit. So this actually went up to 94% here, almost 95.

And I would assume that this is because within the validation set, there are more easy examples, whereas in the training set, we have some more difficult examples. But nonetheless, these are pretty good results for quickly putting together a model. It's not a particularly big model, other than the BERT encoder in the middle, but otherwise, it's a pretty straightforward, simple model.

So it's pretty cool that you can get these results on sentiment analysis in so little time. And going forward with other data sets with more data can definitely do a lot better. At the same time, we can also improve the model. We can add LSTM layers to the classifier or convolution neural net layers, or even just more densely connected neural network layers as well.

So there is a lot that we can actually do with this. But for now, that's everything. So I hope you enjoyed the video. I hope it's been useful to you. If you have any questions, let me know in the comments below. But otherwise, thank you for watching, and I will see you next time.

How-to Build a Transformer for Language Classification in TensorFlow

Chapters

Transcript