back to indexHow-to Build a Transformer for Language Classification in TensorFlow
Chapters
0:0
0:10 Six Steps
0:15 Initializing Tokenizer and Model
0:24 Encode Input Data
0:29 Build Model
0:43 Optimizer, Metrics, and Loss
00:00:00.000 |
Hi, and welcome to this video on implementing transformer models 00:00:06.760 |
So we're going to go through six steps, which 00:00:14.680 |
initializing the HuggingFace tokenizer model-- 00:00:17.520 |
and by HuggingFace, I mean the Transformers framework. 00:00:22.440 |
Then we encode input data to get input ID and attention tensors. 00:00:30.880 |
So that is our input layers, which go into BERT, 00:00:38.960 |
Then it's back to the normal TensorFlow process 00:00:42.160 |
where we set up our optimizer, metrics, and loss, 00:00:47.440 |
And we will cover each one of these steps in this video. 00:00:54.640 |
So we're going to use the IMDB movie review data set, 00:01:03.840 |
Now, this data set provides us with sentiment ratings 00:01:07.440 |
from 0, which is terrible, up to 4, which is amazing. 00:01:13.760 |
or we can just download it using the Kaggle API, which 00:01:18.160 |
Now, if you haven't used Kaggle API before, that's fine. 00:01:26.440 |
and then you need to head over to the Kaggle website, 00:01:31.320 |
go to your account page, scroll down to, I think, 00:01:51.160 |
in the correct Kaggle folder, which will have been created 00:01:55.520 |
Now, if you're not sure where that is, all you need to do 00:02:03.840 |
And when you execute this, a OS error will appear, 00:02:07.400 |
and it will say you need kaggle.json, which you don't 00:02:12.920 |
You just go ahead and put kaggle.json in that folder, 00:02:16.960 |
Now, we just need to initialize our API and authenticate it. 00:02:27.240 |
And now, we can use the competition download file 00:02:39.040 |
And we are going to import the data into this directory. 00:02:42.640 |
Now, let's refresh up here, and we can see it. 00:02:54.560 |
We can also do this in Python, or you can do it manually. 00:02:57.160 |
But for now, we're just going to do it this way. 00:02:59.720 |
So we're going to import the data into this directory. 00:03:05.560 |
We can also do this in Python, or you can do it manually. 00:03:43.280 |
And we can see we have our phrase and our sentiment, 00:04:01.280 |
And because it's a tab-delimited file, we use read CSV. 00:04:15.960 |
which we can see with the phrase ID is 1 and sentence ID is 1. 00:04:23.120 |
And then we have lots of parts of that same phrase cut down 00:04:27.680 |
into different pieces and given a sentiment value. 00:04:38.040 |
going to be using the training data for both the training 00:04:42.840 |
And I don't want to pollute the validation set 00:04:51.760 |
So we're just going to drop duplicates and keep 00:04:54.840 |
the first element of every unique sentence ID. 00:05:14.680 |
With the segments removed, we only have 8,500. 00:05:21.000 |
Now, we need to move on to encoding our data. 00:05:24.760 |
So for that, we are going to be using the transformers 00:05:28.840 |
framework, which we will also be using for the transformer 00:05:32.800 |
And this works by providing a tokenizer and the model 00:05:42.600 |
And that means that we are going to import or initialize 00:05:45.640 |
a BERT model and also the BERT tokenizer, which 00:05:54.640 |
need to figure out how long we want each sequence to be, 00:05:59.000 |
because the encoding method also acts as our padding 00:06:05.200 |
So to do that, we will get the sequence length 00:06:08.160 |
in words of each sentence and plot that out and just 00:06:23.360 |
Now, what we're going to do here is get the length 00:06:34.320 |
Now, split will, by default, split by spaces. 00:06:45.680 |
just because they're super easy and quick to use. 00:06:48.480 |
And then we will also set the seaborn style just 00:07:20.920 |
of the length of each sequence in our data set. 00:07:26.480 |
Now, we could cut it off maybe around 40 or even 50. 00:07:32.600 |
I think we'll go with 50 just so we get as much data in there 00:07:55.040 |
need to import it from the Transformers library. 00:08:10.680 |
And we are getting our model from a pre-trained model. 00:08:19.920 |
Now, cased here refers to whether BERT distinguishes 00:08:30.760 |
the difference between uppercase and lowercase characters. 00:08:39.000 |
the difference between uppercase and lowercase. 00:08:54.720 |
and tell that someone is being dramatic or shouting at you 00:09:00.080 |
And because we are classifying sentiment here, 00:09:11.760 |
So we use the encode plus method, which looks like this. 00:09:15.520 |
So you'll see here we've just defined or hardcoded 00:09:44.880 |
But when we are feeding all of our data through it, 00:10:00.280 |
In this case, we will end up with 48 of these padding tokens. 00:10:27.200 |
In order to add these in during the encoding method here, 00:10:46.200 |
and then it's gonna add all of our padding values. 00:10:52.400 |
that we can get from this encode plus method. 00:11:00.640 |
Now, the token type IDs we don't really need, 00:11:03.200 |
so we can tell the tokenizer to not return those. 00:11:21.720 |
Finally, because we are working in TensorFlow, 00:12:05.320 |
are the start of sequence and end of sequence tokens 00:12:30.680 |
Where there's a zero, it means just ignore it. 00:12:37.320 |
because padding tokens aren't important to us, 00:12:41.480 |
But then we have ones for the end of sequence 00:12:56.480 |
and we need to do this for every sample in our dataset. 00:13:06.040 |
to add each value or sequence into a NumPy array, 00:13:27.520 |
so it's gonna be the length of our data frame 00:13:30.720 |
by the sequence length that we have defined, which is 50. 00:13:53.220 |
Now, we're just gonna use a for loop to do this. 00:14:05.780 |
We are only processing and encoding this data one time, 00:14:09.380 |
and then we will save it and then load it from memory 00:14:54.440 |
Okay, and then here we can see our complete arrays. 00:15:05.700 |
So at the top, we have our input IDs and XIDs, 00:15:18.700 |
and then at the end, we have our padding token. 00:15:30.620 |
which obviously correspond to the respective values 00:15:38.940 |
we are actually going to one-hot encode them. 00:15:45.460 |
So we will get values of four, one, three, two, and zero. 00:16:25.460 |
And what we are doing here is we are taking the array size. 00:16:34.640 |
So if I, okay, let's do it a little differently. 00:16:39.620 |
So the array size is just the length of our DataFrame. 00:16:48.140 |
And array.max is what we see maximum value within our array, 00:16:56.860 |
which is saying here that we want a zero array 00:17:08.940 |
Now, at the moment, we just have an empty zero array. 00:17:26.860 |
from zero to 8,528, which is the array size here. 00:17:36.660 |
Because in array, we have each sentiment value. 00:18:01.080 |
Now, I said before that typically we'd save these 00:18:08.620 |
Obviously, we don't need to load them back in. 00:18:13.580 |
because going forwards, if you want to retrain data, 00:18:19.060 |
Otherwise, you'd have to do everything all over again. 00:18:21.060 |
And when you're working with bigger datasets, 00:18:54.420 |
and we've just removed all of them from memory. 00:18:57.140 |
So now we will not be able to access any of them. 00:19:01.260 |
Going forwards, we are just going to load them back in, 00:19:33.060 |
Okay, so now we need to put all of our arrays 00:19:51.940 |
which is a lot faster in terms of performance, 00:20:28.620 |
TensorFlow expects our data to be input as a tuple. 00:20:50.780 |
that is input_ids, which maps to our xids array. 00:21:03.100 |
So first, we actually need to create our dataset object, 00:21:33.900 |
So we can actually view one of those, like so. 00:21:43.500 |
So the first one, xids, xmask, and then labels. 00:21:56.220 |
The zero index of that tuple needs to be our input values 00:22:00.380 |
and the one index of that tuple needs to be our labels. 00:22:05.380 |
Now, in our case, it's also slightly different 00:22:24.780 |
and attention_mask, which must map to our xmask array. 00:22:31.740 |
we need to build a mapping function, like so. 00:23:05.900 |
we also need to add labels onto the end there. 00:23:08.260 |
And to apply this mapping function to our dataset object, 00:23:15.820 |
So now we can view a single row in this dataset. 00:23:29.100 |
And now we can see a slightly different format. 00:23:58.140 |
So here we're gonna put our samples into batches of 32, 00:24:16.700 |
I take a sample of my dataset and increase the number. 00:24:26.540 |
Okay, so now we have our shuffle batch dataset 00:24:54.940 |
Now, because the dataset object is a generator, 00:24:57.860 |
we can't just take the length of it directly, 00:25:02.580 |
Now, if you're working with a very large dataset, 00:25:06.820 |
this is probably not the right method to use, 00:25:08.700 |
and you should instead take the size of the dataset 00:25:37.660 |
And then we get our training and validation sets 00:25:54.540 |
and what it does is simply takes the specified number of, 00:26:12.020 |
And then at the end here, we can delete dataset 00:26:36.100 |
We can go on to actually building our model architecture. 00:26:43.060 |
and to do that, we need to import tf.autoModel 00:27:04.700 |
that you're using to initialize your tokenizer. 00:27:07.100 |
And here, we can see that we have now imported BERT. 00:27:15.460 |
but we need to build a network around BERT as well. 00:27:19.660 |
First thing we need to do is define our input layers, 00:27:24.140 |
because we have the input IDs and the attention mask. 00:27:43.540 |
This needs to match up to the dictionary value 00:27:51.100 |
So input IDs, input IDs, and attention mask, okay? 00:27:57.700 |
Otherwise, TensorFlow does not know where these are going. 00:28:10.740 |
but we are doing this for the attention mask. 00:28:35.780 |
like so, and BERT will return two tensors to us. 00:28:49.980 |
That is a 3D tensor, which provides all the information 00:28:53.980 |
from the last hidden state of the BERT model. 00:28:59.180 |
The second tensor that we are going to ignore 00:29:05.340 |
and the pooler output is essentially the last hidden state 00:29:18.180 |
which can be used for classification if you want, 00:29:29.940 |
Okay, so here you can experiment with adding LSTM layers, 00:29:40.380 |
we are just going to add a global max pooling layer, 00:29:43.220 |
which will convert our output 3D tensor into a 2D tensor. 00:29:51.180 |
and you could just output the pooler output tensor, 00:30:16.500 |
we just need to define the input data types as well, 00:30:48.980 |
we will go into our Densely Connected Neural Network layers, 00:30:55.820 |
the classification of our BERT embedding outputs. 00:31:18.260 |
And then we want to add a dropout layer here. 00:31:21.460 |
This just prevents any overfitting or too much overfitting. 00:31:26.260 |
Then we add another Densely Connected Neural Network. 00:31:31.820 |
And finally, we are creating our output layer, 00:31:44.620 |
which is going to be a Densely Connected Neural Network 00:31:50.140 |
Now, we use Softmax here because we have our three labels, 00:32:16.060 |
And finally, we just give it a name of outputs. 00:32:29.740 |
what our input layers are and what our output layer is. 00:32:37.220 |
And to the inputs, we pass input IDs and math. 00:33:13.460 |
we can see the number of parameters in our model, 00:33:33.220 |
Now, for this dataset, it's definitely overkill. 00:33:51.180 |
And we simply set trainable equal to false to do that. 00:34:10.140 |
So here, we now have 104,000 trainable parameters 00:34:15.140 |
rather than 108 million trainable parameters. 00:34:19.220 |
We can go ahead and put together our optimizer, 00:34:23.820 |
loss and accuracy, compile our model, and begin training. 00:34:28.020 |
Now, for our optimizer, we're just gonna use Adam. 00:34:40.700 |
For the loss, because we are using one-hot encoding 00:35:37.980 |
We have our training data, our validation set. 00:35:57.940 |
So I'm gonna train it for a total of maybe 140 epochs. 00:36:02.940 |
Now, depending on your GPU, if you're using a GPU, 00:36:10.780 |
So if you're using a CPU, it will take a very long time. 00:36:16.580 |
or reduce the data set size a little further if you want. 00:36:43.420 |
But then if we take a look at the accuracy at the end, 00:36:55.860 |
this could quite easily get to about 90% on this data set, 00:37:04.660 |
we actually get a high number by quite a bit. 00:37:07.620 |
So this actually went up to 94% here, almost 95. 00:37:16.980 |
within the validation set, there are more easy examples, 00:37:26.140 |
But nonetheless, these are pretty good results 00:37:34.940 |
but otherwise, it's a pretty straightforward, simple model. 00:37:39.300 |
So it's pretty cool that you can get these results 00:37:44.260 |
And going forward with other data sets with more data 00:37:51.420 |
At the same time, we can also improve the model. 00:38:02.260 |
So there is a lot that we can actually do with this.