back to indexMulti-Class Language Classification With BERT in TensorFlow
Chapters
0:0 Intro
1:21 Pulling Data
1:47 Preprocessing
14:33 Data Input Pipeline
24:14 Defining Model
33:29 Model Training
35:36 Saving and Loading Models
37:37 Making Predictions
00:00:00.000 |
I welcome to this video on mid-class classification 00:00:06.200 |
So I've done a video very similar to this before, 00:00:12.560 |
at the end, which were saving and loading models 00:00:15.920 |
and actually making predictions with those models. 00:00:18.560 |
So I've made this video to cover those points, as a lot of you 00:00:28.080 |
So we're going to cover those, and we're also 00:00:31.640 |
going to cover all the other steps that lead up to that. 00:00:34.080 |
So if this is the first video on this that you've seen, 00:00:38.000 |
then I'm going to take you through everything. 00:00:39.920 |
So I'm going to take you through sourcing data 00:00:42.040 |
from Kaggle that we'll be using, pre-processing that data. 00:00:45.160 |
So that's tokenization and encoding the label data. 00:00:50.980 |
Then we're also going to be looking at setting up 00:00:53.020 |
a TensorFlow input pipeline, building and training 00:00:57.340 |
the model, and then we go on to the few extra things 00:01:02.260 |
I missed before-- so the saving and loading the model, 00:01:11.260 |
And what I've done is left chapters in the timeline, 00:01:15.340 |
so you can skip ahead to whatever you need to see. 00:01:21.660 |
So here we're just going to download the data. 00:01:24.340 |
We're going to be using the sentiment analysis on the movie 00:01:27.140 |
reviews data set, which you can see over here on Kaggle. 00:01:30.260 |
You can download it from here if you just click on train.tsv.zip, 00:01:35.460 |
I'm just going to do it through the Kaggle API 00:01:47.980 |
so I'll put a link to that in the description. 00:01:54.980 |
And I'm just going to import pandas and view what we have. 00:02:26.180 |
OK, we have the sentiment here, and we have the phrase. 00:02:33.860 |
to be tokenizing this text to create two input 00:02:37.100 |
tensors, our input IDs and the attention mask. 00:02:41.940 |
Now, at the moment, we're going to contain these two 00:02:47.660 |
will be of dimensions the length of the data frame by 512. 00:02:52.780 |
512 is the sequence length of our tokenized sequences 00:03:04.940 |
and assign each sample or each tokenized sample 00:03:09.460 |
to its own row in the respective NumPy array. 00:03:13.620 |
And we'll just first initialize those as empty zero rows. 00:03:21.740 |
And then, like I said before, the sequence length is 512. 00:03:38.140 |
And with that, we can initialize those empty zero rows. 00:03:43.420 |
So one will be xids, which will be our token IDs. 00:03:54.060 |
And then we pass the size of that array here. 00:03:58.980 |
So number of samples or the length of the data frame 00:04:13.940 |
And then let's just confirm we have the right shape as well. 00:04:23.660 |
So we have 156,000 samples and 512 tokens within each sample. 00:04:31.420 |
So now that we have initialized those two arrays, 00:04:34.140 |
we can begin populating them with the actual tokenized text. 00:04:38.980 |
So we're going to be using transformers for that. 00:04:43.660 |
And we are using BERT, so we'll import BERT tokenizer. 00:04:47.020 |
And then we just want to initialize the tokenizer. 00:04:55.060 |
And what we'll do here is just load it from pre-trained. 00:05:09.940 |
just loop through every phrase within the phrase column. 00:05:19.700 |
So we've got the row number in i and the actual phrase 00:05:33.140 |
And then we want to pull out our tokens using the tokenizer. 00:06:06.300 |
or tensors of different shapes into our model. 00:06:10.900 |
Likewise, if we have something that is too short, 00:06:18.580 |
And for that, we need to use padding equal to max length, 00:06:23.220 |
which will just pad up to whatever number we pass 00:06:32.300 |
So in BERT, we have a few tokens like this, which 00:06:40.620 |
means the start of a sequence, this, which means separator. 00:06:46.420 |
And that is either separating different sequences 00:06:53.780 |
and also this, which is the padding token, which 00:06:58.900 |
we'll be using because we set padding equal to max length. 00:07:02.620 |
So if we have an input, which is maybe 412 tokens in length, 00:07:12.580 |
be added to it, which will push it up to 412. 00:07:25.100 |
So obviously, we do want to add those special tokens. 00:07:28.180 |
Otherwise, BERT will not understand what it is reading. 00:07:32.500 |
And then because we're using TensorFlow here, 00:07:35.500 |
we're going to return tensors equals tf for TensorFlow. 00:07:42.460 |
so we've pulled out our tokens into this dictionary here. 00:07:51.060 |
So it will have our input IDs and attention mass. 00:08:05.820 |
And then this is why we have enumerated through here. 00:08:16.860 |
And we set that equal to tokens because input IDs. 00:08:31.500 |
we are going to set it equal to attention mask. 00:08:34.060 |
So let's just have a quick look at what we have in xids now. 00:08:50.460 |
And now if we rerun this, we can see, OK, now we 00:08:56.300 |
So first, this 101 is the CLS token that I mentioned before. 00:09:04.500 |
And these zeros here, they are all padding tokens. 00:09:08.340 |
So obviously, at the end here, we almost always 00:09:13.980 |
to come up to this point or if it has been truncated. 00:09:18.140 |
But here, we can see the sort of structure that we would expect. 00:09:27.660 |
And if we look at our data here, we can say, OK, that's why. 00:09:42.900 |
And I think let's just have a quick look at the XMASK. 00:09:50.340 |
So XMASK is essentially like a control for the attention layer 00:09:57.300 |
Wherever there's a one, BERT will calculate the attention 00:10:06.460 |
So this is to avoid BERT making any kind of connection 00:10:12.860 |
Because these padding tokens are, in reality, not there. 00:10:18.580 |
And we do that by passing every padding token 00:10:22.300 |
as a zero within the attention masquerade or tensor. 00:10:32.220 |
So at the moment, we can see here we have the sentiment. 00:10:41.860 |
which represent each of the sentiment classes. 00:10:46.980 |
of 0, which is very negative, 1, somewhat negative, 2, neutral, 00:10:54.900 |
So we're going to be keeping all those in there. 00:10:59.620 |
So to do that, we will first extract that data. 00:11:09.100 |
We want to put df sentiment, which is the column name. 00:11:25.980 |
And now what we need to do is initialize, again, a 0 array. 00:11:39.020 |
because, again, we have the same number of samples 00:11:41.340 |
in our labels here, so the length of the data frame. 00:11:46.220 |
And I want to say the array max value plus 1. 00:11:51.500 |
Now, this works because in our array, we have 0, 1, 2, 3, 4. 00:12:10.420 |
which is the total number of values that we need. 00:12:18.900 |
And we need one column for each different class. 00:12:28.700 |
And here, we see we have the length of the data frame 00:12:35.500 |
Now, let's have a quick look at what we're doing here. 00:12:50.060 |
going to do some fancy indexing to select each value that 00:13:04.580 |
we will be selecting this, making it equal to 1. 00:13:08.020 |
And then this, because this is in the second. 00:13:16.620 |
So first one, that will be set to 1 because we have 1 here. 00:13:21.220 |
This is number 2, so we select number 2 and 2 here as well. 00:13:25.540 |
And then for 3 down here, we'd have a 1 here. 00:13:30.300 |
So to do that, we need to specify both the row. 00:13:35.660 |
So we're just going to be getting one row at a time. 00:13:40.580 |
which covers from 0 all the way down to 156,060, 00:13:57.900 |
And then here, we need to select which column 00:14:01.660 |
we want to set each value to or select each value for. 00:14:06.140 |
And that's easy because we already have it here. 00:14:34.020 |
And now what we want to do is sell our data here 00:14:38.820 |
and put into a format that TensorFlow will be able to read. 00:14:42.380 |
So to do that, we want to import TensorFlow first. 00:14:47.140 |
And what we're going to do is use the TensorFlow data set 00:14:55.580 |
by TensorFlow, which just allows us to transform our data 00:15:09.700 |
And we're creating this data set from tensor slices, 00:15:17.820 |
And in here, we're going to pass a tuple of XIDs, XMASK, 00:15:25.460 |
And to actually view what we have inside that data set, 00:15:32.140 |
because we can't just print it out and view everything. 00:15:45.620 |
We see here, OK, we have this take data set shapes. 00:16:19.420 |
OK, and then we have the same for the XMASK, which 00:16:39.220 |
But what we need now is to merge both of our input tensors 00:16:47.420 |
So the reason that we do that is that when TensorFlow reads data 00:16:56.140 |
the input in index 0 and the output or target labels 00:17:11.500 |
we merge both of these into a dictionary, which 00:17:25.980 |
apply whatever is inside this function to our data set. 00:17:29.340 |
And it will reformat everything in our data set 00:17:35.500 |
So our input IDs, let's just change that to input. 00:17:42.980 |
And all we want to do is return input IDs and mass together. 00:17:52.300 |
And we're also going to give them these key names 00:17:57.580 |
so that we can map the correct tensor to the correct input 00:18:06.220 |
We have our attention mass, which goes to our mass. 00:18:12.780 |
And then that is the first part of our tuple. 00:18:19.900 |
And that's all we need to do for creating that map function. 00:18:29.820 |
all we need to do is data set dot map, map function. 00:18:37.100 |
So now let's have a quick look at what we have, 00:19:02.100 |
tensor and attention mass, which maps to our attention mass 00:19:08.300 |
And now what we want to do is actually batch this data. 00:19:24.620 |
So I would say this is at the upper end, the size 00:19:50.540 |
But if you notice that your data is not being shuffled, 00:20:05.140 |
So we first shuffle the data, and then we batch it. 00:20:08.660 |
Otherwise, we would get batches, and we would end up 00:20:11.340 |
shuffling the batches, which is not really what we want to do. 00:20:15.660 |
We just want to actually shuffle the data within those batches. 00:20:35.060 |
we would get two batches out of that, 16 and 16, 00:20:47.940 |
And then let's just have a look at what we have now. 00:20:58.660 |
where we have the labels here and the inputs here. 00:21:01.980 |
But our actual shapes, our tensor shapes, has changed. 00:21:12.060 |
We've got our full data set, our train data set here. 00:21:14.820 |
And what we might want to do is split that into a training 00:21:21.620 |
And we can do that by setting the split here. 00:21:27.900 |
So we're going to 90% training data, 10% validation data. 00:21:33.580 |
the size of that split, or the size of the training 00:21:46.460 |
Or actually, we can just set the SQL to the value 00:21:51.420 |
that we defined up here, the number of samples. 00:22:10.700 |
We've already defined that, so let's just go with that. 00:22:13.100 |
And we're going to divide that by the batch size. 00:22:18.380 |
And this gives us the new number of samples within our data 00:22:54.300 |
that we want this number of samples from the data set, 00:22:58.660 |
we can't give it a float, because we can't have 0.375 00:23:25.340 |
So we're going to have the train data set, which 00:23:29.740 |
And as we did up here, where we do this take method 00:23:32.780 |
to take a certain number of samples, we do the same. 00:23:36.620 |
But now we're going to take a lot more than just one. 00:23:40.660 |
that we calculated here, which is 8,700 or so. 00:23:56.580 |
So we're going to skip the first 8,700 or so. 00:24:00.500 |
And we're just going to take the final ones out 00:24:06.460 |
And then we're not going to be using data set anymore. 00:24:15.100 |
OK, so now we're on to actually building and training 00:24:20.820 |
So we're going to be using Transformers again 00:24:27.380 |
And we're going to be using the TF auto model using 00:24:35.860 |
And we'll set BERT equal to TF auto model from pre-trained. 00:24:56.060 |
using the TensorFlow version here, if we got rid of this, 00:25:03.540 |
Just like we would have any other TensorFlow model, 00:25:05.700 |
we can use the summary method to print out what we have. 00:25:19.540 |
within that BERT layer is a lot more complex than just one 00:25:22.860 |
layer, but it's all embedded within that single layer. 00:25:34.500 |
But that's kind of like the core or the engine of our model. 00:25:43.900 |
So the first thing we need to do is we have our two input layers. 00:25:50.180 |
We have that one for input IDs and one for the attention mask. 00:26:01.740 |
for the data set, so we don't need to do that again. 00:26:04.500 |
And what we do is input IDs, and we say tf.keras.layers. 00:26:20.380 |
So sequence length, and then we just add a comma here. 00:26:23.940 |
So that's the same shape as that we were seeing up here. 00:26:36.460 |
And we do this because, as we have seen up here, 00:26:42.740 |
And we need to know which input each of these input 00:26:48.940 |
We map input IDs to this name here, input IDs. 00:26:58.460 |
And we set the data type equal to integer 32. 00:27:03.220 |
And we do that because these are tokens here, 00:27:11.780 |
And we do the same for mask, tf.keras.layers, input. 00:27:25.180 |
We have the name, which is where we use our attention 00:27:32.380 |
And again, it's just the same D type, which is int 32. 00:27:49.860 |
So what we're doing there is we're creating our embeddings 00:27:56.980 |
And what we need to do is access the transformer 00:28:08.180 |
BERT dot BERT, which accesses that transformer. 00:28:12.260 |
And in there, we want to pass out input IDs and our attention 00:28:18.420 |
mask, which is going to be the mask, so these two input 00:28:25.020 |
We can pull out the raw activations or the raw 00:28:30.180 |
activation tensors from the transformer model here. 00:28:36.700 |
And as I said, just take out that raw activation 00:28:44.100 |
Or what they also include here is a pooled layer 00:28:50.740 |
So these are those 3D tensors pooled into 2D tensors. 00:28:57.260 |
Now, what we're going to be using is dense layers here. 00:29:02.780 |
And therefore, we want to be using this pooled layer. 00:29:13.220 |
Now, what we want to do here for our final part of this model 00:29:27.620 |
And these are going to both be densely connected layers. 00:29:31.100 |
And for the first one, I'm going to use 1,024 units or neurons. 00:29:46.100 |
And then our final layer, which is our labels, 00:29:49.900 |
that is going to be the same thing again, so dense. 00:29:55.780 |
But this time, we just want a number of labels here. 00:30:13.980 |
And what we want to do here is calculate the probability 00:30:20.340 |
So to do that, we want to use a softmax activation. 00:30:28.740 |
to call this the outputs layer, because it is our outputs. 00:30:38.540 |
They are all of our layers, but we haven't actually 00:30:43.500 |
They're all kind of just floating there by themselves. 00:30:47.500 |
Obviously, they do have these connections between them, 00:30:49.740 |
but they're not initialized into a model object yet. 00:31:00.300 |
And we also need to do here, we need to set the input layers, 00:31:25.020 |
Yeah, so just setting up our boundaries of our model. 00:31:30.500 |
We have the inputs, and they lead to our outputs. 00:31:33.660 |
Everything else within this is already handled to go to x. 00:32:18.260 |
so we have our input IDs, and we have the shape here, 00:32:33.220 |
the densely connected neural net with 1,024 units. 00:32:38.420 |
And we have our outputs, which is the softmax. 00:32:51.540 |
to train the BERT layer as well, you can also write this. 00:32:58.580 |
And we select number 2, because we have 0, 1, and 2. 00:33:10.700 |
within this BERT layer and just train the other two here. 00:33:15.460 |
But I will be keeping that so they can be trained as well, 00:33:23.460 |
It will probably give you a small performance increase, 00:33:30.460 |
Now, I want to set up the model training parameters. 00:33:43.300 |
We're using a pretty small learning rate of 1e to the minus 00:33:48.020 |
5 is because we've got our BERT model in here. 00:34:00.460 |
And what we also want to add is a loss function. 00:34:10.860 |
And because we're using categorical outputs here, 00:34:21.660 |
And then we're going to set our accuracy as well. 00:34:28.060 |
And we're using categorical accuracy for the same reason. 00:34:32.220 |
We're just going to need to pass accuracy in there as well. 00:34:57.060 |
And metrics is going to be equal to a list containing 00:35:02.820 |
OK, so that's our model training parameters all set up. 00:35:07.140 |
So the final thing to do now is train our model. 00:35:24.380 |
The validation data we'll be using is our validation data 00:35:40.860 |
I'm also going to save that model to sentiment model. 00:35:50.660 |
and store all the files that we need for that model in there. 00:36:19.140 |
that we just created when we saved our model, 00:36:22.260 |
we have everything that we need to load our model as well. 00:36:27.900 |
So I'm just going to show you how we can do that. 00:36:31.660 |
So what we can do is let's start a new file here, no notebook. 00:36:48.500 |
And what we'll do here is we need to first import TensorFlow. 00:37:04.980 |
And then we're loading this from the sentiment model directory 00:37:09.620 |
And then let's just check that what we have here 00:37:17.700 |
So we can see now we have our input layers, BERT, 00:37:23.020 |
and then we have our preclassifier and classifier 00:37:31.660 |
And now we can go forwards and start making predictions 00:37:40.420 |
we still need to convert our data or input strings 00:37:46.860 |
To do that, I'm going to create a function to handle it 00:37:52.540 |
First, we are going to need to import the tokenizer 00:37:58.300 |
We're using BERT tokenizer just like we did before. 00:38:05.180 |
And we're going to use the tokenizer, BERT tokenizer 00:38:15.580 |
OK, so exactly the same as what we used before. 00:38:24.380 |
And all we need to do is we're going to say-- 00:38:29.940 |
And here we would expect a text, which is a string. 00:38:38.380 |
And this is just the same as what we were doing before. 00:38:46.220 |
We set a max length, which is going to be 512, as always. 00:38:50.740 |
But we are going to truncate anything longer than that. 00:38:55.140 |
And we're going to pad anything shorter than that. 00:39:13.060 |
And then there's one other thing that we don't need, 00:39:24.060 |
And this is just another tensor that is returned. 00:39:31.860 |
And we're going to return the TensorFlow tensors. 00:39:42.380 |
And now we can just return our tensors in the correct format. 00:39:47.740 |
So the format that we need is like we used before 00:39:54.260 |
But if you remember before, within the data set, 00:39:59.020 |
we were working with TensorFlow Float64 tensors. 00:40:04.460 |
So we also need to convert these, which will be integers, 00:40:12.700 |
So to do that, we do tf.cast tokens, input IDs. 00:40:21.140 |
And we say we want to cast that to tf.float64. 00:40:30.340 |
And we'll repeat the same thing, but for our attention mask. 00:40:35.060 |
So attention mask, we'll just copy that across. 00:40:48.460 |
OK, and now what we can do is we can just prep data. 00:41:13.740 |
And we just need to add the S onto IDs there. 00:41:20.100 |
And you can see here we have our CLS token, "hello world," 00:41:25.060 |
separated token, followed by lots of padding. 00:41:42.020 |
And what I want to do now is get the probability 00:42:05.060 |
And we also just need to access the zero index. 00:42:12.540 |
And what we can do to get the actual class from that 00:42:18.660 |
because we just want to take the maximum value out 00:42:23.420 |
And to do that, we just do np argmax probs zero. 00:42:58.500 |
But that is really everything you need from start to finish. 00:43:02.140 |
We've preprocessed the data, set up our input data set, 00:43:09.140 |
We've saved it, loaded it, and made predictions.