Multi-Class Language Classification With BERT in TensorFlow

00:00:00.000 | I welcome to this video on mid-class classification

00:00:03.080 | using Transformers and TensorFlow.

00:00:06.200 | So I've done a video very similar to this before,

00:00:09.720 | but I missed a few of the final steps

00:00:12.560 | at the end, which were saving and loading models

00:00:15.920 | and actually making predictions with those models.

00:00:18.560 | So I've made this video to cover those points, as a lot of you

00:00:23.880 | were asking for how to actually do that.

00:00:28.080 | So we're going to cover those, and we're also

00:00:31.640 | going to cover all the other steps that lead up to that.

00:00:34.080 | So if this is the first video on this that you've seen,

00:00:38.000 | then I'm going to take you through everything.

00:00:39.920 | So I'm going to take you through sourcing data

00:00:42.040 | from Kaggle that we'll be using, pre-processing that data.

00:00:45.160 | So that's tokenization and encoding the label data.

00:00:50.980 | Then we're also going to be looking at setting up

00:00:53.020 | a TensorFlow input pipeline, building and training

00:00:57.340 | the model, and then we go on to the few extra things

00:01:02.260 | I missed before-- so the saving and loading the model,

00:01:04.940 | and also making those predictions.

00:01:08.540 | So that's everything that we'll be covering.

00:01:11.260 | And what I've done is left chapters in the timeline,

00:01:15.340 | so you can skip ahead to whatever you need to see.

00:01:18.740 | So we'll jump straight into the code.

00:01:21.660 | So here we're just going to download the data.

00:01:24.340 | We're going to be using the sentiment analysis on the movie

00:01:27.140 | reviews data set, which you can see over here on Kaggle.

00:01:30.260 | You can download it from here if you just click on train.tsv.zip,

00:01:34.140 | download it, and unzip it.

00:01:35.460 | I'm just going to do it through the Kaggle API

00:01:37.820 | here, and then unzip it with zip file here.

00:01:41.100 | It's just a little bit easier.

00:01:42.420 | So I'll run that.

00:01:45.420 | And I do have a video on the Kaggle API,

00:01:47.980 | so I'll put a link to that in the description.

00:01:50.780 | So we've got our data.

00:01:52.700 | You can see it in the left here.

00:01:54.980 | And I'm just going to import pandas and view what we have.

00:01:57.980 | And just read from CSV.

00:02:03.140 | And we're using a tab-limited file here,

00:02:17.980 | so we need to use the tab separator.

00:02:22.260 | And let's just see what we have.

00:02:26.180 | OK, we have the sentiment here, and we have the phrase.

00:02:28.780 | And that's all we really need.

00:02:31.180 | So on the phrase here, we're going

00:02:33.860 | to be tokenizing this text to create two input

00:02:37.100 | tensors, our input IDs and the attention mask.

00:02:41.940 | Now, at the moment, we're going to contain these two

00:02:45.140 | tensors within two NumPy arrays, which

00:02:47.660 | will be of dimensions the length of the data frame by 512.

00:02:52.780 | 512 is the sequence length of our tokenized sequences

00:02:57.660 | when we're putting them into BERT.

00:02:59.580 | So when tokenizing, all we're going to do

00:03:01.980 | is iterate through each sample one by one

00:03:04.940 | and assign each sample or each tokenized sample

00:03:09.460 | to its own row in the respective NumPy array.

00:03:13.620 | And we'll just first initialize those as empty zero rows.

00:03:18.180 | So we'll do that.

00:03:20.180 | So first, we need to import NumPy.

00:03:21.740 | And then, like I said before, the sequence length is 512.

00:03:31.620 | And then our number of samples is just

00:03:35.020 | equal to the length of our data frame.

00:03:38.140 | And with that, we can initialize those empty zero rows.

00:03:43.420 | So one will be xids, which will be our token IDs.

00:03:48.820 | And that is initialized with empty zeroes.

00:03:54.060 | And then we pass the size of that array here.

00:03:58.980 | So number of samples or the length of the data frame

00:04:02.340 | by the sequence length, which is 512.

00:04:05.340 | And then we can copy that.

00:04:07.500 | And we do the same thing for xmask,

00:04:12.420 | which is our attention mask.

00:04:13.940 | And then let's just confirm we have the right shape as well.

00:04:21.140 | We'll do that like this.

00:04:23.660 | So we have 156,000 samples and 512 tokens within each sample.

00:04:31.420 | So now that we have initialized those two arrays,

00:04:34.140 | we can begin populating them with the actual tokenized text.

00:04:38.980 | So we're going to be using transformers for that.

00:04:43.660 | And we are using BERT, so we'll import BERT tokenizer.

00:04:47.020 | And then we just want to initialize the tokenizer.

00:04:55.060 | And what we'll do here is just load it from pre-trained.

00:05:02.940 | And so I'm using BERT base case.

00:05:05.460 | And then what we can do here is we'll

00:05:09.940 | just loop through every phrase within the phrase column.

00:05:15.140 | And we'll just enumerate that.

00:05:19.700 | So we've got the row number in i and the actual phrase

00:05:24.580 | in the phrase variable.

00:05:27.380 | So here, phrase.

00:05:33.140 | And then we want to pull out our tokens using the tokenizer.

00:05:37.980 | So we do ENCODE plus.

00:05:39.140 | And then we have our phrase, the max length,

00:05:47.220 | which is sequence length.

00:05:48.980 | So that's 512.

00:05:52.060 | We're going to set truncation to true.

00:05:55.260 | So here, if we have any text which

00:05:57.820 | is longer than 512 tokens, it will just

00:06:01.260 | truncate and cut it off at that point

00:06:03.500 | because we can't feed in arrays of--

00:06:06.300 | or tensors of different shapes into our model.

00:06:08.700 | It needs to be consistent.

00:06:10.900 | Likewise, if we have something that is too short,

00:06:15.620 | we want to pad it up to 512 tokens.

00:06:18.580 | And for that, we need to use padding equal to max length,

00:06:23.220 | which will just pad up to whatever number we pass

00:06:25.980 | into this max length variable here.

00:06:29.220 | And then I want to add special tokens.

00:06:32.300 | So in BERT, we have a few tokens like this, which

00:06:40.620 | means the start of a sequence, this, which means separator.

00:06:46.420 | And that is either separating different sequences

00:06:49.820 | or marking the end of a sequence,

00:06:53.780 | and also this, which is the padding token, which

00:06:58.900 | we'll be using because we set padding equal to max length.

00:07:02.620 | So if we have an input, which is maybe 412 tokens in length,

00:07:09.420 | or let's say 410, both of these will

00:07:12.580 | be added to it, which will push it up to 412.

00:07:15.980 | And then we'll add 100 padding tokens

00:07:19.100 | onto the end, which pushes it up to 512.

00:07:24.100 | So that's how that works.

00:07:25.100 | So obviously, we do want to add those special tokens.

00:07:28.180 | Otherwise, BERT will not understand what it is reading.

00:07:32.500 | And then because we're using TensorFlow here,

00:07:35.500 | we're going to return tensors equals tf for TensorFlow.

00:07:40.860 | And then what we want to do here,

00:07:42.460 | so we've pulled out our tokens into this dictionary here.

00:07:46.300 | So this will give us a dictionary.

00:07:48.100 | It will have different tensors in there.

00:07:51.060 | So it will have our input IDs and attention mass.

00:07:54.140 | And we want to add both of those to a row

00:07:58.580 | within each one of these zero rows.

00:08:02.740 | So to do that, we want to say xids.

00:08:05.820 | And then this is why we have enumerated through here.

00:08:08.460 | So we have this i.

00:08:10.380 | So this tells us which row to put it in.

00:08:13.860 | And then we want to select the whole row.

00:08:16.860 | And we set that equal to tokens because input IDs.

00:08:20.840 | Now, as well, we do the same for the xmask.

00:08:29.180 | But this time, rather than input IDs,

00:08:31.500 | we are going to set it equal to attention mask.

00:08:34.060 | So let's just have a quick look at what we have in xids now.

00:08:45.820 | OK, it's just 0.

00:08:47.020 | So now let's run this.

00:08:50.460 | And now if we rerun this, we can see, OK, now we

00:08:54.100 | have all these other values.

00:08:56.300 | So first, this 101 is the CLS token that I mentioned before.

00:09:02.780 | So the start sequence token.

00:09:04.500 | And these zeros here, they are all padding tokens.

00:09:08.340 | So obviously, at the end here, we almost always

00:09:10.620 | have paddings unless the text is long enough

00:09:13.980 | to come up to this point or if it has been truncated.

00:09:18.140 | But here, we can see the sort of structure that we would expect.

00:09:22.420 | And we see some duplication here.

00:09:23.940 | So we have 101.

00:09:24.700 | Then we have this 138, 1326.

00:09:27.660 | And if we look at our data here, we can say, OK, that's why.

00:09:30.420 | Because the first few examples, we

00:09:32.500 | have segments of the full sequence here.

00:09:37.300 | So there is some duplication there.

00:09:40.260 | So that all looks pretty good.

00:09:42.900 | And I think let's just have a quick look at the XMASK.

00:09:46.780 | So here, you can see something different.

00:09:48.500 | We have these ones and zeros.

00:09:50.340 | So XMASK is essentially like a control for the attention layer

00:09:55.740 | within BERT.

00:09:57.300 | Wherever there's a one, BERT will calculate the attention

00:10:03.060 | for that token.

00:10:04.140 | Wherever there's a zero, it will not.

00:10:06.460 | So this is to avoid BERT making any kind of connection

00:10:10.580 | with the padding tokens and actual words.

00:10:12.860 | Because these padding tokens are, in reality, not there.

00:10:16.020 | We want BERT to just completely ignore them.

00:10:18.580 | And we do that by passing every padding token

00:10:22.300 | as a zero within the attention masquerade or tensor.

00:10:26.340 | Now, as well as that, we also want

00:10:28.660 | to want to encode our label data.

00:10:32.220 | So at the moment, we can see here we have the sentiment.

00:10:35.460 | And this is a set of values from 0 to 4,

00:10:41.860 | which represent each of the sentiment classes.

00:10:44.380 | So here, we have sentiment labels

00:10:46.980 | of 0, which is very negative, 1, somewhat negative, 2, neutral,

00:10:51.460 | 3, somewhat positive, and 4, positive.

00:10:54.900 | So we're going to be keeping all those in there.

00:10:56.900 | But we're going to be one-hot encoding them.

00:10:59.620 | So to do that, we will first extract that data.

00:11:05.060 | So we'll go put into an array.

00:11:09.100 | We want to put df sentiment, which is the column name.

00:11:13.420 | And we want to take the values.

00:11:15.180 | And if I just show you what that gives us,

00:11:16.980 | it just gives us an array.

00:11:18.420 | And we have up to 4, so 0, 1, 2, 3, 4.

00:11:21.980 | And that's good.

00:11:25.980 | And now what we need to do is initialize, again, a 0 array.

00:11:31.700 | So we'll do np0s.

00:11:35.420 | And in here, we're going to do it

00:11:37.300 | the length of the number of samples,

00:11:39.020 | because, again, we have the same number of samples

00:11:41.340 | in our labels here, so the length of the data frame.

00:11:46.220 | And I want to say the array max value plus 1.

00:11:51.500 | Now, this works because in our array, we have 0, 1, 2, 3, 4.

00:12:00.380 | So the max value of that is 4.

00:12:04.860 | And we have five numbers here.

00:12:06.820 | So if we take the 4 plus 1, we get 5,

00:12:10.420 | which is the total number of values that we need.

00:12:13.380 | And this essentially gives us the number

00:12:15.020 | of columns in this array.

00:12:18.900 | And we need one column for each different class.

00:12:22.460 | So that gives us what we need.

00:12:25.100 | And we can just check that as well.

00:12:26.780 | So we have labels.shape.

00:12:28.700 | And here, we see we have the length of the data frame

00:12:31.140 | or number of samples.

00:12:32.940 | And we have our five classes.

00:12:35.500 | Now, let's have a quick look at what we're doing here.

00:12:40.020 | So we'll print out labels.

00:12:41.580 | OK, we just have zeros.

00:12:46.620 | Now, what we're going to do is we're

00:12:50.060 | going to do some fancy indexing to select each value that

00:12:54.860 | needs to be selected based on the index here

00:12:58.740 | and set that equal to 1.

00:13:02.420 | So for these first three examples,

00:13:04.580 | we will be selecting this, making it equal to 1.

00:13:08.020 | And then this, because this is in the second.

00:13:10.580 | So we've got 0, 1, 2.

00:13:13.900 | These are the column indices.

00:13:16.620 | So first one, that will be set to 1 because we have 1 here.

00:13:21.220 | This is number 2, so we select number 2 and 2 here as well.

00:13:25.540 | And then for 3 down here, we'd have a 1 here.

00:13:30.300 | So to do that, we need to specify both the row.

00:13:35.660 | So we're just going to be getting one row at a time.

00:13:38.340 | So all we need here is a range of numbers,

00:13:40.580 | which covers from 0 all the way down to 156,060,

00:13:47.420 | which is our length here.

00:13:51.020 | So to do that, we just go np arrange.

00:13:54.420 | We have a number of samples.

00:13:57.900 | And then here, we need to select which column

00:14:01.660 | we want to set each value to or select each value for.

00:14:06.140 | And that's easy because we already have it here.

00:14:08.340 | This is our array.

00:14:11.460 | So we just write array.

00:14:13.900 | And then each one of those that we select,

00:14:16.180 | we want to set equal to 1.

00:14:18.300 | So let me just put it there.

00:14:21.780 | I want to put it here.

00:14:23.700 | OK, and now let's rerun this cell as well.

00:14:27.420 | OK, and now we can see we have those ones

00:14:29.580 | in the correct places.

00:14:31.380 | So that's our one-hot encoding.

00:14:34.020 | And now what we want to do is sell our data here

00:14:38.820 | and put into a format that TensorFlow will be able to read.

00:14:42.380 | So to do that, we want to import TensorFlow first.

00:14:47.140 | And what we're going to do is use the TensorFlow data set

00:14:51.220 | object.

00:14:52.540 | So this is just a object provided

00:14:55.580 | by TensorFlow, which just allows us to transform our data

00:15:00.620 | and shuffle and batch it very easily.

00:15:04.060 | So it just makes our life a lot easier.

00:15:08.300 | So it's a data set.

00:15:09.700 | And we're creating this data set from tensor slices,

00:15:14.700 | which means arrays.

00:15:17.820 | And in here, we're going to pass a tuple of XIDs, XMASK,

00:15:23.380 | and labels.

00:15:25.460 | And to actually view what we have inside that data set,

00:15:28.020 | we have to use specific data set methods

00:15:32.140 | because we can't just print it out and view everything.

00:15:34.660 | It's a generator.

00:15:36.340 | So what we can do is just take one.

00:15:38.660 | And that will show us the very top sample

00:15:41.180 | or after we batch it, the very top batch.

00:15:43.820 | And I'll just print that out.

00:15:45.620 | We see here, OK, we have this take data set shapes.

00:15:50.460 | And we have this tuple here.

00:15:53.620 | So this is one sample.

00:15:56.380 | And inside here, we have a tensor,

00:15:58.980 | which is of shape 512.

00:16:01.740 | So this is our very first XIDs array--

00:16:05.260 | or row, sorry.

00:16:07.420 | So this is like doing this and getting this.

00:16:12.620 | So this is what this value is here.

00:16:15.500 | This is the size or the shape.

00:16:19.420 | OK, and then we have the same for the XMASK, which

00:16:24.940 | is the second item here in index 1.

00:16:28.820 | And we also have four labels as well.

00:16:33.780 | You can see here.

00:16:36.420 | OK, and that's all good.

00:16:39.220 | But what we need now is to merge both of our input tensors

00:16:44.340 | into a single dictionary.

00:16:47.420 | So the reason that we do that is that when TensorFlow reads data

00:16:52.060 | during training, it's going to read--

00:16:54.140 | or it's going to expect a tuple where it has

00:16:56.140 | the input in index 0 and the output or target labels

00:17:01.180 | in index 1.

00:17:03.300 | It doesn't expect a tuple with three values.

00:17:07.420 | So to deal with that, what we do is

00:17:11.500 | we merge both of these into a dictionary, which

00:17:14.580 | will be contained within tuple index 0.

00:17:19.460 | So first, we create a map function.

00:17:22.340 | And what we're going to do is just

00:17:25.980 | apply whatever is inside this function to our data set.

00:17:29.340 | And it will reformat everything in our data set

00:17:32.100 | to this format that we set up here.

00:17:35.500 | So our input IDs, let's just change that to input.

00:17:41.020 | We have the mass, and we have the labels.

00:17:42.980 | And all we want to do is return input IDs and mass together.

00:17:52.300 | And we're also going to give them these key names

00:17:57.580 | so that we can map the correct tensor to the correct input

00:18:01.700 | later on in the model.

00:18:03.380 | So we go input IDs.

00:18:06.220 | We have our attention mass, which goes to our mass.

00:18:12.780 | And then that is the first part of our tuple.

00:18:16.460 | And then we also have the labels,

00:18:18.020 | which is the second part.

00:18:19.900 | And that's all we need to do for creating that map function.

00:18:24.140 | And then like I said before, data set

00:18:25.540 | makes things very easy for us.

00:18:27.260 | So to actually apply that mapping function,

00:18:29.820 | all we need to do is data set dot map, map function.

00:18:37.100 | So now let's have a quick look at what we have,

00:18:40.020 | see if the format has changed or the shape.

00:18:45.140 | OK, you can see now we have--

00:18:47.420 | so it's all within a tuple.

00:18:50.700 | This is the 1 index of that tuple.

00:18:53.860 | This is a 0 index of that tuple.

00:18:56.300 | And in the 0 index, we have a dictionary

00:18:59.020 | with input IDs, which goes to our input ID

00:19:02.100 | tensor and attention mass, which maps to our attention mass

00:19:06.300 | tensor.

00:19:07.300 | OK, so that's great.

00:19:08.300 | And now what we want to do is actually batch this data.

00:19:12.820 | So I'm going to use a batch size of 16.

00:19:15.980 | You might want to increase or decrease this,

00:19:19.780 | probably mostly dependent on your GPU.

00:19:23.220 | I have a very good GPU.

00:19:24.620 | So I would say this is at the upper end, the size

00:19:29.420 | that you want to be using.

00:19:31.660 | And what we do here is data set dot shuffle.

00:19:38.620 | So this is going to shuffle our data set.

00:19:41.660 | And the value that you should input in here,

00:19:44.300 | I tend to go for this sort of value

00:19:47.660 | for this type of size data set.

00:19:50.540 | But if you notice that your data is not being shuffled,

00:19:54.060 | just increase this value.

00:19:57.460 | And then batch.

00:19:59.340 | And then we have the batch size.

00:20:02.220 | OK, so split into batches of 16.

00:20:05.140 | So we first shuffle the data, and then we batch it.

00:20:08.660 | Otherwise, we would get batches, and we would end up

00:20:11.340 | shuffling the batches, which is not really what we want to do.

00:20:15.660 | We just want to actually shuffle the data within those batches.

00:20:19.820 | And then we want to drop the remainder

00:20:24.060 | and set that equal to true.

00:20:25.260 | So that is dropping.

00:20:27.500 | So we have batch size of 16.

00:20:29.900 | If we had, for example, 33 samples,

00:20:35.060 | we would get two batches out of that, 16 and 16,

00:20:37.780 | and then we'd have one left over.

00:20:39.740 | And that wouldn't fit cleanly into a batch,

00:20:41.580 | so we would just drop it.

00:20:42.940 | We would get rid of it.

00:20:45.300 | And that is what we're doing there.

00:20:47.940 | And then let's just have a look at what we have now.

00:20:53.340 | So now, see, this has changed.

00:20:55.300 | So we still have that tuple shape

00:20:58.660 | where we have the labels here and the inputs here.

00:21:01.980 | But our actual shapes, our tensor shapes, has changed.

00:21:04.940 | So now we have 16 samples for every tensor.

00:21:09.220 | OK, so that is what we want.

00:21:10.820 | That's good.

00:21:12.060 | We've got our full data set, our train data set here.

00:21:14.820 | And what we might want to do is split that into a training

00:21:19.700 | and validation set.

00:21:21.620 | And we can do that by setting the split here.

00:21:25.460 | So we're going to use 90%.

00:21:26.860 | You can change this.

00:21:27.900 | So we're going to 90% training data, 10% validation data.

00:21:31.380 | And what we need to do is just calculate

00:21:33.580 | the size of that split, or the size of the training

00:21:38.700 | data in this case.

00:21:40.340 | So we'll take the XIDs shape.

00:21:46.460 | Or actually, we can just set the SQL to the value

00:21:51.420 | that we defined up here, the number of samples.

00:21:56.180 | So these are the same.

00:21:59.900 | Let me show you.

00:22:02.500 | So number of samples and the XIDs shape.

00:22:08.540 | Cases are the same thing.

00:22:10.700 | We've already defined that, so let's just go with that.

00:22:13.100 | And we're going to divide that by the batch size.

00:22:18.380 | And this gives us the new number of samples within our data

00:22:30.460 | set, because we've batched it here.

00:22:34.380 | Now, one, we only want 90% of these,

00:22:39.260 | so we multiply it by split.

00:22:40.980 | So I do need to just run that quick.

00:22:47.820 | So that's our 90% mark.

00:22:51.420 | And when we're saying, in a moment,

00:22:54.300 | that we want this number of samples from the data set,

00:22:58.660 | we can't give it a float, because we can't have 0.375

00:23:04.180 | of a batch.

00:23:05.500 | It doesn't work.

00:23:07.140 | So what we need to do is set that

00:23:09.220 | equal to a integer to round off.

00:23:13.660 | So we do that here.

00:23:15.980 | Let's remove that and run that.

00:23:18.900 | So now we have our size.

00:23:22.220 | We can split our data set.

00:23:25.340 | So we're going to have the train data set, which

00:23:27.980 | is just going to be data set.

00:23:29.740 | And as we did up here, where we do this take method

00:23:32.780 | to take a certain number of samples, we do the same.

00:23:36.620 | But now we're going to take a lot more than just one.

00:23:38.780 | We're going to be taking the size

00:23:40.660 | that we calculated here, which is 8,700 or so.

00:23:45.300 | And then for the validation data set,

00:23:48.020 | we kind of want to do the opposite.

00:23:49.500 | We want to skip that many samples.

00:23:52.380 | So that's exactly what we write here.

00:23:54.180 | We just write skip size.

00:23:56.580 | So we're going to skip the first 8,700 or so.

00:24:00.500 | And we're just going to take the final ones out

00:24:04.300 | of that, the final 10%.

00:24:06.460 | And then we're not going to be using data set anymore.

00:24:08.860 | And it's quite a large object.

00:24:10.820 | So we can just remove that from our memory.

00:24:15.100 | OK, so now we're on to actually building and training

00:24:19.860 | our model.

00:24:20.820 | So we're going to be using Transformers again

00:24:22.980 | to load in a pre-trained BERT model.

00:24:27.380 | And we're going to be using the TF auto model using

00:24:32.380 | the TensorFlow version.

00:24:35.860 | And we'll set BERT equal to TF auto model from pre-trained.

00:24:44.420 | And it is BERT base and case.

00:24:50.820 | And just like we would, because here we're

00:24:56.060 | using the TensorFlow version here, if we got rid of this,

00:24:59.260 | we'd be using PyTorch.

00:25:01.620 | So here we're using TensorFlow.

00:25:03.540 | Just like we would have any other TensorFlow model,

00:25:05.700 | we can use the summary method to print out what we have.

00:25:09.100 | OK, and we see we just have this one layer,

00:25:15.420 | which is a BERT layer.

00:25:17.580 | And this is just because everything

00:25:19.540 | within that BERT layer is a lot more complex than just one

00:25:22.860 | layer, but it's all embedded within that single layer.

00:25:26.540 | So we can't actually see inside of it

00:25:28.620 | just by printing out the summary.

00:25:30.740 | Now, we have BERT.

00:25:32.780 | And that is great.

00:25:34.500 | But that's kind of like the core or the engine of our model.

00:25:38.940 | But we need to build a frame around that

00:25:41.260 | based on our use case.

00:25:43.900 | So the first thing we need to do is we have our two input layers.

00:25:50.180 | We have that one for input IDs and one for the attention mask.

00:25:54.100 | So first, we need to define those.

00:25:58.740 | So we've already imported TensorFlow earlier

00:26:01.740 | for the data set, so we don't need to do that again.

00:26:04.500 | And what we do is input IDs, and we say tf.keras.layers.

00:26:11.780 | And we're using an input layer here.

00:26:14.180 | And the shape of that input layer

00:26:17.140 | is equal to the sequence length.

00:26:20.380 | So sequence length, and then we just add a comma here.

00:26:23.940 | So that's the same shape as that we were seeing up here.

00:26:30.380 | Now, we need to set a name.

00:26:36.460 | And we do this because, as we have seen up here,

00:26:40.700 | we have this dictionary.

00:26:42.740 | And we need to know which input each of these input

00:26:46.500 | tensors are going to go into.

00:26:47.940 | And that's how we do it.

00:26:48.940 | We map input IDs to this name here, input IDs.

00:26:53.900 | And we'll do the same for attention mask

00:26:55.900 | as well in a moment.

00:26:58.460 | And we set the data type equal to integer 32.

00:27:03.220 | And we do that because these are tokens here,

00:27:08.180 | but expect them to be integer values.

00:27:11.780 | And we do the same for mask, tf.keras.layers, input.

00:27:21.020 | And the shape is sequence length again.

00:27:25.180 | We have the name, which is where we use our attention

00:27:29.300 | mask to map that across.

00:27:32.380 | And again, it's just the same D type, which is int 32.

00:27:38.860 | So they are our input layers.

00:27:41.740 | And then what we need is we need--

00:27:45.820 | after the input, what do we have?

00:27:47.340 | We want to feed this into BERT, right?

00:27:49.860 | So what we're doing there is we're creating our embeddings

00:27:55.460 | from BERT.

00:27:56.980 | And what we need to do is access the transformer

00:28:02.300 | within that BERT object.

00:28:05.500 | So to do that, for BERT, we just write

00:28:08.180 | BERT dot BERT, which accesses that transformer.

00:28:12.260 | And in there, we want to pass out input IDs and our attention

00:28:18.420 | mask, which is going to be the mask, so these two input

00:28:22.380 | layers here.

00:28:23.820 | And we have two options here.

00:28:25.020 | We can pull out the raw activations or the raw

00:28:30.180 | activation tensors from the transformer model here.

00:28:33.700 | And these are three-dimensional.

00:28:36.700 | And as I said, just take out that raw activation

00:28:40.380 | from the final layer of BERT.

00:28:44.100 | Or what they also include here is a pooled layer

00:28:49.020 | or pooled activations.

00:28:50.740 | So these are those 3D tensors pooled into 2D tensors.

00:28:57.260 | Now, what we're going to be using is dense layers here.

00:28:59.900 | So we are expecting 2D tensors.

00:29:02.780 | And therefore, we want to be using this pooled layer.

00:29:07.460 | We could also pool it ourselves.

00:29:08.900 | But the pooling has already been done.

00:29:10.900 | So why do it again?

00:29:13.220 | Now, what we want to do here for our final part of this model

00:29:17.820 | is we need to convert these embeddings

00:29:19.980 | into our label predictions.

00:29:23.460 | So to do that, we want two more layers.

00:29:27.620 | And these are going to both be densely connected layers.

00:29:31.100 | And for the first one, I'm going to use 1,024 units or neurons.

00:29:38.740 | And the activation will be ReLU.

00:29:41.780 | And we're passing the embeddings into that.

00:29:46.100 | And then our final layer, which is our labels,

00:29:49.900 | that is going to be the same thing again, so dense.

00:29:55.780 | But this time, we just want a number of labels here.

00:29:59.340 | So we did calculate this before.

00:30:01.980 | It was array, max, plus 1, right?

00:30:07.860 | So that is just 5.

00:30:11.100 | So we have 5 output labels.

00:30:13.980 | And what we want to do here is calculate the probability

00:30:17.100 | across all 5 of those output labels.

00:30:20.340 | So to do that, we want to use a softmax activation.

00:30:25.580 | Function.

00:30:26.300 | And we just want to say we're going

00:30:28.740 | to call this the outputs layer, because it is our outputs.

00:30:33.980 | Now, that is our model architecture.

00:30:38.540 | They are all of our layers, but we haven't actually

00:30:41.740 | put those layers into a model yet.

00:30:43.500 | They're all kind of just floating there by themselves.

00:30:47.500 | Obviously, they do have these connections between them,

00:30:49.740 | but they're not initialized into a model object yet.

00:30:53.340 | So we need to do that.

00:30:55.100 | And to do that, we go tf.keras, model.

00:31:00.300 | And we also need to do here, we need to set the input layers,

00:31:04.820 | so our inputs.

00:31:06.940 | And we have two of those, so we put them

00:31:08.580 | in a list here, input IDs, and the mask.

00:31:13.420 | So this is those two.

00:31:17.380 | Then we also need to set the outputs.

00:31:21.300 | And we just have one output, and that is y.

00:31:25.020 | Yeah, so just setting up our boundaries of our model.

00:31:30.500 | We have the inputs, and they lead to our outputs.

00:31:33.660 | Everything else within this is already handled to go to x.

00:31:37.260 | We consume embeddings, and embeddings

00:31:39.980 | consumes input IDs and mask.

00:31:43.180 | So those connections are already set up.

00:31:46.780 | And let's just see what we have here.

00:31:51.180 | And I just realized that this is input.

00:31:53.540 | This should be mask.

00:31:56.260 | Let's see what this error is.

00:31:57.660 | OK, so here I forgot to add this connection,

00:32:07.020 | so I need to add x there.

00:32:08.420 | OK, so now what do we have?

00:32:16.660 | It's a little bit messy, but--

00:32:18.260 | so we have our input IDs, and we have the shape here,

00:32:23.580 | the attention mask.

00:32:24.700 | So these are our two input layers.

00:32:27.580 | They lead into our BERT layer.

00:32:30.500 | Then we have our pre-classifying layer here,

00:32:33.220 | the densely connected neural net with 1,024 units.

00:32:38.420 | And we have our outputs, which is the softmax.

00:32:42.220 | And we have five of those.

00:32:45.020 | Now, if you would like to, what you

00:32:48.860 | can do if you don't have enough compute

00:32:51.540 | to train the BERT layer as well, you can also write this.

00:32:57.060 | So go model layers.

00:32:58.580 | And we select number 2, because we have 0, 1, and 2.

00:33:01.580 | So BERT is number 2 in there.

00:33:03.460 | And we can set trainable equal to false.

00:33:07.660 | And this would just freeze the parameters

00:33:10.700 | within this BERT layer and just train the other two here.

00:33:15.460 | But I will be keeping that so they can be trained as well,

00:33:21.780 | although you don't need to.

00:33:23.460 | It will probably give you a small performance increase,

00:33:26.500 | but not a huge amount in most cases.

00:33:30.460 | Now, I want to set up the model training parameters.

00:33:35.060 | So we need a optimizer, which is optimizers.

00:33:40.820 | And for this, we're going to be using Adam.

00:33:43.300 | We're using a pretty small learning rate of 1e to the minus

00:33:48.020 | 5 is because we've got our BERT model in here.

00:33:50.900 | We also want to set a weight decay as well.

00:33:57.220 | So this is Adam with a decay.

00:34:00.460 | And what we also want to add is a loss function.

00:34:05.820 | So we want to do tf.keras losses.

00:34:10.860 | And because we're using categorical outputs here,

00:34:14.300 | we want to use categorical cross-entropy.

00:34:21.660 | And then we're going to set our accuracy as well.

00:34:24.620 | And that is tf.keras metrics this time.

00:34:28.060 | And we're using categorical accuracy for the same reason.

00:34:32.220 | We're just going to need to pass accuracy in there as well.

00:34:38.660 | And let me change that to metrics.

00:34:45.460 | And then we just want to do model compile.

00:34:51.020 | Optimizer is equal to the optimizer.

00:34:54.500 | Loss to loss.

00:34:57.060 | And metrics is going to be equal to a list containing

00:35:00.660 | our accuracy.

00:35:02.820 | OK, so that's our model training parameters all set up.

00:35:07.140 | So the final thing to do now is train our model.

00:35:13.260 | So to do that, we call model fit,

00:35:15.860 | like we would in any TensorFlow training.

00:35:19.460 | And we have our training data set,

00:35:22.740 | which we've built already.

00:35:24.380 | The validation data we'll be using is our validation data

00:35:30.020 | set.

00:35:30.940 | And we'll train that for three epochs.

00:35:34.620 | And that will take some time.

00:35:37.620 | And immediately after we do train that,

00:35:40.860 | I'm also going to save that model to sentiment model.

00:35:48.660 | OK, now we'll just create a directory here

00:35:50.660 | and store all the files that we need for that model in there.

00:35:56.540 | So I will go ahead and run that.

00:35:59.540 | And I will see you when it's done.

00:36:02.660 | OK, so finished training our model.

00:36:05.140 | We got to an accuracy of 75%, still moving,

00:36:10.060 | and also validation accuracy of 71% here.

00:36:15.740 | So inside the sentiment model directory

00:36:19.140 | that we just created when we saved our model,

00:36:22.260 | we have everything that we need to load our model as well.

00:36:27.900 | So I'm just going to show you how we can do that.

00:36:31.660 | So what we can do is let's start a new file here, no notebook.

00:36:42.340 | OK, so let's just close this.

00:36:48.500 | And what we'll do here is we need to first import TensorFlow.

00:36:54.260 | And after we have imported TensorFlow,

00:36:56.860 | we need to actually load our model.

00:36:59.580 | So we use tf.keras.models.loadModel.

00:37:04.980 | And then we're loading this from the sentiment model directory

00:37:08.820 | here.

00:37:09.620 | And then let's just check that what we have here

00:37:12.900 | is what we built before.

00:37:16.780 | OK, great.

00:37:17.700 | So we can see now we have our input layers, BERT,

00:37:23.020 | and then we have our preclassifier and classifier

00:37:26.700 | layers as well.

00:37:29.260 | So that's exactly what we did before.

00:37:31.660 | And now we can go forwards and start making predictions

00:37:35.700 | with this model.

00:37:38.220 | So before we make our predictions,

00:37:40.420 | we still need to convert our data or input strings

00:37:44.500 | into tokens.

00:37:46.860 | To do that, I'm going to create a function to handle it

00:37:50.900 | for us.

00:37:52.540 | First, we are going to need to import the tokenizer

00:37:56.900 | from Transformers.

00:37:58.300 | We're using BERT tokenizer just like we did before.

00:38:05.180 | And we're going to use the tokenizer, BERT tokenizer

00:38:08.980 | from pre-trained BERT base case.

00:38:15.580 | OK, so exactly the same as what we used before.

00:38:19.780 | So that is our tokenizer.

00:38:24.380 | And all we need to do is we're going to say--

00:38:27.900 | call it prep data.

00:38:29.940 | And here we would expect a text, which is a string.

00:38:34.060 | And we'll return our tokens.

00:38:38.380 | And this is just the same as what we were doing before.

00:38:41.540 | OK, so we do encode plus our text.

00:38:46.220 | We set a max length, which is going to be 512, as always.

00:38:50.740 | But we are going to truncate anything longer than that.

00:38:55.140 | And we're going to pad anything shorter than that.

00:39:03.820 | And we'll pad it up to the max length.

00:39:07.540 | And then we want to add the special tokens.

00:39:10.780 | That is true.

00:39:13.060 | And then there's one other thing that we don't need,

00:39:17.780 | which are the token type IDs.

00:39:21.500 | So token type IDs.

00:39:24.060 | And this is just another tensor that is returned.

00:39:26.820 | And we don't need them.

00:39:28.460 | So we can just ignore them.

00:39:31.860 | And we're going to return the TensorFlow tensors.

00:39:36.740 | OK, and that is all we need.

00:39:42.380 | And now we can just return our tensors in the correct format.

00:39:47.740 | So the format that we need is like we used before

00:39:51.300 | with the input IDs.

00:39:54.260 | But if you remember before, within the data set,

00:39:59.020 | we were working with TensorFlow Float64 tensors.

00:40:04.460 | So we also need to convert these, which will be integers,

00:40:09.180 | I believe, into that format.

00:40:12.700 | So to do that, we do tf.cast tokens, input IDs.

00:40:21.140 | And we say we want to cast that to tf.float64.

00:40:23.500 | And we can copy that across.

00:40:30.340 | And we'll repeat the same thing, but for our attention mask.

00:40:35.060 | So attention mask, we'll just copy that across.

00:40:40.740 | And that is all we need to prepare our data.

00:40:45.660 | And we'll just fix that.

00:40:48.460 | OK, and now what we can do is we can just prep data.

00:40:55.620 | We'll just put something like "hello world."

00:41:00.060 | OK, and we get all these values.

00:41:05.140 | And you see here, I have entered that wrong.

00:41:08.020 | We don't even need this necessarily,

00:41:10.100 | but just for it to be explicit.

00:41:13.740 | And we just need to add the S onto IDs there.

00:41:17.060 | So rerun that and remove that error.

00:41:20.100 | And you can see here we have our CLS token, "hello world,"

00:41:25.060 | separated token, followed by lots of padding.

00:41:29.780 | OK, so our prep data works.

00:41:35.100 | And let's just put that into a value there.

00:41:42.020 | And what I want to do now is get the probability

00:41:45.100 | by calling model predict.

00:41:47.820 | And we do test here.

00:41:51.980 | So we've already done prep data.

00:41:55.220 | And let's see what we get.

00:41:58.620 | OK, so we have these predictions,

00:42:02.100 | which is not that easy to read.

00:42:05.060 | And we also just need to access the zero index.

00:42:08.700 | So we just get a simple array there.

00:42:12.540 | And what we can do to get the actual class from that

00:42:15.660 | is we'll just import NumPy as NP,

00:42:18.660 | because we just want to take the maximum value out

00:42:21.580 | of all of these.

00:42:23.420 | And to do that, we just do np argmax probs zero.

00:42:30.700 | OK, so it's had a neutral sentiment.

00:42:35.180 | But something like this movie is awful.

00:42:41.100 | And we should get zero, OK?

00:42:50.020 | And we'll do this.

00:42:52.460 | And we'll get four, OK?

00:42:53.700 | So it's working.

00:42:55.500 | So that, I know, is pretty long.

00:42:58.500 | But that is really everything you need from start to finish.

00:43:02.140 | We've preprocessed the data, set up our input data set,

00:43:04.820 | pipeline.

00:43:06.500 | We've built, trained our model.

00:43:09.140 | We've saved it, loaded it, and made predictions.

00:43:11.580 | So that's really everything you need.

00:43:15.420 | So I hope that has been a useful video.

00:43:18.820 | And thank you very much for watching.

00:43:21.420 | I will see you again in the next one.

Multi-Class Language Classification With BERT in TensorFlow

Chapters