How-to Build a Transformer for Language Classification in TensorFlow

00:00:00.000 | Hi, and welcome to this video on implementing transformer models

00:00:04.840 | with TensorFlow.

00:00:06.760 | So we're going to go through six steps, which

00:00:10.960 | are downloading and preprocessing data,

00:00:14.680 | initializing the HuggingFace tokenizer model--

00:00:17.520 | and by HuggingFace, I mean the Transformers framework.

00:00:22.440 | Then we encode input data to get input ID and attention tensors.

00:00:27.880 | Then we build the full model architecture.

00:00:30.880 | So that is our input layers, which go into BERT,

00:00:35.760 | and then the output layers post-BERT.

00:00:38.960 | Then it's back to the normal TensorFlow process

00:00:42.160 | where we set up our optimizer, metrics, and loss,

00:00:45.360 | and then we begin training.

00:00:47.440 | And we will cover each one of these steps in this video.

00:00:51.200 | So first, we need to actually get our data.

00:00:54.640 | So we're going to use the IMDB movie review data set,

00:00:59.680 | or it may actually be Rotten Tomatoes.

00:01:01.840 | I'm not sure.

00:01:03.840 | Now, this data set provides us with sentiment ratings

00:01:07.440 | from 0, which is terrible, up to 4, which is amazing.

00:01:11.640 | You can get the data set from Kaggle,

00:01:13.760 | or we can just download it using the Kaggle API, which

00:01:16.160 | is what we're going to do here.

00:01:18.160 | Now, if you haven't used Kaggle API before, that's fine.

00:01:21.160 | All you need to do is install Kaggle,

00:01:26.440 | and then you need to head over to the Kaggle website,

00:01:31.320 | go to your account page, scroll down to, I think,

00:01:37.120 | it's API integration, download kaggle.json,

00:01:48.120 | and then you need to place kaggle.json

00:01:51.160 | in the correct Kaggle folder, which will have been created

00:01:53.960 | when you did the pip install.

00:01:55.520 | Now, if you're not sure where that is, all you need to do

00:01:58.360 | is import Kaggle, like that.

00:02:03.840 | And when you execute this, a OS error will appear,

00:02:07.400 | and it will say you need kaggle.json, which you don't

00:02:10.880 | have, you need to put it in this folder.

00:02:12.920 | You just go ahead and put kaggle.json in that folder,

00:02:15.720 | and then you are ready.

00:02:16.960 | Now, we just need to initialize our API and authenticate it.

00:02:27.240 | And now, we can use the competition download file

00:02:35.520 | method to download our data.

00:02:39.040 | And we are going to import the data into this directory.

00:02:42.640 | Now, let's refresh up here, and we can see it.

00:02:50.680 | Now, obviously, it's a zip file, so we

00:02:52.560 | need to quickly unzip that.

00:02:54.560 | We can also do this in Python, or you can do it manually.

00:02:57.160 | But for now, we're just going to do it this way.

00:02:59.720 | So we're going to import the data into this directory.

00:03:03.880 | We need to quickly unzip that.

00:03:05.560 | We can also do this in Python, or you can do it manually.

00:03:09.160 | We will just do it manually.

00:03:10.800 | It's easier-- well, quicker, at least.

00:03:12.400 | And there we go.

00:03:32.280 | We now have our data here.

00:03:33.560 | It's a tab-delimited file.

00:03:37.600 | So if we open it, you can see here,

00:03:39.880 | we're delimiting by the tab character.

00:03:43.280 | And we can see we have our phrase and our sentiment,

00:03:46.160 | which are the two that we care most about.

00:03:50.440 | Now, we'll use pandas to read data.

00:04:01.280 | And because it's a tab-delimited file, we use read CSV.

00:04:06.040 | And then we just set the separator to tab.

00:04:10.960 | Now, this data set has a full sentence,

00:04:15.960 | which we can see with the phrase ID is 1 and sentence ID is 1.

00:04:20.760 | That is our full sentence or full phrase.

00:04:23.120 | And then we have lots of parts of that same phrase cut down

00:04:27.680 | into different pieces and given a sentiment value.

00:04:32.520 | Now, I mean, you can use this.

00:04:34.600 | But I'm going to avoid it because I'm just

00:04:38.040 | going to be using the training data for both the training

00:04:41.040 | set and the validation set.

00:04:42.840 | And I don't want to pollute the validation set

00:04:46.920 | with very similar phrases that we

00:04:49.240 | have used in the training data.

00:04:51.760 | So we're just going to drop duplicates and keep

00:04:54.840 | the first element of every unique sentence ID.

00:04:58.240 | And here, you can see that we have now

00:05:09.920 | removed those duplicates or segments.

00:05:14.680 | With the segments removed, we only have 8,500.

00:05:21.000 | Now, we need to move on to encoding our data.

00:05:24.760 | So for that, we are going to be using the transformers

00:05:28.840 | framework, which we will also be using for the transformer

00:05:31.760 | itself.

00:05:32.800 | And this works by providing a tokenizer and the model

00:05:38.000 | itself for each transformer.

00:05:40.800 | So we're going to be using BERT.

00:05:42.600 | And that means that we are going to import or initialize

00:05:45.640 | a BERT model and also the BERT tokenizer, which

00:05:49.520 | is already pre-built.

00:05:52.080 | Now, before we do this encoding, we

00:05:54.640 | need to figure out how long we want each sequence to be,

00:05:59.000 | because the encoding method also acts as our padding

00:06:02.600 | or truncation method.

00:06:05.200 | So to do that, we will get the sequence length

00:06:08.160 | in words of each sentence and plot that out and just

00:06:12.680 | go by eye and say, OK, around here,

00:06:16.240 | won't cut off too much data.

00:06:19.000 | So first, we need to get the sequence length

00:06:21.440 | of every sentence in here.

00:06:23.360 | Now, what we're going to do here is get the length

00:06:30.320 | of each sentence split.

00:06:34.320 | Now, split will, by default, split by spaces.

00:06:38.680 | So we will get a list of words here.

00:06:41.520 | Now, we need to actually visualize this.

00:06:43.280 | So we will use matplotlib and seaborn

00:06:45.680 | just because they're super easy and quick to use.

00:06:48.480 | And then we will also set the seaborn style just

00:06:56.680 | to make it a bit more visually appealing.

00:06:59.400 | And we will also increase the figure size

00:07:06.480 | so we can actually see what's going on.

00:07:08.080 | And then we will use a distribution plot.

00:07:14.720 | And here, we can see the distribution

00:07:20.920 | of the length of each sequence in our data set.

00:07:26.480 | Now, we could cut it off maybe around 40 or even 50.

00:07:32.600 | I think we'll go with 50 just so we get as much data in there

00:07:36.280 | as possible.

00:07:40.800 | So we'll set sequence length equal to 50.

00:07:47.480 | Now, we need to initialize our tokenizer.

00:07:50.600 | And before we do that, actually, we

00:07:55.040 | need to import it from the Transformers library.

00:08:10.680 | And we are getting our model from a pre-trained model.

00:08:14.320 | And we're using BERT, base, cased.

00:08:19.920 | Now, cased here refers to whether BERT distinguishes

00:08:30.760 | the difference between uppercase and lowercase characters.

00:08:34.440 | The alternative would be uncased.

00:08:36.800 | And this would just not distinguish

00:08:39.000 | the difference between uppercase and lowercase.

00:08:41.440 | Everything would just become lowercase.

00:08:44.040 | But when people are on the internet

00:08:46.520 | and they want to shout and seem angry,

00:08:49.560 | people put everything in capital letters.

00:08:52.240 | So BERT can probably pick up on this

00:08:54.720 | and tell that someone is being dramatic or shouting at you

00:08:58.320 | over the internet or whatever else.

00:09:00.080 | And because we are classifying sentiment here,

00:09:02.160 | it's probably quite important.

00:09:07.200 | Now that we've initialized our tokenizer,

00:09:08.960 | we can go on to the encoding.

00:09:11.760 | So we use the encode plus method, which looks like this.

00:09:15.520 | So you'll see here we've just defined or hardcoded

00:09:31.960 | a single line, which is hello world.

00:09:35.000 | We are using a max length of 50.

00:09:37.480 | We want the encoder to truncate any text

00:09:41.640 | that is longer than 50 tokens.

00:09:43.000 | Obviously, this one will not be.

00:09:44.880 | But when we are feeding all of our data through it,

00:09:48.080 | we need this in there.

00:09:49.480 | And on the other hand,

00:09:53.400 | we also want anything shorter than 50

00:09:57.240 | to be padded with pad tokens.

00:10:00.280 | In this case, we will end up with 48 of these padding tokens.

00:10:03.760 | And here we are just telling the tokenizer

00:10:08.760 | to pad up to the value that we have given

00:10:13.640 | in the max length argument.

00:10:15.560 | Now BERT comes with several special tokens.

00:10:20.760 | We have the start sequence, end of sequence,

00:10:24.200 | padding, unknown, and mask tokens.

00:10:27.200 | In order to add these in during the encoding method here,

00:10:32.200 | we need to write add special tokens true.

00:10:35.760 | And in this case, all it's gonna do

00:10:41.360 | is add the start of sequence token,

00:10:44.520 | the end of sequence token,

00:10:46.200 | and then it's gonna add all of our padding values.

00:10:48.720 | Now, there are several different outputs

00:10:52.400 | that we can get from this encode plus method.

00:10:55.680 | By default, we have input IDs

00:10:57.400 | and the return token type IDs.

00:11:00.640 | Now, the token type IDs we don't really need,

00:11:03.200 | so we can tell the tokenizer to not return those.

00:11:06.680 | But we do need input IDs,

00:11:13.240 | which is fine, we get them by default,

00:11:15.800 | and also the attention mask tensor.

00:11:18.880 | To return this, we just write return

00:11:20.360 | attention mask equals true.

00:11:21.720 | Finally, because we are working in TensorFlow,

00:11:28.200 | we need to return TensorFlow tensors.

00:11:30.480 | Okay, so here we have our outputs.

00:11:43.800 | So we get two tensors.

00:11:45.200 | One of those are the input IDs,

00:11:47.520 | and then also our attention mask here.

00:11:51.420 | So we input the sequence hello world,

00:11:54.760 | and we can see that this value here

00:11:57.400 | and input IDs is hello, and this is world.

00:12:00.920 | Now, the 101 and 102 you see here

00:12:05.320 | are the start of sequence and end of sequence tokens

00:12:09.600 | used by BERT, and the remaining zeros

00:12:12.400 | are simply the padding tokens.

00:12:14.160 | We also have the attention mask,

00:12:16.440 | and this is used to tell BERT

00:12:19.280 | what tokens to calculate attention for,

00:12:22.240 | and which tokens to just completely ignore.

00:12:24.720 | So where we have a one, that tells BERT,

00:12:28.600 | yep, pay attention to this.

00:12:30.680 | Where there's a zero, it means just ignore it.

00:12:34.620 | So we have zeros for every padding token

00:12:37.320 | because padding tokens aren't important to us,

00:12:39.600 | it's just padding.

00:12:41.480 | But then we have ones for the end of sequence

00:12:45.520 | and start of sequence tokens,

00:12:47.440 | and also hello and world,

00:12:49.320 | because they are actually important

00:12:51.520 | for BERT to pay attention to.

00:12:53.600 | Now, of course, this is just one,

00:12:56.480 | and we need to do this for every sample in our dataset.

00:13:01.480 | So we'll go ahead and do that.

00:13:02.880 | Now, we're just gonna use a simple for loop

00:13:06.040 | to add each value or sequence into a NumPy array,

00:13:11.040 | which we will initialize now.

00:13:13.360 | So first import NumPy,

00:13:19.000 | and then initialize our two arrays.

00:13:21.160 | Both of them are gonna be the same size,

00:13:27.520 | so it's gonna be the length of our data frame

00:13:30.720 | by the sequence length that we have defined, which is 50.

00:13:36.140 | And we do this twice.

00:13:44.720 | We also want one for the attention mask.

00:13:48.220 | And we can see the shape of our arrays here.

00:13:53.220 | Now, we're just gonna use a for loop to do this.

00:14:00.500 | It's only a small dataset,

00:14:01.700 | so whatever we use doesn't really matter.

00:14:05.780 | We are only processing and encoding this data one time,

00:14:09.380 | and then we will save it and then load it from memory

00:14:12.780 | when we're actually training our model.

00:14:14.740 | , and there we have our for loop.

00:14:43.100 | This will go through each sequence

00:14:45.700 | and add each one of those sequences

00:14:48.100 | into the perspective index

00:14:51.020 | of our initialized zero arrays here.

00:14:54.440 | Okay, and then here we can see our complete arrays.

00:15:05.700 | So at the top, we have our input IDs and XIDs,

00:15:09.460 | and we can see each one starts

00:15:11.260 | with our start of sequence token, 101,

00:15:14.340 | followed by a few words.

00:15:15.820 | Then there will be the end of sequence token

00:15:17.340 | somewhere in the middle there,

00:15:18.700 | and then at the end, we have our padding token.

00:15:22.180 | And then we can also see in Xmask,

00:15:24.580 | which is our attention mask,

00:15:26.060 | we have the ones to pay attention to

00:15:28.700 | and the zeros to ignore,

00:15:30.620 | which obviously correspond to the respective values

00:15:34.180 | in the XIDs array.

00:15:35.700 | Now, for our labels,

00:15:38.940 | we are actually going to one-hot encode them.

00:15:41.940 | So at the moment, let's see what we have.

00:15:45.460 | So we will get values of four, one, three, two, and zero.

00:15:55.320 | Though you can't see it here.

00:16:00.860 | Now, this is a pretty good format

00:16:03.140 | to go straight in and one-hot encode.

00:16:05.980 | So all we need to do here is create a array

00:16:09.580 | from the DataFrame column.

00:16:11.780 | And then here, like we did before,

00:16:22.380 | we are just initializing a empty zero array.

00:16:25.460 | And what we are doing here is we are taking the array size.

00:16:34.640 | So if I, okay, let's do it a little differently.

00:16:39.620 | So the array size is just the length of our DataFrame.

00:16:48.140 | And array.max is what we see maximum value within our array,

00:16:53.140 | which is the number four.

00:16:55.220 | And then we are adding one onto that,

00:16:56.860 | which is saying here that we want a zero array

00:16:59.840 | of 8,529 rows by five columns.

00:17:04.840 | Like so.

00:17:08.940 | Now, at the moment, we just have an empty zero array.

00:17:11.460 | So we just need to add ones in the indices

00:17:14.300 | where we have a value.

00:17:16.900 | And we do that very easily like this.

00:17:18.900 | So we create a range of values

00:17:26.860 | from zero to 8,528, which is the array size here.

00:17:31.860 | And then within that, we add array.

00:17:36.660 | Because in array, we have each sentiment value.

00:17:39.460 | So zero, four, three, two.

00:17:41.380 | And this will add a one in that index.

00:17:44.940 | So zero, three, four.

00:17:46.940 | And that produces our one-hot encoding.

00:17:50.040 | Like so.

00:17:54.420 | Here, our rating was one.

00:17:57.700 | Here, four.

00:17:59.420 | Here, one.

00:18:00.260 | And so on.

00:18:01.080 | Now, I said before that typically we'd save these

00:18:05.100 | before loading them in.

00:18:06.780 | So we are gonna save them.

00:18:08.620 | Obviously, we don't need to load them back in.

00:18:10.940 | But I will show you how to anyway,

00:18:13.580 | because going forwards, if you want to retrain data,

00:18:17.500 | you have to do this.

00:18:19.060 | Otherwise, you'd have to do everything all over again.

00:18:21.060 | And when you're working with bigger datasets,

00:18:23.220 | that will take quite some time.

00:18:25.500 | So first, we'll just save them.

00:18:28.900 | (silence)

00:18:31.060 | So now we've saved them,

00:18:54.420 | and we've just removed all of them from memory.

00:18:57.140 | So now we will not be able to access any of them.

00:19:01.260 | Going forwards, we are just going to load them back in,

00:19:05.020 | which is exactly the same process.

00:19:06.940 | Quickly write that out.

00:19:08.140 | (silence)

00:19:10.300 | (silence)

00:19:12.460 | And now we have our data back in.

00:19:30.540 | Like so.

00:19:33.060 | Okay, so now we need to put all of our arrays

00:19:37.980 | into a TensorFlow dataset object.

00:19:40.340 | So we'll use a dataset object

00:19:43.620 | because it makes things a lot easier.

00:19:46.020 | So we can restructure the data, shuffle,

00:19:49.220 | and batch it in just a few lines of code,

00:19:51.940 | which is a lot faster in terms of performance,

00:19:55.020 | and also a lot faster for us to write down.

00:19:57.860 | It's a much simpler code.

00:19:59.460 | So we'll import TensorFlow.

00:20:05.660 | And also, if you're using a GPU,

00:20:08.300 | you can check that it is being detected

00:20:10.860 | by your system with this.

00:20:13.020 | And there we can see I have a GPU

00:20:23.820 | being picked up by TensorFlow.

00:20:26.260 | So first, we need to restructure the data.

00:20:28.620 | TensorFlow expects our data to be input as a tuple.

00:20:33.620 | So that tuple consists of our inputs

00:20:36.220 | and our target or output labels.

00:20:38.820 | Now, because we are using BERT,

00:20:41.820 | our data structure is slightly different

00:20:43.940 | because in the input tuple,

00:20:45.660 | we actually have a dictionary,

00:20:48.380 | and that dictionary contains a key

00:20:50.780 | that is input_ids, which maps to our xids array.

00:20:55.780 | We also have another key, attention_mask,

00:20:58.860 | which maps to our xmask array.

00:21:03.100 | So first, we actually need to create our dataset object,

00:21:06.100 | which we do like so.

00:21:08.260 | And this just creates a generator

00:21:21.820 | which contains all of our data

00:21:23.620 | in the tuple-like format

00:21:26.460 | where each tuple contains one xids array,

00:21:30.340 | xmask array, and label array.

00:21:33.900 | So we can actually view one of those, like so.

00:21:36.620 | And here we can see each one of our arrays.

00:21:43.500 | So the first one, xids, xmask, and then labels.

00:21:47.620 | Now, TensorFlow expects our input data

00:21:52.660 | as a dataset to be in a tuple format.

00:21:56.220 | The zero index of that tuple needs to be our input values

00:22:00.380 | and the one index of that tuple needs to be our labels.

00:22:05.380 | Now, in our case, it's also slightly different

00:22:08.820 | because we are using two inputs.

00:22:12.220 | So within that tuple, we have our input,

00:22:14.420 | and within that input, we have a dictionary.

00:22:18.660 | And that dictionary consists of a key,

00:22:21.940 | input_ids, which maps to our xids array,

00:22:24.780 | and attention_mask, which must map to our xmask array.

00:22:30.020 | So to create this structure,

00:22:31.740 | we need to build a mapping function, like so.

00:22:35.020 | And here we return the format we need.

00:22:44.220 | So input_ids, which will map to input_ids,

00:22:49.220 | and attention_mask,

00:22:58.380 | which will map to mask.

00:23:00.140 | Then, because we are still expecting

00:23:04.060 | to use the input output tuple,

00:23:05.900 | we also need to add labels onto the end there.

00:23:08.260 | And to apply this mapping function to our dataset object,

00:23:13.780 | we use the map method.

00:23:15.820 | So now we can view a single row in this dataset.

00:23:26.940 | (silence)

00:23:29.100 | And now we can see a slightly different format.

00:23:34.100 | So we have our input dictionary here,

00:23:37.660 | where we have input_ids and attention_mask,

00:23:40.620 | and then we have our output tensor here.

00:23:44.460 | Now, I said before that this makes shuffling

00:23:48.620 | and batching our dataset very easy.

00:23:51.340 | So we can do it in a single line, like this.

00:23:55.980 | (silence)

00:23:58.140 | So here we're gonna put our samples into batches of 32,

00:24:07.220 | and then the value that I've given

00:24:08.780 | to the shuffle method here,

00:24:10.780 | essentially just needs to be very large.

00:24:13.340 | The larger your dataset,

00:24:14.940 | the larger this number needs to be.

00:24:16.700 | I take a sample of my dataset and increase the number.

00:24:22.300 | If I can see that this is not shuffling,

00:24:24.980 | the dataset properly.

00:24:26.540 | Okay, so now we have our shuffle batch dataset

00:24:32.500 | in the correct format.

00:24:33.940 | So now we just need to split it

00:24:36.340 | into our training and validation sets.

00:24:38.980 | Okay, and to do that,

00:24:41.180 | we need to get the total size of our dataset

00:24:45.260 | now that it is batched.

00:24:46.500 | So we can do that, like so.

00:24:54.940 | Now, because the dataset object is a generator,

00:24:57.860 | we can't just take the length of it directly,

00:25:00.700 | so we have to convert it into a list.

00:25:02.580 | Now, if you're working with a very large dataset,

00:25:06.820 | this is probably not the right method to use,

00:25:08.700 | and you should instead take the size of the dataset

00:25:12.460 | that you know already,

00:25:14.140 | and calculate the current length of it,

00:25:16.900 | including the batch size from that.

00:25:18.820 | But for us, this is fine,

00:25:21.860 | and we can see that dataset length is 267.

00:25:26.860 | Now, if we want a nine to 10 split

00:25:30.980 | between the training and validation set,

00:25:34.020 | we simply use a split value of 0.9.

00:25:37.660 | And then we get our training and validation sets

00:25:44.420 | as two different datasets.

00:25:49.860 | To split the dataset, we use take and skip.

00:25:53.060 | We used take earlier on,

00:25:54.540 | and what it does is simply takes the specified number of,

00:25:59.380 | in this case, batches, and nothing else.

00:26:02.180 | Skip, on the other hand, does the opposite.

00:26:04.060 | It skips a specified number of batches,

00:26:06.780 | and then takes the remainder.

00:26:09.260 | (mouse clicking)

00:26:12.020 | And then at the end here, we can delete dataset

00:26:29.580 | if space is an issue.

00:26:31.020 | Okay, so now our data is ready.

00:26:36.100 | We can go on to actually building our model architecture.

00:26:40.540 | So first, we initialize BERT,

00:26:43.060 | and to do that, we need to import tf.autoModel

00:26:46.660 | from the Transformers library.

00:26:48.380 | And we initialize BERT like so.

00:26:55.700 | So here, remember to use the same model

00:27:04.700 | that you're using to initialize your tokenizer.

00:27:07.100 | And here, we can see that we have now imported BERT.

00:27:13.940 | So we have BERT,

00:27:15.460 | but we need to build a network around BERT as well.

00:27:19.660 | First thing we need to do is define our input layers,

00:27:22.940 | and of course, there are two,

00:27:24.140 | because we have the input IDs and the attention mask.

00:27:33.740 | And the shape is simply the sequence length

00:27:36.700 | that we are using.

00:27:37.740 | And the name here is very important.

00:27:43.540 | This needs to match up to the dictionary value

00:27:47.660 | that we have defined in our dataset here.

00:27:51.100 | So input IDs, input IDs, and attention mask, okay?

00:27:56.060 | So these need to match up.

00:27:57.700 | Otherwise, TensorFlow does not know where these are going.

00:28:02.180 | (mouse clicking)

00:28:04.940 | And we do the same here,

00:28:10.740 | but we are doing this for the attention mask.

00:28:12.940 | So those are our two input layers,

00:28:25.380 | and now we need to pull the embeddings

00:28:28.180 | from our initialized BERT model.

00:28:30.780 | (mouse clicking)

00:28:32.980 | BERT consumes our two input layers,

00:28:35.780 | like so, and BERT will return two tensors to us.

00:28:45.940 | One of those is the last hidden state,

00:28:48.140 | which is what we are interested in.

00:28:49.980 | That is a 3D tensor, which provides all the information

00:28:53.980 | from the last hidden state of the BERT model.

00:28:59.180 | The second tensor that we are going to ignore

00:29:03.300 | is called the pooler output,

00:29:05.340 | and the pooler output is essentially the last hidden state

00:29:08.980 | run through a feed-forward

00:29:11.620 | or linear activation function and pooled.

00:29:14.700 | So that creates a 2D tensor,

00:29:18.180 | which can be used for classification if you want,

00:29:21.740 | but we are going to pool it ourselves,

00:29:24.020 | so we will not be using that tensor.

00:29:27.740 | (mouse clicking)

00:29:29.940 | Okay, so here you can experiment with adding LSTM layers,

00:29:34.740 | convolutional layers, or anything else,

00:29:37.780 | but for now, to keep things simple,

00:29:40.380 | we are just going to add a global max pooling layer,

00:29:43.220 | which will convert our output 3D tensor into a 2D tensor.

00:29:48.220 | Again, you could skip this,

00:29:51.180 | and you could just output the pooler output tensor,

00:29:55.220 | like this, by changing the zero to a one,

00:29:58.820 | but we are not going to be using that.

00:30:00.660 | (mouse clicking)

00:30:05.020 | Okay, so up here,

00:30:16.500 | we just need to define the input data types as well,

00:30:20.140 | which I missed.

00:30:21.060 | (mouse clicking)

00:30:23.820 | And that will remove the type error.

00:30:28.380 | Now, we need to normalize our outputs here.

00:30:34.460 | This will almost always give better results

00:30:38.060 | when we are actually training the model.

00:30:40.020 | (mouse clicking)

00:30:42.780 | And then following this,

00:30:48.980 | we will go into our Densely Connected Neural Network layers,

00:30:52.620 | which are in charge of actually figuring out

00:30:55.820 | the classification of our BERT embedding outputs.

00:30:59.660 | (mouse clicking)

00:31:18.260 | And then we want to add a dropout layer here.

00:31:21.460 | This just prevents any overfitting or too much overfitting.

00:31:26.260 | Then we add another Densely Connected Neural Network.

00:31:31.820 | And finally, we are creating our output layer,

00:31:44.620 | which is going to be a Densely Connected Neural Network

00:31:47.100 | with a Softmax activation function.

00:31:50.140 | Now, we use Softmax here because we have our three labels,

00:31:59.980 | or no, sorry, five labels in the output.

00:32:03.220 | So let me change this.

00:32:04.620 | So we have our five labels in the output

00:32:07.700 | because we have one hot encoded,

00:32:09.660 | the zero, one, two, three, and four.

00:32:16.060 | And finally, we just give it a name of outputs.

00:32:19.420 | Now, that is our model architecture,

00:32:26.420 | but we still need to tell TensorFlow

00:32:29.740 | what our input layers are and what our output layer is.

00:32:33.300 | So to do that, we define our model, like so.

00:32:37.220 | And to the inputs, we pass input IDs and math.

00:32:44.900 | And then to the outputs, we just pass Y.

00:32:47.260 | Finally, we have our model,

00:32:53.780 | so we can actually execute that

00:32:57.340 | and produce a model summary here

00:32:59.260 | so we can see what we have built.

00:33:01.940 | Okay, and here we can see our model.

00:33:11.980 | Now, if we scroll down to the bottom,

00:33:13.460 | we can see the number of parameters in our model,

00:33:16.580 | and we have quite a lot, 108 parameters.

00:33:19.700 | Almost all of them are trainable.

00:33:22.500 | Now, BERT is a very big model,

00:33:24.980 | and I wouldn't recommend training it

00:33:28.740 | unless you have a specific reason to.

00:33:33.220 | Now, for this dataset, it's definitely overkill.

00:33:37.580 | So what we can do is actually go in here

00:33:40.180 | and we can actually freeze the BERT model

00:33:43.900 | by freezing the third layer at index two

00:33:49.420 | of our model layers.

00:33:51.180 | And we simply set trainable equal to false to do that.

00:33:55.340 | So if we now look at our model summary,

00:34:00.900 | we will see that the number of parameters

00:34:03.580 | is exactly the same,

00:34:05.140 | but the number of trainable parameters

00:34:07.340 | will have reduced by a lot.

00:34:10.140 | So here, we now have 104,000 trainable parameters

00:34:15.140 | rather than 108 million trainable parameters.

00:34:19.220 | We can go ahead and put together our optimizer,

00:34:23.820 | loss and accuracy, compile our model, and begin training.

00:34:28.020 | Now, for our optimizer, we're just gonna use Adam.

00:34:35.700 | With a learning rate of 0.01.

00:34:40.700 | For the loss, because we are using one-hot encoding

00:34:48.020 | for our outputs, we are going to use

00:34:49.740 | the categorical cross-entropy loss.

00:34:52.020 | And finally, for our accuracy,

00:35:00.380 | we are also gonna use categorical accuracy

00:35:02.980 | for the same reason.

00:35:04.980 | (silence)

00:35:07.140 | And we can compile our model.

00:35:14.900 | So here, I've just missed the R.

00:35:32.140 | And now we can actually train our model.

00:35:35.060 | So just do model fit.

00:35:37.980 | We have our training data, our validation set.

00:35:45.420 | And I've found that for this model,

00:35:53.460 | we have to use a lot of epochs

00:35:55.740 | to actually get a good accuracy out of it.

00:35:57.940 | So I'm gonna train it for a total of maybe 140 epochs.

00:36:02.940 | Now, depending on your GPU, if you're using a GPU,

00:36:08.260 | this will still take quite some time.

00:36:10.780 | So if you're using a CPU, it will take a very long time.

00:36:13.660 | So maybe reduce the number of epochs

00:36:16.580 | or reduce the data set size a little further if you want.

00:36:21.420 | So I'm gonna start training this,

00:36:23.940 | and I will see you on the other side.

00:36:26.180 | (silence)

00:36:28.180 | Okay, so we have finished training.

00:36:31.820 | And you can see that the accuracy over time

00:36:35.820 | is actually quite good.

00:36:37.700 | It goes up very slowly,

00:36:40.100 | which is why using a lot of epochs

00:36:42.100 | has been quite useful here.

00:36:43.420 | But then if we take a look at the accuracy at the end,

00:36:47.300 | it's 82%, which is not bad.

00:36:49.740 | But I think more importantly is the fact

00:36:51.460 | that it was still going up very gradually.

00:36:53.660 | I think with further training,

00:36:55.860 | this could quite easily get to about 90% on this data set,

00:36:59.340 | and considering it's a very small data set,

00:37:01.220 | that's pretty good.

00:37:03.060 | Now, if we look at the validation accuracy,

00:37:04.660 | we actually get a high number by quite a bit.

00:37:07.620 | So this actually went up to 94% here, almost 95.

00:37:12.620 | And I would assume that this is because

00:37:16.980 | within the validation set, there are more easy examples,

00:37:21.540 | whereas in the training set,

00:37:22.660 | we have some more difficult examples.

00:37:26.140 | But nonetheless, these are pretty good results

00:37:28.780 | for quickly putting together a model.

00:37:30.460 | It's not a particularly big model,

00:37:31.900 | other than the BERT encoder in the middle,

00:37:34.940 | but otherwise, it's a pretty straightforward, simple model.

00:37:39.300 | So it's pretty cool that you can get these results

00:37:41.860 | on sentiment analysis in so little time.

00:37:44.260 | And going forward with other data sets with more data

00:37:48.940 | can definitely do a lot better.

00:37:51.420 | At the same time, we can also improve the model.

00:37:53.900 | We can add LSTM layers to the classifier

00:37:56.820 | or convolution neural net layers,

00:37:58.420 | or even just more densely connected

00:38:00.900 | neural network layers as well.

00:38:02.260 | So there is a lot that we can actually do with this.

00:38:04.860 | But for now, that's everything.

00:38:07.460 | So I hope you enjoyed the video.

00:38:09.300 | I hope it's been useful to you.

00:38:11.140 | If you have any questions,

00:38:12.220 | let me know in the comments below.

00:38:14.020 | But otherwise, thank you for watching,

00:38:16.180 | and I will see you next time.

00:38:17.620 | [BLANK_AUDIO]

How-to Build a Transformer for Language Classification in TensorFlow

Chapters