back to index

How-to Build a Transformer for Language Classification in TensorFlow


Chapters

0:0
0:10 Six Steps
0:15 Initializing Tokenizer and Model
0:24 Encode Input Data
0:29 Build Model
0:43 Optimizer, Metrics, and Loss

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, and welcome to this video on implementing transformer models
00:00:04.840 | with TensorFlow.
00:00:06.760 | So we're going to go through six steps, which
00:00:10.960 | are downloading and preprocessing data,
00:00:14.680 | initializing the HuggingFace tokenizer model--
00:00:17.520 | and by HuggingFace, I mean the Transformers framework.
00:00:22.440 | Then we encode input data to get input ID and attention tensors.
00:00:27.880 | Then we build the full model architecture.
00:00:30.880 | So that is our input layers, which go into BERT,
00:00:35.760 | and then the output layers post-BERT.
00:00:38.960 | Then it's back to the normal TensorFlow process
00:00:42.160 | where we set up our optimizer, metrics, and loss,
00:00:45.360 | and then we begin training.
00:00:47.440 | And we will cover each one of these steps in this video.
00:00:51.200 | So first, we need to actually get our data.
00:00:54.640 | So we're going to use the IMDB movie review data set,
00:00:59.680 | or it may actually be Rotten Tomatoes.
00:01:01.840 | I'm not sure.
00:01:03.840 | Now, this data set provides us with sentiment ratings
00:01:07.440 | from 0, which is terrible, up to 4, which is amazing.
00:01:11.640 | You can get the data set from Kaggle,
00:01:13.760 | or we can just download it using the Kaggle API, which
00:01:16.160 | is what we're going to do here.
00:01:18.160 | Now, if you haven't used Kaggle API before, that's fine.
00:01:21.160 | All you need to do is install Kaggle,
00:01:26.440 | and then you need to head over to the Kaggle website,
00:01:31.320 | go to your account page, scroll down to, I think,
00:01:37.120 | it's API integration, download kaggle.json,
00:01:48.120 | and then you need to place kaggle.json
00:01:51.160 | in the correct Kaggle folder, which will have been created
00:01:53.960 | when you did the pip install.
00:01:55.520 | Now, if you're not sure where that is, all you need to do
00:01:58.360 | is import Kaggle, like that.
00:02:03.840 | And when you execute this, a OS error will appear,
00:02:07.400 | and it will say you need kaggle.json, which you don't
00:02:10.880 | have, you need to put it in this folder.
00:02:12.920 | You just go ahead and put kaggle.json in that folder,
00:02:15.720 | and then you are ready.
00:02:16.960 | Now, we just need to initialize our API and authenticate it.
00:02:27.240 | And now, we can use the competition download file
00:02:35.520 | method to download our data.
00:02:39.040 | And we are going to import the data into this directory.
00:02:42.640 | Now, let's refresh up here, and we can see it.
00:02:50.680 | Now, obviously, it's a zip file, so we
00:02:52.560 | need to quickly unzip that.
00:02:54.560 | We can also do this in Python, or you can do it manually.
00:02:57.160 | But for now, we're just going to do it this way.
00:02:59.720 | So we're going to import the data into this directory.
00:03:03.880 | We need to quickly unzip that.
00:03:05.560 | We can also do this in Python, or you can do it manually.
00:03:09.160 | We will just do it manually.
00:03:10.800 | It's easier-- well, quicker, at least.
00:03:12.400 | And there we go.
00:03:32.280 | We now have our data here.
00:03:33.560 | It's a tab-delimited file.
00:03:37.600 | So if we open it, you can see here,
00:03:39.880 | we're delimiting by the tab character.
00:03:43.280 | And we can see we have our phrase and our sentiment,
00:03:46.160 | which are the two that we care most about.
00:03:50.440 | Now, we'll use pandas to read data.
00:04:01.280 | And because it's a tab-delimited file, we use read CSV.
00:04:06.040 | And then we just set the separator to tab.
00:04:10.960 | Now, this data set has a full sentence,
00:04:15.960 | which we can see with the phrase ID is 1 and sentence ID is 1.
00:04:20.760 | That is our full sentence or full phrase.
00:04:23.120 | And then we have lots of parts of that same phrase cut down
00:04:27.680 | into different pieces and given a sentiment value.
00:04:32.520 | Now, I mean, you can use this.
00:04:34.600 | But I'm going to avoid it because I'm just
00:04:38.040 | going to be using the training data for both the training
00:04:41.040 | set and the validation set.
00:04:42.840 | And I don't want to pollute the validation set
00:04:46.920 | with very similar phrases that we
00:04:49.240 | have used in the training data.
00:04:51.760 | So we're just going to drop duplicates and keep
00:04:54.840 | the first element of every unique sentence ID.
00:04:58.240 | And here, you can see that we have now
00:05:09.920 | removed those duplicates or segments.
00:05:14.680 | With the segments removed, we only have 8,500.
00:05:21.000 | Now, we need to move on to encoding our data.
00:05:24.760 | So for that, we are going to be using the transformers
00:05:28.840 | framework, which we will also be using for the transformer
00:05:31.760 | itself.
00:05:32.800 | And this works by providing a tokenizer and the model
00:05:38.000 | itself for each transformer.
00:05:40.800 | So we're going to be using BERT.
00:05:42.600 | And that means that we are going to import or initialize
00:05:45.640 | a BERT model and also the BERT tokenizer, which
00:05:49.520 | is already pre-built.
00:05:52.080 | Now, before we do this encoding, we
00:05:54.640 | need to figure out how long we want each sequence to be,
00:05:59.000 | because the encoding method also acts as our padding
00:06:02.600 | or truncation method.
00:06:05.200 | So to do that, we will get the sequence length
00:06:08.160 | in words of each sentence and plot that out and just
00:06:12.680 | go by eye and say, OK, around here,
00:06:16.240 | won't cut off too much data.
00:06:19.000 | So first, we need to get the sequence length
00:06:21.440 | of every sentence in here.
00:06:23.360 | Now, what we're going to do here is get the length
00:06:30.320 | of each sentence split.
00:06:34.320 | Now, split will, by default, split by spaces.
00:06:38.680 | So we will get a list of words here.
00:06:41.520 | Now, we need to actually visualize this.
00:06:43.280 | So we will use matplotlib and seaborn
00:06:45.680 | just because they're super easy and quick to use.
00:06:48.480 | And then we will also set the seaborn style just
00:06:56.680 | to make it a bit more visually appealing.
00:06:59.400 | And we will also increase the figure size
00:07:06.480 | so we can actually see what's going on.
00:07:08.080 | And then we will use a distribution plot.
00:07:14.720 | And here, we can see the distribution
00:07:20.920 | of the length of each sequence in our data set.
00:07:26.480 | Now, we could cut it off maybe around 40 or even 50.
00:07:32.600 | I think we'll go with 50 just so we get as much data in there
00:07:36.280 | as possible.
00:07:40.800 | So we'll set sequence length equal to 50.
00:07:47.480 | Now, we need to initialize our tokenizer.
00:07:50.600 | And before we do that, actually, we
00:07:55.040 | need to import it from the Transformers library.
00:08:10.680 | And we are getting our model from a pre-trained model.
00:08:14.320 | And we're using BERT, base, cased.
00:08:19.920 | Now, cased here refers to whether BERT distinguishes
00:08:30.760 | the difference between uppercase and lowercase characters.
00:08:34.440 | The alternative would be uncased.
00:08:36.800 | And this would just not distinguish
00:08:39.000 | the difference between uppercase and lowercase.
00:08:41.440 | Everything would just become lowercase.
00:08:44.040 | But when people are on the internet
00:08:46.520 | and they want to shout and seem angry,
00:08:49.560 | people put everything in capital letters.
00:08:52.240 | So BERT can probably pick up on this
00:08:54.720 | and tell that someone is being dramatic or shouting at you
00:08:58.320 | over the internet or whatever else.
00:09:00.080 | And because we are classifying sentiment here,
00:09:02.160 | it's probably quite important.
00:09:07.200 | Now that we've initialized our tokenizer,
00:09:08.960 | we can go on to the encoding.
00:09:11.760 | So we use the encode plus method, which looks like this.
00:09:15.520 | So you'll see here we've just defined or hardcoded
00:09:31.960 | a single line, which is hello world.
00:09:35.000 | We are using a max length of 50.
00:09:37.480 | We want the encoder to truncate any text
00:09:41.640 | that is longer than 50 tokens.
00:09:43.000 | Obviously, this one will not be.
00:09:44.880 | But when we are feeding all of our data through it,
00:09:48.080 | we need this in there.
00:09:49.480 | And on the other hand,
00:09:53.400 | we also want anything shorter than 50
00:09:57.240 | to be padded with pad tokens.
00:10:00.280 | In this case, we will end up with 48 of these padding tokens.
00:10:03.760 | And here we are just telling the tokenizer
00:10:08.760 | to pad up to the value that we have given
00:10:13.640 | in the max length argument.
00:10:15.560 | Now BERT comes with several special tokens.
00:10:20.760 | We have the start sequence, end of sequence,
00:10:24.200 | padding, unknown, and mask tokens.
00:10:27.200 | In order to add these in during the encoding method here,
00:10:32.200 | we need to write add special tokens true.
00:10:35.760 | And in this case, all it's gonna do
00:10:41.360 | is add the start of sequence token,
00:10:44.520 | the end of sequence token,
00:10:46.200 | and then it's gonna add all of our padding values.
00:10:48.720 | Now, there are several different outputs
00:10:52.400 | that we can get from this encode plus method.
00:10:55.680 | By default, we have input IDs
00:10:57.400 | and the return token type IDs.
00:11:00.640 | Now, the token type IDs we don't really need,
00:11:03.200 | so we can tell the tokenizer to not return those.
00:11:06.680 | But we do need input IDs,
00:11:13.240 | which is fine, we get them by default,
00:11:15.800 | and also the attention mask tensor.
00:11:18.880 | To return this, we just write return
00:11:20.360 | attention mask equals true.
00:11:21.720 | Finally, because we are working in TensorFlow,
00:11:28.200 | we need to return TensorFlow tensors.
00:11:30.480 | Okay, so here we have our outputs.
00:11:43.800 | So we get two tensors.
00:11:45.200 | One of those are the input IDs,
00:11:47.520 | and then also our attention mask here.
00:11:51.420 | So we input the sequence hello world,
00:11:54.760 | and we can see that this value here
00:11:57.400 | and input IDs is hello, and this is world.
00:12:00.920 | Now, the 101 and 102 you see here
00:12:05.320 | are the start of sequence and end of sequence tokens
00:12:09.600 | used by BERT, and the remaining zeros
00:12:12.400 | are simply the padding tokens.
00:12:14.160 | We also have the attention mask,
00:12:16.440 | and this is used to tell BERT
00:12:19.280 | what tokens to calculate attention for,
00:12:22.240 | and which tokens to just completely ignore.
00:12:24.720 | So where we have a one, that tells BERT,
00:12:28.600 | yep, pay attention to this.
00:12:30.680 | Where there's a zero, it means just ignore it.
00:12:34.620 | So we have zeros for every padding token
00:12:37.320 | because padding tokens aren't important to us,
00:12:39.600 | it's just padding.
00:12:41.480 | But then we have ones for the end of sequence
00:12:45.520 | and start of sequence tokens,
00:12:47.440 | and also hello and world,
00:12:49.320 | because they are actually important
00:12:51.520 | for BERT to pay attention to.
00:12:53.600 | Now, of course, this is just one,
00:12:56.480 | and we need to do this for every sample in our dataset.
00:13:01.480 | So we'll go ahead and do that.
00:13:02.880 | Now, we're just gonna use a simple for loop
00:13:06.040 | to add each value or sequence into a NumPy array,
00:13:11.040 | which we will initialize now.
00:13:13.360 | So first import NumPy,
00:13:19.000 | and then initialize our two arrays.
00:13:21.160 | Both of them are gonna be the same size,
00:13:27.520 | so it's gonna be the length of our data frame
00:13:30.720 | by the sequence length that we have defined, which is 50.
00:13:36.140 | And we do this twice.
00:13:44.720 | We also want one for the attention mask.
00:13:48.220 | And we can see the shape of our arrays here.
00:13:53.220 | Now, we're just gonna use a for loop to do this.
00:14:00.500 | It's only a small dataset,
00:14:01.700 | so whatever we use doesn't really matter.
00:14:05.780 | We are only processing and encoding this data one time,
00:14:09.380 | and then we will save it and then load it from memory
00:14:12.780 | when we're actually training our model.
00:14:14.740 | , and there we have our for loop.
00:14:43.100 | This will go through each sequence
00:14:45.700 | and add each one of those sequences
00:14:48.100 | into the perspective index
00:14:51.020 | of our initialized zero arrays here.
00:14:54.440 | Okay, and then here we can see our complete arrays.
00:15:05.700 | So at the top, we have our input IDs and XIDs,
00:15:09.460 | and we can see each one starts
00:15:11.260 | with our start of sequence token, 101,
00:15:14.340 | followed by a few words.
00:15:15.820 | Then there will be the end of sequence token
00:15:17.340 | somewhere in the middle there,
00:15:18.700 | and then at the end, we have our padding token.
00:15:22.180 | And then we can also see in Xmask,
00:15:24.580 | which is our attention mask,
00:15:26.060 | we have the ones to pay attention to
00:15:28.700 | and the zeros to ignore,
00:15:30.620 | which obviously correspond to the respective values
00:15:34.180 | in the XIDs array.
00:15:35.700 | Now, for our labels,
00:15:38.940 | we are actually going to one-hot encode them.
00:15:41.940 | So at the moment, let's see what we have.
00:15:45.460 | So we will get values of four, one, three, two, and zero.
00:15:55.320 | Though you can't see it here.
00:16:00.860 | Now, this is a pretty good format
00:16:03.140 | to go straight in and one-hot encode.
00:16:05.980 | So all we need to do here is create a array
00:16:09.580 | from the DataFrame column.
00:16:11.780 | And then here, like we did before,
00:16:22.380 | we are just initializing a empty zero array.
00:16:25.460 | And what we are doing here is we are taking the array size.
00:16:34.640 | So if I, okay, let's do it a little differently.
00:16:39.620 | So the array size is just the length of our DataFrame.
00:16:48.140 | And array.max is what we see maximum value within our array,
00:16:53.140 | which is the number four.
00:16:55.220 | And then we are adding one onto that,
00:16:56.860 | which is saying here that we want a zero array
00:16:59.840 | of 8,529 rows by five columns.
00:17:04.840 | Like so.
00:17:08.940 | Now, at the moment, we just have an empty zero array.
00:17:11.460 | So we just need to add ones in the indices
00:17:14.300 | where we have a value.
00:17:16.900 | And we do that very easily like this.
00:17:18.900 | So we create a range of values
00:17:26.860 | from zero to 8,528, which is the array size here.
00:17:31.860 | And then within that, we add array.
00:17:36.660 | Because in array, we have each sentiment value.
00:17:39.460 | So zero, four, three, two.
00:17:41.380 | And this will add a one in that index.
00:17:44.940 | So zero, three, four.
00:17:46.940 | And that produces our one-hot encoding.
00:17:50.040 | Like so.
00:17:54.420 | Here, our rating was one.
00:17:57.700 | Here, four.
00:17:59.420 | Here, one.
00:18:00.260 | And so on.
00:18:01.080 | Now, I said before that typically we'd save these
00:18:05.100 | before loading them in.
00:18:06.780 | So we are gonna save them.
00:18:08.620 | Obviously, we don't need to load them back in.
00:18:10.940 | But I will show you how to anyway,
00:18:13.580 | because going forwards, if you want to retrain data,
00:18:17.500 | you have to do this.
00:18:19.060 | Otherwise, you'd have to do everything all over again.
00:18:21.060 | And when you're working with bigger datasets,
00:18:23.220 | that will take quite some time.
00:18:25.500 | So first, we'll just save them.
00:18:28.900 | (silence)
00:18:31.060 | So now we've saved them,
00:18:54.420 | and we've just removed all of them from memory.
00:18:57.140 | So now we will not be able to access any of them.
00:19:01.260 | Going forwards, we are just going to load them back in,
00:19:05.020 | which is exactly the same process.
00:19:06.940 | Quickly write that out.
00:19:08.140 | (silence)
00:19:10.300 | (silence)
00:19:12.460 | And now we have our data back in.
00:19:30.540 | Like so.
00:19:33.060 | Okay, so now we need to put all of our arrays
00:19:37.980 | into a TensorFlow dataset object.
00:19:40.340 | So we'll use a dataset object
00:19:43.620 | because it makes things a lot easier.
00:19:46.020 | So we can restructure the data, shuffle,
00:19:49.220 | and batch it in just a few lines of code,
00:19:51.940 | which is a lot faster in terms of performance,
00:19:55.020 | and also a lot faster for us to write down.
00:19:57.860 | It's a much simpler code.
00:19:59.460 | So we'll import TensorFlow.
00:20:05.660 | And also, if you're using a GPU,
00:20:08.300 | you can check that it is being detected
00:20:10.860 | by your system with this.
00:20:13.020 | And there we can see I have a GPU
00:20:23.820 | being picked up by TensorFlow.
00:20:26.260 | So first, we need to restructure the data.
00:20:28.620 | TensorFlow expects our data to be input as a tuple.
00:20:33.620 | So that tuple consists of our inputs
00:20:36.220 | and our target or output labels.
00:20:38.820 | Now, because we are using BERT,
00:20:41.820 | our data structure is slightly different
00:20:43.940 | because in the input tuple,
00:20:45.660 | we actually have a dictionary,
00:20:48.380 | and that dictionary contains a key
00:20:50.780 | that is input_ids, which maps to our xids array.
00:20:55.780 | We also have another key, attention_mask,
00:20:58.860 | which maps to our xmask array.
00:21:03.100 | So first, we actually need to create our dataset object,
00:21:06.100 | which we do like so.
00:21:08.260 | And this just creates a generator
00:21:21.820 | which contains all of our data
00:21:23.620 | in the tuple-like format
00:21:26.460 | where each tuple contains one xids array,
00:21:30.340 | xmask array, and label array.
00:21:33.900 | So we can actually view one of those, like so.
00:21:36.620 | And here we can see each one of our arrays.
00:21:43.500 | So the first one, xids, xmask, and then labels.
00:21:47.620 | Now, TensorFlow expects our input data
00:21:52.660 | as a dataset to be in a tuple format.
00:21:56.220 | The zero index of that tuple needs to be our input values
00:22:00.380 | and the one index of that tuple needs to be our labels.
00:22:05.380 | Now, in our case, it's also slightly different
00:22:08.820 | because we are using two inputs.
00:22:12.220 | So within that tuple, we have our input,
00:22:14.420 | and within that input, we have a dictionary.
00:22:18.660 | And that dictionary consists of a key,
00:22:21.940 | input_ids, which maps to our xids array,
00:22:24.780 | and attention_mask, which must map to our xmask array.
00:22:30.020 | So to create this structure,
00:22:31.740 | we need to build a mapping function, like so.
00:22:35.020 | And here we return the format we need.
00:22:44.220 | So input_ids, which will map to input_ids,
00:22:49.220 | and attention_mask,
00:22:58.380 | which will map to mask.
00:23:00.140 | Then, because we are still expecting
00:23:04.060 | to use the input output tuple,
00:23:05.900 | we also need to add labels onto the end there.
00:23:08.260 | And to apply this mapping function to our dataset object,
00:23:13.780 | we use the map method.
00:23:15.820 | So now we can view a single row in this dataset.
00:23:26.940 | (silence)
00:23:29.100 | And now we can see a slightly different format.
00:23:34.100 | So we have our input dictionary here,
00:23:37.660 | where we have input_ids and attention_mask,
00:23:40.620 | and then we have our output tensor here.
00:23:44.460 | Now, I said before that this makes shuffling
00:23:48.620 | and batching our dataset very easy.
00:23:51.340 | So we can do it in a single line, like this.
00:23:55.980 | (silence)
00:23:58.140 | So here we're gonna put our samples into batches of 32,
00:24:07.220 | and then the value that I've given
00:24:08.780 | to the shuffle method here,
00:24:10.780 | essentially just needs to be very large.
00:24:13.340 | The larger your dataset,
00:24:14.940 | the larger this number needs to be.
00:24:16.700 | I take a sample of my dataset and increase the number.
00:24:22.300 | If I can see that this is not shuffling,
00:24:24.980 | the dataset properly.
00:24:26.540 | Okay, so now we have our shuffle batch dataset
00:24:32.500 | in the correct format.
00:24:33.940 | So now we just need to split it
00:24:36.340 | into our training and validation sets.
00:24:38.980 | Okay, and to do that,
00:24:41.180 | we need to get the total size of our dataset
00:24:45.260 | now that it is batched.
00:24:46.500 | So we can do that, like so.
00:24:54.940 | Now, because the dataset object is a generator,
00:24:57.860 | we can't just take the length of it directly,
00:25:00.700 | so we have to convert it into a list.
00:25:02.580 | Now, if you're working with a very large dataset,
00:25:06.820 | this is probably not the right method to use,
00:25:08.700 | and you should instead take the size of the dataset
00:25:12.460 | that you know already,
00:25:14.140 | and calculate the current length of it,
00:25:16.900 | including the batch size from that.
00:25:18.820 | But for us, this is fine,
00:25:21.860 | and we can see that dataset length is 267.
00:25:26.860 | Now, if we want a nine to 10 split
00:25:30.980 | between the training and validation set,
00:25:34.020 | we simply use a split value of 0.9.
00:25:37.660 | And then we get our training and validation sets
00:25:44.420 | as two different datasets.
00:25:49.860 | To split the dataset, we use take and skip.
00:25:53.060 | We used take earlier on,
00:25:54.540 | and what it does is simply takes the specified number of,
00:25:59.380 | in this case, batches, and nothing else.
00:26:02.180 | Skip, on the other hand, does the opposite.
00:26:04.060 | It skips a specified number of batches,
00:26:06.780 | and then takes the remainder.
00:26:09.260 | (mouse clicking)
00:26:12.020 | And then at the end here, we can delete dataset
00:26:29.580 | if space is an issue.
00:26:31.020 | Okay, so now our data is ready.
00:26:36.100 | We can go on to actually building our model architecture.
00:26:40.540 | So first, we initialize BERT,
00:26:43.060 | and to do that, we need to import tf.autoModel
00:26:46.660 | from the Transformers library.
00:26:48.380 | And we initialize BERT like so.
00:26:55.700 | So here, remember to use the same model
00:27:04.700 | that you're using to initialize your tokenizer.
00:27:07.100 | And here, we can see that we have now imported BERT.
00:27:13.940 | So we have BERT,
00:27:15.460 | but we need to build a network around BERT as well.
00:27:19.660 | First thing we need to do is define our input layers,
00:27:22.940 | and of course, there are two,
00:27:24.140 | because we have the input IDs and the attention mask.
00:27:33.740 | And the shape is simply the sequence length
00:27:36.700 | that we are using.
00:27:37.740 | And the name here is very important.
00:27:43.540 | This needs to match up to the dictionary value
00:27:47.660 | that we have defined in our dataset here.
00:27:51.100 | So input IDs, input IDs, and attention mask, okay?
00:27:56.060 | So these need to match up.
00:27:57.700 | Otherwise, TensorFlow does not know where these are going.
00:28:02.180 | (mouse clicking)
00:28:04.940 | And we do the same here,
00:28:10.740 | but we are doing this for the attention mask.
00:28:12.940 | So those are our two input layers,
00:28:25.380 | and now we need to pull the embeddings
00:28:28.180 | from our initialized BERT model.
00:28:30.780 | (mouse clicking)
00:28:32.980 | BERT consumes our two input layers,
00:28:35.780 | like so, and BERT will return two tensors to us.
00:28:45.940 | One of those is the last hidden state,
00:28:48.140 | which is what we are interested in.
00:28:49.980 | That is a 3D tensor, which provides all the information
00:28:53.980 | from the last hidden state of the BERT model.
00:28:59.180 | The second tensor that we are going to ignore
00:29:03.300 | is called the pooler output,
00:29:05.340 | and the pooler output is essentially the last hidden state
00:29:08.980 | run through a feed-forward
00:29:11.620 | or linear activation function and pooled.
00:29:14.700 | So that creates a 2D tensor,
00:29:18.180 | which can be used for classification if you want,
00:29:21.740 | but we are going to pool it ourselves,
00:29:24.020 | so we will not be using that tensor.
00:29:27.740 | (mouse clicking)
00:29:29.940 | Okay, so here you can experiment with adding LSTM layers,
00:29:34.740 | convolutional layers, or anything else,
00:29:37.780 | but for now, to keep things simple,
00:29:40.380 | we are just going to add a global max pooling layer,
00:29:43.220 | which will convert our output 3D tensor into a 2D tensor.
00:29:48.220 | Again, you could skip this,
00:29:51.180 | and you could just output the pooler output tensor,
00:29:55.220 | like this, by changing the zero to a one,
00:29:58.820 | but we are not going to be using that.
00:30:00.660 | (mouse clicking)
00:30:05.020 | Okay, so up here,
00:30:16.500 | we just need to define the input data types as well,
00:30:20.140 | which I missed.
00:30:21.060 | (mouse clicking)
00:30:23.820 | And that will remove the type error.
00:30:28.380 | Now, we need to normalize our outputs here.
00:30:34.460 | This will almost always give better results
00:30:38.060 | when we are actually training the model.
00:30:40.020 | (mouse clicking)
00:30:42.780 | And then following this,
00:30:48.980 | we will go into our Densely Connected Neural Network layers,
00:30:52.620 | which are in charge of actually figuring out
00:30:55.820 | the classification of our BERT embedding outputs.
00:30:59.660 | (mouse clicking)
00:31:18.260 | And then we want to add a dropout layer here.
00:31:21.460 | This just prevents any overfitting or too much overfitting.
00:31:26.260 | Then we add another Densely Connected Neural Network.
00:31:31.820 | And finally, we are creating our output layer,
00:31:44.620 | which is going to be a Densely Connected Neural Network
00:31:47.100 | with a Softmax activation function.
00:31:50.140 | Now, we use Softmax here because we have our three labels,
00:31:59.980 | or no, sorry, five labels in the output.
00:32:03.220 | So let me change this.
00:32:04.620 | So we have our five labels in the output
00:32:07.700 | because we have one hot encoded,
00:32:09.660 | the zero, one, two, three, and four.
00:32:16.060 | And finally, we just give it a name of outputs.
00:32:19.420 | Now, that is our model architecture,
00:32:26.420 | but we still need to tell TensorFlow
00:32:29.740 | what our input layers are and what our output layer is.
00:32:33.300 | So to do that, we define our model, like so.
00:32:37.220 | And to the inputs, we pass input IDs and math.
00:32:44.900 | And then to the outputs, we just pass Y.
00:32:47.260 | Finally, we have our model,
00:32:53.780 | so we can actually execute that
00:32:57.340 | and produce a model summary here
00:32:59.260 | so we can see what we have built.
00:33:01.940 | Okay, and here we can see our model.
00:33:11.980 | Now, if we scroll down to the bottom,
00:33:13.460 | we can see the number of parameters in our model,
00:33:16.580 | and we have quite a lot, 108 parameters.
00:33:19.700 | Almost all of them are trainable.
00:33:22.500 | Now, BERT is a very big model,
00:33:24.980 | and I wouldn't recommend training it
00:33:28.740 | unless you have a specific reason to.
00:33:33.220 | Now, for this dataset, it's definitely overkill.
00:33:37.580 | So what we can do is actually go in here
00:33:40.180 | and we can actually freeze the BERT model
00:33:43.900 | by freezing the third layer at index two
00:33:49.420 | of our model layers.
00:33:51.180 | And we simply set trainable equal to false to do that.
00:33:55.340 | So if we now look at our model summary,
00:34:00.900 | we will see that the number of parameters
00:34:03.580 | is exactly the same,
00:34:05.140 | but the number of trainable parameters
00:34:07.340 | will have reduced by a lot.
00:34:10.140 | So here, we now have 104,000 trainable parameters
00:34:15.140 | rather than 108 million trainable parameters.
00:34:19.220 | We can go ahead and put together our optimizer,
00:34:23.820 | loss and accuracy, compile our model, and begin training.
00:34:28.020 | Now, for our optimizer, we're just gonna use Adam.
00:34:35.700 | With a learning rate of 0.01.
00:34:40.700 | For the loss, because we are using one-hot encoding
00:34:48.020 | for our outputs, we are going to use
00:34:49.740 | the categorical cross-entropy loss.
00:34:52.020 | And finally, for our accuracy,
00:35:00.380 | we are also gonna use categorical accuracy
00:35:02.980 | for the same reason.
00:35:04.980 | (silence)
00:35:07.140 | And we can compile our model.
00:35:14.900 | So here, I've just missed the R.
00:35:32.140 | And now we can actually train our model.
00:35:35.060 | So just do model fit.
00:35:37.980 | We have our training data, our validation set.
00:35:45.420 | And I've found that for this model,
00:35:53.460 | we have to use a lot of epochs
00:35:55.740 | to actually get a good accuracy out of it.
00:35:57.940 | So I'm gonna train it for a total of maybe 140 epochs.
00:36:02.940 | Now, depending on your GPU, if you're using a GPU,
00:36:08.260 | this will still take quite some time.
00:36:10.780 | So if you're using a CPU, it will take a very long time.
00:36:13.660 | So maybe reduce the number of epochs
00:36:16.580 | or reduce the data set size a little further if you want.
00:36:21.420 | So I'm gonna start training this,
00:36:23.940 | and I will see you on the other side.
00:36:26.180 | (silence)
00:36:28.180 | Okay, so we have finished training.
00:36:31.820 | And you can see that the accuracy over time
00:36:35.820 | is actually quite good.
00:36:37.700 | It goes up very slowly,
00:36:40.100 | which is why using a lot of epochs
00:36:42.100 | has been quite useful here.
00:36:43.420 | But then if we take a look at the accuracy at the end,
00:36:47.300 | it's 82%, which is not bad.
00:36:49.740 | But I think more importantly is the fact
00:36:51.460 | that it was still going up very gradually.
00:36:53.660 | I think with further training,
00:36:55.860 | this could quite easily get to about 90% on this data set,
00:36:59.340 | and considering it's a very small data set,
00:37:01.220 | that's pretty good.
00:37:03.060 | Now, if we look at the validation accuracy,
00:37:04.660 | we actually get a high number by quite a bit.
00:37:07.620 | So this actually went up to 94% here, almost 95.
00:37:12.620 | And I would assume that this is because
00:37:16.980 | within the validation set, there are more easy examples,
00:37:21.540 | whereas in the training set,
00:37:22.660 | we have some more difficult examples.
00:37:26.140 | But nonetheless, these are pretty good results
00:37:28.780 | for quickly putting together a model.
00:37:30.460 | It's not a particularly big model,
00:37:31.900 | other than the BERT encoder in the middle,
00:37:34.940 | but otherwise, it's a pretty straightforward, simple model.
00:37:39.300 | So it's pretty cool that you can get these results
00:37:41.860 | on sentiment analysis in so little time.
00:37:44.260 | And going forward with other data sets with more data
00:37:48.940 | can definitely do a lot better.
00:37:51.420 | At the same time, we can also improve the model.
00:37:53.900 | We can add LSTM layers to the classifier
00:37:56.820 | or convolution neural net layers,
00:37:58.420 | or even just more densely connected
00:38:00.900 | neural network layers as well.
00:38:02.260 | So there is a lot that we can actually do with this.
00:38:04.860 | But for now, that's everything.
00:38:07.460 | So I hope you enjoyed the video.
00:38:09.300 | I hope it's been useful to you.
00:38:11.140 | If you have any questions,
00:38:12.220 | let me know in the comments below.
00:38:14.020 | But otherwise, thank you for watching,
00:38:16.180 | and I will see you next time.
00:38:17.620 | [BLANK_AUDIO]