back to indexLesson 8 - Deep Learning for Coders (2020)
Chapters
0:0 Introduction
0:15 Natural Language Processing
5:10 Building a Language Model
12:18 Get Files
13:28 Word Tokenizer
16:30 Word Tokenizer Rules
17:38 SubWord Tokenizer
19:1 Setup
23:23 Numericalization
25:43 Batch
29:19 Data Loader
30:18 Independent Variables
30:36 Dependent Variables
31:9 Data Blocks
31:56 Class Methods
33:25 Language Model
35:8 Save Epoch
35:32 Save Encoder
37:55 Text Generation
42:5 Language Models
42:49 Classification
43:10 Batch Size
44:20 Pad Tokens
45:29 Text Classifier
48:31 Data Augmentation
48:54 Predicting the Next Word
51:0 Data Augmentation on Text
51:51 Generation
58:11 Creating Datasets
00:00:00.000 |
Hi everybody and welcome to Lesson 8, the last lesson of Part 1 of this course. Thanks 00:00:06.820 |
so much for sticking with us. Got a very interesting lesson today where we're going to do a dive 00:00:12.540 |
into natural language processing. And remind you, we did see natural language processing 00:00:18.000 |
in Lesson 1. This was it here. We looked at a dataset where we could pass in many movie 00:00:28.160 |
reviews like so and get back probabilities that it's a positive or negative sentiment. 00:00:34.740 |
And we trained it with a very standard looking classifier trainer approach. But we haven't 00:00:41.160 |
really talked about what's going on behind the scenes there, so let's do that. And we'll 00:00:46.040 |
also learn about how to make it better. So we were getting about 93%. So 93% accuracy 00:00:52.280 |
for sentiment analysis which is actually extremely good and it only took a bit over 10 minutes. 00:00:58.560 |
But let's see if we can do better. So we're going to go to notebook number 10. And in 00:01:08.640 |
notebook number 10 we're going to start by talking about what we're going to do to train 00:01:16.640 |
an NLP classifier. So a sentiment analysis which is this movie review positive or negative 00:01:22.040 |
sentiment is just a classifier. The dependent variable is binary. And the independent variable 00:01:28.040 |
is the kind of the interesting bit. So we're going to talk about that. But before we do 00:01:33.000 |
we're going to talk about what was the pre-trained model that got used here. Because the reason 00:01:40.200 |
we got such a good result so quickly is because we're doing fine-tuning of a pre-trained model. 00:01:46.540 |
So what is this pre-trained model exactly? Well the pre-trained model is actually a pre-trained 00:01:51.440 |
language model. So what is a language model? A language model is a special kind of model 00:02:01.140 |
and it's a model where we try to predict the next word of a sentence. So for example if 00:02:09.680 |
our language model received even if our language model knows the and its job would be to predict 00:02:18.000 |
basics. Now the language model that we use as our pre-trained model was actually trained 00:02:25.760 |
on Wikipedia. So we took all the you know non-trivial sized articles in Wikipedia and 00:02:34.800 |
we built a language model which attempted to predict the next word of every sequence 00:02:40.080 |
of words in every one of those articles. And it was a neural network of course. And we 00:02:47.280 |
then take those pre-trained weights and those are the pre-trained weights that when we said 00:02:51.240 |
text classifier learner were automatically loaded in. So conceptually why would it be 00:02:57.680 |
useful to pre-train a language model? How does that help us to do sentiment analysis 00:03:02.960 |
for example? Well just like an ImageNet model has a lot of information about what pictures 00:03:10.360 |
look like and what they're consisting of. A language model tells us a bit a lot about 00:03:16.920 |
what sentences look like and what they know about the world. So a language model for example 00:03:24.120 |
if it's going to be able to predict the end of the sentence in 1998 this law was passed 00:03:34.760 |
by president what? So a language model to predict that correctly would have to know 00:03:41.840 |
a whole lot of stuff. It would have to know about well how English language works in general 00:03:45.760 |
and what kind of sentences go in what places. That after the word president would usually 00:03:51.760 |
be the surname of somebody. It would need to know what country that law was passed in 00:03:57.000 |
and it would need to know what president was president of that country in what I say 1998. 00:04:03.720 |
So it'd have to know a lot about the world. It would have to know a lot about language 00:04:07.880 |
to create a really good language model is really hard. And in fact this is something 00:04:12.720 |
that people spend many many many millions of dollars on creating language models of 00:04:19.440 |
huge datasets. Our particular one doesn't take particularly long to pre-train but there's 00:04:25.960 |
no particular reason for you to pre-train one of these language models because you can 00:04:29.240 |
download them through fast AI or through other places. So what happened in lesson one is 00:04:39.600 |
we downloaded this pre-trained Wikipedia model and then we fine-tuned it so as per usual 00:04:45.520 |
we threw away the last layer which was specific for predicting the next word of Wikipedia 00:04:51.840 |
and fine-tuned the model. Initially just the last layer to learn to predict sentiment of 00:04:59.960 |
movie reviews and then as per usual then fine-tuned the rest of the model and that got us 93%. 00:05:07.720 |
Now there's a trick we can use though which is we start with this Wikipedia language model 00:05:14.000 |
and the particular subset we use is called Wikitext 103. And rather than just jumping 00:05:19.760 |
straight to a classifier which we did in lesson one we can do even better if we first of all 00:05:24.920 |
create an IMDB language model that is to say a language model that learns to predict the 00:05:30.360 |
next word of a movie review. The reason we do that is that this will help it to learn 00:05:36.640 |
about IMDB specific kind of words like it'll learn a lot more about the names of actors 00:05:42.760 |
and directors it'll learn about the kinds of words that people use in movie reviews. 00:05:48.760 |
And so if we do that first then we would hope we'll end up with a better classifier. So that's 00:05:53.240 |
what we're going to do in the first part of today's lesson. And we're going to kind of 00:05:59.880 |
do it from scratch and we're going to show you how to do a lot of the things from scratch 00:06:04.540 |
even though later we'll show you how fast AI does it all for you. So how do we build 00:06:10.240 |
a language model? So as we point out here sentences can be different lengths and documents like 00:06:16.360 |
movie reviews can be very long. So how do we go about this? Well a word is basically 00:06:26.440 |
a categorical variable and we already know how to use categorical variables as an independent 00:06:31.680 |
variable in a neural net which was we make a list of all of the possible levels of a 00:06:35.880 |
categorical variable which we call the vocab and then we replace each of those categories 00:06:42.640 |
with its index so they all become numbers. We create an initially random embedding matrix 00:06:49.320 |
for each so each row then is for one element from the vocab and then we make that the first 00:06:55.880 |
layer of a neural net. So that's what we've done a few times now and we've even created 00:07:02.540 |
our own embedding layer from scratch remember. So we can do the same thing with text right 00:07:07.540 |
we can make a list of all the possible words in in the whole corpus the whole dataset and 00:07:15.320 |
we can replace each word with the index of the vocab and creating embedding matrix. So 00:07:23.080 |
in order to create a list of all levels in this case a list of all possible words let's 00:07:28.080 |
first of all concatenate all the documents or the movie reviews together into one big 00:07:32.480 |
long string and split it into words okay and then our independent variable will basically 00:07:39.200 |
be that sequence starting with the first word in the long list and ending with a second 00:07:44.120 |
last and our dependent variable will be the sequence of words starting with a second word 00:07:49.200 |
and ending with a last so they're kind of offset by one so as you move through the first 00:07:54.520 |
sequence you're then trying to predict the next word in the next in the in the second 00:08:00.320 |
part that's kind of what we're doing right we'll see more detail in a moment. Now when 00:08:06.680 |
we create our vocab by finding all the unique words in this concatenated corpus a lot of 00:08:14.560 |
the words we see will be already in the embedding matrix already in the vocab of the pre-trained 00:08:20.920 |
Wikipedia model but there's also going to be some new ones right there might be some 00:08:26.800 |
particular actors that don't appear in Wikipedia or maybe some informal slang words and so 00:08:34.920 |
forth so when we build our vocab and then our embedding embedding matrix for the IMDB 00:08:43.440 |
language model any words that are in the vocab of the pre-trained model we'll just use them 00:08:49.520 |
as is but for new words we'll create a new random vector. So here's the process we're 00:08:58.520 |
going to have to go through first we're going to have to take our big concatenated corpus 00:09:03.940 |
and turn it into a list of tokens could be words could be characters could be substrings 00:09:10.800 |
that's called tokenization and then we'll do numericalization which is basically these 00:09:18.700 |
two steps which is replacing each word with its index in a vocab which means we have to 00:09:23.600 |
create that vocab so create the vocab and then convert then we're going to need to create 00:09:29.200 |
a data loader that has lots of substrings lots of sequences of tokens from IMDB corpus 00:09:37.720 |
as an independent variable and the same thing offset by one as a dependent variable and 00:09:45.520 |
then we're going to have to create a language model. Now a language model is going to be 00:09:49.560 |
able to handle input lists that can be arbitrarily big or small and we're going to be using something 00:09:56.240 |
called a recurrent neural network to do this which we'll learn about later so basically 00:10:00.640 |
so far we've always assumed that everything is a fixed size a fixed input so we're going 00:10:05.840 |
to have to mix things up a little bit here and deal with architectures that can be different 00:10:10.640 |
sizes for this notebook notebook 10 we're going to kind of treat it as a black box it's 00:10:18.360 |
just going to be just a neural net and then later in the lesson we'll look at delving 00:10:23.200 |
inside what's happening in that architecture okay so let's start with the first of these 00:10:30.320 |
which is tokenization so converting a text into a list of words or a list of tokens what 00:10:36.480 |
does that mean is a full stop a token what about don't is that single word or is it two 00:10:45.000 |
words don't or is it would I convert it to do not what about long medical words that 00:10:51.040 |
are kind of made up of lots of pieces of medical jargon that are all stuck together what about 00:10:55.880 |
hyphenated words and really interestingly then what about something like Polish where 00:11:01.560 |
you or Turkish where you can create really long words all the time they create really 00:11:06.400 |
long words that are actually lots of separate parts or concatenated together or languages 00:11:11.360 |
like Japanese and Chinese that don't use spaces at all they don't really have a world of find 00:11:18.560 |
idea of a word well there's no right answer but it's basically three approaches we can 00:11:26.480 |
use a word-based approach which is what we use by default at the moment for English although 00:11:30.640 |
that might change which is we split a sentence on space and then there are some language specific 00:11:36.160 |
rules for example turning don't into do and putting punctuation marks as a separate token 00:11:43.240 |
most of the time really interestingly there are tokenizes at a sub word based and this 00:11:49.600 |
is where we split words into smaller parts based on the most commonly occurring substrings 00:11:54.440 |
we'll see that in a moment or the simplest character-based split a sentence into its 00:12:00.640 |
characters we're going to look at word and sub word tokenization in this notebook and 00:12:06.040 |
then if you look at the questionnaire at the end you'll be asked to create your own character 00:12:10.200 |
based tokenizer so please make sure you do that if you can it'll be a great exercise 00:12:19.280 |
so fastai doesn't invent its own tokenizers we just provide a consistent interface to 00:12:25.960 |
a range of external tokenizers because there's a lot of great tokenizers out there so you 00:12:32.480 |
can switch between different tokenizers pretty easily so let's start let's grab our IMDB data 00:12:38.360 |
set like we did in lesson one and in order to try out a tokenizer let's grab all the 00:12:44.320 |
text files so we can instead of calling get image files we'll call get text files and 00:12:51.040 |
you know to have a look at what that's doing don't forget we can even look at the source 00:12:55.640 |
code and you can see actually it's calling a more general thing called get files and 00:13:00.960 |
saying what extensions it wants right so if anything in fastai doesn't work quite the 00:13:04.400 |
way you want and there isn't a option which which works the way you want you can often 00:13:09.080 |
look always look underneath to see what we're calling and you can call the lower level stuff 00:13:13.880 |
yourself so files is now a list of files so we can grab the first one we can open it we 00:13:20.280 |
can read it have a look at the start of this review and here it is okay so at the moment 00:13:31.240 |
the default English word tokenizer we use is called spaCy which uses a pretty sophisticated 00:13:37.080 |
set of rules with special rules for particular words and URLs and so forth but we're just 00:13:45.320 |
going to go ahead and say word tokenizer which will automatically use fastai's default word 00:13:49.920 |
tokenizer currently spaCy and so if we pass a list of documents we'll just make it a list 00:13:58.320 |
of one document here to the tokenizer we just created and just grab the first since we just 00:14:04.000 |
created a list that's going to show us as you can see the tokenized version so you can 00:14:10.880 |
see here that this movie which I just discovered at the video store has etc it's changed it's 00:14:19.440 |
into it's and it's put a comma as a separate punctuation mark and so forth okay so you 00:14:38.480 |
Let's look at a more interesting one the US dollar blah blah blah and you can see here 00:14:42.600 |
it actually knows that US is special so it doesn't put the full stop in a set as a separate 00:14:46.840 |
place for US it knows about 1.00 is special so you can see there's a lot of tricky stuff 00:14:53.120 |
going on with spaCy to try and be as kind of thoughtful about this as possible. 00:15:00.760 |
Fastai then provides this tokenizer wrapper which provides some additional functionality 00:15:08.120 |
to any tokenizer as you can see here which is for example the word it here which previously 00:15:20.400 |
was capital IT has been turned into lowercase IT and then a special token XX badge has appeared 00:15:27.220 |
at the front everything starting with XX is a special fastai token and this means that 00:15:33.360 |
the next match means that the next word was previously started with a capital letter so 00:15:38.840 |
here's another one this used to be capital T so you make it lowercase and then add XX 00:15:43.920 |
page XXBOS means this is the start of a document so there's a few special rules going on there 00:15:51.800 |
so why do we do that well if you think about it if we didn't lowercase it for instance 00:15:57.760 |
or this then the capitalized version and the lowercase version are going to be two different 00:16:03.760 |
words in the embedding matrix which probably doesn't make sense you know regardless of 00:16:08.320 |
the capitalization they probably basically mean the same thing having said that sometimes 00:16:15.720 |
the capitalization might matter so we kind of want to say all right use the same embedding 00:16:20.200 |
every time you see the word this but add some kind of marker that says that this was originally 00:16:24.640 |
capitalized okay so that's why we do it like this so there's quite a few rules you can 00:16:33.500 |
see them in text proc rules and you can see the source code here's a summary of what they 00:16:38.520 |
do but let's look at a few examples so if we use that tokenizer we created and pass 00:16:44.760 |
in for example this text you can see the way it's tokenized we get the XX beginning of 00:16:49.820 |
stream or beginning of string beginning of document this HTML entity has become a real 00:16:54.680 |
Unicode we've got the XX Madge we discussed now here www has been replaced by XXRep3w that 00:17:04.680 |
means the letter w is repeated three times so for things where you've got like you know 00:17:11.560 |
a hundred exclamation marks in a row all the words so with like 50 Os this is a much better 00:17:17.960 |
representation and then you can see all upper case has been replaced with XX up followed 00:17:26.960 |
by the word so there's some of those rules in action oh you can also see multiple spaces 00:17:33.400 |
have been replaced you know with making just making it standard tokens so that's the word 00:17:40.000 |
tokenizer the really interesting one is the subword tokenizer so how why would you need 00:17:46.820 |
a subword tokenizer well consider for example this sentence here order means they're sure 00:17:51.860 |
how'd you rate so this is my name is Jeremy but the interesting thing about it is there's 00:17:57.900 |
no spaces here right and that's because there are no spaces in Chinese and there isn't really 00:18:06.520 |
a great sense of what a word is in Chinese in this particular sentence it's fairly clear 00:18:10.920 |
what the words are but it's not always obvious sometimes the words are actually split you 00:18:16.720 |
know so some of it's at the start of a sentence and some of it's at the end so you can't really 00:18:21.160 |
do word tokenization for something like Chinese so instead we use subword tokenization which 00:18:28.000 |
is where we look at a corpus of documents and we find the most commonly occurring groups 00:18:33.200 |
of letters and those commonly occurring groups of letters become the vocab so for example 00:18:40.600 |
we would probably find order would appear often because that means my and Ming Z and 00:18:47.520 |
then Ming Z for example is name and this is my westernized version of a Chinese name which 00:18:55.960 |
wouldn't be very common at all so they would probably appear separately so let's look at 00:19:03.200 |
an example let's grab the first 2000 movie reviews and let's create the default subword 00:19:13.840 |
tokenizer which currently uses something called sentence piece that might change and now we're 00:19:20.080 |
going to use something special something very important which is called setup transforms 00:19:25.960 |
in fastai you can always call this special thing called setup it often doesn't do anything 00:19:31.640 |
stupid it's always there but some transforms like a subword tokenizer actually need to 00:19:37.960 |
be set up before you can use them in other words you can't tokenize into subwords until 00:19:43.680 |
you know what the most commonly occurring groups of letters are so passing a list of 00:19:49.840 |
texts in here this list of text to set up will train the subword tokenizer it'll find 00:19:57.840 |
those commonly occurring groups of letters so having done that we can then this is just 00:20:04.080 |
for experimenting we're going to pass in some size we'll say what vocab size we want for 00:20:09.640 |
our subword tokenizer we'll set it up with our texts and then we will have a look at 00:20:15.640 |
a particular sentence so for example if we create a subword tokenizer with a thousand 00:20:21.600 |
tokens and it returns this this tokenized string now this kind of long underscore thing 00:20:32.680 |
is what we replace space with because now we're using subword tokens we kind of want 00:20:36.880 |
to know where the sentence is actually start and stop and you can see here a lot of sentence 00:20:42.600 |
words are common enough sequences of letters that they get their own vocab item or else 00:20:49.520 |
discovered it was not wasn't common enough so that became this over it right video appears 00:20:58.880 |
enough where a store didn't that becomes or so you get the idea right so if we wanted 00:21:05.660 |
a smaller vocab size that would as you see even this doesn't become its own word movie 00:21:13.520 |
is so common that it is its own word so it just becomes to for example we have a question 00:21:23.280 |
okay how can we determine if the given pre-trained model in this case wiki text 103 is suitable 00:21:31.700 |
enough for our downstream task if we have limited vocab overlap should we need to add 00:21:36.920 |
an additional data set to create a language model from scratch if it's in the same language 00:21:45.520 |
so if you're doing English it's always it's almost always sufficient to use Wikipedia 00:21:52.060 |
we've played around with this a lot and it was one of the key things that Sebastian Ruder 00:21:55.320 |
and I found when we created the ULM fit paper was before that time people really thought 00:22:01.160 |
you needed corpus specific pre-trained models but we discovered you don't just like you 00:22:07.440 |
don't that often need corpus specific pre-trained vision models image network surprisingly well 00:22:14.800 |
across a lot of different domains so Wikipedia has a lot of words in it it would be really 00:22:25.360 |
really I haven't come across an English corpus that didn't have a very high level of overlap 00:22:31.040 |
with Wikipedia on the other hand if you're doing ULM fit with like genomic sequences 00:22:37.920 |
or Greek or whatever then obviously you're going to need a different pre-trained model 00:22:46.000 |
so once we got to a 10,000 word vocab as you can see basically every word at least common 00:22:51.120 |
word becomes its own vocab item in the subword vocab except say discovered which becomes 00:22:59.000 |
discover it so my guess is that subword approaches are going to become kind of the most common 00:23:08.200 |
maybe they will be by the time you watch this we've got some fiddling to do to get this 00:23:14.560 |
working super well for fine-tuning but I think I know what we have to do so hopefully we'll 00:23:21.200 |
get it done pretty soon all right so after we split it into tokens the next thing to 00:23:29.880 |
do is numerical ization so let's go back to our word tokenized text which looks like this 00:23:39.400 |
and in order to mean numerical eyes we will first need to call setup so to save a bit 00:23:49.200 |
of time let's create a subset of our text so just create a couple of hundred of the 00:23:53.240 |
co-opuses that's a couple of hundred of the reviews so here's an example of one and we'll 00:23:58.760 |
create our new miracle eyes object and we will call setup and that's the thing that's 00:24:02.860 |
going to create the vocab for us and so after that we can now take a look at the vocab this 00:24:08.160 |
is Cole repra is showing us a representation of a collection it's what the L class uses 00:24:14.960 |
underneath and you can see when we do this that the vocab starts with the special tokens 00:24:25.000 |
and then we start getting the English tokens in order of frequency so the default is a 00:24:33.920 |
vocab size of 60,000 so that'll be the size of your embedding matrix by default and if 00:24:40.560 |
there are more than 60,000 unique words in your vocab in your corpus then any the least 00:24:49.240 |
common ones will be replaced with a special XXunk unknown token so that'll help us avoid 00:24:55.900 |
having a too big embedding matrix all right so now we can treat the numerical eyes object 00:25:03.960 |
which we created as if it was a function as we so often do in both fastai and pytorch 00:25:09.200 |
and when we do it'll replace each of our words with others so two for example is 0 1 2 beginning 00:25:19.240 |
a string beginning a beginning of stream 8 0 1 2 3 4 5 6 7 8 okay so a capitalized letter 00:25:28.680 |
there they are XXbos XXmaj etc okay and then we can convert them back by indexing into 00:25:37.360 |
the vocab and get back what we started with okay right so now we have done the tokenization 00:25:49.920 |
we've done the numericalization and so the next thing we need to do is to create batches 00:25:56.480 |
so let's say this is the text that we want to create batches from and so if we tokenize 00:26:04.400 |
that text it'll convert it into this and so let's 00:26:16.640 |
let's take that and write it out here let's do it here let's take that and write it out 00:26:26.680 |
here so XXbos XXmaj in this chapter XXbos XXmaj in this chapter we will go back over the example 00:26:35.760 |
of classifying and then next row starts here movie reviews we studied in chapter one and 00:26:42.040 |
dig deeper under the surface full stop XXmaj first we will look at the etc okay so we've 00:26:48.040 |
taken these 90 tokens and to create a batch size of six we've broken up the text into 00:26:56.600 |
six contiguous parts each of length 15 so 1 2 3 4 5 6 and then we have 15 columns okay 00:27:06.640 |
so 6 by 15 now ideally we would just provide that to our model as a batch and if indeed 00:27:17.280 |
that was all the data we had we could just pass it in as a batch but that's not going 00:27:25.840 |
to work for imdb because imdb once we concatenate all the reviews together and then let's say 00:27:32.600 |
we want to use a batch size of 64 then we're going to have 64 rows and you know probably 00:27:42.920 |
there's a few million tokens of imdb so a few million divided by 64 across it's going 00:27:49.280 |
to be way too big and to fit in our GPU so what we're going to do then is we're going 00:27:57.320 |
to split up that big wide array and we're going to split it up horizontally so we'll 00:28:04.880 |
start with XXbos XXmaj in this chapter and then down here we will go back over the example 00:28:12.120 |
of classifying movie reviews we studied in chapter one and dig deeper under the surface 00:28:18.600 |
etc so this would become our first mini-batch right and so you can see what's happened is 00:28:28.640 |
the kind of second row right actually is continuing what was like way down here and so we basically 00:28:39.540 |
treated each row is totally independent so when we predict the second from the second 00:28:45.600 |
mini-batch you know the second mini-batch is going to follow from the first and that 00:28:50.080 |
each row to row one in the second mini-batch will join up to row one of the first row two 00:28:56.320 |
of the second mini-batch will join up to row two of the first so please look at this example 00:29:01.800 |
super carefully because we found that this is something that every year a lot of students 00:29:08.040 |
get confused about because it's just not what they expected to see happen right so go back 00:29:14.400 |
over this and make sure you understand what's happening in this little example so that's 00:29:20.160 |
what our mini batches are going to be so the good news is that all these fiddly steps you 00:29:26.760 |
don't have to do yourself you can just use the language model data loader or LM data 00:29:32.320 |
loader so if we take those all the tokens from the first 200 movie reviews and map them 00:29:38.160 |
through our numericalize object right so now we've got numericalized versions of all those 00:29:42.720 |
tokens and then pass them into LM data loader and then grab the first item from the data 00:29:49.860 |
loader then we have 64 by 72 why is that well 64 is the default batch size and 72 is the 00:30:02.440 |
default sequence length you see here we've got one two three four five here we used a 00:30:08.080 |
sequence length of five right so what we do in practice is we use a default sequence length 00:30:15.080 |
of 72 so if we grab the first of our independent variables and grab the first few tokens and 00:30:24.400 |
look them up in the vocab here it is this movie which I just something at the video 00:30:29.160 |
store so that's interesting so this was not common enough to be in a vocab has apparently 00:30:34.080 |
sit around for a and then if we look at the exact same thing but for the dependent variable 00:30:40.960 |
rather than being XXBOS XXMaj this movie it's XXMaj this movie so you can see it's offset 00:30:47.160 |
by one which means the end rather than being around for a it's for a couple so this is exactly 00:30:55.280 |
what we want this is offset by one from here so that's looking good so we can now go ahead 00:31:10.160 |
and use these ideas to try and build our even better IMDB sentiment analysis and the first 00:31:16.360 |
step will be to as we discussed create the language model but let's just go ahead and 00:31:21.300 |
use the fast AI built-in stuff to do it for us rather than doing all that messing around 00:31:26.520 |
manually so we can just create a data block and our blocks are it's going to be a text 00:31:32.720 |
block from folder and the items are going to be text files from these folders and we're 00:31:46.400 |
going to split things randomly and then going to turn that into data loaders with a batch 00:31:52.280 |
size of 128 and a sequence length of 80 in this case our blocks we're not just passing 00:32:01.280 |
in a class directly but we're actually passing in here a class method and that's so that 00:32:10.600 |
we can allow the tokenization for example to be saved to be cached in some path so that 00:32:19.000 |
the next time we run this it won't have to do it all from scratch so that's why we have 00:32:24.440 |
a slightly different syntax here so once we've run this we can call show batch and so you 00:32:33.560 |
can see here we've got for example what xxmaj I've read xxmaj death blah blah blah and you 00:32:44.000 |
can see so that's the independent variable and so the dependent variable is the same 00:32:48.000 |
thing offset by one so we don't have the what anymore but it just goes straight to xxmaj 00:32:52.880 |
I've read and then at the end this was also this and of course in the dependent variable 00:32:57.680 |
also this is so this is that offset by one just like we were hoping for show batch is 00:33:03.720 |
automatically denumericalizing it for us turning it back into strings but if we look at the 00:33:10.400 |
actual or you should look at the actual x and y to confirm that you actually see numbers 00:33:16.720 |
there that'll be a good exercise for you is to make sure that you can actually grab a 00:33:20.680 |
mini batch from dlslm so now that we've got the data loaders we can fine-tune our language 00:33:30.400 |
model so fine-tuning the language model means we're going to create a learner which is going 00:33:35.060 |
to learn to predict the next word of a movie review so that's our data the data loaders 00:33:41.700 |
for the language model this is the pre-trained model it's something called awd lstm which 00:33:47.160 |
we'll see how to create from scratch in a moment or something similar to it dropout 00:33:54.080 |
we'll learn about later that we see how much dropout to use this is how much regularization 00:33:57.680 |
we want and what metrics do we want we've know about accuracy perplexity is not particularly 00:34:03.760 |
interesting so I won't discuss it but feel free to look it up if you're interested and 00:34:07.640 |
let's train with fp16 to use less memory on the GPU and for any modern GPU it'll also 00:34:15.560 |
run two or three times faster so this gray bit here has been done for us the pre-training 00:34:23.200 |
of the language model for wiki text 103 and now we're up to this bit which is fine-tuning 00:34:28.760 |
the language model for imdb so let's do one epoch and as per usual the using a pre-trained 00:34:39.520 |
model automatically calls freeze so we don't have to freeze so this is going to just actually 00:34:46.400 |
train only the new embeddings initially and we get an accuracy after a ten minutes or 00:34:53.120 |
so of 30% so that's pretty cool so about a bit under a third of the time our model is 00:34:59.680 |
predicting the next word of a string so I think that's pretty cool now since this takes 00:35:09.720 |
quite a while for each epoch we might as well save it and you can save it under any name 00:35:17.760 |
you want and that's going to put it into your path into your learner's path into a model 00:35:21.720 |
subfolder and it'll give it a .pth extension for PyTorch and then later on you can load 00:35:28.160 |
that with learn.load after you create the learner and so then we can unfreeze and we 00:35:36.280 |
can train a few more epochs and we eventually get up to an accuracy of 34% so that's pretty 00:35:42.960 |
great so once we've done all that we can save the model but actually all we really need 00:35:51.480 |
to do is to save the encoder what's the encoder the encoder is all of the model except for 00:35:59.280 |
the final layer oh and we're getting a thunderstorm here that could be interesting we've never 00:36:03.880 |
done a lesson with a thunderstorm before but that's the joy of teaching during COVID-19 00:36:10.920 |
you get all the sound effects so yeah the final layer of our language model is predict 00:36:21.040 |
is the bit that actually picks a particular word out which we don't need so when we say 00:36:26.320 |
save encoder it saves everything except for that final layer and that's the pre-trained 00:36:31.240 |
model we're going to use that is a pre-trained model of a language model that is fine-tuned 00:36:37.800 |
from Wikipedia fine-tuned using IMDB and doesn't contain the very last layer Rachel any questions 00:36:45.880 |
at this point do any language models attempt to provide meaning for instance I'm going 00:36:53.160 |
to the store is the opposite of I'm not going to the store or I barely understand this stuff 00:36:59.600 |
and that ball came so close to my ear I heard it whistle both contain the idea of something 00:37:04.640 |
almost happening being right on the border is there a way to indicate this kind of subtlety 00:37:09.280 |
in a language model yeah absolutely our language model will have all of that in it or hopefully 00:37:19.360 |
it will hopefully it'll learn about it we don't have to program that the whole point 00:37:22.760 |
of machine learning is it learns it for itself but when it sees a sentence like hey careful 00:37:30.400 |
that ball nearly hit me the expectation of what word is going to be happen next is going 00:37:37.080 |
to be different to the sentence hey that ball hit me so so yeah language models generally 00:37:44.860 |
you see in practice tend to get really good at understanding all of these nuances of of 00:37:51.880 |
of English or whatever language it's learning about okay so we have a fine-tuned language 00:37:57.760 |
model so the next thing we're going to do is we're going to try fine-tuning a classifier 00:38:02.760 |
but before we do just for fun let's look at text generation we can create write ourselves 00:38:09.680 |
some words like I liked this movie because and then we can create say two sentences each 00:38:17.080 |
containing say 40 words and so we can just go through those two sentences and call learn.predict 00:38:24.440 |
passing in this text and asking to predict this number of words with this amount of kind 00:38:31.040 |
of randomization and see what it comes up with I liked this movie because of its story 00:38:38.520 |
and characters the storyline was very strong very good for a sci-fi the main character 00:38:42.520 |
alucard was very well developed and brought the whole story but second attempt I like 00:38:48.120 |
this movie because I like the idea of the premise of the movie the very convenient virus 00:38:53.020 |
which well when you have to kill a few people the evil machine has to be used to protect 00:38:57.020 |
blah blah blah so as you can see it's done a good job of inventing language there are 00:39:04.440 |
much I shouldn't say more sophisticated there are there are more careful ways to do a generation 00:39:11.320 |
from a language model this learn.predict uses the most kind of basic possible one but even 00:39:17.280 |
with a very simple approach you can see we can get from a fine-chin model some pretty 00:39:22.020 |
authentic looking text and so in practice this is really interesting because we can 00:39:29.360 |
now you know by using the prompt you can kind of get it to generate you know appropriate 00:39:37.800 |
context appropriate text particularly if you fine-tune from a particular corpus anyway 00:39:46.360 |
that was really just a little demonstration of something we accidentally created on the 00:39:50.160 |
way of course the whole purpose of this is actually just to be a pre-trained model for 00:39:53.960 |
classification so to do that we're going to need to create another data block and this 00:39:59.880 |
time we've got two blocks not one we've got a text block again just like before but this 00:40:04.460 |
time we're going to ask fastai not to create a vocab from the unique words but using the 00:40:11.560 |
vocab that we already have from the language model because otherwise obviously there's 00:40:16.200 |
no point reusing a pre-trained model if the vocab is different the numbers would mean 00:40:20.920 |
totally different things so that's the independent variable and the dependent variable just like 00:40:26.920 |
we've used before is a category so a category block is for that as we've used many times 00:40:31.700 |
we're going to use parent label to create our dependent variable that's a function get 00:40:37.600 |
items we'll use get text files just like before and we'll split using grandparent splitter 00:40:43.160 |
as we've used before in provision so this has been used for vision this has been used 00:40:48.600 |
for vision and then we'll create our data loaders with a batch size of 128 a sequence 00:40:53.200 |
length of 72 and now show batch we can see an example of subset of a movie review and 00:41:02.440 |
a category yes question do the tokenizers use any tokenization techniques like stemming 00:41:12.800 |
or lemmatization or is that an outdated approach that would not be a tokenization approach 00:41:21.020 |
so stemming is something that actually removes the stem and we absolutely don't want to do 00:41:27.040 |
that that is certainly an outdated approach the in English we have stems for a reason 00:41:35.120 |
they tell us something so we we don't like to remove anything that can give us some kind 00:41:41.560 |
of information we used to use that for kind of pre deep learning and LP quite a bit because 00:41:48.160 |
that we didn't really have good ways like embedding matrices of handling you know big 00:41:53.640 |
vocabs that just differed in the you know kind of the end of a word but nowadays we 00:42:01.000 |
definitely don't want to do that oh yeah one other difference here is previously we had 00:42:10.200 |
an is LM equals true when we said text block dot folder from folder to say it was a language 00:42:15.080 |
model we don't have that anymore because it's not a language model okay now one thing with 00:42:24.760 |
a language model that was a bit easier was that we could concatenate all the documents 00:42:28.880 |
together and then we could split them by batch size to create we're not split them by batch 00:42:35.960 |
size put them into a number of substrings based on the batch size and that way we could 00:42:41.480 |
ensure that every many batch was the same size it would be batch size by sequence length 00:42:50.200 |
but for classification we can't do that we actually need each dependent variable label 00:42:57.280 |
to be associated with each complete movie review and we're not showing the whole movie 00:43:03.480 |
review here we've truncated it just for display purposes but we're going to use the whole 00:43:06.760 |
movie review to make our prediction now the problem is that if we're using a batch size 00:43:14.440 |
of 128 then and our movie reviews are often like 3,000 words long we could end up with 00:43:22.280 |
something that's way too big to fit into the GPU memory so how are we going to deal with 00:43:28.640 |
that well again we're going we can we can split them up so first of all let's grab a 00:43:37.440 |
few of the movie reviews just to a demo here and numericalize them and if we have a look 00:43:42.680 |
at the length so map the length over each you can see that they do vary a lot in length 00:43:50.760 |
now we can we can split them into sequences and indeed we have asked to do that sequence 00:43:56.080 |
length 72 but when we do so we're you know we don't even have the same number of sub-sequences 00:44:02.640 |
when we split each of these into 72 long sections they're going to be all different lengths 00:44:11.080 |
so how do we deal with that well just like in vision we can handle different sized sequences 00:44:17.880 |
by adding padding so we're going to add a special XX pad token to every sequence in 00:44:28.680 |
a mini batch so like in this case it looks like 581 is the longest so we would add enough 00:44:33.360 |
padding tokens to make this 581 and this 581 and this 581 and so forth and then we can 00:44:39.760 |
split them into sequence length into 72 long and it's in the mini batches and we'll be 00:44:46.280 |
right to go now obviously if your lengths are very different like this adding a whole 00:44:52.420 |
lot of padding is going to be super wasteful so another thing that fastai does internally 00:44:56.920 |
is it tries to shuffle the documents around so that similar length documents are in the 00:45:03.160 |
same mini batch it also randomizes them but it kind of approximately sorts them so it 00:45:09.360 |
wastes a lot less time on padding okay so that is how that is what happens when we we 00:45:19.240 |
don't have to do any of that manually when we call text block from folder without the 00:45:25.760 |
is LM it does that all that for us and then we can now go ahead and create a learner this 00:45:33.920 |
time it's going to be a text classifier learner again we're going to base it off L a WD LSTM 00:45:40.640 |
pass in the data loaders we just created for metric we'll just use accuracy make it FP16 00:45:46.360 |
again and now we don't want to use a pre-trained Wikipedia model in fact there is no pre-trained 00:45:52.480 |
Wikipedia classifier because you know what you classify matters a lot so instead we load 00:46:00.000 |
the encoder so remember everything except the last layer which we saved just before 00:46:05.640 |
so we're going to load as a pre-trained model a language model for predicting the next word 00:46:11.720 |
of a movie review so let's go ahead and hit one cycle and again by default it will be 00:46:23.040 |
frozen so it's only the final layer which is the randomly added classifier layer that's 00:46:28.680 |
going to be trained it took 30 seconds and look at this we already have 93% so that's 00:46:34.720 |
pretty similar to what we got back in lesson one but rather than taking about 12 minutes 00:46:42.680 |
once all the pre-training's been done it takes about 30 seconds this is quite cool you can 00:46:47.120 |
create a language model for your kind of general area of interest and then you can create all 00:46:54.400 |
kinds of different classifiers pretty quickly and so that's just with that's just looking 00:47:00.680 |
at the pre fine-tuning the final randomly added layer so now we could just unfreeze 00:47:09.920 |
and keep learning but something we found is for NLP it's actually better to only unfreeze 00:47:18.280 |
one layer at a time not to unfreeze the whole model so we've in this case we've automatically 00:47:23.400 |
unfreeze in the last layer and so then to unfreeze the last couple of layer groups we 00:47:28.680 |
can say freeze two minus two and then train a little bit more and look at this we're already 00:47:34.880 |
beating after a bit over a minute easily beating what we got in lesson one and then freeze 00:47:43.000 |
two minus three to unfreeze another few layers now we're up to 94 and then finally unfreeze 00:47:48.840 |
the whole model and we're up to about 94.3% accuracy and that was literally the state 00:47:56.640 |
of the art for this very heavily studied data set just three years ago if you also reverse 00:48:04.680 |
all of the reviews to make them go backwards and train a second model on the backwards 00:48:09.360 |
version and then average the predictions of those two models as an ensemble you get to 00:48:14.320 |
95.1% accuracy and that was the state of the art that we actually got in the ULM fit paper 00:48:20.480 |
and it was only beaten for the first time a few months ago using a way way bigger model 00:48:26.120 |
way more data way more compute and way more data augmentation I should mention actually 00:48:34.120 |
with the data augmentation one of the cool things they did do was they actually figured 00:48:37.480 |
out also a way to even beat our 95.1 with less data as well so I should mention that 00:48:42.880 |
actually the data augmentation has become a really since we created the ULM fit paper 00:48:47.840 |
has become a really really important approach any questions Rachel can someone explain how 00:48:56.400 |
a model trained to predict the last word in the sentence can generalize to classify sentiment 00:49:01.760 |
they seem like different domains yeah that's a great question they're very different domains 00:49:07.640 |
and it's it's really amazing and basically the trick is that to be able to predict the 00:49:14.440 |
next word of a sentence you just have to know a lot of stuff about not only the language 00:49:19.640 |
but about the world so if you know let's say we wanted to finish the next word of this 00:49:27.080 |
sentence by training a model on all the text read backwards and averaging the averaging 00:49:31.160 |
the predictions of these two models we can even get to 95.1% accuracy which was the state 00:49:35.980 |
of the art introduced by the what so to be able to fill in the word ULM fit you would 00:49:42.720 |
have to know a whole lot of stuff about you know the fact that there's a thing called 00:49:47.640 |
fine-tune you know pre-trained language models and which one gets which results and the ULM 00:49:53.480 |
fit got this particular result I mean that would be an amazing language model that could 00:49:57.600 |
fill that in correctly I'm not sure that any language models can but to give you a sense 00:50:02.360 |
of like what you have to be able to do to be good at language modeling so if you're 00:50:08.760 |
going to be able to predict the next word of a sentence like wow I really love this 00:50:15.320 |
movie I love every movie containing Meg Watt right maybe it's Ryan you'd have to know about 00:50:23.780 |
like the fact that Meg Ryan is an actress and actresses are in movies and so forth so 00:50:29.120 |
when you know so much about English and about the world to then turn that into something 00:50:35.480 |
which recognizes that I really love this movie is a good thing rather than a bad thing is 00:50:42.400 |
just not a very big step and as we saw you can actually get that far using just pre fine 00:50:50.160 |
tuning just the very last layer or two so it is it's amazing and I think that's super 00:50:58.400 |
super cool all right another question how would you do data augmentation on text well 00:51:12.520 |
you would probably Google for unsupervised data augmentation and read this paper and 00:51:17.000 |
things that have cited it so this is the one that easily beat our IMDB result with only 00:51:24.720 |
20 labeled examples which is amazing right and so they did things like if I remember 00:51:33.080 |
correctly translate every sentence into a different language and then translate it back 00:51:39.080 |
again so you kind of get like different rewordings of the sentence that way yeah so kind of tricks 00:51:46.160 |
like that now let's go back to the generation thing so remember we saw that we can generate 00:52:05.080 |
context appropriate sentences and it's important to think about what that means in practice 00:52:10.640 |
when you can generate context appropriate sentences have a look for example at even 00:52:15.000 |
before this technology existed in 2017 the FCC asked for comments about a proposal to 00:52:24.120 |
repeal net neutrality and it turned out that less than 800,000 of the 22 million comments 00:52:36.880 |
actually appeared to be unique and this particular person Jeff Cowell discovered that a lot of 00:52:44.000 |
the submissions were slightly different to each other by kind of like picking up different 00:52:52.000 |
you know the green bit would either be citizens or people like me or Americans and the red 00:52:57.820 |
bit would be as opposed to or rather than and so forth so like and that made a big difference 00:53:05.080 |
to I believe to American policy you know here's an example of reddit conversation you're wrong 00:53:14.880 |
the defense budget is a good example of how badly the US spends money on the military 00:53:18.400 |
dot dot dot somebody else yeah but that's already happening there's a huge increase 00:53:22.220 |
in the military budget I didn't mean to sound like stop paying for the military I'm not 00:53:26.520 |
saying that we cannot pay the bills that are all of these are actually created by a language 00:53:31.760 |
model or GPT - and this is a very concerning thing around disinformation is that never 00:53:39.640 |
mind fake news never mind deep fakes think about like what would happen if somebody invested 00:53:45.240 |
a few million dollars in creating a million Twitter bots and Facebook groups bots and 00:53:52.600 |
Weibo bots and made it so that 99% of the content on social networks were deep learning 00:54:03.200 |
bots and furthermore they were trained not just to optimize the next word of a sentence 00:54:08.040 |
but were trained to optimize the level of disharmony created or the level of agreeableness 00:54:17.200 |
for some of the half of them and disagreeableness for the other half of them you know you could 00:54:21.120 |
create like a whole lot of you know just awful toxic discussion which is actually the goal 00:54:29.620 |
of a lot of propaganda outfits it's not so much to push a particular point of view but 00:54:35.620 |
to make people feel like there's no point engaging because the truth is too hard to 00:54:41.120 |
understand or whatever so I'm Rachel and I are both super worried about what could happen 00:54:51.560 |
to discourse now that we have this incredibly powerful tool and I'm not even sure we have 00:54:57.020 |
we don't have a great sense of what to do about it algorithms are unlikely unlikely 00:55:02.800 |
to save us here if you could create a classifier which could do a good job of figuring out 00:55:08.240 |
whether something was generated by a algorithm or not then I could just use your classifier 00:55:14.120 |
as part of my training loop to train an algorithm that can actually learn to trick your classifier 00:55:20.480 |
so this is a real worry and the only solutions I've seen are those which are kind of based 00:55:26.840 |
on cryptographic signatures which is another whole can of worms which has never really 00:55:34.240 |
been properly sorted out at least not in the Western world in a privacy centric way all 00:55:41.280 |
right so yes I'll add and I'll link to this on the forums I gave a keynote at SciPy conference 00:55:49.800 |
last summer which is the scientific Python conference and went into a lot more detail 00:55:54.280 |
about the the threat that Jeremy is describing about using advanced language models to manipulate 00:56:00.960 |
public opinion and so if you want to kind of learn more about the dangers there and exactly 00:56:06.280 |
what that threat is you can find that in my SciPy keynote great thanks so much Rachel 00:56:13.400 |
so let's have a five-minute break and see you back here in five minutes 00:56:24.080 |
so we're going to finish with a kind of a segue into what will eventually be part two 00:56:32.720 |
of the course which is to go right underneath the hood and see exactly how a more complex 00:56:41.120 |
architecture works and specifically we're going to see how a new recurrent neural network 00:56:46.640 |
works do we have a question first in the previous lesson MNIST example you showed us that under 00:56:58.080 |
the hood the model was learning parts of the image like curves of a three or angles of 00:57:02.800 |
a seven is there a way to look under the hood of the language models to see if they are 00:57:07.160 |
learning rules of grammar and syntax would it be a good idea to fine-tune models with 00:57:12.600 |
examples of domain specific syntax like technical manuals or does that miss the point of having 00:57:18.920 |
the model learn for themselves yeah there are tools that allow you to kind of see what's 00:57:25.840 |
going on inside an NLP model we're not going to look at them in this part of the course 00:57:30.520 |
maybe we will in part two but certainly worth doing some research to see what you can find 00:57:35.000 |
and there's certainly PyTorch libraries you can download and play with yeah I mean I think 00:57:44.280 |
it's a perfectly good idea to incorporate some kind of technical manuals and stuff into 00:57:51.600 |
your training corpus there's actually been some recent papers on this general idea of 00:57:57.320 |
trying to kind of create some carefully curated sentences as part of your training corpus 00:58:05.680 |
it's unlikely to hurt and it could well help all right so let's have a look at RNNs now 00:58:15.880 |
when Sylvain and I started creating the RNN stuff for fast AI the first thing I did actually 00:58:25.200 |
was to create a new data set and the reason for that is I didn't find any data sets that 00:58:31.920 |
would allow for quick prototyping and really easy debugging so I made one which we call 00:58:39.400 |
human numbers and it contains the first two ten thousand numbers written out in English 00:58:47.400 |
and I'm surprised at how few people create data sets I create data sets frequently you 00:58:58.360 |
know I specifically look for things that can be kind of small easy to prototype good for 00:59:03.340 |
debugging and quickly trying things out and very very few people do this even though like 00:59:09.240 |
this human numbers data set which has been so useful for us took me I don't know an hour 00:59:14.160 |
or two to create so this is definitely an underappreciated underutilized technique so 00:59:26.320 |
we can grab the human numbers data set and we can see that there's a training and a validation 00:59:31.260 |
text file we can open each of them and for now we're just going to concatenate the two 00:59:35.920 |
together into a file called lines and you can see that the contents are 1 2 3 etc and 00:59:44.920 |
so there's a new line at the end of each we can concatenate those all together and put 00:59:51.760 |
a full stop between them as so okay and then you could tokenize that by splitting on spaces 01:00:00.080 |
and so for example here's tokens 100 to 110 new number 42 new number 43 new number 44 and 01:00:08.480 |
so forth so you can see I'm just using plain Python here there's not even any PyTorch certainly 01:00:12.760 |
not any fast AI to create a vocab we can just create all the unique tokens of which there 01:00:20.040 |
are 30 and then to create a lookup from so that's a lookup from a word to an ID sorry 01:00:29.320 |
from an ID to a word to go from a word to an ID we can just enumerate that and create 01:00:36.480 |
a dictionary from word to ID so then we can numerical eyes our tokens by calling word 01:00:47.040 |
to index on each one and so here's our tokens and here's the equivalent numericalized version 01:00:55.480 |
so you can see in fairly small data sets when we don't have to worry about scale and speed 01:01:03.000 |
and the details of tokenization in English you can do the whole thing in just plain Python 01:01:11.760 |
the only other thing we use did for to save a little bit of time is use L but you could 01:01:16.520 |
easily do that with the Python standard library in about the same amount of code so hopefully 01:01:24.400 |
that gives you a good sense of really what's going on with tokenization and numericalization 01:01:29.200 |
all done by hand so let's create a language model so one way to create a language model 01:01:35.520 |
would be to go through all of our tokens and let's create a range from zero to the length 01:01:41.840 |
of our tokens minus four and every three of them and so that's going to allow us to grab 01:01:47.360 |
three tokens at a time 1.2.3.4.5.6.7.8 and so forth right so here's the first three tokens 01:02:01.760 |
and then here's the fourth token and here's the second three tokens and here's the seventh 01:02:07.960 |
token and so forth so these are going to be our independent variables and this will be 01:02:13.160 |
our dependent variable so here's a super super kind of naive simple language model data set 01:02:23.520 |
for the human numbers question so we can do exactly the same thing as before but use the 01:02:30.920 |
numericalized version and create tensors this is exactly the same thing as before but now 01:02:35.600 |
as as through numericalized and as tensors and we can create a data loaders object from 01:02:45.840 |
data sets and remember these are data sets because they have a length and we can index 01:02:50.000 |
into them right and so we can just grab the first 80% of the tokens as the training set 01:02:55.800 |
the last 20% is the validation set like so batch size 64 and we're ready to go so we 01:03:04.640 |
really used very very little I mean the only pytorch we used was to create these tensors 01:03:11.360 |
and the only fast AI we used was to create the data loaders and it's just grabbing directly 01:03:16.960 |
from the data sets so it's really not doing anything clever at all so let's see if we 01:03:23.160 |
can now create a neural network architecture which takes three numericalized words at a 01:03:29.320 |
time as import and tries to predict the fourth as dependent variable so here is just such 01:03:39.040 |
a language model it's a three layer neural network so we've got a linear layer here which 01:03:50.520 |
we're going to use once twice three times and after each of them we call value as per usual 01:04:02.480 |
but there's a little bit more going on the first interesting thing is that rather than 01:04:08.760 |
each of these being a different linear layer we've just created one linear layer here which 01:04:18.720 |
we've reused as you can see one two three times so that's the first thing that's a bit 01:04:27.080 |
tricky and so there's a few things going on it's a bit a little bit different usual but 01:04:32.480 |
the basic idea is here we've got an embedding and nnlinear another nnlinear and in here we've 01:04:38.600 |
used the linear layers and relu so it's very nearly a totally standard three layer neural 01:04:45.120 |
network I guess for really because there's an output layer yes we have a question sure 01:04:51.440 |
is there a way to speed up fine-tuning the NLP model ten plus minutes per epoch slows 01:04:57.200 |
down the iterative process quite a bit any best practices or tips I can't think of any 01:05:04.760 |
obviously other than to say you don't normally need to fine-tune it that often you know the 01:05:13.400 |
work is often more at the classifier stage so yeah I tend to kind of just leave it running 01:05:20.280 |
overnight or while I have lunch or something like that yeah just don't make sure you make 01:05:24.200 |
sure you don't sit there watching it go and do something else this is where it can be 01:05:31.100 |
quite handy to have a second GPU or fire up a second AWS instance or whatever so you can 01:05:37.680 |
kind of keep keep moving while something's training in the background all right so what's 01:05:46.960 |
going on here in this model to describe it we're actually going to develop a little kind 01:05:52.640 |
of pictorial representation and the pictorial representation is going to work like this 01:05:57.560 |
let's start with a simple linear model to define this pictorial representation a simple 01:06:04.520 |
linear model has an input of size batch size by number of inputs and so we're going to 01:06:10.960 |
use a rectangle to represent an input we're going to use an arrow to represent a layer 01:06:23.240 |
computation so in this case there's going to be a matrix product for a simple linear 01:06:26.760 |
model there'd be a matrix actually sorry this is a single hidden layer model there'll be 01:06:32.120 |
a matrix product followed by a value so that's what this arrow represents and out of that 01:06:37.560 |
we're going to get some activations and so circles represent computed activations and 01:06:43.320 |
there would be we call this a hidden layer it'll be a size batch size by number of activations 01:06:48.080 |
that's its size and then to create a neural net we're going to do a second matrix product 01:06:53.920 |
and this time a softmax so the computation again represented by the arrow and then output 01:06:59.280 |
activations are a triangle so the output would be batch size by num classes so let me show 01:07:05.760 |
you the pictorial version of this so this is going to be a legend triangle is output 01:07:14.560 |
circle hidden rectangle input and here it is we're going to take the first word as an 01:07:23.440 |
input it's going to go through a linear layer and a value and you'll notice here I've deleted 01:07:31.080 |
the details of what the operations are at this point and I've also deleted the sizes 01:07:38.520 |
so every arrow is basically just a linear layer followed by a nonlinearity so we take 01:07:44.120 |
the word one input and we put it through the layer the linear layer and the nonlinearity 01:07:53.840 |
to give us some activations so there's our first set of activations and when we put that 01:08:00.680 |
through another linear layer and nonlinearity to get some more activations and at this point 01:08:07.760 |
we get word two and word two is now goes through a linear layer and a nonlinearity and these 01:08:18.900 |
two when two arrows together come to a circle it means that we add or concatenate either 01:08:24.360 |
is fine the two sets of activations so we'll add the set of activations from this input 01:08:32.400 |
to the set of activations from here to create a new set of activations and then we'll put 01:08:38.720 |
that through another linear layer and a value and again word three is now going to come 01:08:43.100 |
in and go through a linear layer and a value and they'll get added to create another set 01:08:46.680 |
of activations and then they'll find go through a final linear layer and really and softmax 01:08:54.600 |
to create our output activations so this is our model it's basically a standard one two 01:09:05.000 |
three four layer model but a couple of interesting things are going on the first is that we have 01:09:11.640 |
inputs coming in to later layers and get added so that's something we haven't seen before 01:09:17.240 |
and the second is all of the arrows that are the same color use the same weight matrix 01:09:22.920 |
so every time we get an input we're going to put it through a particular weight matrix 01:09:28.000 |
and every time we go from one set of activations to the next we'll put it through a different 01:09:32.160 |
weight matrix and then to go through the activations to the output we'll use a different weight 01:09:36.960 |
matrix so if we now go back to the code to go from input to hidden not surprisingly we 01:09:45.840 |
always use an embedding so in other words an embedding is the green okay and you'll 01:09:52.400 |
see we just create one embedding and here is the first so his x which is the three words 01:09:59.080 |
so here's the first word x0 and it goes through that embedding and word 2 goes through the 01:10:04.800 |
same embedding and word 3 index number 2 goes through the same embedding and then each time 01:10:10.440 |
you say we add it to the current set of activations and so having put the got the embedding we 01:10:18.040 |
then put it through this linear layer and again we get the embedding add it to the hit 01:10:24.400 |
to the activations and put it through the linear that linear layer and again the same 01:10:28.960 |
thing here put it through the same linear layer so H is the orange so these set of activations 01:10:38.880 |
we call the hidden state okay and so the hidden state is why it's called H and so if you follow 01:10:47.520 |
through these steps you'll see how each of them corresponds to a step in this diagram 01:10:53.480 |
and then finally at the end we go from the hidden state to the output which is this linear 01:10:58.560 |
layer hit state to the output okay and then we don't have the actual softmax there because 01:11:08.960 |
as you'll remember we can incorporate that directly into the loss function the cross 01:11:12.880 |
entropy loss function and using pytorch so one nice thing about this is everything we're 01:11:21.400 |
using we have previously created from scratch so there's nothing magic here we've created 01:11:25.800 |
our own embedding layer from scratch we've created our own linear layer from scratch 01:11:29.580 |
we've created our own relu from scratch we've created our own cross entropy loss from scratch 01:11:36.380 |
so you can actually try building this whole thing yourself from scratch so why do we just 01:11:45.640 |
in terms of the nomenclature I H so H refers to hidden so this is a layer that goes from 01:11:52.240 |
input to hidden this is one that goes from hidden to hidden this is one that goes from 01:11:56.640 |
hidden to output so if any of this is feeling confusing at any point go back to where we 01:12:03.360 |
actually created each one of these things from scratch and create it from scratch again make 01:12:07.680 |
sure you actually write the code so that nothing here is mysterious so why do we use the same 01:12:19.400 |
embedding matrix each time we have a new input word for an input word index 0 1 and 2 well 01:12:26.160 |
because conceptually they all represent English words you know for human numbers so why would 01:12:35.040 |
you expect them to be a different embedding they all should have the same representation 01:12:39.320 |
they all have the same meaning same for this hidden to hidden at each time we're basically 01:12:44.000 |
describing to how to go from one token to the next of our language model so we'd expect 01:12:49.160 |
it to be the same computation so that's basically what's going on here so having created that 01:12:58.000 |
model we can go ahead and instantiate it so we're going to have to pass in the vocab size 01:13:05.080 |
for the embedding and the number of hidden right so that's number of activations so here 01:13:13.880 |
we create the model and then we create a learner by passing in a model and our data loaders 01:13:20.440 |
and a loss function and optionally metrics and we can fit now of course this is not pre-trained 01:13:29.800 |
right this is not a application specific learner so it wouldn't know what pre-trained model 01:13:34.600 |
to use so this is all random and we're getting somewhere around 45 to 50% or so accuracy 01:13:44.360 |
is that any good well you should always compare to random or not random you should always 01:13:49.520 |
compare to like the simplest model where the simplest model is like some average or something 01:13:55.080 |
so what I did is I grabbed the validation set so all the tokens put it into a Python 01:14:00.600 |
standard library counter which simply counts how many times each thing appears I found 01:14:05.520 |
that the word thousand is the most common and then I said okay what if we used seven 01:14:12.520 |
thousand one hundred and four thousand that's here and divide that by the length of the 01:14:18.600 |
tokens and we get 15% which in other words means if we always just predicted I think 01:14:23.600 |
the next word will be thousand we would get 15% accuracy but in this model we got around 01:14:30.960 |
45 to 50% accuracy so in other words our model is a lot better than the simplest possible 01:14:37.800 |
baseline so we've learned something useful that's great so the first thing we're going 01:14:43.600 |
to do is we're going to refactor this code because you can see we've got X going into 01:14:52.280 |
IH going into HH going into ReLU X going into IH going into HH going to ReLU X going into 01:14:59.040 |
IH going to HH going to ReLU how would you refactor that in Python you would of course 01:15:04.720 |
use a for loop so let's go ahead and write that again so these lines of code are identical 01:15:11.920 |
in fact these lines of code are identical as is this and we're going to instead of doing 01:15:17.080 |
all that stuff manually we create a loop that goes through three times and in each time 01:15:21.380 |
it goes IH add to our hidden HH ReLU and then at the end hidden to output so this is exactly 01:15:31.100 |
the same thing as before but it's just refactored with a for loop and we can train it again 01:15:38.040 |
and again we get the same basically 45 to 50% as you would expect because it's exactly 01:15:43.320 |
the same it's just been refactored so here's something crazy this is a recurrent neural 01:15:54.600 |
network even though it's like exactly the same as it's exactly the same as this right 01:16:07.720 |
it's just been refactored into a loop and so believe it or not that's actually all an 01:16:14.560 |
RNN is an RNN is a simple refactoring of that model with that deep learning linear model 01:16:25.120 |
we saw I shouldn't say linear model deep learning model of simple linear layers with values 01:16:34.480 |
so let's draw our pictorial representation again so remember this was our previous pictorial 01:16:40.480 |
representation we can refactor the picture as well so instead of showing these dots separately 01:16:46.880 |
we can just take this arrow and represented it represented it as a loop because that's 01:16:53.680 |
all that's happening right so the word one goes through an embedding goes into this activations 01:16:59.680 |
which then just gets repeated from 2 to the end 2 to n-1 where n at this time is you know 01:17:06.080 |
we've got three words basically for each word coming in as well and so we've just refactored 01:17:13.200 |
our diagram and then eventually it goes through our blue to create the output so this diagram 01:17:20.160 |
is exactly the same as this diagram just replacing the middle with that loop so that's a recurrent 01:17:28.240 |
neural net and so h remember was something that we just kept track of here h h h h h 01:17:35.360 |
h as we added each layer to it and here we just have it inside the loop we initialize 01:17:44.880 |
it as 0 which is kind of tricky and the reason we can do that is that 0 plus a tensor will 01:17:51.640 |
broadcast this 0 so that's a little neat feature that's why we don't have to make this a particular 01:17:57.980 |
size tensor to start with okay so we're going to be seeing the word hidden state a lot and 01:18:08.560 |
so it's important to remember that hidden state simply represents these activations 01:18:14.000 |
that are occurring inside our recurrent neural net and a recurrent neural net is just a refactoring 01:18:19.720 |
of a particular kind of fully connected deep model so that's it that's what an RNN is no 01:18:33.160 |
questions at this point Rachel something that's a bit weird about it though is that for every 01:18:45.080 |
batch we're setting our hidden state to 0 even although we're going through the entire 01:18:50.640 |
set of numbers the human numbers data set in order so you would think that by the time 01:18:56.580 |
you've gone like one two three you shouldn't then forget everything we've learnt when you 01:19:00.560 |
go to four five six right it would be great to actually remember where we're up to and 01:19:08.080 |
not reset the hidden state back to zero every time so we can absolutely do that we can maintain 01:19:16.640 |
the state of our RNN and here's how we would do that rather than having something called 01:19:26.120 |
H we'll call it self dot H and we'll set it to zero at the start when we first create 01:19:32.520 |
our model everything else here is the same and everything else here is the same and then 01:19:42.200 |
there's just one extra line of code here what's going on here well here's the thing if if 01:19:50.200 |
H is something which persists from batch to batch then effectively this loop is effectively 01:20:00.400 |
kind of becoming infinitely long right our deep learning model therefore is getting effectively 01:20:08.920 |
we're not infinitely deep but as deep as the entire size of our data set because every 01:20:13.560 |
time we're stacking new layers on top of the previous layers the reason this matters is 01:20:20.500 |
that when we then do back propagation when we then calculate the gradients we're going 01:20:24.840 |
to have to calculate the gradients all the way back through every layer going all the 01:20:31.000 |
way so by the time we get to the end of the data set we're going to be effectively back 01:20:35.480 |
propagating not just through this loop but remember self dot H was created also by the 01:20:42.400 |
previous quarter forward and the previous quarter forward and the previous quarter forward 01:20:45.920 |
so we're going to have this incredibly slow calculation of the gradients all the way back 01:20:53.840 |
to the start it's also going to use up a whole lot of memory because it's going to have to 01:20:58.880 |
store all those intermediate gradients in order to calculate them so that's a problem 01:21:05.400 |
and so the problem is easily solved by saying detach and what detach does is it basically 01:21:11.920 |
says throw away my gradient history forget that I forget that I was calculated from some 01:21:18.240 |
other gradients so the activations are still stored but the gradient history is no longer 01:21:24.040 |
stored and so this kind of cuts off the gradient computation and so this is called truncated 01:21:31.960 |
back propagation so exactly the same lines of code as the other two models H equals zero 01:21:40.840 |
has been moved into self dot H equals zero these lines of code are identical and we've 01:21:46.800 |
added one more line of code so the only other thing is it from time to time we might have 01:21:52.400 |
to reset self dot H to zero so I've created a method for that and we'll see how that works 01:21:59.200 |
shortly okay so back propagation sorry I was using the wrong jargon back propagation through 01:22:08.360 |
time is what we call it when we calculate the back prop over going back through this 01:22:17.480 |
loop all right now we do need to make sure that the samples are seen in the correct order 01:22:24.920 |
you know given that we need to make sure that every batch connects up to the previous batch 01:22:30.960 |
so go back to notebook 10 to remind yourself of what that needs to look like but basically 01:22:35.440 |
the first batch we see that the number the length of our sequences divided by the batch 01:22:42.840 |
size is 328 so the first batch will be our index number 0 then M then 2 times M and so 01:22:50.440 |
forth the second batch will be 1 M plus 1 2 times M plus 1 and so forth so the details 01:22:56.800 |
don't matter but here's how we create you know do that indexing so now we can go ahead 01:23:03.800 |
and call that group chunks function to calculate to create our training set and our validation 01:23:12.080 |
set and certainly don't shuffle because that would break everything in terms of the ordering 01:23:20.200 |
and then there's one more thing we need to do which is we need to make sure that at the 01:23:25.440 |
start of each epoch we call reset because at the start of the epoch we're going back 01:23:33.160 |
to the start of our natural numbers so we need to set self.h back to 0 so something 01:23:41.680 |
that we'll learn about in part 2 is that fastai has something called callbacks and callbacks 01:23:50.720 |
are classes which allow you to basically say during the training loop I want you to call 01:23:56.760 |
some particular code and in particular this is going to call this code and so you can 01:24:09.880 |
see callbacks are very small or can be very small they're normally very small when we 01:24:13.480 |
start training it'll call reset when we start validation it'll call reset so this is each 01:24:20.120 |
epoch and when we're all finished fitting it will call reset and what does reset do 01:24:25.400 |
it does whatever you tell it to do and we told it to be self.h=0 so if you want to use 01:24:32.520 |
a callback you can simply add it to the callbacks list CVs when you create your learner and 01:24:41.960 |
so now when we train that's way better okay so we've now actually kept this is called 01:24:49.800 |
a stateful RNN it's actually keeping the state keeping the hidden state from batch to batch 01:25:00.480 |
now we still got a bit of a obvious problem here which is that if you look back to the 01:25:07.000 |
data that we created we used these first three tokens to predict the fourth and then the 01:25:17.320 |
next three tokens to predict the seventh and then the next three tokens to predict the 01:25:24.640 |
one after and so forth effectively what would rather do you would think is is predict every 01:25:32.480 |
word not just every fourth word it seems like we're throwing away a lot of signal here which 01:25:37.160 |
is pretty wasteful so we want to create more signal and so the way to do that would be 01:25:45.560 |
rather than putting rather than putting this output stage outside 01:25:56.560 |
the loop right so this dotted area is a bit that's looped what if we put the output inside 01:26:04.360 |
the loop so in other words after every hidden state was created we immediately did a prediction 01:26:10.520 |
and so that way we could predict after every time step and our dependent variable could 01:26:15.760 |
be the entire sequence of numbers offset by one so that would give us a lot more signal 01:26:22.360 |
so we have to change our data so the dependent variable has each of the next three words 01:26:27.520 |
after each of the three inputs so instead of being just the numbers from I to I plus 01:26:34.440 |
SL as input and then I plus SL plus one as output we're going to have the entire set 01:26:42.120 |
offset by one as our dependent variable so and then we can now do exactly the same as 01:26:48.080 |
we did before to create our data loaders and so you can now see that each sequence is exactly 01:26:56.080 |
the previous is the independent variable and the dependent variable the same thing but 01:27:00.240 |
offset by one okay and then we need to modify our model very slightly this code is all exactly 01:27:09.200 |
the same as before but rather than now returning one output will create a list of outputs and 01:27:15.960 |
we'll append to that list after every element of the loop and then at the end we'll stack 01:27:22.880 |
them all up and then this is the same so it's nearly exactly the same okay just a very minor 01:27:29.080 |
change our loss function needs to we need to create our own loss function which is just 01:27:37.680 |
a cross-entropy loss but we need to just flatten it out so the target gets flattened out the 01:27:47.560 |
input gets flattened out and so then we can now pass that as our loss function everything 01:27:56.040 |
else here is the same and we can fit and we've gone from I can't remember 58 to 64 so it's 01:28:12.020 |
improved a little bit so that's good you know we did find this a little little flaky sometimes 01:28:21.800 |
it would train really well sometimes it wouldn't train great but sometimes we you know we often 01:28:26.880 |
got this reasonably good answer now one problem here is although effectively we have quite 01:28:38.280 |
a deep neural net if you kind of go back to the version so this this version where we 01:28:45.400 |
have the loop in it is kind of the normal way to think about an RNN but perhaps an easier 01:28:50.400 |
way to think about it is what we call the unrolled version and the unrolled version 01:28:55.960 |
is when you look at it like this now if you unroll this stateful neural net we have you 01:29:03.360 |
know it's it is quite deep but every single one of the hidden to hidden layers uses exactly 01:29:11.560 |
the same weight matrix so really it's not really that deep at all because it can't really 01:29:19.360 |
do very sophisticated computation because it has to use the same weight matrix every 01:29:23.520 |
time so in some ways it's not really any smarter than a plane linear model so it would be nice 01:29:32.860 |
to try to you know create a truly deep model have multiple different layers that it can 01:29:39.760 |
go through so we can actually do that easily enough by creating something called a multi-layer 01:29:45.880 |
RNN and all we do is we basically take that diagram we just saw before and we repeat it 01:29:53.640 |
but and this is actually a bit unclear the dotted arrows here are different weight matrices 01:30:02.520 |
to the non-dotted arrows here so we can have a different hidden to hidden weight matrix 01:30:09.060 |
in the kind of second set of RNN layers and a different weight matrix here for the second 01:30:16.860 |
set and so this is called a stacked RNN or a multi-layer RNN and so here's the same thing 01:30:26.640 |
in the unrolled version right so this is exactly the same thing but showing you the unrolled 01:30:32.240 |
version. Writing this out by hand maybe that's quite a good exercise or particularly this 01:30:41.120 |
one would be quite a good exercise but it's kind of tedious so we're not going to bother 01:30:44.800 |
instead we're going to use PyTorch as RNN class and so PyTorch as RNN class is basically 01:30:51.040 |
doing exactly what we saw here right and specifically this this part here and this part here right 01:31:04.960 |
but it's nice that it also has an extra number of layers parameter that lets you tell it how 01:31:14.400 |
many to stack on top of each other so it's important when you start using PyTorch as 01:31:19.480 |
RNN to realize there's nothing magic going on right you're just using this refactored 01:31:26.120 |
for loop that we've already seen so we still need the input to hidden embedding this is 01:31:32.680 |
now the hidden to hidden with the loop all done for us and then this is the hidden to 01:31:38.160 |
output just as before and then this is our hidden just like before so now we don't need 01:31:45.280 |
the loop we can just call self.rnn and it does the whole loop for us we can do all the input 01:31:51.640 |
to hidden at once to save a little bit of time because thanks to the wonder of embedding 01:31:56.500 |
matrices and as per usual we have to go detach to avoid getting a super deep effective network 01:32:05.160 |
and then pass it through our output linear layer so this is exactly the same as the previous 01:32:11.760 |
model except that we have just refactored it using in RNN and we said we want more than 01:32:19.160 |
one layer so let's request say two layers we still need the model reset it just like 01:32:26.040 |
before because remember nothing's changed and let's go ahead and fit and oh it's terrible 01:32:36.160 |
so why is it terrible well the reason it's terrible is that now we really do have a very 01:32:45.000 |
deep model and very deep models are really hard to train because we can get exploding 01:32:55.400 |
or disappearing activations so what that means is we start out with some initial state and 01:33:07.560 |
we're gradually putting it through all of these layers and all of these layers right 01:33:12.280 |
and so each time we're doing a matrix multiplication which remember is just doing a whole bunch 01:33:16.800 |
of multiplies and adds and then we multiply and add and we multiply and add and we multiply 01:33:21.800 |
and add and we multiply and add and if you do that enough times you can end up with very 01:33:28.120 |
very very big results or so that would be if the kind of things we're multiplying and 01:33:33.000 |
adding by are pretty big or very very very very small results particularly because we're 01:33:38.160 |
putting it through the same layer again and again right and why is that a problem well 01:33:46.520 |
if you multiply by 2 a few times you get 1 2 4 8 etc and after 32 steps you're already 01:33:54.840 |
at 4 billion or if you start at 1 and you multiply by half a few times after 32 steps 01:34:05.040 |
you're down to this tiny number so a number even slightly or higher or lower than 1 can 01:34:11.040 |
kind of cause an explosion or disappearance of a number and matrix multiplication is just 01:34:16.000 |
multiplying numbers and adding them up so exactly the same thing happens to matrix multiplication 01:34:20.840 |
you kind of have matrices that grow really big or grow really small and when that does 01:34:29.420 |
that you're also going to have exactly the same things happening to the gradients they'll 01:34:33.080 |
get really big really small and one of the problems here is that numbers are not stored 01:34:39.440 |
precisely in a computer they're stored using something called floating point so we stole 01:34:48.720 |
this nice diagram from this article called what you never wanted to know about floating 01:34:54.240 |
point but what we're forced to find out and here we're at this point where we're forced 01:34:57.380 |
to find out and it's basically showing us the granularity with which numbers are stored 01:35:02.720 |
and so the numbers that are further away from zero are stored much less precisely than the 01:35:09.760 |
numbers that are close to zero and so if you think about it that means that the gradients 01:35:15.960 |
further away from zero could actually for very big numbers could actually become zero 01:35:27.160 |
themselves because you could actually end up in kind of with two numbers that are between 01:35:33.280 |
these kind of little gradations here and you actually end up the same thing with the really 01:35:41.120 |
small numbers because they're really small numbers although they're closer together the 01:35:44.840 |
numbers that they represent are also very close together so in both cases they're kind 01:35:49.840 |
of the relative accuracy gets worse and worse so you really want to avoid this happening 01:36:02.000 |
there's a number of ways to avoid this happening and this is the same for really deep convolutional 01:36:08.760 |
neural nets or really deep kind of tabular standard tabular networks anytime you have 01:36:16.040 |
too many layers it can become difficult to train and you generally have to use like either 01:36:19.680 |
really small learning rates or you have to use special techniques that avoid exploding 01:36:27.400 |
or disappearing activations or gradients. For RNNs one of the most popular approaches 01:36:34.380 |
to this is to use an architecture called an LSTM and I am not going to go into the details 01:36:43.840 |
of an LSTM from scratch today but it's in the it's in the book and in the notebook but 01:36:50.920 |
the key thing to know about an LSTM is let's have a look is that rather than just being 01:36:59.800 |
a matrix multiplication it is this which is that there are a number of linear layers that 01:37:10.120 |
it goes through and those linear layers are combined in particular ways and the way they're 01:37:15.160 |
combined which is shown in this kind of diagram here is that it basically is designed such 01:37:22.340 |
that the that there are like little mini neural networks inside the layer which decide how 01:37:32.360 |
much of the previous state is kept how much is thrown away and how much of the new state 01:37:38.560 |
is added and by letting it have little neural nets to kind of calculate each of these things 01:37:44.600 |
it allows the LSTM layer which again is shown here to decide how much of kind of how much 01:37:54.280 |
of an update to do at each time and then with that capability it basically allows it to 01:38:03.560 |
avoid kind of updating too much or updating too little and by the way this this code you 01:38:14.600 |
can refactor which Sylvain did here into a much smaller amount of code but these two 01:38:22.680 |
So as I said I'm not going to worry too much about the details of how this works now the 01:38:27.920 |
important thing just to know is that you can replace the matrix multiplication in an RNN 01:38:35.320 |
with this sequence of matrix multiplications and sigmoids and times and plus and when you 01:38:42.820 |
do so you will very significantly decrease the amount of gradient or activation exploding 01:38:51.680 |
explosions or disappearances. So that's called an LSTM cell and an RNN which uses this instead 01:39:00.560 |
of a matrix multiplication is called an LSTM and so you can replace NN.RNN with NN.LSTM. 01:39:11.560 |
Other than that we haven't really changed anything except that LSTMs because they have 01:39:19.720 |
more of these layers in them we actually have to make our hidden state have more layers in 01:39:27.040 |
as well but other than that we can just replace RNN with LSTM and we can call it just the same 01:39:35.960 |
way as we did before we can detach just like before but that's now a list so we have to 01:39:40.640 |
detach all of them and pop it through our output layer which is exactly as before reset 01:39:46.320 |
is just as before except it's got to loop through each one and we can fit it in exactly 01:39:52.240 |
the same way as before and as you can see we end up with a much better result which 01:39:59.320 |
is great. We have two questions. Okay perfect. Could we somehow use regularization to try 01:40:05.480 |
to make the RNN parameters close to the identity matrix or would that cause bad results because 01:40:11.200 |
the hidden layers want to deviate from the identity during training? So we're actually 01:40:19.320 |
about to look at regularization so we will take a look. The identity matrix for those 01:40:28.320 |
that don't know don't remember is the matrix where if you multiply it by it you get exactly 01:40:33.480 |
the same thing that you started with so just like if you multiply by one you get back the 01:40:37.720 |
same number you started with. For linear algebra if you multiply by the identity matrix you 01:40:42.600 |
get the same matrix you started with and actually one quite popular approach to initializing 01:40:49.960 |
the hidden to hidden activations is to initialize with a identity matrix which ensures that 01:40:55.960 |
you start with something which doesn't have gradient explosions or activation explosions. 01:41:05.560 |
There are yeah we'll have and we're about to have a look at some more regularization 01:41:09.440 |
approaches so let's wait until we do that. All right next question. Is there a way to 01:41:14.600 |
quickly check if the activations are disappearing/exploding? Absolutely just go ahead and calculate them 01:41:24.240 |
and we'll be looking at that in a lot more detail in part two but a really great exercise 01:41:30.240 |
would be to try to figure out how you can actually output the activations of each layer 01:41:37.320 |
and it would certainly be very easy to do that in the in the RNNs that we built ourselves 01:41:42.560 |
from scratch because we can actually see the linear layers and so you could just print 01:41:48.760 |
them out or print out some statistics or store them away or something like that. FAST AI 01:42:04.440 |
has a class called Activation Stats which kind of you can check out if you're interested 01:42:15.360 |
if that's a really good way to specifically to do this. Okay so yeah so regularization 01:42:30.280 |
is important we have potentially a lot of parameters and a lot of layers it would be 01:42:35.880 |
really nice if we can do the same kind of thing that we've done with our CNNs and so 01:42:43.080 |
forth which is to use more parameters but then use regularization to ensure that we 01:42:48.240 |
don't overfit and so we can certainly do that with an LSTM as well and perhaps the best 01:42:55.480 |
way to do that is to use something called dropout and dropout is not just used for RNNs 01:43:03.320 |
dropout is used all over the place but it works particularly well in RNNs. This is a 01:43:08.200 |
picture from the dropout paper and what happens in dropout is here's a is a kind of a picture 01:43:16.960 |
of a three fully connected layers no sorry I guess it's two one two yeah no three fully 01:43:25.480 |
connected layers and so in these two fully connected layers at the start here what we 01:43:32.280 |
could do is we could delete some of the activations at random and so this has happened here but 01:43:39.240 |
X this is what X means it's like deleting those those activations at random and if we 01:43:45.760 |
do so you can see we end up with a lot less computation going on and what dropout does 01:43:51.960 |
is each batch each mini batch it randomly deletes a different set of activations from 01:44:01.780 |
whatever layers you ask for that's what dropout does so basically the idea is that dropout 01:44:13.240 |
helps to generalize because if a particular activation was kind of effectively learning 01:44:22.100 |
some input some some particular piece of input memorizing it then sometimes it gets randomly 01:44:28.560 |
deleted and so then suddenly it's not going to do anything useful at all so by randomly 01:44:36.000 |
deleting activations it ensures that activations can't become over specialized at doing just 01:44:42.320 |
one thing because then if it did then the times they're randomly deleted it's it's not 01:44:47.880 |
going to work so here is the entire implementation of a dropout layer you pass it some value 01:44:56.280 |
P which is the probability that an activation gets deleted so we'll store that away and 01:45:02.120 |
so then in the forward you're going to get your activations now if you're not training 01:45:06.960 |
so if you're doing validation then we're not going to do dropout right but if we are training 01:45:13.260 |
then we create a mask and so the mask is a Bernoulli random a Bernoulli random variable 01:45:24.680 |
so what is Bernoulli random variable means it means it's a bunch of ones and zeros where 01:45:29.480 |
this is the probability that we get a one which is one minus the probability we get 01:45:36.880 |
a zero and so then we just multiply that by our input so that's going to convert some 01:45:44.360 |
of the inputs into zeros which is basically deleting them so you should check out some 01:45:50.760 |
of the details for example about why we do a divide one minus P which is described here 01:45:55.960 |
and we do point out here that normally and I would normally in the lesson show you an 01:46:01.760 |
example of the of what Bernoulli does but of course nowadays you know we're getting 01:46:08.400 |
to the advanced classes you're expected to do it yourself so be sure to create a little 01:46:12.760 |
cell here and make sure you actually create a tensor and then run Bernoulli underscore 01:46:18.120 |
on it and make sure you see exactly what it's doing so that then you can understand this 01:46:23.980 |
class now of course we don't have to use this class we made ourselves we can just use nn.dropout 01:46:31.800 |
but you can use this class yourself because it does the same thing so again you know we're 01:46:36.480 |
trying to make sure that we know how to build stuff from scratch this special self dot training 01:46:44.000 |
is set for every module automatically by fast.ai to based on whether or not you're in the validation 01:46:52.660 |
part of your training loop or the training part of your training loop it's also part 01:46:59.520 |
of PyTorch and in PyTorch if you're not using fast.ai you have to call the train method 01:47:04.480 |
on a module to set training to true and the eval method to set it to false for every module 01:47:09.920 |
inside some other module so that's one great approach to regularization another approach 01:47:18.340 |
which I've only seen used in recurrent neural nets is activation regularization and temporal 01:47:28.080 |
activation regularization which is very very similar to the question that we were just 01:47:32.000 |
asked what happens with activation regularization is it looks a very similar to weight decay 01:47:42.760 |
but rather than adding some multiplier times the sum of squares of the weights we add some 01:47:52.920 |
multiplier by the sum of squares of the activations so in other words we're basically saying we're 01:48:00.280 |
not just trying to decrease the weights but decrease the total activations and then similarly 01:48:10.000 |
we can also see what's the difference between the activations from the previous time step 01:48:19.680 |
to this time step so take the difference and then again squared times some value so these 01:48:28.800 |
are two hyper parameters alpha and beta the higher they are the more regularized your 01:48:34.200 |
model and so with TAR it's going to say no layer of the LSTM should too dramatically 01:48:43.080 |
change the activations from one time step to the next and then for alpha it's saying 01:48:50.880 |
no layer of the LSTM should create two large activations and so they wouldn't actually 01:48:57.300 |
create these large activations or large changes unless the loss improved by enough to make 01:49:03.180 |
it worth it okay so there's then I think just one more thing we need to know about which 01:49:16.320 |
is called weight tying and weight tying is a very minor change and let's have a look 01:49:22.400 |
at it here so this is the embedding we had before this is the LSTM we had before this 01:49:27.520 |
is where we're going to introduce dropout this is the hidden to output linear layer 01:49:33.000 |
we had before but we're going to add one more line of code which is the hidden to output 01:49:42.720 |
weights are actually equal to the input to hidden weights now this is not just setting 01:49:50.960 |
them once this is actually setting them so that they're a reference to the exact same 01:49:55.340 |
object in memory the exact same tensor in memory so the weights of the hidden to output 01:50:00.560 |
layer will always be identical to the weights of the input to hidden layer and this is called 01:50:06.920 |
weight tying and the reason we do this is because conceptually in a language model predicting 01:50:14.480 |
the next word is about kind of converting activations into English words or else an embedding 01:50:21.080 |
is about converting English words to activations and there's a reasonable hypothesis which 01:50:28.180 |
would be that well those are basically exactly the same computation or at least the reverse 01:50:33.000 |
of it so why shouldn't they use the same weights and it turns out lo and behold yes if you 01:50:37.920 |
use the same weights then actually it does work a little bit better so then here's our 01:50:44.000 |
forward which is to do the input to hidden do the RNN apply the dropout do the detach 01:50:52.280 |
and then apply the hidden to output which is using exactly the same weights as the input 01:50:56.720 |
to hidden and resets the same we haven't created the RNN regularizer from scratch here but 01:51:05.640 |
you can add it as a callback passing in your alpha and your beta if you call text learner 01:51:17.200 |
instead of learner it will add the model resetter and the RNN regularizer for you so that's 01:51:24.200 |
what one of the things text learner does so this code is the same as this code and so 01:51:30.360 |
we can then train a model again and that's also add weight decay and look at this we're 01:51:37.800 |
getting up close to 90% accuracy so we've covered a lot in this lesson but the amazing 01:51:49.640 |
thing is that we've just replicated all of the pieces in an AWD LSTM all of the pieces 01:51:58.220 |
in this state-of-the-art recurrent neural net which we've showed we could use in the previous 01:52:04.940 |
notebook to get what was until very recently state-of-the-art results for text classification 01:52:12.000 |
and far more quickly and with far less compute and memory than more modern than the approaches 01:52:21.360 |
in the last year or so which have beaten that benchmark so this is a really efficient really 01:52:29.940 |
accurate approach and it's still the state-of-the-art in many many academic situations and it's 01:52:39.880 |
still very widely used in industry and so it's pretty cool that we've actually seen 01:52:43.680 |
how to write it from scratch so the main thing to mention the further research is to have 01:52:52.080 |
a look at the source code for AWD LSTM and fast AI and see if you can see how the things 01:52:58.840 |
in AWD LSTM map to the you know what those lines of code how they map to the concepts 01:53:05.520 |
that we've seen in this chapter. Rachel do we have any questions? So here we have come 01:53:14.000 |
to the conclusion of our what was originally going to be seven lessons and turned into 01:53:18.760 |
eight lessons. I hope that you've got a lot out of this, thank you for staying with us. 01:53:29.920 |
What a lot of folks people people now do when they finish there at least people have finished 01:53:35.000 |
previous courses is they go back to lesson one and try and repeat it but doing a lot 01:53:41.120 |
less looking at the notebooks a lot more doing stuff from scratch yourself and going deeper 01:53:48.000 |
into the assignments so that's one thing you could do next. Another thing you could do 01:53:53.760 |
next would be to pick out a Kaggle competition to enter or pick a book that you want to read 01:54:01.920 |
about deep learning or a paper and team up with some friends to do like a paper reading 01:54:09.900 |
group or a book reading group you know one of the most important things to keep the learning 01:54:14.800 |
going is to get together with other people on the learning journey. Another great way 01:54:20.800 |
to do that of course is through the forums so if you haven't been using the forums much 01:54:24.920 |
so far no problem but now might be a great time to get involved and find some projects 01:54:29.640 |
that are going on that look interesting and it's fine if you you know you don't have to 01:54:35.360 |
be an expert right obviously any of those projects the people that are already doing 01:54:39.680 |
it are going to know more about it than you do at this point because they're already doing 01:54:44.000 |
it but if you drop into a thread and say hey I would love to learn more about this how do 01:54:48.280 |
I get started or have a look at the wiki posts to find out and try things out you can start 01:54:54.800 |
getting involved in other people's projects and help them out. So yeah and of course don't 01:55:06.240 |
forget about writing so if you haven't tried writing a blog post yet maybe now is a great 01:55:10.240 |
time to do that pick something that's interesting to you especially if it's something in your 01:55:15.240 |
area of expertise at work or a hobby or something like that or specific to where you live maybe 01:55:21.120 |
you could try and build some kind of text classifier or text generator for particular 01:55:27.760 |
kinds of text that are that you know about you know that would be that would be a super 01:55:33.360 |
interesting thing to try out and be sure to share it with the folks on the forum. So there's 01:55:39.240 |
a few ideas so don't let this be the end of your learning journey you know keep keep going 01:55:45.840 |
and then come back and try part two if it's not out yet obviously you'll have to wait 01:55:51.480 |
until it is out if it but if it is out you might want to kind of spend a couple of months 01:55:56.960 |
you know really experimenting with this before you move on to part two to make sure that 01:56:01.080 |
everything in part one feels pretty pretty solid to you. Well thank you very much everybody 01:56:11.040 |
for your time we've really enjoyed doing this course it's been a tough course for us to 01:56:16.400 |
teach because with all this COVID-19 stuff going on at the same time I'm really glad 01:56:20.840 |
we've got through it I'm particularly particularly grateful to Sylvain who has been extraordinary 01:56:28.280 |
in really making so much of this happen and particularly since I've been so busy with 01:56:34.080 |
COVID-19 stuff around masks in particular it's really a lot thanks to Sylvain that everything 01:56:41.400 |
has come together and of course to Rachel who's been here with me on on every one of 01:56:47.280 |
these lessons thank you so much and I'm looking forward to seeing you again in a future course