back to index

Lesson 8 - Deep Learning for Coders (2020)


Chapters

0:0 Introduction
0:15 Natural Language Processing
5:10 Building a Language Model
12:18 Get Files
13:28 Word Tokenizer
16:30 Word Tokenizer Rules
17:38 SubWord Tokenizer
19:1 Setup
23:23 Numericalization
25:43 Batch
29:19 Data Loader
30:18 Independent Variables
30:36 Dependent Variables
31:9 Data Blocks
31:56 Class Methods
33:25 Language Model
35:8 Save Epoch
35:32 Save Encoder
37:55 Text Generation
42:5 Language Models
42:49 Classification
43:10 Batch Size
44:20 Pad Tokens
45:29 Text Classifier
48:31 Data Augmentation
48:54 Predicting the Next Word
51:0 Data Augmentation on Text
51:51 Generation
58:11 Creating Datasets

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everybody and welcome to Lesson 8, the last lesson of Part 1 of this course. Thanks
00:00:06.820 | so much for sticking with us. Got a very interesting lesson today where we're going to do a dive
00:00:12.540 | into natural language processing. And remind you, we did see natural language processing
00:00:18.000 | in Lesson 1. This was it here. We looked at a dataset where we could pass in many movie
00:00:28.160 | reviews like so and get back probabilities that it's a positive or negative sentiment.
00:00:34.740 | And we trained it with a very standard looking classifier trainer approach. But we haven't
00:00:41.160 | really talked about what's going on behind the scenes there, so let's do that. And we'll
00:00:46.040 | also learn about how to make it better. So we were getting about 93%. So 93% accuracy
00:00:52.280 | for sentiment analysis which is actually extremely good and it only took a bit over 10 minutes.
00:00:58.560 | But let's see if we can do better. So we're going to go to notebook number 10. And in
00:01:08.640 | notebook number 10 we're going to start by talking about what we're going to do to train
00:01:16.640 | an NLP classifier. So a sentiment analysis which is this movie review positive or negative
00:01:22.040 | sentiment is just a classifier. The dependent variable is binary. And the independent variable
00:01:28.040 | is the kind of the interesting bit. So we're going to talk about that. But before we do
00:01:33.000 | we're going to talk about what was the pre-trained model that got used here. Because the reason
00:01:40.200 | we got such a good result so quickly is because we're doing fine-tuning of a pre-trained model.
00:01:46.540 | So what is this pre-trained model exactly? Well the pre-trained model is actually a pre-trained
00:01:51.440 | language model. So what is a language model? A language model is a special kind of model
00:02:01.140 | and it's a model where we try to predict the next word of a sentence. So for example if
00:02:09.680 | our language model received even if our language model knows the and its job would be to predict
00:02:18.000 | basics. Now the language model that we use as our pre-trained model was actually trained
00:02:25.760 | on Wikipedia. So we took all the you know non-trivial sized articles in Wikipedia and
00:02:34.800 | we built a language model which attempted to predict the next word of every sequence
00:02:40.080 | of words in every one of those articles. And it was a neural network of course. And we
00:02:47.280 | then take those pre-trained weights and those are the pre-trained weights that when we said
00:02:51.240 | text classifier learner were automatically loaded in. So conceptually why would it be
00:02:57.680 | useful to pre-train a language model? How does that help us to do sentiment analysis
00:03:02.960 | for example? Well just like an ImageNet model has a lot of information about what pictures
00:03:10.360 | look like and what they're consisting of. A language model tells us a bit a lot about
00:03:16.920 | what sentences look like and what they know about the world. So a language model for example
00:03:24.120 | if it's going to be able to predict the end of the sentence in 1998 this law was passed
00:03:34.760 | by president what? So a language model to predict that correctly would have to know
00:03:41.840 | a whole lot of stuff. It would have to know about well how English language works in general
00:03:45.760 | and what kind of sentences go in what places. That after the word president would usually
00:03:51.760 | be the surname of somebody. It would need to know what country that law was passed in
00:03:57.000 | and it would need to know what president was president of that country in what I say 1998.
00:04:03.720 | So it'd have to know a lot about the world. It would have to know a lot about language
00:04:07.880 | to create a really good language model is really hard. And in fact this is something
00:04:12.720 | that people spend many many many millions of dollars on creating language models of
00:04:19.440 | huge datasets. Our particular one doesn't take particularly long to pre-train but there's
00:04:25.960 | no particular reason for you to pre-train one of these language models because you can
00:04:29.240 | download them through fast AI or through other places. So what happened in lesson one is
00:04:39.600 | we downloaded this pre-trained Wikipedia model and then we fine-tuned it so as per usual
00:04:45.520 | we threw away the last layer which was specific for predicting the next word of Wikipedia
00:04:51.840 | and fine-tuned the model. Initially just the last layer to learn to predict sentiment of
00:04:59.960 | movie reviews and then as per usual then fine-tuned the rest of the model and that got us 93%.
00:05:07.720 | Now there's a trick we can use though which is we start with this Wikipedia language model
00:05:14.000 | and the particular subset we use is called Wikitext 103. And rather than just jumping
00:05:19.760 | straight to a classifier which we did in lesson one we can do even better if we first of all
00:05:24.920 | create an IMDB language model that is to say a language model that learns to predict the
00:05:30.360 | next word of a movie review. The reason we do that is that this will help it to learn
00:05:36.640 | about IMDB specific kind of words like it'll learn a lot more about the names of actors
00:05:42.760 | and directors it'll learn about the kinds of words that people use in movie reviews.
00:05:48.760 | And so if we do that first then we would hope we'll end up with a better classifier. So that's
00:05:53.240 | what we're going to do in the first part of today's lesson. And we're going to kind of
00:05:59.880 | do it from scratch and we're going to show you how to do a lot of the things from scratch
00:06:04.540 | even though later we'll show you how fast AI does it all for you. So how do we build
00:06:10.240 | a language model? So as we point out here sentences can be different lengths and documents like
00:06:16.360 | movie reviews can be very long. So how do we go about this? Well a word is basically
00:06:26.440 | a categorical variable and we already know how to use categorical variables as an independent
00:06:31.680 | variable in a neural net which was we make a list of all of the possible levels of a
00:06:35.880 | categorical variable which we call the vocab and then we replace each of those categories
00:06:42.640 | with its index so they all become numbers. We create an initially random embedding matrix
00:06:49.320 | for each so each row then is for one element from the vocab and then we make that the first
00:06:55.880 | layer of a neural net. So that's what we've done a few times now and we've even created
00:07:02.540 | our own embedding layer from scratch remember. So we can do the same thing with text right
00:07:07.540 | we can make a list of all the possible words in in the whole corpus the whole dataset and
00:07:15.320 | we can replace each word with the index of the vocab and creating embedding matrix. So
00:07:23.080 | in order to create a list of all levels in this case a list of all possible words let's
00:07:28.080 | first of all concatenate all the documents or the movie reviews together into one big
00:07:32.480 | long string and split it into words okay and then our independent variable will basically
00:07:39.200 | be that sequence starting with the first word in the long list and ending with a second
00:07:44.120 | last and our dependent variable will be the sequence of words starting with a second word
00:07:49.200 | and ending with a last so they're kind of offset by one so as you move through the first
00:07:54.520 | sequence you're then trying to predict the next word in the next in the in the second
00:08:00.320 | part that's kind of what we're doing right we'll see more detail in a moment. Now when
00:08:06.680 | we create our vocab by finding all the unique words in this concatenated corpus a lot of
00:08:14.560 | the words we see will be already in the embedding matrix already in the vocab of the pre-trained
00:08:20.920 | Wikipedia model but there's also going to be some new ones right there might be some
00:08:26.800 | particular actors that don't appear in Wikipedia or maybe some informal slang words and so
00:08:34.920 | forth so when we build our vocab and then our embedding embedding matrix for the IMDB
00:08:43.440 | language model any words that are in the vocab of the pre-trained model we'll just use them
00:08:49.520 | as is but for new words we'll create a new random vector. So here's the process we're
00:08:58.520 | going to have to go through first we're going to have to take our big concatenated corpus
00:09:03.940 | and turn it into a list of tokens could be words could be characters could be substrings
00:09:10.800 | that's called tokenization and then we'll do numericalization which is basically these
00:09:18.700 | two steps which is replacing each word with its index in a vocab which means we have to
00:09:23.600 | create that vocab so create the vocab and then convert then we're going to need to create
00:09:29.200 | a data loader that has lots of substrings lots of sequences of tokens from IMDB corpus
00:09:37.720 | as an independent variable and the same thing offset by one as a dependent variable and
00:09:45.520 | then we're going to have to create a language model. Now a language model is going to be
00:09:49.560 | able to handle input lists that can be arbitrarily big or small and we're going to be using something
00:09:56.240 | called a recurrent neural network to do this which we'll learn about later so basically
00:10:00.640 | so far we've always assumed that everything is a fixed size a fixed input so we're going
00:10:05.840 | to have to mix things up a little bit here and deal with architectures that can be different
00:10:10.640 | sizes for this notebook notebook 10 we're going to kind of treat it as a black box it's
00:10:18.360 | just going to be just a neural net and then later in the lesson we'll look at delving
00:10:23.200 | inside what's happening in that architecture okay so let's start with the first of these
00:10:30.320 | which is tokenization so converting a text into a list of words or a list of tokens what
00:10:36.480 | does that mean is a full stop a token what about don't is that single word or is it two
00:10:45.000 | words don't or is it would I convert it to do not what about long medical words that
00:10:51.040 | are kind of made up of lots of pieces of medical jargon that are all stuck together what about
00:10:55.880 | hyphenated words and really interestingly then what about something like Polish where
00:11:01.560 | you or Turkish where you can create really long words all the time they create really
00:11:06.400 | long words that are actually lots of separate parts or concatenated together or languages
00:11:11.360 | like Japanese and Chinese that don't use spaces at all they don't really have a world of find
00:11:18.560 | idea of a word well there's no right answer but it's basically three approaches we can
00:11:26.480 | use a word-based approach which is what we use by default at the moment for English although
00:11:30.640 | that might change which is we split a sentence on space and then there are some language specific
00:11:36.160 | rules for example turning don't into do and putting punctuation marks as a separate token
00:11:43.240 | most of the time really interestingly there are tokenizes at a sub word based and this
00:11:49.600 | is where we split words into smaller parts based on the most commonly occurring substrings
00:11:54.440 | we'll see that in a moment or the simplest character-based split a sentence into its
00:12:00.640 | characters we're going to look at word and sub word tokenization in this notebook and
00:12:06.040 | then if you look at the questionnaire at the end you'll be asked to create your own character
00:12:10.200 | based tokenizer so please make sure you do that if you can it'll be a great exercise
00:12:19.280 | so fastai doesn't invent its own tokenizers we just provide a consistent interface to
00:12:25.960 | a range of external tokenizers because there's a lot of great tokenizers out there so you
00:12:32.480 | can switch between different tokenizers pretty easily so let's start let's grab our IMDB data
00:12:38.360 | set like we did in lesson one and in order to try out a tokenizer let's grab all the
00:12:44.320 | text files so we can instead of calling get image files we'll call get text files and
00:12:51.040 | you know to have a look at what that's doing don't forget we can even look at the source
00:12:55.640 | code and you can see actually it's calling a more general thing called get files and
00:13:00.960 | saying what extensions it wants right so if anything in fastai doesn't work quite the
00:13:04.400 | way you want and there isn't a option which which works the way you want you can often
00:13:09.080 | look always look underneath to see what we're calling and you can call the lower level stuff
00:13:13.880 | yourself so files is now a list of files so we can grab the first one we can open it we
00:13:20.280 | can read it have a look at the start of this review and here it is okay so at the moment
00:13:31.240 | the default English word tokenizer we use is called spaCy which uses a pretty sophisticated
00:13:37.080 | set of rules with special rules for particular words and URLs and so forth but we're just
00:13:45.320 | going to go ahead and say word tokenizer which will automatically use fastai's default word
00:13:49.920 | tokenizer currently spaCy and so if we pass a list of documents we'll just make it a list
00:13:58.320 | of one document here to the tokenizer we just created and just grab the first since we just
00:14:04.000 | created a list that's going to show us as you can see the tokenized version so you can
00:14:10.880 | see here that this movie which I just discovered at the video store has etc it's changed it's
00:14:19.440 | into it's and it's put a comma as a separate punctuation mark and so forth okay so you
00:14:30.400 | can see how it has tokenized this review.
00:14:38.480 | Let's look at a more interesting one the US dollar blah blah blah and you can see here
00:14:42.600 | it actually knows that US is special so it doesn't put the full stop in a set as a separate
00:14:46.840 | place for US it knows about 1.00 is special so you can see there's a lot of tricky stuff
00:14:53.120 | going on with spaCy to try and be as kind of thoughtful about this as possible.
00:15:00.760 | Fastai then provides this tokenizer wrapper which provides some additional functionality
00:15:08.120 | to any tokenizer as you can see here which is for example the word it here which previously
00:15:20.400 | was capital IT has been turned into lowercase IT and then a special token XX badge has appeared
00:15:27.220 | at the front everything starting with XX is a special fastai token and this means that
00:15:33.360 | the next match means that the next word was previously started with a capital letter so
00:15:38.840 | here's another one this used to be capital T so you make it lowercase and then add XX
00:15:43.920 | page XXBOS means this is the start of a document so there's a few special rules going on there
00:15:51.800 | so why do we do that well if you think about it if we didn't lowercase it for instance
00:15:57.760 | or this then the capitalized version and the lowercase version are going to be two different
00:16:03.760 | words in the embedding matrix which probably doesn't make sense you know regardless of
00:16:08.320 | the capitalization they probably basically mean the same thing having said that sometimes
00:16:15.720 | the capitalization might matter so we kind of want to say all right use the same embedding
00:16:20.200 | every time you see the word this but add some kind of marker that says that this was originally
00:16:24.640 | capitalized okay so that's why we do it like this so there's quite a few rules you can
00:16:33.500 | see them in text proc rules and you can see the source code here's a summary of what they
00:16:38.520 | do but let's look at a few examples so if we use that tokenizer we created and pass
00:16:44.760 | in for example this text you can see the way it's tokenized we get the XX beginning of
00:16:49.820 | stream or beginning of string beginning of document this HTML entity has become a real
00:16:54.680 | Unicode we've got the XX Madge we discussed now here www has been replaced by XXRep3w that
00:17:04.680 | means the letter w is repeated three times so for things where you've got like you know
00:17:11.560 | a hundred exclamation marks in a row all the words so with like 50 Os this is a much better
00:17:17.960 | representation and then you can see all upper case has been replaced with XX up followed
00:17:26.960 | by the word so there's some of those rules in action oh you can also see multiple spaces
00:17:33.400 | have been replaced you know with making just making it standard tokens so that's the word
00:17:40.000 | tokenizer the really interesting one is the subword tokenizer so how why would you need
00:17:46.820 | a subword tokenizer well consider for example this sentence here order means they're sure
00:17:51.860 | how'd you rate so this is my name is Jeremy but the interesting thing about it is there's
00:17:57.900 | no spaces here right and that's because there are no spaces in Chinese and there isn't really
00:18:06.520 | a great sense of what a word is in Chinese in this particular sentence it's fairly clear
00:18:10.920 | what the words are but it's not always obvious sometimes the words are actually split you
00:18:16.720 | know so some of it's at the start of a sentence and some of it's at the end so you can't really
00:18:21.160 | do word tokenization for something like Chinese so instead we use subword tokenization which
00:18:28.000 | is where we look at a corpus of documents and we find the most commonly occurring groups
00:18:33.200 | of letters and those commonly occurring groups of letters become the vocab so for example
00:18:40.600 | we would probably find order would appear often because that means my and Ming Z and
00:18:47.520 | then Ming Z for example is name and this is my westernized version of a Chinese name which
00:18:55.960 | wouldn't be very common at all so they would probably appear separately so let's look at
00:19:03.200 | an example let's grab the first 2000 movie reviews and let's create the default subword
00:19:13.840 | tokenizer which currently uses something called sentence piece that might change and now we're
00:19:20.080 | going to use something special something very important which is called setup transforms
00:19:25.960 | in fastai you can always call this special thing called setup it often doesn't do anything
00:19:31.640 | stupid it's always there but some transforms like a subword tokenizer actually need to
00:19:37.960 | be set up before you can use them in other words you can't tokenize into subwords until
00:19:43.680 | you know what the most commonly occurring groups of letters are so passing a list of
00:19:49.840 | texts in here this list of text to set up will train the subword tokenizer it'll find
00:19:57.840 | those commonly occurring groups of letters so having done that we can then this is just
00:20:04.080 | for experimenting we're going to pass in some size we'll say what vocab size we want for
00:20:09.640 | our subword tokenizer we'll set it up with our texts and then we will have a look at
00:20:15.640 | a particular sentence so for example if we create a subword tokenizer with a thousand
00:20:21.600 | tokens and it returns this this tokenized string now this kind of long underscore thing
00:20:32.680 | is what we replace space with because now we're using subword tokens we kind of want
00:20:36.880 | to know where the sentence is actually start and stop and you can see here a lot of sentence
00:20:42.600 | words are common enough sequences of letters that they get their own vocab item or else
00:20:49.520 | discovered it was not wasn't common enough so that became this over it right video appears
00:20:58.880 | enough where a store didn't that becomes or so you get the idea right so if we wanted
00:21:05.660 | a smaller vocab size that would as you see even this doesn't become its own word movie
00:21:13.520 | is so common that it is its own word so it just becomes to for example we have a question
00:21:23.280 | okay how can we determine if the given pre-trained model in this case wiki text 103 is suitable
00:21:31.700 | enough for our downstream task if we have limited vocab overlap should we need to add
00:21:36.920 | an additional data set to create a language model from scratch if it's in the same language
00:21:45.520 | so if you're doing English it's always it's almost always sufficient to use Wikipedia
00:21:52.060 | we've played around with this a lot and it was one of the key things that Sebastian Ruder
00:21:55.320 | and I found when we created the ULM fit paper was before that time people really thought
00:22:01.160 | you needed corpus specific pre-trained models but we discovered you don't just like you
00:22:07.440 | don't that often need corpus specific pre-trained vision models image network surprisingly well
00:22:14.800 | across a lot of different domains so Wikipedia has a lot of words in it it would be really
00:22:25.360 | really I haven't come across an English corpus that didn't have a very high level of overlap
00:22:31.040 | with Wikipedia on the other hand if you're doing ULM fit with like genomic sequences
00:22:37.920 | or Greek or whatever then obviously you're going to need a different pre-trained model
00:22:46.000 | so once we got to a 10,000 word vocab as you can see basically every word at least common
00:22:51.120 | word becomes its own vocab item in the subword vocab except say discovered which becomes
00:22:59.000 | discover it so my guess is that subword approaches are going to become kind of the most common
00:23:08.200 | maybe they will be by the time you watch this we've got some fiddling to do to get this
00:23:14.560 | working super well for fine-tuning but I think I know what we have to do so hopefully we'll
00:23:21.200 | get it done pretty soon all right so after we split it into tokens the next thing to
00:23:29.880 | do is numerical ization so let's go back to our word tokenized text which looks like this
00:23:39.400 | and in order to mean numerical eyes we will first need to call setup so to save a bit
00:23:49.200 | of time let's create a subset of our text so just create a couple of hundred of the
00:23:53.240 | co-opuses that's a couple of hundred of the reviews so here's an example of one and we'll
00:23:58.760 | create our new miracle eyes object and we will call setup and that's the thing that's
00:24:02.860 | going to create the vocab for us and so after that we can now take a look at the vocab this
00:24:08.160 | is Cole repra is showing us a representation of a collection it's what the L class uses
00:24:14.960 | underneath and you can see when we do this that the vocab starts with the special tokens
00:24:25.000 | and then we start getting the English tokens in order of frequency so the default is a
00:24:33.920 | vocab size of 60,000 so that'll be the size of your embedding matrix by default and if
00:24:40.560 | there are more than 60,000 unique words in your vocab in your corpus then any the least
00:24:49.240 | common ones will be replaced with a special XXunk unknown token so that'll help us avoid
00:24:55.900 | having a too big embedding matrix all right so now we can treat the numerical eyes object
00:25:03.960 | which we created as if it was a function as we so often do in both fastai and pytorch
00:25:09.200 | and when we do it'll replace each of our words with others so two for example is 0 1 2 beginning
00:25:19.240 | a string beginning a beginning of stream 8 0 1 2 3 4 5 6 7 8 okay so a capitalized letter
00:25:28.680 | there they are XXbos XXmaj etc okay and then we can convert them back by indexing into
00:25:37.360 | the vocab and get back what we started with okay right so now we have done the tokenization
00:25:49.920 | we've done the numericalization and so the next thing we need to do is to create batches
00:25:56.480 | so let's say this is the text that we want to create batches from and so if we tokenize
00:26:04.400 | that text it'll convert it into this and so let's
00:26:16.640 | let's take that and write it out here let's do it here let's take that and write it out
00:26:26.680 | here so XXbos XXmaj in this chapter XXbos XXmaj in this chapter we will go back over the example
00:26:35.760 | of classifying and then next row starts here movie reviews we studied in chapter one and
00:26:42.040 | dig deeper under the surface full stop XXmaj first we will look at the etc okay so we've
00:26:48.040 | taken these 90 tokens and to create a batch size of six we've broken up the text into
00:26:56.600 | six contiguous parts each of length 15 so 1 2 3 4 5 6 and then we have 15 columns okay
00:27:06.640 | so 6 by 15 now ideally we would just provide that to our model as a batch and if indeed
00:27:17.280 | that was all the data we had we could just pass it in as a batch but that's not going
00:27:25.840 | to work for imdb because imdb once we concatenate all the reviews together and then let's say
00:27:32.600 | we want to use a batch size of 64 then we're going to have 64 rows and you know probably
00:27:42.920 | there's a few million tokens of imdb so a few million divided by 64 across it's going
00:27:49.280 | to be way too big and to fit in our GPU so what we're going to do then is we're going
00:27:57.320 | to split up that big wide array and we're going to split it up horizontally so we'll
00:28:04.880 | start with XXbos XXmaj in this chapter and then down here we will go back over the example
00:28:12.120 | of classifying movie reviews we studied in chapter one and dig deeper under the surface
00:28:18.600 | etc so this would become our first mini-batch right and so you can see what's happened is
00:28:28.640 | the kind of second row right actually is continuing what was like way down here and so we basically
00:28:39.540 | treated each row is totally independent so when we predict the second from the second
00:28:45.600 | mini-batch you know the second mini-batch is going to follow from the first and that
00:28:50.080 | each row to row one in the second mini-batch will join up to row one of the first row two
00:28:56.320 | of the second mini-batch will join up to row two of the first so please look at this example
00:29:01.800 | super carefully because we found that this is something that every year a lot of students
00:29:08.040 | get confused about because it's just not what they expected to see happen right so go back
00:29:14.400 | over this and make sure you understand what's happening in this little example so that's
00:29:20.160 | what our mini batches are going to be so the good news is that all these fiddly steps you
00:29:26.760 | don't have to do yourself you can just use the language model data loader or LM data
00:29:32.320 | loader so if we take those all the tokens from the first 200 movie reviews and map them
00:29:38.160 | through our numericalize object right so now we've got numericalized versions of all those
00:29:42.720 | tokens and then pass them into LM data loader and then grab the first item from the data
00:29:49.860 | loader then we have 64 by 72 why is that well 64 is the default batch size and 72 is the
00:30:02.440 | default sequence length you see here we've got one two three four five here we used a
00:30:08.080 | sequence length of five right so what we do in practice is we use a default sequence length
00:30:15.080 | of 72 so if we grab the first of our independent variables and grab the first few tokens and
00:30:24.400 | look them up in the vocab here it is this movie which I just something at the video
00:30:29.160 | store so that's interesting so this was not common enough to be in a vocab has apparently
00:30:34.080 | sit around for a and then if we look at the exact same thing but for the dependent variable
00:30:40.960 | rather than being XXBOS XXMaj this movie it's XXMaj this movie so you can see it's offset
00:30:47.160 | by one which means the end rather than being around for a it's for a couple so this is exactly
00:30:55.280 | what we want this is offset by one from here so that's looking good so we can now go ahead
00:31:10.160 | and use these ideas to try and build our even better IMDB sentiment analysis and the first
00:31:16.360 | step will be to as we discussed create the language model but let's just go ahead and
00:31:21.300 | use the fast AI built-in stuff to do it for us rather than doing all that messing around
00:31:26.520 | manually so we can just create a data block and our blocks are it's going to be a text
00:31:32.720 | block from folder and the items are going to be text files from these folders and we're
00:31:46.400 | going to split things randomly and then going to turn that into data loaders with a batch
00:31:52.280 | size of 128 and a sequence length of 80 in this case our blocks we're not just passing
00:32:01.280 | in a class directly but we're actually passing in here a class method and that's so that
00:32:10.600 | we can allow the tokenization for example to be saved to be cached in some path so that
00:32:19.000 | the next time we run this it won't have to do it all from scratch so that's why we have
00:32:24.440 | a slightly different syntax here so once we've run this we can call show batch and so you
00:32:33.560 | can see here we've got for example what xxmaj I've read xxmaj death blah blah blah and you
00:32:44.000 | can see so that's the independent variable and so the dependent variable is the same
00:32:48.000 | thing offset by one so we don't have the what anymore but it just goes straight to xxmaj
00:32:52.880 | I've read and then at the end this was also this and of course in the dependent variable
00:32:57.680 | also this is so this is that offset by one just like we were hoping for show batch is
00:33:03.720 | automatically denumericalizing it for us turning it back into strings but if we look at the
00:33:10.400 | actual or you should look at the actual x and y to confirm that you actually see numbers
00:33:16.720 | there that'll be a good exercise for you is to make sure that you can actually grab a
00:33:20.680 | mini batch from dlslm so now that we've got the data loaders we can fine-tune our language
00:33:30.400 | model so fine-tuning the language model means we're going to create a learner which is going
00:33:35.060 | to learn to predict the next word of a movie review so that's our data the data loaders
00:33:41.700 | for the language model this is the pre-trained model it's something called awd lstm which
00:33:47.160 | we'll see how to create from scratch in a moment or something similar to it dropout
00:33:54.080 | we'll learn about later that we see how much dropout to use this is how much regularization
00:33:57.680 | we want and what metrics do we want we've know about accuracy perplexity is not particularly
00:34:03.760 | interesting so I won't discuss it but feel free to look it up if you're interested and
00:34:07.640 | let's train with fp16 to use less memory on the GPU and for any modern GPU it'll also
00:34:15.560 | run two or three times faster so this gray bit here has been done for us the pre-training
00:34:23.200 | of the language model for wiki text 103 and now we're up to this bit which is fine-tuning
00:34:28.760 | the language model for imdb so let's do one epoch and as per usual the using a pre-trained
00:34:39.520 | model automatically calls freeze so we don't have to freeze so this is going to just actually
00:34:46.400 | train only the new embeddings initially and we get an accuracy after a ten minutes or
00:34:53.120 | so of 30% so that's pretty cool so about a bit under a third of the time our model is
00:34:59.680 | predicting the next word of a string so I think that's pretty cool now since this takes
00:35:09.720 | quite a while for each epoch we might as well save it and you can save it under any name
00:35:17.760 | you want and that's going to put it into your path into your learner's path into a model
00:35:21.720 | subfolder and it'll give it a .pth extension for PyTorch and then later on you can load
00:35:28.160 | that with learn.load after you create the learner and so then we can unfreeze and we
00:35:36.280 | can train a few more epochs and we eventually get up to an accuracy of 34% so that's pretty
00:35:42.960 | great so once we've done all that we can save the model but actually all we really need
00:35:51.480 | to do is to save the encoder what's the encoder the encoder is all of the model except for
00:35:59.280 | the final layer oh and we're getting a thunderstorm here that could be interesting we've never
00:36:03.880 | done a lesson with a thunderstorm before but that's the joy of teaching during COVID-19
00:36:10.920 | you get all the sound effects so yeah the final layer of our language model is predict
00:36:21.040 | is the bit that actually picks a particular word out which we don't need so when we say
00:36:26.320 | save encoder it saves everything except for that final layer and that's the pre-trained
00:36:31.240 | model we're going to use that is a pre-trained model of a language model that is fine-tuned
00:36:37.800 | from Wikipedia fine-tuned using IMDB and doesn't contain the very last layer Rachel any questions
00:36:45.880 | at this point do any language models attempt to provide meaning for instance I'm going
00:36:53.160 | to the store is the opposite of I'm not going to the store or I barely understand this stuff
00:36:59.600 | and that ball came so close to my ear I heard it whistle both contain the idea of something
00:37:04.640 | almost happening being right on the border is there a way to indicate this kind of subtlety
00:37:09.280 | in a language model yeah absolutely our language model will have all of that in it or hopefully
00:37:19.360 | it will hopefully it'll learn about it we don't have to program that the whole point
00:37:22.760 | of machine learning is it learns it for itself but when it sees a sentence like hey careful
00:37:30.400 | that ball nearly hit me the expectation of what word is going to be happen next is going
00:37:37.080 | to be different to the sentence hey that ball hit me so so yeah language models generally
00:37:44.860 | you see in practice tend to get really good at understanding all of these nuances of of
00:37:51.880 | of English or whatever language it's learning about okay so we have a fine-tuned language
00:37:57.760 | model so the next thing we're going to do is we're going to try fine-tuning a classifier
00:38:02.760 | but before we do just for fun let's look at text generation we can create write ourselves
00:38:09.680 | some words like I liked this movie because and then we can create say two sentences each
00:38:17.080 | containing say 40 words and so we can just go through those two sentences and call learn.predict
00:38:24.440 | passing in this text and asking to predict this number of words with this amount of kind
00:38:31.040 | of randomization and see what it comes up with I liked this movie because of its story
00:38:38.520 | and characters the storyline was very strong very good for a sci-fi the main character
00:38:42.520 | alucard was very well developed and brought the whole story but second attempt I like
00:38:48.120 | this movie because I like the idea of the premise of the movie the very convenient virus
00:38:53.020 | which well when you have to kill a few people the evil machine has to be used to protect
00:38:57.020 | blah blah blah so as you can see it's done a good job of inventing language there are
00:39:04.440 | much I shouldn't say more sophisticated there are there are more careful ways to do a generation
00:39:11.320 | from a language model this learn.predict uses the most kind of basic possible one but even
00:39:17.280 | with a very simple approach you can see we can get from a fine-chin model some pretty
00:39:22.020 | authentic looking text and so in practice this is really interesting because we can
00:39:29.360 | now you know by using the prompt you can kind of get it to generate you know appropriate
00:39:37.800 | context appropriate text particularly if you fine-tune from a particular corpus anyway
00:39:46.360 | that was really just a little demonstration of something we accidentally created on the
00:39:50.160 | way of course the whole purpose of this is actually just to be a pre-trained model for
00:39:53.960 | classification so to do that we're going to need to create another data block and this
00:39:59.880 | time we've got two blocks not one we've got a text block again just like before but this
00:40:04.460 | time we're going to ask fastai not to create a vocab from the unique words but using the
00:40:11.560 | vocab that we already have from the language model because otherwise obviously there's
00:40:16.200 | no point reusing a pre-trained model if the vocab is different the numbers would mean
00:40:20.920 | totally different things so that's the independent variable and the dependent variable just like
00:40:26.920 | we've used before is a category so a category block is for that as we've used many times
00:40:31.700 | we're going to use parent label to create our dependent variable that's a function get
00:40:37.600 | items we'll use get text files just like before and we'll split using grandparent splitter
00:40:43.160 | as we've used before in provision so this has been used for vision this has been used
00:40:48.600 | for vision and then we'll create our data loaders with a batch size of 128 a sequence
00:40:53.200 | length of 72 and now show batch we can see an example of subset of a movie review and
00:41:02.440 | a category yes question do the tokenizers use any tokenization techniques like stemming
00:41:12.800 | or lemmatization or is that an outdated approach that would not be a tokenization approach
00:41:21.020 | so stemming is something that actually removes the stem and we absolutely don't want to do
00:41:27.040 | that that is certainly an outdated approach the in English we have stems for a reason
00:41:35.120 | they tell us something so we we don't like to remove anything that can give us some kind
00:41:41.560 | of information we used to use that for kind of pre deep learning and LP quite a bit because
00:41:48.160 | that we didn't really have good ways like embedding matrices of handling you know big
00:41:53.640 | vocabs that just differed in the you know kind of the end of a word but nowadays we
00:42:01.000 | definitely don't want to do that oh yeah one other difference here is previously we had
00:42:10.200 | an is LM equals true when we said text block dot folder from folder to say it was a language
00:42:15.080 | model we don't have that anymore because it's not a language model okay now one thing with
00:42:24.760 | a language model that was a bit easier was that we could concatenate all the documents
00:42:28.880 | together and then we could split them by batch size to create we're not split them by batch
00:42:35.960 | size put them into a number of substrings based on the batch size and that way we could
00:42:41.480 | ensure that every many batch was the same size it would be batch size by sequence length
00:42:50.200 | but for classification we can't do that we actually need each dependent variable label
00:42:57.280 | to be associated with each complete movie review and we're not showing the whole movie
00:43:03.480 | review here we've truncated it just for display purposes but we're going to use the whole
00:43:06.760 | movie review to make our prediction now the problem is that if we're using a batch size
00:43:14.440 | of 128 then and our movie reviews are often like 3,000 words long we could end up with
00:43:22.280 | something that's way too big to fit into the GPU memory so how are we going to deal with
00:43:28.640 | that well again we're going we can we can split them up so first of all let's grab a
00:43:37.440 | few of the movie reviews just to a demo here and numericalize them and if we have a look
00:43:42.680 | at the length so map the length over each you can see that they do vary a lot in length
00:43:50.760 | now we can we can split them into sequences and indeed we have asked to do that sequence
00:43:56.080 | length 72 but when we do so we're you know we don't even have the same number of sub-sequences
00:44:02.640 | when we split each of these into 72 long sections they're going to be all different lengths
00:44:11.080 | so how do we deal with that well just like in vision we can handle different sized sequences
00:44:17.880 | by adding padding so we're going to add a special XX pad token to every sequence in
00:44:28.680 | a mini batch so like in this case it looks like 581 is the longest so we would add enough
00:44:33.360 | padding tokens to make this 581 and this 581 and this 581 and so forth and then we can
00:44:39.760 | split them into sequence length into 72 long and it's in the mini batches and we'll be
00:44:46.280 | right to go now obviously if your lengths are very different like this adding a whole
00:44:52.420 | lot of padding is going to be super wasteful so another thing that fastai does internally
00:44:56.920 | is it tries to shuffle the documents around so that similar length documents are in the
00:45:03.160 | same mini batch it also randomizes them but it kind of approximately sorts them so it
00:45:09.360 | wastes a lot less time on padding okay so that is how that is what happens when we we
00:45:19.240 | don't have to do any of that manually when we call text block from folder without the
00:45:25.760 | is LM it does that all that for us and then we can now go ahead and create a learner this
00:45:33.920 | time it's going to be a text classifier learner again we're going to base it off L a WD LSTM
00:45:40.640 | pass in the data loaders we just created for metric we'll just use accuracy make it FP16
00:45:46.360 | again and now we don't want to use a pre-trained Wikipedia model in fact there is no pre-trained
00:45:52.480 | Wikipedia classifier because you know what you classify matters a lot so instead we load
00:46:00.000 | the encoder so remember everything except the last layer which we saved just before
00:46:05.640 | so we're going to load as a pre-trained model a language model for predicting the next word
00:46:11.720 | of a movie review so let's go ahead and hit one cycle and again by default it will be
00:46:23.040 | frozen so it's only the final layer which is the randomly added classifier layer that's
00:46:28.680 | going to be trained it took 30 seconds and look at this we already have 93% so that's
00:46:34.720 | pretty similar to what we got back in lesson one but rather than taking about 12 minutes
00:46:42.680 | once all the pre-training's been done it takes about 30 seconds this is quite cool you can
00:46:47.120 | create a language model for your kind of general area of interest and then you can create all
00:46:54.400 | kinds of different classifiers pretty quickly and so that's just with that's just looking
00:47:00.680 | at the pre fine-tuning the final randomly added layer so now we could just unfreeze
00:47:09.920 | and keep learning but something we found is for NLP it's actually better to only unfreeze
00:47:18.280 | one layer at a time not to unfreeze the whole model so we've in this case we've automatically
00:47:23.400 | unfreeze in the last layer and so then to unfreeze the last couple of layer groups we
00:47:28.680 | can say freeze two minus two and then train a little bit more and look at this we're already
00:47:34.880 | beating after a bit over a minute easily beating what we got in lesson one and then freeze
00:47:43.000 | two minus three to unfreeze another few layers now we're up to 94 and then finally unfreeze
00:47:48.840 | the whole model and we're up to about 94.3% accuracy and that was literally the state
00:47:56.640 | of the art for this very heavily studied data set just three years ago if you also reverse
00:48:04.680 | all of the reviews to make them go backwards and train a second model on the backwards
00:48:09.360 | version and then average the predictions of those two models as an ensemble you get to
00:48:14.320 | 95.1% accuracy and that was the state of the art that we actually got in the ULM fit paper
00:48:20.480 | and it was only beaten for the first time a few months ago using a way way bigger model
00:48:26.120 | way more data way more compute and way more data augmentation I should mention actually
00:48:34.120 | with the data augmentation one of the cool things they did do was they actually figured
00:48:37.480 | out also a way to even beat our 95.1 with less data as well so I should mention that
00:48:42.880 | actually the data augmentation has become a really since we created the ULM fit paper
00:48:47.840 | has become a really really important approach any questions Rachel can someone explain how
00:48:56.400 | a model trained to predict the last word in the sentence can generalize to classify sentiment
00:49:01.760 | they seem like different domains yeah that's a great question they're very different domains
00:49:07.640 | and it's it's really amazing and basically the trick is that to be able to predict the
00:49:14.440 | next word of a sentence you just have to know a lot of stuff about not only the language
00:49:19.640 | but about the world so if you know let's say we wanted to finish the next word of this
00:49:27.080 | sentence by training a model on all the text read backwards and averaging the averaging
00:49:31.160 | the predictions of these two models we can even get to 95.1% accuracy which was the state
00:49:35.980 | of the art introduced by the what so to be able to fill in the word ULM fit you would
00:49:42.720 | have to know a whole lot of stuff about you know the fact that there's a thing called
00:49:47.640 | fine-tune you know pre-trained language models and which one gets which results and the ULM
00:49:53.480 | fit got this particular result I mean that would be an amazing language model that could
00:49:57.600 | fill that in correctly I'm not sure that any language models can but to give you a sense
00:50:02.360 | of like what you have to be able to do to be good at language modeling so if you're
00:50:08.760 | going to be able to predict the next word of a sentence like wow I really love this
00:50:15.320 | movie I love every movie containing Meg Watt right maybe it's Ryan you'd have to know about
00:50:23.780 | like the fact that Meg Ryan is an actress and actresses are in movies and so forth so
00:50:29.120 | when you know so much about English and about the world to then turn that into something
00:50:35.480 | which recognizes that I really love this movie is a good thing rather than a bad thing is
00:50:42.400 | just not a very big step and as we saw you can actually get that far using just pre fine
00:50:50.160 | tuning just the very last layer or two so it is it's amazing and I think that's super
00:50:58.400 | super cool all right another question how would you do data augmentation on text well
00:51:12.520 | you would probably Google for unsupervised data augmentation and read this paper and
00:51:17.000 | things that have cited it so this is the one that easily beat our IMDB result with only
00:51:24.720 | 20 labeled examples which is amazing right and so they did things like if I remember
00:51:33.080 | correctly translate every sentence into a different language and then translate it back
00:51:39.080 | again so you kind of get like different rewordings of the sentence that way yeah so kind of tricks
00:51:46.160 | like that now let's go back to the generation thing so remember we saw that we can generate
00:52:05.080 | context appropriate sentences and it's important to think about what that means in practice
00:52:10.640 | when you can generate context appropriate sentences have a look for example at even
00:52:15.000 | before this technology existed in 2017 the FCC asked for comments about a proposal to
00:52:24.120 | repeal net neutrality and it turned out that less than 800,000 of the 22 million comments
00:52:36.880 | actually appeared to be unique and this particular person Jeff Cowell discovered that a lot of
00:52:44.000 | the submissions were slightly different to each other by kind of like picking up different
00:52:52.000 | you know the green bit would either be citizens or people like me or Americans and the red
00:52:57.820 | bit would be as opposed to or rather than and so forth so like and that made a big difference
00:53:05.080 | to I believe to American policy you know here's an example of reddit conversation you're wrong
00:53:14.880 | the defense budget is a good example of how badly the US spends money on the military
00:53:18.400 | dot dot dot somebody else yeah but that's already happening there's a huge increase
00:53:22.220 | in the military budget I didn't mean to sound like stop paying for the military I'm not
00:53:26.520 | saying that we cannot pay the bills that are all of these are actually created by a language
00:53:31.760 | model or GPT - and this is a very concerning thing around disinformation is that never
00:53:39.640 | mind fake news never mind deep fakes think about like what would happen if somebody invested
00:53:45.240 | a few million dollars in creating a million Twitter bots and Facebook groups bots and
00:53:52.600 | Weibo bots and made it so that 99% of the content on social networks were deep learning
00:54:03.200 | bots and furthermore they were trained not just to optimize the next word of a sentence
00:54:08.040 | but were trained to optimize the level of disharmony created or the level of agreeableness
00:54:17.200 | for some of the half of them and disagreeableness for the other half of them you know you could
00:54:21.120 | create like a whole lot of you know just awful toxic discussion which is actually the goal
00:54:29.620 | of a lot of propaganda outfits it's not so much to push a particular point of view but
00:54:35.620 | to make people feel like there's no point engaging because the truth is too hard to
00:54:41.120 | understand or whatever so I'm Rachel and I are both super worried about what could happen
00:54:51.560 | to discourse now that we have this incredibly powerful tool and I'm not even sure we have
00:54:57.020 | we don't have a great sense of what to do about it algorithms are unlikely unlikely
00:55:02.800 | to save us here if you could create a classifier which could do a good job of figuring out
00:55:08.240 | whether something was generated by a algorithm or not then I could just use your classifier
00:55:14.120 | as part of my training loop to train an algorithm that can actually learn to trick your classifier
00:55:20.480 | so this is a real worry and the only solutions I've seen are those which are kind of based
00:55:26.840 | on cryptographic signatures which is another whole can of worms which has never really
00:55:34.240 | been properly sorted out at least not in the Western world in a privacy centric way all
00:55:41.280 | right so yes I'll add and I'll link to this on the forums I gave a keynote at SciPy conference
00:55:49.800 | last summer which is the scientific Python conference and went into a lot more detail
00:55:54.280 | about the the threat that Jeremy is describing about using advanced language models to manipulate
00:56:00.960 | public opinion and so if you want to kind of learn more about the dangers there and exactly
00:56:06.280 | what that threat is you can find that in my SciPy keynote great thanks so much Rachel
00:56:13.400 | so let's have a five-minute break and see you back here in five minutes
00:56:24.080 | so we're going to finish with a kind of a segue into what will eventually be part two
00:56:32.720 | of the course which is to go right underneath the hood and see exactly how a more complex
00:56:41.120 | architecture works and specifically we're going to see how a new recurrent neural network
00:56:46.640 | works do we have a question first in the previous lesson MNIST example you showed us that under
00:56:58.080 | the hood the model was learning parts of the image like curves of a three or angles of
00:57:02.800 | a seven is there a way to look under the hood of the language models to see if they are
00:57:07.160 | learning rules of grammar and syntax would it be a good idea to fine-tune models with
00:57:12.600 | examples of domain specific syntax like technical manuals or does that miss the point of having
00:57:18.920 | the model learn for themselves yeah there are tools that allow you to kind of see what's
00:57:25.840 | going on inside an NLP model we're not going to look at them in this part of the course
00:57:30.520 | maybe we will in part two but certainly worth doing some research to see what you can find
00:57:35.000 | and there's certainly PyTorch libraries you can download and play with yeah I mean I think
00:57:44.280 | it's a perfectly good idea to incorporate some kind of technical manuals and stuff into
00:57:51.600 | your training corpus there's actually been some recent papers on this general idea of
00:57:57.320 | trying to kind of create some carefully curated sentences as part of your training corpus
00:58:05.680 | it's unlikely to hurt and it could well help all right so let's have a look at RNNs now
00:58:15.880 | when Sylvain and I started creating the RNN stuff for fast AI the first thing I did actually
00:58:25.200 | was to create a new data set and the reason for that is I didn't find any data sets that
00:58:31.920 | would allow for quick prototyping and really easy debugging so I made one which we call
00:58:39.400 | human numbers and it contains the first two ten thousand numbers written out in English
00:58:47.400 | and I'm surprised at how few people create data sets I create data sets frequently you
00:58:58.360 | know I specifically look for things that can be kind of small easy to prototype good for
00:59:03.340 | debugging and quickly trying things out and very very few people do this even though like
00:59:09.240 | this human numbers data set which has been so useful for us took me I don't know an hour
00:59:14.160 | or two to create so this is definitely an underappreciated underutilized technique so
00:59:26.320 | we can grab the human numbers data set and we can see that there's a training and a validation
00:59:31.260 | text file we can open each of them and for now we're just going to concatenate the two
00:59:35.920 | together into a file called lines and you can see that the contents are 1 2 3 etc and
00:59:44.920 | so there's a new line at the end of each we can concatenate those all together and put
00:59:51.760 | a full stop between them as so okay and then you could tokenize that by splitting on spaces
01:00:00.080 | and so for example here's tokens 100 to 110 new number 42 new number 43 new number 44 and
01:00:08.480 | so forth so you can see I'm just using plain Python here there's not even any PyTorch certainly
01:00:12.760 | not any fast AI to create a vocab we can just create all the unique tokens of which there
01:00:20.040 | are 30 and then to create a lookup from so that's a lookup from a word to an ID sorry
01:00:29.320 | from an ID to a word to go from a word to an ID we can just enumerate that and create
01:00:36.480 | a dictionary from word to ID so then we can numerical eyes our tokens by calling word
01:00:47.040 | to index on each one and so here's our tokens and here's the equivalent numericalized version
01:00:55.480 | so you can see in fairly small data sets when we don't have to worry about scale and speed
01:01:03.000 | and the details of tokenization in English you can do the whole thing in just plain Python
01:01:11.760 | the only other thing we use did for to save a little bit of time is use L but you could
01:01:16.520 | easily do that with the Python standard library in about the same amount of code so hopefully
01:01:24.400 | that gives you a good sense of really what's going on with tokenization and numericalization
01:01:29.200 | all done by hand so let's create a language model so one way to create a language model
01:01:35.520 | would be to go through all of our tokens and let's create a range from zero to the length
01:01:41.840 | of our tokens minus four and every three of them and so that's going to allow us to grab
01:01:47.360 | three tokens at a time 1.2.3.4.5.6.7.8 and so forth right so here's the first three tokens
01:02:01.760 | and then here's the fourth token and here's the second three tokens and here's the seventh
01:02:07.960 | token and so forth so these are going to be our independent variables and this will be
01:02:13.160 | our dependent variable so here's a super super kind of naive simple language model data set
01:02:23.520 | for the human numbers question so we can do exactly the same thing as before but use the
01:02:30.920 | numericalized version and create tensors this is exactly the same thing as before but now
01:02:35.600 | as as through numericalized and as tensors and we can create a data loaders object from
01:02:45.840 | data sets and remember these are data sets because they have a length and we can index
01:02:50.000 | into them right and so we can just grab the first 80% of the tokens as the training set
01:02:55.800 | the last 20% is the validation set like so batch size 64 and we're ready to go so we
01:03:04.640 | really used very very little I mean the only pytorch we used was to create these tensors
01:03:11.360 | and the only fast AI we used was to create the data loaders and it's just grabbing directly
01:03:16.960 | from the data sets so it's really not doing anything clever at all so let's see if we
01:03:23.160 | can now create a neural network architecture which takes three numericalized words at a
01:03:29.320 | time as import and tries to predict the fourth as dependent variable so here is just such
01:03:39.040 | a language model it's a three layer neural network so we've got a linear layer here which
01:03:50.520 | we're going to use once twice three times and after each of them we call value as per usual
01:04:02.480 | but there's a little bit more going on the first interesting thing is that rather than
01:04:08.760 | each of these being a different linear layer we've just created one linear layer here which
01:04:18.720 | we've reused as you can see one two three times so that's the first thing that's a bit
01:04:27.080 | tricky and so there's a few things going on it's a bit a little bit different usual but
01:04:32.480 | the basic idea is here we've got an embedding and nnlinear another nnlinear and in here we've
01:04:38.600 | used the linear layers and relu so it's very nearly a totally standard three layer neural
01:04:45.120 | network I guess for really because there's an output layer yes we have a question sure
01:04:51.440 | is there a way to speed up fine-tuning the NLP model ten plus minutes per epoch slows
01:04:57.200 | down the iterative process quite a bit any best practices or tips I can't think of any
01:05:04.760 | obviously other than to say you don't normally need to fine-tune it that often you know the
01:05:13.400 | work is often more at the classifier stage so yeah I tend to kind of just leave it running
01:05:20.280 | overnight or while I have lunch or something like that yeah just don't make sure you make
01:05:24.200 | sure you don't sit there watching it go and do something else this is where it can be
01:05:31.100 | quite handy to have a second GPU or fire up a second AWS instance or whatever so you can
01:05:37.680 | kind of keep keep moving while something's training in the background all right so what's
01:05:46.960 | going on here in this model to describe it we're actually going to develop a little kind
01:05:52.640 | of pictorial representation and the pictorial representation is going to work like this
01:05:57.560 | let's start with a simple linear model to define this pictorial representation a simple
01:06:04.520 | linear model has an input of size batch size by number of inputs and so we're going to
01:06:10.960 | use a rectangle to represent an input we're going to use an arrow to represent a layer
01:06:23.240 | computation so in this case there's going to be a matrix product for a simple linear
01:06:26.760 | model there'd be a matrix actually sorry this is a single hidden layer model there'll be
01:06:32.120 | a matrix product followed by a value so that's what this arrow represents and out of that
01:06:37.560 | we're going to get some activations and so circles represent computed activations and
01:06:43.320 | there would be we call this a hidden layer it'll be a size batch size by number of activations
01:06:48.080 | that's its size and then to create a neural net we're going to do a second matrix product
01:06:53.920 | and this time a softmax so the computation again represented by the arrow and then output
01:06:59.280 | activations are a triangle so the output would be batch size by num classes so let me show
01:07:05.760 | you the pictorial version of this so this is going to be a legend triangle is output
01:07:14.560 | circle hidden rectangle input and here it is we're going to take the first word as an
01:07:23.440 | input it's going to go through a linear layer and a value and you'll notice here I've deleted
01:07:31.080 | the details of what the operations are at this point and I've also deleted the sizes
01:07:38.520 | so every arrow is basically just a linear layer followed by a nonlinearity so we take
01:07:44.120 | the word one input and we put it through the layer the linear layer and the nonlinearity
01:07:53.840 | to give us some activations so there's our first set of activations and when we put that
01:08:00.680 | through another linear layer and nonlinearity to get some more activations and at this point
01:08:07.760 | we get word two and word two is now goes through a linear layer and a nonlinearity and these
01:08:18.900 | two when two arrows together come to a circle it means that we add or concatenate either
01:08:24.360 | is fine the two sets of activations so we'll add the set of activations from this input
01:08:32.400 | to the set of activations from here to create a new set of activations and then we'll put
01:08:38.720 | that through another linear layer and a value and again word three is now going to come
01:08:43.100 | in and go through a linear layer and a value and they'll get added to create another set
01:08:46.680 | of activations and then they'll find go through a final linear layer and really and softmax
01:08:54.600 | to create our output activations so this is our model it's basically a standard one two
01:09:05.000 | three four layer model but a couple of interesting things are going on the first is that we have
01:09:11.640 | inputs coming in to later layers and get added so that's something we haven't seen before
01:09:17.240 | and the second is all of the arrows that are the same color use the same weight matrix
01:09:22.920 | so every time we get an input we're going to put it through a particular weight matrix
01:09:28.000 | and every time we go from one set of activations to the next we'll put it through a different
01:09:32.160 | weight matrix and then to go through the activations to the output we'll use a different weight
01:09:36.960 | matrix so if we now go back to the code to go from input to hidden not surprisingly we
01:09:45.840 | always use an embedding so in other words an embedding is the green okay and you'll
01:09:52.400 | see we just create one embedding and here is the first so his x which is the three words
01:09:59.080 | so here's the first word x0 and it goes through that embedding and word 2 goes through the
01:10:04.800 | same embedding and word 3 index number 2 goes through the same embedding and then each time
01:10:10.440 | you say we add it to the current set of activations and so having put the got the embedding we
01:10:18.040 | then put it through this linear layer and again we get the embedding add it to the hit
01:10:24.400 | to the activations and put it through the linear that linear layer and again the same
01:10:28.960 | thing here put it through the same linear layer so H is the orange so these set of activations
01:10:38.880 | we call the hidden state okay and so the hidden state is why it's called H and so if you follow
01:10:47.520 | through these steps you'll see how each of them corresponds to a step in this diagram
01:10:53.480 | and then finally at the end we go from the hidden state to the output which is this linear
01:10:58.560 | layer hit state to the output okay and then we don't have the actual softmax there because
01:11:08.960 | as you'll remember we can incorporate that directly into the loss function the cross
01:11:12.880 | entropy loss function and using pytorch so one nice thing about this is everything we're
01:11:21.400 | using we have previously created from scratch so there's nothing magic here we've created
01:11:25.800 | our own embedding layer from scratch we've created our own linear layer from scratch
01:11:29.580 | we've created our own relu from scratch we've created our own cross entropy loss from scratch
01:11:36.380 | so you can actually try building this whole thing yourself from scratch so why do we just
01:11:45.640 | in terms of the nomenclature I H so H refers to hidden so this is a layer that goes from
01:11:52.240 | input to hidden this is one that goes from hidden to hidden this is one that goes from
01:11:56.640 | hidden to output so if any of this is feeling confusing at any point go back to where we
01:12:03.360 | actually created each one of these things from scratch and create it from scratch again make
01:12:07.680 | sure you actually write the code so that nothing here is mysterious so why do we use the same
01:12:19.400 | embedding matrix each time we have a new input word for an input word index 0 1 and 2 well
01:12:26.160 | because conceptually they all represent English words you know for human numbers so why would
01:12:35.040 | you expect them to be a different embedding they all should have the same representation
01:12:39.320 | they all have the same meaning same for this hidden to hidden at each time we're basically
01:12:44.000 | describing to how to go from one token to the next of our language model so we'd expect
01:12:49.160 | it to be the same computation so that's basically what's going on here so having created that
01:12:58.000 | model we can go ahead and instantiate it so we're going to have to pass in the vocab size
01:13:05.080 | for the embedding and the number of hidden right so that's number of activations so here
01:13:13.880 | we create the model and then we create a learner by passing in a model and our data loaders
01:13:20.440 | and a loss function and optionally metrics and we can fit now of course this is not pre-trained
01:13:29.800 | right this is not a application specific learner so it wouldn't know what pre-trained model
01:13:34.600 | to use so this is all random and we're getting somewhere around 45 to 50% or so accuracy
01:13:44.360 | is that any good well you should always compare to random or not random you should always
01:13:49.520 | compare to like the simplest model where the simplest model is like some average or something
01:13:55.080 | so what I did is I grabbed the validation set so all the tokens put it into a Python
01:14:00.600 | standard library counter which simply counts how many times each thing appears I found
01:14:05.520 | that the word thousand is the most common and then I said okay what if we used seven
01:14:12.520 | thousand one hundred and four thousand that's here and divide that by the length of the
01:14:18.600 | tokens and we get 15% which in other words means if we always just predicted I think
01:14:23.600 | the next word will be thousand we would get 15% accuracy but in this model we got around
01:14:30.960 | 45 to 50% accuracy so in other words our model is a lot better than the simplest possible
01:14:37.800 | baseline so we've learned something useful that's great so the first thing we're going
01:14:43.600 | to do is we're going to refactor this code because you can see we've got X going into
01:14:52.280 | IH going into HH going into ReLU X going into IH going into HH going to ReLU X going into
01:14:59.040 | IH going to HH going to ReLU how would you refactor that in Python you would of course
01:15:04.720 | use a for loop so let's go ahead and write that again so these lines of code are identical
01:15:11.920 | in fact these lines of code are identical as is this and we're going to instead of doing
01:15:17.080 | all that stuff manually we create a loop that goes through three times and in each time
01:15:21.380 | it goes IH add to our hidden HH ReLU and then at the end hidden to output so this is exactly
01:15:31.100 | the same thing as before but it's just refactored with a for loop and we can train it again
01:15:38.040 | and again we get the same basically 45 to 50% as you would expect because it's exactly
01:15:43.320 | the same it's just been refactored so here's something crazy this is a recurrent neural
01:15:54.600 | network even though it's like exactly the same as it's exactly the same as this right
01:16:07.720 | it's just been refactored into a loop and so believe it or not that's actually all an
01:16:14.560 | RNN is an RNN is a simple refactoring of that model with that deep learning linear model
01:16:25.120 | we saw I shouldn't say linear model deep learning model of simple linear layers with values
01:16:34.480 | so let's draw our pictorial representation again so remember this was our previous pictorial
01:16:40.480 | representation we can refactor the picture as well so instead of showing these dots separately
01:16:46.880 | we can just take this arrow and represented it represented it as a loop because that's
01:16:53.680 | all that's happening right so the word one goes through an embedding goes into this activations
01:16:59.680 | which then just gets repeated from 2 to the end 2 to n-1 where n at this time is you know
01:17:06.080 | we've got three words basically for each word coming in as well and so we've just refactored
01:17:13.200 | our diagram and then eventually it goes through our blue to create the output so this diagram
01:17:20.160 | is exactly the same as this diagram just replacing the middle with that loop so that's a recurrent
01:17:28.240 | neural net and so h remember was something that we just kept track of here h h h h h
01:17:35.360 | h as we added each layer to it and here we just have it inside the loop we initialize
01:17:44.880 | it as 0 which is kind of tricky and the reason we can do that is that 0 plus a tensor will
01:17:51.640 | broadcast this 0 so that's a little neat feature that's why we don't have to make this a particular
01:17:57.980 | size tensor to start with okay so we're going to be seeing the word hidden state a lot and
01:18:08.560 | so it's important to remember that hidden state simply represents these activations
01:18:14.000 | that are occurring inside our recurrent neural net and a recurrent neural net is just a refactoring
01:18:19.720 | of a particular kind of fully connected deep model so that's it that's what an RNN is no
01:18:33.160 | questions at this point Rachel something that's a bit weird about it though is that for every
01:18:45.080 | batch we're setting our hidden state to 0 even although we're going through the entire
01:18:50.640 | set of numbers the human numbers data set in order so you would think that by the time
01:18:56.580 | you've gone like one two three you shouldn't then forget everything we've learnt when you
01:19:00.560 | go to four five six right it would be great to actually remember where we're up to and
01:19:08.080 | not reset the hidden state back to zero every time so we can absolutely do that we can maintain
01:19:16.640 | the state of our RNN and here's how we would do that rather than having something called
01:19:26.120 | H we'll call it self dot H and we'll set it to zero at the start when we first create
01:19:32.520 | our model everything else here is the same and everything else here is the same and then
01:19:42.200 | there's just one extra line of code here what's going on here well here's the thing if if
01:19:50.200 | H is something which persists from batch to batch then effectively this loop is effectively
01:20:00.400 | kind of becoming infinitely long right our deep learning model therefore is getting effectively
01:20:08.920 | we're not infinitely deep but as deep as the entire size of our data set because every
01:20:13.560 | time we're stacking new layers on top of the previous layers the reason this matters is
01:20:20.500 | that when we then do back propagation when we then calculate the gradients we're going
01:20:24.840 | to have to calculate the gradients all the way back through every layer going all the
01:20:31.000 | way so by the time we get to the end of the data set we're going to be effectively back
01:20:35.480 | propagating not just through this loop but remember self dot H was created also by the
01:20:42.400 | previous quarter forward and the previous quarter forward and the previous quarter forward
01:20:45.920 | so we're going to have this incredibly slow calculation of the gradients all the way back
01:20:53.840 | to the start it's also going to use up a whole lot of memory because it's going to have to
01:20:58.880 | store all those intermediate gradients in order to calculate them so that's a problem
01:21:05.400 | and so the problem is easily solved by saying detach and what detach does is it basically
01:21:11.920 | says throw away my gradient history forget that I forget that I was calculated from some
01:21:18.240 | other gradients so the activations are still stored but the gradient history is no longer
01:21:24.040 | stored and so this kind of cuts off the gradient computation and so this is called truncated
01:21:31.960 | back propagation so exactly the same lines of code as the other two models H equals zero
01:21:40.840 | has been moved into self dot H equals zero these lines of code are identical and we've
01:21:46.800 | added one more line of code so the only other thing is it from time to time we might have
01:21:52.400 | to reset self dot H to zero so I've created a method for that and we'll see how that works
01:21:59.200 | shortly okay so back propagation sorry I was using the wrong jargon back propagation through
01:22:08.360 | time is what we call it when we calculate the back prop over going back through this
01:22:17.480 | loop all right now we do need to make sure that the samples are seen in the correct order
01:22:24.920 | you know given that we need to make sure that every batch connects up to the previous batch
01:22:30.960 | so go back to notebook 10 to remind yourself of what that needs to look like but basically
01:22:35.440 | the first batch we see that the number the length of our sequences divided by the batch
01:22:42.840 | size is 328 so the first batch will be our index number 0 then M then 2 times M and so
01:22:50.440 | forth the second batch will be 1 M plus 1 2 times M plus 1 and so forth so the details
01:22:56.800 | don't matter but here's how we create you know do that indexing so now we can go ahead
01:23:03.800 | and call that group chunks function to calculate to create our training set and our validation
01:23:12.080 | set and certainly don't shuffle because that would break everything in terms of the ordering
01:23:20.200 | and then there's one more thing we need to do which is we need to make sure that at the
01:23:25.440 | start of each epoch we call reset because at the start of the epoch we're going back
01:23:33.160 | to the start of our natural numbers so we need to set self.h back to 0 so something
01:23:41.680 | that we'll learn about in part 2 is that fastai has something called callbacks and callbacks
01:23:50.720 | are classes which allow you to basically say during the training loop I want you to call
01:23:56.760 | some particular code and in particular this is going to call this code and so you can
01:24:09.880 | see callbacks are very small or can be very small they're normally very small when we
01:24:13.480 | start training it'll call reset when we start validation it'll call reset so this is each
01:24:20.120 | epoch and when we're all finished fitting it will call reset and what does reset do
01:24:25.400 | it does whatever you tell it to do and we told it to be self.h=0 so if you want to use
01:24:32.520 | a callback you can simply add it to the callbacks list CVs when you create your learner and
01:24:41.960 | so now when we train that's way better okay so we've now actually kept this is called
01:24:49.800 | a stateful RNN it's actually keeping the state keeping the hidden state from batch to batch
01:25:00.480 | now we still got a bit of a obvious problem here which is that if you look back to the
01:25:07.000 | data that we created we used these first three tokens to predict the fourth and then the
01:25:17.320 | next three tokens to predict the seventh and then the next three tokens to predict the
01:25:24.640 | one after and so forth effectively what would rather do you would think is is predict every
01:25:32.480 | word not just every fourth word it seems like we're throwing away a lot of signal here which
01:25:37.160 | is pretty wasteful so we want to create more signal and so the way to do that would be
01:25:45.560 | rather than putting rather than putting this output stage outside
01:25:56.560 | the loop right so this dotted area is a bit that's looped what if we put the output inside
01:26:04.360 | the loop so in other words after every hidden state was created we immediately did a prediction
01:26:10.520 | and so that way we could predict after every time step and our dependent variable could
01:26:15.760 | be the entire sequence of numbers offset by one so that would give us a lot more signal
01:26:22.360 | so we have to change our data so the dependent variable has each of the next three words
01:26:27.520 | after each of the three inputs so instead of being just the numbers from I to I plus
01:26:34.440 | SL as input and then I plus SL plus one as output we're going to have the entire set
01:26:42.120 | offset by one as our dependent variable so and then we can now do exactly the same as
01:26:48.080 | we did before to create our data loaders and so you can now see that each sequence is exactly
01:26:56.080 | the previous is the independent variable and the dependent variable the same thing but
01:27:00.240 | offset by one okay and then we need to modify our model very slightly this code is all exactly
01:27:09.200 | the same as before but rather than now returning one output will create a list of outputs and
01:27:15.960 | we'll append to that list after every element of the loop and then at the end we'll stack
01:27:22.880 | them all up and then this is the same so it's nearly exactly the same okay just a very minor
01:27:29.080 | change our loss function needs to we need to create our own loss function which is just
01:27:37.680 | a cross-entropy loss but we need to just flatten it out so the target gets flattened out the
01:27:47.560 | input gets flattened out and so then we can now pass that as our loss function everything
01:27:56.040 | else here is the same and we can fit and we've gone from I can't remember 58 to 64 so it's
01:28:12.020 | improved a little bit so that's good you know we did find this a little little flaky sometimes
01:28:21.800 | it would train really well sometimes it wouldn't train great but sometimes we you know we often
01:28:26.880 | got this reasonably good answer now one problem here is although effectively we have quite
01:28:38.280 | a deep neural net if you kind of go back to the version so this this version where we
01:28:45.400 | have the loop in it is kind of the normal way to think about an RNN but perhaps an easier
01:28:50.400 | way to think about it is what we call the unrolled version and the unrolled version
01:28:55.960 | is when you look at it like this now if you unroll this stateful neural net we have you
01:29:03.360 | know it's it is quite deep but every single one of the hidden to hidden layers uses exactly
01:29:11.560 | the same weight matrix so really it's not really that deep at all because it can't really
01:29:19.360 | do very sophisticated computation because it has to use the same weight matrix every
01:29:23.520 | time so in some ways it's not really any smarter than a plane linear model so it would be nice
01:29:32.860 | to try to you know create a truly deep model have multiple different layers that it can
01:29:39.760 | go through so we can actually do that easily enough by creating something called a multi-layer
01:29:45.880 | RNN and all we do is we basically take that diagram we just saw before and we repeat it
01:29:53.640 | but and this is actually a bit unclear the dotted arrows here are different weight matrices
01:30:02.520 | to the non-dotted arrows here so we can have a different hidden to hidden weight matrix
01:30:09.060 | in the kind of second set of RNN layers and a different weight matrix here for the second
01:30:16.860 | set and so this is called a stacked RNN or a multi-layer RNN and so here's the same thing
01:30:26.640 | in the unrolled version right so this is exactly the same thing but showing you the unrolled
01:30:32.240 | version. Writing this out by hand maybe that's quite a good exercise or particularly this
01:30:41.120 | one would be quite a good exercise but it's kind of tedious so we're not going to bother
01:30:44.800 | instead we're going to use PyTorch as RNN class and so PyTorch as RNN class is basically
01:30:51.040 | doing exactly what we saw here right and specifically this this part here and this part here right
01:31:04.960 | but it's nice that it also has an extra number of layers parameter that lets you tell it how
01:31:14.400 | many to stack on top of each other so it's important when you start using PyTorch as
01:31:19.480 | RNN to realize there's nothing magic going on right you're just using this refactored
01:31:26.120 | for loop that we've already seen so we still need the input to hidden embedding this is
01:31:32.680 | now the hidden to hidden with the loop all done for us and then this is the hidden to
01:31:38.160 | output just as before and then this is our hidden just like before so now we don't need
01:31:45.280 | the loop we can just call self.rnn and it does the whole loop for us we can do all the input
01:31:51.640 | to hidden at once to save a little bit of time because thanks to the wonder of embedding
01:31:56.500 | matrices and as per usual we have to go detach to avoid getting a super deep effective network
01:32:05.160 | and then pass it through our output linear layer so this is exactly the same as the previous
01:32:11.760 | model except that we have just refactored it using in RNN and we said we want more than
01:32:19.160 | one layer so let's request say two layers we still need the model reset it just like
01:32:26.040 | before because remember nothing's changed and let's go ahead and fit and oh it's terrible
01:32:36.160 | so why is it terrible well the reason it's terrible is that now we really do have a very
01:32:45.000 | deep model and very deep models are really hard to train because we can get exploding
01:32:55.400 | or disappearing activations so what that means is we start out with some initial state and
01:33:07.560 | we're gradually putting it through all of these layers and all of these layers right
01:33:12.280 | and so each time we're doing a matrix multiplication which remember is just doing a whole bunch
01:33:16.800 | of multiplies and adds and then we multiply and add and we multiply and add and we multiply
01:33:21.800 | and add and we multiply and add and if you do that enough times you can end up with very
01:33:28.120 | very very big results or so that would be if the kind of things we're multiplying and
01:33:33.000 | adding by are pretty big or very very very very small results particularly because we're
01:33:38.160 | putting it through the same layer again and again right and why is that a problem well
01:33:46.520 | if you multiply by 2 a few times you get 1 2 4 8 etc and after 32 steps you're already
01:33:54.840 | at 4 billion or if you start at 1 and you multiply by half a few times after 32 steps
01:34:05.040 | you're down to this tiny number so a number even slightly or higher or lower than 1 can
01:34:11.040 | kind of cause an explosion or disappearance of a number and matrix multiplication is just
01:34:16.000 | multiplying numbers and adding them up so exactly the same thing happens to matrix multiplication
01:34:20.840 | you kind of have matrices that grow really big or grow really small and when that does
01:34:29.420 | that you're also going to have exactly the same things happening to the gradients they'll
01:34:33.080 | get really big really small and one of the problems here is that numbers are not stored
01:34:39.440 | precisely in a computer they're stored using something called floating point so we stole
01:34:48.720 | this nice diagram from this article called what you never wanted to know about floating
01:34:54.240 | point but what we're forced to find out and here we're at this point where we're forced
01:34:57.380 | to find out and it's basically showing us the granularity with which numbers are stored
01:35:02.720 | and so the numbers that are further away from zero are stored much less precisely than the
01:35:09.760 | numbers that are close to zero and so if you think about it that means that the gradients
01:35:15.960 | further away from zero could actually for very big numbers could actually become zero
01:35:27.160 | themselves because you could actually end up in kind of with two numbers that are between
01:35:33.280 | these kind of little gradations here and you actually end up the same thing with the really
01:35:41.120 | small numbers because they're really small numbers although they're closer together the
01:35:44.840 | numbers that they represent are also very close together so in both cases they're kind
01:35:49.840 | of the relative accuracy gets worse and worse so you really want to avoid this happening
01:36:02.000 | there's a number of ways to avoid this happening and this is the same for really deep convolutional
01:36:08.760 | neural nets or really deep kind of tabular standard tabular networks anytime you have
01:36:16.040 | too many layers it can become difficult to train and you generally have to use like either
01:36:19.680 | really small learning rates or you have to use special techniques that avoid exploding
01:36:27.400 | or disappearing activations or gradients. For RNNs one of the most popular approaches
01:36:34.380 | to this is to use an architecture called an LSTM and I am not going to go into the details
01:36:43.840 | of an LSTM from scratch today but it's in the it's in the book and in the notebook but
01:36:50.920 | the key thing to know about an LSTM is let's have a look is that rather than just being
01:36:59.800 | a matrix multiplication it is this which is that there are a number of linear layers that
01:37:10.120 | it goes through and those linear layers are combined in particular ways and the way they're
01:37:15.160 | combined which is shown in this kind of diagram here is that it basically is designed such
01:37:22.340 | that the that there are like little mini neural networks inside the layer which decide how
01:37:32.360 | much of the previous state is kept how much is thrown away and how much of the new state
01:37:38.560 | is added and by letting it have little neural nets to kind of calculate each of these things
01:37:44.600 | it allows the LSTM layer which again is shown here to decide how much of kind of how much
01:37:54.280 | of an update to do at each time and then with that capability it basically allows it to
01:38:03.560 | avoid kind of updating too much or updating too little and by the way this this code you
01:38:14.600 | can refactor which Sylvain did here into a much smaller amount of code but these two
01:38:20.440 | things are exactly the same thing.
01:38:22.680 | So as I said I'm not going to worry too much about the details of how this works now the
01:38:27.920 | important thing just to know is that you can replace the matrix multiplication in an RNN
01:38:35.320 | with this sequence of matrix multiplications and sigmoids and times and plus and when you
01:38:42.820 | do so you will very significantly decrease the amount of gradient or activation exploding
01:38:51.680 | explosions or disappearances. So that's called an LSTM cell and an RNN which uses this instead
01:39:00.560 | of a matrix multiplication is called an LSTM and so you can replace NN.RNN with NN.LSTM.
01:39:11.560 | Other than that we haven't really changed anything except that LSTMs because they have
01:39:19.720 | more of these layers in them we actually have to make our hidden state have more layers in
01:39:27.040 | as well but other than that we can just replace RNN with LSTM and we can call it just the same
01:39:35.960 | way as we did before we can detach just like before but that's now a list so we have to
01:39:40.640 | detach all of them and pop it through our output layer which is exactly as before reset
01:39:46.320 | is just as before except it's got to loop through each one and we can fit it in exactly
01:39:52.240 | the same way as before and as you can see we end up with a much better result which
01:39:59.320 | is great. We have two questions. Okay perfect. Could we somehow use regularization to try
01:40:05.480 | to make the RNN parameters close to the identity matrix or would that cause bad results because
01:40:11.200 | the hidden layers want to deviate from the identity during training? So we're actually
01:40:19.320 | about to look at regularization so we will take a look. The identity matrix for those
01:40:28.320 | that don't know don't remember is the matrix where if you multiply it by it you get exactly
01:40:33.480 | the same thing that you started with so just like if you multiply by one you get back the
01:40:37.720 | same number you started with. For linear algebra if you multiply by the identity matrix you
01:40:42.600 | get the same matrix you started with and actually one quite popular approach to initializing
01:40:49.960 | the hidden to hidden activations is to initialize with a identity matrix which ensures that
01:40:55.960 | you start with something which doesn't have gradient explosions or activation explosions.
01:41:05.560 | There are yeah we'll have and we're about to have a look at some more regularization
01:41:09.440 | approaches so let's wait until we do that. All right next question. Is there a way to
01:41:14.600 | quickly check if the activations are disappearing/exploding? Absolutely just go ahead and calculate them
01:41:24.240 | and we'll be looking at that in a lot more detail in part two but a really great exercise
01:41:30.240 | would be to try to figure out how you can actually output the activations of each layer
01:41:37.320 | and it would certainly be very easy to do that in the in the RNNs that we built ourselves
01:41:42.560 | from scratch because we can actually see the linear layers and so you could just print
01:41:48.760 | them out or print out some statistics or store them away or something like that. FAST AI
01:42:04.440 | has a class called Activation Stats which kind of you can check out if you're interested
01:42:15.360 | if that's a really good way to specifically to do this. Okay so yeah so regularization
01:42:30.280 | is important we have potentially a lot of parameters and a lot of layers it would be
01:42:35.880 | really nice if we can do the same kind of thing that we've done with our CNNs and so
01:42:43.080 | forth which is to use more parameters but then use regularization to ensure that we
01:42:48.240 | don't overfit and so we can certainly do that with an LSTM as well and perhaps the best
01:42:55.480 | way to do that is to use something called dropout and dropout is not just used for RNNs
01:43:03.320 | dropout is used all over the place but it works particularly well in RNNs. This is a
01:43:08.200 | picture from the dropout paper and what happens in dropout is here's a is a kind of a picture
01:43:16.960 | of a three fully connected layers no sorry I guess it's two one two yeah no three fully
01:43:25.480 | connected layers and so in these two fully connected layers at the start here what we
01:43:32.280 | could do is we could delete some of the activations at random and so this has happened here but
01:43:39.240 | X this is what X means it's like deleting those those activations at random and if we
01:43:45.760 | do so you can see we end up with a lot less computation going on and what dropout does
01:43:51.960 | is each batch each mini batch it randomly deletes a different set of activations from
01:44:01.780 | whatever layers you ask for that's what dropout does so basically the idea is that dropout
01:44:13.240 | helps to generalize because if a particular activation was kind of effectively learning
01:44:22.100 | some input some some particular piece of input memorizing it then sometimes it gets randomly
01:44:28.560 | deleted and so then suddenly it's not going to do anything useful at all so by randomly
01:44:36.000 | deleting activations it ensures that activations can't become over specialized at doing just
01:44:42.320 | one thing because then if it did then the times they're randomly deleted it's it's not
01:44:47.880 | going to work so here is the entire implementation of a dropout layer you pass it some value
01:44:56.280 | P which is the probability that an activation gets deleted so we'll store that away and
01:45:02.120 | so then in the forward you're going to get your activations now if you're not training
01:45:06.960 | so if you're doing validation then we're not going to do dropout right but if we are training
01:45:13.260 | then we create a mask and so the mask is a Bernoulli random a Bernoulli random variable
01:45:24.680 | so what is Bernoulli random variable means it means it's a bunch of ones and zeros where
01:45:29.480 | this is the probability that we get a one which is one minus the probability we get
01:45:36.880 | a zero and so then we just multiply that by our input so that's going to convert some
01:45:44.360 | of the inputs into zeros which is basically deleting them so you should check out some
01:45:50.760 | of the details for example about why we do a divide one minus P which is described here
01:45:55.960 | and we do point out here that normally and I would normally in the lesson show you an
01:46:01.760 | example of the of what Bernoulli does but of course nowadays you know we're getting
01:46:08.400 | to the advanced classes you're expected to do it yourself so be sure to create a little
01:46:12.760 | cell here and make sure you actually create a tensor and then run Bernoulli underscore
01:46:18.120 | on it and make sure you see exactly what it's doing so that then you can understand this
01:46:23.980 | class now of course we don't have to use this class we made ourselves we can just use nn.dropout
01:46:31.800 | but you can use this class yourself because it does the same thing so again you know we're
01:46:36.480 | trying to make sure that we know how to build stuff from scratch this special self dot training
01:46:44.000 | is set for every module automatically by fast.ai to based on whether or not you're in the validation
01:46:52.660 | part of your training loop or the training part of your training loop it's also part
01:46:59.520 | of PyTorch and in PyTorch if you're not using fast.ai you have to call the train method
01:47:04.480 | on a module to set training to true and the eval method to set it to false for every module
01:47:09.920 | inside some other module so that's one great approach to regularization another approach
01:47:18.340 | which I've only seen used in recurrent neural nets is activation regularization and temporal
01:47:28.080 | activation regularization which is very very similar to the question that we were just
01:47:32.000 | asked what happens with activation regularization is it looks a very similar to weight decay
01:47:42.760 | but rather than adding some multiplier times the sum of squares of the weights we add some
01:47:52.920 | multiplier by the sum of squares of the activations so in other words we're basically saying we're
01:48:00.280 | not just trying to decrease the weights but decrease the total activations and then similarly
01:48:10.000 | we can also see what's the difference between the activations from the previous time step
01:48:19.680 | to this time step so take the difference and then again squared times some value so these
01:48:28.800 | are two hyper parameters alpha and beta the higher they are the more regularized your
01:48:34.200 | model and so with TAR it's going to say no layer of the LSTM should too dramatically
01:48:43.080 | change the activations from one time step to the next and then for alpha it's saying
01:48:50.880 | no layer of the LSTM should create two large activations and so they wouldn't actually
01:48:57.300 | create these large activations or large changes unless the loss improved by enough to make
01:49:03.180 | it worth it okay so there's then I think just one more thing we need to know about which
01:49:16.320 | is called weight tying and weight tying is a very minor change and let's have a look
01:49:22.400 | at it here so this is the embedding we had before this is the LSTM we had before this
01:49:27.520 | is where we're going to introduce dropout this is the hidden to output linear layer
01:49:33.000 | we had before but we're going to add one more line of code which is the hidden to output
01:49:42.720 | weights are actually equal to the input to hidden weights now this is not just setting
01:49:50.960 | them once this is actually setting them so that they're a reference to the exact same
01:49:55.340 | object in memory the exact same tensor in memory so the weights of the hidden to output
01:50:00.560 | layer will always be identical to the weights of the input to hidden layer and this is called
01:50:06.920 | weight tying and the reason we do this is because conceptually in a language model predicting
01:50:14.480 | the next word is about kind of converting activations into English words or else an embedding
01:50:21.080 | is about converting English words to activations and there's a reasonable hypothesis which
01:50:28.180 | would be that well those are basically exactly the same computation or at least the reverse
01:50:33.000 | of it so why shouldn't they use the same weights and it turns out lo and behold yes if you
01:50:37.920 | use the same weights then actually it does work a little bit better so then here's our
01:50:44.000 | forward which is to do the input to hidden do the RNN apply the dropout do the detach
01:50:52.280 | and then apply the hidden to output which is using exactly the same weights as the input
01:50:56.720 | to hidden and resets the same we haven't created the RNN regularizer from scratch here but
01:51:05.640 | you can add it as a callback passing in your alpha and your beta if you call text learner
01:51:17.200 | instead of learner it will add the model resetter and the RNN regularizer for you so that's
01:51:24.200 | what one of the things text learner does so this code is the same as this code and so
01:51:30.360 | we can then train a model again and that's also add weight decay and look at this we're
01:51:37.800 | getting up close to 90% accuracy so we've covered a lot in this lesson but the amazing
01:51:49.640 | thing is that we've just replicated all of the pieces in an AWD LSTM all of the pieces
01:51:58.220 | in this state-of-the-art recurrent neural net which we've showed we could use in the previous
01:52:04.940 | notebook to get what was until very recently state-of-the-art results for text classification
01:52:12.000 | and far more quickly and with far less compute and memory than more modern than the approaches
01:52:21.360 | in the last year or so which have beaten that benchmark so this is a really efficient really
01:52:29.940 | accurate approach and it's still the state-of-the-art in many many academic situations and it's
01:52:39.880 | still very widely used in industry and so it's pretty cool that we've actually seen
01:52:43.680 | how to write it from scratch so the main thing to mention the further research is to have
01:52:52.080 | a look at the source code for AWD LSTM and fast AI and see if you can see how the things
01:52:58.840 | in AWD LSTM map to the you know what those lines of code how they map to the concepts
01:53:05.520 | that we've seen in this chapter. Rachel do we have any questions? So here we have come
01:53:14.000 | to the conclusion of our what was originally going to be seven lessons and turned into
01:53:18.760 | eight lessons. I hope that you've got a lot out of this, thank you for staying with us.
01:53:29.920 | What a lot of folks people people now do when they finish there at least people have finished
01:53:35.000 | previous courses is they go back to lesson one and try and repeat it but doing a lot
01:53:41.120 | less looking at the notebooks a lot more doing stuff from scratch yourself and going deeper
01:53:48.000 | into the assignments so that's one thing you could do next. Another thing you could do
01:53:53.760 | next would be to pick out a Kaggle competition to enter or pick a book that you want to read
01:54:01.920 | about deep learning or a paper and team up with some friends to do like a paper reading
01:54:09.900 | group or a book reading group you know one of the most important things to keep the learning
01:54:14.800 | going is to get together with other people on the learning journey. Another great way
01:54:20.800 | to do that of course is through the forums so if you haven't been using the forums much
01:54:24.920 | so far no problem but now might be a great time to get involved and find some projects
01:54:29.640 | that are going on that look interesting and it's fine if you you know you don't have to
01:54:35.360 | be an expert right obviously any of those projects the people that are already doing
01:54:39.680 | it are going to know more about it than you do at this point because they're already doing
01:54:44.000 | it but if you drop into a thread and say hey I would love to learn more about this how do
01:54:48.280 | I get started or have a look at the wiki posts to find out and try things out you can start
01:54:54.800 | getting involved in other people's projects and help them out. So yeah and of course don't
01:55:06.240 | forget about writing so if you haven't tried writing a blog post yet maybe now is a great
01:55:10.240 | time to do that pick something that's interesting to you especially if it's something in your
01:55:15.240 | area of expertise at work or a hobby or something like that or specific to where you live maybe
01:55:21.120 | you could try and build some kind of text classifier or text generator for particular
01:55:27.760 | kinds of text that are that you know about you know that would be that would be a super
01:55:33.360 | interesting thing to try out and be sure to share it with the folks on the forum. So there's
01:55:39.240 | a few ideas so don't let this be the end of your learning journey you know keep keep going
01:55:45.840 | and then come back and try part two if it's not out yet obviously you'll have to wait
01:55:51.480 | until it is out if it but if it is out you might want to kind of spend a couple of months
01:55:56.960 | you know really experimenting with this before you move on to part two to make sure that
01:56:01.080 | everything in part one feels pretty pretty solid to you. Well thank you very much everybody
01:56:11.040 | for your time we've really enjoyed doing this course it's been a tough course for us to
01:56:16.400 | teach because with all this COVID-19 stuff going on at the same time I'm really glad
01:56:20.840 | we've got through it I'm particularly particularly grateful to Sylvain who has been extraordinary
01:56:28.280 | in really making so much of this happen and particularly since I've been so busy with
01:56:34.080 | COVID-19 stuff around masks in particular it's really a lot thanks to Sylvain that everything
01:56:41.400 | has come together and of course to Rachel who's been here with me on on every one of
01:56:47.280 | these lessons thank you so much and I'm looking forward to seeing you again in a future course
01:56:56.200 | thanks everybody