Back to Index

Lesson 8 - Deep Learning for Coders (2020)


Chapters

0:0 Introduction
0:15 Natural Language Processing
5:10 Building a Language Model
12:18 Get Files
13:28 Word Tokenizer
16:30 Word Tokenizer Rules
17:38 SubWord Tokenizer
19:1 Setup
23:23 Numericalization
25:43 Batch
29:19 Data Loader
30:18 Independent Variables
30:36 Dependent Variables
31:9 Data Blocks
31:56 Class Methods
33:25 Language Model
35:8 Save Epoch
35:32 Save Encoder
37:55 Text Generation
42:5 Language Models
42:49 Classification
43:10 Batch Size
44:20 Pad Tokens
45:29 Text Classifier
48:31 Data Augmentation
48:54 Predicting the Next Word
51:0 Data Augmentation on Text
51:51 Generation
58:11 Creating Datasets

Transcript

Hi everybody and welcome to Lesson 8, the last lesson of Part 1 of this course. Thanks so much for sticking with us. Got a very interesting lesson today where we're going to do a dive into natural language processing. And remind you, we did see natural language processing in Lesson 1.

This was it here. We looked at a dataset where we could pass in many movie reviews like so and get back probabilities that it's a positive or negative sentiment. And we trained it with a very standard looking classifier trainer approach. But we haven't really talked about what's going on behind the scenes there, so let's do that.

And we'll also learn about how to make it better. So we were getting about 93%. So 93% accuracy for sentiment analysis which is actually extremely good and it only took a bit over 10 minutes. But let's see if we can do better. So we're going to go to notebook number 10.

And in notebook number 10 we're going to start by talking about what we're going to do to train an NLP classifier. So a sentiment analysis which is this movie review positive or negative sentiment is just a classifier. The dependent variable is binary. And the independent variable is the kind of the interesting bit.

So we're going to talk about that. But before we do we're going to talk about what was the pre-trained model that got used here. Because the reason we got such a good result so quickly is because we're doing fine-tuning of a pre-trained model. So what is this pre-trained model exactly?

Well the pre-trained model is actually a pre-trained language model. So what is a language model? A language model is a special kind of model and it's a model where we try to predict the next word of a sentence. So for example if our language model received even if our language model knows the and its job would be to predict basics.

Now the language model that we use as our pre-trained model was actually trained on Wikipedia. So we took all the you know non-trivial sized articles in Wikipedia and we built a language model which attempted to predict the next word of every sequence of words in every one of those articles.

And it was a neural network of course. And we then take those pre-trained weights and those are the pre-trained weights that when we said text classifier learner were automatically loaded in. So conceptually why would it be useful to pre-train a language model? How does that help us to do sentiment analysis for example?

Well just like an ImageNet model has a lot of information about what pictures look like and what they're consisting of. A language model tells us a bit a lot about what sentences look like and what they know about the world. So a language model for example if it's going to be able to predict the end of the sentence in 1998 this law was passed by president what?

So a language model to predict that correctly would have to know a whole lot of stuff. It would have to know about well how English language works in general and what kind of sentences go in what places. That after the word president would usually be the surname of somebody.

It would need to know what country that law was passed in and it would need to know what president was president of that country in what I say 1998. So it'd have to know a lot about the world. It would have to know a lot about language to create a really good language model is really hard.

And in fact this is something that people spend many many many millions of dollars on creating language models of huge datasets. Our particular one doesn't take particularly long to pre-train but there's no particular reason for you to pre-train one of these language models because you can download them through fast AI or through other places.

So what happened in lesson one is we downloaded this pre-trained Wikipedia model and then we fine-tuned it so as per usual we threw away the last layer which was specific for predicting the next word of Wikipedia and fine-tuned the model. Initially just the last layer to learn to predict sentiment of movie reviews and then as per usual then fine-tuned the rest of the model and that got us 93%.

Now there's a trick we can use though which is we start with this Wikipedia language model and the particular subset we use is called Wikitext 103. And rather than just jumping straight to a classifier which we did in lesson one we can do even better if we first of all create an IMDB language model that is to say a language model that learns to predict the next word of a movie review.

The reason we do that is that this will help it to learn about IMDB specific kind of words like it'll learn a lot more about the names of actors and directors it'll learn about the kinds of words that people use in movie reviews. And so if we do that first then we would hope we'll end up with a better classifier.

So that's what we're going to do in the first part of today's lesson. And we're going to kind of do it from scratch and we're going to show you how to do a lot of the things from scratch even though later we'll show you how fast AI does it all for you.

So how do we build a language model? So as we point out here sentences can be different lengths and documents like movie reviews can be very long. So how do we go about this? Well a word is basically a categorical variable and we already know how to use categorical variables as an independent variable in a neural net which was we make a list of all of the possible levels of a categorical variable which we call the vocab and then we replace each of those categories with its index so they all become numbers.

We create an initially random embedding matrix for each so each row then is for one element from the vocab and then we make that the first layer of a neural net. So that's what we've done a few times now and we've even created our own embedding layer from scratch remember.

So we can do the same thing with text right we can make a list of all the possible words in in the whole corpus the whole dataset and we can replace each word with the index of the vocab and creating embedding matrix. So in order to create a list of all levels in this case a list of all possible words let's first of all concatenate all the documents or the movie reviews together into one big long string and split it into words okay and then our independent variable will basically be that sequence starting with the first word in the long list and ending with a second last and our dependent variable will be the sequence of words starting with a second word and ending with a last so they're kind of offset by one so as you move through the first sequence you're then trying to predict the next word in the next in the in the second part that's kind of what we're doing right we'll see more detail in a moment.

Now when we create our vocab by finding all the unique words in this concatenated corpus a lot of the words we see will be already in the embedding matrix already in the vocab of the pre-trained Wikipedia model but there's also going to be some new ones right there might be some particular actors that don't appear in Wikipedia or maybe some informal slang words and so forth so when we build our vocab and then our embedding embedding matrix for the IMDB language model any words that are in the vocab of the pre-trained model we'll just use them as is but for new words we'll create a new random vector.

So here's the process we're going to have to go through first we're going to have to take our big concatenated corpus and turn it into a list of tokens could be words could be characters could be substrings that's called tokenization and then we'll do numericalization which is basically these two steps which is replacing each word with its index in a vocab which means we have to create that vocab so create the vocab and then convert then we're going to need to create a data loader that has lots of substrings lots of sequences of tokens from IMDB corpus as an independent variable and the same thing offset by one as a dependent variable and then we're going to have to create a language model.

Now a language model is going to be able to handle input lists that can be arbitrarily big or small and we're going to be using something called a recurrent neural network to do this which we'll learn about later so basically so far we've always assumed that everything is a fixed size a fixed input so we're going to have to mix things up a little bit here and deal with architectures that can be different sizes for this notebook notebook 10 we're going to kind of treat it as a black box it's just going to be just a neural net and then later in the lesson we'll look at delving inside what's happening in that architecture okay so let's start with the first of these which is tokenization so converting a text into a list of words or a list of tokens what does that mean is a full stop a token what about don't is that single word or is it two words don't or is it would I convert it to do not what about long medical words that are kind of made up of lots of pieces of medical jargon that are all stuck together what about hyphenated words and really interestingly then what about something like Polish where you or Turkish where you can create really long words all the time they create really long words that are actually lots of separate parts or concatenated together or languages like Japanese and Chinese that don't use spaces at all they don't really have a world of find idea of a word well there's no right answer but it's basically three approaches we can use a word-based approach which is what we use by default at the moment for English although that might change which is we split a sentence on space and then there are some language specific rules for example turning don't into do and putting punctuation marks as a separate token most of the time really interestingly there are tokenizes at a sub word based and this is where we split words into smaller parts based on the most commonly occurring substrings we'll see that in a moment or the simplest character-based split a sentence into its characters we're going to look at word and sub word tokenization in this notebook and then if you look at the questionnaire at the end you'll be asked to create your own character based tokenizer so please make sure you do that if you can it'll be a great exercise so fastai doesn't invent its own tokenizers we just provide a consistent interface to a range of external tokenizers because there's a lot of great tokenizers out there so you can switch between different tokenizers pretty easily so let's start let's grab our IMDB data set like we did in lesson one and in order to try out a tokenizer let's grab all the text files so we can instead of calling get image files we'll call get text files and you know to have a look at what that's doing don't forget we can even look at the source code and you can see actually it's calling a more general thing called get files and saying what extensions it wants right so if anything in fastai doesn't work quite the way you want and there isn't a option which which works the way you want you can often look always look underneath to see what we're calling and you can call the lower level stuff yourself so files is now a list of files so we can grab the first one we can open it we can read it have a look at the start of this review and here it is okay so at the moment the default English word tokenizer we use is called spaCy which uses a pretty sophisticated set of rules with special rules for particular words and URLs and so forth but we're just going to go ahead and say word tokenizer which will automatically use fastai's default word tokenizer currently spaCy and so if we pass a list of documents we'll just make it a list of one document here to the tokenizer we just created and just grab the first since we just created a list that's going to show us as you can see the tokenized version so you can see here that this movie which I just discovered at the video store has etc it's changed it's into it's and it's put a comma as a separate punctuation mark and so forth okay so you can see how it has tokenized this review.

Let's look at a more interesting one the US dollar blah blah blah and you can see here it actually knows that US is special so it doesn't put the full stop in a set as a separate place for US it knows about 1.00 is special so you can see there's a lot of tricky stuff going on with spaCy to try and be as kind of thoughtful about this as possible.

Fastai then provides this tokenizer wrapper which provides some additional functionality to any tokenizer as you can see here which is for example the word it here which previously was capital IT has been turned into lowercase IT and then a special token XX badge has appeared at the front everything starting with XX is a special fastai token and this means that the next match means that the next word was previously started with a capital letter so here's another one this used to be capital T so you make it lowercase and then add XX page XXBOS means this is the start of a document so there's a few special rules going on there so why do we do that well if you think about it if we didn't lowercase it for instance or this then the capitalized version and the lowercase version are going to be two different words in the embedding matrix which probably doesn't make sense you know regardless of the capitalization they probably basically mean the same thing having said that sometimes the capitalization might matter so we kind of want to say all right use the same embedding every time you see the word this but add some kind of marker that says that this was originally capitalized okay so that's why we do it like this so there's quite a few rules you can see them in text proc rules and you can see the source code here's a summary of what they do but let's look at a few examples so if we use that tokenizer we created and pass in for example this text you can see the way it's tokenized we get the XX beginning of stream or beginning of string beginning of document this HTML entity has become a real Unicode we've got the XX Madge we discussed now here www has been replaced by XXRep3w that means the letter w is repeated three times so for things where you've got like you know a hundred exclamation marks in a row all the words so with like 50 Os this is a much better representation and then you can see all upper case has been replaced with XX up followed by the word so there's some of those rules in action oh you can also see multiple spaces have been replaced you know with making just making it standard tokens so that's the word tokenizer the really interesting one is the subword tokenizer so how why would you need a subword tokenizer well consider for example this sentence here order means they're sure how'd you rate so this is my name is Jeremy but the interesting thing about it is there's no spaces here right and that's because there are no spaces in Chinese and there isn't really a great sense of what a word is in Chinese in this particular sentence it's fairly clear what the words are but it's not always obvious sometimes the words are actually split you know so some of it's at the start of a sentence and some of it's at the end so you can't really do word tokenization for something like Chinese so instead we use subword tokenization which is where we look at a corpus of documents and we find the most commonly occurring groups of letters and those commonly occurring groups of letters become the vocab so for example we would probably find order would appear often because that means my and Ming Z and then Ming Z for example is name and this is my westernized version of a Chinese name which wouldn't be very common at all so they would probably appear separately so let's look at an example let's grab the first 2000 movie reviews and let's create the default subword tokenizer which currently uses something called sentence piece that might change and now we're going to use something special something very important which is called setup transforms in fastai you can always call this special thing called setup it often doesn't do anything stupid it's always there but some transforms like a subword tokenizer actually need to be set up before you can use them in other words you can't tokenize into subwords until you know what the most commonly occurring groups of letters are so passing a list of texts in here this list of text to set up will train the subword tokenizer it'll find those commonly occurring groups of letters so having done that we can then this is just for experimenting we're going to pass in some size we'll say what vocab size we want for our subword tokenizer we'll set it up with our texts and then we will have a look at a particular sentence so for example if we create a subword tokenizer with a thousand tokens and it returns this this tokenized string now this kind of long underscore thing is what we replace space with because now we're using subword tokens we kind of want to know where the sentence is actually start and stop and you can see here a lot of sentence words are common enough sequences of letters that they get their own vocab item or else discovered it was not wasn't common enough so that became this over it right video appears enough where a store didn't that becomes or so you get the idea right so if we wanted a smaller vocab size that would as you see even this doesn't become its own word movie is so common that it is its own word so it just becomes to for example we have a question okay how can we determine if the given pre-trained model in this case wiki text 103 is suitable enough for our downstream task if we have limited vocab overlap should we need to add an additional data set to create a language model from scratch if it's in the same language so if you're doing English it's always it's almost always sufficient to use Wikipedia we've played around with this a lot and it was one of the key things that Sebastian Ruder and I found when we created the ULM fit paper was before that time people really thought you needed corpus specific pre-trained models but we discovered you don't just like you don't that often need corpus specific pre-trained vision models image network surprisingly well across a lot of different domains so Wikipedia has a lot of words in it it would be really really I haven't come across an English corpus that didn't have a very high level of overlap with Wikipedia on the other hand if you're doing ULM fit with like genomic sequences or Greek or whatever then obviously you're going to need a different pre-trained model so once we got to a 10,000 word vocab as you can see basically every word at least common word becomes its own vocab item in the subword vocab except say discovered which becomes discover it so my guess is that subword approaches are going to become kind of the most common maybe they will be by the time you watch this we've got some fiddling to do to get this working super well for fine-tuning but I think I know what we have to do so hopefully we'll get it done pretty soon all right so after we split it into tokens the next thing to do is numerical ization so let's go back to our word tokenized text which looks like this and in order to mean numerical eyes we will first need to call setup so to save a bit of time let's create a subset of our text so just create a couple of hundred of the co-opuses that's a couple of hundred of the reviews so here's an example of one and we'll create our new miracle eyes object and we will call setup and that's the thing that's going to create the vocab for us and so after that we can now take a look at the vocab this is Cole repra is showing us a representation of a collection it's what the L class uses underneath and you can see when we do this that the vocab starts with the special tokens and then we start getting the English tokens in order of frequency so the default is a vocab size of 60,000 so that'll be the size of your embedding matrix by default and if there are more than 60,000 unique words in your vocab in your corpus then any the least common ones will be replaced with a special XXunk unknown token so that'll help us avoid having a too big embedding matrix all right so now we can treat the numerical eyes object which we created as if it was a function as we so often do in both fastai and pytorch and when we do it'll replace each of our words with others so two for example is 0 1 2 beginning a string beginning a beginning of stream 8 0 1 2 3 4 5 6 7 8 okay so a capitalized letter there they are XXbos XXmaj etc okay and then we can convert them back by indexing into the vocab and get back what we started with okay right so now we have done the tokenization we've done the numericalization and so the next thing we need to do is to create batches so let's say this is the text that we want to create batches from and so if we tokenize that text it'll convert it into this and so let's let's take that and write it out here let's do it here let's take that and write it out here so XXbos XXmaj in this chapter XXbos XXmaj in this chapter we will go back over the example of classifying and then next row starts here movie reviews we studied in chapter one and dig deeper under the surface full stop XXmaj first we will look at the etc okay so we've taken these 90 tokens and to create a batch size of six we've broken up the text into six contiguous parts each of length 15 so 1 2 3 4 5 6 and then we have 15 columns okay so 6 by 15 now ideally we would just provide that to our model as a batch and if indeed that was all the data we had we could just pass it in as a batch but that's not going to work for imdb because imdb once we concatenate all the reviews together and then let's say we want to use a batch size of 64 then we're going to have 64 rows and you know probably there's a few million tokens of imdb so a few million divided by 64 across it's going to be way too big and to fit in our GPU so what we're going to do then is we're going to split up that big wide array and we're going to split it up horizontally so we'll start with XXbos XXmaj in this chapter and then down here we will go back over the example of classifying movie reviews we studied in chapter one and dig deeper under the surface etc so this would become our first mini-batch right and so you can see what's happened is the kind of second row right actually is continuing what was like way down here and so we basically treated each row is totally independent so when we predict the second from the second mini-batch you know the second mini-batch is going to follow from the first and that each row to row one in the second mini-batch will join up to row one of the first row two of the second mini-batch will join up to row two of the first so please look at this example super carefully because we found that this is something that every year a lot of students get confused about because it's just not what they expected to see happen right so go back over this and make sure you understand what's happening in this little example so that's what our mini batches are going to be so the good news is that all these fiddly steps you don't have to do yourself you can just use the language model data loader or LM data loader so if we take those all the tokens from the first 200 movie reviews and map them through our numericalize object right so now we've got numericalized versions of all those tokens and then pass them into LM data loader and then grab the first item from the data loader then we have 64 by 72 why is that well 64 is the default batch size and 72 is the default sequence length you see here we've got one two three four five here we used a sequence length of five right so what we do in practice is we use a default sequence length of 72 so if we grab the first of our independent variables and grab the first few tokens and look them up in the vocab here it is this movie which I just something at the video store so that's interesting so this was not common enough to be in a vocab has apparently sit around for a and then if we look at the exact same thing but for the dependent variable rather than being XXBOS XXMaj this movie it's XXMaj this movie so you can see it's offset by one which means the end rather than being around for a it's for a couple so this is exactly what we want this is offset by one from here so that's looking good so we can now go ahead and use these ideas to try and build our even better IMDB sentiment analysis and the first step will be to as we discussed create the language model but let's just go ahead and use the fast AI built-in stuff to do it for us rather than doing all that messing around manually so we can just create a data block and our blocks are it's going to be a text block from folder and the items are going to be text files from these folders and we're going to split things randomly and then going to turn that into data loaders with a batch size of 128 and a sequence length of 80 in this case our blocks we're not just passing in a class directly but we're actually passing in here a class method and that's so that we can allow the tokenization for example to be saved to be cached in some path so that the next time we run this it won't have to do it all from scratch so that's why we have a slightly different syntax here so once we've run this we can call show batch and so you can see here we've got for example what xxmaj I've read xxmaj death blah blah blah and you can see so that's the independent variable and so the dependent variable is the same thing offset by one so we don't have the what anymore but it just goes straight to xxmaj I've read and then at the end this was also this and of course in the dependent variable also this is so this is that offset by one just like we were hoping for show batch is automatically denumericalizing it for us turning it back into strings but if we look at the actual or you should look at the actual x and y to confirm that you actually see numbers there that'll be a good exercise for you is to make sure that you can actually grab a mini batch from dlslm so now that we've got the data loaders we can fine-tune our language model so fine-tuning the language model means we're going to create a learner which is going to learn to predict the next word of a movie review so that's our data the data loaders for the language model this is the pre-trained model it's something called awd lstm which we'll see how to create from scratch in a moment or something similar to it dropout we'll learn about later that we see how much dropout to use this is how much regularization we want and what metrics do we want we've know about accuracy perplexity is not particularly interesting so I won't discuss it but feel free to look it up if you're interested and let's train with fp16 to use less memory on the GPU and for any modern GPU it'll also run two or three times faster so this gray bit here has been done for us the pre-training of the language model for wiki text 103 and now we're up to this bit which is fine-tuning the language model for imdb so let's do one epoch and as per usual the using a pre-trained model automatically calls freeze so we don't have to freeze so this is going to just actually train only the new embeddings initially and we get an accuracy after a ten minutes or so of 30% so that's pretty cool so about a bit under a third of the time our model is predicting the next word of a string so I think that's pretty cool now since this takes quite a while for each epoch we might as well save it and you can save it under any name you want and that's going to put it into your path into your learner's path into a model subfolder and it'll give it a .pth extension for PyTorch and then later on you can load that with learn.load after you create the learner and so then we can unfreeze and we can train a few more epochs and we eventually get up to an accuracy of 34% so that's pretty great so once we've done all that we can save the model but actually all we really need to do is to save the encoder what's the encoder the encoder is all of the model except for the final layer oh and we're getting a thunderstorm here that could be interesting we've never done a lesson with a thunderstorm before but that's the joy of teaching during COVID-19 you get all the sound effects so yeah the final layer of our language model is predict is the bit that actually picks a particular word out which we don't need so when we say save encoder it saves everything except for that final layer and that's the pre-trained model we're going to use that is a pre-trained model of a language model that is fine-tuned from Wikipedia fine-tuned using IMDB and doesn't contain the very last layer Rachel any questions at this point do any language models attempt to provide meaning for instance I'm going to the store is the opposite of I'm not going to the store or I barely understand this stuff and that ball came so close to my ear I heard it whistle both contain the idea of something almost happening being right on the border is there a way to indicate this kind of subtlety in a language model yeah absolutely our language model will have all of that in it or hopefully it will hopefully it'll learn about it we don't have to program that the whole point of machine learning is it learns it for itself but when it sees a sentence like hey careful that ball nearly hit me the expectation of what word is going to be happen next is going to be different to the sentence hey that ball hit me so so yeah language models generally you see in practice tend to get really good at understanding all of these nuances of of of English or whatever language it's learning about okay so we have a fine-tuned language model so the next thing we're going to do is we're going to try fine-tuning a classifier but before we do just for fun let's look at text generation we can create write ourselves some words like I liked this movie because and then we can create say two sentences each containing say 40 words and so we can just go through those two sentences and call learn.predict passing in this text and asking to predict this number of words with this amount of kind of randomization and see what it comes up with I liked this movie because of its story and characters the storyline was very strong very good for a sci-fi the main character alucard was very well developed and brought the whole story but second attempt I like this movie because I like the idea of the premise of the movie the very convenient virus which well when you have to kill a few people the evil machine has to be used to protect blah blah blah so as you can see it's done a good job of inventing language there are much I shouldn't say more sophisticated there are there are more careful ways to do a generation from a language model this learn.predict uses the most kind of basic possible one but even with a very simple approach you can see we can get from a fine-chin model some pretty authentic looking text and so in practice this is really interesting because we can now you know by using the prompt you can kind of get it to generate you know appropriate context appropriate text particularly if you fine-tune from a particular corpus anyway that was really just a little demonstration of something we accidentally created on the way of course the whole purpose of this is actually just to be a pre-trained model for classification so to do that we're going to need to create another data block and this time we've got two blocks not one we've got a text block again just like before but this time we're going to ask fastai not to create a vocab from the unique words but using the vocab that we already have from the language model because otherwise obviously there's no point reusing a pre-trained model if the vocab is different the numbers would mean totally different things so that's the independent variable and the dependent variable just like we've used before is a category so a category block is for that as we've used many times we're going to use parent label to create our dependent variable that's a function get items we'll use get text files just like before and we'll split using grandparent splitter as we've used before in provision so this has been used for vision this has been used for vision and then we'll create our data loaders with a batch size of 128 a sequence length of 72 and now show batch we can see an example of subset of a movie review and a category yes question do the tokenizers use any tokenization techniques like stemming or lemmatization or is that an outdated approach that would not be a tokenization approach so stemming is something that actually removes the stem and we absolutely don't want to do that that is certainly an outdated approach the in English we have stems for a reason they tell us something so we we don't like to remove anything that can give us some kind of information we used to use that for kind of pre deep learning and LP quite a bit because that we didn't really have good ways like embedding matrices of handling you know big vocabs that just differed in the you know kind of the end of a word but nowadays we definitely don't want to do that oh yeah one other difference here is previously we had an is LM equals true when we said text block dot folder from folder to say it was a language model we don't have that anymore because it's not a language model okay now one thing with a language model that was a bit easier was that we could concatenate all the documents together and then we could split them by batch size to create we're not split them by batch size put them into a number of substrings based on the batch size and that way we could ensure that every many batch was the same size it would be batch size by sequence length but for classification we can't do that we actually need each dependent variable label to be associated with each complete movie review and we're not showing the whole movie review here we've truncated it just for display purposes but we're going to use the whole movie review to make our prediction now the problem is that if we're using a batch size of 128 then and our movie reviews are often like 3,000 words long we could end up with something that's way too big to fit into the GPU memory so how are we going to deal with that well again we're going we can we can split them up so first of all let's grab a few of the movie reviews just to a demo here and numericalize them and if we have a look at the length so map the length over each you can see that they do vary a lot in length now we can we can split them into sequences and indeed we have asked to do that sequence length 72 but when we do so we're you know we don't even have the same number of sub-sequences when we split each of these into 72 long sections they're going to be all different lengths so how do we deal with that well just like in vision we can handle different sized sequences by adding padding so we're going to add a special XX pad token to every sequence in a mini batch so like in this case it looks like 581 is the longest so we would add enough padding tokens to make this 581 and this 581 and this 581 and so forth and then we can split them into sequence length into 72 long and it's in the mini batches and we'll be right to go now obviously if your lengths are very different like this adding a whole lot of padding is going to be super wasteful so another thing that fastai does internally is it tries to shuffle the documents around so that similar length documents are in the same mini batch it also randomizes them but it kind of approximately sorts them so it wastes a lot less time on padding okay so that is how that is what happens when we we don't have to do any of that manually when we call text block from folder without the is LM it does that all that for us and then we can now go ahead and create a learner this time it's going to be a text classifier learner again we're going to base it off L a WD LSTM pass in the data loaders we just created for metric we'll just use accuracy make it FP16 again and now we don't want to use a pre-trained Wikipedia model in fact there is no pre-trained Wikipedia classifier because you know what you classify matters a lot so instead we load the encoder so remember everything except the last layer which we saved just before so we're going to load as a pre-trained model a language model for predicting the next word of a movie review so let's go ahead and hit one cycle and again by default it will be frozen so it's only the final layer which is the randomly added classifier layer that's going to be trained it took 30 seconds and look at this we already have 93% so that's pretty similar to what we got back in lesson one but rather than taking about 12 minutes once all the pre-training's been done it takes about 30 seconds this is quite cool you can create a language model for your kind of general area of interest and then you can create all kinds of different classifiers pretty quickly and so that's just with that's just looking at the pre fine-tuning the final randomly added layer so now we could just unfreeze and keep learning but something we found is for NLP it's actually better to only unfreeze one layer at a time not to unfreeze the whole model so we've in this case we've automatically unfreeze in the last layer and so then to unfreeze the last couple of layer groups we can say freeze two minus two and then train a little bit more and look at this we're already beating after a bit over a minute easily beating what we got in lesson one and then freeze two minus three to unfreeze another few layers now we're up to 94 and then finally unfreeze the whole model and we're up to about 94.3% accuracy and that was literally the state of the art for this very heavily studied data set just three years ago if you also reverse all of the reviews to make them go backwards and train a second model on the backwards version and then average the predictions of those two models as an ensemble you get to 95.1% accuracy and that was the state of the art that we actually got in the ULM fit paper and it was only beaten for the first time a few months ago using a way way bigger model way more data way more compute and way more data augmentation I should mention actually with the data augmentation one of the cool things they did do was they actually figured out also a way to even beat our 95.1 with less data as well so I should mention that actually the data augmentation has become a really since we created the ULM fit paper has become a really really important approach any questions Rachel can someone explain how a model trained to predict the last word in the sentence can generalize to classify sentiment they seem like different domains yeah that's a great question they're very different domains and it's it's really amazing and basically the trick is that to be able to predict the next word of a sentence you just have to know a lot of stuff about not only the language but about the world so if you know let's say we wanted to finish the next word of this sentence by training a model on all the text read backwards and averaging the averaging the predictions of these two models we can even get to 95.1% accuracy which was the state of the art introduced by the what so to be able to fill in the word ULM fit you would have to know a whole lot of stuff about you know the fact that there's a thing called fine-tune you know pre-trained language models and which one gets which results and the ULM fit got this particular result I mean that would be an amazing language model that could fill that in correctly I'm not sure that any language models can but to give you a sense of like what you have to be able to do to be good at language modeling so if you're going to be able to predict the next word of a sentence like wow I really love this movie I love every movie containing Meg Watt right maybe it's Ryan you'd have to know about like the fact that Meg Ryan is an actress and actresses are in movies and so forth so when you know so much about English and about the world to then turn that into something which recognizes that I really love this movie is a good thing rather than a bad thing is just not a very big step and as we saw you can actually get that far using just pre fine tuning just the very last layer or two so it is it's amazing and I think that's super super cool all right another question how would you do data augmentation on text well you would probably Google for unsupervised data augmentation and read this paper and things that have cited it so this is the one that easily beat our IMDB result with only 20 labeled examples which is amazing right and so they did things like if I remember correctly translate every sentence into a different language and then translate it back again so you kind of get like different rewordings of the sentence that way yeah so kind of tricks like that now let's go back to the generation thing so remember we saw that we can generate context appropriate sentences and it's important to think about what that means in practice when you can generate context appropriate sentences have a look for example at even before this technology existed in 2017 the FCC asked for comments about a proposal to repeal net neutrality and it turned out that less than 800,000 of the 22 million comments actually appeared to be unique and this particular person Jeff Cowell discovered that a lot of the submissions were slightly different to each other by kind of like picking up different you know the green bit would either be citizens or people like me or Americans and the red bit would be as opposed to or rather than and so forth so like and that made a big difference to I believe to American policy you know here's an example of reddit conversation you're wrong the defense budget is a good example of how badly the US spends money on the military dot dot dot somebody else yeah but that's already happening there's a huge increase in the military budget I didn't mean to sound like stop paying for the military I'm not saying that we cannot pay the bills that are all of these are actually created by a language model or GPT - and this is a very concerning thing around disinformation is that never mind fake news never mind deep fakes think about like what would happen if somebody invested a few million dollars in creating a million Twitter bots and Facebook groups bots and Weibo bots and made it so that 99% of the content on social networks were deep learning bots and furthermore they were trained not just to optimize the next word of a sentence but were trained to optimize the level of disharmony created or the level of agreeableness for some of the half of them and disagreeableness for the other half of them you know you could create like a whole lot of you know just awful toxic discussion which is actually the goal of a lot of propaganda outfits it's not so much to push a particular point of view but to make people feel like there's no point engaging because the truth is too hard to understand or whatever so I'm Rachel and I are both super worried about what could happen to discourse now that we have this incredibly powerful tool and I'm not even sure we have we don't have a great sense of what to do about it algorithms are unlikely unlikely to save us here if you could create a classifier which could do a good job of figuring out whether something was generated by a algorithm or not then I could just use your classifier as part of my training loop to train an algorithm that can actually learn to trick your classifier so this is a real worry and the only solutions I've seen are those which are kind of based on cryptographic signatures which is another whole can of worms which has never really been properly sorted out at least not in the Western world in a privacy centric way all right so yes I'll add and I'll link to this on the forums I gave a keynote at SciPy conference last summer which is the scientific Python conference and went into a lot more detail about the the threat that Jeremy is describing about using advanced language models to manipulate public opinion and so if you want to kind of learn more about the dangers there and exactly what that threat is you can find that in my SciPy keynote great thanks so much Rachel so let's have a five-minute break and see you back here in five minutes so we're going to finish with a kind of a segue into what will eventually be part two of the course which is to go right underneath the hood and see exactly how a more complex architecture works and specifically we're going to see how a new recurrent neural network works do we have a question first in the previous lesson MNIST example you showed us that under the hood the model was learning parts of the image like curves of a three or angles of a seven is there a way to look under the hood of the language models to see if they are learning rules of grammar and syntax would it be a good idea to fine-tune models with examples of domain specific syntax like technical manuals or does that miss the point of having the model learn for themselves yeah there are tools that allow you to kind of see what's going on inside an NLP model we're not going to look at them in this part of the course maybe we will in part two but certainly worth doing some research to see what you can find and there's certainly PyTorch libraries you can download and play with yeah I mean I think it's a perfectly good idea to incorporate some kind of technical manuals and stuff into your training corpus there's actually been some recent papers on this general idea of trying to kind of create some carefully curated sentences as part of your training corpus it's unlikely to hurt and it could well help all right so let's have a look at RNNs now when Sylvain and I started creating the RNN stuff for fast AI the first thing I did actually was to create a new data set and the reason for that is I didn't find any data sets that would allow for quick prototyping and really easy debugging so I made one which we call human numbers and it contains the first two ten thousand numbers written out in English and I'm surprised at how few people create data sets I create data sets frequently you know I specifically look for things that can be kind of small easy to prototype good for debugging and quickly trying things out and very very few people do this even though like this human numbers data set which has been so useful for us took me I don't know an hour or two to create so this is definitely an underappreciated underutilized technique so we can grab the human numbers data set and we can see that there's a training and a validation text file we can open each of them and for now we're just going to concatenate the two together into a file called lines and you can see that the contents are 1 2 3 etc and so there's a new line at the end of each we can concatenate those all together and put a full stop between them as so okay and then you could tokenize that by splitting on spaces and so for example here's tokens 100 to 110 new number 42 new number 43 new number 44 and so forth so you can see I'm just using plain Python here there's not even any PyTorch certainly not any fast AI to create a vocab we can just create all the unique tokens of which there are 30 and then to create a lookup from so that's a lookup from a word to an ID sorry from an ID to a word to go from a word to an ID we can just enumerate that and create a dictionary from word to ID so then we can numerical eyes our tokens by calling word to index on each one and so here's our tokens and here's the equivalent numericalized version so you can see in fairly small data sets when we don't have to worry about scale and speed and the details of tokenization in English you can do the whole thing in just plain Python the only other thing we use did for to save a little bit of time is use L but you could easily do that with the Python standard library in about the same amount of code so hopefully that gives you a good sense of really what's going on with tokenization and numericalization all done by hand so let's create a language model so one way to create a language model would be to go through all of our tokens and let's create a range from zero to the length of our tokens minus four and every three of them and so that's going to allow us to grab three tokens at a time 1.2.3.4.5.6.7.8 and so forth right so here's the first three tokens and then here's the fourth token and here's the second three tokens and here's the seventh token and so forth so these are going to be our independent variables and this will be our dependent variable so here's a super super kind of naive simple language model data set for the human numbers question so we can do exactly the same thing as before but use the numericalized version and create tensors this is exactly the same thing as before but now as as through numericalized and as tensors and we can create a data loaders object from data sets and remember these are data sets because they have a length and we can index into them right and so we can just grab the first 80% of the tokens as the training set the last 20% is the validation set like so batch size 64 and we're ready to go so we really used very very little I mean the only pytorch we used was to create these tensors and the only fast AI we used was to create the data loaders and it's just grabbing directly from the data sets so it's really not doing anything clever at all so let's see if we can now create a neural network architecture which takes three numericalized words at a time as import and tries to predict the fourth as dependent variable so here is just such a language model it's a three layer neural network so we've got a linear layer here which we're going to use once twice three times and after each of them we call value as per usual but there's a little bit more going on the first interesting thing is that rather than each of these being a different linear layer we've just created one linear layer here which we've reused as you can see one two three times so that's the first thing that's a bit tricky and so there's a few things going on it's a bit a little bit different usual but the basic idea is here we've got an embedding and nnlinear another nnlinear and in here we've used the linear layers and relu so it's very nearly a totally standard three layer neural network I guess for really because there's an output layer yes we have a question sure is there a way to speed up fine-tuning the NLP model ten plus minutes per epoch slows down the iterative process quite a bit any best practices or tips I can't think of any obviously other than to say you don't normally need to fine-tune it that often you know the work is often more at the classifier stage so yeah I tend to kind of just leave it running overnight or while I have lunch or something like that yeah just don't make sure you make sure you don't sit there watching it go and do something else this is where it can be quite handy to have a second GPU or fire up a second AWS instance or whatever so you can kind of keep keep moving while something's training in the background all right so what's going on here in this model to describe it we're actually going to develop a little kind of pictorial representation and the pictorial representation is going to work like this let's start with a simple linear model to define this pictorial representation a simple linear model has an input of size batch size by number of inputs and so we're going to use a rectangle to represent an input we're going to use an arrow to represent a layer computation so in this case there's going to be a matrix product for a simple linear model there'd be a matrix actually sorry this is a single hidden layer model there'll be a matrix product followed by a value so that's what this arrow represents and out of that we're going to get some activations and so circles represent computed activations and there would be we call this a hidden layer it'll be a size batch size by number of activations that's its size and then to create a neural net we're going to do a second matrix product and this time a softmax so the computation again represented by the arrow and then output activations are a triangle so the output would be batch size by num classes so let me show you the pictorial version of this so this is going to be a legend triangle is output circle hidden rectangle input and here it is we're going to take the first word as an input it's going to go through a linear layer and a value and you'll notice here I've deleted the details of what the operations are at this point and I've also deleted the sizes so every arrow is basically just a linear layer followed by a nonlinearity so we take the word one input and we put it through the layer the linear layer and the nonlinearity to give us some activations so there's our first set of activations and when we put that through another linear layer and nonlinearity to get some more activations and at this point we get word two and word two is now goes through a linear layer and a nonlinearity and these two when two arrows together come to a circle it means that we add or concatenate either is fine the two sets of activations so we'll add the set of activations from this input to the set of activations from here to create a new set of activations and then we'll put that through another linear layer and a value and again word three is now going to come in and go through a linear layer and a value and they'll get added to create another set of activations and then they'll find go through a final linear layer and really and softmax to create our output activations so this is our model it's basically a standard one two three four layer model but a couple of interesting things are going on the first is that we have inputs coming in to later layers and get added so that's something we haven't seen before and the second is all of the arrows that are the same color use the same weight matrix so every time we get an input we're going to put it through a particular weight matrix and every time we go from one set of activations to the next we'll put it through a different weight matrix and then to go through the activations to the output we'll use a different weight matrix so if we now go back to the code to go from input to hidden not surprisingly we always use an embedding so in other words an embedding is the green okay and you'll see we just create one embedding and here is the first so his x which is the three words so here's the first word x0 and it goes through that embedding and word 2 goes through the same embedding and word 3 index number 2 goes through the same embedding and then each time you say we add it to the current set of activations and so having put the got the embedding we then put it through this linear layer and again we get the embedding add it to the hit to the activations and put it through the linear that linear layer and again the same thing here put it through the same linear layer so H is the orange so these set of activations we call the hidden state okay and so the hidden state is why it's called H and so if you follow through these steps you'll see how each of them corresponds to a step in this diagram and then finally at the end we go from the hidden state to the output which is this linear layer hit state to the output okay and then we don't have the actual softmax there because as you'll remember we can incorporate that directly into the loss function the cross entropy loss function and using pytorch so one nice thing about this is everything we're using we have previously created from scratch so there's nothing magic here we've created our own embedding layer from scratch we've created our own linear layer from scratch we've created our own relu from scratch we've created our own cross entropy loss from scratch so you can actually try building this whole thing yourself from scratch so why do we just in terms of the nomenclature I H so H refers to hidden so this is a layer that goes from input to hidden this is one that goes from hidden to hidden this is one that goes from hidden to output so if any of this is feeling confusing at any point go back to where we actually created each one of these things from scratch and create it from scratch again make sure you actually write the code so that nothing here is mysterious so why do we use the same embedding matrix each time we have a new input word for an input word index 0 1 and 2 well because conceptually they all represent English words you know for human numbers so why would you expect them to be a different embedding they all should have the same representation they all have the same meaning same for this hidden to hidden at each time we're basically describing to how to go from one token to the next of our language model so we'd expect it to be the same computation so that's basically what's going on here so having created that model we can go ahead and instantiate it so we're going to have to pass in the vocab size for the embedding and the number of hidden right so that's number of activations so here we create the model and then we create a learner by passing in a model and our data loaders and a loss function and optionally metrics and we can fit now of course this is not pre-trained right this is not a application specific learner so it wouldn't know what pre-trained model to use so this is all random and we're getting somewhere around 45 to 50% or so accuracy is that any good well you should always compare to random or not random you should always compare to like the simplest model where the simplest model is like some average or something so what I did is I grabbed the validation set so all the tokens put it into a Python standard library counter which simply counts how many times each thing appears I found that the word thousand is the most common and then I said okay what if we used seven thousand one hundred and four thousand that's here and divide that by the length of the tokens and we get 15% which in other words means if we always just predicted I think the next word will be thousand we would get 15% accuracy but in this model we got around 45 to 50% accuracy so in other words our model is a lot better than the simplest possible baseline so we've learned something useful that's great so the first thing we're going to do is we're going to refactor this code because you can see we've got X going into IH going into HH going into ReLU X going into IH going into HH going to ReLU X going into IH going to HH going to ReLU how would you refactor that in Python you would of course use a for loop so let's go ahead and write that again so these lines of code are identical in fact these lines of code are identical as is this and we're going to instead of doing all that stuff manually we create a loop that goes through three times and in each time it goes IH add to our hidden HH ReLU and then at the end hidden to output so this is exactly the same thing as before but it's just refactored with a for loop and we can train it again and again we get the same basically 45 to 50% as you would expect because it's exactly the same it's just been refactored so here's something crazy this is a recurrent neural network even though it's like exactly the same as it's exactly the same as this right it's just been refactored into a loop and so believe it or not that's actually all an RNN is an RNN is a simple refactoring of that model with that deep learning linear model we saw I shouldn't say linear model deep learning model of simple linear layers with values so let's draw our pictorial representation again so remember this was our previous pictorial representation we can refactor the picture as well so instead of showing these dots separately we can just take this arrow and represented it represented it as a loop because that's all that's happening right so the word one goes through an embedding goes into this activations which then just gets repeated from 2 to the end 2 to n-1 where n at this time is you know we've got three words basically for each word coming in as well and so we've just refactored our diagram and then eventually it goes through our blue to create the output so this diagram is exactly the same as this diagram just replacing the middle with that loop so that's a recurrent neural net and so h remember was something that we just kept track of here h h h h h h as we added each layer to it and here we just have it inside the loop we initialize it as 0 which is kind of tricky and the reason we can do that is that 0 plus a tensor will broadcast this 0 so that's a little neat feature that's why we don't have to make this a particular size tensor to start with okay so we're going to be seeing the word hidden state a lot and so it's important to remember that hidden state simply represents these activations that are occurring inside our recurrent neural net and a recurrent neural net is just a refactoring of a particular kind of fully connected deep model so that's it that's what an RNN is no questions at this point Rachel something that's a bit weird about it though is that for every batch we're setting our hidden state to 0 even although we're going through the entire set of numbers the human numbers data set in order so you would think that by the time you've gone like one two three you shouldn't then forget everything we've learnt when you go to four five six right it would be great to actually remember where we're up to and not reset the hidden state back to zero every time so we can absolutely do that we can maintain the state of our RNN and here's how we would do that rather than having something called H we'll call it self dot H and we'll set it to zero at the start when we first create our model everything else here is the same and everything else here is the same and then there's just one extra line of code here what's going on here well here's the thing if if H is something which persists from batch to batch then effectively this loop is effectively kind of becoming infinitely long right our deep learning model therefore is getting effectively we're not infinitely deep but as deep as the entire size of our data set because every time we're stacking new layers on top of the previous layers the reason this matters is that when we then do back propagation when we then calculate the gradients we're going to have to calculate the gradients all the way back through every layer going all the way so by the time we get to the end of the data set we're going to be effectively back propagating not just through this loop but remember self dot H was created also by the previous quarter forward and the previous quarter forward and the previous quarter forward so we're going to have this incredibly slow calculation of the gradients all the way back to the start it's also going to use up a whole lot of memory because it's going to have to store all those intermediate gradients in order to calculate them so that's a problem and so the problem is easily solved by saying detach and what detach does is it basically says throw away my gradient history forget that I forget that I was calculated from some other gradients so the activations are still stored but the gradient history is no longer stored and so this kind of cuts off the gradient computation and so this is called truncated back propagation so exactly the same lines of code as the other two models H equals zero has been moved into self dot H equals zero these lines of code are identical and we've added one more line of code so the only other thing is it from time to time we might have to reset self dot H to zero so I've created a method for that and we'll see how that works shortly okay so back propagation sorry I was using the wrong jargon back propagation through time is what we call it when we calculate the back prop over going back through this loop all right now we do need to make sure that the samples are seen in the correct order you know given that we need to make sure that every batch connects up to the previous batch so go back to notebook 10 to remind yourself of what that needs to look like but basically the first batch we see that the number the length of our sequences divided by the batch size is 328 so the first batch will be our index number 0 then M then 2 times M and so forth the second batch will be 1 M plus 1 2 times M plus 1 and so forth so the details don't matter but here's how we create you know do that indexing so now we can go ahead and call that group chunks function to calculate to create our training set and our validation set and certainly don't shuffle because that would break everything in terms of the ordering and then there's one more thing we need to do which is we need to make sure that at the start of each epoch we call reset because at the start of the epoch we're going back to the start of our natural numbers so we need to set self.h back to 0 so something that we'll learn about in part 2 is that fastai has something called callbacks and callbacks are classes which allow you to basically say during the training loop I want you to call some particular code and in particular this is going to call this code and so you can see callbacks are very small or can be very small they're normally very small when we start training it'll call reset when we start validation it'll call reset so this is each epoch and when we're all finished fitting it will call reset and what does reset do it does whatever you tell it to do and we told it to be self.h=0 so if you want to use a callback you can simply add it to the callbacks list CVs when you create your learner and so now when we train that's way better okay so we've now actually kept this is called a stateful RNN it's actually keeping the state keeping the hidden state from batch to batch now we still got a bit of a obvious problem here which is that if you look back to the data that we created we used these first three tokens to predict the fourth and then the next three tokens to predict the seventh and then the next three tokens to predict the one after and so forth effectively what would rather do you would think is is predict every word not just every fourth word it seems like we're throwing away a lot of signal here which is pretty wasteful so we want to create more signal and so the way to do that would be rather than putting rather than putting this output stage outside the loop right so this dotted area is a bit that's looped what if we put the output inside the loop so in other words after every hidden state was created we immediately did a prediction and so that way we could predict after every time step and our dependent variable could be the entire sequence of numbers offset by one so that would give us a lot more signal so we have to change our data so the dependent variable has each of the next three words after each of the three inputs so instead of being just the numbers from I to I plus SL as input and then I plus SL plus one as output we're going to have the entire set offset by one as our dependent variable so and then we can now do exactly the same as we did before to create our data loaders and so you can now see that each sequence is exactly the previous is the independent variable and the dependent variable the same thing but offset by one okay and then we need to modify our model very slightly this code is all exactly the same as before but rather than now returning one output will create a list of outputs and we'll append to that list after every element of the loop and then at the end we'll stack them all up and then this is the same so it's nearly exactly the same okay just a very minor change our loss function needs to we need to create our own loss function which is just a cross-entropy loss but we need to just flatten it out so the target gets flattened out the input gets flattened out and so then we can now pass that as our loss function everything else here is the same and we can fit and we've gone from I can't remember 58 to 64 so it's improved a little bit so that's good you know we did find this a little little flaky sometimes it would train really well sometimes it wouldn't train great but sometimes we you know we often got this reasonably good answer now one problem here is although effectively we have quite a deep neural net if you kind of go back to the version so this this version where we have the loop in it is kind of the normal way to think about an RNN but perhaps an easier way to think about it is what we call the unrolled version and the unrolled version is when you look at it like this now if you unroll this stateful neural net we have you know it's it is quite deep but every single one of the hidden to hidden layers uses exactly the same weight matrix so really it's not really that deep at all because it can't really do very sophisticated computation because it has to use the same weight matrix every time so in some ways it's not really any smarter than a plane linear model so it would be nice to try to you know create a truly deep model have multiple different layers that it can go through so we can actually do that easily enough by creating something called a multi-layer RNN and all we do is we basically take that diagram we just saw before and we repeat it but and this is actually a bit unclear the dotted arrows here are different weight matrices to the non-dotted arrows here so we can have a different hidden to hidden weight matrix in the kind of second set of RNN layers and a different weight matrix here for the second set and so this is called a stacked RNN or a multi-layer RNN and so here's the same thing in the unrolled version right so this is exactly the same thing but showing you the unrolled version.

Writing this out by hand maybe that's quite a good exercise or particularly this one would be quite a good exercise but it's kind of tedious so we're not going to bother instead we're going to use PyTorch as RNN class and so PyTorch as RNN class is basically doing exactly what we saw here right and specifically this this part here and this part here right but it's nice that it also has an extra number of layers parameter that lets you tell it how many to stack on top of each other so it's important when you start using PyTorch as RNN to realize there's nothing magic going on right you're just using this refactored for loop that we've already seen so we still need the input to hidden embedding this is now the hidden to hidden with the loop all done for us and then this is the hidden to output just as before and then this is our hidden just like before so now we don't need the loop we can just call self.rnn and it does the whole loop for us we can do all the input to hidden at once to save a little bit of time because thanks to the wonder of embedding matrices and as per usual we have to go detach to avoid getting a super deep effective network and then pass it through our output linear layer so this is exactly the same as the previous model except that we have just refactored it using in RNN and we said we want more than one layer so let's request say two layers we still need the model reset it just like before because remember nothing's changed and let's go ahead and fit and oh it's terrible so why is it terrible well the reason it's terrible is that now we really do have a very deep model and very deep models are really hard to train because we can get exploding or disappearing activations so what that means is we start out with some initial state and we're gradually putting it through all of these layers and all of these layers right and so each time we're doing a matrix multiplication which remember is just doing a whole bunch of multiplies and adds and then we multiply and add and we multiply and add and we multiply and add and we multiply and add and if you do that enough times you can end up with very very very big results or so that would be if the kind of things we're multiplying and adding by are pretty big or very very very very small results particularly because we're putting it through the same layer again and again right and why is that a problem well if you multiply by 2 a few times you get 1 2 4 8 etc and after 32 steps you're already at 4 billion or if you start at 1 and you multiply by half a few times after 32 steps you're down to this tiny number so a number even slightly or higher or lower than 1 can kind of cause an explosion or disappearance of a number and matrix multiplication is just multiplying numbers and adding them up so exactly the same thing happens to matrix multiplication you kind of have matrices that grow really big or grow really small and when that does that you're also going to have exactly the same things happening to the gradients they'll get really big really small and one of the problems here is that numbers are not stored precisely in a computer they're stored using something called floating point so we stole this nice diagram from this article called what you never wanted to know about floating point but what we're forced to find out and here we're at this point where we're forced to find out and it's basically showing us the granularity with which numbers are stored and so the numbers that are further away from zero are stored much less precisely than the numbers that are close to zero and so if you think about it that means that the gradients further away from zero could actually for very big numbers could actually become zero themselves because you could actually end up in kind of with two numbers that are between these kind of little gradations here and you actually end up the same thing with the really small numbers because they're really small numbers although they're closer together the numbers that they represent are also very close together so in both cases they're kind of the relative accuracy gets worse and worse so you really want to avoid this happening there's a number of ways to avoid this happening and this is the same for really deep convolutional neural nets or really deep kind of tabular standard tabular networks anytime you have too many layers it can become difficult to train and you generally have to use like either really small learning rates or you have to use special techniques that avoid exploding or disappearing activations or gradients.

For RNNs one of the most popular approaches to this is to use an architecture called an LSTM and I am not going to go into the details of an LSTM from scratch today but it's in the it's in the book and in the notebook but the key thing to know about an LSTM is let's have a look is that rather than just being a matrix multiplication it is this which is that there are a number of linear layers that it goes through and those linear layers are combined in particular ways and the way they're combined which is shown in this kind of diagram here is that it basically is designed such that the that there are like little mini neural networks inside the layer which decide how much of the previous state is kept how much is thrown away and how much of the new state is added and by letting it have little neural nets to kind of calculate each of these things it allows the LSTM layer which again is shown here to decide how much of kind of how much of an update to do at each time and then with that capability it basically allows it to avoid kind of updating too much or updating too little and by the way this this code you can refactor which Sylvain did here into a much smaller amount of code but these two things are exactly the same thing.

So as I said I'm not going to worry too much about the details of how this works now the important thing just to know is that you can replace the matrix multiplication in an RNN with this sequence of matrix multiplications and sigmoids and times and plus and when you do so you will very significantly decrease the amount of gradient or activation exploding explosions or disappearances.

So that's called an LSTM cell and an RNN which uses this instead of a matrix multiplication is called an LSTM and so you can replace NN.RNN with NN.LSTM. Other than that we haven't really changed anything except that LSTMs because they have more of these layers in them we actually have to make our hidden state have more layers in as well but other than that we can just replace RNN with LSTM and we can call it just the same way as we did before we can detach just like before but that's now a list so we have to detach all of them and pop it through our output layer which is exactly as before reset is just as before except it's got to loop through each one and we can fit it in exactly the same way as before and as you can see we end up with a much better result which is great.

We have two questions. Okay perfect. Could we somehow use regularization to try to make the RNN parameters close to the identity matrix or would that cause bad results because the hidden layers want to deviate from the identity during training? So we're actually about to look at regularization so we will take a look.

The identity matrix for those that don't know don't remember is the matrix where if you multiply it by it you get exactly the same thing that you started with so just like if you multiply by one you get back the same number you started with. For linear algebra if you multiply by the identity matrix you get the same matrix you started with and actually one quite popular approach to initializing the hidden to hidden activations is to initialize with a identity matrix which ensures that you start with something which doesn't have gradient explosions or activation explosions.

There are yeah we'll have and we're about to have a look at some more regularization approaches so let's wait until we do that. All right next question. Is there a way to quickly check if the activations are disappearing/exploding? Absolutely just go ahead and calculate them and we'll be looking at that in a lot more detail in part two but a really great exercise would be to try to figure out how you can actually output the activations of each layer and it would certainly be very easy to do that in the in the RNNs that we built ourselves from scratch because we can actually see the linear layers and so you could just print them out or print out some statistics or store them away or something like that.

FAST AI has a class called Activation Stats which kind of you can check out if you're interested if that's a really good way to specifically to do this. Okay so yeah so regularization is important we have potentially a lot of parameters and a lot of layers it would be really nice if we can do the same kind of thing that we've done with our CNNs and so forth which is to use more parameters but then use regularization to ensure that we don't overfit and so we can certainly do that with an LSTM as well and perhaps the best way to do that is to use something called dropout and dropout is not just used for RNNs dropout is used all over the place but it works particularly well in RNNs.

This is a picture from the dropout paper and what happens in dropout is here's a is a kind of a picture of a three fully connected layers no sorry I guess it's two one two yeah no three fully connected layers and so in these two fully connected layers at the start here what we could do is we could delete some of the activations at random and so this has happened here but X this is what X means it's like deleting those those activations at random and if we do so you can see we end up with a lot less computation going on and what dropout does is each batch each mini batch it randomly deletes a different set of activations from whatever layers you ask for that's what dropout does so basically the idea is that dropout helps to generalize because if a particular activation was kind of effectively learning some input some some particular piece of input memorizing it then sometimes it gets randomly deleted and so then suddenly it's not going to do anything useful at all so by randomly deleting activations it ensures that activations can't become over specialized at doing just one thing because then if it did then the times they're randomly deleted it's it's not going to work so here is the entire implementation of a dropout layer you pass it some value P which is the probability that an activation gets deleted so we'll store that away and so then in the forward you're going to get your activations now if you're not training so if you're doing validation then we're not going to do dropout right but if we are training then we create a mask and so the mask is a Bernoulli random a Bernoulli random variable so what is Bernoulli random variable means it means it's a bunch of ones and zeros where this is the probability that we get a one which is one minus the probability we get a zero and so then we just multiply that by our input so that's going to convert some of the inputs into zeros which is basically deleting them so you should check out some of the details for example about why we do a divide one minus P which is described here and we do point out here that normally and I would normally in the lesson show you an example of the of what Bernoulli does but of course nowadays you know we're getting to the advanced classes you're expected to do it yourself so be sure to create a little cell here and make sure you actually create a tensor and then run Bernoulli underscore on it and make sure you see exactly what it's doing so that then you can understand this class now of course we don't have to use this class we made ourselves we can just use nn.dropout but you can use this class yourself because it does the same thing so again you know we're trying to make sure that we know how to build stuff from scratch this special self dot training is set for every module automatically by fast.ai to based on whether or not you're in the validation part of your training loop or the training part of your training loop it's also part of PyTorch and in PyTorch if you're not using fast.ai you have to call the train method on a module to set training to true and the eval method to set it to false for every module inside some other module so that's one great approach to regularization another approach which I've only seen used in recurrent neural nets is activation regularization and temporal activation regularization which is very very similar to the question that we were just asked what happens with activation regularization is it looks a very similar to weight decay but rather than adding some multiplier times the sum of squares of the weights we add some multiplier by the sum of squares of the activations so in other words we're basically saying we're not just trying to decrease the weights but decrease the total activations and then similarly we can also see what's the difference between the activations from the previous time step to this time step so take the difference and then again squared times some value so these are two hyper parameters alpha and beta the higher they are the more regularized your model and so with TAR it's going to say no layer of the LSTM should too dramatically change the activations from one time step to the next and then for alpha it's saying no layer of the LSTM should create two large activations and so they wouldn't actually create these large activations or large changes unless the loss improved by enough to make it worth it okay so there's then I think just one more thing we need to know about which is called weight tying and weight tying is a very minor change and let's have a look at it here so this is the embedding we had before this is the LSTM we had before this is where we're going to introduce dropout this is the hidden to output linear layer we had before but we're going to add one more line of code which is the hidden to output weights are actually equal to the input to hidden weights now this is not just setting them once this is actually setting them so that they're a reference to the exact same object in memory the exact same tensor in memory so the weights of the hidden to output layer will always be identical to the weights of the input to hidden layer and this is called weight tying and the reason we do this is because conceptually in a language model predicting the next word is about kind of converting activations into English words or else an embedding is about converting English words to activations and there's a reasonable hypothesis which would be that well those are basically exactly the same computation or at least the reverse of it so why shouldn't they use the same weights and it turns out lo and behold yes if you use the same weights then actually it does work a little bit better so then here's our forward which is to do the input to hidden do the RNN apply the dropout do the detach and then apply the hidden to output which is using exactly the same weights as the input to hidden and resets the same we haven't created the RNN regularizer from scratch here but you can add it as a callback passing in your alpha and your beta if you call text learner instead of learner it will add the model resetter and the RNN regularizer for you so that's what one of the things text learner does so this code is the same as this code and so we can then train a model again and that's also add weight decay and look at this we're getting up close to 90% accuracy so we've covered a lot in this lesson but the amazing thing is that we've just replicated all of the pieces in an AWD LSTM all of the pieces in this state-of-the-art recurrent neural net which we've showed we could use in the previous notebook to get what was until very recently state-of-the-art results for text classification and far more quickly and with far less compute and memory than more modern than the approaches in the last year or so which have beaten that benchmark so this is a really efficient really accurate approach and it's still the state-of-the-art in many many academic situations and it's still very widely used in industry and so it's pretty cool that we've actually seen how to write it from scratch so the main thing to mention the further research is to have a look at the source code for AWD LSTM and fast AI and see if you can see how the things in AWD LSTM map to the you know what those lines of code how they map to the concepts that we've seen in this chapter.

Rachel do we have any questions? So here we have come to the conclusion of our what was originally going to be seven lessons and turned into eight lessons. I hope that you've got a lot out of this, thank you for staying with us. What a lot of folks people people now do when they finish there at least people have finished previous courses is they go back to lesson one and try and repeat it but doing a lot less looking at the notebooks a lot more doing stuff from scratch yourself and going deeper into the assignments so that's one thing you could do next.

Another thing you could do next would be to pick out a Kaggle competition to enter or pick a book that you want to read about deep learning or a paper and team up with some friends to do like a paper reading group or a book reading group you know one of the most important things to keep the learning going is to get together with other people on the learning journey.

Another great way to do that of course is through the forums so if you haven't been using the forums much so far no problem but now might be a great time to get involved and find some projects that are going on that look interesting and it's fine if you you know you don't have to be an expert right obviously any of those projects the people that are already doing it are going to know more about it than you do at this point because they're already doing it but if you drop into a thread and say hey I would love to learn more about this how do I get started or have a look at the wiki posts to find out and try things out you can start getting involved in other people's projects and help them out.

So yeah and of course don't forget about writing so if you haven't tried writing a blog post yet maybe now is a great time to do that pick something that's interesting to you especially if it's something in your area of expertise at work or a hobby or something like that or specific to where you live maybe you could try and build some kind of text classifier or text generator for particular kinds of text that are that you know about you know that would be that would be a super interesting thing to try out and be sure to share it with the folks on the forum.

So there's a few ideas so don't let this be the end of your learning journey you know keep keep going and then come back and try part two if it's not out yet obviously you'll have to wait until it is out if it but if it is out you might want to kind of spend a couple of months you know really experimenting with this before you move on to part two to make sure that everything in part one feels pretty pretty solid to you.

Well thank you very much everybody for your time we've really enjoyed doing this course it's been a tough course for us to teach because with all this COVID-19 stuff going on at the same time I'm really glad we've got through it I'm particularly particularly grateful to Sylvain who has been extraordinary in really making so much of this happen and particularly since I've been so busy with COVID-19 stuff around masks in particular it's really a lot thanks to Sylvain that everything has come together and of course to Rachel who's been here with me on on every one of these lessons thank you so much and I'm looking forward to seeing you again in a future course thanks everybody