Lesson 10: Deep Learning Part 2 2018 - NLP Classification and Translation

So, welcome to lesson 10, or as somebody on the forum described it, lesson 10, mod 7, which is probably a clearer way to think about this. We're going to be talking about NLP. Before we do, let's do a quick review of last week. Last week, there's quite a few people who have flown here to San Francisco for this in-person course.

I'm seeing them pretty much every day, they're working full-time on this, and quite a few of them are still struggling to understand the material from last week. So if you're finding it difficult, that's fine. One of the reasons I kind of put it up there up front is so that we've got something to concentrate about and think about and gradually work towards so that by lesson 14, mod 7, you'll get a second crack at it.

But there's so many pieces, so hopefully you can keep developing better understanding. To understand the pieces, you'll need to understand the shapes of convolutional layer outputs, and receptive fields, and loss functions, and everything. So it's all stuff that you need to understand for all of your deep learning studies anyway.

So everything you do to develop an understanding of last week's lesson is going to help you with everything else. One key thing I wanted to mention is we started out with something which is really pretty simple, which is single object classifier, single object bounding box without a classifier, and then single object classifier and bounding box.

And anybody who's spent some time studying since lesson 8, mod 7, has got to the point where they understand this bit. Now the reason I mention this is because the bit where we go to multiple objects is actually almost identical to this, except we first have to solve the matching problem.

We end up creating far more activations than we need for our number of bounding boxes, ground truth bounding boxes, so we match each ground truth object to a subset of those activations. And once we've done that, the loss function that we then do to each matched pair is almost identical to this loss function.

So if you're feeling stuck, go back to lesson 8 and make sure you understand the data set, the data loader, and most importantly the loss function from the end of lesson 8 or the start of lesson 9. So once we've got this thing which can predict the class and bounding box for one object, we went to multiple objects by creating more activations.

We had to then deal with the matching problem, we then basically moved each of those anchor boxes in and out a little bit and around a little bit, so they tried to line up with particular ground truth objects. And we talked about how we took advantage of the convolutional nature of the network to try to have activations that had a receptive field that was similar to the ground truth object we were predicting.

And Chloe Sultan provided this fantastic picture, I guess for her own notes, but she shared it with everybody, which is lovely, to talk about what does SSD multi-head forward do line by line. And I partly wanted to show this to help you with your revision, but I also partly wanted to show this to kind of say, doing this kind of stuff is very useful for you to do, like walk through and in whatever way helps you make sure you understand something.

You can see what Chloe's done here is she's focused particularly on the dimensions of the tensor at each point in the path as we kind of gradually down-sampling using these tragedy convolutions, making sure she understands why those grid sizes happen, and then understanding how the outputs come out of those.

And so one thing you might be wondering is how did Chloe calculate these numbers? So I don't know the answer I haven't spoken to her, but obviously one approach would be like from first principles just thinking through it. But then you want to know what am I right? And so this is where you've got to remember this pdb.settrace idea.

So I just went in just before class and went into SSD multi-head.forward and entered pdb.settrace, and then I ran a single batch. And so I put the trace at the end, and then I could just print out the size of all of these guys. So which reminds me, last week there may have been a point where I said 21 + 4 = 26, which is not true in most universes.

And by the way, when I code I do that stuff, that's the kind of thing I do all the time. So that's why we have debuggers and know how to check things and do things in small little bits along the way. So this idea of putting a debugger inside your forward function and printing out the sizes is something which is damn super helpful.

Or you could just put a print statement here as well. So I actually don't know if that's how Chloe figured it out, but that's how I would if I was her. And then we talked about increasing k, which is the number of anchor boxes for each convolutional grid cell, which we can do with different zooms and different aspect ratios.

And so that gives us a plethora of activations, and therefore predicted bounding boxes, which we then went down to a small number using non-maximum suppression. And I'll try to remember to put a link -- there's a really interesting paper that one of our students told me about that I hadn't heard about, which is attempting to -- you know I've mentioned non-maximum suppression.

It's kind of hacky, kind of ugly, it's totally heuristic, I didn't even talk about the code because it seems kind of hideous. So somebody actually came up with a paper recently which attempts to do an end-to-end ConvNet to replace that NMS piece. So I'll put that paper up. Nobody's created a PyTorch implementation yet, so it would be an interesting project if anyone wanted to try that.

One thing I've noticed in our study groups during the week is not enough people reading papers. What we are doing in class now is implementing papers. The papers are the real ground truth. And I think from talking to people, a lot of the reason people aren't reading papers is because a lot of people don't think they're capable of reading papers, they don't think they're the kind of people that read papers.

But you are. You're here. And we started looking at a paper last week and we read the words that were in English and we largely understood them. So if you actually look through this picture from SSD, carefully you'll realize that SSD multi-head dot forward is not doing the same as this.

And then you might think, oh, I wonder if this is better. And my answer is probably, because SSD multi-head dot forward was the first thing I tried just to get something out there, but between this and the YOLO version, there are probably much better ways. One thing you'll notice in particular is they use a smaller K, but they have a lot more sets of grids, 1x1, 3x3, 5x5, 10x10, 19x19 and 38x38, 8,700 per plus, so a lot more than we had, so that'd be an interesting thing to experiment with.

Another thing I noticed is that we had 4x4, 2x2, 1x1, which means there's a lot of overlap like every set fits within every other set. In this case where you've got 1, 3, 5, you don't have that overlap, so it might actually make it easier to learn. So there's lots of interesting things you can play with based on stuff that's either trying to make it closer to the paper or think about other things you could try that aren't in the paper or whatever.

Perhaps the most important thing I would recommend is to put the code and the equations next to each other. Yes, Rachel? There was a question of whether you could speak about the use cyclic learning rate argument and the fit function. We will get there. So put the code and the equations from the paper next to each other and draw in one of two groups.

You're either a code person like me who's not that happy about math, in which case I start with the code and then I look at the math and I learn about how the math maps to the code and end up eventually understanding the math. All your PhD in Stochastic Differential Equations like Rachel, whatever that means, in which case you can look at the math and then learn about how the code completes the math.

But either way, unless you're one of those rare people who is equally comfortable in either world, you'll learn about one or the other. Now learning about code is pretty easy because there's documentation and we know how to look it up and so forth. Sometimes learning the math is hard because the notation might seem hard to look up, but there's actually a lot of resources.

For example, a list of mathematical symbols on Wikipedia is amazingly great. It has examples of them, explanations of what they mean, and tells you what to search for to find out more about it. Really terrific. And if you Google for math notation cheat sheet, you'll find more of these kinds of terrific resources.

So over time, you do need to learn the notation, but as you'll see from the Wikipedia page, there's not actually that much of it. Obviously there's a lot of concepts behind it, but once you know the notation you can then quickly look up the concept as it pertains to the particular thing you're studying.

Nobody learns all of math and then starts machine learning. Everybody, even top researchers I know, when they're reading a new paper will very often come to bits of math they haven't seen before and they'll have to go away and learn that bit of math. Another thing you should try doing is to recreate things that you see in the papers.

So here was the key most important figure 1 from the focal loss paper, the Retlinet paper. So recreate it. And very often I put these challenges up on the forums, so keep an eye on the lesson threads during the forums, and so I put this challenge up there and within about 3 minutes Serada had said "done it" in Microsoft Excel naturally along with actually a lot more information than in the original paper.

A nice thing here is that she was actually able to draw a line showing at a 0.5 ground truth probability what's the loss for different amounts of gamma, which is kind of cool. And if you want to cheat, she's also provided Python code on the forum too. I did discover a minor bug in my code last week, the way that I was flattening out the convolutional activations did not line up with how I was using them in the loss function, and fixing that actually made it quite a bit better, so my motorbikes and cows are actually in the right place.

So when you go back to the notebook, you'll see it's a little less bad than it was last time. So there's some quick coverage of what's gone before. Yes? >> Quick question, are you going to put the PowerPoint on GitHub? >> I'll put a subset of it on GitHub.

>> And then secondly, usually when we down sample, we increase the number of filters or depth. When we're doing sampling from 77 to 44, why are we decreasing the number from 512 to 256? Why not decrease dimension in SSD head? Is it performance related? >> 77 to 44? Oh, 7 by 7 to 4 by 4?

I guess they've got the stars and the colors. >> Oh yes, that's right, they're weird italics. >> It's because -- well, largely it's because that's kind of what the papers tend to do. We've got a number of -- well, we have a number of out paths and we kind of want each one to be the same, so we don't want each one to have a different number of filters.

And also this is what the papers did, so I was trying to match up with that, having these 256. It's a different concept because we're taking advantage of not just the last layer, but the layers before that as well. Life's easier if we make them more consistent. So we're now going to move to NLP, and so let me kind of lay out where we're going here.

We've seen a couple of times now this idea of taking a pre-trained model, in fact we've seen it in every lesson. Take a pre-trained model, rip off some stuff on the top, replace it with some new stuff, get it to do something similar. And so what we're going to do -- and so we've kind of dived in a little bit deeper to that, to say like okay, with conv_learner.pre_trained, it had a standard way of sticking stuff on the top which does a particular thing which was classification.

And then we learned actually we can stick any PyTorch module we like on the end and have it do anything we like with a custom head. And so suddenly you discover, wow, there's some really interesting things we can do. In fact, that reminds me, Yang Lu said, well, what if we did a different kind of custom head?

And so the different custom head was, well, let's take the original pictures and rotate them and then make our dependent variable the opposite of that rotation basically and see if it can learn to unrotate it. And this is like a super useful thing, obviously. In fact, I think Google Photos nowadays has this option that it will actually automatically rotate your photos for you.

But the cool thing is, as Yang Lu shows here, you can build that network right now by doing exactly the same as our previous lesson, but your custom head is one that spits out a single number which is how much to rotate by, and your dataset has a dependent variable which is how much did you rotate by.

So you suddenly realize with this idea of a backbone plus a custom head, you can do almost anything you can think about. So today we're going to look at the same idea and say, okay, how does that apply to NLP? And then in the next lesson, we're going to go further and say, well, if NLP and computer vision kind of let you do the same basic ideas, how do we combine them two?

And we're going to learn about a model that can actually learn to find word structures from images, or images from word structures, or images from images. And that will form the basis, if you wanted to go further, of doing things like going from an image to a sentence, it's called image captioning, or going from a sentence to an image, which will start to do a phrased image.

And so from there, we're going to go deeper then into computer vision to think about what other kinds of things we can do with this idea of a pre-trained network plus a custom head. And so we'll look at various kinds of image enhancement, like increasing the resolution of a low-res photo to guess what was missing, or adding artistic filters on top of photos, or changing photos of forces into photos of zebras and stuff like that.

And then finally, that's going to bring us all the way back to bounding boxes again. And so to get there, we're going to first of all learn about segmentation, which is not just figuring out where a bounding box is, but figuring out what every single pixel in an image is part of.

So this pixel is part of a person, this pixel is part of a car. And then we're going to use that idea, particularly an idea called unet, and it turns out that this idea from unet, we can apply to bounding boxes where it's called feature pyramids. Everything has to have a different name in every slightly different area.

And we'll use that to hopefully get some really good results with bounding boxes. So that's kind of our path from here. So it's all going to build on each other, but take us into lots of different areas. Now for NLP last part, we relied on a pretty great library called TorchText.

But as pretty great as it was, I've since then found the limitations of it too problematic to keep using it. As a lot of you complained on the forums, it's pretty damn slow. Partly because it's not doing parallel processing, and partly it's because it doesn't remember what you did last time and it does it all over again from scratch.

And then it's kind of hard to do fairly simple things, like a lot of you were trying to get into the toxic comment competition on Kaggle, which was a multi-label problem, and trying to do that with TorchText. I eventually got it working, but it took me like a week of hacking away, which is kind of ridiculous.

So to fix all these problems, we've created a new library called FastAI.Text. FastAI.Text is a replacement for the combination of TorchText and FastAI.NLP. So don't use FastAI.NLP anymore. That's obsolete. It's slower, it's more confusing, it's less good in every way, but there's a lot of overlaps. Intentionally, a lot of the classes have the same names, a lot of the functions have the same names, but this is the non-TorchText version.

So we're going to work with IMDB again. For those of you who have forgotten, go back and check out lesson 4. Basically this is a data set of movie reviews, and you remember we used it to find out whether we might enjoy some Begeddon or not, and we thought probably my kind of thing.

So we're going to use the same data set, and by default it calls itself ACLIMDB, so this is just the raw data set that you can download. And as you can see, I'm doing from FastAI.Text import star. There's no TorchText, and I'm not using FastAI.NLP. I'm going to use Pathlib as per usual.

We're going to learn about what these tags are later. So you might remember the basic path for NLP is that we have to take sentences and turn them into numbers, and there's a couple of steps to get there. So at the moment, somewhat intentionally, FastAI.Text doesn't provide that many helper functions.

It's really designed more to let you handle things in a fairly flexible way. So as you can see here, I wrote something called Get Texts, which goes through each thing in classes, and these are the three classes that they have in IMDB. Negative, positive, and then there's another folder, unsupervised.

That's stuff they haven't gotten around for labeling yet. So I'm just going to call that a class. And so I just go through each one of those classes, and then I just find every file in that folder with that name, and I open it up and read it and chuck it into the end of this array.

And as you can see, with Pathlib it's super easy to grab stuff and pull it in, and then the label is just whatever class I'm up to so far. So I'll go ahead and do that for the train bit, and I'll go ahead and do that for the test bit.

So there's 70,000 in train, 25,000 in test, 50,000 of the train ones are unsupervised. We won't actually be able to use them when we get to the classification piece. So I actually find this much easier than the torch text approach of having lots of layers and wrappers and stuff, because in the end reading text files is not that hard.

One thing that's always a good idea is to sort things randomly. It's useful to know this simple trick for sorting things randomly, particularly when you've got multiple things you have to sort the same way, in this case I've got labels and texts. np.random.permutation, if you give it an integer, it gives you a random list from 0 up to and not including the number you give it in some random order.

So you can then just pass that in as an indexer to give you a list that's sorted in that random order. So in this case it's going to sort train texts and train labels in the same random way. So it's a useful little idiom to use. So now I've got my texts and my labels sorted.

I can go ahead and create a data frame from them. Why am I doing this? The reason I'm doing this is because there is a somewhat standard approach starting to appear for text classification datasets, which is to have your training set as a CSV file with the labels first and the text of the NLP document second in a train.csv and a test.csv.

So basically it looks like this. You've got your labels and your texts. And then a file called classes.text, which just lists the classes. I think it's somewhat standard. In a reasonably recent academic paper, Yann LeCun and a team of researchers looked at quite a few datasets and they used this format for all of them.

And so that's what I've started using as well for my recent paper. So what I've done is you'll find that this notebook, if you put your data into this format, the whole notebook will work every time. So rather than having a thousand different classes or formats and readers and writers and whatever, I've just said let's just pick a standard format and your job, your code is, you can do it perfectly well, is to put it in that format which is the CSV file.

The CSV files have no header by default. Now you'll notice at the start here that I had two different paths. One was the classification path, one was the language model path. In NLP, you'll see LM all the time. LM means language model in NLP. So the classification path is going to contain the information that we're going to use to create a sentiment analysis model.

The language model path is going to contain the information we need to create a language model. So they're a little bit different. One thing that's different is that when we create the train.csv and the classification path, we remove everything that has a label of 2 because label of 2 is unsupervised.

So when we remove the unsupervised data from the classifier, we can't use it. So that means this is going to have actually 25,000 positive, 25,000 negative. The second difference is the labels. For the classification path, the labels are the actual labels. But for the language model, there are no labels, so we just use a bunch of zeroes.

That just makes it a little bit easier because we can use a consistent data frame format or CSV format. Now the language model, we can create our own validation set. So you've probably come across by now sklearn.modelSelection.trainTestSplit, which is a really simple little function that grabs a data set and randomly splits it into a training set and a validation set according to whatever proportion you specify.

So in this case, I can catenate my classification training and validation together. So it's going to be 100,000 altogether, split it by 10%, so now I've got 90,000 training, 10,000 validation for my language model. So go ahead and save that. So that's my basic get the data in a standard format for my language model and my classifier.

So the next thing we need to do is tokenization. Tokenization means at this stage we've got for a document, for a movie review, we've got a big long string, and we want to put it into a list of tokens, which are kind of a list of words, but not quite.

For example, 'don't', we want to be 'don't', we probably want 'full stop' to be a token. So tokenization is something that we passed off to a terrific library called Spacey, partly terrific because an Australian wrote it and partly terrific because it's good at what it does. We've put a bit of stuff on top of Spacey, but the vast majority of the work is being done by Spacey.

Before we pass it to Spacey, I've written this simple fixup function, which is basically each time I looked at a different dataset, and I've looked at about a dozen in building this, everyone had different weird things that needed to be replaced. Here are all the ones I've come up with so far.

Hopefully this will help you out as well. So I HTML and escape all the entities, and then there's a bunch more things I replace. Have a look at the result of running this on text that you put in and make sure there's not more weird tokens in there. It's amazing how many weird things people do to text.

So basically I've got this function called getAll, which is going to go ahead and call getTexts, and text is going to go ahead and do a few things, one of which is to apply that fixup that we just mentioned. So let's kind of look through this because there's some interesting things to point out.

So I've got to use pandas to open our train.csv from the language model path, but I'm passing in an extra parameter you may not have seen before called chunksites. Python and pandas can both be pretty inefficient when it comes to storing and using text data. And so you'll see that very few people in NLP are working with large corpuses and I think part of the reason is that traditional tools have just made it really difficult - you run out of memory all the time.

So this process I'm showing you today I have used on corpuses of over a billion words successfully using this exact code. And so one of the simple tricks is to use this thing called chunksites with pandas. What that means is that pandas does not return a data frame, but it returns an iterator that we can iterate through chunks of a data frame.

And so that's why I don't say "tok_train = get_texts" but instead I call "get_all" which loops through the data frame. But actually what it's really doing is it's looping through chunks of the data frame. So each of those chunks is basically a data frame representing a subset of the data.

"When I'm working with NLP data, many times I come across data with foreign text or characters. Is it better to discard them or keep them?" No, no, definitely keep them. And this whole process is Unicode, and I've actually used this on Chinese text. This is designed to work on pretty much anything.

In general, most of the time it's not a good idea to remove anything. Old-fashioned NLP approaches tend to do all this lemmatization and all these normalization steps to get rid of lowercase everything blah blah blah. But that's throwing away information which you don't know ahead of time whether it's useful or not.

So don't throw away information. So we go through each chunk, each of which is a data frame, and we call get_texts. get_texts is going to grab the labels and make them into ints. It's going to grab then the texts. And I'll point out a couple of things. The first is that before we include the text, we have this beginning of stream token, which you might remember we used way back up here.

There's nothing special about these particular strings of letters, they're just ones I figured don't appear in normal texts very often. So every text is going to start with XBOS. Why is that? Because it's often really useful for your model to know when a new text is starting. For example, if it's a language model, you're going to concatenate all the text together, and so it'd be really helpful for it to know this article is finished and a new one started, so I should probably forget some of that context now.

Ditto is quite often texts have multiple fields like a title, an abstract, and then the main document. And so by the same token, I've got this thing here which lets us actually have multiple fields in our CSP. So this process is designed to be very flexible. And again, at the start of each one, we put a special field starts here token, followed by the number of the field that's starting here for as many fields as we have.

Then we apply our fix up to it, and then most importantly we tokenize it and we tokenize it by doing a process or multiprocessor. So tokenizing tends to be pretty slow, but we've all got multiple cores in our machines now and some of the better machines on AWS and stuff can have dozens of cores.

Here on our university computer, we've got 56 cores. So spaCy is not very amenable to multiprocessing, but I finally figured out how to get it to work. And the good news is it's all wrapped up in this one function now. And so all you need to pass to that function is a list of things to tokenize, which each part of that list will be tokenized on a different core.

And so I've also created this function called partition by cores, which takes a list and splits it into sub-lists. The number of sub-lists is the number of cores that you have in your computer. So on my machine, without multiprocessing, this takes about an hour and a half, and with multiprocessing it takes about two minutes.

So it's a really handy thing to have. And now that this code's here, feel free to look inside it and take advantage of it through your own stuff. Remember, we all have multiple cores even in our laptops, and very few things in Python take advantage of it unless you make a bit of an effort to make it work.

So there's a couple of tricks to get things working quickly and reliably. As it runs, it prints out how it's going. And so here's the result of the end. Beginning of stream token, beginning of field number one token, here's the tokenized text. You'll see that the punctuation is on the whole, now a separate token.

You'll see there's a few interesting little things. One is this. What's this? T-up, MGM. Well, MGM was originally capitalized, but the interesting thing is that normally people either lowercase everything or they leave the case as is. Now if you leave the case as is, then screw you, or caps, and screw you, lowercase, are two totally different sets of tokens that have to be learned from scratch.

Or if you lowercase them all, then there's no difference at all between screw you and screw you. So how do you fix this so that you both get the semantic impact of "I'm shouting now!" but not have every single word have to learn the shouted version versus the normal version.

And so the idea I came up with, and I'm sure other people have done this too, is to come up with a unique token to mean the next thing is all uppercase. So then I lowercase it, so now whatever used to be uppercase is now lowercase, it's just one token, and then we can learn the semantic meaning of all uppercase.

And so I've done a similar thing. If you've got 29 exclamation marks in a row, we don't learn a separate token for 29 exclamation marks. Instead I put in a special token for the next thing repeats lots of times, and then I put the number 29, and then I put the exclamation mark.

And so there's a few little tricks like that, and if you're interested in LP, have a look at the code for Tokenizer for these little tricks that I've added in because some of them are kind of fun. So the nice thing with doing things this way is we can now just np.save that and load it back up later.

We don't have to recalculate all this stuff each time like we tend to have to do with TorchText or a lot of other libraries. So we've now got it tokenized. The next thing we need to do is turn it into numbers, which we call numericalizing it. And the way we numericalize it is very simple.

We make a list of all the words that appear in some order, and then we replace every word with its index into that list. The list of all the tokens that appear, we call the vocabulary. So here's an example of some of the vocabulary. The counter class in Python is very handy for this.

It basically gives us a list of unique items and their counts. So here are the 25 most common things in the vocabulary. You can see there are things like apostrophe s and double quote and end of paragraph, and also stuff like that. Generally speaking, we don't want every unique token in our vocabulary.

If it doesn't appear at least two times, then it might just be a spelling mistake or a word. We can't learn anything about it if it doesn't appear that often. Also the stuff that we're going to be learning about at least so far on this part gets a bit clunky once you've got a vocabulary bigger than 60,000.

Time permitting, we may look at some work I've been doing recently on handling larger vocabularies, otherwise that might have to come in a future course. But actually for classification, I've discovered that doing more than about 60,000 words doesn't seem to help anyway. So we're going to limit our vocabulary to 60,000 words, things that appear at least twice.

So here's a simple way to do that. Use that dot most common, pass in the max_vocab size. That'll sort it by the frequency, by the way. And if it appears less often than a minimum frequency, then don't bother with it at all. So that gives us i to s.

That's the same name that torch text used. Remember it means int to string. So this is just the list of the unique tokens in the vocab. I'm going to insert two more tokens, a token for unknown, a vocab item for unknown, and a vocab item for padding. Then we can create the dictionary which goes in the opposite direction, so string to int.

And that won't cover everything because we intentionally truncated it down to 60,000 words. And so if we come across something that's not in the dictionary, we want to replace it with 0 for unknown, so we can use a default dict for that, with a lambda function that always returns 0.

So you can see all these things we're using that keep coming back up. So now that we've got our s to i dictionary defined, we can then just call that for every word for every sentence. And so there's our numericalized version, and there it is. And so of course the nice thing is again, we can save that step as well.

So each time we get to another step, we can save it. And these are not very big files. Compared to what you get used to with images, text is generally pretty small. Very important to also save that vocabulary. Because this list of numbers means nothing, unless you know what each number refers to, and that's what I2S tells you.

So you save those three things, and then later on you can load them back up. So now our vocab size is 60,002, and our training language model has 90,000 documents in it. So that's the preprocessing you do. We can probably wrap a little bit more of that in little utility functions if we want to, but it's all pretty straightforward, and basically that exact code will work for any dataset you have once you've got it in that CSV format.

So here is a kind of a new insight that's not new at all, which is that we'd like to pre-train something. Like we know from lesson 4 that if we pre-train our classifier by first creating a language model, and then fine-tuning that as a classifier, that was helpful. Remember it actually got us a new state-of-the-art result.

We got the best IMDB classifier result that had ever been published. But quite a bit. Well, we're not going far enough though, because IMDB movie reviews are not that different to any other English document compared to how different they are to a random string or even to a Chinese document.

So just like ImageNet allowed us to train things that recognize stuff that kind of looks like pictures, and we could use it on stuff that was nothing to do with ImageNet, like satellite images. Why don't we train a language model that's just like good at English, and then fine-tune it to be good at movie reviews?

So this basic insight led me to try building a language model on Wikipedia. So my friend Stephen Meridy has already processed Wikipedia, found a subset of nearly the most of it, but throwing away the stupid little articles, and he calls that Wikitex 103. So I grabbed Wikitex 103 and I trained a language model on it.

I used exactly the same approach I'm about to show you for training an IMDB language model, but instead I trained a Wikitex 103 language model. And then I saved it and I've made it available for anybody who wants to use it at this URL. So this is not a URL for Wikitex 103, the documents, this is the Wikitex 103, the language model.

So the idea now is let's train an IMDB language model which starts with these words. Now hopefully to you folks, this is an extremely obvious, extremely non-controversial idea because it's basically what we've done in nearly every class so far. But when I first mentioned this to people in the NLP community, I guess June/July of last year, there couldn't have been less interest.

I asked on Twitter, where a lot of the top Twitter researchers are people that I follow and they follow me back, I was like "hey, what if we pre-trained a general language model?" and they're like "no, all language is different, you can't do that" or "I don't know why you would bother anyway, I've talked to people at conferences and I'm pretty sure people have tried that and it's stupid." There was just this weird straight past.

I guess because I am arrogant and I ignored them even though they know much more about NLP than I do and just tried it anyway and let me show you what happened. So here's how we do it. Grab the wiki text models, and if you use wget -r it'll actually recursively grab the whole directory, it's got a few things in it.

We need to make sure that our language model has exactly the same embedding size, number of hidden and number of layers as my wiki text one did, otherwise you can't load the weights in. So here's our pre-trained path, here's our pre-trained language model path, let's go ahead and torch.load in those weights from the forward wiki text 103 model.

We don't normally use torch.load, but that's the PyTorch way of grabbing a file. And it basically gives you a dictionary containing the name of the layer and a tensor of those weights or an array of those weights. Now here's the problem, that wiki text language model was built with a certain vocabulary which was not the same as this one was built on.

So my number 40 was not the same as wiki text 103 models number 40. So we need to map one to the other. That's very, very simple because luckily I saved the i2s for the wiki text vocab. So here's the list of what each word is when I trained the wiki text 103 model, and so we can do the same default dict trick to map it in reverse, and I'm going to use -1 to mean that it's not in the wiki text dictionary.

And so now I can just say my new set of weights is just a whole bunch of zeros with vocab size by embedding size, so we're going to create an embedding matrix. I'm then going to go through every one of the words in my IMDB vocabulary. I'm going to look it up in S to i2, so string to int for the wiki text 103 vocabulary, and see if that's words there.

And if that is word there, then I'm not going to get this -1, so r will be greater than or equal to 0. So in that case I will just set that row of the embedding matrix to the weight that I just looked at, which was stored inside this named element.

So these names, you can just look at this dictionary and it's pretty obvious what each name corresponds to because it looks very similar to the names that you gave it when you set up your module. So here are the encoder weights. So grab it from the encoder weights. If I don't find it, then I will use the row mean.

In other words, here is the average embedding weight across all of the wiki text 103 things. So that's pretty simple, so I'm going to end up with an embedding matrix for every word that's in both my vocabulary for IMDB and the wiki text 103 vocabulary. I will use the wiki text 103's embedding matrix weights for anything else.

I will just use whatever was the average weight from the wiki text 103 embedding matrix. And then I'll go ahead and I will replace the encoder weights with that turn into a tensor. We haven't talked much about weight tying, we might do so later, but basically the decoder, so the thing that turns the final prediction back into a word, uses exactly the same weights, so I pop it there as well.

And then there's a bit of a weird thing with how we do embedding dropout that ends up with a whole separate copy of them for a reason that doesn't matter much. So we just pop the weights back where they need to go. So this is now something that a dictionary we can now, or a set of torch state which we can load in.

So let's go ahead and create our language model. And so the basic approach we're going to use, and I'm going to look at this in more detail in a moment, but the basic approach we're going to use is I'm going to concatenate all of the documents together into a single list of tokens of length 24.998 million.

So that's going to be what I pass in as my training set. So the language model, we basically just take all our documents and just concatenate them back to back. And we're going to be continuously trying to predict what's the next word after these words. And we'll look at these details in a moment.

I'm going to set up a whole bunch of dropout. We'll look at that in detail in a moment. Once we've got a model data object, we can then grab the model from it. So that's going to give us a learner. And then as per usual, we can call learner.fit.

So we first of all, as per usual, just do a single epoch on the last layer just to get that okay. And the way I've set it up is the last layer is actually the embedding weights. Because that's obviously the thing that's going to be the most wrong, because a lot of those embedding weights didn't even exist in the vocab, so we're just going to train a single epoch of just the embedding weights.

And then we'll start doing a few epochs of the full model. And so how is that looking? Well here's lesson 4, which was our academic world's best ever result. And after 14 epochs we had a 4.23 loss. Here after 1 epoch we have a 4.12 loss. So by pre-training on Wikitext 103, in fact let's go and have a look, we kept training and training at a different rate.

Eventually we got to 4.16. So by pre-training on Wikitext 103 we have a better loss after 1 epoch than the best loss we got for the language model otherwise. Yes, Rachel? What is the Wikitext 103 model? Is it AWD LSTM again? Yeah and we're about to dig into that.

The way I trained it was literally the same lines of code that you see here, but without pre-training it on Wikitext 103. So let's take a 10-minute break, come back at 7.40 and we'll dig in and have a look at these models. Ok welcome back. Before we go back into language models and NLP classifiers, a quick discussion about something pretty new at the moment which is the FastAI doc project.

So the goal of the FastAI doc project is to create documentation that makes readers say "Wow, that's the most fantastic documentation I've ever read." And so we have some specific ideas about how to do that, but it's the same kind of idea of top-down, thoughtful, take-full advantage of the medium approach, interactive, experimental code first that we're all familiar with.

If you're interested in getting involved, the basic approach you can see in the docs directory. So this is the readme in the docs directory. In there there is, amongst other things, a transforms_template.adoc. What the hell is adoc? Adoc is ASCII doc. How many people here have come across ASCII doc?

That's awesome. People are laughing because there's one hand up and it's somebody who was in our study group today who talked to me about ASCII doc. ASCII doc is the most amazing project. It's like Markdown, but it's like what Markdown needs to be to create actual books, and a lot of actual books are written in ASCII doc.

And so it's as easy to use as Markdown, but there's way more cool stuff you can do with it. In fact, here is an ASCII doc file here, and as you'll see it looks very normal. There's headings and this is pre-formatted text, and there's lists and whatever else. It looks pretty standard, and actually I'll show you a more complete ASCII doc thing, a more standard ASCII doc thing.

But you can do stuff like say put a table of contents here please. You can say colon colon means put a definition list here please. Plus means this is a continuation of the previous list item. So there's just little things that you can do which are super handy or make it slightly smaller than everything else.

So it's like turbocharged Markdown. And so this ASCII doc creates this HTML. And I didn't add any CSS or do anything myself. We literally started this project like 4 hours ago. So this is like just an example basically. And so you can see we've got a table of contents, we can jump straight to here, we've got a cross-reference we can click on to jump straight to the cross-reference.

Each method comes along with its details and so on and so forth. And to make things even easier, rather than having to know that the argument list is meant to be smaller than the main part, or how do you create a cross-reference, or how are you meant to format the arguments to the method name and list out each one of its arguments, we've created a special template where you can just write various stuff in curly brackets like "please put the arguments here, and here is an example of one argument, and here is a cross-reference, and here is a method," and so forth.

So we're in the process of documenting the documentation template that there's basically like 5 or 6 of these little curly bracket things you'll need to learn. But for you to create a documentation of a class or a method, you can just copy one that's already there and so the idea is we're going to have, it'll almost be like a book.

There'll be tables and pictures and little video segments and hyperlink throughout and all that stuff. You might be wondering what about docstrings, but actually I don't know if you've noticed, but if you look at the Python standard library and look at the docstring for example for regex compile, it's a single line.

Nearly every docstring in Python is a single line. And Python then does exactly this. They have a website containing the documentation that says like "Hey, this is what regular expressions are and this is what you need to know about them and if you want them to go faster, you'll need to use compile and here's lots of information about compile and here's the examples." It's not in the docstring.

And that's how we're doing it as well. Our docstrings will be one line unless you need two sometimes. It's going to be very similar to Python, but even better. So everybody is welcome to help contribute to the documentation and hopefully by the time you're watching this on the MOOC, it'll be recently fleshed out and we'll try to keep a list of things to do.

So I'm going to do one first. So one question that came up in the break was how does this compare to Word2Vec? And this is actually a great thing for you to spend time thinking about during the week is how does this compare to Word2Vec. I'll give you the summary now, but it's a very important conceptual difference.

The main conceptual difference is, what is Word2Vec? Word2Vec is a single embedding matrix. Each word has a vector and that's it. So in other words, it's a single layer from a pre-trained model and specifically that layer is the input layer. And also specifically that pre-trained model is a linear model that is pre-trained on something called a co-occurrence matrix.

So we have no particular reason to believe that this model has learned anything much about the English language or that it has any particular capabilities because it's just a single linear layer and that's it. So what's this Wikitex 103 model? It's a language model. It has a 400-dimensional embedding matrix, 3 hidden layers with 1,150 activations per layer, and regularization and all of that stuff.

Tired input output, matrixes, it's basically a state-of-the-art AWD. So what's the difference between a single layer of a single linear model versus a three-layer recurrent neural network? Everything. They're very different levels of capability. And so you'll see when you try using a pre-trained language model versus a Word2vec layer, you'll get very, very different results for the vast majority of tasks.

What if the NumPy array does not fit in memory? Is it possible to write a PyTorch data loader directly from a large CSV file? It almost certainly won't come up, so I'm not going to spend time on it. These things are tiny. They're just ints. Think about how many ints you would need to run out of memory.

It's not going to happen. They don't have to fit in GPU memory, just in your memory. I've actually done another Wikipedia model, which I called GigaWiki, which was on all of Wikipedia, and even that easily fits in memory. The reason I'm not using it is because it turned out not to really help very much versus Wikitex 103, but I've built a bigger model than anybody else I found in the academic literature pretty much, and it fits in memory on a single machine.

What is the idea behind averaging the weights of embeddings? They've got to be set to something. There are words that weren't there, so other options is we could leave them at 0, but that seems like a very extreme thing to do. 0 is a very extreme number. Why would it be 0?

We could set it equal to some random numbers, but if so, what would be the mean and standard deviation of those random numbers, or should it be uniform? If we just average the rest of the embeddings, then we have something that's a reasonable scale. Just to clarify, this is how you're initializing words that didn't appear in the training corpus.

Thanks, Rachel, that's right. I think you've pretty much just answered this one, but someone had asked if there's a specific advantage to creating our own pre-trained embedding over using glob or Word2Vec. I think I have. We're not creating a pre-trained embedding; we're creating a pre-trained model. Let's talk a little bit more.

This is a ton of stuff we've seen before, but it's changed a little bit. It's actually a lot easier than it was in Part 1, but I want to go a little bit deeper into the language model loader. So this is the language model loader, and I really hope that by now you've learned in your editor or IDE how to jump to symbols.

I don't want it to be a burden for you to find out what the source code of a language model loader is. And if it's still a burden, please go back and try and learn those keyboard shortcuts in VS Code. If your editor does not make it easy, don't use that editor anymore.

There's lots of good free editors that make this easy. So here's the source code for language model loader. It's interesting to notice that it's not doing anything particularly tricky. It's not deriving from anything at all. What makes it something that's capable of being a data loader is it's something you can iterate over.

And so specifically, here's the fit function inside fastai.model. This is where everything ends up eventually, which goes through each epoch, and then it creates an iterator from the data loader, and it just does a for loop through it. So anything you can do a for loop through can be a data loader.

And specifically, it needs to return tuples of many batches, an independent and dependent variable for many batches. So anything with a dunder-eater method is something that can act as an iterator. And yield is a neat little Python keyword you probably should learn about if you don't already know it, but it basically spits out a thing and waits for you to ask for another thing, normally in a for loop or something.

So in this case, we start by initializing the language model, passing it in the numbers. So this is a numericalized, big, long list of all of our documents concatenated together. And the first thing we do is to batchify it. And this is the thing which quite a few of you got confused about last time.

If our batch size is 64 and we have 25 million numbers in our list, we are not creating items of length 64. We're not doing that. We're creating 64 items in total. So each of them is of size t/64, which is 390,000. So that's what we do here when we reshape it so that this axis here is of length 64, and then this -1 is everything else.

So that's 390,000 long. And then we transpose it. So that means that we now have 64 columns, 390,000 rows, and then what we do each time we do an iterate is we grab one batch of some sequence length, we'll look at that in a moment, but basically it's approximately equal to bptt, which we set to 70, stands for backprop through time, and we just grab that many rows.

So from i to i plus 70 rows, and then we try to predict that plus 1. So we've got 64 columns, and each of those is 1/64 of our 25 million or whatever it was, tokens, hundreds of thousands long, and we just grab 70 at a time. So each of those columns each time we grab it is going to hook up to the previous column.

So that's why we get this consistency, this language model. It's stateful, just really important. Pretty much all the cool stuff in the language model is stolen from Stephen Merrity's AWD LSTM, including this little trick here, which is if we always grab 70 at a time and then we go back and do a new epoch, we're going to grab exactly the same batches every time.

There's no randomness. Now normally we shuffle our data every time we do an epoch, or every time we grab some data we grab it at random. You can't do that with a language model because this set has to join up to the previous set because it's trying to learn the sentence.

If you suddenly jump somewhere else, then that doesn't make any sense as a sentence. So Stephen's idea is to say, since we can't shuffle the order, let's instead randomly change the size, the sequence length. So basically he says, 95% of the time we'll use bptt, 70, but 5% of the time we'll use half that.

And then he says, you know what, I'm not even going to make that the sequence length, I'm going to create a normally distributed random number with that average and a standard deviation of 5, and I'll make that the sequence length. So the sequence length is 70ish, and that means every time we go through we're getting slightly different batches.

So we've got that little bit of extra randomness. I asked Stephen Meridy where he came up with this idea. Did he think of it? He was like, I think I thought of it, but it seemed so obvious that I bet I didn't think of it, which is true of every time I come up with an idea in deep learning, it always seems so obvious that you assume somebody else has thought of it, but I think he thought of it.

So this is a nice thing to look at if you're trying to do something a bit unusual with a data loader. It's like, okay, here's a simple kind of role model you can use as to creating a data loader from scratch, something that spits out batches of data. So our language model loader just took in all of the documents concatenated together along with the batch size and the BPTT.

Now generally speaking, we want to create a learner, and the way we normally do that is by getting a model data object and by calling some kind of method which have various names, but often we call that method getModel. And so the idea is that the model data object has enough information to know what kind of model to give you.

So we have to create that model data object, which means we need that class, and so that's very easy to do. So here are all of the pieces. We're going to create a custom learner, a custom model data class and a custom model class. So a model data class, again, this one doesn't inherit from anything, so you really see there's almost nothing to do.

You need to tell it most importantly what's your training set, give it a data loader, what's the validation set, give it a data loader, and optionally give it a test set, plus anything else it needs to know. So it might need to know the VPTT, it needs to know the number of tokens, that's the vocab size, it needs to know what is the padding index, and so that it can save temporary files and models, model data always needs to know the path.

And so we just grab all that stuff and we dump it. And that's it, that's the entire initializer, there's no logic there at all. So then all of the work happens inside get_model. And so get_model calls something we'll look at later which just grabs a normal PyTorch NN.module architecture.

And jux it on the GPU. Note with PyTorch normally we would say .cuda. With fast.ai, it's better to say to GPU. And the reason is that if you don't have a GPU, it will leave it on the CPU, and it also provides a global variable you can set to choose whether it goes on the GPU or not.

So it's a better approach. So we wrap the model in a language model. And the language model is this. Basically a language model is a subclass of basic model. It basically almost does nothing except it defines layer groups. And so remember how when we do discriminative learning rates where different layers have different learning rates, or we freeze different amounts, we don't provide a different learning rate for every layer because there can be like a thousand layers.

We provide a different learning rate for every layer group. So when you create a custom model, you just have to override this one thing which returns a list of all of your layer groups. So in this case, my last layer group contains the last part of the model and one bit of dropout, and the rest of it, this star here, means pull this apart.

So this is basically going to be one layer per RNN layer. So that's all that is. And then finally, turn that into a learner. And so a learner you just pass in the model and it turns it into a learner. In this case we have overridden learner and the only thing we've done is to say I want the default loss function to be cross-entropy.

So this entire set of custom model, custom model data, custom learner all fits on a single screen, and they always basically look like this. So that's a kind of little dig inside this pretty boring part of the code base. So the interesting part of this code base is getLanguageModel.

GetLanguageModel is actually the thing that gives us our awdlstm. And it actually contains the big idea, the big, incredibly simple idea that everybody else here thinks it's really obvious, that everybody in the NLP community I spoke to thought was insane, which is basically every model can be thought of as a backbone plus a head, and if you pre-train the backbone and stick on a random head, you can do fine-tuning and that's a good idea.

And so these two bits of the code are literally right next to each other. There is this bit of fastai.lm_rnn. Here's getLanguageModel. Here's getClassifier. getLanguageModel creates an RNN encoder and then creates a sequential model that sticks on top of that a linear decoder. Classifier creates an RNN encoder and then a sequential model that sticks on top of that a pooling linear classifier.

We'll see what these differences are in a moment, but you get the basic idea. They're basically doing pretty much the same thing. They've got this head and then they're sticking on a simple linear layer on top. So it's worth digging in a little bit deeper and seeing what's going on here.

Yes, Rich? >> There was a question earlier about whether any of this translates to other languages. >> Yeah, this whole thing works in any language you like. >> I mean, would you have to retrain your language model on a corpus from that language? >> Absolutely. >> Okay. >> So the wikitext-103-pre-trained-language-model knows English.

You could use it maybe as a pre-trained start for a French or German model. Start by retraining the embedding layer from scratch. Might be helpful. Chinese, maybe not so much. But given that a language model can be trained from any unlabeled documents at all, you'd never have to do that.

Because almost every language in the world has plenty of documents. You can grab newspapers, web pages, parliamentary records, whatever. As long as you've got a few thousand documents showing somewhat normal usage of that language, you can create a language model. And so I know some of our students, one of our students, whose name I'll have to look after in a week, very embarrassing, tried this approach for Thai.

He said the first model he built easily beat the previous day of the entire classifier. For those of you that are international fellows, this is an easy way for you to whip out a paper in which you either create the first ever classifier in your language or beat everybody else's classifier in your language and then you can tell them that you've been a student of deep learning for six months and piss off all the academics in your country.

So here's our edit encoder. It's just a standard edit module. Most of the text in it is actually just documentation, as you can see. It looks like there's more going on in it than there actually is, but really all there is is we create an embedding layer, we create an LSTM for each layer that's been asked for, and that's it.

Everything else in it is dropout. Basically all of the interesting stuff in the AWED LSTM paper is all of the places you can put dropout. And then the forward is basically the same thing, right? It's call the embedding layer, add some dropout, go through each layer, call that RNN layer, append it to our list of outputs, add dropout, and that's about it.

So it's really pretty straightforward. The paper you want to be reading, as I've mentioned, is the AWD LSTM paper, which is this one here, regularizing and optimizing LSTM language models, and it's well-written and pretty accessible and entirely implemented inside FastAI as well, so you can see all of the code for that paper.

And like a lot of the code is shamelessly plagiarized with Stephen's permission from his excellent GitHub repo, AWD LSTM, and the process of which I picked some of his bugs as well. I even told him about them. So I'm talking increasingly about "please read the papers", so here's the paper, "please read this paper", and it refers to other papers.

So for things like why is it that the encoder weight and the decoder weight are the same? Well, it's because there's this thing called "tie_weights", this is inside that get_language model, there's a thing called "tie_weights", it defaults to true, and if it's true then we literally use the same weight matrix for the encoder and the decoder.

So they're literally pointing at the same block of memory. And so why is that? What's the result of it? That's one of the citations in Stephen's paper, which is also a well-written paper, you can go and look up and learn about work time. So there's a lot of cool stuff in there.

So we have basically a standard RNN, the only way it's not standard is it's just got lots more types of dropout in it, and then a sequential model, on top of that we stick a linear decoder, which is literally half the screen of code. It's got a single linear layer, we initialize the weights to some range, we add some dropout, and that's it.

So we've got an RNN, on top of that we stick a linear layer with dropout and we're finished. So that's the language model. So what dropout you choose matters a lot, and through a lot of experimentation I found a bunch of dropouts -- you can see here we've got each of these corresponds to a particular argument -- a bunch of dropouts that tend to work pretty well for language models.

But if you have less data for your language model, you'll need more dropout. If you have more data, you can benefit from less dropout, you don't want to regularize more than you have to. Rather than having to tune every one of these 5 things, my claim is they're already pretty good ratios to each other, so just tune this number.

I just multiply it all by something. So there's really just one number you have to tune. If you're overfitting, then you'll need to increase this number. If you're underfitting, you'll need to decrease this number. Other than that, these ratios actually seem pretty good. So one important idea which may seem pretty minor, but again it's incredibly controversial, is that we should measure accuracy when we look at a language model.

So normally in language models we look at this loss value, which is just cross-entropy loss, but specifically where you nearly always take e^ of that, which the NLP community calls perplexity. Perplexity is just e^ of cross-entropy. There's a lot of problems with comparing things based on cross-entropy loss. I'm not sure I've got time to go into it in detail now, but the basic problem is that it's kind of like that thing we learned about focal loss.

Cross-entropy loss, if you're right, it wants you to be really confident that you're right. So it really penalizes a model that doesn't kind of say, I'm so sure this is wrong, whereas accuracy doesn't care at all about how confident you are, it just cares about whether you're right. And this is much more often the thing which you care about in real life.

So this accuracy is how often do we guess the next word correctly. And I just find that a much more stable number to keep track of. So that's a simple little thing that I do. So we trained for a while, and we get down to a 3.9 cross-entropy loss, and if you go e^, that kind of gives you a sense of what's happened with language models.

If you look at academic papers from about 18 months ago, you'll see them talking about state-of-the-art complexities of over 100. The rate at which our ability to kind of understand language, and I think measuring language model accuracy or complexity is not a terrible proxy for understanding language. If I can guess what you're going to say next, I pretty much need to understand language pretty well, and also the kind of things you might talk about pretty well.

So this number has just come down so much. It's been amazing. NLP in the last 12 to 18 months. And it's going to come down a lot more. It really feels like 2011-2012 computer vision. We're just starting to understand transfer learning and fine-tuning, and these basic models are getting so much better.

So everything you thought about what NLP can and can't do is very rapidly going out of date. But there's still lots of stuff NLP is not good at, to be clear. Just like in 2012 there was lots of stuff computer vision wasn't good at. But it's changing incredibly rapidly, and now is a very, very good time to be getting very, very good at NLP or starting start-ups based on NLP because there's a whole bunch of stuff which computers were absolutely shit at two years ago, and now are not quite as good at people, and then next year they'll be much better at people.

Two questions. One, what is your ratio of paper reading versus coding in a week? What do you think, Rachel? You see me. I mean, it's a lot more coding, right? It's a lot more coding. I feel like it also really varies from week to week. I feel like they're...

Like with that bounding box stuff, there was all these papers and no map through them, and so I didn't even know which one to read first, and then I'd read the citations and didn't understand any of them. So there was a few weeks of just kind of reading papers before I even knew what to start coding.

That's unusual though. Most of the time, I don't know, any time I start reading a paper, I'm always convinced that I'm not smart enough to understand it, always, regardless of the paper, and somehow eventually I do. But yeah, I try to spend as much time as I can coding.

And then the second question, is your dropout rate the same through the training or do you adjust it and the weights accordingly? I'll just say one more thing about the last bit, which is very often, like the vast majority, nearly always, after I've read a paper, even after I've read the bit that says this is the problem I'm trying to solve, I'll kind of stop there and try to implement something that I think might solve that problem, and then I'll go back and read the paper and I'll read little bits about how I solve these problem bits, and I'll be like, oh that's a good idea, and then I'll try to implement those.

And so that's why, for example, I didn't actually implement SSD. My custom head is not the same as their head. It's because I kind of read the gist of it and then I tried to create something best as I could and then go back to the papers and try to see why.

So by the time I got to the focal loss paper, I was driving myself crazy with how come I can't find small objects, how come it's always predicting background, and I read the focal loss paper and I was like, that's why! It's so much better when you deeply understand the problem they're trying to solve.

And I do find the vast majority of the time, by the time I read that bit of the paper which is like solving the problem, I'm then like, yeah but these three ideas I came up with, they didn't try. And you suddenly realize that you've got new ideas. Or else if you just implement the paper mindlessly, you tend not to have these insights about better ways to do it.

Varying dropout is really interesting and there are some recent papers actually that suggest gradually changing dropout and it was either a good idea to gradually make it smaller or to gradually make it bigger. I'm not sure which. Maybe one of us can try and find it during the week.

I haven't seen it widely used. I tried it a little bit with the most recent paper I wrote and I had some good results. I think I was gradually making it smaller but I can't remember. And then the next question is, "Am I correct in thinking that this language model is built on word embeddings?

Would it be valuable to try this with phrase or sentence embeddings?" I asked this because I saw from Google the other day universal sentence encoder. Yeah, this is much better than that. Do you see what I mean? This is not just an embedding of a sentence, this is an entire model.

An embedding by definition is like a fixed thing. I think they're asking, they're saying that this language, well the first question is, is this language model built on word embeddings? Right, but it's not saying, a sentence or a phrase embedding is always a model that creates that. We've got a model that's like trying to understand language, it's not just a phrase, it's not just a sentence, it's a document in the end and it's not just an embedding, we're training through the whole thing.

So this has been a huge problem with NLP for years now is this attachment they have to embeddings. So even the paper that the community has been most excited about recently from AI2, the Allen Institute, called ELMO, and they found much better results across lots of models. But again, it was an embedding.

They took a fixed model and created a fixed set of numbers which they then fed into a model. But in computer vision, we've known for years that that approach of having a fixed set of features, they're called hypercolons in computer vision. People stopped using them like 3 or 4 years ago because fine-tuning the entire model works much better.

So for those of you that have spent quite a lot of time with NLP and not much time with computer vision, you're going to have to start relearning. All that stuff you have been told about this idea that there are these things called embeddings and that you learn them ahead of time, and then you apply these fixed things, whether it be word level or phrase level or whatever level, don't do that.

You want to actually create a pre-trained model and fine-tune it end to end. You'll see some specific results. For using accuracy instead of perplexity as a metric for the model, could we work that into the loss function rather than just use it as a metric? No, you never want to do that whether it be computer vision or NLP or whatever.

It's too bumpy. So cross-entropy is fine as a loss function. And I'm not saying instead of I use it in addition, I think it's good to look at the accuracy and to look at the cross-entropy. But for your loss function, you need something nice and smooth. Accuracy doesn't work very well.

You'll see there's two different versions of save. There's save and save encoder. Save saves the whole model as per usual. Save encoder saves just that bit. In other words, in the sequential model, it saves just that bit and not that bit. In other words, this bit, which is the bit that actually makes it into a language model, we don't care about in the classifier, we just care about that bit.

So let's now create the classifier. I'm going to go through this bit pretty quickly because it's the same. But when you go back during the week and look at the code, convince yourself it's the same. We do getAllPD, read_csv again, juxize again, getAll again, save those tokens again. We don't create a new I2S vocabulary.

We obviously want to use the same vocabulary we had in the language model because we're about to reload the same encoder. Same default dict, same way of creating our numericalized list, which as per before we can save. So that's all the same. Later on we can reload those rather than having to rebuild them.

So all of our hyperparameters are the same. We can change the dropout. Optimize a function. Pick a batch size that is as big as you can that doesn't run out of memory. This bit's a bit interesting. There's some fun stuff going on here. The basic idea here is that for the classifier we do really want to look at a document.

We need to say is this document positive or negative. So we do want to shuffle the documents because we like to shuffle things. But those documents are different lengths, so if we stick them all into one batch -- this is a handy thing that fastAI does for you -- you can stick things at different lengths into a batch and it will automatically pad them, so you don't have to worry about that.

But if they're wildly different lengths, then you're going to be wasting a lot of computation times. There might be one thing there that's 2,000 words long and everything else is 50 words long and that means you end up with a 2,000-wide tensor. That's pretty annoying. So James Bradbury, who's actually one of Stephen Meridy's colleagues and the guy who came up with TorchText, came up with an idea which was let's sort the dataset by length-ish.

So kind of make it so the first things in the list are on the whole, shorter than the things at the end, but a little bit random as well. And so I'll show you how I implemented that. So the first thing we need is a dataset. So we have a dataset passing in the documents and their labels.

And so here's a text dataset and it inherits from dataset. Here is dataset from PyTorch. And actually, dataset doesn't do anything at all. It says you need to get item if you don't have one, you're going to get an error, you need a length if you don't have one, you're going to get an error.

So this is an abstract class. So we're going to pass in our x, we're going to pass in our y, and getItem is going to grab the x and grab the y and return them. It couldn't be much simpler. Optionally, it could reverse it. Optionally it could stick an end of stream at the end.

Optionally it could stick a start of stream at the beginning. We're not doing any of those things. So literally all we're doing is putting in an x, putting in a y, and then grab an item, we're returning the x and the y as a tuple. And the length is how long the x array is.

So that's all the dataset is. Something with a length that you can index. So to turn it into a data loader, you simply pass the dataset to the data loader constructor, and it's now going to go ahead and give you a batch of that at a time. Normally you can say shuffle=true or shuffle=false, it will decide whether to randomize it for you.

In this case though, we're actually going to pass in a sampler parameter. The sampler is a class we're going to define that tells the data loader how to shuffle. So for the validation set, we're going to define something that actually just sorts it. It just deterministically sorts it so all the shortest documents will be at the start, all the longest documents will be at the end, and that's going to minimize the amount of padding.

For the training sampler, we're going to create this thing I call a sort-ish sampler, which also sorts-ish. So this is where I really like PyTorch is that they came up with this idea for an API for their data loader where we can hook in new classes to make it behave in different ways.

So here's a sort-sampler, it's simply something which again has a length, which is the length of the data source, and it has an iterator, which is simply an iterator which goes through the data source sorted by length of the key, and I pass in as the key lambda function which returns the length.

And so for the sort-ish sampler, I won't go through the details, but it basically does the same thing with a little bit of randomness. So it's just another of these beautiful little design things in PyTorch that I discovered. I could take James Bradbury's ideas, which he had written a whole new set of classes around, and I could actually just use the inbuilt hooks inside PyTorch.

You will notice that it's not actually PyTorch's data loader, it's actually FastAI's data loader, but it's basically almost entirely plagiarized from PyTorch but customized in some ways to make it faster, mainly by using multithreading instead of multiprocessing. Does the pre-trained LSTM depth and BBTT need to match with the new one we are training?

No, the BBTT doesn't need to match at all. That's just like how many things do we look at at a time, it's got nothing to do with the architecture. So now we can call that function we just saw before, getRNNClassifier. It's going to create exactly the same encoder, more or less, and we're going to pass in the same architectural details as before.

But this time, the head that we add on, you've got a few more things you can do. One is you can add more than one hidden layer. So this layer here says this is what the input to my classifier section, my head, is going to be. This is the output of the first layer, this is the output of the second layer, and you can add as many as you like.

So you can basically create a little multi-layered neural net classifier at the end. And so ditto, these are the dropouts to go after each of these layers. And then here are all of the AWD LSTM dropouts, which we're going to basically plagiarize that idea for our classifier. We're going to use the RNN learner, just like before.

We're going to use discriminative learning rates for different layers. You can try using weight decay or not, I've been fiddling around a bit with that to see what happens. And so we start out just training the last layer and we get 92.9% accuracy, then we unfreeze one more layer, get 93.3 accuracy, and then we fine-tune the whole thing.

And after 3 epochs, so this was kind of the main attempt before our paper came along at using a pre-trained model. And what they did is they used a pre-trained translation model. But they didn't fine-tune the whole thing, they just took the activations of the translation model. And when they tried IMDB, they got 91.8% which we beat easily after only fine-tuning one layer.

They weren't state-of-the-art there, the state-of-the-art is 94.1, which we beat after fine-tuning the whole thing for 3 epochs. And so by the end, we're at 94.8, which is obviously a huge difference because in terms of error rate, that's gone down from 5.9, and then I'll tell you a simple little trick.

Go back to the start of this notebook, and reverse the order of all of the documents, and then rerun the whole thing. And when you get to the bit that says wt103, replace this fwd for forward with bwd for backward. That's a backward English language model that learns to read English backwards.

So if you redo this whole thing, put all the documents in reverse, and change this to backward, you now have a second classifier which classifies things by positive or negative sentiment based on the reverse document. If you then take the two predictions and take the average of them, you basically have a bidirectional model that you've trained each bit separately.

That gets you to 95.4% accuracy. So we basically load it from 5.9 to 4.6. So this kind of 20% change in the state-of-the-art is almost unheard of. You have to go back to Jeffrey Hinton's ImageNet computer vision thing where they chop 30% off the state-of-the-art. It doesn't happen very often.

So you can see this idea of just use transfer learning is ridiculously powerful, but every new field thinks their new field is too special and you can't do it. So it's a big opportunity for all of us. So we turned this into a paper, and when I say we, I did it with this guy, Sebastian Reuter.

You might remember his name because in lesson 5 I told you that I actually had shared lesson 4 with Sebastian because I think he's an awesome researcher who I thought might like it. I didn't know him personally at all. And much to my surprise, he actually watched the damn video.

I was like, what NLP researcher is going to watch some beginner's video? He watched the whole video and he was like, that's actually quite fantastic. Well, thank you very much, that's awesome coming from you. And he said, hey, we should turn this into a paper. And I said, I don't write papers, I don't care about papers, I'm not interested in papers, that sounds really boring.

And he said, okay, how about I write the paper for you? And I said, you can't really write a paper about this yet because you'd have to do studies to compare it to other things, they're called ablation studies to see which bits actually work. There's no rigor here, I just put in everything that came in my head and chucked it all together and it happened to work.

And it's like, okay, what if I write all the paper and do all the ablation studies, then can we write the paper? And I said, well, it's like a whole library that I haven't documented and I'm not going to yet and you don't know how it all works. He said, okay, if I write the paper and do the ablation studies and figure out from scratch how the code works without bothering you, then can we write the paper?

I was like, yeah, if you did all those things, you can write the paper. And he was like, okay. And so then two days later he comes back and he says, okay, I've done a draft with the paper. So I share this story to say like, if you're some student in Ireland and you want to do good work, don't let anybody stop you.

I did not encourage him to say the least. But in the end he was like, look, I want to do this work, I think it's going to be good and I'll figure it out. And he wrote a fantastic paper and he did the ablation studies and he figured out how fast AI works and now we're planning to write another paper together.

You've got to be a bit careful because sometimes I get messages from random people saying like, I've got lots of good ideas, can we have coffee? I can have coffee at my office any time, thank you. But it's very different to say like, hey, I took your ideas and I wrote a paper and I did a bunch of experiments and I figured out how your code works.

I added documentation to it, should we submit this to a conference? Do you see what I mean? There's nothing to stop you doing amazing work and if you do amazing work that helps somebody else, like in this case, I'm happy that we have a paper. I don't deeply care about papers, but I think it's cool that these ideas now have this rigorous study.

Let me show you what he did. He took all my code, so I'd already done all the fast AI.txt and stuff like that. As you've seen, it lets us work with large corpuses. Sebastian is fantastically well read and he said here's a paper that Jan Lekudins and guys just came out with where they tried lots of different classification data sets, so I'm going to try running your code on all these data sets.

These are the data sets. Some of them had many, many hundreds of thousands of documents and they were far bigger than anything I had tried, but I thought it should work. He had a few good little ideas as we went along and so you should totally make sure you read the paper.

He said this thing that you called in the lessons differential learning rates, differential means something else. Maybe we should rename it. It's now called discriminative learning rates. This idea that we had from Part 1 where we used different learning rates for different layers, after doing some literature research, it does seem like that hasn't been done before so it's now officially a thing, discriminative learning rates.

So all these ideas, this is something we learned in Lesson 1. It now has an equation with Greek and everything. When you see an equation with Greek and everything, that doesn't necessarily mean it's more complex than anything we did in Lesson 1 because this one isn't. Again, that idea of unfreezing a layer at a time also seems to have never been done before, so it's now a thing and it's got the very clever name gradual unfreezing.

So then, long promised, we're going to look at this, slanted triangular learning rates. So this actually was not my idea. Leslie Smith, one of my favorite researchers who you all now know about, emailed me a while ago and said I'm so over a circle called learning rates, I don't do that anymore, I now do a slightly different version where I have one cycle which goes up quickly at the start and then slows it down afterwards.

And he said I often find it works better, I tried going back over all of my old data sets and it worked better for all of them, everyone I tried. So this is what the learning rate looks like. You can use it in fastAI just by adding UCLR equals to your fit.

This first number is the ratio between the highest learning rate and the lowest learning rate. So here this is 1/32 of that. The second number is the ratio between the first peak and the last peak. And so the basic idea is if you're doing a cycle length 10 and you want the first epoch to be the upward bit and the other 9 epochs to be the downward bit, then you would use 10.

And I find that works pretty well, that was also Leslie's suggestion, make about 1/10 of it the upward bit and about 9/10 the downward bit. Since he told me about it, maybe two days ago, he wrote this amazing paper, a disciplined approach to neural network hyperparameters, in which he described something very slightly different to this again, but the same basic idea.

This is a must-read paper. It's got all the kinds of ideas that fastAI talks about a lot in great depth, and nobody else is talking about this stuff. It's kind of a slog, unfortunately Leslie had to go away on a trip before he really had time to edit it properly, so it's a little bit slow reading, but don't let that stop you, it's amazing.

So this triangle, this is the equation from my paper with Sebastian. Sebastian was like, "Jeremy, can you send me the math equation behind that code you wrote?" And I was like, "No, I just wrote the code, I could not turn it into math." So he figured out the math for it.

So you might have noticed the first layer of our classifier was equal to embedding size times 3. Why times 3? Times 3 because, and again this seems to be something which people haven't done before, so a new idea, concat pooling, which is that we take the average pooling over the sequence of the activations, the max pooling of the sequence over the activations, and the final set of activations and just concatenate them all together.

Again, this is something which we talked about in Part 1, but it doesn't seem to be in the literature before, so it's now called concat pooling, and again it's now got an equation and everything, but this is the entirety of the implementation. Pool with average, pool with max, concatenate those two along with the final sequence.

So you can go through this paper and see how the fastai code implements each piece. So then, to me one of the kind of interesting pieces is the difference between RNN encoder, which you've already seen, and multibatch RNN encoder. So what's the difference there? So the key difference is that the normal RNN encoder for the language model, we could just do bptt chunk at a time, but no problem, and predict the next word.

But for the classifier, we need to do the whole document. We need to do the whole movie review before we decide if it's positive or negative. And the whole movie review can easily be 2000 words long, and I can't fit 2000 words worth of gradients in my GPU memory for every single one of my activations -- sorry, for every one of my weights.

So what do I do? And so the idea was very simple, which is I go through my whole sequence length one batch of bptt at a time, and I call super.forward, so in other words the RNN encoder, to grab its outputs. And then I've got this maximum sequence length parameter where it says, okay, as long as you're doing no more than that sequence length, then start appending it to my list of outputs.

So in other words, the thing that it sends back to this pooling is only as many activations as we've asked it to keep. And so that way you can basically figure out how much, what's maxsec do you, can your particular GPU handle. So it's still using the whole document, but let's say maxsec is 1000 words, and your longest document length is 2000 words.

Then it's still going through the RNN creating state for those first 1000 words, but it's not actually going to store the activations for the backprop the first 1000, it's only going to keep the last 1000. So that means that it can't backprop the loss back to any state that was created in the first 1000 words.

Basically that's now gone. So it's a really simple piece of code, and honestly when I wrote it, I didn't spend much time thinking about it, it seems so obviously the only way that this could possibly work. But again, it seems to be a new thing, so we now have backprop through time for text classification.

So you can see there's lots of little pieces in this paper. So what was the result? So the result was on every single dataset we tried, we got a better result than any previous academic for text classification. So IMDB, Trek 6, AG News, DBpedia, Yelp, all different types. And honestly IMDB was the only one I spent any time trying to optimize the model, so like most of them we just did it like whatever came out first, so if we actually spent time on it I think these would be a lot better.

And the things that these are comparing to, most of them are, you'll see they're different on each table because they're optimized, these are like customized algorithms on the whole. So this is saying one simple fine-tuning algorithm can beat these really customized algorithms. And so here's one of the really cool things that Sebastian did with his ablation studies, which is I was really keen that if we were going to publish a paper we had to say why does it work.

So Sebastian went through and tried removing all of those different contributions I mentioned. So what if we don't use gradual freezing? What if we don't use discriminative learning rates? What if instead of discriminative learning rates we use cosine annealing? What if we don't do any pre-training with Wikipedia? What if we don't do any fine-tuning?

And the really interesting one to me was what's the validation error rate on IMDB if we only use 100 training examples versus 200 versus 500? And you can see, very interestingly, the full version of this approach is nearly as accurate on just 100 training examples, like it's still very accurate versus 20,000 training examples.

Whereas if you're training from scratch on 100, it's almost random. So it's what I expected, kind of set to Sebastian. I really think this is most beneficial when you don't have much data, and this is like where FastAI is most interested in contributing, small data regimes, small compute regimes and so forth.

So he did these studies to check. So I want to show you a couple of tricks as to how you can run these kinds of studies. The first trick is something which I know you're all going to find really handy. I know you've all been annoyed when you're running something in a Jupyter notebook and you lose your internet connection for long enough that it decides you've gone away and then your session disappears and you have to start it again from scratch.

So what do you do? There's a very simple cool thing called VNC, where basically you can install on your AWS instance or paper space or whatever, xWindows, a lightweight window manager, a VNC server, Firefox, a terminal, and some fonts. Track these lines at the end of your VNC xstartup configuration file, and then run this command.

It's now running a server where you can then run the type VNC viewer on your computer, and you point it at your server. Specifically, what you do is you use SSH port forwarding to port 4913 to localhost 5913. And so then you connect to port 5913 on localhost, send it off to port 5913 on your server, which is the VNC port because you said colon 13 here, and it will display an xWindows desktop.

And then you can click on the Linux start like button and click on Firefox, and you now have Firefox, and you'll see here in Firefox it says localhost because this Firefox is running on my AWS server. So you now run Firefox, you start your thing running, and then you close your VNC viewer, remembering that Firefox is like displaying on this virtual VNC display, not in a real display.

And so then later on that day, you log back into VNC viewer and it pops up again, so it's like a persistent desktop. And it's shockingly fast, it works really well. So there's trick number 1. And there's lots of different VNC servers and clients and whatever, but this one worked fine for me.

So you can see here I connect to localhost 5913. Trick number 2 is to create Python scripts. This is what we ended up doing. So I ended up creating a little Python script for Sebastian to say this is the basic steps you need to do, and now you need to create different versions for everything else, and I suggested to him that he tried using this thing called Google Fire.

What Google Fire does is you create a function with shitloads of parameters. And so these are all the things that Sebastian wanted to try doing. Different dropout amounts, different learning rates, do I use pre-training or not, do I use CLI or not, do I use discriminative learning rate or not, do I go backwards or not, blah blah blah.

So you create a function, and then you add something saying if name equals main, fire.fire, and the function name, you do nothing else at all. You don't have to add any metadata, any docstrings, anything at all, and you then call that script and automatically you now have a command line interface, and that's it.

So that's a super fantastic easy way to run lots of different variations in a terminal. And this ends up being easier if you want to do lots of variations than using a notebook because you can just have a bash script that tries all of them and spits them all out.

You'll find inside the dl2-course directory, there's now something called imdb-scripts, and I've put there all of the scripts that Sebastian and I used. So you'll see because we needed to tokenize every single dataset, we had to turn every dataset and numericalize every dataset, we had to train a language model on every dataset, we had to train and classify every dataset, we had to do all of those things in a variety of different ways to compare them, we had a script for all of those things.

So you can check out and see all of the scripts that we used. When you're doing a lot of scripts and stuff, you've got different code all over the place, eventually it might get frustrating that you don't want to symlink your fastai library again and again, but you probably don't want to pip-install it because that version tends to be a little bit old, we move so fast you want to use the current version in git.

If you say pip-install-a. from the fastai-repo base, it does something quite neat which basically creates a symlink to the fastai library inside your site packages directory. Your site packages directory is like your main Python library. And so if you do this, you can then access fastai from anywhere, but every time you do git pull, you've got the most recent version.

One downside of this is that it installs any updated versions of packages from pip which can confuse conda a little bit. So another alternative here is just to symlink the fastai library to your site packages library. That works just as well. And then you can use fastai again from anywhere, and it's quite handy when you want to run scripts that use fastai from different directories on your system.

So one more thing before we go, which is something you can try if you like. You don't have to tokenize words. Instead of tokenizing words, you can tokenize what are called subword units. And so for example, unsupervised could be tokenized as unsupervised. Tokenizer could be tokenized as tokenizer. And then you can do the same thing, the language model that works on subword units, the classifier that works on subword units, etc.

So how well does that work? I started playing with it and with not too much playing, I was getting classification results that were nearly as good as using word-level tokenization. Not quite as good, but nearly as good. I suspect with more careful thinking and playing around, maybe I could have got as good or better.

But even if I couldn't, if you create a subword unit wiki text model, then IMDB model, language model, and then classifier forwards and backwards for subword units, and then ensemble it with the forwards and backwards word-level ones, you should be able to beat us. So here's an approach you may be able to beat our state-of-the-art result.

Google has, as Sebastian told me about this particular project, Google has a project called Sentence Piece, which actually uses a neural net to figure out the optimal splitting up of words, and so you end up with a vocabulary of subword units. In my playing around, I found that creating a vocabulary of about 30,000 subword units seems to be about optimal.

So if you're interested, there's something you can try. It's a bit of a pain to install. It's C++. It doesn't have great error messages. But it will work. There is a Python library for it, and if anybody tries this, I'm happy to help them get it working. There's been little if any experiments with ensembling subword and word-level stuff classification, and I do think it should be the best approach.

Alright, thanks everybody. Have a great week and see you next Monday.

Lesson 10: Deep Learning Part 2 2018 - NLP Classification and Translation

Chapters

Transcript