back to indexLesson 10: Deep Learning Part 2 2018 - NLP Classification and Translation
Chapters
0:0
0:18 Review of Last Week
18:12 Segmentation
18:34 Feature Pyramids
18:57 Nlp
21:28 Basic Paths for Nlp
27:22 Train Test Split
28:5 Tokenization
43:45 Building a Language Model on Wikipedia
48:7 Create an Embedding Matrix
61:24 Averaging the Weights of Embeddings
71:36 Language Model
76:49 Edit Encoder
78:7 Regularizing and Optimizing Lsdm Language Models
79:10 Tie Weights
81:56 Measure Accuracy
85:26 What Is Your Ratio of Paper Reading versus Coding in a Week
88:59 Universal Sentence Encoder
99:38 Add More than One Hidden Layer
109:48 Learning Rate
111:57 Concat Pooling
121:26 Trick Number Two Is To Create Python Scripts
122:55 Imdb Scripts
00:00:00.000 |
So, welcome to lesson 10, or as somebody on the forum described it, lesson 10, mod 7, 00:00:07.600 |
which is probably a clearer way to think about this. 00:00:15.860 |
Before we do, let's do a quick review of last week. 00:00:21.900 |
Last week, there's quite a few people who have flown here to San Francisco for this 00:00:27.320 |
I'm seeing them pretty much every day, they're working full-time on this, and quite a few 00:00:31.880 |
of them are still struggling to understand the material from last week. 00:00:35.280 |
So if you're finding it difficult, that's fine. 00:00:37.640 |
One of the reasons I kind of put it up there up front is so that we've got something to 00:00:42.920 |
concentrate about and think about and gradually work towards so that by lesson 14, mod 7, 00:00:52.960 |
But there's so many pieces, so hopefully you can keep developing better understanding. 00:00:58.360 |
To understand the pieces, you'll need to understand the shapes of convolutional layer outputs, 00:01:03.960 |
and receptive fields, and loss functions, and everything. 00:01:08.520 |
So it's all stuff that you need to understand for all of your deep learning studies anyway. 00:01:15.640 |
So everything you do to develop an understanding of last week's lesson is going to help you 00:01:22.480 |
One key thing I wanted to mention is we started out with something which is really pretty 00:01:26.440 |
simple, which is single object classifier, single object bounding box without a classifier, 00:01:34.040 |
and then single object classifier and bounding box. 00:01:38.080 |
And anybody who's spent some time studying since lesson 8, mod 7, has got to the point 00:01:49.680 |
Now the reason I mention this is because the bit where we go to multiple objects is actually 00:01:55.940 |
almost identical to this, except we first have to solve the matching problem. 00:02:00.800 |
We end up creating far more activations than we need for our number of bounding boxes, 00:02:07.280 |
ground truth bounding boxes, so we match each ground truth object to a subset of those activations. 00:02:12.920 |
And once we've done that, the loss function that we then do to each matched pair is almost 00:02:20.800 |
So if you're feeling stuck, go back to lesson 8 and make sure you understand the data set, 00:02:30.040 |
the data loader, and most importantly the loss function from the end of lesson 8 or 00:02:40.880 |
So once we've got this thing which can predict the class and bounding box for one object, 00:02:47.240 |
we went to multiple objects by creating more activations. 00:02:51.800 |
We had to then deal with the matching problem, we then basically moved each of those anchor 00:02:58.640 |
boxes in and out a little bit and around a little bit, so they tried to line up with 00:03:07.720 |
And we talked about how we took advantage of the convolutional nature of the network 00:03:13.920 |
to try to have activations that had a receptive field that was similar to the ground truth 00:03:23.360 |
And Chloe Sultan provided this fantastic picture, I guess for her own notes, but she shared 00:03:29.800 |
it with everybody, which is lovely, to talk about what does SSD multi-head forward do 00:03:37.880 |
And I partly wanted to show this to help you with your revision, but I also partly wanted 00:03:41.680 |
to show this to kind of say, doing this kind of stuff is very useful for you to do, like 00:03:48.840 |
walk through and in whatever way helps you make sure you understand something. 00:03:53.120 |
You can see what Chloe's done here is she's focused particularly on the dimensions of 00:03:59.760 |
the tensor at each point in the path as we kind of gradually down-sampling using these 00:04:06.440 |
tragedy convolutions, making sure she understands why those grid sizes happen, and then understanding 00:04:16.680 |
And so one thing you might be wondering is how did Chloe calculate these numbers? 00:04:23.480 |
So I don't know the answer I haven't spoken to her, but obviously one approach would be 00:04:27.840 |
like from first principles just thinking through it. 00:04:33.040 |
And so this is where you've got to remember this pdb.settrace idea. 00:04:38.560 |
So I just went in just before class and went into SSD multi-head.forward and entered pdb.settrace, 00:04:50.040 |
And so I put the trace at the end, and then I could just print out the size of all of 00:04:58.520 |
So which reminds me, last week there may have been a point where I said 21 + 4 = 26, which 00:05:21.840 |
And by the way, when I code I do that stuff, that's the kind of thing I do all the time. 00:05:25.920 |
So that's why we have debuggers and know how to check things and do things in small little 00:05:32.500 |
So this idea of putting a debugger inside your forward function and printing out the 00:05:36.240 |
sizes is something which is damn super helpful. 00:05:40.440 |
Or you could just put a print statement here as well. 00:05:44.720 |
So I actually don't know if that's how Chloe figured it out, but that's how I would if 00:05:50.560 |
And then we talked about increasing k, which is the number of anchor boxes for each convolutional 00:05:55.360 |
grid cell, which we can do with different zooms and different aspect ratios. 00:05:59.760 |
And so that gives us a plethora of activations, and therefore predicted bounding boxes, which 00:06:07.400 |
we then went down to a small number using non-maximum suppression. 00:06:15.920 |
And I'll try to remember to put a link -- there's a really interesting paper that one of our 00:06:18.960 |
students told me about that I hadn't heard about, which is attempting to -- you know 00:06:26.320 |
It's kind of hacky, kind of ugly, it's totally heuristic, I didn't even talk about the code 00:06:34.960 |
So somebody actually came up with a paper recently which attempts to do an end-to-end 00:06:45.320 |
Nobody's created a PyTorch implementation yet, so it would be an interesting project 00:06:55.800 |
One thing I've noticed in our study groups during the week is not enough people reading 00:07:03.200 |
What we are doing in class now is implementing papers. 00:07:10.940 |
And I think from talking to people, a lot of the reason people aren't reading papers 00:07:14.360 |
is because a lot of people don't think they're capable of reading papers, they don't think 00:07:24.160 |
And we started looking at a paper last week and we read the words that were in English 00:07:32.120 |
So if you actually look through this picture from SSD, carefully you'll realize that SSD 00:07:40.800 |
multi-head dot forward is not doing the same as this. 00:07:44.880 |
And then you might think, oh, I wonder if this is better. 00:07:49.240 |
And my answer is probably, because SSD multi-head dot forward was the first thing I tried just 00:07:56.540 |
to get something out there, but between this and the YOLO version, there are probably much 00:08:05.660 |
One thing you'll notice in particular is they use a smaller K, but they have a lot more 00:08:10.600 |
sets of grids, 1x1, 3x3, 5x5, 10x10, 19x19 and 38x38, 8,700 per plus, so a lot more than 00:08:22.080 |
we had, so that'd be an interesting thing to experiment with. 00:08:25.600 |
Another thing I noticed is that we had 4x4, 2x2, 1x1, which means there's a lot of overlap 00:08:37.360 |
In this case where you've got 1, 3, 5, you don't have that overlap, so it might actually 00:08:44.120 |
So there's lots of interesting things you can play with based on stuff that's either 00:08:48.920 |
trying to make it closer to the paper or think about other things you could try that aren't 00:08:55.720 |
Perhaps the most important thing I would recommend is to put the code and the equations next 00:09:04.520 |
There was a question of whether you could speak about the use cyclic learning rate argument 00:09:13.920 |
So put the code and the equations from the paper next to each other and draw in one of 00:09:22.560 |
You're either a code person like me who's not that happy about math, in which case I 00:09:29.480 |
start with the code and then I look at the math and I learn about how the math maps to 00:09:34.560 |
the code and end up eventually understanding the math. 00:09:38.520 |
All your PhD in Stochastic Differential Equations like Rachel, whatever that means, in which 00:09:47.400 |
case you can look at the math and then learn about how the code completes the math. 00:09:52.400 |
But either way, unless you're one of those rare people who is equally comfortable in 00:09:56.800 |
either world, you'll learn about one or the other. 00:10:02.740 |
Now learning about code is pretty easy because there's documentation and we know how to look 00:10:09.320 |
Sometimes learning the math is hard because the notation might seem hard to look up, but 00:10:15.720 |
For example, a list of mathematical symbols on Wikipedia is amazingly great. 00:10:21.560 |
It has examples of them, explanations of what they mean, and tells you what to search for 00:10:31.880 |
And if you Google for math notation cheat sheet, you'll find more of these kinds of 00:10:40.200 |
So over time, you do need to learn the notation, but as you'll see from the Wikipedia page, 00:10:49.600 |
Obviously there's a lot of concepts behind it, but once you know the notation you can 00:10:53.800 |
then quickly look up the concept as it pertains to the particular thing you're studying. 00:10:59.160 |
Nobody learns all of math and then starts machine learning. 00:11:05.160 |
Everybody, even top researchers I know, when they're reading a new paper will very often 00:11:10.800 |
come to bits of math they haven't seen before and they'll have to go away and learn that 00:11:18.800 |
Another thing you should try doing is to recreate things that you see in the papers. 00:11:24.560 |
So here was the key most important figure 1 from the focal loss paper, the Retlinet paper. 00:11:35.040 |
And very often I put these challenges up on the forums, so keep an eye on the lesson threads 00:11:41.760 |
during the forums, and so I put this challenge up there and within about 3 minutes Serada 00:11:46.840 |
had said "done it" in Microsoft Excel naturally along with actually a lot more information 00:11:55.280 |
A nice thing here is that she was actually able to draw a line showing at a 0.5 ground 00:12:00.600 |
truth probability what's the loss for different amounts of gamma, which is kind of cool. 00:12:07.360 |
And if you want to cheat, she's also provided Python code on the forum too. 00:12:15.760 |
I did discover a minor bug in my code last week, the way that I was flattening out the 00:12:20.920 |
convolutional activations did not line up with how I was using them in the loss function, 00:12:26.800 |
and fixing that actually made it quite a bit better, so my motorbikes and cows are actually 00:12:33.160 |
So when you go back to the notebook, you'll see it's a little less bad than it was last 00:12:41.160 |
So there's some quick coverage of what's gone before. 00:12:49.200 |
>> Quick question, are you going to put the PowerPoint on GitHub? 00:12:58.200 |
>> And then secondly, usually when we down sample, we increase the number of filters 00:13:03.900 |
When we're doing sampling from 77 to 44, why are we decreasing the number from 512 to 256? 00:13:23.920 |
I guess they've got the stars and the colors. 00:13:32.200 |
>> Oh yes, that's right, they're weird italics. 00:13:35.200 |
>> It's because -- well, largely it's because that's kind of what the papers tend to do. 00:13:38.640 |
We've got a number of -- well, we have a number of out paths and we kind of want each one 00:13:44.720 |
to be the same, so we don't want each one to have a different number of filters. 00:13:51.120 |
And also this is what the papers did, so I was trying to match up with that, having these 00:13:57.760 |
It's a different concept because we're taking advantage of not just the last layer, but 00:14:02.880 |
the layers before that as well. Life's easier if we make them more consistent. 00:14:12.080 |
So we're now going to move to NLP, and so let me kind of lay out where we're going here. 00:14:23.560 |
We've seen a couple of times now this idea of taking a pre-trained model, in fact we've 00:14:28.400 |
seen it in every lesson. Take a pre-trained model, rip off some stuff on the top, replace 00:14:33.280 |
it with some new stuff, get it to do something similar. 00:14:42.000 |
And so what we're going to do -- and so we've kind of dived in a little bit deeper to that, 00:14:48.440 |
to say like okay, with conv_learner.pre_trained, it had a standard way of sticking stuff on 00:14:56.960 |
the top which does a particular thing which was classification. 00:15:01.600 |
And then we learned actually we can stick any PyTorch module we like on the end and 00:15:07.640 |
have it do anything we like with a custom head. And so suddenly you discover, wow, there's 00:15:16.280 |
some really interesting things we can do. In fact, that reminds me, Yang Lu said, well, 00:15:37.960 |
what if we did a different kind of custom head? And so the different custom head was, 00:15:42.200 |
well, let's take the original pictures and rotate them and then make our dependent variable 00:15:51.480 |
the opposite of that rotation basically and see if it can learn to unrotate it. And this 00:15:57.680 |
is like a super useful thing, obviously. In fact, I think Google Photos nowadays has this 00:16:03.160 |
option that it will actually automatically rotate your photos for you. 00:16:09.200 |
But the cool thing is, as Yang Lu shows here, you can build that network right now by doing 00:16:15.000 |
exactly the same as our previous lesson, but your custom head is one that spits out a single 00:16:21.320 |
number which is how much to rotate by, and your dataset has a dependent variable which 00:16:27.120 |
is how much did you rotate by. So you suddenly realize with this idea of a backbone plus 00:16:34.040 |
a custom head, you can do almost anything you can think about. 00:16:41.920 |
So today we're going to look at the same idea and say, okay, how does that apply to NLP? 00:16:49.640 |
And then in the next lesson, we're going to go further and say, well, if NLP and computer 00:16:56.600 |
vision kind of let you do the same basic ideas, how do we combine them two? And we're going 00:17:02.120 |
to learn about a model that can actually learn to find word structures from images, or images 00:17:11.840 |
from word structures, or images from images. And that will form the basis, if you wanted 00:17:17.860 |
to go further, of doing things like going from an image to a sentence, it's called image 00:17:23.480 |
captioning, or going from a sentence to an image, which will start to do a phrased image. 00:17:31.120 |
And so from there, we're going to go deeper then into computer vision to think about what 00:17:39.800 |
other kinds of things we can do with this idea of a pre-trained network plus a custom 00:17:44.400 |
head. And so we'll look at various kinds of image enhancement, like increasing the resolution 00:17:49.940 |
of a low-res photo to guess what was missing, or adding artistic filters on top of photos, 00:17:58.200 |
or changing photos of forces into photos of zebras and stuff like that. 00:18:04.280 |
And then finally, that's going to bring us all the way back to bounding boxes again. 00:18:11.240 |
And so to get there, we're going to first of all learn about segmentation, which is 00:18:14.480 |
not just figuring out where a bounding box is, but figuring out what every single pixel 00:18:19.640 |
in an image is part of. So this pixel is part of a person, this pixel is part of a car. 00:18:25.640 |
And then we're going to use that idea, particularly an idea called unet, and it turns out that 00:18:30.640 |
this idea from unet, we can apply to bounding boxes where it's called feature pyramids. 00:18:36.280 |
Everything has to have a different name in every slightly different area. And we'll use 00:18:41.480 |
that to hopefully get some really good results with bounding boxes. 00:18:48.240 |
So that's kind of our path from here. So it's all going to build on each other, but take 00:18:57.480 |
Now for NLP last part, we relied on a pretty great library called TorchText. But as pretty 00:19:06.200 |
great as it was, I've since then found the limitations of it too problematic to keep 00:19:13.000 |
using it. As a lot of you complained on the forums, it's pretty damn slow. Partly because 00:19:21.160 |
it's not doing parallel processing, and partly it's because it doesn't remember what you did 00:19:29.440 |
last time and it does it all over again from scratch. 00:19:34.920 |
And then it's kind of hard to do fairly simple things, like a lot of you were trying to get 00:19:38.840 |
into the toxic comment competition on Kaggle, which was a multi-label problem, and trying 00:19:44.520 |
to do that with TorchText. I eventually got it working, but it took me like a week of 00:19:53.000 |
So to fix all these problems, we've created a new library called FastAI.Text. FastAI.Text 00:19:59.800 |
is a replacement for the combination of TorchText and FastAI.NLP. So don't use FastAI.NLP anymore. 00:20:10.940 |
That's obsolete. It's slower, it's more confusing, it's less good in every way, but there's a 00:20:18.880 |
lot of overlaps. Intentionally, a lot of the classes have the same names, a lot of the 00:20:24.280 |
functions have the same names, but this is the non-TorchText version. 00:20:33.480 |
So we're going to work with IMDB again. For those of you who have forgotten, go back and 00:20:38.440 |
check out lesson 4. Basically this is a data set of movie reviews, and you remember we 00:20:46.840 |
used it to find out whether we might enjoy some Begeddon or not, and we thought probably 00:20:56.460 |
So we're going to use the same data set, and by default it calls itself ACLIMDB, so this 00:21:03.080 |
is just the raw data set that you can download. And as you can see, I'm doing from FastAI.Text 00:21:13.560 |
import star. There's no TorchText, and I'm not using FastAI.NLP. 00:21:21.160 |
I'm going to use Pathlib as per usual. We're going to learn about what these tags are later. 00:21:27.720 |
So you might remember the basic path for NLP is that we have to take sentences and turn 00:21:37.360 |
them into numbers, and there's a couple of steps to get there. 00:21:44.280 |
So at the moment, somewhat intentionally, FastAI.Text doesn't provide that many helper functions. 00:21:54.240 |
It's really designed more to let you handle things in a fairly flexible way. So as you 00:22:00.200 |
can see here, I wrote something called Get Texts, which goes through each thing in classes, 00:22:08.360 |
and these are the three classes that they have in IMDB. Negative, positive, and then 00:22:13.760 |
there's another folder, unsupervised. That's stuff they haven't gotten around for labeling 00:22:20.400 |
And so I just go through each one of those classes, and then I just find every file in 00:22:27.800 |
that folder with that name, and I open it up and read it and chuck it into the end of 00:22:33.080 |
this array. And as you can see, with Pathlib it's super easy to grab stuff and pull it 00:22:41.080 |
in, and then the label is just whatever class I'm up to so far. So I'll go ahead and do that 00:22:49.120 |
for the train bit, and I'll go ahead and do that for the test bit. 00:22:54.200 |
So there's 70,000 in train, 25,000 in test, 50,000 of the train ones are unsupervised. 00:23:00.280 |
We won't actually be able to use them when we get to the classification piece. So I actually 00:23:05.320 |
find this much easier than the torch text approach of having lots of layers and wrappers 00:23:12.440 |
and stuff, because in the end reading text files is not that hard. 00:23:20.480 |
One thing that's always a good idea is to sort things randomly. It's useful to know this 00:23:28.640 |
simple trick for sorting things randomly, particularly when you've got multiple things 00:23:32.000 |
you have to sort the same way, in this case I've got labels and texts. np.random.permutation, 00:23:38.040 |
if you give it an integer, it gives you a random list from 0 up to and not including 00:23:44.720 |
the number you give it in some random order. So you can then just pass that in as an indexer 00:23:53.480 |
to give you a list that's sorted in that random order. So in this case it's going to sort 00:23:58.240 |
train texts and train labels in the same random way. So it's a useful little idiom to use. 00:24:08.120 |
So now I've got my texts and my labels sorted. I can go ahead and create a data frame from 00:24:13.240 |
them. Why am I doing this? The reason I'm doing this is because there is a somewhat standard 00:24:22.360 |
approach starting to appear for text classification datasets, which is to have your training set 00:24:31.480 |
as a CSV file with the labels first and the text of the NLP document second in a train.csv 00:24:45.280 |
and a test.csv. So basically it looks like this. You've got your labels and your texts. 00:24:50.880 |
And then a file called classes.text, which just lists the classes. I think it's somewhat 00:24:57.920 |
standard. In a reasonably recent academic paper, Yann LeCun and a team of researchers looked 00:25:05.920 |
at quite a few datasets and they used this format for all of them. And so that's what 00:25:12.960 |
I've started using as well for my recent paper. So what I've done is you'll find that this 00:25:20.360 |
notebook, if you put your data into this format, the whole notebook will work every time. So 00:25:29.400 |
rather than having a thousand different classes or formats and readers and writers and whatever, 00:25:35.360 |
I've just said let's just pick a standard format and your job, your code is, you can 00:25:39.880 |
do it perfectly well, is to put it in that format which is the CSV file. The CSV files 00:25:52.560 |
Now you'll notice at the start here that I had two different paths. One was the classification 00:25:58.040 |
path, one was the language model path. In NLP, you'll see LM all the time. LM means language 00:26:05.640 |
model in NLP. So the classification path is going to contain the information that we're 00:26:14.160 |
going to use to create a sentiment analysis model. The language model path is going to 00:26:19.000 |
contain the information we need to create a language model. So they're a little bit different. 00:26:23.880 |
One thing that's different is that when we create the train.csv and the classification 00:26:30.360 |
path, we remove everything that has a label of 2 because label of 2 is unsupervised. So 00:26:40.640 |
when we remove the unsupervised data from the classifier, we can't use it. So that means 00:26:47.320 |
this is going to have actually 25,000 positive, 25,000 negative. The second difference is 00:26:52.720 |
the labels. For the classification path, the labels are the actual labels. But for the 00:26:59.760 |
language model, there are no labels, so we just use a bunch of zeroes. That just makes 00:27:04.920 |
it a little bit easier because we can use a consistent data frame format or CSV format. 00:27:12.900 |
Now the language model, we can create our own validation set. So you've probably come 00:27:19.620 |
across by now sklearn.modelSelection.trainTestSplit, which is a really simple little function 00:27:26.480 |
that grabs a data set and randomly splits it into a training set and a validation set 00:27:32.200 |
according to whatever proportion you specify. So in this case, I can catenate my classification 00:27:38.680 |
training and validation together. So it's going to be 100,000 altogether, split it by 00:27:44.000 |
10%, so now I've got 90,000 training, 10,000 validation for my language model. So go ahead 00:27:52.680 |
So that's my basic get the data in a standard format for my language model and my classifier. 00:28:04.240 |
So the next thing we need to do is tokenization. Tokenization means at this stage we've got 00:28:12.400 |
for a document, for a movie review, we've got a big long string, and we want to put it into 00:28:17.460 |
a list of tokens, which are kind of a list of words, but not quite. For example, 'don't', 00:28:26.480 |
we want to be 'don't', we probably want 'full stop' to be a token. So tokenization is something 00:28:35.680 |
that we passed off to a terrific library called Spacey, partly terrific because an Australian 00:28:43.080 |
wrote it and partly terrific because it's good at what it does. We've put a bit of stuff 00:28:50.760 |
on top of Spacey, but the vast majority of the work is being done by Spacey. 00:28:55.560 |
Before we pass it to Spacey, I've written this simple fixup function, which is basically 00:29:03.440 |
each time I looked at a different dataset, and I've looked at about a dozen in building 00:29:06.880 |
this, everyone had different weird things that needed to be replaced. Here are all the 00:29:15.440 |
ones I've come up with so far. Hopefully this will help you out as well. So I HTML and escape 00:29:24.320 |
all the entities, and then there's a bunch more things I replace. Have a look at the 00:29:29.320 |
result of running this on text that you put in and make sure there's not more weird tokens 00:29:34.720 |
in there. It's amazing how many weird things people do to text. 00:29:41.120 |
So basically I've got this function called getAll, which is going to go ahead and call 00:29:47.800 |
getTexts, and text is going to go ahead and do a few things, one of which is to apply 00:29:52.520 |
that fixup that we just mentioned. So let's kind of look through this because there's 00:29:58.840 |
some interesting things to point out. So I've got to use pandas to open our train.csv from 00:30:05.000 |
the language model path, but I'm passing in an extra parameter you may not have seen before 00:30:09.840 |
called chunksites. Python and pandas can both be pretty inefficient when it comes to storing 00:30:19.080 |
and using text data. And so you'll see that very few people in NLP are working with large 00:30:31.000 |
corpuses and I think part of the reason is that traditional tools have just made it really 00:30:36.960 |
difficult - you run out of memory all the time. So this process I'm showing you today 00:30:43.680 |
I have used on corpuses of over a billion words successfully using this exact code. And so 00:30:50.720 |
one of the simple tricks is to use this thing called chunksites with pandas. What that means 00:30:55.740 |
is that pandas does not return a data frame, but it returns an iterator that we can iterate 00:31:01.760 |
through chunks of a data frame. And so that's why I don't say "tok_train = get_texts" but 00:31:15.920 |
instead I call "get_all" which loops through the data frame. But actually what it's really 00:31:21.240 |
doing is it's looping through chunks of the data frame. So each of those chunks is basically 00:31:27.280 |
a data frame representing a subset of the data. 00:31:31.940 |
"When I'm working with NLP data, many times I come across data with foreign text or characters. 00:31:40.000 |
No, no, definitely keep them. And this whole process is Unicode, and I've actually used 00:31:45.800 |
this on Chinese text. This is designed to work on pretty much anything. In general, 00:31:55.920 |
most of the time it's not a good idea to remove anything. Old-fashioned NLP approaches tend 00:32:02.640 |
to do all this lemmatization and all these normalization steps to get rid of lowercase 00:32:08.640 |
everything blah blah blah. But that's throwing away information which you don't know ahead 00:32:14.480 |
of time whether it's useful or not. So don't throw away information. 00:32:20.720 |
So we go through each chunk, each of which is a data frame, and we call get_texts. get_texts 00:32:26.560 |
is going to grab the labels and make them into ints. It's going to grab then the texts. 00:32:37.960 |
And I'll point out a couple of things. The first is that before we include the text, 00:32:42.320 |
we have this beginning of stream token, which you might remember we used way back up here. 00:32:49.800 |
There's nothing special about these particular strings of letters, they're just ones I figured 00:32:53.520 |
don't appear in normal texts very often. So every text is going to start with XBOS. Why 00:33:01.940 |
is that? Because it's often really useful for your model to know when a new text is starting. 00:33:08.980 |
For example, if it's a language model, you're going to concatenate all the text together, 00:33:14.280 |
and so it'd be really helpful for it to know this article is finished and a new one started, 00:33:18.200 |
so I should probably forget some of that context now. Ditto is quite often texts have multiple 00:33:27.880 |
fields like a title, an abstract, and then the main document. And so by the same token, 00:33:32.880 |
I've got this thing here which lets us actually have multiple fields in our CSP. So this process 00:33:40.080 |
is designed to be very flexible. And again, at the start of each one, we put a special 00:33:44.200 |
field starts here token, followed by the number of the field that's starting here for as many 00:33:49.920 |
fields as we have. Then we apply our fix up to it, and then most importantly we tokenize 00:33:56.400 |
it and we tokenize it by doing a process or multiprocessor. So tokenizing tends to be 00:34:10.440 |
pretty slow, but we've all got multiple cores in our machines now and some of the better 00:34:15.280 |
machines on AWS and stuff can have dozens of cores. Here on our university computer, 00:34:21.200 |
we've got 56 cores. So spaCy is not very amenable to multiprocessing, but I finally figured 00:34:31.800 |
out how to get it to work. And the good news is it's all wrapped up in this one function 00:34:36.500 |
now. And so all you need to pass to that function is a list of things to tokenize, which each 00:34:43.320 |
part of that list will be tokenized on a different core. And so I've also created this function 00:34:48.520 |
called partition by cores, which takes a list and splits it into sub-lists. The number of 00:34:54.200 |
sub-lists is the number of cores that you have in your computer. So on my machine, without 00:35:03.280 |
multiprocessing, this takes about an hour and a half, and with multiprocessing it takes 00:35:09.280 |
about two minutes. So it's a really handy thing to have. And now that this code's here, feel 00:35:16.080 |
free to look inside it and take advantage of it through your own stuff. Remember, we 00:35:21.880 |
all have multiple cores even in our laptops, and very few things in Python take advantage 00:35:29.040 |
of it unless you make a bit of an effort to make it work. 00:35:35.040 |
So there's a couple of tricks to get things working quickly and reliably. As it runs, 00:35:39.560 |
it prints out how it's going. And so here's the result of the end. Beginning of stream 00:35:47.080 |
token, beginning of field number one token, here's the tokenized text. You'll see that 00:35:53.120 |
the punctuation is on the whole, now a separate token. You'll see there's a few interesting 00:36:02.240 |
little things. One is this. What's this? T-up, MGM. Well, MGM was originally capitalized, 00:36:11.960 |
but the interesting thing is that normally people either lowercase everything or they 00:36:18.120 |
leave the case as is. Now if you leave the case as is, then screw you, or caps, and screw 00:36:26.920 |
you, lowercase, are two totally different sets of tokens that have to be learned from 00:36:32.280 |
scratch. Or if you lowercase them all, then there's no difference at all between screw 00:36:41.200 |
So how do you fix this so that you both get the semantic impact of "I'm shouting now!" 00:36:50.240 |
but not have every single word have to learn the shouted version versus the normal version. 00:36:55.040 |
And so the idea I came up with, and I'm sure other people have done this too, is to come 00:36:59.440 |
up with a unique token to mean the next thing is all uppercase. So then I lowercase it, 00:37:06.600 |
so now whatever used to be uppercase is now lowercase, it's just one token, and then we 00:37:10.240 |
can learn the semantic meaning of all uppercase. 00:37:14.480 |
And so I've done a similar thing. If you've got 29 exclamation marks in a row, we don't 00:37:19.280 |
learn a separate token for 29 exclamation marks. Instead I put in a special token for 00:37:24.840 |
the next thing repeats lots of times, and then I put the number 29, and then I put the 00:37:32.000 |
And so there's a few little tricks like that, and if you're interested in LP, have a look 00:37:36.120 |
at the code for Tokenizer for these little tricks that I've added in because some of 00:37:45.440 |
So the nice thing with doing things this way is we can now just np.save that and load it 00:37:52.160 |
back up later. We don't have to recalculate all this stuff each time like we tend to have 00:37:57.600 |
to do with TorchText or a lot of other libraries. 00:38:02.160 |
So we've now got it tokenized. The next thing we need to do is turn it into numbers, which 00:38:08.600 |
we call numericalizing it. And the way we numericalize it is very simple. We make a list of all the 00:38:14.560 |
words that appear in some order, and then we replace every word with its index into that 00:38:20.000 |
list. The list of all the tokens that appear, we call the vocabulary. 00:38:29.160 |
So here's an example of some of the vocabulary. The counter class in Python is very handy 00:38:34.200 |
for this. It basically gives us a list of unique items and their counts. So here are the 25 00:38:42.960 |
most common things in the vocabulary. You can see there are things like apostrophe s and 00:38:48.480 |
double quote and end of paragraph, and also stuff like that. 00:38:54.720 |
Generally speaking, we don't want every unique token in our vocabulary. If it doesn't appear 00:39:01.320 |
at least two times, then it might just be a spelling mistake or a word. We can't learn 00:39:06.560 |
anything about it if it doesn't appear that often. Also the stuff that we're going to 00:39:11.120 |
be learning about at least so far on this part gets a bit clunky once you've got a vocabulary 00:39:16.680 |
bigger than 60,000. Time permitting, we may look at some work I've been doing recently 00:39:22.520 |
on handling larger vocabularies, otherwise that might have to come in a future course. 00:39:28.600 |
But actually for classification, I've discovered that doing more than about 60,000 words doesn't 00:39:32.920 |
seem to help anyway. So we're going to limit our vocabulary to 60,000 words, things that 00:39:38.560 |
appear at least twice. So here's a simple way to do that. Use that dot most common, pass 00:39:45.080 |
in the max_vocab size. That'll sort it by the frequency, by the way. And if it appears 00:39:52.360 |
less often than a minimum frequency, then don't bother with it at all. 00:39:57.000 |
So that gives us i to s. That's the same name that torch text used. Remember it means int 00:40:02.920 |
to string. So this is just the list of the unique tokens in the vocab. I'm going to insert 00:40:10.200 |
two more tokens, a token for unknown, a vocab item for unknown, and a vocab item for padding. 00:40:19.960 |
Then we can create the dictionary which goes in the opposite direction, so string to int. 00:40:26.760 |
And that won't cover everything because we intentionally truncated it down to 60,000 words. 00:40:33.440 |
And so if we come across something that's not in the dictionary, we want to replace 00:40:36.960 |
it with 0 for unknown, so we can use a default dict for that, with a lambda function that 00:40:43.600 |
always returns 0. So you can see all these things we're using that keep coming back up. 00:40:51.480 |
So now that we've got our s to i dictionary defined, we can then just call that for every 00:40:56.480 |
word for every sentence. And so there's our numericalized version, and there it is. And 00:41:06.920 |
so of course the nice thing is again, we can save that step as well. So each time we get 00:41:12.840 |
to another step, we can save it. And these are not very big files. Compared to what you 00:41:17.880 |
get used to with images, text is generally pretty small. Very important to also save 00:41:28.560 |
that vocabulary. Because this list of numbers means nothing, unless you know what each number 00:41:36.160 |
refers to, and that's what I2S tells you. So you save those three things, and then later 00:41:42.840 |
on you can load them back up. So now our vocab size is 60,002, and our training language 00:42:00.120 |
So that's the preprocessing you do. We can probably wrap a little bit more of that in 00:42:05.520 |
little utility functions if we want to, but it's all pretty straightforward, and basically 00:42:10.400 |
that exact code will work for any dataset you have once you've got it in that CSV format. 00:42:18.240 |
So here is a kind of a new insight that's not new at all, which is that we'd like to 00:42:31.280 |
pre-train something. Like we know from lesson 4 that if we pre-train our classifier by first 00:42:39.680 |
creating a language model, and then fine-tuning that as a classifier, that was helpful. Remember 00:42:45.520 |
it actually got us a new state-of-the-art result. We got the best IMDB classifier result 00:42:50.720 |
that had ever been published. But quite a bit. Well, we're not going far enough though, 00:42:58.040 |
because IMDB movie reviews are not that different to any other English document compared to 00:43:12.000 |
how different they are to a random string or even to a Chinese document. So just like 00:43:19.940 |
ImageNet allowed us to train things that recognize stuff that kind of looks like pictures, and 00:43:26.760 |
we could use it on stuff that was nothing to do with ImageNet, like satellite images. 00:43:30.680 |
Why don't we train a language model that's just like good at English, and then fine-tune 00:43:37.400 |
it to be good at movie reviews? So this basic insight led me to try building a language 00:43:47.800 |
model on Wikipedia. So my friend Stephen Meridy has already processed Wikipedia, found a subset 00:43:58.920 |
of nearly the most of it, but throwing away the stupid little articles, and he calls that 00:44:08.240 |
Wikitex 103. So I grabbed Wikitex 103 and I trained a language model on it. I used exactly 00:44:16.640 |
the same approach I'm about to show you for training an IMDB language model, but instead 00:44:21.760 |
I trained a Wikitex 103 language model. And then I saved it and I've made it available 00:44:29.640 |
for anybody who wants to use it at this URL. So this is not a URL for Wikitex 103, the 00:44:36.920 |
documents, this is the Wikitex 103, the language model. So the idea now is let's train an IMDB 00:44:46.160 |
language model which starts with these words. 00:44:50.600 |
Now hopefully to you folks, this is an extremely obvious, extremely non-controversial idea because 00:44:58.720 |
it's basically what we've done in nearly every class so far. But when I first mentioned this 00:45:09.560 |
to people in the NLP community, I guess June/July of last year, there couldn't have been less 00:45:18.920 |
interest. I asked on Twitter, where a lot of the top Twitter researchers are people that 00:45:24.960 |
I follow and they follow me back, I was like "hey, what if we pre-trained a general language 00:45:29.800 |
model?" and they're like "no, all language is different, you can't do that" or "I don't 00:45:36.080 |
know why you would bother anyway, I've talked to people at conferences and I'm pretty sure 00:45:43.280 |
people have tried that and it's stupid." There was just this weird straight past. I guess 00:45:56.000 |
because I am arrogant and I ignored them even though they know much more about NLP than 00:46:03.960 |
I do and just tried it anyway and let me show you what happened. 00:46:10.400 |
So here's how we do it. Grab the wiki text models, and if you use wget -r it'll actually 00:46:21.280 |
recursively grab the whole directory, it's got a few things in it. We need to make sure 00:46:27.480 |
that our language model has exactly the same embedding size, number of hidden and number 00:46:32.900 |
of layers as my wiki text one did, otherwise you can't load the weights in. So here's our 00:46:41.800 |
pre-trained path, here's our pre-trained language model path, let's go ahead and torch.load in 00:46:48.400 |
those weights from the forward wiki text 103 model. We don't normally use torch.load, but 00:46:58.440 |
that's the PyTorch way of grabbing a file. And it basically gives you a dictionary containing 00:47:07.080 |
the name of the layer and a tensor of those weights or an array of those weights. 00:47:14.760 |
Now here's the problem, that wiki text language model was built with a certain vocabulary which 00:47:21.720 |
was not the same as this one was built on. So my number 40 was not the same as wiki text 00:47:27.680 |
103 models number 40. So we need to map one to the other. That's very, very simple because 00:47:35.120 |
luckily I saved the i2s for the wiki text vocab. So here's the list of what each word 00:47:44.280 |
is when I trained the wiki text 103 model, and so we can do the same default dict trick 00:47:50.520 |
to map it in reverse, and I'm going to use -1 to mean that it's not in the wiki text 00:47:56.520 |
dictionary. And so now I can just say my new set of weights is just a whole bunch of zeros 00:48:05.000 |
with vocab size by embedding size, so we're going to create an embedding matrix. I'm then 00:48:10.480 |
going to go through every one of the words in my IMDB vocabulary. I'm going to look it 00:48:17.200 |
up in S to i2, so string to int for the wiki text 103 vocabulary, and see if that's words 00:48:24.280 |
there. And if that is word there, then I'm not going to get this -1, so r will be greater 00:48:31.520 |
than or equal to 0. So in that case I will just set that row of the embedding matrix 00:48:36.800 |
to the weight that I just looked at, which was stored inside this named element. So these 00:48:45.520 |
names, you can just look at this dictionary and it's pretty obvious what each name corresponds 00:48:51.360 |
to because it looks very similar to the names that you gave it when you set up your module. 00:48:55.440 |
So here are the encoder weights. So grab it from the encoder weights. If I don't find it, 00:49:05.400 |
then I will use the row mean. In other words, here is the average embedding weight across 00:49:12.400 |
all of the wiki text 103 things. So that's pretty simple, so I'm going to end up with 00:49:18.560 |
an embedding matrix for every word that's in both my vocabulary for IMDB and the wiki 00:49:24.040 |
text 103 vocabulary. I will use the wiki text 103's embedding matrix weights for anything 00:49:30.240 |
else. I will just use whatever was the average weight from the wiki text 103 embedding matrix. 00:49:36.080 |
And then I'll go ahead and I will replace the encoder weights with that turn into a tensor. 00:49:43.600 |
We haven't talked much about weight tying, we might do so later, but basically the decoder, 00:49:48.500 |
so the thing that turns the final prediction back into a word, uses exactly the same weights, 00:49:56.380 |
so I pop it there as well. And then there's a bit of a weird thing with how we do embedding 00:50:01.600 |
dropout that ends up with a whole separate copy of them for a reason that doesn't matter 00:50:06.360 |
much. So we just pop the weights back where they need to go. 00:50:09.960 |
So this is now something that a dictionary we can now, or a set of torch state which 00:50:16.920 |
we can load in. So let's go ahead and create our language model. And so the basic approach 00:50:25.240 |
we're going to use, and I'm going to look at this in more detail in a moment, but the 00:50:27.920 |
basic approach we're going to use is I'm going to concatenate all of the documents together 00:50:38.000 |
into a single list of tokens of length 24.998 million. 00:50:47.260 |
So that's going to be what I pass in as my training set. So the language model, we basically 00:50:54.720 |
just take all our documents and just concatenate them back to back. And we're going to be continuously 00:50:59.280 |
trying to predict what's the next word after these words. And we'll look at these details 00:51:06.600 |
in a moment. I'm going to set up a whole bunch of dropout. We'll look at that in detail in 00:51:11.280 |
a moment. Once we've got a model data object, we can then grab the model from it. So that's 00:51:17.680 |
going to give us a learner. And then as per usual, we can call learner.fit. So we first 00:51:27.280 |
of all, as per usual, just do a single epoch on the last layer just to get that okay. And 00:51:34.320 |
the way I've set it up is the last layer is actually the embedding weights. Because that's 00:51:38.680 |
obviously the thing that's going to be the most wrong, because a lot of those embedding 00:51:42.020 |
weights didn't even exist in the vocab, so we're just going to train a single epoch of 00:51:47.200 |
just the embedding weights. And then we'll start doing a few epochs of the full model. 00:51:54.200 |
And so how is that looking? Well here's lesson 4, which was our academic world's best ever 00:52:02.920 |
result. And after 14 epochs we had a 4.23 loss. Here after 1 epoch we have a 4.12 loss. 00:52:19.800 |
So by pre-training on Wikitext 103, in fact let's go and have a look, we kept training 00:52:26.720 |
and training at a different rate. Eventually we got to 4.16. So by pre-training on Wikitext 00:52:32.400 |
103 we have a better loss after 1 epoch than the best loss we got for the language model 00:52:42.200 |
What is the Wikitext 103 model? Is it AWD LSTM again? 00:52:47.320 |
Yeah and we're about to dig into that. The way I trained it was literally the same lines 00:52:54.120 |
of code that you see here, but without pre-training it on Wikitext 103. 00:53:00.760 |
So let's take a 10-minute break, come back at 7.40 and we'll dig in and have a look at 00:53:08.720 |
Ok welcome back. Before we go back into language models and NLP classifiers, a quick discussion 00:53:17.280 |
about something pretty new at the moment which is the FastAI doc project. So the goal of 00:53:23.200 |
the FastAI doc project is to create documentation that makes readers say "Wow, that's the most 00:53:30.320 |
fantastic documentation I've ever read." And so we have some specific ideas about how to 00:53:37.440 |
do that, but it's the same kind of idea of top-down, thoughtful, take-full advantage 00:53:45.800 |
of the medium approach, interactive, experimental code first that we're all familiar with. 00:53:54.040 |
If you're interested in getting involved, the basic approach you can see in the docs 00:54:01.180 |
directory. So this is the readme in the docs directory. In there there is, amongst other 00:54:09.600 |
things, a transforms_template.adoc. What the hell is adoc? Adoc is ASCII doc. How many 00:54:17.600 |
people here have come across ASCII doc? That's awesome. People are laughing because there's 00:54:25.280 |
one hand up and it's somebody who was in our study group today who talked to me about ASCII 00:54:29.560 |
doc. ASCII doc is the most amazing project. It's like Markdown, but it's like what Markdown 00:54:36.280 |
needs to be to create actual books, and a lot of actual books are written in ASCII doc. 00:54:43.280 |
And so it's as easy to use as Markdown, but there's way more cool stuff you can do with 00:54:48.440 |
it. In fact, here is an ASCII doc file here, and as you'll see it looks very normal. There's 00:54:53.720 |
headings and this is pre-formatted text, and there's lists and whatever else. It looks 00:55:05.800 |
pretty standard, and actually I'll show you a more complete ASCII doc thing, a more standard 00:55:13.840 |
ASCII doc thing. But you can do stuff like say put a table of contents here please. You 00:55:20.880 |
can say colon colon means put a definition list here please. Plus means this is a continuation 00:55:28.780 |
of the previous list item. So there's just little things that you can do which are super 00:55:34.600 |
handy or make it slightly smaller than everything else. So it's like turbocharged Markdown. 00:55:43.800 |
And so this ASCII doc creates this HTML. And I didn't add any CSS or do anything myself. 00:55:52.280 |
We literally started this project like 4 hours ago. So this is like just an example basically. 00:55:58.480 |
And so you can see we've got a table of contents, we can jump straight to here, we've got a 00:56:05.920 |
cross-reference we can click on to jump straight to the cross-reference. Each method comes 00:56:11.960 |
along with its details and so on and so forth. And to make things even easier, rather than 00:56:18.380 |
having to know that the argument list is meant to be smaller than the main part, or how do 00:56:25.980 |
you create a cross-reference, or how are you meant to format the arguments to the method 00:56:32.280 |
name and list out each one of its arguments, we've created a special template where you 00:56:38.880 |
can just write various stuff in curly brackets like "please put the arguments here, and here 00:56:43.800 |
is an example of one argument, and here is a cross-reference, and here is a method," and 00:56:49.400 |
so forth. So we're in the process of documenting the documentation template that there's basically 00:56:55.440 |
like 5 or 6 of these little curly bracket things you'll need to learn. But for you to 00:56:59.760 |
create a documentation of a class or a method, you can just copy one that's already there 00:57:05.680 |
and so the idea is we're going to have, it'll almost be like a book. There'll be tables 00:57:12.320 |
and pictures and little video segments and hyperlink throughout and all that stuff. You 00:57:21.320 |
might be wondering what about docstrings, but actually I don't know if you've noticed, 00:57:25.760 |
but if you look at the Python standard library and look at the docstring for example for 00:57:31.320 |
regex compile, it's a single line. Nearly every docstring in Python is a single line. 00:57:38.040 |
And Python then does exactly this. They have a website containing the documentation that 00:57:43.080 |
says like "Hey, this is what regular expressions are and this is what you need to know about 00:57:46.940 |
them and if you want them to go faster, you'll need to use compile and here's lots of information 00:57:51.040 |
about compile and here's the examples." It's not in the docstring. And that's how we're 00:57:55.840 |
doing it as well. Our docstrings will be one line unless you need two sometimes. It's going 00:58:03.640 |
to be very similar to Python, but even better. So everybody is welcome to help contribute 00:58:11.920 |
to the documentation and hopefully by the time you're watching this on the MOOC, it'll 00:58:16.640 |
be recently fleshed out and we'll try to keep a list of things to do. 00:58:26.560 |
So I'm going to do one first. So one question that came up in the break was how does this 00:58:35.440 |
compare to Word2Vec? And this is actually a great thing for you to spend time thinking 00:58:41.440 |
about during the week is how does this compare to Word2Vec. I'll give you the summary now, 00:58:46.360 |
but it's a very important conceptual difference. The main conceptual difference is, what is 00:58:51.320 |
Word2Vec? Word2Vec is a single embedding matrix. Each word has a vector and that's it. So in 00:59:00.520 |
other words, it's a single layer from a pre-trained model and specifically that layer is the input 00:59:08.440 |
layer. And also specifically that pre-trained model is a linear model that is pre-trained 00:59:16.960 |
on something called a co-occurrence matrix. So we have no particular reason to believe 00:59:22.600 |
that this model has learned anything much about the English language or that it has 00:59:27.040 |
any particular capabilities because it's just a single linear layer and that's it. 00:59:34.320 |
So what's this Wikitex 103 model? It's a language model. It has a 400-dimensional embedding 00:59:45.200 |
matrix, 3 hidden layers with 1,150 activations per layer, and regularization and all of that 00:59:57.560 |
stuff. Tired input output, matrixes, it's basically a state-of-the-art AWD. So what's 01:00:05.920 |
the difference between a single layer of a single linear model versus a three-layer recurrent 01:00:14.800 |
neural network? Everything. They're very different levels of capability. And so you'll see when 01:00:22.200 |
you try using a pre-trained language model versus a Word2vec layer, you'll get very, 01:00:29.240 |
very different results for the vast majority of tasks. 01:00:33.360 |
What if the NumPy array does not fit in memory? Is it possible to write a PyTorch data loader 01:00:42.440 |
It almost certainly won't come up, so I'm not going to spend time on it. These things 01:00:46.200 |
are tiny. They're just ints. Think about how many ints you would need to run out of memory. 01:00:52.680 |
It's not going to happen. They don't have to fit in GPU memory, just in your memory. I've 01:00:57.880 |
actually done another Wikipedia model, which I called GigaWiki, which was on all of Wikipedia, 01:01:06.680 |
and even that easily fits in memory. The reason I'm not using it is because it turned out 01:01:10.480 |
not to really help very much versus Wikitex 103, but I've built a bigger model than anybody 01:01:17.800 |
else I found in the academic literature pretty much, and it fits in memory on a single machine. 01:01:24.600 |
What is the idea behind averaging the weights of embeddings? 01:01:27.720 |
They've got to be set to something. There are words that weren't there, so other options 01:01:34.560 |
is we could leave them at 0, but that seems like a very extreme thing to do. 0 is a very 01:01:38.880 |
extreme number. Why would it be 0? We could set it equal to some random numbers, but if 01:01:46.160 |
so, what would be the mean and standard deviation of those random numbers, or should it be uniform? 01:01:50.960 |
If we just average the rest of the embeddings, then we have something that's a reasonable 01:01:57.800 |
Just to clarify, this is how you're initializing words that didn't appear in the training corpus. 01:02:03.040 |
I think you've pretty much just answered this one, but someone had asked if there's a specific 01:02:09.320 |
advantage to creating our own pre-trained embedding over using glob or Word2Vec. 01:02:14.520 |
I think I have. We're not creating a pre-trained embedding; we're creating a pre-trained model. 01:02:23.120 |
Let's talk a little bit more. This is a ton of stuff we've seen before, but it's changed 01:02:26.880 |
a little bit. It's actually a lot easier than it was in Part 1, but I want to go a little 01:02:38.000 |
So this is the language model loader, and I really hope that by now you've learned in 01:02:41.280 |
your editor or IDE how to jump to symbols. I don't want it to be a burden for you to 01:02:48.920 |
find out what the source code of a language model loader is. And if it's still a burden, 01:02:54.000 |
please go back and try and learn those keyboard shortcuts in VS Code. If your editor does 01:03:00.760 |
not make it easy, don't use that editor anymore. There's lots of good free editors that make 01:03:10.360 |
So here's the source code for language model loader. It's interesting to notice that it's 01:03:18.720 |
not doing anything particularly tricky. It's not deriving from anything at all. What makes 01:03:30.400 |
it something that's capable of being a data loader is it's something you can iterate over. 01:03:36.640 |
And so specifically, here's the fit function inside fastai.model. This is where everything 01:03:47.680 |
ends up eventually, which goes through each epoch, and then it creates an iterator from 01:03:52.960 |
the data loader, and it just does a for loop through it. So anything you can do a for loop 01:03:57.480 |
through can be a data loader. And specifically, it needs to return tuples of many batches, 01:04:05.800 |
an independent and dependent variable for many batches. 01:04:09.320 |
So anything with a dunder-eater method is something that can act as an iterator. And 01:04:17.600 |
yield is a neat little Python keyword you probably should learn about if you don't already 01:04:22.520 |
know it, but it basically spits out a thing and waits for you to ask for another thing, 01:04:30.720 |
So in this case, we start by initializing the language model, passing it in the numbers. 01:04:38.600 |
So this is a numericalized, big, long list of all of our documents concatenated together. 01:04:46.060 |
And the first thing we do is to batchify it. And this is the thing which quite a few of 01:04:52.280 |
you got confused about last time. If our batch size is 64 and we have 25 million numbers in 01:05:05.320 |
our list, we are not creating items of length 64. We're not doing that. We're creating 64 01:05:15.080 |
items in total. So each of them is of size t/64, which is 390,000. So that's what we 01:05:27.000 |
do here when we reshape it so that this axis here is of length 64, and then this -1 is 01:05:36.400 |
everything else. So that's 390,000 long. And then we transpose it. 01:05:44.560 |
So that means that we now have 64 columns, 390,000 rows, and then what we do each time 01:05:52.560 |
we do an iterate is we grab one batch of some sequence length, we'll look at that in a moment, 01:06:00.120 |
but basically it's approximately equal to bptt, which we set to 70, stands for backprop 01:06:10.160 |
through time, and we just grab that many rows. So from i to i plus 70 rows, and then we try 01:06:23.800 |
to predict that plus 1. So we've got 64 columns, and each of those is 1/64 of our 25 million 01:06:35.880 |
or whatever it was, tokens, hundreds of thousands long, and we just grab 70 at a time. So each 01:06:45.160 |
of those columns each time we grab it is going to hook up to the previous column. So that's 01:06:51.600 |
why we get this consistency, this language model. It's stateful, just really important. 01:06:59.600 |
Pretty much all the cool stuff in the language model is stolen from Stephen Merrity's AWD 01:07:06.640 |
LSTM, including this little trick here, which is if we always grab 70 at a time and then 01:07:15.200 |
we go back and do a new epoch, we're going to grab exactly the same batches every time. 01:07:20.480 |
There's no randomness. Now normally we shuffle our data every time we do an epoch, or every 01:07:26.000 |
time we grab some data we grab it at random. You can't do that with a language model because 01:07:30.660 |
this set has to join up to the previous set because it's trying to learn the sentence. 01:07:38.120 |
If you suddenly jump somewhere else, then that doesn't make any sense as a sentence. 01:07:43.400 |
So Stephen's idea is to say, since we can't shuffle the order, let's instead randomly 01:07:51.380 |
change the size, the sequence length. So basically he says, 95% of the time we'll use bptt, 70, 01:08:02.020 |
but 5% of the time we'll use half that. And then he says, you know what, I'm not even 01:08:08.640 |
going to make that the sequence length, I'm going to create a normally distributed random 01:08:13.320 |
number with that average and a standard deviation of 5, and I'll make that the sequence length. 01:08:20.080 |
So the sequence length is 70ish, and that means every time we go through we're getting 01:08:26.600 |
slightly different batches. So we've got that little bit of extra randomness. I asked Stephen 01:08:34.420 |
Meridy where he came up with this idea. Did he think of it? He was like, I think I thought 01:08:40.840 |
of it, but it seemed so obvious that I bet I didn't think of it, which is true of every 01:08:46.280 |
time I come up with an idea in deep learning, it always seems so obvious that you assume 01:08:49.640 |
somebody else has thought of it, but I think he thought of it. 01:08:54.860 |
So this is a nice thing to look at if you're trying to do something a bit unusual with 01:09:01.600 |
a data loader. It's like, okay, here's a simple kind of role model you can use as to creating 01:09:07.840 |
a data loader from scratch, something that spits out batches of data. So our language 01:09:16.200 |
model loader just took in all of the documents concatenated together along with the batch 01:09:23.960 |
Now generally speaking, we want to create a learner, and the way we normally do that 01:09:28.700 |
is by getting a model data object and by calling some kind of method which have various names, 01:09:34.360 |
but often we call that method getModel. And so the idea is that the model data object 01:09:39.920 |
has enough information to know what kind of model to give you. So we have to create that 01:09:45.720 |
model data object, which means we need that class, and so that's very easy to do. 01:09:55.860 |
So here are all of the pieces. We're going to create a custom learner, a custom model 01:09:59.900 |
data class and a custom model class. So a model data class, again, this one doesn't inherit 01:10:07.040 |
from anything, so you really see there's almost nothing to do. You need to tell it most importantly 01:10:14.440 |
what's your training set, give it a data loader, what's the validation set, give it a data 01:10:19.680 |
loader, and optionally give it a test set, plus anything else it needs to know. So it 01:10:29.040 |
might need to know the VPTT, it needs to know the number of tokens, that's the vocab size, 01:10:38.240 |
it needs to know what is the padding index, and so that it can save temporary files and 01:10:45.360 |
models, model data always needs to know the path. And so we just grab all that stuff and 01:10:50.000 |
we dump it. And that's it, that's the entire initializer, there's no logic there at all. 01:10:55.920 |
So then all of the work happens inside get_model. And so get_model calls something we'll look 01:11:03.120 |
at later which just grabs a normal PyTorch NN.module architecture. And jux it on the GPU. 01:11:14.440 |
Note with PyTorch normally we would say .cuda. With fast.ai, it's better to say to GPU. And 01:11:21.040 |
the reason is that if you don't have a GPU, it will leave it on the CPU, and it also provides 01:11:27.440 |
a global variable you can set to choose whether it goes on the GPU or not. So it's a better 01:11:35.520 |
So we wrap the model in a language model. And the language model is this. Basically 01:11:40.840 |
a language model is a subclass of basic model. It basically almost does nothing except it 01:11:48.820 |
defines layer groups. And so remember how when we do discriminative learning rates where 01:11:54.660 |
different layers have different learning rates, or we freeze different amounts, we don't provide 01:12:03.300 |
a different learning rate for every layer because there can be like a thousand layers. 01:12:07.680 |
We provide a different learning rate for every layer group. So when you create a custom model, 01:12:13.300 |
you just have to override this one thing which returns a list of all of your layer groups. 01:12:21.840 |
So in this case, my last layer group contains the last part of the model and one bit of 01:12:28.680 |
dropout, and the rest of it, this star here, means pull this apart. So this is basically 01:12:40.200 |
So that's all that is. And then finally, turn that into a learner. And so a learner you 01:12:47.520 |
just pass in the model and it turns it into a learner. In this case we have overridden 01:12:52.480 |
learner and the only thing we've done is to say I want the default loss function to be 01:12:59.160 |
cross-entropy. So this entire set of custom model, custom model data, custom learner all 01:13:07.960 |
fits on a single screen, and they always basically look like this. So that's a kind of little 01:13:15.040 |
dig inside this pretty boring part of the code base. 01:13:19.200 |
So the interesting part of this code base is getLanguageModel. GetLanguageModel is actually 01:13:24.200 |
the thing that gives us our awdlstm. And it actually contains the big idea, the big, incredibly 01:13:35.440 |
simple idea that everybody else here thinks it's really obvious, that everybody in the 01:13:40.280 |
NLP community I spoke to thought was insane, which is basically every model can be thought 01:13:47.720 |
of as a backbone plus a head, and if you pre-train the backbone and stick on a random head, you 01:14:00.120 |
And so these two bits of the code are literally right next to each other. There is this bit 01:14:08.520 |
of fastai.lm_rnn. Here's getLanguageModel. Here's getClassifier. getLanguageModel creates 01:14:18.000 |
an RNN encoder and then creates a sequential model that sticks on top of that a linear 01:14:24.200 |
decoder. Classifier creates an RNN encoder and then a sequential model that sticks on 01:14:30.160 |
top of that a pooling linear classifier. We'll see what these differences are in a moment, 01:14:35.440 |
but you get the basic idea. They're basically doing pretty much the same thing. They've 01:14:39.880 |
got this head and then they're sticking on a simple linear layer on top. 01:14:46.280 |
So it's worth digging in a little bit deeper and seeing what's going on here. Yes, Rich? 01:14:52.240 |
>> There was a question earlier about whether any of this translates to other languages. 01:14:59.080 |
>> Yeah, this whole thing works in any language you like. 01:15:02.800 |
>> I mean, would you have to retrain your language model on a corpus from that language? 01:15:12.920 |
>> So the wikitext-103-pre-trained-language-model knows English. You could use it maybe as 01:15:22.080 |
a pre-trained start for a French or German model. Start by retraining the embedding layer 01:15:27.840 |
from scratch. Might be helpful. Chinese, maybe not so much. But given that a language model 01:15:35.560 |
can be trained from any unlabeled documents at all, you'd never have to do that. Because 01:15:42.280 |
almost every language in the world has plenty of documents. You can grab newspapers, web 01:15:51.120 |
pages, parliamentary records, whatever. As long as you've got a few thousand documents 01:15:59.520 |
showing somewhat normal usage of that language, you can create a language model. 01:16:04.640 |
And so I know some of our students, one of our students, whose name I'll have to look 01:16:09.280 |
after in a week, very embarrassing, tried this approach for Thai. He said the first 01:16:16.600 |
model he built easily beat the previous day of the entire classifier. For those of you 01:16:24.160 |
that are international fellows, this is an easy way for you to whip out a paper in which 01:16:31.440 |
you either create the first ever classifier in your language or beat everybody else's 01:16:36.160 |
classifier in your language and then you can tell them that you've been a student of deep 01:16:41.080 |
learning for six months and piss off all the academics in your country. 01:16:47.160 |
So here's our edit encoder. It's just a standard edit module. Most of the text in it is actually 01:16:57.280 |
just documentation, as you can see. It looks like there's more going on in it than there 01:17:03.280 |
actually is, but really all there is is we create an embedding layer, we create an LSTM 01:17:09.640 |
for each layer that's been asked for, and that's it. Everything else in it is dropout. 01:17:19.520 |
Basically all of the interesting stuff in the AWED LSTM paper is all of the places you 01:17:25.320 |
can put dropout. And then the forward is basically the same thing, right? It's call the embedding 01:17:35.240 |
layer, add some dropout, go through each layer, call that RNN layer, append it to our list 01:17:44.960 |
of outputs, add dropout, and that's about it. 01:17:54.320 |
So it's really pretty straightforward. The paper you want to be reading, as I've mentioned, 01:18:05.020 |
is the AWD LSTM paper, which is this one here, regularizing and optimizing LSTM language 01:18:10.440 |
models, and it's well-written and pretty accessible and entirely implemented inside FastAI as 01:18:20.920 |
well, so you can see all of the code for that paper. And like a lot of the code is shamelessly 01:18:29.240 |
plagiarized with Stephen's permission from his excellent GitHub repo, AWD LSTM, and the 01:18:36.880 |
process of which I picked some of his bugs as well. I even told him about them. 01:18:46.920 |
So I'm talking increasingly about "please read the papers", so here's the paper, "please 01:18:51.320 |
read this paper", and it refers to other papers. So for things like why is it that the encoder 01:19:00.960 |
weight and the decoder weight are the same? Well, it's because there's this thing called 01:19:10.720 |
"tie_weights", this is inside that get_language model, there's a thing called "tie_weights", 01:19:21.040 |
it defaults to true, and if it's true then we literally use the same weight matrix for 01:19:32.280 |
the encoder and the decoder. So they're literally pointing at the same block of memory. And 01:19:39.160 |
so why is that? What's the result of it? That's one of the citations in Stephen's paper, which 01:19:44.920 |
is also a well-written paper, you can go and look up and learn about work time. So there's 01:19:53.040 |
So we have basically a standard RNN, the only way it's not standard is it's just got lots 01:19:57.960 |
more types of dropout in it, and then a sequential model, on top of that we stick a linear decoder, 01:20:06.600 |
which is literally half the screen of code. It's got a single linear layer, we initialize 01:20:15.320 |
the weights to some range, we add some dropout, and that's it. So we've got an RNN, on top 01:20:25.040 |
of that we stick a linear layer with dropout and we're finished. So that's the language 01:20:29.880 |
model. So what dropout you choose matters a lot, and through a lot of experimentation 01:20:46.000 |
I found a bunch of dropouts -- you can see here we've got each of these corresponds to 01:20:51.840 |
a particular argument -- a bunch of dropouts that tend to work pretty well for language 01:20:56.480 |
models. But if you have less data for your language model, you'll need more dropout. If 01:21:06.680 |
you have more data, you can benefit from less dropout, you don't want to regularize more 01:21:11.200 |
than you have to. Rather than having to tune every one of these 5 things, my claim is they're 01:21:19.000 |
already pretty good ratios to each other, so just tune this number. I just multiply 01:21:24.000 |
it all by something. So there's really just one number you have to tune. If you're overfitting, 01:21:33.480 |
then you'll need to increase this number. If you're underfitting, you'll need to decrease 01:21:37.040 |
this number. Other than that, these ratios actually seem pretty good. 01:21:45.640 |
So one important idea which may seem pretty minor, but again it's incredibly controversial, 01:21:55.000 |
is that we should measure accuracy when we look at a language model. So normally in language 01:22:01.500 |
models we look at this loss value, which is just cross-entropy loss, but specifically 01:22:08.680 |
where you nearly always take e^ of that, which the NLP community calls perplexity. Perplexity 01:22:22.240 |
There's a lot of problems with comparing things based on cross-entropy loss. I'm not sure 01:22:29.120 |
I've got time to go into it in detail now, but the basic problem is that it's kind of 01:22:35.400 |
like that thing we learned about focal loss. Cross-entropy loss, if you're right, it wants 01:22:40.240 |
you to be really confident that you're right. So it really penalizes a model that doesn't 01:22:46.720 |
kind of say, I'm so sure this is wrong, whereas accuracy doesn't care at all about how confident 01:22:52.360 |
you are, it just cares about whether you're right. And this is much more often the thing 01:22:56.520 |
which you care about in real life. So this accuracy is how often do we guess the next 01:23:02.760 |
word correctly. And I just find that a much more stable number to keep track of. So that's 01:23:14.720 |
So we trained for a while, and we get down to a 3.9 cross-entropy loss, and if you go 01:23:32.160 |
e^, that kind of gives you a sense of what's happened with language models. If you look 01:23:45.840 |
at academic papers from about 18 months ago, you'll see them talking about state-of-the-art 01:23:54.760 |
complexities of over 100. The rate at which our ability to kind of understand language, 01:24:04.440 |
and I think measuring language model accuracy or complexity is not a terrible proxy for 01:24:11.440 |
understanding language. If I can guess what you're going to say next, I pretty much need 01:24:16.640 |
to understand language pretty well, and also the kind of things you might talk about pretty 01:24:20.480 |
well. So this number has just come down so much. It's been amazing. NLP in the last 12 01:24:29.160 |
to 18 months. And it's going to come down a lot more. It really feels like 2011-2012 computer 01:24:35.960 |
vision. We're just starting to understand transfer learning and fine-tuning, and these 01:24:44.880 |
So everything you thought about what NLP can and can't do is very rapidly going out of date. 01:24:53.920 |
But there's still lots of stuff NLP is not good at, to be clear. Just like in 2012 there 01:24:58.560 |
was lots of stuff computer vision wasn't good at. But it's changing incredibly rapidly, 01:25:03.420 |
and now is a very, very good time to be getting very, very good at NLP or starting start-ups 01:25:10.340 |
based on NLP because there's a whole bunch of stuff which computers were absolutely shit 01:25:15.120 |
at two years ago, and now are not quite as good at people, and then next year they'll 01:25:25.140 |
Two questions. One, what is your ratio of paper reading versus coding in a week? 01:25:35.000 |
What do you think, Rachel? You see me. I mean, it's a lot more coding, right? 01:25:39.000 |
It's a lot more coding. I feel like it also really varies from week to week. I feel like 01:25:44.320 |
Like with that bounding box stuff, there was all these papers and no map through them, 01:25:54.040 |
and so I didn't even know which one to read first, and then I'd read the citations and 01:25:58.200 |
didn't understand any of them. So there was a few weeks of just kind of reading papers 01:26:02.600 |
before I even knew what to start coding. That's unusual though. Most of the time, I don't 01:26:10.560 |
know, any time I start reading a paper, I'm always convinced that I'm not smart enough 01:26:15.120 |
to understand it, always, regardless of the paper, and somehow eventually I do. But yeah, 01:26:26.880 |
And then the second question, is your dropout rate the same through the training or do you 01:26:34.680 |
I'll just say one more thing about the last bit, which is very often, like the vast majority, 01:26:42.080 |
nearly always, after I've read a paper, even after I've read the bit that says this is 01:26:49.920 |
the problem I'm trying to solve, I'll kind of stop there and try to implement something 01:26:54.080 |
that I think might solve that problem, and then I'll go back and read the paper and I'll 01:26:57.600 |
read little bits about how I solve these problem bits, and I'll be like, oh that's a good idea, 01:27:04.120 |
And so that's why, for example, I didn't actually implement SSD. My custom head is not the same 01:27:11.320 |
as their head. It's because I kind of read the gist of it and then I tried to create 01:27:15.560 |
something best as I could and then go back to the papers and try to see why. So by the 01:27:20.960 |
time I got to the focal loss paper, I was driving myself crazy with how come I can't 01:27:28.520 |
find small objects, how come it's always predicting background, and I read the focal loss paper 01:27:33.600 |
and I was like, that's why! It's so much better when you deeply understand the problem they're 01:27:42.480 |
trying to solve. And I do find the vast majority of the time, by the time I read that bit of 01:27:46.800 |
the paper which is like solving the problem, I'm then like, yeah but these three ideas I 01:27:51.720 |
came up with, they didn't try. And you suddenly realize that you've got new ideas. Or else 01:27:57.040 |
if you just implement the paper mindlessly, you tend not to have these insights about 01:28:10.120 |
Varying dropout is really interesting and there are some recent papers actually that 01:28:15.080 |
suggest gradually changing dropout and it was either a good idea to gradually make it 01:28:21.600 |
smaller or to gradually make it bigger. I'm not sure which. Maybe one of us can try and 01:28:29.200 |
find it during the week. I haven't seen it widely used. I tried it a little bit with 01:28:34.280 |
the most recent paper I wrote and I had some good results. I think I was gradually making 01:28:45.720 |
And then the next question is, "Am I correct in thinking that this language model is built 01:28:50.000 |
on word embeddings? Would it be valuable to try this with phrase or sentence embeddings?" 01:28:56.120 |
I asked this because I saw from Google the other day universal sentence encoder. 01:29:02.360 |
Yeah, this is much better than that. Do you see what I mean? This is not just an embedding 01:29:07.480 |
of a sentence, this is an entire model. An embedding by definition is like a fixed thing. 01:29:16.920 |
I think they're asking, they're saying that this language, well the first question is, 01:29:21.920 |
is this language model built on word embeddings? 01:29:24.480 |
Right, but it's not saying, a sentence or a phrase embedding is always a model that 01:29:32.160 |
creates that. We've got a model that's like trying to understand language, it's not just 01:29:39.000 |
a phrase, it's not just a sentence, it's a document in the end and it's not just an embedding, 01:29:46.960 |
So this has been a huge problem with NLP for years now is this attachment they have to 01:29:54.120 |
embeddings. So even the paper that the community has been most excited about recently from 01:30:00.280 |
AI2, the Allen Institute, called ELMO, and they found much better results across lots 01:30:07.840 |
of models. But again, it was an embedding. They took a fixed model and created a fixed 01:30:12.720 |
set of numbers which they then fed into a model. But in computer vision, we've known 01:30:19.080 |
for years that that approach of having a fixed set of features, they're called hypercolons 01:30:26.800 |
in computer vision. People stopped using them like 3 or 4 years ago because fine-tuning 01:30:37.640 |
So for those of you that have spent quite a lot of time with NLP and not much time with 01:30:42.040 |
computer vision, you're going to have to start relearning. All that stuff you have been told 01:30:48.600 |
about this idea that there are these things called embeddings and that you learn them 01:30:53.800 |
ahead of time, and then you apply these fixed things, whether it be word level or phrase 01:31:00.120 |
level or whatever level, don't do that. You want to actually create a pre-trained model 01:31:06.840 |
and fine-tune it end to end. You'll see some specific results. 01:31:16.800 |
For using accuracy instead of perplexity as a metric for the model, could we work that 01:31:26.920 |
into the loss function rather than just use it as a metric? 01:31:30.080 |
No, you never want to do that whether it be computer vision or NLP or whatever. It's too 01:31:34.120 |
bumpy. So cross-entropy is fine as a loss function. And I'm not saying instead of I 01:31:41.040 |
use it in addition, I think it's good to look at the accuracy and to look at the cross-entropy. 01:31:47.480 |
But for your loss function, you need something nice and smooth. Accuracy doesn't work very 01:31:54.480 |
You'll see there's two different versions of save. There's save and save encoder. Save 01:32:00.040 |
saves the whole model as per usual. Save encoder saves just that bit. In other words, in the 01:32:11.520 |
sequential model, it saves just that bit and not that bit. In other words, this bit, which 01:32:18.340 |
is the bit that actually makes it into a language model, we don't care about in the classifier, 01:32:23.520 |
we just care about that bit. So let's now create the classifier. I'm going to go through this 01:32:34.280 |
bit pretty quickly because it's the same. But when you go back during the week and look 01:32:38.120 |
at the code, convince yourself it's the same. We do getAllPD, read_csv again, juxize again, 01:32:43.880 |
getAll again, save those tokens again. We don't create a new I2S vocabulary. We obviously 01:32:52.900 |
want to use the same vocabulary we had in the language model because we're about to reload 01:32:58.540 |
the same encoder. Same default dict, same way of creating our numericalized list, which 01:33:08.060 |
as per before we can save. So that's all the same. Later on we can reload those rather 01:33:17.000 |
So all of our hyperparameters are the same. We can change the dropout. Optimize a function. 01:33:31.120 |
Pick a batch size that is as big as you can that doesn't run out of memory. This bit's 01:33:38.760 |
a bit interesting. There's some fun stuff going on here. The basic idea here is that 01:33:50.000 |
for the classifier we do really want to look at a document. We need to say is this document 01:33:57.040 |
positive or negative. So we do want to shuffle the documents because we like to shuffle things. 01:34:05.480 |
But those documents are different lengths, so if we stick them all into one batch -- this 01:34:11.960 |
is a handy thing that fastAI does for you -- you can stick things at different lengths 01:34:15.360 |
into a batch and it will automatically pad them, so you don't have to worry about that. 01:34:20.920 |
But if they're wildly different lengths, then you're going to be wasting a lot of computation 01:34:25.160 |
times. There might be one thing there that's 2,000 words long and everything else is 50 01:34:29.240 |
words long and that means you end up with a 2,000-wide tensor. That's pretty annoying. 01:34:36.000 |
So James Bradbury, who's actually one of Stephen Meridy's colleagues and the guy who came up 01:34:41.480 |
with TorchText, came up with an idea which was let's sort the dataset by length-ish. 01:34:55.120 |
So kind of make it so the first things in the list are on the whole, shorter than the 01:35:03.160 |
things at the end, but a little bit random as well. 01:35:14.820 |
So the first thing we need is a dataset. So we have a dataset passing in the documents 01:35:24.880 |
and their labels. And so here's a text dataset and it inherits from dataset. Here is dataset 01:35:31.800 |
from PyTorch. And actually, dataset doesn't do anything at all. It says you need to get 01:35:38.820 |
item if you don't have one, you're going to get an error, you need a length if you don't 01:35:42.560 |
have one, you're going to get an error. So this is an abstract class. 01:35:48.640 |
So we're going to pass in our x, we're going to pass in our y, and getItem is going to 01:35:54.640 |
grab the x and grab the y and return them. It couldn't be much simpler. Optionally, it 01:36:02.920 |
could reverse it. Optionally it could stick an end of stream at the end. Optionally it 01:36:06.400 |
could stick a start of stream at the beginning. We're not doing any of those things. So literally 01:36:09.640 |
all we're doing is putting in an x, putting in a y, and then grab an item, we're returning 01:36:14.100 |
the x and the y as a tuple. And the length is how long the x array is. So that's all 01:36:22.200 |
the dataset is. Something with a length that you can index. 01:36:27.920 |
So to turn it into a data loader, you simply pass the dataset to the data loader constructor, 01:36:34.300 |
and it's now going to go ahead and give you a batch of that at a time. Normally you can 01:36:39.560 |
say shuffle=true or shuffle=false, it will decide whether to randomize it for you. In 01:36:44.920 |
this case though, we're actually going to pass in a sampler parameter. The sampler is 01:36:50.920 |
a class we're going to define that tells the data loader how to shuffle. So for the validation 01:36:59.120 |
set, we're going to define something that actually just sorts it. It just deterministically 01:37:04.440 |
sorts it so all the shortest documents will be at the start, all the longest documents 01:37:09.840 |
will be at the end, and that's going to minimize the amount of padding. 01:37:13.720 |
For the training sampler, we're going to create this thing I call a sort-ish sampler, which 01:37:22.000 |
also sorts-ish. So this is where I really like PyTorch is that they came up with this 01:37:31.600 |
idea for an API for their data loader where we can hook in new classes to make it behave 01:37:38.280 |
in different ways. So here's a sort-sampler, it's simply something which again has a length, 01:37:46.320 |
which is the length of the data source, and it has an iterator, which is simply an iterator 01:37:52.160 |
which goes through the data source sorted by length of the key, and I pass in as the 01:38:02.080 |
key lambda function which returns the length. 01:38:10.280 |
And so for the sort-ish sampler, I won't go through the details, but it basically does 01:38:16.040 |
the same thing with a little bit of randomness. So it's just another of these beautiful little 01:38:24.960 |
design things in PyTorch that I discovered. I could take James Bradbury's ideas, which 01:38:31.760 |
he had written a whole new set of classes around, and I could actually just use the 01:38:37.760 |
inbuilt hooks inside PyTorch. You will notice that it's not actually PyTorch's data loader, 01:38:46.700 |
it's actually FastAI's data loader, but it's basically almost entirely plagiarized from 01:38:51.600 |
PyTorch but customized in some ways to make it faster, mainly by using multithreading instead 01:38:58.520 |
Does the pre-trained LSTM depth and BBTT need to match with the new one we are training? 01:39:07.520 |
No, the BBTT doesn't need to match at all. That's just like how many things do we look 01:39:11.620 |
at at a time, it's got nothing to do with the architecture. 01:39:16.640 |
So now we can call that function we just saw before, getRNNClassifier. It's going to create 01:39:22.200 |
exactly the same encoder, more or less, and we're going to pass in the same architectural 01:39:28.720 |
details as before. But this time, the head that we add on, you've got a few more things 01:39:37.200 |
you can do. One is you can add more than one hidden layer. So this layer here says this 01:39:43.800 |
is what the input to my classifier section, my head, is going to be. This is the output 01:39:51.440 |
of the first layer, this is the output of the second layer, and you can add as many 01:39:55.720 |
as you like. So you can basically create a little multi-layered neural net classifier 01:40:00.240 |
at the end. And so ditto, these are the dropouts to go after each of these layers. And then 01:40:08.200 |
here are all of the AWD LSTM dropouts, which we're going to basically plagiarize that idea 01:40:13.780 |
for our classifier. We're going to use the RNN learner, just like before. We're going 01:40:21.860 |
to use discriminative learning rates for different layers. You can try using weight decay or not, 01:40:31.640 |
I've been fiddling around a bit with that to see what happens. And so we start out just 01:40:37.240 |
training the last layer and we get 92.9% accuracy, then we unfreeze one more layer, get 93.3 accuracy, 01:40:47.760 |
and then we fine-tune the whole thing. And after 3 epochs, so this was kind of the 01:41:07.120 |
main attempt before our paper came along at using a pre-trained model. And what they did 01:41:14.800 |
is they used a pre-trained translation model. But they didn't fine-tune the whole thing, 01:41:25.460 |
they just took the activations of the translation model. And when they tried IMDB, they got 91.8% 01:41:47.220 |
which we beat easily after only fine-tuning one layer. They weren't state-of-the-art there, 01:41:57.700 |
the state-of-the-art is 94.1, which we beat after fine-tuning the whole thing for 3 epochs. 01:42:07.300 |
And so by the end, we're at 94.8, which is obviously a huge difference because in terms 01:42:15.460 |
of error rate, that's gone down from 5.9, and then I'll tell you a simple little trick. Go 01:42:22.400 |
back to the start of this notebook, and reverse the order of all of the documents, and then 01:42:31.280 |
rerun the whole thing. And when you get to the bit that says wt103, replace this fwd 01:42:41.220 |
for forward with bwd for backward. That's a backward English language model that learns 01:42:47.220 |
to read English backwards. So if you redo this whole thing, put all the documents in reverse, 01:42:54.420 |
and change this to backward, you now have a second classifier which classifies things 01:42:59.300 |
by positive or negative sentiment based on the reverse document. If you then take the 01:43:07.740 |
two predictions and take the average of them, you basically have a bidirectional model that 01:43:13.020 |
you've trained each bit separately. That gets you to 95.4% accuracy. So we basically load 01:43:22.900 |
So this kind of 20% change in the state-of-the-art is almost unheard of. You have to go back 01:43:32.020 |
to Jeffrey Hinton's ImageNet computer vision thing where they chop 30% off the state-of-the-art. 01:43:39.380 |
It doesn't happen very often. So you can see this idea of just use transfer learning is 01:43:47.880 |
ridiculously powerful, but every new field thinks their new field is too special and 01:43:55.140 |
you can't do it. So it's a big opportunity for all of us. 01:44:02.980 |
So we turned this into a paper, and when I say we, I did it with this guy, Sebastian 01:44:07.420 |
Reuter. You might remember his name because in lesson 5 I told you that I actually had 01:44:14.180 |
shared lesson 4 with Sebastian because I think he's an awesome researcher who I thought might 01:44:20.060 |
like it. I didn't know him personally at all. And much to my surprise, he actually watched 01:44:27.100 |
the damn video. I was like, what NLP researcher is going to watch some beginner's video? He 01:44:33.900 |
watched the whole video and he was like, that's actually quite fantastic. Well, thank you 01:44:38.740 |
very much, that's awesome coming from you. And he said, hey, we should turn this into 01:44:44.580 |
a paper. And I said, I don't write papers, I don't care about papers, I'm not interested 01:44:50.700 |
in papers, that sounds really boring. And he said, okay, how about I write the paper 01:44:58.100 |
for you? And I said, you can't really write a paper about this yet because you'd have 01:45:04.780 |
to do studies to compare it to other things, they're called ablation studies to see which 01:45:08.500 |
bits actually work. There's no rigor here, I just put in everything that came in my head 01:45:12.780 |
and chucked it all together and it happened to work. And it's like, okay, what if I write 01:45:17.380 |
all the paper and do all the ablation studies, then can we write the paper? And I said, well, 01:45:23.740 |
it's like a whole library that I haven't documented and I'm not going to yet and you don't know 01:45:31.060 |
how it all works. He said, okay, if I write the paper and do the ablation studies and 01:45:35.300 |
figure out from scratch how the code works without bothering you, then can we write the 01:45:38.860 |
paper? I was like, yeah, if you did all those things, you can write the paper. And he was 01:45:48.740 |
like, okay. And so then two days later he comes back and he says, okay, I've done a 01:45:51.580 |
draft with the paper. So I share this story to say like, if you're some student in Ireland 01:46:02.700 |
and you want to do good work, don't let anybody stop you. I did not encourage him to say the 01:46:10.940 |
least. But in the end he was like, look, I want to do this work, I think it's going to 01:46:16.300 |
be good and I'll figure it out. And he wrote a fantastic paper and he did the ablation 01:46:22.420 |
studies and he figured out how fast AI works and now we're planning to write another paper 01:46:27.420 |
together. You've got to be a bit careful because sometimes I get messages from random people 01:46:36.300 |
saying like, I've got lots of good ideas, can we have coffee? I can have coffee at my 01:46:43.980 |
office any time, thank you. But it's very different to say like, hey, I took your ideas 01:46:49.660 |
and I wrote a paper and I did a bunch of experiments and I figured out how your code works. I added 01:46:53.700 |
documentation to it, should we submit this to a conference? Do you see what I mean? There's 01:47:02.300 |
nothing to stop you doing amazing work and if you do amazing work that helps somebody 01:47:08.660 |
else, like in this case, I'm happy that we have a paper. I don't deeply care about papers, 01:47:15.700 |
but I think it's cool that these ideas now have this rigorous study. Let me show you 01:47:20.220 |
what he did. He took all my code, so I'd already done all the fast AI.txt and stuff like that. 01:47:29.580 |
As you've seen, it lets us work with large corpuses. Sebastian is fantastically well 01:47:36.660 |
read and he said here's a paper that Jan Lekudins and guys just came out with where they tried 01:47:41.620 |
lots of different classification data sets, so I'm going to try running your code on all 01:47:46.500 |
these data sets. These are the data sets. Some of them had many, many hundreds of thousands 01:47:52.940 |
of documents and they were far bigger than anything I had tried, but I thought it should 01:47:57.620 |
work. He had a few good little ideas as we went along and so you should totally make 01:48:07.980 |
sure you read the paper. He said this thing that you called in the lessons differential 01:48:18.100 |
learning rates, differential means something else. Maybe we should rename it. It's now called 01:48:25.100 |
discriminative learning rates. This idea that we had from Part 1 where we used different 01:48:29.620 |
learning rates for different layers, after doing some literature research, it does seem 01:48:34.940 |
like that hasn't been done before so it's now officially a thing, discriminative learning 01:48:41.540 |
So all these ideas, this is something we learned in Lesson 1. It now has an equation with Greek 01:48:46.740 |
and everything. When you see an equation with Greek and everything, that doesn't necessarily 01:48:52.300 |
mean it's more complex than anything we did in Lesson 1 because this one isn't. Again, 01:48:57.420 |
that idea of unfreezing a layer at a time also seems to have never been done before, 01:49:03.540 |
so it's now a thing and it's got the very clever name gradual unfreezing. 01:49:11.180 |
So then, long promised, we're going to look at this, slanted triangular learning rates. 01:49:19.860 |
So this actually was not my idea. Leslie Smith, one of my favorite researchers who you all 01:49:25.780 |
now know about, emailed me a while ago and said I'm so over a circle called learning 01:49:31.780 |
rates, I don't do that anymore, I now do a slightly different version where I have one 01:49:35.280 |
cycle which goes up quickly at the start and then slows it down afterwards. And he said 01:49:40.900 |
I often find it works better, I tried going back over all of my old data sets and it worked 01:49:48.060 |
So this is what the learning rate looks like. You can use it in fastAI just by adding UCLR 01:49:53.540 |
equals to your fit. This first number is the ratio between the highest learning rate and 01:50:01.100 |
the lowest learning rate. So here this is 1/32 of that. The second number is the ratio 01:50:07.880 |
between the first peak and the last peak. And so the basic idea is if you're doing a cycle 01:50:15.340 |
length 10 and you want the first epoch to be the upward bit and the other 9 epochs to 01:50:23.700 |
be the downward bit, then you would use 10. And I find that works pretty well, that was 01:50:28.660 |
also Leslie's suggestion, make about 1/10 of it the upward bit and about 9/10 the downward 01:50:36.940 |
Since he told me about it, maybe two days ago, he wrote this amazing paper, a disciplined 01:50:43.880 |
approach to neural network hyperparameters, in which he described something very slightly 01:50:49.440 |
different to this again, but the same basic idea. This is a must-read paper. It's got all 01:50:57.220 |
the kinds of ideas that fastAI talks about a lot in great depth, and nobody else is talking 01:51:05.100 |
about this stuff. It's kind of a slog, unfortunately Leslie had to go away on a trip before he 01:51:12.020 |
really had time to edit it properly, so it's a little bit slow reading, but don't let that 01:51:19.740 |
So this triangle, this is the equation from my paper with Sebastian. Sebastian was like, 01:51:24.220 |
"Jeremy, can you send me the math equation behind that code you wrote?" And I was like, 01:51:29.100 |
"No, I just wrote the code, I could not turn it into math." So he figured out the math 01:51:37.140 |
So you might have noticed the first layer of our classifier was equal to embedding size 01:51:47.820 |
times 3. Why times 3? Times 3 because, and again this seems to be something which people 01:51:54.960 |
haven't done before, so a new idea, concat pooling, which is that we take the average 01:52:04.460 |
pooling over the sequence of the activations, the max pooling of the sequence over the activations, 01:52:10.940 |
and the final set of activations and just concatenate them all together. 01:52:14.820 |
Again, this is something which we talked about in Part 1, but it doesn't seem to be in the 01:52:20.940 |
literature before, so it's now called concat pooling, and again it's now got an equation 01:52:25.940 |
and everything, but this is the entirety of the implementation. Pool with average, pool 01:52:32.580 |
with max, concatenate those two along with the final sequence. 01:52:38.460 |
So you can go through this paper and see how the fastai code implements each piece. 01:52:47.100 |
So then, to me one of the kind of interesting pieces is the difference between RNN encoder, 01:52:55.180 |
which you've already seen, and multibatch RNN encoder. So what's the difference there? 01:53:00.780 |
So the key difference is that the normal RNN encoder for the language model, we could just 01:53:05.900 |
do bptt chunk at a time, but no problem, and predict the next word. 01:53:16.420 |
But for the classifier, we need to do the whole document. We need to do the whole movie 01:53:22.300 |
review before we decide if it's positive or negative. And the whole movie review can easily 01:53:26.780 |
be 2000 words long, and I can't fit 2000 words worth of gradients in my GPU memory for every 01:53:37.700 |
single one of my activations -- sorry, for every one of my weights. So what do I do? 01:53:44.700 |
And so the idea was very simple, which is I go through my whole sequence length one 01:53:52.140 |
batch of bptt at a time, and I call super.forward, so in other words the RNN encoder, to grab 01:54:07.060 |
And then I've got this maximum sequence length parameter where it says, okay, as long as you're 01:54:17.260 |
doing no more than that sequence length, then start appending it to my list of outputs. 01:54:25.380 |
So in other words, the thing that it sends back to this pooling is only as many activations 01:54:37.740 |
as we've asked it to keep. And so that way you can basically figure out how much, what's 01:54:45.540 |
maxsec do you, can your particular GPU handle. 01:54:51.940 |
So it's still using the whole document, but let's say maxsec is 1000 words, and your longest 01:54:59.540 |
document length is 2000 words. Then it's still going through the RNN creating state for those 01:55:05.700 |
first 1000 words, but it's not actually going to store the activations for the backprop 01:55:14.180 |
the first 1000, it's only going to keep the last 1000. 01:55:17.500 |
So that means that it can't backprop the loss back to any state that was created in the 01:55:31.680 |
So it's a really simple piece of code, and honestly when I wrote it, I didn't spend much 01:55:39.500 |
time thinking about it, it seems so obviously the only way that this could possibly work. 01:55:44.500 |
But again, it seems to be a new thing, so we now have backprop through time for text 01:55:50.420 |
So you can see there's lots of little pieces in this paper. 01:55:59.020 |
So the result was on every single dataset we tried, we got a better result than any 01:56:11.460 |
So IMDB, Trek 6, AG News, DBpedia, Yelp, all different types. 01:56:20.820 |
And honestly IMDB was the only one I spent any time trying to optimize the model, so 01:56:25.660 |
like most of them we just did it like whatever came out first, so if we actually spent time 01:56:33.380 |
And the things that these are comparing to, most of them are, you'll see they're different 01:56:40.180 |
on each table because they're optimized, these are like customized algorithms on the whole. 01:56:45.500 |
So this is saying one simple fine-tuning algorithm can beat these really customized algorithms. 01:56:56.420 |
And so here's one of the really cool things that Sebastian did with his ablation studies, 01:57:02.580 |
which is I was really keen that if we were going to publish a paper we had to say why 01:57:08.980 |
So Sebastian went through and tried removing all of those different contributions I mentioned. 01:57:22.340 |
What if we don't use discriminative learning rates? 01:57:24.860 |
What if instead of discriminative learning rates we use cosine annealing? 01:57:28.900 |
What if we don't do any pre-training with Wikipedia? 01:57:40.580 |
And the really interesting one to me was what's the validation error rate on IMDB if we only 01:57:46.980 |
use 100 training examples versus 200 versus 500? 01:57:50.940 |
And you can see, very interestingly, the full version of this approach is nearly as accurate 01:58:01.140 |
on just 100 training examples, like it's still very accurate versus 20,000 training examples. 01:58:09.460 |
Whereas if you're training from scratch on 100, it's almost random. 01:58:14.540 |
So it's what I expected, kind of set to Sebastian. 01:58:18.660 |
I really think this is most beneficial when you don't have much data, and this is like 01:58:23.940 |
where FastAI is most interested in contributing, small data regimes, small compute regimes 01:58:33.100 |
So I want to show you a couple of tricks as to how you can run these kinds of studies. 01:58:42.940 |
The first trick is something which I know you're all going to find really handy. 01:58:49.060 |
I know you've all been annoyed when you're running something in a Jupyter notebook and 01:58:52.620 |
you lose your internet connection for long enough that it decides you've gone away and 01:58:57.740 |
then your session disappears and you have to start it again from scratch. 01:59:05.780 |
There's a very simple cool thing called VNC, where basically you can install on your AWS 01:59:13.460 |
instance or paper space or whatever, xWindows, a lightweight window manager, a VNC server, 01:59:28.500 |
Track these lines at the end of your VNC xstartup configuration file, and then run this command. 01:59:38.700 |
It's now running a server where you can then run the type VNC viewer on your computer, 01:59:59.180 |
Specifically, what you do is you use SSH port forwarding to port 4913 to localhost 5913. 02:00:13.740 |
And so then you connect to port 5913 on localhost, send it off to port 5913 on your server, which 02:00:25.460 |
is the VNC port because you said colon 13 here, and it will display an xWindows desktop. 02:00:32.460 |
And then you can click on the Linux start like button and click on Firefox, and you 02:00:37.420 |
now have Firefox, and you'll see here in Firefox it says localhost because this Firefox is 02:00:47.780 |
So you now run Firefox, you start your thing running, and then you close your VNC viewer, 02:00:53.700 |
remembering that Firefox is like displaying on this virtual VNC display, not in a real 02:01:00.660 |
And so then later on that day, you log back into VNC viewer and it pops up again, so it's 02:01:08.300 |
And it's shockingly fast, it works really well. 02:01:14.020 |
And there's lots of different VNC servers and clients and whatever, but this one worked 02:01:19.860 |
So you can see here I connect to localhost 5913. 02:01:34.960 |
So I ended up creating a little Python script for Sebastian to say this is the basic steps 02:01:39.960 |
you need to do, and now you need to create different versions for everything else, and 02:01:43.300 |
I suggested to him that he tried using this thing called Google Fire. 02:01:47.140 |
What Google Fire does is you create a function with shitloads of parameters. 02:01:53.100 |
And so these are all the things that Sebastian wanted to try doing. 02:01:56.100 |
Different dropout amounts, different learning rates, do I use pre-training or not, do I 02:02:00.380 |
use CLI or not, do I use discriminative learning rate or not, do I go backwards or not, blah 02:02:06.940 |
So you create a function, and then you add something saying if name equals main, fire.fire, 02:02:11.900 |
and the function name, you do nothing else at all. 02:02:14.580 |
You don't have to add any metadata, any docstrings, anything at all, and you then call that script 02:02:20.460 |
and automatically you now have a command line interface, and that's it. 02:02:27.060 |
So that's a super fantastic easy way to run lots of different variations in a terminal. 02:02:34.700 |
And this ends up being easier if you want to do lots of variations than using a notebook 02:02:40.500 |
because you can just have a bash script that tries all of them and spits them all out. 02:02:48.180 |
You'll find inside the dl2-course directory, there's now something called imdb-scripts, 02:02:58.040 |
and I've put there all of the scripts that Sebastian and I used. 02:03:02.780 |
So you'll see because we needed to tokenize every single dataset, we had to turn every 02:03:10.460 |
dataset and numericalize every dataset, we had to train a language model on every dataset, 02:03:15.220 |
we had to train and classify every dataset, we had to do all of those things in a variety 02:03:18.420 |
of different ways to compare them, we had a script for all of those things. 02:03:21.940 |
So you can check out and see all of the scripts that we used. 02:03:32.460 |
When you're doing a lot of scripts and stuff, you've got different code all over the place, 02:03:37.420 |
eventually it might get frustrating that you don't want to symlink your fastai library 02:03:43.340 |
again and again, but you probably don't want to pip-install it because that version tends 02:03:48.460 |
to be a little bit old, we move so fast you want to use the current version in git. 02:03:54.120 |
If you say pip-install-a. from the fastai-repo base, it does something quite neat which basically 02:04:03.780 |
creates a symlink to the fastai library inside your site packages directory. 02:04:15.540 |
Your site packages directory is like your main Python library. 02:04:20.900 |
And so if you do this, you can then access fastai from anywhere, but every time you do 02:04:28.060 |
git pull, you've got the most recent version. One downside of this is that it installs any 02:04:34.980 |
updated versions of packages from pip which can confuse conda a little bit. 02:04:42.120 |
So another alternative here is just to symlink the fastai library to your site packages library. 02:04:50.980 |
That works just as well. And then you can use fastai again from anywhere, and it's quite 02:04:57.740 |
handy when you want to run scripts that use fastai from different directories on your 02:05:07.420 |
So one more thing before we go, which is something you can try if you like. 02:05:17.660 |
You don't have to tokenize words. Instead of tokenizing words, you can tokenize what 02:05:29.140 |
And so for example, unsupervised could be tokenized as unsupervised. Tokenizer could 02:05:40.820 |
be tokenized as tokenizer. And then you can do the same thing, the language model that 02:05:47.780 |
works on subword units, the classifier that works on subword units, etc. 02:05:55.740 |
So how well does that work? I started playing with it and with not too much playing, I was 02:06:04.060 |
getting classification results that were nearly as good as using word-level tokenization. 02:06:14.860 |
I suspect with more careful thinking and playing around, maybe I could have got as good or 02:06:21.060 |
better. But even if I couldn't, if you create a subword unit wiki text model, then IMDB model, 02:06:34.060 |
language model, and then classifier forwards and backwards for subword units, and then 02:06:39.340 |
ensemble it with the forwards and backwards word-level ones, you should be able to beat 02:06:46.220 |
So here's an approach you may be able to beat our state-of-the-art result. 02:06:52.340 |
Google has, as Sebastian told me about this particular project, Google has a project called 02:06:57.780 |
Sentence Piece, which actually uses a neural net to figure out the optimal splitting up 02:07:05.900 |
of words, and so you end up with a vocabulary of subword units. In my playing around, I 02:07:12.700 |
found that creating a vocabulary of about 30,000 subword units seems to be about optimal. 02:07:19.940 |
So if you're interested, there's something you can try. It's a bit of a pain to install. 02:07:25.300 |
It's C++. It doesn't have great error messages. But it will work. There is a Python library 02:07:31.780 |
for it, and if anybody tries this, I'm happy to help them get it working. There's been 02:07:38.540 |
little if any experiments with ensembling subword and word-level stuff classification, 02:07:46.060 |
and I do think it should be the best approach. 02:07:48.620 |
Alright, thanks everybody. Have a great week and see you next Monday.