back to indexLesson 11: Deep Learning Part 2 2018 - Neural Translation
Chapters
0:0 Super Convergence
1:58 One Cycle
3:57 Our Cube Flow
5:41 Neural Translation
9:44 Code
13:32 Basic Approach
18:18 RNN Review
19:30 Refactoring
21:39 Stacking
22:37 Training
24:48 Tokenizing
26:13 Processing
26:54 Partition
27:19 Intention Layer
32:26 Industry Marker
34:13 Fast Text
35:54 Python Dictionary
43:11 Data Loader
48:15 NCoder
00:00:00.000 |
I want to start pointing out a couple of the many cool things that happened this week. 00:00:07.040 |
One thing that I'm really excited about is we briefly talked about how Leslie Smith has 00:00:12.780 |
a new paper out, and basically the paper takes these previous two key papers, cyclical learning 00:00:22.120 |
rates and superconvergence, and builds on them with a number of experiments to show 00:00:33.220 |
Superconvergence lets you train models 5 times faster than previous stepwise approaches. 00:00:41.840 |
It's not 5 times faster than CLR, but it's faster than CLR as well. 00:00:46.440 |
The key is that superconvergence lets you get up to massively high learning rates, somewhere 00:01:00.640 |
So the interesting thing about superconvergence is that you actually train at those very high 00:01:10.400 |
learning rates for quite a large percentage of your epochs, and during that time the loss 00:01:16.000 |
doesn't really improve very much, but the trick is it's doing a lot of searching through 00:01:22.240 |
the space to find really generalizable areas, it seems. 00:01:28.200 |
We kind of had a lot of what we needed in fastAI to achieve this, but we're missing 00:01:32.160 |
a couple of bits, and so Sylvan Gugo has done an amazing job of fleshing out the pieces 00:01:38.180 |
that we're missing, and then confirming that he has actually achieved superconvergence 00:01:46.000 |
I think this is the first time that this has been done that I've heard of outside of Leslie 00:01:51.320 |
He's got a great blog post up now on 1Cycle, which is what Leslie Smith called this approach. 00:01:59.200 |
And this is actually, it turns out, what 1Cycle looks like. 00:02:03.520 |
It's a single cyclical learning rate, but the key difference here is that the going up bit 00:02:11.400 |
is the same length as the going down bit, so you go up really slowly. 00:02:16.080 |
And then at the end, for like a tenth of the time, you then have this little bit where 00:02:23.760 |
And it's interesting, obviously this is a very easy thing to show, a very easy thing 00:02:29.240 |
Sylvan has added it to fastAI under the temporarily, it's called useCLRbeta by the time you watch 00:02:37.160 |
this on the video, it'll probably be called 1Cycle or something like that. 00:02:45.920 |
So that's one key piece to getting these massively high learning rates. 00:02:49.780 |
And he shows a number of experiments when you do that. 00:02:52.200 |
A second key piece is that as you do this to the learning rate, you do this to the momentum. 00:02:59.080 |
So when the learning rate's low, it's fine to have a high momentum. 00:03:02.480 |
But when the learning rate gets up really high, your momentum needs to be quite a bit 00:03:10.960 |
So this is also part of what he's added to the library is this cyclical momentum. 00:03:16.040 |
And so with these two things, you can train for about the fifth of the number of epochs 00:03:24.800 |
Then you can drop your weight decay down by about two orders of magnitude. 00:03:28.720 |
You can often remove most or all of your dropout, and so you end up with something that's trained 00:03:37.600 |
And it actually turns out that Sylvan got quite a bit better accuracy than Leslie Smith's 00:03:42.880 |
His guess, I was pleased to see, is because our data augmentation defaults are better 00:03:52.280 |
As I say, there's been so many cool things this week, I'm just going to pick two. 00:04:05.320 |
There's a fairly new project called Kubeflow, which is basically TensorFlow for Kubernetes. 00:04:11.960 |
Hamill wrote a very nice article about magical sequence-to-sequence models, building data 00:04:19.800 |
products on that, using Kubernetes to kind of put that in production and so forth. 00:04:29.320 |
He said that the Google Kubeflow team created a demo based on what he wrote earlier this 00:04:34.560 |
year, directly based on the skills alone in class AI, and I will be presenting this technique 00:04:41.080 |
KDD is one of the top academic conferences, so I wanted to share this as a motivation 00:04:47.120 |
for folks to blog, which I think is a great point. 00:04:51.760 |
Nobody who goes out and writes a blog thinks that none of us really think our blog is actually 00:04:58.120 |
going to be very good, probably nobody's going to read it, and then when people actually 00:05:02.440 |
do like it and read it, it's like with great surprise, you just go, oh, it's actually something 00:05:09.840 |
So here is the tool where you can summarize GitHub issues using this tool, which is now 00:05:19.560 |
So I think that's a great story if Hamill didn't put his work out there, none of this 00:05:26.440 |
would have happened, and you can check out his post that made it all happen as well. 00:05:36.440 |
So talking of the magic of sequence-to-sequence models, let's build one. 00:05:44.600 |
So we're going to be specifically working on machine translation. 00:05:51.980 |
So machine translation is something that's been around for a long time, but specifically 00:05:58.040 |
we're going to look at a code called neural translation, which is using neural networks 00:06:04.920 |
That wasn't really a thing in any kind of meaningful way until a couple of years ago. 00:06:13.320 |
And so thanks to Chris Manning from Stanford for the next three slides. 00:06:19.000 |
In 2015, Chris pointed out that neural machine translation first appeared properly, and it 00:06:25.000 |
was pretty crappy compared to the statistical machine translation approaches that used kind 00:06:28.840 |
of classic feature engineering and standard MLP kind of approaches of lots of stemming 00:06:35.360 |
and fiddling around with work frequencies and n-grams and lots of stuff. 00:06:42.000 |
A year later, it was better than everything else. 00:06:46.040 |
This is on a metric called Blue, we're not going to discuss the metric because it's not 00:06:49.200 |
a very good metric and it's not very interesting, but it's what everybody uses. 00:06:53.200 |
So that was Blue as of when Chris did this slide. 00:06:57.320 |
As of now, it's about up here, it's about 30. 00:07:02.800 |
So we're kind of seeing machine translation starting down the path that we saw starting 00:07:10.560 |
computer vision object classification in 2012, I guess, which is we kind of just surpassed 00:07:17.640 |
the state-of-the-art and now we're zipping past it at a great rate. 00:07:24.160 |
It's very unlikely that anybody watching this is actually going to build a machine translation 00:07:30.840 |
model because you can go to translate.google.com and use theirs and it works quite well. 00:07:37.260 |
So why are we learning about machine translation? 00:07:40.120 |
The reason we're learning about machine translation is that the general idea of taking some kind 00:07:44.480 |
of input, like a sentence in French, and transforming it into some other kind of output of arbitrary 00:07:52.200 |
length such as a sentence in English is a really useful thing to do. 00:07:59.080 |
For example, the thing that we just saw that Hamill did takes GitHub issues and turns them 00:08:08.240 |
Another example is taking videos and turning them into descriptions. 00:08:19.720 |
Basically anything where you're spitting out kind of an arbitrary sized output, very often 00:08:24.400 |
that's a sentence, so maybe taking a CT scan and spitting out a radiology report, this 00:08:30.760 |
is where you can use sequence-to-sequence learning. 00:08:37.200 |
So the important thing about a neural machine translation, there's more slides from Chris, 00:08:45.200 |
and generally sequence-to-sequence models is that there's no fussing around with heuristics 00:08:53.520 |
and hacky feature engineering or whatever, it's end-to-end training. 00:08:58.720 |
We're able to build these distributed representations which are shared by lots of concepts within 00:09:03.920 |
a single network, we're able to use long-term state in the RNN, so use a lot more context 00:09:10.640 |
than kind of n-gram type approaches, and in the end the text we're generating uses an 00:09:15.560 |
RNN as well so we can build something that's more fluid. 00:09:19.240 |
We're going to use a bidirectional LSTM with a tension, well actually we're going to use 00:09:26.000 |
a bidirectional GRU with a tension but basically the same thing. 00:09:30.120 |
So you already know about bidirectional recurrent neural networks and a tension we're going 00:09:36.240 |
These general ideas you can use for lots of other things as well as Chris points out on 00:09:45.560 |
So let's jump into the code which is in the translate notebook, funnily enough. 00:10:03.960 |
And so we're going to try to translate French into English. 00:10:09.840 |
And so the basic idea is that we're going to try and make this look as much like a standard 00:10:21.080 |
So we're going to need three things, you all remember the three things, data, a suitable 00:10:33.400 |
Once you've got those three things, you run fit. 00:10:35.720 |
And all things going well, you end up with something that solves your problem. 00:10:41.920 |
So data, we generally need x y pairs because we need something which we can feed it into 00:10:50.160 |
the loss function and say I took my x value which was my French sentence and the loss 00:10:58.920 |
function says it was meant to generate this English sentence and then you had your predictions 00:11:05.560 |
which you would then compare and see how good it is. 00:11:08.740 |
So therefore we need lots of these tuples of French sentences with their equivalent 00:11:18.840 |
Obviously this is harder to find than a corpus for a language model, because for a language 00:11:29.640 |
For any living language of which the people that use that language, like use computers, 00:11:37.480 |
there will be a few gigabytes at least of text floating around the internet for you to grab. 00:11:42.600 |
So building a language model is only challenging corpus-wise for ancient languages, one of 00:11:49.840 |
our students is trying to do a Sanskrit one for example at the moment, but that's very 00:11:56.720 |
For translation there are actually some pretty good parallel corpuses available for European 00:12:03.600 |
The European Parliament basically has every sentence in every European language. 00:12:08.960 |
Anything that goes through the UN is translated to lots of languages. 00:12:14.320 |
For French through English we have a particularly nice thing which is pretty much any semi-official 00:12:21.400 |
Canadian website, we'll have a French version and an English version. 00:12:27.600 |
This chap Chris Callison-Burch did a cool thing which is basically to try to transform French 00:12:32.400 |
URLs into English URLs by replacing -fr with -en and hoping that that retrieves the equivalent 00:12:39.200 |
document and then did that for lots and lots of websites and ended up creating a huge corpus 00:12:49.400 |
So French to English we have this particularly nice resource. 00:12:54.400 |
So we're going to start out by talking about how to create the data, then we'll look at 00:12:57.520 |
the architecture, and then we'll look at the loss function. 00:13:00.320 |
And so for bounding boxes all of the interesting stuff was in the loss function, but for neural 00:13:08.680 |
translation all of the interesting stuff is going to be in the architecture. 00:13:17.720 |
One of the things I want you to think about particularly is what are the relationships 00:13:22.360 |
and similarities in terms of the task we're doing and how we do it between language modeling 00:13:34.040 |
So the basic approach here is that we're going to take a sentence, so in this case the example 00:13:41.520 |
is English to German, and this slide's from Stephen Meridy, we steal everything we can 00:13:49.680 |
We start with some sentence in English, and the first step is to do basically the exact 00:13:54.640 |
same thing we do in a language model, which is to chuck it through an RNN. 00:14:00.200 |
Now with our language model, actually let's not even think about the language model, let's 00:14:10.160 |
So something that turns this sentence into positive or negative sentiment. 00:14:16.760 |
We had a decoder, something which basically took the RNN output, and from our paper we 00:14:33.440 |
We took a max pool over all of the time steps, we took a mean pool over all the time steps, 00:14:38.760 |
and we took the value of the RNN at the last time step, stuck all those together, and put 00:14:46.920 |
Most people don't do that in most NLP stuff, I think it's something we invented. 00:14:54.040 |
People pretty much always use the last time step, so all the stuff we'll be talking about 00:15:01.320 |
So we start out by chucking this sentence through an RNN, and out of it comes some state. 00:15:09.480 |
So some state meaning some hidden state, some vector that represents the output of an RNN 00:15:18.840 |
You'll see the word that Stephen used here was encoder. 00:15:25.640 |
So like when we've talked about adding a custom head to an existing model, like the existing 00:15:31.280 |
pre-trained ImageNet model, for example, we say that's our backbone, and then we stick 00:15:35.480 |
on top of it some head that does the task we want. 00:15:39.620 |
In sequence-to-sequence learning, they use the word encoder. 00:15:44.640 |
But basically it's the same thing, it's some piece of a neural network architecture that 00:15:49.120 |
takes the input and turns it into some representation which we can then stick a few more layers 00:15:56.240 |
on top of to grab something out of it, such as we did for the classifier where we stuck 00:16:03.040 |
a linear layer on top of it to turn it into a sentiment, positive or negative. 00:16:11.800 |
So this time, though, we have something that's a little bit harder than just getting sentiment, 00:16:20.320 |
which is I want to turn this state not into a positive or negative sentiment, but into 00:16:24.800 |
a sequence of tokens, where that sequence of tokens is the German sentence that we want. 00:16:32.640 |
So this is sounding more like the language model than the classifier, because the language 00:16:37.720 |
model had multiple tokens for every input word, there was an output word, but the language 00:16:43.360 |
model was also much easier because the number of tokens in the language model output was 00:16:50.600 |
the same length as the number of tokens in the language model input. 00:16:54.040 |
And not only were they the same length, they exactly matched up. 00:16:58.000 |
Like after word 1 comes word 2, after word 2 comes word 3 and so forth. 00:17:04.040 |
But for translating language, you don't necessarily know that the word 'he' will be translated 00:17:11.860 |
as the first word in the output, and that 'loved' will be the second word in the output. 00:17:15.640 |
In this particular case, unfortunately, they are the same. 00:17:18.960 |
But very often the subject-object order will be different, or there will be some extra 00:17:24.480 |
words inserted, or some pronouns will need to add some gendered article to it, or whatever. 00:17:31.600 |
So this is the key issue we're going to have to deal with is the fact that we have an arbitrary 00:17:37.280 |
length output where the tokens in the output do not correspond to the same order of specific 00:17:48.000 |
So the general idea is the same, use an RNN to encode the input, turns it into some hidden 00:17:53.880 |
state and then this is the new thing we're going to learn is generating a sequence output. 00:18:00.480 |
So we already know sequence to class, that's IMDB classifier, we already know sequence 00:18:08.060 |
to equal length sequence where it corresponds to the same items, that's the language model 00:18:13.800 |
for example, but we don't know yet how to do a general-purpose sequence to sequence, 00:18:21.240 |
Very little of this will make sense unless you really understand lesson 6, how an RNN 00:18:31.320 |
So if some of this lesson doesn't make sense to you and you find yourself wondering what 00:18:36.520 |
does he mean by 'hidden state' exactly, how's that working, go back and rewatch lesson 6 00:18:42.960 |
to give you a very quick review, we learned that an RNN at its heart is a standard fully 00:18:51.000 |
connected network, so here's one with one, two, three, four layers, takes an input and 00:18:57.520 |
puts it through four layers, but then at the second layer it can just concatenate in the 00:19:04.200 |
second input, third layer concatenate in the third input, but we actually wrote this in 00:19:08.520 |
Python as just literally a four-layer neural network, there was nothing else we used other 00:19:20.480 |
We used the same weight matrix every time an input came in, we used the same matrix 00:19:24.200 |
every time we went from one of these states to the next, and that's why these arrows are 00:19:28.440 |
the same color, and so we can redraw that previous thing like this. 00:19:34.960 |
And so not only did we redraw it, but we took four lines of linear linear linear linear code 00:19:43.520 |
in PyTorch and we replaced it with a for loop. 00:19:50.400 |
So remember we had something that did exactly the same thing as this, but it just had four 00:19:55.680 |
lines of code saying linear linear linear linear, and we literally replaced it with 00:20:05.380 |
So literally that refactoring, which doesn't change any of the math, any of the ideas, 00:20:12.120 |
any of the outputs, that refactoring is an RNN, it's turning a bunch of separate lines 00:20:26.680 |
We could take the output so that it's not outside the loop and put it inside the loop, 00:20:34.500 |
And if we do that, we're now going to generate a separate output for every input. 00:20:44.240 |
So in this case, this particular one here, the hidden state gets replaced each time and 00:20:49.960 |
we end up just spitting out the final hidden state. 00:20:52.880 |
So this one is this example, but if instead we had something that said h's dot append 00:21:02.200 |
h and returned h's at the end, that would be this picture. 00:21:08.120 |
And so go back and relook at that notebook if this is unclear. 00:21:10.680 |
I think the main thing to remember is when we say hidden state, we're referring to a 00:21:17.480 |
See here, here's the vector, h equals torch.zeros nhidden. 00:21:24.880 |
Now of course it's a vector for each thing in the mini-batch, so it's a matrix. 00:21:29.720 |
But generally when I speak about these things, I ignore the mini-batch piece and treat it 00:21:42.040 |
We also learned that you can stack these layers on top of each other. 00:21:45.840 |
So rather than this first RNN spitting out output, there could just spit out inputs into 00:21:53.440 |
And if you're thinking at this point, "I think I understand this, but I'm not quite sure," 00:22:00.320 |
if you're anything like me, that means you don't understand this. 00:22:04.160 |
And the only way you know and that you actually understand it is to go and write this from 00:22:12.520 |
And if you can't do that, then you don't understand it. 00:22:16.080 |
You can go back and rewatch Lesson 6 and check out the notebook and copy some of the ideas 00:22:20.960 |
until it's really important that you can write that from scratch. 00:22:28.400 |
So you want to make sure you create a two-layer RNN. 00:22:34.120 |
And this is what it looks like if you unroll it. 00:22:38.840 |
So that's the goal, is to get to a point that we first of all have these X, Y pairs of sentences, 00:22:48.360 |
So we're going to start by downloading this dataset, and training a translation model 00:22:59.080 |
Google's translation model has 8 layers of RNN stacked on top of each other. 00:23:03.920 |
There's no conceptual difference between 8 layers and 2 layers, it's just like if you're 00:23:09.680 |
Google and you have more GPUs or TPUs and you know what to do with, then you're fine 00:23:13.360 |
doing that, whereas in our case it's pretty likely that the kind of sequence-to-sequence 00:23:17.640 |
models we're building are not going to require that level of computation. 00:23:21.880 |
So to keep things simple, let's do a cut-down thing where rather than learning how to translate 00:23:28.000 |
French into English for any sentence, let's learn to translate French questions into English 00:23:34.440 |
And specifically questions that start with what, where, which, when. 00:23:38.280 |
So you can see here I've got a regex that looks for things that start with wh and end 00:23:43.160 |
So I just go through the corpus, open up each of the two files, each line is one parallel 00:23:49.120 |
text, zip them together, grab the English question, the French question, and check whether they 00:23:57.760 |
Dump that out as a pickle so that I don't have to do it again. 00:24:00.880 |
And so we now have 52,000 sentences, and here are some examples of those sentence pairs. 00:24:09.480 |
One nice thing about this is that what, who, where type questions tend to be fairly short, 00:24:17.960 |
But I would say the idea that we could learn from scratch with no previous understanding 00:24:24.160 |
of the idea of language, let alone of English or of French, that we could create something 00:24:28.840 |
that can translate one to the other for any arbitrary question with only 50,000 sentences, 00:24:34.880 |
sounds like a ludicrously difficult thing to ask this to do. 00:24:39.840 |
So I would be impressed if we could make any progress whatsoever. 00:24:43.480 |
This is very little data to do a very complex exercise. 00:24:49.400 |
So this contains the tuples of French and English. 00:24:52.840 |
You can use this handy idiom to split them apart into a list of English questions and 00:24:59.560 |
And then we tokenize the English questions and we tokenize the French questions. 00:25:05.200 |
So remember that just means splitting them up into separate words or word-like things. 00:25:12.440 |
By default, the tokenizer that we have here, and remember this is a wrapper around the 00:25:17.880 |
spacey tokenizer, which is a fantastic tokenizer. 00:25:22.120 |
This wrapper by default assumes English, so to ask for French, you just add an extra 00:25:28.240 |
The first time you do this, you'll get an error saying that you don't have the spacey 00:25:31.600 |
French model installed, and you can google to get the command something python -m spacey 00:25:37.600 |
download French or something like that to grab the French model. 00:25:44.480 |
I don't think any of you are going to have RAM problems here because this is not particularly 00:25:48.440 |
big corpus, but I know that some of you were trying to train new language models during 00:25:55.280 |
If you do, it's worth knowing what these functions are actually doing. 00:25:59.420 |
So for example, these ones here is processing every sentence across multiple processes. 00:26:05.560 |
And remember, fastai code is designed to be pretty easy to read, so here's the three lines 00:26:17.200 |
of code to process all MP, find out how many CPUs you have, divide by two, because normally 00:26:23.200 |
with hyperthreading they don't actually all work in parallel, then in parallel run this 00:26:32.720 |
So that's going to spit out a whole separate Python process for every CPU you have. 00:26:37.720 |
If you have a lot of cores, a lot of Python processes, everyone's going to load all this 00:26:42.580 |
data in and that can potentially use up all your RAM. 00:26:46.720 |
So you could replace that with just proc all rather than proc all MP to use less RAM. 00:26:58.400 |
So at the moment we're calling this function partition by cores, which calls partition 00:27:04.920 |
on a list and asks to split it into a number of equal length things according to how many 00:27:10.920 |
CPUs you have, so you could replace that by splitting it into a smaller list and run it 00:27:21.440 |
Was an attention layer tried in the language model? 00:27:23.920 |
Do you think it would be a good idea to try and add one? 00:27:27.640 |
We haven't learned about attention yet, so let's ask about things that we have got to, 00:27:34.360 |
The short answer is no, I haven't tried it properly, yes, you should try it because it 00:27:41.280 |
In general, there's going to be a lot of things that we cover today, which if you've done 00:27:46.360 |
some sequence-to-sequence stuff before, you'll want to know about something we haven't covered 00:27:51.480 |
I'm going to cover all the sequence-to-sequence things. 00:27:53.480 |
So at the end of this, if I haven't covered the thing you wanted to know about, please 00:27:58.600 |
If you ask me before, I'll be answering something based on something I'm about to teach you. 00:28:05.160 |
So having tokenized the English and French, you can see how it gets split out. 00:28:10.320 |
You can see that tokenization for French is quite different looking because French loves 00:28:15.080 |
their apostrophes and their hyphens and stuff. 00:28:17.920 |
So if you try to use an English tokenizer for a French sentence, you're going to get 00:28:25.040 |
So I don't find you need to know heaps of NLP ideas to use deep learning for NLP, but 00:28:31.240 |
just some basic stuff like use the right tokenizer if your language is important. 00:28:38.040 |
And so some of the students this week in our study group have been trying to build language 00:28:42.080 |
models for Chinese, for instance, which of course doesn't really have the concept of 00:28:50.440 |
So we've been starting to look at, briefly mentioned last week, this Google thing called 00:28:55.040 |
sentence piece, which basically splits things into arbitrary subword units. 00:29:01.020 |
And so when I say tokenize, if you're using a language that doesn't have spaces in, you 00:29:06.800 |
should probably be checking out sentence piece or some other similar subword unit thing instead. 00:29:13.800 |
And hopefully in the next week or two we'll be able to report back with some early results 00:29:25.680 |
So having tokenized it, we'll save that to disk. 00:29:27.880 |
And then remember the next step after we create tokens is to turn them into numbers. 00:29:33.120 |
And to turn them into numbers, we have two steps. 00:29:35.240 |
The first is to get a list of all of the words that appear, and then we turn every word into 00:29:44.560 |
If there are more than 40,000 words that appear, then let's cut it off there so it doesn't 00:29:51.240 |
And we insert a few extra tokens for beginning of stream, padding, end of stream, and unknown. 00:30:00.880 |
So if we try to look up something that wasn't in the 40,000 most common, then we use a default 00:30:11.600 |
So now we can go ahead and turn every token into an id by putting it through the string 00:30:20.840 |
And then at the end of that, let's add the number 2, which is end of stream. 00:30:25.720 |
And you'll see the code you see here is the code I write when I'm iterating and experimenting. 00:30:32.880 |
Because 99% of the code I write when I'm iterating and experimenting turns out to be totally 00:30:38.440 |
wrong or stupid or embarrassing and you don't get to see it. 00:30:42.640 |
But there's no point refactoring that and making it beautiful when I'm writing it. 00:30:49.320 |
So I was wanting you to see all the little shortcuts I have. 00:30:52.560 |
So rather than doing this properly and having some constant or something for end of stream 00:30:57.560 |
marker and using it, when I'm prototyping, I just do the easy stuff. 00:31:04.680 |
Not so much that I end up with broken code, but I try to find some mid-ground between 00:31:17.760 |
I just heard him mention that we divide them of CPUs by 2 because with hyperthreading we 00:31:24.760 |
don't get to speed up using all the hyperthreaded cores. 00:31:27.800 |
Is this based on practical experience or is there some underlying reason why we wouldn't 00:31:35.560 |
And it's like not all things kind of seem like this, but I definitely noticed with tokenization 00:31:41.720 |
hyperthreading seems to slow things down a little bit. 00:31:45.000 |
Also if I use all the cores, often I want to do something else at the same time, like 00:31:51.600 |
generally run some interactive notebook and I don't have any spare room to do that. 00:32:03.120 |
So now for our English and our French, we can grab our list of IDs. 00:32:07.800 |
And when we do that, of course, we need to make sure that we also store the vocabulary. 00:32:11.640 |
There's no point having IDs if we don't know what the number 5 represents. 00:32:17.600 |
So that's our vocabulary, the string, and the reverse mapping, string to int, that we can 00:32:28.320 |
So just to confirm it's working, we can go through each ID, convert the int to a string 00:32:33.440 |
and spit that out, and there we have our thing back, now with an end-to-stream marker at 00:32:39.560 |
Our English vocab is 17,000, our French vocab is 25,000. 00:32:44.880 |
So this is not too complex a vocab that we're dealing with, which is nice to know. 00:32:54.840 |
So we spent a lot of time on the forums during the week discussing how pointless word vectors 00:33:01.060 |
are and how you should stop getting so excited about them, we're now going to use them. 00:33:07.760 |
Basically, all the stuff we've been learning about using language models and pre-trained 00:33:13.440 |
proper models rather than pre-trained linear single layers, which is what word vectors 00:33:18.240 |
are, I think applies equally well to sequence-to-sequence, but I haven't tried it yet. 00:33:25.580 |
So Sebastian and I are starting to look at that, slightly distracted by preparing this 00:33:31.560 |
class at the moment, but after this class is done. 00:33:34.080 |
So there's a whole thing, for anybody interested in creating some genuinely new, highly publishable 00:33:41.120 |
results, the entire area of sequence-to-sequence with pre-trained language models hasn't been 00:33:48.160 |
touched yet, and I strongly believe it's going to be just as good as classification stuff. 00:33:54.800 |
And if you work on this and you get to the point where you have something that's looking 00:34:00.400 |
exciting and you want help publishing it, I'm very happy to help co-author papers on 00:34:09.980 |
So feel free to reach out if and when you have some interesting results. 00:34:15.040 |
So at this stage, we don't have any of that, so we're going to use very little fast.ai 00:34:22.600 |
actually, and very little in terms of fast.ai ideas. 00:34:31.360 |
Anyway, so let's at least use decent word vectors. 00:34:36.880 |
There are better word vectors now, and fast text is a pretty good source of word vectors. 00:34:42.000 |
There's hundreds of languages available for them, your language is likely to be represented. 00:34:47.100 |
So to grab them, you can click on this link, download word vectors for a language that 00:34:52.080 |
you're interested in, install the fast text Python library. 00:35:01.080 |
It's not available on PyPy, but here's a handy trick. 00:35:04.720 |
If there is a GitHub repo that has a setup.py in it and a requirements.txt in it, you can 00:35:12.760 |
just chuck git+ at the start and then stick that in your pip install and it works. 00:35:19.400 |
Hardly anybody seems to know this, and if you go to the fast text repo, they won't tell 00:35:25.840 |
They'll say you have to download it and CD into it and blah blah blah, but you don't, 00:35:30.960 |
You can also use for the fast.ai library, by the way. 00:35:33.640 |
If you want to pip install the latest version of fast.ai, you can totally do this. 00:35:39.080 |
So you grab the library, import it, load the model, so here's my English model, and here's 00:35:47.040 |
You'll see there's a text version and a binary version, the binary version's a bit faster, 00:35:50.640 |
we're going to use that, the text version's also a bit buggy. 00:35:55.560 |
And then I'm going to convert it into a standard Python dictionary to make it a bit easier 00:35:59.680 |
to work with, so this is just going to go through each word with a dictionary comprehension 00:36:11.440 |
We can go ahead and look up a word, for example, comma, and that will return a vector. 00:36:17.640 |
The length of that vector is the dimensionality of this set of word vectors, so in this case 00:36:22.600 |
we've got 300-dimensional English and French word vectors. 00:36:31.960 |
For reasons that you'll see in a moment, I also want to find out what the mean of my 00:36:35.680 |
vectors are and the standard deviation of my vectors are. 00:36:38.680 |
So the mean's about zero and the standard deviation is about 0.3. 00:36:49.040 |
Often corpus's have a pretty long-tailed distribution of sequence length, and it's the longest sequences 00:36:57.960 |
that kind of tend to overwhelm how long things take and how much memory is used and stuff 00:37:04.380 |
So I'm going to grab, in this case, the 99th and 97th percentile of the English and French 00:37:16.320 |
Originally I was using the 90th percentile, so these are poorly named variables, so apologies 00:37:26.320 |
We've got our tokenized, numericalized English and French dataset. 00:37:39.880 |
So PyTorch expects a dataset object, and hopefully by now you all can tell me that a dataset 00:37:47.720 |
object requires two things, a length and an indexer. 00:37:53.280 |
So I started out writing this, and I was like, "Okay, I need a sector-sect dataset." 00:37:56.320 |
I started out writing it, and I thought, "Okay, we're going to have to pass it our x's and 00:37:59.920 |
our y's and store them away, and then my indexer is going to need to return a numpy array of 00:38:06.960 |
the x's at that point and a numpy array of the y's at that point, and oh, that's it." 00:38:13.120 |
So then after I wrote this, I realized I haven't really written a sector-sect dataset, I've 00:38:20.240 |
So here's the simplest possible dataset that works for any pair of arrays. 00:38:26.500 |
So it's now poorly named, it's much more general than a sector-sect dataset, but that's what 00:38:33.600 |
This a function, remember we've got v for variables, t for tensors, a for arrays. 00:38:39.800 |
So this basically goes through each of the things you pass it, if it's not already a 00:38:43.400 |
numpy array, it converts it into a numpy array and returns back a tuple of all of the things 00:38:49.120 |
that you passed it, which are now guaranteed to be numpy arrays. 00:39:04.640 |
So now we need to grab our English and French IDs and get a training set and a validation 00:39:13.360 |
And so one of the things which is pretty disappointing about a lot of code out there on the internet 00:39:18.800 |
is that they don't follow some simple best practices. 00:39:22.240 |
For example, if you go to the PyTorch website, they have an example section for sequence-to-sequence 00:39:31.120 |
Their example does not have a separate validation set. 00:39:33.840 |
I tried it, training according to their settings, and tested it with a validation set and it 00:39:41.800 |
So this is not just a theoretical problem, the actual PyTorch repo has the actual official 00:39:47.720 |
sequence-to-sequence translation example, which does not check for overfitting and overfits 00:39:54.200 |
Also it fails to use minibatches, so it actually fails to utilize any of the efficiency of 00:40:03.120 |
Even if you find code in the official PyTorch repo, don't assume it's any good at all. 00:40:09.400 |
The other thing you'll notice is that pretty much every other sequence-to-sequence model 00:40:16.640 |
I've found in PyTorch anywhere on the internet has clearly copied from that shitty PyTorch 00:40:22.220 |
repo because it's all the same variable names, it has the same problems, it has the same 00:40:28.400 |
Like another example, nearly every PyTorch convolutional neural network I've found does 00:40:36.120 |
So in other words, the final layer is always like average_pool(7,7). 00:40:41.880 |
So they assume that the previous layer is 7x7, and if you use any other size input you 00:40:49.200 |
And therefore nearly everybody I've spoken to that uses PyTorch thinks that there is 00:40:53.000 |
a fundamental limitation of CNNs that they are tied to the input size. 00:41:00.240 |
So every time we grab a new model and stick it in the fastai repo, I have to go in, search 00:41:04.680 |
for pool and add adaptive to the start and replace the 7 with a 1, and now it works on 00:41:12.400 |
So just be careful, it's still early days, and believe it or not, even though most of 00:41:18.280 |
you have only started in the last year your deep learning journey, you know quite a lot 00:41:24.040 |
more about a lot of the more important practical aspects than the vast majority of people that 00:41:28.520 |
are publishing and writing stuff in official repos. 00:41:32.520 |
So you kind of need to have a little more self-confidence than you might expect when 00:41:39.120 |
If you find yourself thinking that looks odd, it's not necessarily you, right? 00:41:49.880 |
So I would say like at least 90% of deep learning code that I start looking at turns out to 00:41:59.400 |
have like deathly serious problems that make it completely unusable for anything. 00:42:07.960 |
And so I've been telling people that I've been working with recently, if a repo you're 00:42:13.920 |
looking at doesn't have a section on it saying here's the test we did where we got the same 00:42:18.080 |
results as the paper that this is meant to be implementing, that almost certainly means 00:42:21.960 |
they haven't got the same results as the paper they're implementing, they probably haven't 00:42:25.720 |
And if you run it, it definitely won't get those results because it's hard to get things 00:42:32.080 |
It probably takes me 12 goes, probably takes normal smarter people than me 6 goes, but 00:42:38.000 |
if they haven't tested it once, it almost certainly won't work. 00:42:49.100 |
Grab a bunch of random numbers, one for each row of your data. 00:42:57.820 |
Index into your array with that list of balls to grab a training set. 00:43:01.520 |
Index into that array with the opposite of that list of balls to get your validation 00:43:06.080 |
There's lots of ways of doing it, I just like to do different ways to see a few approaches. 00:43:12.960 |
So now we can create our data set with our X's and our Y's, French and English. 00:43:17.880 |
If you want to translate instead English to French, switch these two around and you're 00:43:26.720 |
We can just grab our data loader and pass in our data set and batch size. 00:43:38.160 |
I'm not going to go into the details about why, we can talk about it during the week 00:43:41.840 |
if you're interested, but have a think about why we might need to transpose their orientation. 00:43:51.120 |
One is that since we've already done all the pre-processing, there's no point spawning 00:43:55.560 |
off multiple workers to do augmentation or whatever because there's no work to do. 00:44:00.680 |
So making num workers equals 1 will save you some time. 00:44:04.880 |
We have to tell it what our padding index is, that's actually pretty important because 00:44:10.320 |
what's going to happen is that we've got different length sentences and fastai will just automatically 00:44:19.160 |
stick them together and pad the shorter ones so that they'll end up equal length. 00:44:24.520 |
Because remember a tensor has to be rectangular. 00:44:31.560 |
In the decoder in particular, I actually want my padding to be at the end, not at the start. 00:44:38.840 |
For a classifier, I want the padding at the start because I want that final token to represent 00:44:47.840 |
But in the decoder, as you'll see, it's going to work out a bit better to have padding at 00:44:52.640 |
And then finally, since we've got sentences of different lengths coming in and they all 00:44:59.000 |
have to be put together in a mini-batch to be the same size by padding, we would much 00:45:04.080 |
prefer that the sentences in a mini-batch are of similar sizes already because otherwise 00:45:10.160 |
it's going to be as long as the longest sentence and that's going to end up wasting time and 00:45:16.280 |
So therefore I'm going to use the sampler trick that we learned last time which is the 00:45:22.280 |
We're going to ask it to sort everything by length first. 00:45:27.360 |
And then for the training set, we're going to ask it to randomize the order of things 00:45:32.440 |
but to roughly make it so that things of similar length are about in the same spot. 00:45:37.440 |
So we've got our sort_sampler and our sort_ish_sampler. 00:45:41.960 |
And then at that point, we can create a model_data object. 00:45:45.480 |
For a model_data object, it really does one thing which is it says I have a training set 00:45:50.600 |
and a validation set and an optional test set and sticks them into a single object. 00:45:55.480 |
We also have a path so that it has somewhere to store temporary files, models, stuff like 00:46:01.400 |
So we're not using fast.ai for very much at all in this example, just a minimal set to 00:46:16.280 |
In the end, once you've got a model_data object, you can then create a learner and you can 00:46:22.080 |
So that's a minimal amount of fast.ai stuff here. 00:46:28.440 |
This is a standard PyTorch compatible dataset. 00:46:32.600 |
This is a standard PyTorch compatible data loader. 00:46:34.840 |
Behind the scenes, it's actually using the fast.ai version because I do need to do this 00:46:40.960 |
So there's a few tweaks in our version that are a bit faster and a bit more convenient. 00:46:46.880 |
The fast.ai samplers we're using, but there's not too much going on here. 00:46:59.800 |
So as I said, most of the work is in the architecture. 00:47:04.060 |
And so the architecture is going to take our sequence of tokens. 00:47:14.420 |
It's going to spit them into an encoder, or in computer vision terms, what we've been 00:47:23.240 |
calling a backbone, something that's going to try and turn this into some kind of representation. 00:47:31.920 |
That's going to spit out the final hidden state, which for each sentence is just a vector. 00:47:45.560 |
That's all going to be using very direct simple techniques that we've already learnt. 00:47:50.200 |
And then we're going to take that and we're going to spit it into a different RNN, which 00:47:54.960 |
is a decoder, and that's going to have some new stuff because we need something that can 00:48:03.840 |
And it's going to keep going until it thinks it's finished the sentence, it doesn't know 00:48:07.840 |
how long the sentence is going to be ahead of time, it keeps going until it thinks it's 00:48:11.480 |
finished the sentence, and then it stops and returns the sentence. 00:48:19.800 |
So in terms of variable naming here, there's basically identical variables for encoder 00:48:27.000 |
and decoder, well attributes for encoder and decoder. 00:48:29.880 |
The encoder versions have 'enc', the decoder versions have 'dec'. 00:48:39.120 |
And so I always try to mention what the mnemonics are, rather than writing things out in too 00:48:48.920 |
So just remember, 'enc' is an encoder, 'dec' is a decoder, and there's an embedding. 00:48:57.680 |
The RNN, in this case, is a GRU, not an LSTM, they're nearly the same thing. 00:49:04.000 |
So don't worry about the difference, you could replace it with an LSTM and you'll get basically 00:49:08.360 |
To replace it with an LSTM, simply type LSTM and you're done. 00:49:17.440 |
So we need to create an embedding layer to take -- because remember what we're being 00:49:23.320 |
passed is the index of the words into a vocabulary, and we want to grab their fast text embedding, 00:49:30.240 |
and then over time we might want to also fine-tune to train that embedding into it. 00:49:37.640 |
So to create an embedding, we'll create embedding up here, so we'll just say nn.embedding. 00:49:43.760 |
So it's important that you know now how to set the rows and columns for your embedding. 00:49:49.160 |
So the number of rows has to be equal to your vocabulary size, so each vocabulary item has 00:49:58.480 |
Well in this case it was determined by fast text, and the fast text embeddings are size 00:50:05.120 |
So we have to use size 300 as well, otherwise we can't start out by using their embeddings. 00:50:14.000 |
So what we want to do is this is initially going to give us a random set of embeddings, 00:50:18.760 |
and so we're going to now go through each one of these, and if we find it in fast text 00:50:25.580 |
So again, something that you should already know is that a PyTorch module that is learnable 00:50:34.080 |
has a weight attribute, and the weight attribute is a variable, and the variables have a data 00:50:40.920 |
attribute, and the data attribute is a tensor. 00:50:44.440 |
And you'll notice very often today I'm saying here is something you should know, not so 00:50:48.440 |
that you think oh I don't know that I'm a bad person, but so that you think okay this 00:50:53.760 |
is a concept that I haven't learned yet and Jeremy thinks I ought to know about, and so 00:51:01.400 |
I've got to write that down and I'm going to go home and I've got to Google -- this 00:51:05.640 |
is a normal PyTorch attribute in every single learnable PyTorch module. 00:51:11.280 |
This is a normal PyTorch attribute in every single PyTorch variable. 00:51:15.840 |
And so if you don't know how to grab the weights out of a module, or you don't know how to grab 00:51:19.520 |
the tensor out of a variable, it's going to be hard for you to build new things or debug 00:51:26.440 |
So if I say you ought to know this, and you're thinking I don't know this, don't run away 00:51:31.200 |
and hide, go home and learn the thing, and if you're having trouble learning the thing, 00:51:36.200 |
because you can't find documentation about it, or you don't understand that documentation, 00:51:40.080 |
or you don't know why Jeremy thought it was important you know it, jump on the forum and 00:51:44.080 |
say please explain this thing, here's my best understanding of that thing as I have it at 00:51:49.360 |
the moment, here's the resources I've looked at, help fill me in. 00:51:54.440 |
And normally if I respond, it's very likely I will not tell you the answer, but I will 00:52:00.520 |
instead give you a problem that you could solve that if you solve it will solve it for 00:52:06.500 |
you because I know that that way it will be something you remember. 00:52:10.600 |
So again, don't be put off if I'm like okay, go read this link, try and summarize that 00:52:15.080 |
thing, tell us what you think, like I'm trying to be helpful, not unhelpful, and if you're 00:52:18.920 |
still not following, just come back and say I had a look, honestly that link you sent, 00:52:25.000 |
I don't know what any of it means, I wouldn't know where to start, whatever. 00:52:28.680 |
I'll keep trying to help you until you fully understand it. 00:52:35.680 |
So now that we've got our weight tensor, we can just go through our vocabulary and we can 00:52:42.880 |
look up the word in our pre-trained vectors, and if we find it we will replace the random 00:52:52.680 |
The random weights have a standard deviation of 1, our pre-trained vectors it turned out 00:52:58.920 |
had a standard deviation of about 0.3, so again this is the kind of hacky thing I do when 00:53:03.240 |
I'm prototyping stuff, I just multiply it by 3. 00:53:07.480 |
Obviously by the time you see the video of this and then you're able to put all this 00:53:11.320 |
sequence to sequence stuff into the fastAI library, you won't find horrible hacks like 00:53:15.360 |
that in there, sure hope, but hack away when you're prototyping. 00:53:23.080 |
Some things won't be in fast text, in which case we'll just keep track of it, and I've 00:53:27.400 |
just added this print statement here just so that I can kind of see why am I missing 00:53:32.860 |
stuff, basically I'll probably comment it out when I actually commit this to GitHub. 00:53:43.180 |
So we create those embeddings, and so when we actually create the sequence to sequence 00:53:47.640 |
RNN, it will print out how many were missed, and so remember we had about 30,000 words, 00:53:57.040 |
And interesting, the things that are missing, well there's our special token for uppercase, 00:54:03.260 |
not surprising that's missing, but also remember it's not token to vec, it's not token text, 00:54:10.000 |
it does words, so L apostrophe and D apostrophe and apostrophe S, they're not appearing either. 00:54:16.480 |
That does suggest that maybe we could have slightly better embeddings if we tried to 00:54:21.520 |
find some which would be tokenized the same way we tokenize, but that's okay. 00:54:27.440 |
Do we just keep embedding vectors from training? 00:54:31.400 |
Why don't we keep all word embeddings in case you have new words in the test set? 00:54:42.800 |
We're going to be fine-tuning them, and so I don't know, it's an interesting idea. 00:54:50.440 |
Maybe that would work, I haven't tried it, obviously you can also add random embedding 00:55:05.760 |
to those, and at the beginning just keep them random, but it's going to make an effect in 00:55:13.000 |
the sense that you're going to be using those words. 00:55:18.000 |
I think it's an interesting line of inquiry, but I will say this. 00:55:22.000 |
The vast majority of the time when you're doing this in the real world, your vocabulary 00:55:27.640 |
will be bigger than 40,000, and once your vocabulary is bigger than 40,000, using the 00:55:33.160 |
standard techniques, the embedding layers get so big that it takes up all your memory, 00:55:41.920 |
There are tricks to dealing with very large vocabries, I don't think we'll have time to 00:55:46.400 |
handle them in this session, but you definitely would not want to have all 3.5 million fast-text 00:55:59.360 |
I wonder, if you're not touching a word, it's not going to change, given you're fine-tuning. 00:56:09.360 |
It's in GPU RAM, and you've got to remember, 3.5 million times 300 times the size of a 00:56:16.360 |
single-precision floating-point vector, plus all of the gradients for them, even if it's 00:56:24.400 |
Without being very careful and adding a lot more code and stuff, it is slow and hard and 00:56:35.480 |
I think it's an interesting path of inquiry, but it's the kind of path of inquiry that 00:56:39.680 |
leads to multiple academic papers, not something that you do on a weekend. 00:56:45.440 |
I think it would be very interesting, maybe we can look at it sometime. 00:56:52.000 |
As I say, I have actually started doing some stuff around incorporating large vocabulary 00:56:58.600 |
It's not finished, but hopefully by the time we get here, this kind of stuff will be possible. 00:57:08.160 |
We create our encoder embedding, add a bit of dropout, and then we create our RNN. 00:57:15.720 |
This input to the RNN obviously is the size of the embedding by definition. 00:57:21.240 |
Number of hidden is whatever we want, so we set it to 256 for now, however many layers 00:57:26.080 |
we want, and some dropout inside the RNN as well. 00:57:31.080 |
This is all standard PyTorch stuff, you could use an LSTM here as well, and then finally 00:57:35.880 |
we need to turn that into some output that we're going to feed to the decoder, so let's 00:57:40.960 |
use a linear layer to convert the number of hidden into the decoder embedding size. 00:57:52.620 |
We first of all initialize our hidden state to a bunch of zeros. 00:57:58.760 |
So we've now got a vector of zeros, and then we're going to take our input and put it through 00:58:05.920 |
our embedding, we're going to put that through dropout, we then pass our currently zeros 00:58:12.440 |
hidden state and our embeddings into our RNN, and it's going to spit out the usual stuff 00:58:18.560 |
that RNN spit out, which includes the final hidden state. 00:58:25.320 |
We're then going to take that final hidden state and stick it through that linear layer, 00:58:29.880 |
so we now have something of the right size to feed to our decoder. 00:58:33.040 |
So that's it, and again this ought to be very familiar and very comfortable, it's like the 00:58:40.800 |
most simple possible RNN, so if it's not, go back, check out lesson 6, make sure you 00:58:46.120 |
can write it from scratch and you understand what it does. 00:58:49.680 |
But the key thing to know is that it takes our inputs and spits out a hidden vector that 00:59:01.000 |
hopefully will learn to contain all of the information about what that sentence says 00:59:11.340 |
Because if it can't do that, then we can't feed it into a decoder and hope it to spit 00:59:21.080 |
out our sentence in a different language, so that's what we want it to learn to do. 00:59:28.520 |
And we're not going to do anything special to make it learn to do that, we're just going 00:59:31.560 |
to do the three things and cross our fingers because that's what we do. 00:59:48.720 |
I guess Steven used s for state, I used h for hidden, but there you go, you would think 00:59:53.720 |
that two Australians could agree on something like that, but apparently not. 01:00:03.780 |
And so the basic idea of the new bit is the same, we're going to do exactly the same thing, 01:00:13.560 |
And so the for loop is going to do exactly what the for loop inside pytorch does here, 01:00:21.080 |
So we're going to go through the for loop, and how big is the for loop? 01:00:30.400 |
It's something that got passed to the constructor, and it is equal to the length of the largest 01:00:40.080 |
So we're going to do this for loop as long as the largest English sentence, because we're 01:00:45.760 |
translating it into English, so we can't possibly be longer than that. 01:00:55.100 |
If we then used it on some different corpus that was longer, this is going to fail. 01:01:02.720 |
You could always pass in a different parameter, of course. 01:01:07.400 |
So the basic idea is the same, we're going to go through and put it through the embedding, 01:01:12.560 |
we're going to stick it through the RNN, we're going to stick it through dropout, and we're 01:01:22.840 |
And once we've done that, we're then going to append that output to a list, and then 01:01:29.840 |
when we're going to finish, we're going to stack that list up into a single tensor and 01:01:38.840 |
Normally a recurrent neural network works in a whole sequence at a time, but we've got 01:01:47.720 |
a for loop to go through each part of the sequence separately. 01:01:51.520 |
So we have to add a leading unit axis to the start to basically say this is a sequence 01:02:00.240 |
So we're not really taking advantage of the recurrent net much at all, we could easily 01:02:06.640 |
That would be an interesting experiment if you wanted to try it. 01:02:11.260 |
So we basically take our input and we feed it into our embedding, and we add something 01:02:19.300 |
to the front saying treat this as a sequence of length 1, and then we pass that to our 01:02:26.800 |
We then get the output of that RNN, feed it into our dropout, and feed it into our linear 01:02:34.960 |
So there's two extra things now to be aware of. 01:02:46.660 |
And the answer is, it's the previous word that we translated. 01:02:52.380 |
See how the input here is the previous word here? 01:02:58.160 |
So the basic idea is, if you're trying to translate, if you're about to translate, tell 01:03:04.960 |
me the fourth word of the new sentence, but you don't know what the third word you just 01:03:13.060 |
So we're going to feed that in at each time step, let's make it as easy as possible. 01:03:19.040 |
And so what was the previous word at the start? 01:03:23.340 |
So specifically we're going to start out with a beginning of stream token. 01:03:42.080 |
So let's start out our decoder with a beginning of stream token, which is zero. 01:03:49.880 |
And of course we're doing a mini-batch, so we need batch size number of them. 01:03:53.400 |
But let's just think about one part of that batch. 01:03:58.480 |
We look up that zero in our embedding matrix to find out what the vector for the beginning 01:04:06.160 |
We stick a unit axis on the front to say we have a single sequence length of beginning 01:04:12.480 |
We stick that through our RNN, which gets not only the fact that there's a zero at the 01:04:18.720 |
beginning of stream, but also the hidden state which at this point is whatever came out of 01:04:26.880 |
So now its job is to try and figure out what is the first word to translate this sentence. 01:04:39.760 |
Pop the trees in dropout, go through one linear layer in order to convert that into the correct 01:04:45.160 |
size for our decoder embedding matrix, append that to our list of translated words, and 01:04:54.720 |
now we need to figure out what word that was because we need to feed it to the next time 01:05:04.680 |
So remember what we actually output here, and don't forget, use a debugger, pdb.settrace, 01:05:19.100 |
So before you look it up in the debugger, try and figure it out from first principles 01:05:24.400 |
So @p is a tensor whose length is equal to the number of words in our English vocabulary, 01:05:32.040 |
and it contains the probability for every one of those words that it is that word. 01:05:38.760 |
So then if we now say @p.data.max, that looks in its tensor to find out which word has the 01:05:48.600 |
highest probability, and max implies which returns two things. 01:05:53.960 |
The first thing is what is that max probability, and the second is what is the index into the 01:06:01.160 |
And so we want that second item, index number 1, which is the word index with the largest 01:06:08.180 |
So now that contains the word, or the word index into our vocabulary of the word. 01:06:17.120 |
If it's a 1, you might remember 1 was padding, then that means we're done. 01:06:22.720 |
That means we've reached the end because we've finished with a bunch of padding. 01:06:30.440 |
Now deck_imp is whatever the highest probability word was. 01:06:37.880 |
So we keep looping through, either until we get to the largest length of a sentence, or 01:06:44.840 |
until everything in our mini-batch is padding. 01:06:49.320 |
And each time we've appended our outputs, not the word but the probabilities, to this 01:06:57.120 |
list which we stack up into a tensor and we can now go ahead and feed that to a loss function. 01:07:04.880 |
So before we go to a break, since we've done 1 and 2, let's do 3, which is a loss function. 01:07:13.600 |
The loss function is categorical cross-entropy loss. 01:07:18.320 |
We've got a list of probabilities for each of our classes, the classes are all the words 01:07:24.640 |
in our English vocab, and we have a target which is the correct class, i.e. the correct 01:07:34.180 |
There's two tweaks, which is why we need to write our own little loss function, but you 01:07:37.360 |
can see basically it's going to be cross-entropy loss. 01:07:42.280 |
Tweak number 1 is we might have stopped a little bit early, and so the sequence length 01:07:49.400 |
that we generated may be different to the sequence length of the target, in which case 01:07:59.740 |
If you have a rank 3 tensor, which we do, we have sequence length by batch size by number 01:08:12.360 |
of words in the vocab, a rank 3 tensor requires a 6 tuple. 01:08:19.280 |
Each pair in things in that tuple is the padding before and then the padding after that dimension. 01:08:26.760 |
So in this case, the first dimension has no padding, the second dimension has no padding, 01:08:31.280 |
the third dimension has no padding on the left, and as-matched padding is required on 01:08:40.720 |
Now that we've added any padding, that's necessary. 01:08:43.960 |
The only other thing we need to do is cross-entropy loss expects a rank 2 tensor, but we've got 01:08:53.500 |
So let's just flatten out the sequence length and batch size into a -1 in View. 01:09:01.120 |
So flatten out that for both of them, and now we can go ahead and call cross-entropy. 01:09:08.160 |
So now we can just use standard approach, here's our sequence-to-sequence RNN, that's 01:09:23.240 |
Hopefully by now you've noticed you can call .cuda, but if you call .gpu then it doesn't 01:09:32.060 |
You can also set fastai.core.useGPU to false to force it to not use the GPU, and that can 01:09:41.320 |
We then need something that tells it how to handle learning rate groups. 01:09:47.760 |
So there's a thing called single model that you can pass it to which treats the whole 01:09:53.920 |
So this is like the easiest way to turn a PyTorch module into a fastai model. 01:10:01.280 |
Here's the model data object we created before. 01:10:05.160 |
We could then just call learner to turn that into a learner, but if we call RNN_learner, 01:10:16.480 |
It defines cross-entropy as the default criteria, in this case we're overriding that anyway, 01:10:22.920 |
But it does add in these save encoder and load encoder things that can be handy sometimes. 01:10:31.400 |
So in this case we really put it to set learner, but RNN_learner also works. 01:10:38.400 |
So here's how we turn our PyTorch module into a fastai model into a learner. 01:10:46.120 |
And once we have a learner, give it our new loss function, and then we can call lrefind, 01:10:53.600 |
and we can call fit, and it runs through a while, and we can save it. 01:11:02.760 |
Remember the model attribute of a learner is a standard PyTorch model. 01:11:06.480 |
So we can pass that some x which we can grab out of our validation set, or you could use 01:11:13.440 |
learn.predict_array or whatever you like to get some predictions. 01:11:19.820 |
And then we can convert those predictions into words by going .max1 to grab the index 01:11:26.000 |
of the highest probability words to get some predictions. 01:11:30.440 |
And then we can go through a few examples and print out the French, the correct English, 01:11:37.680 |
and the predicted English for things that are not padding. 01:11:45.320 |
So amazingly enough, this kind of simplest possible written largely from scratch PyTorch 01:11:54.520 |
module on only 50,000 sentences is sometimes capable on a validation set of giving you 01:12:00.760 |
exactly the right answer, sometimes the right answer in slightly different wording, and sometimes 01:12:09.400 |
sentences that aren't grammatically sensible or even have too many question marks. 01:12:14.120 |
So we're well on the right track, I think you would agree. 01:12:18.400 |
So even the simplest possible sec-to-sec trained for a very small number of epochs without 01:12:25.780 |
any pre-training other than the use of word embeddings is surprisingly good. 01:12:32.600 |
So I think the message here -- and we're going to improve this in a moment after the break 01:12:36.340 |
-- but I think the message here is even sequence-to-sequence models that you think are simpler than could 01:12:42.560 |
possibly work, even with less data than you think you could learn from can be surprisingly 01:12:48.480 |
effective and in certain situations this may even be enough for your needs. 01:12:54.800 |
So we're going to learn a few tricks after the break which will make this much better. 01:13:10.740 |
So one question that came up during the break is that some of the tokens that are missing 01:13:20.800 |
in fast text had a curly quote rather than a straight quote, for example, and the question 01:13:35.700 |
And the answer for this particular case is probably yes. 01:13:44.320 |
You do have to be very careful though because it may turn out that people using beautiful 01:13:51.140 |
curly quotes like using more formal language and actually writing in a different way. 01:13:57.340 |
So I generally -- if you're going to do some kind of pre-processing like punctuation normalization, 01:14:04.360 |
you should definitely check your results with and without, because nearly always that kind 01:14:09.680 |
of pre-processing makes things worse even when I'm sure it won't. 01:14:14.400 |
What might be some ways of regularizing these sequence-to-sequence models besides dropout 01:14:32.240 |
It's like, you know, AWDLSTM, which we've been relying on a lot, has so many great -- I mean, 01:14:46.880 |
And then there's the -- we haven't talked about it much, but there's also a kind of 01:14:50.760 |
regularization based on activations and stuff like that as well and on changes and whatever. 01:15:00.000 |
I just haven't seen anybody put anything like that amount of work into regularization of 01:15:05.280 |
sequence-to-sequence models, and I think there's a huge opportunity for somebody to do like 01:15:10.560 |
the AWDLSTM of Sek2Sec, which might be as simple as stealing all the ideas from AWDLSTM and 01:15:21.160 |
That would be pretty easy to try, I think, and there's been an interesting paper that 01:15:27.440 |
actually Stephen Merritt, he's added in the last couple of weeks, where he used an idea 01:15:32.240 |
which I don't know if he stole it from me, but it was certainly something I had also 01:15:38.560 |
Either way, I'm thrilled that he's done it, which was to take all of those different AWDLSTM 01:15:46.080 |
hyperparameters and train a bunch of different models and then use a random forest to find 01:15:51.720 |
out with feature importance which ones actually matter the most and then figure out like how 01:15:58.640 |
I think you could totally use this approach to figure out the sequence-to-sequence regularization 01:16:06.520 |
approaches which ones are best and optimize them, and that would be amazing. 01:16:14.560 |
But at the moment, I don't know that there are additional ideas to sequence-to-sequence 01:16:19.280 |
regularization that I can think of beyond what's in that paper for regular language 01:16:24.160 |
model stuff, and probably all those same approaches would work. 01:16:37.960 |
For classification, my approach to bidirectional that I've suggested you use is take all of 01:16:47.160 |
your token sequences, spin them around, train a new language model, and train a new classifier, 01:16:54.080 |
and I also mentioned the wiki text pre-trained model. 01:16:58.080 |
If you replace fwd with bwd in the name, you'll get the pre-trained backward model I created 01:17:06.240 |
Get a set of predictions and then average the predictions just like a normal ensemble, 01:17:12.040 |
and that's kind of how we do bider for that kind of classification. 01:17:16.240 |
There may be ways to do it end-to-end, but I haven't quite figured them out yet, they're 01:17:21.760 |
not in fastAI yet and I don't think anybody has written a paper about them yet, so if 01:17:25.840 |
you figure it out, that's an interesting line of research. 01:17:31.680 |
But because we're not doing massive documents where we have to chunk it into separate bits 01:17:38.600 |
and then pull over them and whatever, we can do bider very easily in this case, which is 01:17:45.480 |
literally as simple as adding bidirectional equals true to our encoder. 01:17:53.960 |
People tend not to do bidirectional for the decoder, I think partly because it's considered 01:18:00.640 |
cheating, but I don't know, I was just talking to somebody at the break about it, maybe it 01:18:10.120 |
can work in some situations, although it might need to be more of an ensembling approach in 01:18:20.120 |
The encoder is very simple, bidirectional equals true, and with bidirectional equals 01:18:29.600 |
true rather than just having an RNN which is going this direction, we have a second 01:18:41.200 |
And so that second RNN literally is visiting each token in the opposing order, so when 01:18:50.240 |
we get the final hidden state, it's here rather than here. 01:18:56.740 |
But the hidden state is of the same size, so the final result is that we end up with 01:19:07.000 |
And depending on what library you use, often that will be then combined with the number 01:19:11.360 |
of layers thing, so if you've got two layers and bidirectional, that tensor dimension is 01:19:19.960 |
With PyTorch, it kind of depends which bit of the process you're looking at as to whether 01:19:25.140 |
you get a separate result for each layer and each bidirectional bit and so forth. 01:19:29.320 |
You have to look up the docs and it will tell you the inputs, outputs, tensor sizes, appropriate 01:19:35.040 |
for the number of layers and whether you have bidirectional equals true. 01:19:38.940 |
In this particular case, you'll basically see all the changes I've had to make. 01:19:45.040 |
So for example, you'll see when I added bidirectional equals true, my linear layer now needs number 01:19:50.640 |
of hidden times 2 to reflect the fact that we have that second direction in our hidden 01:19:56.480 |
state now, you'll see in it hidden, it's now self.number of layers times 2 here. 01:20:05.000 |
So you'll just see there's a few places where there's been an extra 2 that has to be thrown 01:20:13.100 |
Why making a decoder bidirectional is considered cheating? 01:20:19.440 |
Well, it's not just that it's cheating, it's like we have this loop going on, you know? 01:20:28.760 |
It's not as simple as just kind of having two tensors, and then how do you turn those 01:20:41.920 |
After talking about it during the break, I've kind of gone from "Hey, everybody knows it 01:20:47.480 |
doesn't work" to "Oh, maybe it kind of could work, but it requires more thought." 01:20:53.080 |
It's quite possible during the week I realized it's a dumb idea and I was being stupid, but 01:20:58.920 |
Another question people have, why do you need to have an end to that loop? 01:21:12.640 |
I mean, because when I start training everything's random, so this will probably never be true. 01:21:23.820 |
Later on it will pretty much always break out eventually. 01:21:28.360 |
It's basically like we're going to go forever. 01:21:33.640 |
It's really important to remember when you're designing an architecture that when you start, 01:21:41.040 |
So you kind of want to make sure it's going to do something that's vaguely sensible. 01:21:46.400 |
So bidirectional means we got out to 358 cross-entropy loss with a single direction. 01:22:02.920 |
So that improved it a bit, that's good, and as I say, it shouldn't really slow things 01:22:09.260 |
down too much, bidirectional does mean there's a little bit more sequential processing have 01:22:20.520 |
In the Google translation model of the eight layers, only the first layer is bidirectional 01:22:29.880 |
So if you create really deep models, you may need to think about which one's bidirectional, 01:22:42.760 |
So teacher forcing is going to come back to this idea that when the model starts learning, 01:22:54.460 |
So when the model starts learning, it is not going to spit out 'uh' at this point, it's 01:23:00.160 |
going to spit out some random, meaningless word because it doesn't know anything about 01:23:05.080 |
German or about English or about the idea of language or anything. 01:23:08.140 |
And then it's going to feed it down here as an input and be totally unhelpful. 01:23:12.320 |
And so that means that early learning is going to be very, very difficult because it's feeding 01:23:16.960 |
in an input that's stupid into a model that knows nothing, and somehow it's going to get 01:23:23.440 |
So it's not asking too much, eventually it gets there, but it's definitely not as helpful 01:23:30.600 |
So what if instead of feeding in the thing I predicted just now, what if instead we feed 01:23:44.840 |
in the actual correct word it was meant to be? 01:23:49.880 |
Now we can't do that at inference time because by definition we don't know the correct word 01:23:57.200 |
And we can't require a correct translation in order to do translation. 01:24:01.760 |
So the way I've set this up is I've got this thing called PR force, which is probability 01:24:09.520 |
And if some random number is less than that probability, then I'm going to replace my 01:24:20.080 |
And if we've already gone too far, if it's already longer than the target sentence, I'm 01:24:24.480 |
just going to stop because I can't give it the correct thing. 01:24:28.480 |
So you can see how beautiful PyTorch is for this, because if you tried to do this with 01:24:33.760 |
some static graph thing like classic TensorFlow, I tried. 01:24:40.600 |
One of the key reasons we switched to PyTorch at this exact point in last year's class was 01:24:46.120 |
because Jeremy tried to implement teacher forcing in Keras and TensorFlow and went even 01:24:57.300 |
And then literally on Twitter, I think it was on Dracopathy, I said something about PyTorch 01:25:17.480 |
And all the stuff of trying to debug things, it was suddenly so much easier, and this kind 01:25:24.520 |
So this is a great example of like, hey, I get to use random numbers and if statements 01:25:30.120 |
So here's the basic idea, at the start of training, let's set PR force really high so 01:25:39.600 |
that nearly always it gets the actual correct previous word, and so it has a useful input. 01:25:47.800 |
And then as I train a bit more, let's decrease PR force so that by the end, PR force is 0, 01:25:55.480 |
and it has to learn properly, which is fine because it's now actually feeding in sensible 01:26:03.720 |
So let's now write something such that in the training loop, it gradually decreases 01:26:15.000 |
Well one approach would be to write our own training loop, but let's not do that because 01:26:20.240 |
we already have a training loop that has progress bars and uses exponential weighted averages 01:26:25.420 |
to smooth out the losses and keeps track of metrics. 01:26:28.080 |
And it does a bunch of things which are not rocket science, but they're kind of convenient. 01:26:33.520 |
And they also keep track of calling the reset for RNNs at the start of an epoch to make 01:26:38.720 |
sure that the hidden states set to zeros and little things like that. 01:26:43.000 |
You'd rather not have to write that from scratch. 01:26:45.600 |
So what we've tended to find is that as I start to write some new thing, and I'm like 01:26:53.480 |
I need to replace some part of the code, I'll then add some little hook so that we can all 01:27:03.960 |
In this particular case, there's a hook that I've ended up using all the damn time now, 01:27:12.040 |
And so if you look at our code, model.py is where our fit function lives. 01:27:21.320 |
And so the fit function in model.py, we've seen it before, I think it's like the lowest 01:27:26.840 |
level thing that doesn't require a learner, it doesn't really require anything much at 01:27:32.320 |
It just requires a standard PyTorch model and a model data object. 01:27:35.360 |
You just need to know how many epochs, a standard PyTorch optimizer, and a standard PyTorch 01:27:42.120 |
So we've hardly ever used it in the class, we normally call learn.fit, but learn.fit calls 01:27:50.000 |
But we've looked at the source code here sometimes, we've seen how it loops through each epoch 01:27:54.800 |
and it loops through each thing in our batch and calls stepper.step. 01:28:01.560 |
And so stepper.step is the thing that's responsible for calling the model, finding the loss function 01:28:10.080 |
And so by default stepper.step uses a particular class called stepper, which there's a few things 01:28:18.520 |
you don't know about too much, but basically it calls the model. 01:28:22.040 |
So the model ends up inside m, zeros the gradients, calls the loss function, calls backwards, 01:28:32.640 |
does gradient clipping if necessary, and then calls the optimizer. 01:28:37.080 |
So they're the basic steps that back when we looked at PyTorch from scratch, we had 01:28:44.920 |
So the nice thing is we can replace that with something else rather than replacing the training 01:28:54.160 |
So if you inherit from stepper and then write your own version of step, you can just copy 01:29:01.480 |
and paste the contents of step and add whatever you like. 01:29:06.180 |
Or if it's something you're going to do before or afterwards, you could even call super.step. 01:29:12.640 |
In this case, I'd rather suspect I've been unnecessarily complicated here, I probably 01:29:18.900 |
could have replaced, commented out all of that and just said super.step x's comma y 01:29:28.640 |
Because I think this is an exact copy of everything, but as I say, when I'm prototyping I don't 01:29:34.920 |
think carefully about how to minimize my code. 01:29:37.560 |
I copied and pasted the contents of the code from step, and I added a single line to the 01:29:42.280 |
top which was to replace prforce in my module with something that gradually decreased linearly 01:29:54.600 |
for the first 10 epochs, and after 10 epochs it was zero. 01:29:59.600 |
So total hack, but good enough to try it out. 01:30:04.960 |
So the nice thing is that everything else is the same, I've added these three lines 01:30:12.920 |
of code to my module, and the only thing I need to do other than differently is when 01:30:20.040 |
I call fit is I pass in my customized stepper class. 01:30:26.880 |
And so that's going to do teacher forcing, and so we don't have bidirectional, so we're 01:30:34.880 |
So we should compare this to our unidirectional results, which was 3.58, and this is 3.49. 01:30:48.800 |
I needed to make sure I at least did 10 epochs because before that it was cheating by using 01:31:01.400 |
So we've got another trick, and this next trick is a bigger trick. 01:31:07.840 |
It's a pretty cool trick, and it's called attention. 01:31:12.760 |
And the basic idea of attention is this, which is, expecting the entirety of the sentence 01:31:25.920 |
to be summarized into this single hidden vector is asking a lot. 01:31:32.340 |
It has to know what was said and how it was said and everything necessary to create the 01:31:41.240 |
And so the idea of attention is basically like maybe we're asking too much, particularly 01:31:46.800 |
because we could use this form of model where we output every step of the loop to not just 01:31:57.280 |
have a hidden state at the end, but to hit a hidden state after every single word. 01:32:07.040 |
It's already there, and so far we've just been throwing it away. 01:32:13.120 |
And not only that, but bidirectional, we've got every step, we've got two vectors of state 01:32:24.200 |
So how could we use this piece of state, this piece of state, this piece of state, this piece 01:32:29.000 |
of state and this piece of state rather than just the final state? 01:32:34.000 |
And so the basic idea is, well, let's say I'm translating this word right now. 01:32:42.120 |
Which of these five pieces of state do I want? 01:32:46.040 |
And of course the answer is if I'm doing -- well, actually let's pick a more interesting word. 01:32:53.080 |
So if I'm trying to do loved, then clearly the hidden state I want is this one, because 01:33:03.360 |
And then for this preposition, whatever, this little word here, no it's not a preposition, 01:33:11.380 |
So for this part of the verb, I probably would need this and this and this to make sure that 01:33:18.000 |
I've got the tense right and know that I actually need this part of the verb and so forth. 01:33:23.760 |
So depending on which bit I'm translating, I'm going to need one or more bits of these 01:33:33.960 |
And in fact, I probably want some weighting of them. 01:33:38.580 |
So like what I'm doing here, I probably mainly want this state, but I maybe want a little 01:33:44.440 |
bit of that one and a little bit of that one. 01:33:47.780 |
So in other words, for these five pieces of hidden state, we want a weighted average, and 01:33:54.440 |
we want it weighted by something that can figure out which bits of the sentence are 01:34:03.080 |
So how do we figure out something like which bits of the sentence are important right now? 01:34:09.680 |
We create a neural net, and we train the neural net to figure it out. 01:34:21.920 |
We've actually already got a bunch, we've got an RNN encoder, an RNN decoder, a couple 01:34:32.760 |
And this neural net is going to spit out a weight for every one of these things, and we're 01:34:38.760 |
going to take the weighted average at every step. 01:34:41.120 |
And it's just another set of parameters that we learn all at the same time. 01:34:50.140 |
So the idea is that once that attention has been learned, we can see this terrific demo 01:34:56.040 |
from Priscilla and Sean Carter, each different word is going to take a weighted average. 01:35:01.520 |
See how the weights are different depending on which word is being translated. 01:35:06.880 |
And you can see how it's kind of figuring out the color, the deepness of the blue is 01:35:12.080 |
You can see that each word is basically which word are we translating from. 01:35:16.480 |
So when we say European, we need to know that both of these two parts are going to be influenced 01:35:20.360 |
or if we're doing economic, both of these three parts are going to be influenced, including 01:35:23.880 |
the gender of the definite article and so forth. 01:35:33.360 |
These things are all nice little interactive diagrams. 01:35:38.040 |
It basically shows you how attention works and what the actual attention looks like in 01:35:53.160 |
So with attention, it's basically this is all identical, and the encoder is identical, 01:36:08.360 |
and all of this bit of the decoder is identical. 01:36:12.280 |
There's one difference, which is that we basically are going to take a weighted average. 01:36:30.880 |
And the way that we're going to do the weighted average is we create a little neural net, 01:36:35.320 |
which we're going to see here and here, and then we use softmax, because of course the 01:36:41.600 |
nice thing about softmax is that we want to ensure that all of the weights that we're 01:36:47.400 |
using add up to 1, and we also kind of expect that one of those weights should probably 01:36:56.200 |
So softmax gives us the guarantee that they add up to 1, and because it's the eta in it, 01:37:04.840 |
it tends to encourage one of the weights to be higher than the other ones. 01:37:11.720 |
So what's going to happen is we're going to take the last layer's hidden state, and we're 01:37:24.000 |
And then we're going to stick it into a nonlinear activation, and then we're going to do matrix 01:37:32.560 |
multiply, and so if you think about it, linear layer, nonlinear activation, matrix multiply, 01:37:45.640 |
Stick it into a softmax, and then we can use that to weight our encoder outputs. 01:37:56.600 |
So now, rather than just taking the last encoder output, we've got this is going to be the 01:38:02.240 |
whole tensor of all of the encoder outputs, which I just weight by this little neural 01:38:18.160 |
So what I'll do is I'll put on the wiki thread a couple of papers to check out. 01:38:27.680 |
There was basically one amazing paper that really originally introduced this idea of attention. 01:38:35.120 |
And I say amazing because it actually introduced a couple of key things which have really changed 01:38:43.640 |
This area of attention has been used not just for text, but for things like reading text 01:38:53.400 |
out of pictures, or doing various stuff with computer vision, and stuff like that. 01:38:59.280 |
And then there's a second paper which Jeffrey Hinton was involved in called Grammar as a 01:39:05.440 |
Foreign Language, which used this idea of RNNs with attention to basically try to replace 01:39:11.880 |
rules-based grammar with an RNN which automatically tagged the grammatical, each word based on 01:39:22.440 |
this grammar, and turned out to do it better than any rules-based system. 01:39:29.480 |
I think we're now used to the idea that neural nets do lots of this stuff better than rules-based 01:39:35.880 |
systems, but at the time it was considered really surprising. 01:39:39.800 |
One nice thing is that their summary of how attention works is really nice and concise. 01:39:47.000 |
Let's go back and look at our original encoder. 01:40:14.960 |
It spits out a list of the state after every time step, and it also tells you the state 01:40:25.720 |
And we used the state at the last time step to create the input state for our decoder, 01:40:45.920 |
But we know that it's actually creating a vector at every time step, so wouldn't it 01:40:53.440 |
But wouldn't it be nice to use the ones that's most relevant to translating the word I'm 01:41:02.680 |
So wouldn't it be nice to take a weighted average of the hidden state at each time step, 01:41:08.140 |
weighted by whatever is the appropriate weight right now? 01:41:12.340 |
Which for example in this case, "libter" would definitely be time step number 2, which is 01:41:18.120 |
what it's all about, because that's the word I'm translating. 01:41:21.720 |
So how do we get a list of weights that is suitable for the word we're training right 01:41:30.640 |
now, or the answer is by training a neural net to figure out the list of weights. 01:41:37.020 |
And so anytime we want to figure out how to train a little neural net that does any task, 01:41:43.320 |
the easiest way normally always to do that is to include it in your module and train 01:41:53.480 |
The minimal possible neural net is something that contains two layers and one nonlinear 01:42:09.640 |
So here is one linear layer, and in fact instead of a linear layer, we can't even just grab 01:42:29.400 |
And so here's a random matrix, it's just a random tensor wrapped up in a parameter. 01:42:36.000 |
A parameter, remember, is just a PyTorch variable, it's like identical to a variable, but it 01:42:42.600 |
just tells PyTorch I want you to learn the weights for this. 01:42:48.440 |
So here we've got a linear layer, here we've got a random matrix, and so here at this point 01:42:56.760 |
where we start out our decoder, let's take the current hidden state of the decoder, put 01:43:10.560 |
that into a linear layer, because what's the information we use to decide what words we 01:43:20.640 |
The only information we have to go on is what the decoder's hidden state is now. 01:43:25.000 |
So let's grab that, put it into the linear layer, put it through a nonlinearity, put 01:43:36.880 |
This one actually doesn't have a bias in it, so it's actually just a matrix multiply. 01:43:41.360 |
Put that into a softmax, and that's it, that's a little neural net. 01:43:48.020 |
It doesn't do anything, it's just a neural net, no neural nets do anything, they're just 01:43:54.520 |
linear layers with nonlinear activations with random weights. 01:43:58.240 |
But it starts to do something if we give it a job to do, and in this case the job we give 01:44:04.520 |
it to do is to say, don't just take the final state, but now let's use all of the encoder 01:44:13.000 |
states and let's take all of them and multiply them by the output of that little neural net. 01:44:22.280 |
And so given that the things in this little neural net are learnable weights, hopefully 01:44:28.640 |
it's going to learn to weight those encoder outputs, those encoder hidden states by something 01:44:35.840 |
That's all a neural net ever does, is we give it some random weights to start with and a 01:44:41.460 |
job to do, and hope that it learns to do the job. 01:44:49.200 |
So everything else in here is identical to what it was before. 01:44:54.680 |
We've got teacher forcing, it's not bidirectional. 01:45:00.680 |
You can see, here we are using teacher forcing. 01:45:11.340 |
Teacher forcing had 3.49, and so now we've got nearly exactly the same thing, but we've 01:45:18.900 |
got this little minimal neural net figuring out what weightings to give our inputs. 01:45:30.060 |
However these things are logs, so e^ of this is quite a significant change. 01:45:51.340 |
They're still not perfect, why or why not, but quite a few of them are correct. 01:46:02.020 |
And again, considering that we're asking it to learn about the very idea of language for 01:46:06.660 |
two different languages, and how to translate them between the two, and grammar, and vocabulary, 01:46:11.820 |
and we only have 50,000 sentences, and a lot of the words only appear once, I would say 01:46:24.340 |
Why do we use tan(h) instead of relu for attention-mini-net? 01:46:28.500 |
I don't quite remember, it's been a while since I looked at it. 01:46:37.180 |
You should totally try using relu and see how it goes. 01:46:40.660 |
The key difference is that it can go in each direction, and it's limited both at the top 01:46:51.060 |
I know very often for the gates inside RNNs and LSTMs and GRUs, tan often works out better. 01:46:59.260 |
But it's been about a year since I actually looked at that specific question, so I'll 01:47:05.100 |
The short answer is you should try a different activation function and see if you can get 01:47:13.640 |
So what we can do also is we can actually grab the attentions out of the model. 01:47:20.420 |
So I actually added this returnAttention = true, see here, my forward, you can put anything 01:47:31.860 |
So I added a returnAttention parameter, false by default, because obviously the training 01:47:40.620 |
But then I just had something here saying if returnAttention, then stick the attentions 01:47:44.740 |
on as well, and the attentions is simply that value, a, just check it in a list. 01:47:52.700 |
So we can now call the model with returnAttention = true and get back the probabilities and 01:47:58.500 |
the attentions, which means as well as printing out these here, we can draw pictures. 01:48:08.700 |
And so you can see at the start, the attention is all in the first word, second word, third 01:48:14.780 |
And this is just for one particular sentence. 01:48:20.340 |
So you can kind of see, this is the equivalent, this is like when your Chris Oler and Sean 01:48:29.420 |
Carter make things that look like this, when you're Jeremy Howard, the exact same information 01:48:41.820 |
So you can see basically at each different time step, we've got a different attention. 01:48:47.340 |
And it's really important when you try to build something like this, you don't really 01:48:56.980 |
Because if it's not working, and as per usual my first 12 attempts at this were broken, 01:49:03.300 |
and they were broken in the sense that it wasn't really learning anything useful, and 01:49:06.940 |
so therefore it was basically giving equal attention to everything, and therefore it 01:49:11.180 |
It just wasn't better, or it wasn't much better. 01:49:14.980 |
And so until you actually find ways to visualize the thing in a way that you know what it ought 01:49:22.220 |
to look like ahead of time, you don't really know if it's working. 01:49:24.820 |
So it's really important that you try to find ways to kind of check your intermediate steps 01:49:30.640 |
So people are asking what is the loss function for the attentional neural network? 01:49:38.260 |
No, no, no loss function for the attentional neural network. 01:49:43.140 |
So it's just sitting here inside our decoder loop. 01:49:48.100 |
So the loss function for the decoder loop is that this result contains exactly the same 01:49:55.580 |
Just the outputs, the probabilities of the words. 01:49:59.680 |
So like the loss function, it's the same loss function. 01:50:05.440 |
So how come the little mini neural nets learning something, well because in order to make the 01:50:12.960 |
outputs better and better, it would be great if it made the weights of this weighted average 01:50:21.300 |
So part of creating our output is to please do a good job of finding a good set of weights. 01:50:26.540 |
And if it doesn't do a good job of finding a good set of weights, then the loss function 01:50:31.700 |
So end-to-end learning means you throw in everything that you can into one loss function. 01:50:41.420 |
And the gradients of all the different parameters point in a direction that says basically, 01:50:47.740 |
hey, if you had put more weight over there, it would have been better. 01:50:52.540 |
And thanks to the magic of the train rule, it then knows, oh, it would have put more 01:50:55.800 |
weight over there if you would change the parameter in this matrix, multiply a little 01:51:02.360 |
And so that's the magic of end-to-end learning. 01:51:07.940 |
So it's a very understandable question of how did this little mini neural net work. 01:51:17.060 |
But you've got to realize there's nothing particularly about this code that says, hey, 01:51:22.300 |
this particular bit's a separate little mini neural net work any more than the GRU is a 01:51:26.460 |
separate little neural net work, or this linear layer is a separate little function. 01:51:31.900 |
It all ends up pushed into one output, which ends up in one loss function that returns 01:51:40.580 |
a single number that says this either was or wasn't a good translation. 01:51:46.220 |
And so thanks to the magic of the train rule, we then back-propagate little updates to all 01:51:52.380 |
the parameters to make them a little bit better. 01:51:55.460 |
So this is a big, weird counterintuitive idea, and it's totally okay if it's a bit mind bending. 01:52:07.700 |
And it's the bit where, even back to lesson 1, it's like, how did we make it find dogs 01:52:19.300 |
All we did was we said, this is our data, this is our architecture, this is our loss 01:52:25.020 |
function, please back-propagate into the weights to make them better. 01:52:29.220 |
And after you've made them better a while, it'll start finding cats from dogs. 01:52:34.140 |
In this case, we haven't used somebody else's convolutional network architecture. 01:52:39.260 |
We've said here's like a custom architecture which we hope is going to be particularly 01:52:45.180 |
And even without this custom architecture, it was still okay. 01:52:49.300 |
But then when we kind of made it in a way that made more sense to what we think it ought 01:52:57.100 |
But at no point did we kind of do anything different other than say here's data, here's 01:53:04.020 |
an architecture, here's a loss function, go and find the parameters please. 01:53:09.620 |
And it did it because that's what neural nets do. 01:53:25.140 |
If you want to encode an image using a CNN backbone of some kind and then pass that into 01:53:34.620 |
a decoder which is like an RNN with a tension, and you make your Y values the actual correct 01:53:41.860 |
captions for each of those images, you will end up with an image caption generator. 01:53:46.860 |
If you do the same thing with videos and captions, you'll end up with a video caption generator. 01:53:50.460 |
If you do the same thing with 3D CT scans and radiology reports, you'll end up with 01:53:57.140 |
If you do the same thing with GitHub issues and people's chosen summaries of them, you'll 01:54:18.540 |
And I don't feel like people have begun to scratch the surface of how to use Sec2Sec models 01:54:26.460 |
Not being a GitHub person, it would never have occurred to me that it would be kind of cool 01:54:30.900 |
to start with some issue and automatically create a summary. 01:54:34.620 |
But now I'm like, of course, next time I go to GitHub I want to see a summary written there 01:54:41.660 |
I don't want to write my own damn commit message through that. 01:54:44.780 |
Why should I write my own summary of the code review when I finish adding comments to lots 01:54:51.260 |
Now I'm thinking, GitHub is so behind, it could be doing this stuff. 01:54:55.500 |
So what are the things in your industry that you could start with a sequence and generate 01:55:04.860 |
So again, it's kind of like a fairly new area, the tools for it are not easy to use, they're 01:55:12.500 |
not even built into fastai yet, as you can see, hopefully they will be soon. 01:55:18.740 |
And I don't think anybody knows what the opportunities are. 01:55:27.880 |
The bad news is we have 20 minutes to cover a topic which in last year's course took a 01:55:37.580 |
The good news is that when I went to rewrite this using fastai and PyTorch I ended up with 01:55:44.840 |
So all of the stuff that made it hard last year is basically gone now. 01:55:49.360 |
So we're going to do something bringing together for the first time our two little worlds we 01:55:55.500 |
focused on, text and images, and we're going to try and bring them together. 01:56:01.220 |
And so this idea came up really in a paper by this extraordinary deep learning practitioner 01:56:11.140 |
And Andrea was at Google at the time, and her basic crazy idea was to say words can 01:56:22.380 |
have a distributed representation, a space, which at that time really was just word vectors. 01:56:31.260 |
And images can be represented in a space, like in the end if we have a fully connected 01:56:36.340 |
layer they kind of ended up as a vector representation. 01:56:42.660 |
Could we somehow encourage the vector space that the images end up with be the same vector 01:56:52.100 |
And if we could do that, what would that mean? 01:56:57.380 |
So what could we do with that covers things like, well, what if I'm wrong? 01:57:05.300 |
What if I'm predicting that this image is a beagle, and I predict jumboject, and Yanet's 01:57:20.220 |
The normal loss function says that Yanet and Jeremy's models are equally good, i.e. they're 01:57:27.700 |
But what if we could somehow say corgi is closer to beagle than it is to jumboject, 01:57:36.860 |
And we should be able to do that because in word vector space, beagle and corgi are pretty 01:57:48.460 |
So it would give us a nice situation where hopefully our inferences would be like wrong 01:57:57.620 |
It would also allow us to search for things that aren't at an ImageNet, like a category 01:58:11.700 |
Why did I have to train a whole new model to find dogs versus cats when we already had 01:58:21.980 |
Well if I had trained it in word vector space, I totally could, because there's now a word 01:58:28.540 |
vector, I can find things with the right image vector, and so forth. 01:58:34.740 |
So we'll look at some cool things we can do with it in a moment, but first of all let's 01:58:38.140 |
train a model where this model is not learning a category, a one-hot encoded ID where every 01:58:47.700 |
category is equally far from every other category. 01:58:51.340 |
Let's instead train a model where we're finding the dependent variable which is a word vector. 01:59:00.660 |
Well obviously the word vector for the word you want. 01:59:03.700 |
So if it's corgi, let's train it to create a word vector that's the corgi word vector. 01:59:10.060 |
And if it's a jumbo-jet, let's train it with a dependent variable that says this is the 01:59:20.500 |
So let's grab the fast text word vectors again, load them in, we only need English this time. 01:59:28.780 |
And so here's an example of the word vector for king, it's just 300 numbers. 01:59:36.220 |
So for example, little j Jeremy and big j Jeremy have a correlation of 0.6, I don't like 01:59:42.420 |
bananas at all, this is good, banana and Jeremy, 0.14. 01:59:46.580 |
So words that you would expect to be correlated are correlated in words that should be as 01:59:52.080 |
far away from each other as possible, unfortunately they're still slightly correlated but not 01:59:57.140 |
So let's now grab all of the ImageNet classes because we actually want to know which one's 02:00:10.740 |
So we've got a list of all of those up on files.fast.ai, we can grab them. 02:00:17.580 |
And let's also grab a list of all of the nouns in English which I've made available here 02:00:24.620 |
So here are the names of each of the 1000 ImageNet classes. 02:00:29.540 |
And here are all of the nouns in English according to WordNet, which is a popular thing for kind 02:00:41.460 |
So we can now go ahead and load that list of nouns, load the list of ImageNet classes, 02:00:55.820 |
So these are the class IDs for the 1000 ImageNet classes that are in the competition data set. 02:01:03.620 |
So here's an example, n01, is a tench which apparently is a kind of fish. 02:01:11.980 |
Let's do the same thing for all those WordNet nouns. 02:01:14.820 |
And you can see actually it turns out that ImageNet is using WordNet class names, so that 02:01:21.220 |
makes it nice and easy to map between the two. 02:01:25.580 |
And WordNet, the most basic thing is an entity, and then that includes an abstraction, and 02:01:31.580 |
a physical entity can be an object and so forth. 02:01:36.420 |
We've got the ImageNet 1000 and we've got the 82,000 which are in WordNet. 02:01:43.500 |
So we want to map the two together, which is as simple as creating a couple of dictionaries 02:01:47.220 |
to map them based on the Synset ID or the WordNet ID. 02:01:52.060 |
And it turns out that 49,469 Synset to WordVector, what I need to do now is grab the 82,000 nouns 02:02:18.820 |
in WordNet and try and look them up in fast text. 02:02:23.140 |
And so I've managed to look up 49,000 of them in fast text. 02:02:27.920 |
So I've now got a dictionary that goes from Synset ID, which is what WordNet calls them, 02:02:34.580 |
So that's what this dictionary is, Synset to WordVector. 02:02:41.740 |
And I've also got the same thing specifically for the 1000 WordNet classes. 02:02:54.580 |
Now I grab all of the ImageNet, which you can actually download from Kaggle now. 02:03:00.980 |
If you look up the Kaggle ImageNet localization competition, that contains the entirety of 02:03:08.140 |
It's got a validation set of 28,650 items in it. 02:03:15.740 |
And so I can basically just grab for every image in ImageNet, I can grab using that Synset 02:03:23.420 |
to WordVector, grab its fast text WordVector, and I can now stick that into this ImageVectors 02:03:37.200 |
Stack that all up into a single matrix and save that away. 02:03:43.260 |
And so now what I've got is something for every ImageNet image. 02:03:49.660 |
I've also got the fast text WordVector that it's associated with. 02:03:55.520 |
Just by looking up the Synset ID, going to WordNet, then going to fast text and grabbing 02:04:10.340 |
I can now create a model data object, which specifically is an image classifier data object. 02:04:17.940 |
And I've got this thing called from_names_an_array, I'm not sure if we've used it before. 02:04:21.340 |
But we can pass it a list of file names, and so these are all of the file names in ImageNet. 02:04:29.340 |
And we can just pass it an array of our dependent variables. 02:04:32.900 |
And so this is all of the fast text WordVectors. 02:04:38.140 |
And then I can pass in the validation indexes, which in this case is just all of the last 02:04:44.820 |
I need to make sure that they're the same as ImageNet users, otherwise I'll be cheating. 02:04:50.100 |
And then I pass in continuous = true, which means this puts a lie again to this image 02:04:59.020 |
So continuous = true means don't want hot encode my outputs, but treat them just as continuous 02:05:06.580 |
So now I've got a model data object that contains all of my file names, and for every file name 02:05:12.900 |
a continuous array representing the WordVector for that. 02:05:17.500 |
So I have an x, I have a y, so I have data, now I need an architecture and a loss function. 02:05:26.400 |
So let's create an architecture, and we'll revise this next week, but basically we can 02:05:33.620 |
use the tricks we've learned so far, but it's actually incredibly simple. 02:05:38.020 |
Fast AI has a ConvNet builder, which is when you say conv_learner.pre_trained, it calls 02:05:48.300 |
And you basically say, okay, what architecture do you want? 02:05:55.780 |
In this case it's not really classes, it's how many outputs do you want, which is the 02:06:04.780 |
Obviously it's not multi-class classification, it's not classification at all. 02:06:12.480 |
And then you can just say, alright, what fully connected layers do you want? 02:06:16.820 |
So I'm just going to add one fully connected layer, hidden layer of length 1024. 02:06:23.620 |
Well I've got the last layer of ResNet-50, I think it's 1024 long. 02:06:34.940 |
I obviously need my penultimate layer to be longer than 300, otherwise there's not enough 02:06:39.180 |
information, so I kind of just picked something a bit bigger. 02:06:42.780 |
Maybe different numbers would be better, but this worked for me. 02:06:48.420 |
I found that the default dropout I was consistently underfitting, so I just decreased the dropout 02:06:56.500 |
And so this is now a convolutional neural network that does not have any softmax or 02:07:03.020 |
anything like that because it's regression, it's just a linear layer at the end. 02:07:14.680 |
So I can create a convlearner from that model, give it an optimization function. 02:07:21.460 |
So now all I need, I've got data, I've got an architecture, because I said I've got this 02:07:28.060 |
many 300 outputs, it knows there are 300 outputs because that's the size of this array. 02:07:39.140 |
Now the default loss function for regression is L1-loss, so the absolute differences. 02:07:46.260 |
That's not bad, but unfortunately in really high dimensional spaces, anybody who's studied 02:07:55.220 |
a bit of machine learning probably knows this, in really high dimensional spaces, in this 02:07:58.700 |
case it's 300 dimensional, basically everything is on the outside. 02:08:04.060 |
And when everything's on the outside, distance is not meaningless, but it's a little bit 02:08:12.220 |
awkward, things tend to be close together or far away, doesn't really mean much in these 02:08:19.780 |
really high dimensional spaces where everything's on the edge. 02:08:24.440 |
What does mean something though is that if one thing's on the edge over here, and one 02:08:29.180 |
thing's on the edge over here, you can form an angle between those vectors, and the angle 02:08:36.900 |
And so that's why we use cosine similarity when we're basically looking for how close 02:08:43.700 |
or far apart are things in high dimensional spaces. 02:08:47.500 |
And if you haven't seen cosine similarity before, it's basically the same as Euclidean 02:08:51.500 |
distance, but it's normalized to be basically a unit norm, so basically divide by the length. 02:09:00.980 |
So we don't care about the length of the vector, we only care about its angle. 02:09:05.260 |
So there's a bunch of stuff that you could easily learn in a couple of hours, but if 02:09:11.580 |
you haven't seen it before, it's a bit mysterious. 02:09:14.620 |
For now, just know that loss functions in high dimensional spaces where you're trying 02:09:18.780 |
to find similarity, you care about angle and you don't care about distance. 02:09:26.300 |
If you didn't use this custom loss function, it would still work, I tried it, it's just 02:09:32.420 |
So we've got an architecture, we've got data, we've got a loss function, therefore we're 02:09:41.540 |
Now I'm training on all of ImageNet, that's going to take a long time, so pre-compute 02:09:50.380 |
That's that thing we learned ages ago that caches the output of the final convolutional 02:09:54.500 |
layer and just trains the fully connected bit. 02:10:00.160 |
And like even with pre-compute equals true, it takes like 3 minutes to train an epoch 02:10:08.580 |
So I trained it for a while longer, so that's like an hour's worth of training. 02:10:14.080 |
But it's pretty cool that with fast.ai, we can train a new custom head basically on all 02:10:26.260 |
And so at the end of all that, we can now say, let's grab the 1000 ImageNet classes, 02:10:40.580 |
And let's just take a look at a few pictures. 02:10:46.300 |
And because the validation set is ordered, all the stuff is the same type as in the same 02:10:55.940 |
And what we can now do is we can now use nearest neighbors search. 02:11:00.740 |
So nearest neighbors search means here's one 300-dimensional vector, here's a whole lot 02:11:06.260 |
of other 3-dimensional vectors, which things is it closest to. 02:11:10.460 |
And normally that takes a very long time because you have to look through every 300-dimensional 02:11:13.420 |
vector, calculate its distance, and find out how far away it is. 02:11:18.380 |
But there's an amazing almost unknown library called NMSlib that does that incredibly fast. 02:11:29.100 |
Some of you may have tried other nearest neighbors libraries. 02:11:31.740 |
I guarantee this is faster than what you're using. 02:11:34.540 |
I can tell you that because it's been benchmarked by people who do this stuff for a living. 02:11:40.120 |
This is by far the fastest on every possible dimension. 02:11:46.980 |
We basically look here, this is angular distance. 02:11:49.920 |
So we want to create an index on angular distance, and we're going to do it on all of our image 02:11:56.220 |
network vectors to add in a whole batch, create the index, and now I can query a bunch of 02:12:01.780 |
vectors all at once, get their 10 nearest neighbors, use this multi-threading. 02:12:09.280 |
You can install it from pip, it just works, and it tells you how far away they are and 02:12:17.260 |
So we can now go through and print out the top 3. 02:12:21.620 |
So it turns out that bird actually is a limpkin. 02:12:29.820 |
Interestingly, this one doesn't say it's a limpkin, and I looked it up, it's the 4th one. 02:12:34.780 |
I don't know much about birds, but everything else here is brown with white spots, that's 02:12:41.460 |
So I don't know if that's actually a limpkin or if it's a mislabeled, but it sure as hell 02:12:55.780 |
Now this is not a particularly hard thing to do because it's only 1,000 ImageNet classes, 02:13:00.380 |
it's not doing anything new, but what if we now bring in the entirety of WordNet, and 02:13:06.380 |
we now say which of those 45,000 things is it closest to, exactly the same. 02:13:16.460 |
So now let's do something a bit different, which is take all of our predictions, so basically 02:13:22.180 |
take our whole validation set of images and create a KNN index of the image representations, 02:13:32.220 |
because remember it's predicting things that are meant to be word vectors. 02:13:37.260 |
And now let's grab the fast text vector for boat, and boat is not an ImageNet concept. 02:13:49.380 |
And yet I can now find all of the images in my predicted word vectors in my validation 02:13:59.320 |
And it works, even though it's not something that was ever trained on. 02:14:04.340 |
What if we now take engine's vector and boat's vector and take their average? 02:14:10.140 |
And what if we now look in our nearest neighbors for that? 02:14:14.860 |
I mean, yes, this is actually a boat with an engine, it just happens to have wings on 02:14:20.860 |
By the way, sail is not an ImageNet thing, boat is not an ImageNet thing, here's the 02:14:27.700 |
average of two things that are not ImageNet things. 02:14:31.060 |
And yet, with one exception, it's bound with two sailboats. 02:14:38.340 |
Let's open up an image in the validation set. 02:14:46.780 |
Let's call predict_array on that image to get its kind of word-vector-like thing. 02:14:53.380 |
And let's do a nearest-neighbors search on all the other images. 02:14:57.420 |
And here's all the other images of whatever the hell that is. 02:15:04.980 |
We've trained a thing on all of ImageNet in an hour using a custom head that required 02:15:12.340 |
And these things run in like 300 milliseconds to do these searches. 02:15:16.980 |
I actually taught this basic idea last year as well, but it was in Keras and it was just 02:15:22.180 |
pages and pages and pages of code and everything took a long time and it was complicated. 02:15:27.060 |
Back then I kind of said, I can't begin to think all the stuff you could do with this. 02:15:31.420 |
I don't think anybody has really thought deeply about this yet, but I think it's fascinating. 02:15:36.880 |
And so go back and read the device paper because like Andrea had a whole bunch of other thoughts. 02:15:43.420 |
And now that it's so easy to do, hopefully people will dig into this now because I think