Back to Index

Lesson 11: Deep Learning Part 2 2018 - Neural Translation


Chapters

0:0 Super Convergence
1:58 One Cycle
3:57 Our Cube Flow
5:41 Neural Translation
9:44 Code
13:32 Basic Approach
18:18 RNN Review
19:30 Refactoring
21:39 Stacking
22:37 Training
24:48 Tokenizing
26:13 Processing
26:54 Partition
27:19 Intention Layer
32:26 Industry Marker
34:13 Fast Text
35:54 Python Dictionary
43:11 Data Loader
48:15 NCoder

Transcript

I want to start pointing out a couple of the many cool things that happened this week. One thing that I'm really excited about is we briefly talked about how Leslie Smith has a new paper out, and basically the paper takes these previous two key papers, cyclical learning rates and superconvergence, and builds on them with a number of experiments to show how you can achieve superconvergence.

Superconvergence lets you train models 5 times faster than previous stepwise approaches. It's not 5 times faster than CLR, but it's faster than CLR as well. The key is that superconvergence lets you get up to massively high learning rates, somewhere between 1 and 3, which is quite amazing. So the interesting thing about superconvergence is that you actually train at those very high learning rates for quite a large percentage of your epochs, and during that time the loss doesn't really improve very much, but the trick is it's doing a lot of searching through the space to find really generalizable areas, it seems.

We kind of had a lot of what we needed in fastAI to achieve this, but we're missing a couple of bits, and so Sylvan Gugo has done an amazing job of fleshing out the pieces that we're missing, and then confirming that he has actually achieved superconvergence on training on sci-fi 10.

I think this is the first time that this has been done that I've heard of outside of Leslie Smith himself. He's got a great blog post up now on 1Cycle, which is what Leslie Smith called this approach. And this is actually, it turns out, what 1Cycle looks like. It's a single cyclical learning rate, but the key difference here is that the going up bit is the same length as the going down bit, so you go up really slowly.

And then at the end, for like a tenth of the time, you then have this little bit where you go down even further. And it's interesting, obviously this is a very easy thing to show, a very easy thing to explain. Sylvan has added it to fastAI under the temporarily, it's called useCLRbeta by the time you watch this on the video, it'll probably be called 1Cycle or something like that.

But you can use this right now. So that's one key piece to getting these massively high learning rates. And he shows a number of experiments when you do that. A second key piece is that as you do this to the learning rate, you do this to the momentum. So when the learning rate's low, it's fine to have a high momentum.

But when the learning rate gets up really high, your momentum needs to be quite a bit lower. So this is also part of what he's added to the library is this cyclical momentum. And so with these two things, you can train for about the fifth of the number of epochs with a stepwise learning rate schedule.

Then you can drop your weight decay down by about two orders of magnitude. You can often remove most or all of your dropout, and so you end up with something that's trained faster and generalizes better. And it actually turns out that Sylvan got quite a bit better accuracy than Leslie Smith's paper.

His guess, I was pleased to see, is because our data augmentation defaults are better than Leslie's. I hope that's true. So check that out. As I say, there's been so many cool things this week, I'm just going to pick two. Hamill Hussain works at GitHub. I just really like this.

There's a fairly new project called Kubeflow, which is basically TensorFlow for Kubernetes. Hamill wrote a very nice article about magical sequence-to-sequence models, building data products on that, using Kubernetes to kind of put that in production and so forth. He said that the Google Kubeflow team created a demo based on what he wrote earlier this year, directly based on the skills alone in class AI, and I will be presenting this technique at KDD.

KDD is one of the top academic conferences, so I wanted to share this as a motivation for folks to blog, which I think is a great point. Nobody who goes out and writes a blog thinks that none of us really think our blog is actually going to be very good, probably nobody's going to read it, and then when people actually do like it and read it, it's like with great surprise, you just go, oh, it's actually something people were interested to read.

So here is the tool where you can summarize GitHub issues using this tool, which is now hosted by Google on the kubeflow.org domain. So I think that's a great story if Hamill didn't put his work out there, none of this would have happened, and you can check out his post that made it all happen as well.

So talking of the magic of sequence-to-sequence models, let's build one. So we're going to be specifically working on machine translation. So machine translation is something that's been around for a long time, but specifically we're going to look at a code called neural translation, which is using neural networks for translation.

That wasn't really a thing in any kind of meaningful way until a couple of years ago. And so thanks to Chris Manning from Stanford for the next three slides. In 2015, Chris pointed out that neural machine translation first appeared properly, and it was pretty crappy compared to the statistical machine translation approaches that used kind of classic feature engineering and standard MLP kind of approaches of lots of stemming and fiddling around with work frequencies and n-grams and lots of stuff.

A year later, it was better than everything else. This is on a metric called Blue, we're not going to discuss the metric because it's not a very good metric and it's not very interesting, but it's what everybody uses. So that was Blue as of when Chris did this slide.

As of now, it's about up here, it's about 30. So we're kind of seeing machine translation starting down the path that we saw starting computer vision object classification in 2012, I guess, which is we kind of just surpassed the state-of-the-art and now we're zipping past it at a great rate.

It's very unlikely that anybody watching this is actually going to build a machine translation model because you can go to translate.google.com and use theirs and it works quite well. So why are we learning about machine translation? The reason we're learning about machine translation is that the general idea of taking some kind of input, like a sentence in French, and transforming it into some other kind of output of arbitrary length such as a sentence in English is a really useful thing to do.

For example, the thing that we just saw that Hamill did takes GitHub issues and turns them into summaries. Another example is taking videos and turning them into descriptions. Basically anything where you're spitting out kind of an arbitrary sized output, very often that's a sentence, so maybe taking a CT scan and spitting out a radiology report, this is where you can use sequence-to-sequence learning.

So the important thing about a neural machine translation, there's more slides from Chris, and generally sequence-to-sequence models is that there's no fussing around with heuristics and hacky feature engineering or whatever, it's end-to-end training. We're able to build these distributed representations which are shared by lots of concepts within a single network, we're able to use long-term state in the RNN, so use a lot more context than kind of n-gram type approaches, and in the end the text we're generating uses an RNN as well so we can build something that's more fluid.

We're going to use a bidirectional LSTM with a tension, well actually we're going to use a bidirectional GRU with a tension but basically the same thing. So you already know about bidirectional recurrent neural networks and a tension we're going to add on top today. These general ideas you can use for lots of other things as well as Chris points out on this slide.

So let's jump into the code which is in the translate notebook, funnily enough. And so we're going to try to translate French into English. And so the basic idea is that we're going to try and make this look as much like a standard neural network approach as possible. So we're going to need three things, you all remember the three things, data, a suitable architecture and a suitable loss function.

Once you've got those three things, you run fit. And all things going well, you end up with something that solves your problem. So data, we generally need x y pairs because we need something which we can feed it into the loss function and say I took my x value which was my French sentence and the loss function says it was meant to generate this English sentence and then you had your predictions which you would then compare and see how good it is.

So therefore we need lots of these tuples of French sentences with their equivalent English sentence. That's called a parallel corpus. Obviously this is harder to find than a corpus for a language model, because for a language model we just need text in some language. For any living language of which the people that use that language, like use computers, there will be a few gigabytes at least of text floating around the internet for you to grab.

So building a language model is only challenging corpus-wise for ancient languages, one of our students is trying to do a Sanskrit one for example at the moment, but that's very rarely a problem. For translation there are actually some pretty good parallel corpuses available for European languages. The European Parliament basically has every sentence in every European language.

Anything that goes through the UN is translated to lots of languages. For French through English we have a particularly nice thing which is pretty much any semi-official Canadian website, we'll have a French version and an English version. This chap Chris Callison-Burch did a cool thing which is basically to try to transform French URLs into English URLs by replacing -fr with -en and hoping that that retrieves the equivalent document and then did that for lots and lots of websites and ended up creating a huge corpus based on millions of web pages.

So French to English we have this particularly nice resource. So we're going to start out by talking about how to create the data, then we'll look at the architecture, and then we'll look at the loss function. And so for bounding boxes all of the interesting stuff was in the loss function, but for neural translation all of the interesting stuff is going to be in the architecture.

So let's zip through this pretty quickly. One of the things I want you to think about particularly is what are the relationships and similarities in terms of the task we're doing and how we do it between language modeling versus neural translation. So the basic approach here is that we're going to take a sentence, so in this case the example is English to German, and this slide's from Stephen Meridy, we steal everything we can from Stephen.

We start with some sentence in English, and the first step is to do basically the exact same thing we do in a language model, which is to chuck it through an RNN. Now with our language model, actually let's not even think about the language model, let's start even easier, the classification model.

So something that turns this sentence into positive or negative sentiment. We had a decoder, something which basically took the RNN output, and from our paper we grabbed three things. We took a max pool over all of the time steps, we took a mean pool over all the time steps, and we took the value of the RNN at the last time step, stuck all those together, and put it through a linear layer.

Most people don't do that in most NLP stuff, I think it's something we invented. People pretty much always use the last time step, so all the stuff we'll be talking about today uses the last time step. So we start out by chucking this sentence through an RNN, and out of it comes some state.

So some state meaning some hidden state, some vector that represents the output of an RNN that has encoded that sentence. You'll see the word that Stephen used here was encoder. We've tended to use the word backbone. So like when we've talked about adding a custom head to an existing model, like the existing pre-trained ImageNet model, for example, we say that's our backbone, and then we stick on top of it some head that does the task we want.

In sequence-to-sequence learning, they use the word encoder. But basically it's the same thing, it's some piece of a neural network architecture that takes the input and turns it into some representation which we can then stick a few more layers on top of to grab something out of it, such as we did for the classifier where we stuck a linear layer on top of it to turn it into a sentiment, positive or negative.

So this time, though, we have something that's a little bit harder than just getting sentiment, which is I want to turn this state not into a positive or negative sentiment, but into a sequence of tokens, where that sequence of tokens is the German sentence that we want. So this is sounding more like the language model than the classifier, because the language model had multiple tokens for every input word, there was an output word, but the language model was also much easier because the number of tokens in the language model output was the same length as the number of tokens in the language model input.

And not only were they the same length, they exactly matched up. Like after word 1 comes word 2, after word 2 comes word 3 and so forth. But for translating language, you don't necessarily know that the word 'he' will be translated as the first word in the output, and that 'loved' will be the second word in the output.

In this particular case, unfortunately, they are the same. But very often the subject-object order will be different, or there will be some extra words inserted, or some pronouns will need to add some gendered article to it, or whatever. So this is the key issue we're going to have to deal with is the fact that we have an arbitrary length output where the tokens in the output do not correspond to the same order of specific tokens in the input.

So the general idea is the same, use an RNN to encode the input, turns it into some hidden state and then this is the new thing we're going to learn is generating a sequence output. So we already know sequence to class, that's IMDB classifier, we already know sequence to equal length sequence where it corresponds to the same items, that's the language model for example, but we don't know yet how to do a general-purpose sequence to sequence, so that's the new thing today.

Very little of this will make sense unless you really understand lesson 6, how an RNN works. So if some of this lesson doesn't make sense to you and you find yourself wondering what does he mean by 'hidden state' exactly, how's that working, go back and rewatch lesson 6 to give you a very quick review, we learned that an RNN at its heart is a standard fully connected network, so here's one with one, two, three, four layers, takes an input and puts it through four layers, but then at the second layer it can just concatenate in the second input, third layer concatenate in the third input, but we actually wrote this in Python as just literally a four-layer neural network, there was nothing else we used other than linear layers and values.

We used the same weight matrix every time an input came in, we used the same matrix every time we went from one of these states to the next, and that's why these arrows are the same color, and so we can redraw that previous thing like this. And so not only did we redraw it, but we took four lines of linear linear linear linear code in PyTorch and we replaced it with a for loop.

So remember we had something that did exactly the same thing as this, but it just had four lines of code saying linear linear linear linear, and we literally replaced it with a for loop because that's nice to refactor. So literally that refactoring, which doesn't change any of the math, any of the ideas, any of the outputs, that refactoring is an RNN, it's turning a bunch of separate lines in the code into a Python folder.

And so that's how we can draw it. We could take the output so that it's not outside the loop and put it inside the loop, like so. And if we do that, we're now going to generate a separate output for every input. So in this case, this particular one here, the hidden state gets replaced each time and we end up just spitting out the final hidden state.

So this one is this example, but if instead we had something that said h's dot append h and returned h's at the end, that would be this picture. And so go back and relook at that notebook if this is unclear. I think the main thing to remember is when we say hidden state, we're referring to a vector.

See here, here's the vector, h equals torch.zeros nhidden. Now of course it's a vector for each thing in the mini-batch, so it's a matrix. But generally when I speak about these things, I ignore the mini-batch piece and treat it with just a single item. So it's just a vector of this length.

We also learned that you can stack these layers on top of each other. So rather than this first RNN spitting out output, there could just spit out inputs into a second RNN. And if you're thinking at this point, "I think I understand this, but I'm not quite sure," if you're anything like me, that means you don't understand this.

And the only way you know and that you actually understand it is to go and write this from scratch in PyTorch or NumPy. And if you can't do that, then you don't understand it. You can go back and rewatch Lesson 6 and check out the notebook and copy some of the ideas until it's really important that you can write that from scratch.

It's less than a screen of code. So you want to make sure you create a two-layer RNN. And this is what it looks like if you unroll it. So that's the goal, is to get to a point that we first of all have these X, Y pairs of sentences, and we're going to do French to English.

So we're going to start by downloading this dataset, and training a translation model takes a long time. Google's translation model has 8 layers of RNN stacked on top of each other. There's no conceptual difference between 8 layers and 2 layers, it's just like if you're Google and you have more GPUs or TPUs and you know what to do with, then you're fine doing that, whereas in our case it's pretty likely that the kind of sequence-to-sequence models we're building are not going to require that level of computation.

So to keep things simple, let's do a cut-down thing where rather than learning how to translate French into English for any sentence, let's learn to translate French questions into English questions. And specifically questions that start with what, where, which, when. So you can see here I've got a regex that looks for things that start with wh and end with a question mark.

So I just go through the corpus, open up each of the two files, each line is one parallel text, zip them together, grab the English question, the French question, and check whether they match the regular expressions. Dump that out as a pickle so that I don't have to do it again.

And so we now have 52,000 sentences, and here are some examples of those sentence pairs. One nice thing about this is that what, who, where type questions tend to be fairly short, which is nice. But I would say the idea that we could learn from scratch with no previous understanding of the idea of language, let alone of English or of French, that we could create something that can translate one to the other for any arbitrary question with only 50,000 sentences, sounds like a ludicrously difficult thing to ask this to do.

So I would be impressed if we could make any progress whatsoever. This is very little data to do a very complex exercise. So this contains the tuples of French and English. You can use this handy idiom to split them apart into a list of English questions and a list of French questions.

And then we tokenize the English questions and we tokenize the French questions. So remember that just means splitting them up into separate words or word-like things. By default, the tokenizer that we have here, and remember this is a wrapper around the spacey tokenizer, which is a fantastic tokenizer. This wrapper by default assumes English, so to ask for French, you just add an extra parameter.

The first time you do this, you'll get an error saying that you don't have the spacey French model installed, and you can google to get the command something python -m spacey download French or something like that to grab the French model. I don't think any of you are going to have RAM problems here because this is not particularly big corpus, but I know that some of you were trying to train new language models during the week and were having RAM problems.

If you do, it's worth knowing what these functions are actually doing. So for example, these ones here is processing every sentence across multiple processes. And remember, fastai code is designed to be pretty easy to read, so here's the three lines of code to process all MP, find out how many CPUs you have, divide by two, because normally with hyperthreading they don't actually all work in parallel, then in parallel run this process function.

So that's going to spit out a whole separate Python process for every CPU you have. If you have a lot of cores, a lot of Python processes, everyone's going to load all this data in and that can potentially use up all your RAM. So you could replace that with just proc all rather than proc all MP to use less RAM.

Or you could just use less cores. So at the moment we're calling this function partition by cores, which calls partition on a list and asks to split it into a number of equal length things according to how many CPUs you have, so you could replace that by splitting it into a smaller list and run it on less things.

Yes, Rachel. Was an attention layer tried in the language model? Do you think it would be a good idea to try and add one? We haven't learned about attention yet, so let's ask about things that we have got to, not things we haven't. The short answer is no, I haven't tried it properly, yes, you should try it because it might help.

In general, there's going to be a lot of things that we cover today, which if you've done some sequence-to-sequence stuff before, you'll want to know about something we haven't covered yet. I'm going to cover all the sequence-to-sequence things. So at the end of this, if I haven't covered the thing you wanted to know about, please ask me then.

If you ask me before, I'll be answering something based on something I'm about to teach you. So having tokenized the English and French, you can see how it gets split out. You can see that tokenization for French is quite different looking because French loves their apostrophes and their hyphens and stuff.

So if you try to use an English tokenizer for a French sentence, you're going to get a pretty crappy outcome. So I don't find you need to know heaps of NLP ideas to use deep learning for NLP, but just some basic stuff like use the right tokenizer if your language is important.

And so some of the students this week in our study group have been trying to build language models for Chinese, for instance, which of course doesn't really have the concept of a tokenizer in the same way. So we've been starting to look at, briefly mentioned last week, this Google thing called sentence piece, which basically splits things into arbitrary subword units.

And so when I say tokenize, if you're using a language that doesn't have spaces in, you should probably be checking out sentence piece or some other similar subword unit thing instead. And hopefully in the next week or two we'll be able to report back with some early results of these experiments with Chinese.

So having tokenized it, we'll save that to disk. And then remember the next step after we create tokens is to turn them into numbers. And to turn them into numbers, we have two steps. The first is to get a list of all of the words that appear, and then we turn every word into the index into that list.

If there are more than 40,000 words that appear, then let's cut it off there so it doesn't get too crazy. And we insert a few extra tokens for beginning of stream, padding, end of stream, and unknown. So if we try to look up something that wasn't in the 40,000 most common, then we use a default dict to return 3, which is unknown.

So now we can go ahead and turn every token into an id by putting it through the string to integer dictionary we just created. And then at the end of that, let's add the number 2, which is end of stream. And you'll see the code you see here is the code I write when I'm iterating and experimenting.

Because 99% of the code I write when I'm iterating and experimenting turns out to be totally wrong or stupid or embarrassing and you don't get to see it. But there's no point refactoring that and making it beautiful when I'm writing it. So I was wanting you to see all the little shortcuts I have.

So rather than doing this properly and having some constant or something for end of stream marker and using it, when I'm prototyping, I just do the easy stuff. Not so much that I end up with broken code, but I try to find some mid-ground between beautiful code and code that works.

I just heard him mention that we divide them of CPUs by 2 because with hyperthreading we don't get to speed up using all the hyperthreaded cores. Is this based on practical experience or is there some underlying reason why we wouldn't get additional speed up? Yeah, it's just practical experience.

And it's like not all things kind of seem like this, but I definitely noticed with tokenization hyperthreading seems to slow things down a little bit. Also if I use all the cores, often I want to do something else at the same time, like generally run some interactive notebook and I don't have any spare room to do that.

It's a minor issue. So now for our English and our French, we can grab our list of IDs. And when we do that, of course, we need to make sure that we also store the vocabulary. There's no point having IDs if we don't know what the number 5 represents.

There's no point having a number 5. So that's our vocabulary, the string, and the reverse mapping, string to int, that we can use to convert more cores in the future. So just to confirm it's working, we can go through each ID, convert the int to a string and spit that out, and there we have our thing back, now with an end-to-stream marker at the end.

Our English vocab is 17,000, our French vocab is 25,000. So this is not too complex a vocab that we're dealing with, which is nice to know. So we spent a lot of time on the forums during the week discussing how pointless word vectors are and how you should stop getting so excited about them, we're now going to use them.

Why is that? Basically, all the stuff we've been learning about using language models and pre-trained proper models rather than pre-trained linear single layers, which is what word vectors are, I think applies equally well to sequence-to-sequence, but I haven't tried it yet. So Sebastian and I are starting to look at that, slightly distracted by preparing this class at the moment, but after this class is done.

So there's a whole thing, for anybody interested in creating some genuinely new, highly publishable results, the entire area of sequence-to-sequence with pre-trained language models hasn't been touched yet, and I strongly believe it's going to be just as good as classification stuff. And if you work on this and you get to the point where you have something that's looking exciting and you want help publishing it, I'm very happy to help co-author papers on stuff that's looking good.

So feel free to reach out if and when you have some interesting results. So at this stage, we don't have any of that, so we're going to use very little fast.ai actually, and very little in terms of fast.ai ideas. So all we've got is word vectors. Anyway, so let's at least use decent word vectors.

So Word2vec is very old word vectors. There are better word vectors now, and fast text is a pretty good source of word vectors. There's hundreds of languages available for them, your language is likely to be represented. So to grab them, you can click on this link, download word vectors for a language that you're interested in, install the fast text Python library.

It's not available on PyPy, but here's a handy trick. If there is a GitHub repo that has a setup.py in it and a requirements.txt in it, you can just chuck git+ at the start and then stick that in your pip install and it works. Hardly anybody seems to know this, and if you go to the fast text repo, they won't tell you this.

They'll say you have to download it and CD into it and blah blah blah, but you don't, you can just run that. You can also use for the fast.ai library, by the way. If you want to pip install the latest version of fast.ai, you can totally do this. So you grab the library, import it, load the model, so here's my English model, and here's my French model.

You'll see there's a text version and a binary version, the binary version's a bit faster, we're going to use that, the text version's also a bit buggy. And then I'm going to convert it into a standard Python dictionary to make it a bit easier to work with, so this is just going to go through each word with a dictionary comprehension and save it as a pickled dictionary.

So now we've got our pickled dictionary. We can go ahead and look up a word, for example, comma, and that will return a vector. The length of that vector is the dimensionality of this set of word vectors, so in this case we've got 300-dimensional English and French word vectors.

For reasons that you'll see in a moment, I also want to find out what the mean of my vectors are and the standard deviation of my vectors are. So the mean's about zero and the standard deviation is about 0.3. Often corpus's have a pretty long-tailed distribution of sequence length, and it's the longest sequences that kind of tend to overwhelm how long things take and how much memory is used and stuff like that.

So I'm going to grab, in this case, the 99th and 97th percentile of the English and French and truncate them to that amount. Originally I was using the 90th percentile, so these are poorly named variables, so apologies for that. So that's just truncating them. So we're nearly there. We've got our tokenized, numericalized English and French dataset.

We've got some word vectors. So now we need to get it ready for PyTorch. So PyTorch expects a dataset object, and hopefully by now you all can tell me that a dataset object requires two things, a length and an indexer. So I started out writing this, and I was like, "Okay, I need a sector-sect dataset." I started out writing it, and I thought, "Okay, we're going to have to pass it our x's and our y's and store them away, and then my indexer is going to need to return a numpy array of the x's at that point and a numpy array of the y's at that point, and oh, that's it." So then after I wrote this, I realized I haven't really written a sector-sect dataset, I've just written a totally generic dataset.

So here's the simplest possible dataset that works for any pair of arrays. So it's now poorly named, it's much more general than a sector-sect dataset, but that's what I needed it for. This a function, remember we've got v for variables, t for tensors, a for arrays. So this basically goes through each of the things you pass it, if it's not already a numpy array, it converts it into a numpy array and returns back a tuple of all of the things that you passed it, which are now guaranteed to be numpy arrays.

So that's AVT3 very handy little functions. So that's it, that's our dataset. So now we need to grab our English and French IDs and get a training set and a validation set. And so one of the things which is pretty disappointing about a lot of code out there on the internet is that they don't follow some simple best practices.

For example, if you go to the PyTorch website, they have an example section for sequence-to-sequence translation. Their example does not have a separate validation set. I tried it, training according to their settings, and tested it with a validation set and it turned out that it overfit massively. So this is not just a theoretical problem, the actual PyTorch repo has the actual official sequence-to-sequence translation example, which does not check for overfitting and overfits horribly.

Also it fails to use minibatches, so it actually fails to utilize any of the efficiency of PyTorch whatsoever. Even if you find code in the official PyTorch repo, don't assume it's any good at all. The other thing you'll notice is that pretty much every other sequence-to-sequence model I've found in PyTorch anywhere on the internet has clearly copied from that shitty PyTorch repo because it's all the same variable names, it has the same problems, it has the same mistakes.

Like another example, nearly every PyTorch convolutional neural network I've found does not use an adaptive pooling layer. So in other words, the final layer is always like average_pool(7,7). So they assume that the previous layer is 7x7, and if you use any other size input you get an exception. And therefore nearly everybody I've spoken to that uses PyTorch thinks that there is a fundamental limitation of CNNs that they are tied to the input size.

And that has not been true since VGG. So every time we grab a new model and stick it in the fastai repo, I have to go in, search for pool and add adaptive to the start and replace the 7 with a 1, and now it works on any sized object.

So just be careful, it's still early days, and believe it or not, even though most of you have only started in the last year your deep learning journey, you know quite a lot more about a lot of the more important practical aspects than the vast majority of people that are publishing and writing stuff in official repos.

So you kind of need to have a little more self-confidence than you might expect when it comes to reading other people's code. If you find yourself thinking that looks odd, it's not necessarily you, right? It might well be them. So I would say like at least 90% of deep learning code that I start looking at turns out to have like deathly serious problems that make it completely unusable for anything.

And so I've been telling people that I've been working with recently, if a repo you're looking at doesn't have a section on it saying here's the test we did where we got the same results as the paper that this is meant to be implementing, that almost certainly means they haven't got the same results as the paper they're implementing, they probably haven't even checked.

And if you run it, it definitely won't get those results because it's hard to get things right the first time. It probably takes me 12 goes, probably takes normal smarter people than me 6 goes, but if they haven't tested it once, it almost certainly won't work. So there's our sequence data set.

Let's get the training and validation sets. Here's an easy way to do that. Grab a bunch of random numbers, one for each row of your data. See if they're bigger than 0.1 or not. That gets you a list of balls. Index into your array with that list of balls to grab a training set.

Index into that array with the opposite of that list of balls to get your validation set. There's lots of ways of doing it, I just like to do different ways to see a few approaches. So now we can create our data set with our X's and our Y's, French and English.

If you want to translate instead English to French, switch these two around and you're done. Now we need to create data loaders. We can just grab our data loader and pass in our data set and batch size. We actually have to transpose the arrays. I'm not going to go into the details about why, we can talk about it during the week if you're interested, but have a think about why we might need to transpose their orientation.

But there's a few more things I want to do. One is that since we've already done all the pre-processing, there's no point spawning off multiple workers to do augmentation or whatever because there's no work to do. So making num workers equals 1 will save you some time. We have to tell it what our padding index is, that's actually pretty important because what's going to happen is that we've got different length sentences and fastai will just automatically stick them together and pad the shorter ones so that they'll end up equal length.

Because remember a tensor has to be rectangular. In the decoder in particular, I actually want my padding to be at the end, not at the start. For a classifier, I want the padding at the start because I want that final token to represent the last word of the movie review.

But in the decoder, as you'll see, it's going to work out a bit better to have padding at the end. And then finally, since we've got sentences of different lengths coming in and they all have to be put together in a mini-batch to be the same size by padding, we would much prefer that the sentences in a mini-batch are of similar sizes already because otherwise it's going to be as long as the longest sentence and that's going to end up wasting time and memory.

So therefore I'm going to use the sampler trick that we learned last time which is the validation set. We're going to ask it to sort everything by length first. And then for the training set, we're going to ask it to randomize the order of things but to roughly make it so that things of similar length are about in the same spot.

So we've got our sort_sampler and our sort_ish_sampler. And then at that point, we can create a model_data object. For a model_data object, it really does one thing which is it says I have a training set and a validation set and an optional test set and sticks them into a single object.

We also have a path so that it has somewhere to store temporary files, models, stuff like that. So we're not using fast.ai for very much at all in this example, just a minimal set to show you how to get your model_data object. In the end, once you've got a model_data object, you can then create a learner and you can then call fit.

So that's a minimal amount of fast.ai stuff here. This is a standard PyTorch compatible dataset. This is a standard PyTorch compatible data loader. Behind the scenes, it's actually using the fast.ai version because I do need to do this automatic padding for convenience. So there's a few tweaks in our version that are a bit faster and a bit more convenient.

The fast.ai samplers we're using, but there's not too much going on here. So now we've got our model_data object. We can basically tick off number one. So as I said, most of the work is in the architecture. And so the architecture is going to take our sequence of tokens.

It's going to spit them into an encoder, or in computer vision terms, what we've been calling a backbone, something that's going to try and turn this into some kind of representation. That's just going to be an RNN. That's going to spit out the final hidden state, which for each sentence is just a vector.

None of this is going to be new. That's all going to be using very direct simple techniques that we've already learnt. And then we're going to take that and we're going to spit it into a different RNN, which is a decoder, and that's going to have some new stuff because we need something that can go through one word at a time.

And it's going to keep going until it thinks it's finished the sentence, it doesn't know how long the sentence is going to be ahead of time, it keeps going until it thinks it's finished the sentence, and then it stops and returns the sentence. So let's start with the encoder.

So in terms of variable naming here, there's basically identical variables for encoder and decoder, well attributes for encoder and decoder. The encoder versions have 'enc', the decoder versions have 'dec'. So for the encoder, here's our embeddings. And so I always try to mention what the mnemonics are, rather than writing things out in too long hand.

So just remember, 'enc' is an encoder, 'dec' is a decoder, and there's an embedding. The final thing that comes out is 'out'. The RNN, in this case, is a GRU, not an LSTM, they're nearly the same thing. So don't worry about the difference, you could replace it with an LSTM and you'll get basically the same results.

To replace it with an LSTM, simply type LSTM and you're done. So we need to create an embedding layer to take -- because remember what we're being passed is the index of the words into a vocabulary, and we want to grab their fast text embedding, and then over time we might want to also fine-tune to train that embedding into it.

So to create an embedding, we'll create embedding up here, so we'll just say nn.embedding. So it's important that you know now how to set the rows and columns for your embedding. So the number of rows has to be equal to your vocabulary size, so each vocabulary item has a word vector.

And how big is your embedding? Well in this case it was determined by fast text, and the fast text embeddings are size 300. So we have to use size 300 as well, otherwise we can't start out by using their embeddings. So what we want to do is this is initially going to give us a random set of embeddings, and so we're going to now go through each one of these, and if we find it in fast text we'll replace it with a fast text embedding.

So again, something that you should already know is that a PyTorch module that is learnable has a weight attribute, and the weight attribute is a variable, and the variables have a data attribute, and the data attribute is a tensor. And you'll notice very often today I'm saying here is something you should know, not so that you think oh I don't know that I'm a bad person, but so that you think okay this is a concept that I haven't learned yet and Jeremy thinks I ought to know about, and so I've got to write that down and I'm going to go home and I've got to Google -- this is a normal PyTorch attribute in every single learnable PyTorch module.

This is a normal PyTorch attribute in every single PyTorch variable. And so if you don't know how to grab the weights out of a module, or you don't know how to grab the tensor out of a variable, it's going to be hard for you to build new things or debug things or maintain things or whatever.

So if I say you ought to know this, and you're thinking I don't know this, don't run away and hide, go home and learn the thing, and if you're having trouble learning the thing, because you can't find documentation about it, or you don't understand that documentation, or you don't know why Jeremy thought it was important you know it, jump on the forum and say please explain this thing, here's my best understanding of that thing as I have it at the moment, here's the resources I've looked at, help fill me in.

And normally if I respond, it's very likely I will not tell you the answer, but I will instead give you a problem that you could solve that if you solve it will solve it for you because I know that that way it will be something you remember. So again, don't be put off if I'm like okay, go read this link, try and summarize that thing, tell us what you think, like I'm trying to be helpful, not unhelpful, and if you're still not following, just come back and say I had a look, honestly that link you sent, I don't know what any of it means, I wouldn't know where to start, whatever.

I'll keep trying to help you until you fully understand it. So now that we've got our weight tensor, we can just go through our vocabulary and we can look up the word in our pre-trained vectors, and if we find it we will replace the random weights with that pre-trained vector.

The random weights have a standard deviation of 1, our pre-trained vectors it turned out had a standard deviation of about 0.3, so again this is the kind of hacky thing I do when I'm prototyping stuff, I just multiply it by 3. Obviously by the time you see the video of this and then you're able to put all this sequence to sequence stuff into the fastAI library, you won't find horrible hacks like that in there, sure hope, but hack away when you're prototyping.

Some things won't be in fast text, in which case we'll just keep track of it, and I've just added this print statement here just so that I can kind of see why am I missing stuff, basically I'll probably comment it out when I actually commit this to GitHub. So we create those embeddings, and so when we actually create the sequence to sequence RNN, it will print out how many were missed, and so remember we had about 30,000 words, so we're not missing too many.

And interesting, the things that are missing, well there's our special token for uppercase, not surprising that's missing, but also remember it's not token to vec, it's not token text, it does words, so L apostrophe and D apostrophe and apostrophe S, they're not appearing either. So that's interesting. That does suggest that maybe we could have slightly better embeddings if we tried to find some which would be tokenized the same way we tokenize, but that's okay.

Do we just keep embedding vectors from training? Why don't we keep all word embeddings in case you have new words in the test set? We're going to be fine-tuning them, and so I don't know, it's an interesting idea. Maybe that would work, I haven't tried it, obviously you can also add random embedding to those, and at the beginning just keep them random, but it's going to make an effect in the sense that you're going to be using those words.

I think it's an interesting line of inquiry, but I will say this. The vast majority of the time when you're doing this in the real world, your vocabulary will be bigger than 40,000, and once your vocabulary is bigger than 40,000, using the standard techniques, the embedding layers get so big that it takes up all your memory, it takes up all of the time in the backdrop.

There are tricks to dealing with very large vocabries, I don't think we'll have time to handle them in this session, but you definitely would not want to have all 3.5 million fast-text vectors in an embedding layer. I wonder, if you're not touching a word, it's not going to change, given you're fine-tuning.

It's in GPU RAM, and you've got to remember, 3.5 million times 300 times the size of a single-precision floating-point vector, plus all of the gradients for them, even if it's not touched. Without being very careful and adding a lot more code and stuff, it is slow and hard and we wouldn't touch it for now.

I think it's an interesting path of inquiry, but it's the kind of path of inquiry that leads to multiple academic papers, not something that you do on a weekend. I think it would be very interesting, maybe we can look at it sometime. As I say, I have actually started doing some stuff around incorporating large vocabulary handling into fast.ai.

It's not finished, but hopefully by the time we get here, this kind of stuff will be possible. We create our encoder embedding, add a bit of dropout, and then we create our RNN. This input to the RNN obviously is the size of the embedding by definition. Number of hidden is whatever we want, so we set it to 256 for now, however many layers we want, and some dropout inside the RNN as well.

This is all standard PyTorch stuff, you could use an LSTM here as well, and then finally we need to turn that into some output that we're going to feed to the decoder, so let's use a linear layer to convert the number of hidden into the decoder embedding size. In the forward pass, here's how that's used.

We first of all initialize our hidden state to a bunch of zeros. So we've now got a vector of zeros, and then we're going to take our input and put it through our embedding, we're going to put that through dropout, we then pass our currently zeros hidden state and our embeddings into our RNN, and it's going to spit out the usual stuff that RNN spit out, which includes the final hidden state.

We're then going to take that final hidden state and stick it through that linear layer, so we now have something of the right size to feed to our decoder. So that's it, and again this ought to be very familiar and very comfortable, it's like the most simple possible RNN, so if it's not, go back, check out lesson 6, make sure you can write it from scratch and you understand what it does.

But the key thing to know is that it takes our inputs and spits out a hidden vector that hopefully will learn to contain all of the information about what that sentence says and how it says it. Because if it can't do that, then we can't feed it into a decoder and hope it to spit out our sentence in a different language, so that's what we want it to learn to do.

And we're not going to do anything special to make it learn to do that, we're just going to do the three things and cross our fingers because that's what we do. So that's h is that s, it's a hidden state. I guess Steven used s for state, I used h for hidden, but there you go, you would think that two Australians could agree on something like that, but apparently not.

So how do we now do the new bit? And so the basic idea of the new bit is the same, we're going to do exactly the same thing, but we're going to write our own for loop. And so the for loop is going to do exactly what the for loop inside pytorch does here, but we're going to do it manually.

So we're going to go through the for loop, and how big is the for loop? It's an output sequence length. Well what is output sequence length? It's something that got passed to the constructor, and it is equal to the length of the largest English sentence. So we're going to do this for loop as long as the largest English sentence, because we're translating it into English, so we can't possibly be longer than that.

At least not in this corpus. If we then used it on some different corpus that was longer, this is going to fail. You could always pass in a different parameter, of course. So the basic idea is the same, we're going to go through and put it through the embedding, we're going to stick it through the RNN, we're going to stick it through dropout, and we're going to stick it through a linear layer.

So the basic four steps are the same. And once we've done that, we're then going to append that output to a list, and then when we're going to finish, we're going to stack that list up into a single tensor and return it. So that's the basic idea. Normally a recurrent neural network works in a whole sequence at a time, but we've got a for loop to go through each part of the sequence separately.

So we have to add a leading unit axis to the start to basically say this is a sequence of length 1. So we're not really taking advantage of the recurrent net much at all, we could easily rewrite this with a linear layer. That would be an interesting experiment if you wanted to try it.

So we basically take our input and we feed it into our embedding, and we add something to the front saying treat this as a sequence of length 1, and then we pass that to our RNN. We then get the output of that RNN, feed it into our dropout, and feed it into our linear layer.

So there's two extra things now to be aware of. The one thing is, what's this? What is the input to that embedding? And the answer is, it's the previous word that we translated. See how the input here is the previous word here? The input here is the previous word here.

So the basic idea is, if you're trying to translate, if you're about to translate, tell me the fourth word of the new sentence, but you don't know what the third word you just said was, that's going to be really hard. So we're going to feed that in at each time step, let's make it as easy as possible.

And so what was the previous word at the start? Well there was none. So specifically we're going to start out with a beginning of stream token. So the beginning of stream token is a zero. So let's start out our decoder with a beginning of stream token, which is zero.

And of course we're doing a mini-batch, so we need batch size number of them. But let's just think about one part of that batch. So we start out with a zero. We look up that zero in our embedding matrix to find out what the vector for the beginning of stream token is.

We stick a unit axis on the front to say we have a single sequence length of beginning of stream token. We stick that through our RNN, which gets not only the fact that there's a zero at the beginning of stream, but also the hidden state which at this point is whatever came out of our encoder.

So now its job is to try and figure out what is the first word to translate this sentence. Pop the trees in dropout, go through one linear layer in order to convert that into the correct size for our decoder embedding matrix, append that to our list of translated words, and now we need to figure out what word that was because we need to feed it to the next time step.

We need to feed it to the next time step. So remember what we actually output here, and don't forget, use a debugger, pdb.settrace, put it here, what is @p? @p is a tensor. How big is the tensor? So before you look it up in the debugger, try and figure it out from first principles and check your rate.

So @p is a tensor whose length is equal to the number of words in our English vocabulary, and it contains the probability for every one of those words that it is that word. So then if we now say @p.data.max, that looks in its tensor to find out which word has the highest probability, and max implies which returns two things.

The first thing is what is that max probability, and the second is what is the index into the array of that max probability. And so we want that second item, index number 1, which is the word index with the largest thing. So now that contains the word, or the word index into our vocabulary of the word.

If it's a 1, you might remember 1 was padding, then that means we're done. That means we've reached the end because we've finished with a bunch of padding. If it's not 1, let's go back and continue. Now deck_imp is whatever the highest probability word was. So we keep looping through, either until we get to the largest length of a sentence, or until everything in our mini-batch is padding.

And each time we've appended our outputs, not the word but the probabilities, to this list which we stack up into a tensor and we can now go ahead and feed that to a loss function. So before we go to a break, since we've done 1 and 2, let's do 3, which is a loss function.

The loss function is categorical cross-entropy loss. We've got a list of probabilities for each of our classes, the classes are all the words in our English vocab, and we have a target which is the correct class, i.e. the correct word at this location. There's two tweaks, which is why we need to write our own little loss function, but you can see basically it's going to be cross-entropy loss.

And the tweaks are as follows. Tweak number 1 is we might have stopped a little bit early, and so the sequence length that we generated may be different to the sequence length of the target, in which case we need to add some padding. PyTorch padding function is weird. If you have a rank 3 tensor, which we do, we have sequence length by batch size by number of words in the vocab, a rank 3 tensor requires a 6 tuple.

Each pair in things in that tuple is the padding before and then the padding after that dimension. So in this case, the first dimension has no padding, the second dimension has no padding, the third dimension has no padding on the left, and as-matched padding is required on the right.

It's good to know how to use that function. Now that we've added any padding, that's necessary. The only other thing we need to do is cross-entropy loss expects a rank 2 tensor, but we've got sequence length by batch size. So let's just flatten out the sequence length and batch size into a -1 in View.

So flatten out that for both of them, and now we can go ahead and call cross-entropy. That's it. So now we can just use standard approach, here's our sequence-to-sequence RNN, that's this one here. So that is a standard PyTorch module. Stick it on the GPU. Hopefully by now you've noticed you can call .cuda, but if you call .gpu then it doesn't put it on the GPU if you don't have one.

You can also set fastai.core.useGPU to false to force it to not use the GPU, and that can be super handy for debugging. We then need something that tells it how to handle learning rate groups. So there's a thing called single model that you can pass it to which treats the whole thing as a single learning rate group.

So this is like the easiest way to turn a PyTorch module into a fastai model. Here's the model data object we created before. We could then just call learner to turn that into a learner, but if we call RNN_learner, RNN_learner is a learner. It defines cross-entropy as the default criteria, in this case we're overriding that anyway, so that's not what we care about.

But it does add in these save encoder and load encoder things that can be handy sometimes. So in this case we really put it to set learner, but RNN_learner also works. So here's how we turn our PyTorch module into a fastai model into a learner. And once we have a learner, give it our new loss function, and then we can call lrefind, and we can call fit, and it runs through a while, and we can save it.

So all the normal learn stuff now works. Remember the model attribute of a learner is a standard PyTorch model. So we can pass that some x which we can grab out of our validation set, or you could use learn.predict_array or whatever you like to get some predictions. And then we can convert those predictions into words by going .max1 to grab the index of the highest probability words to get some predictions.

And then we can go through a few examples and print out the French, the correct English, and the predicted English for things that are not padding. And here we go. So amazingly enough, this kind of simplest possible written largely from scratch PyTorch module on only 50,000 sentences is sometimes capable on a validation set of giving you exactly the right answer, sometimes the right answer in slightly different wording, and sometimes sentences that aren't grammatically sensible or even have too many question marks.

So we're well on the right track, I think you would agree. So even the simplest possible sec-to-sec trained for a very small number of epochs without any pre-training other than the use of word embeddings is surprisingly good. So I think the message here -- and we're going to improve this in a moment after the break -- but I think the message here is even sequence-to-sequence models that you think are simpler than could possibly work, even with less data than you think you could learn from can be surprisingly effective and in certain situations this may even be enough for your needs.

So we're going to learn a few tricks after the break which will make this much better. So let's come back at 7.50. So one question that came up during the break is that some of the tokens that are missing in fast text had a curly quote rather than a straight quote, for example, and the question was would it help to normalize punctuation?

And the answer for this particular case is probably yes. You do have to be very careful though because it may turn out that people using beautiful curly quotes like using more formal language and actually writing in a different way. So I generally -- if you're going to do some kind of pre-processing like punctuation normalization, you should definitely check your results with and without, because nearly always that kind of pre-processing makes things worse even when I'm sure it won't.

What might be some ways of regularizing these sequence-to-sequence models besides dropout and weight gain? Let me think about that during the week. It's like, you know, AWDLSTM, which we've been relying on a lot, has so many great -- I mean, it's all dropout. Well, not all dropout. There's dropout of many different kinds.

And then there's the -- we haven't talked about it much, but there's also a kind of regularization based on activations and stuff like that as well and on changes and whatever. I just haven't seen anybody put anything like that amount of work into regularization of sequence-to-sequence models, and I think there's a huge opportunity for somebody to do like the AWDLSTM of Sek2Sec, which might be as simple as stealing all the ideas from AWDLSTM and using them directly in Sek2Sec.

That would be pretty easy to try, I think, and there's been an interesting paper that actually Stephen Merritt, he's added in the last couple of weeks, where he used an idea which I don't know if he stole it from me, but it was certainly something I had also recently done and talked about on Twitter.

Either way, I'm thrilled that he's done it, which was to take all of those different AWDLSTM hyperparameters and train a bunch of different models and then use a random forest to find out with feature importance which ones actually matter the most and then figure out like how to set them.

I think you could totally use this approach to figure out the sequence-to-sequence regularization approaches which ones are best and optimize them, and that would be amazing. But at the moment, I don't know that there are additional ideas to sequence-to-sequence regularization that I can think of beyond what's in that paper for regular language model stuff, and probably all those same approaches would work.

So tricks, trick number 1, go bidirectional. For classification, my approach to bidirectional that I've suggested you use is take all of your token sequences, spin them around, train a new language model, and train a new classifier, and I also mentioned the wiki text pre-trained model. If you replace fwd with bwd in the name, you'll get the pre-trained backward model I created for you.

So you can use that. Get a set of predictions and then average the predictions just like a normal ensemble, and that's kind of how we do bider for that kind of classification. There may be ways to do it end-to-end, but I haven't quite figured them out yet, they're not in fastAI yet and I don't think anybody has written a paper about them yet, so if you figure it out, that's an interesting line of research.

But because we're not doing massive documents where we have to chunk it into separate bits and then pull over them and whatever, we can do bider very easily in this case, which is literally as simple as adding bidirectional equals true to our encoder. People tend not to do bidirectional for the decoder, I think partly because it's considered cheating, but I don't know, I was just talking to somebody at the break about it, maybe it can work in some situations, although it might need to be more of an ensembling approach in the decoder because it's a bit less obvious.

The encoder is very simple, bidirectional equals true, and with bidirectional equals true rather than just having an RNN which is going this direction, we have a second RNN that's going in this direction. And so that second RNN literally is visiting each token in the opposing order, so when we get the final hidden state, it's here rather than here.

But the hidden state is of the same size, so the final result is that we end up with a tensor that's got an extra too long axis. And depending on what library you use, often that will be then combined with the number of layers thing, so if you've got two layers and bidirectional, that tensor dimension is now with length 4.

With PyTorch, it kind of depends which bit of the process you're looking at as to whether you get a separate result for each layer and each bidirectional bit and so forth. You have to look up the docs and it will tell you the inputs, outputs, tensor sizes, appropriate for the number of layers and whether you have bidirectional equals true.

In this particular case, you'll basically see all the changes I've had to make. So for example, you'll see when I added bidirectional equals true, my linear layer now needs number of hidden times 2 to reflect the fact that we have that second direction in our hidden state now, you'll see in it hidden, it's now self.number of layers times 2 here.

So you'll just see there's a few places where there's been an extra 2 that has to be thrown in. Yes, Yannette? Why making a decoder bidirectional is considered cheating? Well, it's not just that it's cheating, it's like we have this loop going on, you know? It's not as simple as just kind of having two tensors, and then how do you turn those two separate loops into a final result?

After talking about it during the break, I've kind of gone from "Hey, everybody knows it doesn't work" to "Oh, maybe it kind of could work, but it requires more thought." It's quite possible during the week I realized it's a dumb idea and I was being stupid, but we'll think about it.

Another question people have, why do you need to have an end to that loop? Why do I have a what to the loop? Why do you need to have an end to that loop? You have like a range, if you are... I mean, because when I start training everything's random, so this will probably never be true.

Later on it will pretty much always break out eventually. It's basically like we're going to go forever. It's really important to remember when you're designing an architecture that when you start, the model knows nothing about anything. So you kind of want to make sure it's going to do something that's vaguely sensible.

So bidirectional means we got out to 358 cross-entropy loss with a single direction. With bidirection, it's down to 351. So that improved it a bit, that's good, and as I say, it shouldn't really slow things down too much, bidirectional does mean there's a little bit more sequential processing have to happen, but it's generally a good win.

In the Google translation model of the eight layers, only the first layer is bidirectional because it allows it to do more in parallel. So if you create really deep models, you may need to think about which one's bidirectional, otherwise you'll have performance issues. Okay, so 351. Now let's talk about teacher forcing.

So teacher forcing is going to come back to this idea that when the model starts learning, it knows nothing about nothing. So when the model starts learning, it is not going to spit out 'uh' at this point, it's going to spit out some random, meaningless word because it doesn't know anything about German or about English or about the idea of language or anything.

And then it's going to feed it down here as an input and be totally unhelpful. And so that means that early learning is going to be very, very difficult because it's feeding in an input that's stupid into a model that knows nothing, and somehow it's going to get better.

So it's not asking too much, eventually it gets there, but it's definitely not as helpful as we can be. So what if instead of feeding in the thing I predicted just now, what if instead we feed in the actual correct word it was meant to be? Now we can't do that at inference time because by definition we don't know the correct word and we've been asked to translate it.

And we can't require a correct translation in order to do translation. So the way I've set this up is I've got this thing called PR force, which is probability of forcing. And if some random number is less than that probability, then I'm going to replace my decoder input with the actual correct thing.

And if we've already gone too far, if it's already longer than the target sentence, I'm just going to stop because I can't give it the correct thing. So you can see how beautiful PyTorch is for this, because if you tried to do this with some static graph thing like classic TensorFlow, I tried.

One of the key reasons we switched to PyTorch at this exact point in last year's class was because Jeremy tried to implement teacher forcing in Keras and TensorFlow and went even more insane than he started. It was weeks of getting nowhere. And then literally on Twitter, I think it was on Dracopathy, I said something about PyTorch that just came out and it's really cool.

And I tried it that day. By the next day, I had teacher forcing. And so I was like, oh my gosh. And all the stuff of trying to debug things, it was suddenly so much easier, and this kind of dynamic stuff is so much easier. So this is a great example of like, hey, I get to use random numbers and if statements and stuff.

So here's the basic idea, at the start of training, let's set PR force really high so that nearly always it gets the actual correct previous word, and so it has a useful input. And then as I train a bit more, let's decrease PR force so that by the end, PR force is 0, and it has to learn properly, which is fine because it's now actually feeding in sensible inputs most of the time anyway.

So let's now write something such that in the training loop, it gradually decreases PR force. So how do you do that? Well one approach would be to write our own training loop, but let's not do that because we already have a training loop that has progress bars and uses exponential weighted averages to smooth out the losses and keeps track of metrics.

And it does a bunch of things which are not rocket science, but they're kind of convenient. And they also keep track of calling the reset for RNNs at the start of an epoch to make sure that the hidden states set to zeros and little things like that. You'd rather not have to write that from scratch.

So what we've tended to find is that as I start to write some new thing, and I'm like I need to replace some part of the code, I'll then add some little hook so that we can all use that hook to make things easier. In this particular case, there's a hook that I've ended up using all the damn time now, which is the hook called the stepper.

And so if you look at our code, model.py is where our fit function lives. And so the fit function in model.py, we've seen it before, I think it's like the lowest level thing that doesn't require a learner, it doesn't really require anything much at all. It just requires a standard PyTorch model and a model data object.

You just need to know how many epochs, a standard PyTorch optimizer, and a standard PyTorch loss function. So we've hardly ever used it in the class, we normally call learn.fit, but learn.fit calls this. This is our lowest level thing. But we've looked at the source code here sometimes, we've seen how it loops through each epoch and it loops through each thing in our batch and calls stepper.step.

And so stepper.step is the thing that's responsible for calling the model, finding the loss function and calling the optimizer. And so by default stepper.step uses a particular class called stepper, which there's a few things you don't know about too much, but basically it calls the model. So the model ends up inside m, zeros the gradients, calls the loss function, calls backwards, does gradient clipping if necessary, and then calls the optimizer.

So they're the basic steps that back when we looked at PyTorch from scratch, we had to do. So the nice thing is we can replace that with something else rather than replacing the training loop. So if you inherit from stepper and then write your own version of step, you can just copy and paste the contents of step and add whatever you like.

Or if it's something you're going to do before or afterwards, you could even call super.step. In this case, I'd rather suspect I've been unnecessarily complicated here, I probably could have replaced, commented out all of that and just said super.step x's comma y comma epoch. Because I think this is an exact copy of everything, but as I say, when I'm prototyping I don't think carefully about how to minimize my code.

I copied and pasted the contents of the code from step, and I added a single line to the top which was to replace prforce in my module with something that gradually decreased linearly for the first 10 epochs, and after 10 epochs it was zero. So total hack, but good enough to try it out.

So the nice thing is that everything else is the same, I've added these three lines of code to my module, and the only thing I need to do other than differently is when I call fit is I pass in my customized stepper class. And so that's going to do teacher forcing, and so we don't have bidirectional, so we're just changing one thing at a time.

So we should compare this to our unidirectional results, which was 3.58, and this is 3.49. So that was an improvement. So that's great! I needed to make sure I at least did 10 epochs because before that it was cheating by using the teacher forcing. So that's good, that's an improvement.

So we've got another trick, and this next trick is a bigger trick. It's a pretty cool trick, and it's called attention. And the basic idea of attention is this, which is, expecting the entirety of the sentence to be summarized into this single hidden vector is asking a lot. It has to know what was said and how it was said and everything necessary to create the sentence in German.

And so the idea of attention is basically like maybe we're asking too much, particularly because we could use this form of model where we output every step of the loop to not just have a hidden state at the end, but to hit a hidden state after every single word.

And why not try and use that information? It's already there, and so far we've just been throwing it away. And not only that, but bidirectional, we've got every step, we've got two vectors of state that we can use. So how could we use this piece of state, this piece of state, this piece of state, this piece of state and this piece of state rather than just the final state?

And so the basic idea is, well, let's say I'm translating this word right now. Which of these five pieces of state do I want? And of course the answer is if I'm doing -- well, actually let's pick a more interesting word. So if I'm trying to do loved, then clearly the hidden state I want is this one, because this is the word.

And then for this preposition, whatever, this little word here, no it's not a preposition, I guess it's part of the verb. So for this part of the verb, I probably would need this and this and this to make sure that I've got the tense right and know that I actually need this part of the verb and so forth.

So depending on which bit I'm translating, I'm going to need one or more bits of these various hidden states. And in fact, I probably want some weighting of them. So like what I'm doing here, I probably mainly want this state, but I maybe want a little bit of that one and a little bit of that one.

So in other words, for these five pieces of hidden state, we want a weighted average, and we want it weighted by something that can figure out which bits of the sentence are most important right now. So how do we figure out something like which bits of the sentence are important right now?

We create a neural net, and we train the neural net to figure it out. When do we train that neural net? End to end. So let's now train two neural nets. We've actually already got a bunch, we've got an RNN encoder, an RNN decoder, a couple of linear layers.

What the hell? Let's put the neural net into the mix. And this neural net is going to spit out a weight for every one of these things, and we're going to take the weighted average at every step. And it's just another set of parameters that we learn all at the same time.

And so that's called attention. So the idea is that once that attention has been learned, we can see this terrific demo from Priscilla and Sean Carter, each different word is going to take a weighted average. See how the weights are different depending on which word is being translated. And you can see how it's kind of figuring out the color, the deepness of the blue is how much weight it's using.

You can see that each word is basically which word are we translating from. So when we say European, we need to know that both of these two parts are going to be influenced or if we're doing economic, both of these three parts are going to be influenced, including the gender of the definite article and so forth.

So check out this distill.pub article. These things are all nice little interactive diagrams. It basically shows you how attention works and what the actual attention looks like in a trained translation model. So let's try and implement attention. So with attention, it's basically this is all identical, and the encoder is identical, and all of this bit of the decoder is identical.

There's one difference, which is that we basically are going to take a weighted average. And the way that we're going to do the weighted average is we create a little neural net, which we're going to see here and here, and then we use softmax, because of course the nice thing about softmax is that we want to ensure that all of the weights that we're using add up to 1, and we also kind of expect that one of those weights should probably be quite a bit higher than the other ones.

So softmax gives us the guarantee that they add up to 1, and because it's the eta in it, it tends to encourage one of the weights to be higher than the other ones. So let's see how this works. So what's going to happen is we're going to take the last layer's hidden state, and we're going to stick it into a linear layer.

And then we're going to stick it into a nonlinear activation, and then we're going to do matrix multiply, and so if you think about it, linear layer, nonlinear activation, matrix multiply, that's a neural net. It's a neural net with one hidden layer. Stick it into a softmax, and then we can use that to weight our encoder outputs.

So now, rather than just taking the last encoder output, we've got this is going to be the whole tensor of all of the encoder outputs, which I just weight by this little neural net that I've created. And that's basically it. So what I'll do is I'll put on the wiki thread a couple of papers to check out.

There was basically one amazing paper that really originally introduced this idea of attention. And I say amazing because it actually introduced a couple of key things which have really changed how people work in this field. This area of attention has been used not just for text, but for things like reading text out of pictures, or doing various stuff with computer vision, and stuff like that.

And then there's a second paper which Jeffrey Hinton was involved in called Grammar as a Foreign Language, which used this idea of RNNs with attention to basically try to replace rules-based grammar with an RNN which automatically tagged the grammatical, each word based on this grammar, and turned out to do it better than any rules-based system.

Which today actually kind of seems obvious. I think we're now used to the idea that neural nets do lots of this stuff better than rules-based systems, but at the time it was considered really surprising. One nice thing is that their summary of how attention works is really nice and concise.

Let's go back and look at our original encoder. So an RNN spits out two things. It spits out a list of the state after every time step, and it also tells you the state at the last time step. And we used the state at the last time step to create the input state for our decoder, which is what we see here, one vector.

But we know that it's actually creating a vector at every time step, so wouldn't it be nice to use them all? But wouldn't it be nice to use the ones that's most relevant to translating the word I'm translating now? So wouldn't it be nice to take a weighted average of the hidden state at each time step, weighted by whatever is the appropriate weight right now?

Which for example in this case, "libter" would definitely be time step number 2, which is what it's all about, because that's the word I'm translating. So how do we get a list of weights that is suitable for the word we're training right now, or the answer is by training a neural net to figure out the list of weights.

And so anytime we want to figure out how to train a little neural net that does any task, the easiest way normally always to do that is to include it in your module and train it in line with everything else. The minimal possible neural net is something that contains two layers and one nonlinear activation function.

So here is one linear layer, and in fact instead of a linear layer, we can't even just grab a random matrix if we don't care about bias. And so here's a random matrix, it's just a random tensor wrapped up in a parameter. A parameter, remember, is just a PyTorch variable, it's like identical to a variable, but it just tells PyTorch I want you to learn the weights for this.

So here we've got a linear layer, here we've got a random matrix, and so here at this point where we start out our decoder, let's take the current hidden state of the decoder, put that into a linear layer, because what's the information we use to decide what words we should focus on next?

The only information we have to go on is what the decoder's hidden state is now. So let's grab that, put it into the linear layer, put it through a nonlinearity, put it through one more nonlinear layer. This one actually doesn't have a bias in it, so it's actually just a matrix multiply.

Put that into a softmax, and that's it, that's a little neural net. It doesn't do anything, it's just a neural net, no neural nets do anything, they're just linear layers with nonlinear activations with random weights. But it starts to do something if we give it a job to do, and in this case the job we give it to do is to say, don't just take the final state, but now let's use all of the encoder states and let's take all of them and multiply them by the output of that little neural net.

And so given that the things in this little neural net are learnable weights, hopefully it's going to learn to weight those encoder outputs, those encoder hidden states by something useful. That's all a neural net ever does, is we give it some random weights to start with and a job to do, and hope that it learns to do the job.

And it turns out that it does. So everything else in here is identical to what it was before. We've got teacher forcing, it's not bidirectional. So we can see how this goes. You can see, here we are using teacher forcing. Teacher forcing had 3.49, and so now we've got nearly exactly the same thing, but we've got this little minimal neural net figuring out what weightings to give our inputs.

Oh wow, now it's down to 3.37. However these things are logs, so e^ of this is quite a significant change. So 3.37, let's try it out. Not bad, right? Where are they located? What are their skills? What do you do? They're still not perfect, why or why not, but quite a few of them are correct.

And again, considering that we're asking it to learn about the very idea of language for two different languages, and how to translate them between the two, and grammar, and vocabulary, and we only have 50,000 sentences, and a lot of the words only appear once, I would say this is actually pretty amazing.

Why do we use tan(h) instead of relu for attention-mini-net? I don't quite remember, it's been a while since I looked at it. You should totally try using relu and see how it goes. The key difference is that it can go in each direction, and it's limited both at the top and the bottom.

I know very often for the gates inside RNNs and LSTMs and GRUs, tan often works out better. But it's been about a year since I actually looked at that specific question, so I'll look at it during the week. The short answer is you should try a different activation function and see if you can get a better result.

So what we can do also is we can actually grab the attentions out of the model. So I actually added this returnAttention = true, see here, my forward, you can put anything you like in forward. So I added a returnAttention parameter, false by default, because obviously the training loop doesn't know anything about it.

But then I just had something here saying if returnAttention, then stick the attentions on as well, and the attentions is simply that value, a, just check it in a list. So we can now call the model with returnAttention = true and get back the probabilities and the attentions, which means as well as printing out these here, we can draw pictures.

That each time step of the attention. And so you can see at the start, the attention is all in the first word, second word, third word, a couple of different words. And this is just for one particular sentence. So you can kind of see, this is the equivalent, this is like when your Chris Oler and Sean Carter make things that look like this, when you're Jeremy Howard, the exact same information looks like this.

It's the same thing. Just pretend that it's beautiful. So you can see basically at each different time step, we've got a different attention. And it's really important when you try to build something like this, you don't really know if it's not working. Because if it's not working, and as per usual my first 12 attempts at this were broken, and they were broken in the sense that it wasn't really learning anything useful, and so therefore it was basically giving equal attention to everything, and therefore it wasn't worse.

It just wasn't better, or it wasn't much better. And so until you actually find ways to visualize the thing in a way that you know what it ought to look like ahead of time, you don't really know if it's working. So it's really important that you try to find ways to kind of check your intermediate steps and your outputs.

So people are asking what is the loss function for the attentional neural network? No, no, no loss function for the attentional neural network. It's trained end-to-end. So it's just sitting here inside our decoder loop. So the loss function for the decoder loop is that this result contains exactly the same as before.

Just the outputs, the probabilities of the words. So like the loss function, it's the same loss function. So how come the little mini neural nets learning something, well because in order to make the outputs better and better, it would be great if it made the weights of this weighted average better and better.

So part of creating our output is to please do a good job of finding a good set of weights. And if it doesn't do a good job of finding a good set of weights, then the loss function would improve from that bit. So end-to-end learning means you throw in everything that you can into one loss function.

And the gradients of all the different parameters point in a direction that says basically, hey, if you had put more weight over there, it would have been better. And thanks to the magic of the train rule, it then knows, oh, it would have put more weight over there if you would change the parameter in this matrix, multiply a little bit over there.

And so that's the magic of end-to-end learning. So it's a very understandable question of how did this little mini neural net work. But you've got to realize there's nothing particularly about this code that says, hey, this particular bit's a separate little mini neural net work any more than the GRU is a separate little neural net work, or this linear layer is a separate little function.

It all ends up pushed into one output, which ends up in one loss function that returns a single number that says this either was or wasn't a good translation. And so thanks to the magic of the train rule, we then back-propagate little updates to all the parameters to make them a little bit better.

So this is a big, weird counterintuitive idea, and it's totally okay if it's a bit mind bending. And it's the bit where, even back to lesson 1, it's like, how did we make it find dogs versus cats? We didn't. All we did was we said, this is our data, this is our architecture, this is our loss function, please back-propagate into the weights to make them better.

And after you've made them better a while, it'll start finding cats from dogs. In this case, we haven't used somebody else's convolutional network architecture. We've said here's like a custom architecture which we hope is going to be particularly good at this problem. And even without this custom architecture, it was still okay.

But then when we kind of made it in a way that made more sense to what we think it ought to do, it worked even better. But at no point did we kind of do anything different other than say here's data, here's an architecture, here's a loss function, go and find the parameters please.

And it did it because that's what neural nets do. So that is sequence-to-sequence learning. If you want to encode an image using a CNN backbone of some kind and then pass that into a decoder which is like an RNN with a tension, and you make your Y values the actual correct captions for each of those images, you will end up with an image caption generator.

If you do the same thing with videos and captions, you'll end up with a video caption generator. If you do the same thing with 3D CT scans and radiology reports, you'll end up with a radiology report generator. If you do the same thing with GitHub issues and people's chosen summaries of them, you'll get a GitHub issue summary generator.

Sec2Sec, I agree. They're magical, but they work. And I don't feel like people have begun to scratch the surface of how to use Sec2Sec models in their own domains. Not being a GitHub person, it would never have occurred to me that it would be kind of cool to start with some issue and automatically create a summary.

But now I'm like, of course, next time I go to GitHub I want to see a summary written there for me. I don't want to write my own damn commit message through that. Why should I write my own summary of the code review when I finish adding comments to lots of clients?

It should do that for me as well. Now I'm thinking, GitHub is so behind, it could be doing this stuff. So what are the things in your industry that you could start with a sequence and generate something from it? I can't begin to imagine. So again, it's kind of like a fairly new area, the tools for it are not easy to use, they're not even built into fastai yet, as you can see, hopefully they will be soon.

And I don't think anybody knows what the opportunities are. So I've got good news, bad news. The bad news is we have 20 minutes to cover a topic which in last year's course took a whole lesson. The good news is that when I went to rewrite this using fastai and PyTorch I ended up with almost no code.

So all of the stuff that made it hard last year is basically gone now. So we're going to do something bringing together for the first time our two little worlds we focused on, text and images, and we're going to try and bring them together. And so this idea came up really in a paper by this extraordinary deep learning practitioner and researcher named Andrea Fromm.

And Andrea was at Google at the time, and her basic crazy idea was to say words can have a distributed representation, a space, which at that time really was just word vectors. And images can be represented in a space, like in the end if we have a fully connected layer they kind of ended up as a vector representation.

Could we merge the two? Could we somehow encourage the vector space that the images end up with be the same vector space that the words are in? And if we could do that, what would that mean? What could we do with that? So what could we do with that covers things like, well, what if I'm wrong?

What if I'm predicting that this image is a beagle, and I predict jumboject, and Yanet's model predicts corgi. The normal loss function says that Yanet and Jeremy's models are equally good, i.e. they're both wrong. But what if we could somehow say corgi is closer to beagle than it is to jumboject, so Yanet's model is better than Jeremy's.

And we should be able to do that because in word vector space, beagle and corgi are pretty close together, but jumboject not so much. So it would give us a nice situation where hopefully our inferences would be like wrong in saner ways, if they're wrong. It would also allow us to search for things that aren't at an ImageNet, like a category in ImageNet, like dog and cat.

Why did I have to train a whole new model to find dogs versus cats when we already had something that found corgis and tabbies? Why can't I just say find me dogs? Well if I had trained it in word vector space, I totally could, because there's now a word vector, I can find things with the right image vector, and so forth.

So we'll look at some cool things we can do with it in a moment, but first of all let's train a model where this model is not learning a category, a one-hot encoded ID where every category is equally far from every other category. Let's instead train a model where we're finding the dependent variable which is a word vector.

So what word vector? Well obviously the word vector for the word you want. So if it's corgi, let's train it to create a word vector that's the corgi word vector. And if it's a jumbo-jet, let's train it with a dependent variable that says this is the word vector for a jumbo-jet.

So as I said, it's now shockingly easy. So let's grab the fast text word vectors again, load them in, we only need English this time. And so here's an example of the word vector for king, it's just 300 numbers. So for example, little j Jeremy and big j Jeremy have a correlation of 0.6, I don't like bananas at all, this is good, banana and Jeremy, 0.14.

So words that you would expect to be correlated are correlated in words that should be as far away from each other as possible, unfortunately they're still slightly correlated but not so much. So let's now grab all of the ImageNet classes because we actually want to know which one's corgi and which one's jumbo-jet.

So we've got a list of all of those up on files.fast.ai, we can grab them. And let's also grab a list of all of the nouns in English which I've made available here as well. So here are the names of each of the 1000 ImageNet classes. And here are all of the nouns in English according to WordNet, which is a popular thing for kind of representing what words are or not.

So we can now go ahead and load that list of nouns, load the list of ImageNet classes, turn that into a dictionary. So these are the class IDs for the 1000 ImageNet classes that are in the competition data set. There are 1000. So here's an example, n01, is a tench which apparently is a kind of fish.

Let's do the same thing for all those WordNet nouns. And you can see actually it turns out that ImageNet is using WordNet class names, so that makes it nice and easy to map between the two. And WordNet, the most basic thing is an entity, and then that includes an abstraction, and a physical entity can be an object and so forth.

So these are our two worlds. We've got the ImageNet 1000 and we've got the 82,000 which are in WordNet. So we want to map the two together, which is as simple as creating a couple of dictionaries to map them based on the Synset ID or the WordNet ID. And it turns out that 49,469 Synset to WordVector, what I need to do now is grab the 82,000 nouns in WordNet and try and look them up in fast text.

And so I've managed to look up 49,000 of them in fast text. So I've now got a dictionary that goes from Synset ID, which is what WordNet calls them, to WordVectors. So that's what this dictionary is, Synset to WordVector. And I've also got the same thing specifically for the 1000 WordNet classes.

So save them away, that's fine. Now I grab all of the ImageNet, which you can actually download from Kaggle now. If you look up the Kaggle ImageNet localization competition, that contains the entirety of the ImageNet classifications as well. It's got a validation set of 28,650 items in it. And so I can basically just grab for every image in ImageNet, I can grab using that Synset to WordVector, grab its fast text WordVector, and I can now stick that into this ImageVectors array.

Stack that all up into a single matrix and save that away. And so now what I've got is something for every ImageNet image. I've also got the fast text WordVector that it's associated with. Just by looking up the Synset ID, going to WordNet, then going to fast text and grabbing the WordVector.

And so here's a cool trick. I can now create a model data object, which specifically is an image classifier data object. And I've got this thing called from_names_an_array, I'm not sure if we've used it before. But we can pass it a list of file names, and so these are all of the file names in ImageNet.

And we can just pass it an array of our dependent variables. And so this is all of the fast text WordVectors. And then I can pass in the validation indexes, which in this case is just all of the last IDs. I need to make sure that they're the same as ImageNet users, otherwise I'll be cheating.

And then I pass in continuous = true, which means this puts a lie again to this image classifier data. It's now really an image regressor data. So continuous = true means don't want hot encode my outputs, but treat them just as continuous values. So now I've got a model data object that contains all of my file names, and for every file name a continuous array representing the WordVector for that.

So I have an x, I have a y, so I have data, now I need an architecture and a loss function. Once I've got that, I should be done. So let's create an architecture, and we'll revise this next week, but basically we can use the tricks we've learned so far, but it's actually incredibly simple.

Fast AI has a ConvNet builder, which is when you say conv_learner.pre_trained, it calls this. And you basically say, okay, what architecture do you want? So we're going to use ResNet-50. How many classes do you want? In this case it's not really classes, it's how many outputs do you want, which is the length of the fast text WordVector, 300.

Obviously it's not multi-class classification, it's not classification at all. Is it regression? Yes it is regression. And then you can just say, alright, what fully connected layers do you want? So I'm just going to add one fully connected layer, hidden layer of length 1024. Why 1024? Well I've got the last layer of ResNet-50, I think it's 1024 long.

The final output I need is 300 long. I obviously need my penultimate layer to be longer than 300, otherwise there's not enough information, so I kind of just picked something a bit bigger. Maybe different numbers would be better, but this worked for me. How much dropout do you want?

I found that the default dropout I was consistently underfitting, so I just decreased the dropout from 0.5 to 0.2. And so this is now a convolutional neural network that does not have any softmax or anything like that because it's regression, it's just a linear layer at the end. And that's basically it, that's my model.

So I can create a convlearner from that model, give it an optimization function. So now all I need, I've got data, I've got an architecture, because I said I've got this many 300 outputs, it knows there are 300 outputs because that's the size of this array. So now all I need is a loss function.

Now the default loss function for regression is L1-loss, so the absolute differences. That's not bad, but unfortunately in really high dimensional spaces, anybody who's studied a bit of machine learning probably knows this, in really high dimensional spaces, in this case it's 300 dimensional, basically everything is on the outside.

And when everything's on the outside, distance is not meaningless, but it's a little bit awkward, things tend to be close together or far away, doesn't really mean much in these really high dimensional spaces where everything's on the edge. What does mean something though is that if one thing's on the edge over here, and one thing's on the edge over here, you can form an angle between those vectors, and the angle is meaningful.

And so that's why we use cosine similarity when we're basically looking for how close or far apart are things in high dimensional spaces. And if you haven't seen cosine similarity before, it's basically the same as Euclidean distance, but it's normalized to be basically a unit norm, so basically divide by the length.

So we don't care about the length of the vector, we only care about its angle. So there's a bunch of stuff that you could easily learn in a couple of hours, but if you haven't seen it before, it's a bit mysterious. For now, just know that loss functions in high dimensional spaces where you're trying to find similarity, you care about angle and you don't care about distance.

If you didn't use this custom loss function, it would still work, I tried it, it's just a little bit less good. So we've got an architecture, we've got data, we've got a loss function, therefore we're done. So we can go ahead and fit. Now I'm training on all of ImageNet, that's going to take a long time, so pre-compute equals true is your friend.

You remember pre-compute equals true? That's that thing we learned ages ago that caches the output of the final convolutional layer and just trains the fully connected bit. And like even with pre-compute equals true, it takes like 3 minutes to train an epoch on all of ImageNet. So I trained it for a while longer, so that's like an hour's worth of training.

But it's pretty cool that with fast.ai, we can train a new custom head basically on all of ImageNet for 40 epochs in an hour or so. And so at the end of all that, we can now say, let's grab the 1000 ImageNet classes, and let's predict on a whole validation set.

And let's just take a look at a few pictures. So here's a look at a few pictures. And because the validation set is ordered, all the stuff is the same type as in the same place. I don't know what this thing is. And what we can now do is we can now use nearest neighbors search.

So nearest neighbors search means here's one 300-dimensional vector, here's a whole lot of other 3-dimensional vectors, which things is it closest to. And normally that takes a very long time because you have to look through every 300-dimensional vector, calculate its distance, and find out how far away it is.

But there's an amazing almost unknown library called NMSlib that does that incredibly fast. Almost nobody's heard of it. Some of you may have tried other nearest neighbors libraries. I guarantee this is faster than what you're using. I can tell you that because it's been benchmarked by people who do this stuff for a living.

This is by far the fastest on every possible dimension. So this is basically a super fast way. We basically look here, this is angular distance. So we want to create an index on angular distance, and we're going to do it on all of our image network vectors to add in a whole batch, create the index, and now I can query a bunch of vectors all at once, get their 10 nearest neighbors, use this multi-threading.

It's absolutely fantastic, this library. You can install it from pip, it just works, and it tells you how far away they are and their indexes. So we can now go through and print out the top 3. So it turns out that bird actually is a limpkin. This is the top 3 for each one.

Interestingly, this one doesn't say it's a limpkin, and I looked it up, it's the 4th one. I don't know much about birds, but everything else here is brown with white spots, that's not. So I don't know if that's actually a limpkin or if it's a mislabeled, but it sure as hell doesn't look like the other birds.

So I thought that was pretty interesting. It's kind of saying I don't think it's that. Now this is not a particularly hard thing to do because it's only 1,000 ImageNet classes, it's not doing anything new, but what if we now bring in the entirety of WordNet, and we now say which of those 45,000 things is it closest to, exactly the same.

So it's now searching all of WordNet. So now let's do something a bit different, which is take all of our predictions, so basically take our whole validation set of images and create a KNN index of the image representations, because remember it's predicting things that are meant to be word vectors.

And now let's grab the fast text vector for boat, and boat is not an ImageNet concept. And yet I can now find all of the images in my predicted word vectors in my validation set that are closest to the word boat. And it works, even though it's not something that was ever trained on.

What if we now take engine's vector and boat's vector and take their average? And what if we now look in our nearest neighbors for that? These are boats with engines. I mean, yes, this is actually a boat with an engine, it just happens to have wings on as well.

By the way, sail is not an ImageNet thing, boat is not an ImageNet thing, here's the average of two things that are not ImageNet things. And yet, with one exception, it's bound with two sailboats. Okay let's do something else crazy. Let's open up an image in the validation set.

Here it is. I don't know what it is. Let's call predict_array on that image to get its kind of word-vector-like thing. And let's do a nearest-neighbors search on all the other images. And here's all the other images of whatever the hell that is. So you can see this is crazy.

We've trained a thing on all of ImageNet in an hour using a custom head that required basically two lines of code. And these things run in like 300 milliseconds to do these searches. I actually taught this basic idea last year as well, but it was in Keras and it was just pages and pages and pages of code and everything took a long time and it was complicated.

Back then I kind of said, I can't begin to think all the stuff you could do with this. I don't think anybody has really thought deeply about this yet, but I think it's fascinating. And so go back and read the device paper because like Andrea had a whole bunch of other thoughts.

And now that it's so easy to do, hopefully people will dig into this now because I think it's crazy and amazing. Alright, thanks everybody. See you next week.