Lesson 11: Deep Learning Part 2 2018

00:00:00.000 | I want to start pointing out a couple of the many cool things that happened this week.

00:00:07.040 | One thing that I'm really excited about is we briefly talked about how Leslie Smith has

00:00:12.780 | a new paper out, and basically the paper takes these previous two key papers, cyclical learning

00:00:22.120 | rates and superconvergence, and builds on them with a number of experiments to show

00:00:30.120 | how you can achieve superconvergence.

00:00:33.220 | Superconvergence lets you train models 5 times faster than previous stepwise approaches.

00:00:41.840 | It's not 5 times faster than CLR, but it's faster than CLR as well.

00:00:46.440 | The key is that superconvergence lets you get up to massively high learning rates, somewhere

00:00:55.620 | between 1 and 3, which is quite amazing.

00:01:00.640 | So the interesting thing about superconvergence is that you actually train at those very high

00:01:10.400 | learning rates for quite a large percentage of your epochs, and during that time the loss

00:01:16.000 | doesn't really improve very much, but the trick is it's doing a lot of searching through

00:01:22.240 | the space to find really generalizable areas, it seems.

00:01:28.200 | We kind of had a lot of what we needed in fastAI to achieve this, but we're missing

00:01:32.160 | a couple of bits, and so Sylvan Gugo has done an amazing job of fleshing out the pieces

00:01:38.180 | that we're missing, and then confirming that he has actually achieved superconvergence

00:01:44.320 | on training on sci-fi 10.

00:01:46.000 | I think this is the first time that this has been done that I've heard of outside of Leslie

00:01:50.320 | Smith himself.

00:01:51.320 | He's got a great blog post up now on 1Cycle, which is what Leslie Smith called this approach.

00:01:59.200 | And this is actually, it turns out, what 1Cycle looks like.

00:02:03.520 | It's a single cyclical learning rate, but the key difference here is that the going up bit

00:02:11.400 | is the same length as the going down bit, so you go up really slowly.

00:02:16.080 | And then at the end, for like a tenth of the time, you then have this little bit where

00:02:20.240 | you go down even further.

00:02:23.760 | And it's interesting, obviously this is a very easy thing to show, a very easy thing

00:02:27.240 | to explain.

00:02:29.240 | Sylvan has added it to fastAI under the temporarily, it's called useCLRbeta by the time you watch

00:02:37.160 | this on the video, it'll probably be called 1Cycle or something like that.

00:02:43.600 | But you can use this right now.

00:02:45.920 | So that's one key piece to getting these massively high learning rates.

00:02:49.780 | And he shows a number of experiments when you do that.

00:02:52.200 | A second key piece is that as you do this to the learning rate, you do this to the momentum.

00:02:59.080 | So when the learning rate's low, it's fine to have a high momentum.

00:03:02.480 | But when the learning rate gets up really high, your momentum needs to be quite a bit

00:03:09.640 | lower.

00:03:10.960 | So this is also part of what he's added to the library is this cyclical momentum.

00:03:16.040 | And so with these two things, you can train for about the fifth of the number of epochs

00:03:21.640 | with a stepwise learning rate schedule.

00:03:24.800 | Then you can drop your weight decay down by about two orders of magnitude.

00:03:28.720 | You can often remove most or all of your dropout, and so you end up with something that's trained

00:03:34.160 | faster and generalizes better.

00:03:37.600 | And it actually turns out that Sylvan got quite a bit better accuracy than Leslie Smith's

00:03:41.880 | paper.

00:03:42.880 | His guess, I was pleased to see, is because our data augmentation defaults are better

00:03:48.040 | than Leslie's.

00:03:49.040 | I hope that's true.

00:03:51.280 | So check that out.

00:03:52.280 | As I say, there's been so many cool things this week, I'm just going to pick two.

00:03:59.600 | Hamill Hussain works at GitHub.

00:04:03.320 | I just really like this.

00:04:05.320 | There's a fairly new project called Kubeflow, which is basically TensorFlow for Kubernetes.

00:04:11.960 | Hamill wrote a very nice article about magical sequence-to-sequence models, building data

00:04:19.800 | products on that, using Kubernetes to kind of put that in production and so forth.

00:04:29.320 | He said that the Google Kubeflow team created a demo based on what he wrote earlier this

00:04:34.560 | year, directly based on the skills alone in class AI, and I will be presenting this technique

00:04:40.080 | at KDD.

00:04:41.080 | KDD is one of the top academic conferences, so I wanted to share this as a motivation

00:04:47.120 | for folks to blog, which I think is a great point.

00:04:51.760 | Nobody who goes out and writes a blog thinks that none of us really think our blog is actually

00:04:58.120 | going to be very good, probably nobody's going to read it, and then when people actually

00:05:02.440 | do like it and read it, it's like with great surprise, you just go, oh, it's actually something

00:05:06.680 | people were interested to read.

00:05:09.840 | So here is the tool where you can summarize GitHub issues using this tool, which is now

00:05:16.000 | hosted by Google on the kubeflow.org domain.

00:05:19.560 | So I think that's a great story if Hamill didn't put his work out there, none of this

00:05:26.440 | would have happened, and you can check out his post that made it all happen as well.

00:05:36.440 | So talking of the magic of sequence-to-sequence models, let's build one.

00:05:44.600 | So we're going to be specifically working on machine translation.

00:05:51.980 | So machine translation is something that's been around for a long time, but specifically

00:05:58.040 | we're going to look at a code called neural translation, which is using neural networks

00:06:02.040 | for translation.

00:06:04.920 | That wasn't really a thing in any kind of meaningful way until a couple of years ago.

00:06:13.320 | And so thanks to Chris Manning from Stanford for the next three slides.

00:06:19.000 | In 2015, Chris pointed out that neural machine translation first appeared properly, and it

00:06:25.000 | was pretty crappy compared to the statistical machine translation approaches that used kind

00:06:28.840 | of classic feature engineering and standard MLP kind of approaches of lots of stemming

00:06:35.360 | and fiddling around with work frequencies and n-grams and lots of stuff.

00:06:42.000 | A year later, it was better than everything else.

00:06:46.040 | This is on a metric called Blue, we're not going to discuss the metric because it's not

00:06:49.200 | a very good metric and it's not very interesting, but it's what everybody uses.

00:06:53.200 | So that was Blue as of when Chris did this slide.

00:06:57.320 | As of now, it's about up here, it's about 30.

00:07:02.800 | So we're kind of seeing machine translation starting down the path that we saw starting

00:07:10.560 | computer vision object classification in 2012, I guess, which is we kind of just surpassed

00:07:17.640 | the state-of-the-art and now we're zipping past it at a great rate.

00:07:24.160 | It's very unlikely that anybody watching this is actually going to build a machine translation

00:07:30.840 | model because you can go to translate.google.com and use theirs and it works quite well.

00:07:37.260 | So why are we learning about machine translation?

00:07:40.120 | The reason we're learning about machine translation is that the general idea of taking some kind

00:07:44.480 | of input, like a sentence in French, and transforming it into some other kind of output of arbitrary

00:07:52.200 | length such as a sentence in English is a really useful thing to do.

00:07:59.080 | For example, the thing that we just saw that Hamill did takes GitHub issues and turns them

00:08:05.840 | into summaries.

00:08:08.240 | Another example is taking videos and turning them into descriptions.

00:08:19.720 | Basically anything where you're spitting out kind of an arbitrary sized output, very often

00:08:24.400 | that's a sentence, so maybe taking a CT scan and spitting out a radiology report, this

00:08:30.760 | is where you can use sequence-to-sequence learning.

00:08:37.200 | So the important thing about a neural machine translation, there's more slides from Chris,

00:08:45.200 | and generally sequence-to-sequence models is that there's no fussing around with heuristics

00:08:53.520 | and hacky feature engineering or whatever, it's end-to-end training.

00:08:58.720 | We're able to build these distributed representations which are shared by lots of concepts within

00:09:03.920 | a single network, we're able to use long-term state in the RNN, so use a lot more context

00:09:10.640 | than kind of n-gram type approaches, and in the end the text we're generating uses an

00:09:15.560 | RNN as well so we can build something that's more fluid.

00:09:19.240 | We're going to use a bidirectional LSTM with a tension, well actually we're going to use

00:09:26.000 | a bidirectional GRU with a tension but basically the same thing.

00:09:30.120 | So you already know about bidirectional recurrent neural networks and a tension we're going

00:09:34.600 | to add on top today.

00:09:36.240 | These general ideas you can use for lots of other things as well as Chris points out on

00:09:41.400 | this slide.

00:09:45.560 | So let's jump into the code which is in the translate notebook, funnily enough.

00:10:03.960 | And so we're going to try to translate French into English.

00:10:09.840 | And so the basic idea is that we're going to try and make this look as much like a standard

00:10:18.120 | neural network approach as possible.

00:10:21.080 | So we're going to need three things, you all remember the three things, data, a suitable

00:10:29.040 | architecture and a suitable loss function.

00:10:33.400 | Once you've got those three things, you run fit.

00:10:35.720 | And all things going well, you end up with something that solves your problem.

00:10:41.920 | So data, we generally need x y pairs because we need something which we can feed it into

00:10:50.160 | the loss function and say I took my x value which was my French sentence and the loss

00:10:58.920 | function says it was meant to generate this English sentence and then you had your predictions

00:11:05.560 | which you would then compare and see how good it is.

00:11:08.740 | So therefore we need lots of these tuples of French sentences with their equivalent

00:11:14.840 | English sentence.

00:11:16.560 | That's called a parallel corpus.

00:11:18.840 | Obviously this is harder to find than a corpus for a language model, because for a language

00:11:23.480 | model we just need text in some language.

00:11:29.640 | For any living language of which the people that use that language, like use computers,

00:11:37.480 | there will be a few gigabytes at least of text floating around the internet for you to grab.

00:11:42.600 | So building a language model is only challenging corpus-wise for ancient languages, one of

00:11:49.840 | our students is trying to do a Sanskrit one for example at the moment, but that's very

00:11:54.720 | rarely a problem.

00:11:56.720 | For translation there are actually some pretty good parallel corpuses available for European

00:12:02.600 | languages.

00:12:03.600 | The European Parliament basically has every sentence in every European language.

00:12:08.960 | Anything that goes through the UN is translated to lots of languages.

00:12:14.320 | For French through English we have a particularly nice thing which is pretty much any semi-official

00:12:21.400 | Canadian website, we'll have a French version and an English version.

00:12:27.600 | This chap Chris Callison-Burch did a cool thing which is basically to try to transform French

00:12:32.400 | URLs into English URLs by replacing -fr with -en and hoping that that retrieves the equivalent

00:12:39.200 | document and then did that for lots and lots of websites and ended up creating a huge corpus

00:12:47.640 | based on millions of web pages.

00:12:49.400 | So French to English we have this particularly nice resource.

00:12:54.400 | So we're going to start out by talking about how to create the data, then we'll look at

00:12:57.520 | the architecture, and then we'll look at the loss function.

00:13:00.320 | And so for bounding boxes all of the interesting stuff was in the loss function, but for neural

00:13:08.680 | translation all of the interesting stuff is going to be in the architecture.

00:13:14.140 | So let's zip through this pretty quickly.

00:13:17.720 | One of the things I want you to think about particularly is what are the relationships

00:13:22.360 | and similarities in terms of the task we're doing and how we do it between language modeling

00:13:27.680 | versus neural translation.

00:13:34.040 | So the basic approach here is that we're going to take a sentence, so in this case the example

00:13:41.520 | is English to German, and this slide's from Stephen Meridy, we steal everything we can

00:13:47.440 | from Stephen.

00:13:49.680 | We start with some sentence in English, and the first step is to do basically the exact

00:13:54.640 | same thing we do in a language model, which is to chuck it through an RNN.

00:14:00.200 | Now with our language model, actually let's not even think about the language model, let's

00:14:07.360 | start even easier, the classification model.

00:14:10.160 | So something that turns this sentence into positive or negative sentiment.

00:14:16.760 | We had a decoder, something which basically took the RNN output, and from our paper we

00:14:32.440 | grabbed three things.

00:14:33.440 | We took a max pool over all of the time steps, we took a mean pool over all the time steps,

00:14:38.760 | and we took the value of the RNN at the last time step, stuck all those together, and put

00:14:44.160 | it through a linear layer.

00:14:46.920 | Most people don't do that in most NLP stuff, I think it's something we invented.

00:14:54.040 | People pretty much always use the last time step, so all the stuff we'll be talking about

00:14:58.160 | today uses the last time step.

00:15:01.320 | So we start out by chucking this sentence through an RNN, and out of it comes some state.

00:15:09.480 | So some state meaning some hidden state, some vector that represents the output of an RNN

00:15:16.680 | that has encoded that sentence.

00:15:18.840 | You'll see the word that Stephen used here was encoder.

00:15:23.040 | We've tended to use the word backbone.

00:15:25.640 | So like when we've talked about adding a custom head to an existing model, like the existing

00:15:31.280 | pre-trained ImageNet model, for example, we say that's our backbone, and then we stick

00:15:35.480 | on top of it some head that does the task we want.

00:15:39.620 | In sequence-to-sequence learning, they use the word encoder.

00:15:44.640 | But basically it's the same thing, it's some piece of a neural network architecture that

00:15:49.120 | takes the input and turns it into some representation which we can then stick a few more layers

00:15:56.240 | on top of to grab something out of it, such as we did for the classifier where we stuck

00:16:03.040 | a linear layer on top of it to turn it into a sentiment, positive or negative.

00:16:11.800 | So this time, though, we have something that's a little bit harder than just getting sentiment,

00:16:20.320 | which is I want to turn this state not into a positive or negative sentiment, but into

00:16:24.800 | a sequence of tokens, where that sequence of tokens is the German sentence that we want.

00:16:32.640 | So this is sounding more like the language model than the classifier, because the language

00:16:37.720 | model had multiple tokens for every input word, there was an output word, but the language

00:16:43.360 | model was also much easier because the number of tokens in the language model output was

00:16:50.600 | the same length as the number of tokens in the language model input.

00:16:54.040 | And not only were they the same length, they exactly matched up.

00:16:58.000 | Like after word 1 comes word 2, after word 2 comes word 3 and so forth.

00:17:04.040 | But for translating language, you don't necessarily know that the word 'he' will be translated

00:17:11.860 | as the first word in the output, and that 'loved' will be the second word in the output.

00:17:15.640 | In this particular case, unfortunately, they are the same.

00:17:18.960 | But very often the subject-object order will be different, or there will be some extra

00:17:24.480 | words inserted, or some pronouns will need to add some gendered article to it, or whatever.

00:17:31.600 | So this is the key issue we're going to have to deal with is the fact that we have an arbitrary

00:17:37.280 | length output where the tokens in the output do not correspond to the same order of specific

00:17:44.640 | tokens in the input.

00:17:48.000 | So the general idea is the same, use an RNN to encode the input, turns it into some hidden

00:17:53.880 | state and then this is the new thing we're going to learn is generating a sequence output.

00:18:00.480 | So we already know sequence to class, that's IMDB classifier, we already know sequence

00:18:08.060 | to equal length sequence where it corresponds to the same items, that's the language model

00:18:13.800 | for example, but we don't know yet how to do a general-purpose sequence to sequence,

00:18:18.440 | so that's the new thing today.

00:18:21.240 | Very little of this will make sense unless you really understand lesson 6, how an RNN

00:18:29.920 | works.

00:18:31.320 | So if some of this lesson doesn't make sense to you and you find yourself wondering what

00:18:36.520 | does he mean by 'hidden state' exactly, how's that working, go back and rewatch lesson 6

00:18:42.960 | to give you a very quick review, we learned that an RNN at its heart is a standard fully

00:18:51.000 | connected network, so here's one with one, two, three, four layers, takes an input and

00:18:57.520 | puts it through four layers, but then at the second layer it can just concatenate in the

00:19:04.200 | second input, third layer concatenate in the third input, but we actually wrote this in

00:19:08.520 | Python as just literally a four-layer neural network, there was nothing else we used other

00:19:15.560 | than linear layers and values.

00:19:20.480 | We used the same weight matrix every time an input came in, we used the same matrix

00:19:24.200 | every time we went from one of these states to the next, and that's why these arrows are

00:19:28.440 | the same color, and so we can redraw that previous thing like this.

00:19:34.960 | And so not only did we redraw it, but we took four lines of linear linear linear linear code

00:19:43.520 | in PyTorch and we replaced it with a for loop.

00:19:50.400 | So remember we had something that did exactly the same thing as this, but it just had four

00:19:55.680 | lines of code saying linear linear linear linear, and we literally replaced it with

00:20:01.880 | a for loop because that's nice to refactor.

00:20:05.380 | So literally that refactoring, which doesn't change any of the math, any of the ideas,

00:20:12.120 | any of the outputs, that refactoring is an RNN, it's turning a bunch of separate lines

00:20:18.880 | in the code into a Python folder.

00:20:22.760 | And so that's how we can draw it.

00:20:26.680 | We could take the output so that it's not outside the loop and put it inside the loop,

00:20:32.720 | like so.

00:20:34.500 | And if we do that, we're now going to generate a separate output for every input.

00:20:44.240 | So in this case, this particular one here, the hidden state gets replaced each time and

00:20:49.960 | we end up just spitting out the final hidden state.

00:20:52.880 | So this one is this example, but if instead we had something that said h's dot append

00:21:02.200 | h and returned h's at the end, that would be this picture.

00:21:08.120 | And so go back and relook at that notebook if this is unclear.

00:21:10.680 | I think the main thing to remember is when we say hidden state, we're referring to a

00:21:15.520 | vector.

00:21:17.480 | See here, here's the vector, h equals torch.zeros nhidden.

00:21:24.880 | Now of course it's a vector for each thing in the mini-batch, so it's a matrix.

00:21:29.720 | But generally when I speak about these things, I ignore the mini-batch piece and treat it

00:21:34.080 | with just a single item.

00:21:35.680 | So it's just a vector of this length.

00:21:42.040 | We also learned that you can stack these layers on top of each other.

00:21:45.840 | So rather than this first RNN spitting out output, there could just spit out inputs into

00:21:51.160 | a second RNN.

00:21:53.440 | And if you're thinking at this point, "I think I understand this, but I'm not quite sure,"

00:22:00.320 | if you're anything like me, that means you don't understand this.

00:22:04.160 | And the only way you know and that you actually understand it is to go and write this from

00:22:09.320 | scratch in PyTorch or NumPy.

00:22:12.520 | And if you can't do that, then you don't understand it.

00:22:16.080 | You can go back and rewatch Lesson 6 and check out the notebook and copy some of the ideas

00:22:20.960 | until it's really important that you can write that from scratch.

00:22:24.680 | It's less than a screen of code.

00:22:28.400 | So you want to make sure you create a two-layer RNN.

00:22:34.120 | And this is what it looks like if you unroll it.

00:22:38.840 | So that's the goal, is to get to a point that we first of all have these X, Y pairs of sentences,

00:22:45.840 | and we're going to do French to English.

00:22:48.360 | So we're going to start by downloading this dataset, and training a translation model

00:22:55.840 | takes a long time.

00:22:59.080 | Google's translation model has 8 layers of RNN stacked on top of each other.

00:23:03.920 | There's no conceptual difference between 8 layers and 2 layers, it's just like if you're

00:23:09.680 | Google and you have more GPUs or TPUs and you know what to do with, then you're fine

00:23:13.360 | doing that, whereas in our case it's pretty likely that the kind of sequence-to-sequence

00:23:17.640 | models we're building are not going to require that level of computation.

00:23:21.880 | So to keep things simple, let's do a cut-down thing where rather than learning how to translate

00:23:28.000 | French into English for any sentence, let's learn to translate French questions into English

00:23:33.440 | questions.

00:23:34.440 | And specifically questions that start with what, where, which, when.

00:23:38.280 | So you can see here I've got a regex that looks for things that start with wh and end

00:23:42.160 | with a question mark.

00:23:43.160 | So I just go through the corpus, open up each of the two files, each line is one parallel

00:23:49.120 | text, zip them together, grab the English question, the French question, and check whether they

00:23:54.440 | match the regular expressions.

00:23:57.760 | Dump that out as a pickle so that I don't have to do it again.

00:24:00.880 | And so we now have 52,000 sentences, and here are some examples of those sentence pairs.

00:24:09.480 | One nice thing about this is that what, who, where type questions tend to be fairly short,

00:24:15.280 | which is nice.

00:24:17.960 | But I would say the idea that we could learn from scratch with no previous understanding

00:24:24.160 | of the idea of language, let alone of English or of French, that we could create something

00:24:28.840 | that can translate one to the other for any arbitrary question with only 50,000 sentences,

00:24:34.880 | sounds like a ludicrously difficult thing to ask this to do.

00:24:39.840 | So I would be impressed if we could make any progress whatsoever.

00:24:43.480 | This is very little data to do a very complex exercise.

00:24:49.400 | So this contains the tuples of French and English.

00:24:52.840 | You can use this handy idiom to split them apart into a list of English questions and

00:24:57.520 | a list of French questions.

00:24:59.560 | And then we tokenize the English questions and we tokenize the French questions.

00:25:05.200 | So remember that just means splitting them up into separate words or word-like things.

00:25:12.440 | By default, the tokenizer that we have here, and remember this is a wrapper around the

00:25:17.880 | spacey tokenizer, which is a fantastic tokenizer.

00:25:22.120 | This wrapper by default assumes English, so to ask for French, you just add an extra

00:25:27.240 | parameter.

00:25:28.240 | The first time you do this, you'll get an error saying that you don't have the spacey

00:25:31.600 | French model installed, and you can google to get the command something python -m spacey

00:25:37.600 | download French or something like that to grab the French model.

00:25:44.480 | I don't think any of you are going to have RAM problems here because this is not particularly

00:25:48.440 | big corpus, but I know that some of you were trying to train new language models during

00:25:53.080 | the week and were having RAM problems.

00:25:55.280 | If you do, it's worth knowing what these functions are actually doing.

00:25:59.420 | So for example, these ones here is processing every sentence across multiple processes.

00:26:05.560 | And remember, fastai code is designed to be pretty easy to read, so here's the three lines

00:26:17.200 | of code to process all MP, find out how many CPUs you have, divide by two, because normally

00:26:23.200 | with hyperthreading they don't actually all work in parallel, then in parallel run this

00:26:31.440 | process function.

00:26:32.720 | So that's going to spit out a whole separate Python process for every CPU you have.

00:26:37.720 | If you have a lot of cores, a lot of Python processes, everyone's going to load all this

00:26:42.580 | data in and that can potentially use up all your RAM.

00:26:46.720 | So you could replace that with just proc all rather than proc all MP to use less RAM.

00:26:55.080 | Or you could just use less cores.

00:26:58.400 | So at the moment we're calling this function partition by cores, which calls partition

00:27:04.920 | on a list and asks to split it into a number of equal length things according to how many

00:27:10.920 | CPUs you have, so you could replace that by splitting it into a smaller list and run it

00:27:16.680 | on less things.

00:27:18.880 | Yes, Rachel.

00:27:21.440 | Was an attention layer tried in the language model?

00:27:23.920 | Do you think it would be a good idea to try and add one?

00:27:27.640 | We haven't learned about attention yet, so let's ask about things that we have got to,

00:27:31.720 | not things we haven't.

00:27:34.360 | The short answer is no, I haven't tried it properly, yes, you should try it because it

00:27:40.280 | might help.

00:27:41.280 | In general, there's going to be a lot of things that we cover today, which if you've done

00:27:46.360 | some sequence-to-sequence stuff before, you'll want to know about something we haven't covered

00:27:50.480 | yet.

00:27:51.480 | I'm going to cover all the sequence-to-sequence things.

00:27:53.480 | So at the end of this, if I haven't covered the thing you wanted to know about, please

00:27:57.080 | ask me then.

00:27:58.600 | If you ask me before, I'll be answering something based on something I'm about to teach you.

00:28:05.160 | So having tokenized the English and French, you can see how it gets split out.

00:28:10.320 | You can see that tokenization for French is quite different looking because French loves

00:28:15.080 | their apostrophes and their hyphens and stuff.

00:28:17.920 | So if you try to use an English tokenizer for a French sentence, you're going to get

00:28:21.240 | a pretty crappy outcome.

00:28:25.040 | So I don't find you need to know heaps of NLP ideas to use deep learning for NLP, but

00:28:31.240 | just some basic stuff like use the right tokenizer if your language is important.

00:28:38.040 | And so some of the students this week in our study group have been trying to build language

00:28:42.080 | models for Chinese, for instance, which of course doesn't really have the concept of

00:28:48.400 | a tokenizer in the same way.

00:28:50.440 | So we've been starting to look at, briefly mentioned last week, this Google thing called

00:28:55.040 | sentence piece, which basically splits things into arbitrary subword units.

00:29:01.020 | And so when I say tokenize, if you're using a language that doesn't have spaces in, you

00:29:06.800 | should probably be checking out sentence piece or some other similar subword unit thing instead.

00:29:13.800 | And hopefully in the next week or two we'll be able to report back with some early results

00:29:17.720 | of these experiments with Chinese.

00:29:25.680 | So having tokenized it, we'll save that to disk.

00:29:27.880 | And then remember the next step after we create tokens is to turn them into numbers.

00:29:33.120 | And to turn them into numbers, we have two steps.

00:29:35.240 | The first is to get a list of all of the words that appear, and then we turn every word into

00:29:40.920 | the index into that list.

00:29:44.560 | If there are more than 40,000 words that appear, then let's cut it off there so it doesn't

00:29:48.880 | get too crazy.

00:29:51.240 | And we insert a few extra tokens for beginning of stream, padding, end of stream, and unknown.

00:30:00.880 | So if we try to look up something that wasn't in the 40,000 most common, then we use a default

00:30:06.600 | dict to return 3, which is unknown.

00:30:11.600 | So now we can go ahead and turn every token into an id by putting it through the string

00:30:17.200 | to integer dictionary we just created.

00:30:20.840 | And then at the end of that, let's add the number 2, which is end of stream.

00:30:25.720 | And you'll see the code you see here is the code I write when I'm iterating and experimenting.

00:30:32.880 | Because 99% of the code I write when I'm iterating and experimenting turns out to be totally

00:30:38.440 | wrong or stupid or embarrassing and you don't get to see it.

00:30:42.640 | But there's no point refactoring that and making it beautiful when I'm writing it.

00:30:49.320 | So I was wanting you to see all the little shortcuts I have.

00:30:52.560 | So rather than doing this properly and having some constant or something for end of stream

00:30:57.560 | marker and using it, when I'm prototyping, I just do the easy stuff.

00:31:04.680 | Not so much that I end up with broken code, but I try to find some mid-ground between

00:31:14.760 | beautiful code and code that works.

00:31:17.760 | I just heard him mention that we divide them of CPUs by 2 because with hyperthreading we

00:31:24.760 | don't get to speed up using all the hyperthreaded cores.

00:31:27.800 | Is this based on practical experience or is there some underlying reason why we wouldn't

00:31:31.640 | get additional speed up?

00:31:33.040 | Yeah, it's just practical experience.

00:31:35.560 | And it's like not all things kind of seem like this, but I definitely noticed with tokenization

00:31:41.720 | hyperthreading seems to slow things down a little bit.

00:31:45.000 | Also if I use all the cores, often I want to do something else at the same time, like

00:31:51.600 | generally run some interactive notebook and I don't have any spare room to do that.

00:31:57.080 | It's a minor issue.

00:32:03.120 | So now for our English and our French, we can grab our list of IDs.

00:32:07.800 | And when we do that, of course, we need to make sure that we also store the vocabulary.

00:32:11.640 | There's no point having IDs if we don't know what the number 5 represents.

00:32:15.840 | There's no point having a number 5.

00:32:17.600 | So that's our vocabulary, the string, and the reverse mapping, string to int, that we can

00:32:22.800 | use to convert more cores in the future.

00:32:28.320 | So just to confirm it's working, we can go through each ID, convert the int to a string

00:32:33.440 | and spit that out, and there we have our thing back, now with an end-to-stream marker at

00:32:38.000 | the end.

00:32:39.560 | Our English vocab is 17,000, our French vocab is 25,000.

00:32:44.880 | So this is not too complex a vocab that we're dealing with, which is nice to know.

00:32:54.840 | So we spent a lot of time on the forums during the week discussing how pointless word vectors

00:33:01.060 | are and how you should stop getting so excited about them, we're now going to use them.

00:33:06.440 | Why is that?

00:33:07.760 | Basically, all the stuff we've been learning about using language models and pre-trained

00:33:13.440 | proper models rather than pre-trained linear single layers, which is what word vectors

00:33:18.240 | are, I think applies equally well to sequence-to-sequence, but I haven't tried it yet.

00:33:25.580 | So Sebastian and I are starting to look at that, slightly distracted by preparing this

00:33:31.560 | class at the moment, but after this class is done.

00:33:34.080 | So there's a whole thing, for anybody interested in creating some genuinely new, highly publishable

00:33:41.120 | results, the entire area of sequence-to-sequence with pre-trained language models hasn't been

00:33:48.160 | touched yet, and I strongly believe it's going to be just as good as classification stuff.

00:33:54.800 | And if you work on this and you get to the point where you have something that's looking

00:34:00.400 | exciting and you want help publishing it, I'm very happy to help co-author papers on

00:34:07.740 | stuff that's looking good.

00:34:09.980 | So feel free to reach out if and when you have some interesting results.

00:34:15.040 | So at this stage, we don't have any of that, so we're going to use very little fast.ai

00:34:22.600 | actually, and very little in terms of fast.ai ideas.

00:34:27.300 | So all we've got is word vectors.

00:34:31.360 | Anyway, so let's at least use decent word vectors.

00:34:33.880 | So Word2vec is very old word vectors.

00:34:36.880 | There are better word vectors now, and fast text is a pretty good source of word vectors.

00:34:42.000 | There's hundreds of languages available for them, your language is likely to be represented.

00:34:47.100 | So to grab them, you can click on this link, download word vectors for a language that

00:34:52.080 | you're interested in, install the fast text Python library.

00:35:01.080 | It's not available on PyPy, but here's a handy trick.

00:35:04.720 | If there is a GitHub repo that has a setup.py in it and a requirements.txt in it, you can

00:35:12.760 | just chuck git+ at the start and then stick that in your pip install and it works.

00:35:19.400 | Hardly anybody seems to know this, and if you go to the fast text repo, they won't tell

00:35:24.840 | you this.

00:35:25.840 | They'll say you have to download it and CD into it and blah blah blah, but you don't,

00:35:28.480 | you can just run that.

00:35:30.960 | You can also use for the fast.ai library, by the way.

00:35:33.640 | If you want to pip install the latest version of fast.ai, you can totally do this.

00:35:39.080 | So you grab the library, import it, load the model, so here's my English model, and here's

00:35:45.000 | my French model.

00:35:47.040 | You'll see there's a text version and a binary version, the binary version's a bit faster,

00:35:50.640 | we're going to use that, the text version's also a bit buggy.

00:35:55.560 | And then I'm going to convert it into a standard Python dictionary to make it a bit easier

00:35:59.680 | to work with, so this is just going to go through each word with a dictionary comprehension

00:36:03.740 | and save it as a pickled dictionary.

00:36:07.800 | So now we've got our pickled dictionary.

00:36:11.440 | We can go ahead and look up a word, for example, comma, and that will return a vector.

00:36:17.640 | The length of that vector is the dimensionality of this set of word vectors, so in this case

00:36:22.600 | we've got 300-dimensional English and French word vectors.

00:36:31.960 | For reasons that you'll see in a moment, I also want to find out what the mean of my

00:36:35.680 | vectors are and the standard deviation of my vectors are.

00:36:38.680 | So the mean's about zero and the standard deviation is about 0.3.

00:36:49.040 | Often corpus's have a pretty long-tailed distribution of sequence length, and it's the longest sequences

00:36:57.960 | that kind of tend to overwhelm how long things take and how much memory is used and stuff

00:37:02.880 | like that.

00:37:04.380 | So I'm going to grab, in this case, the 99th and 97th percentile of the English and French

00:37:11.960 | and truncate them to that amount.

00:37:16.320 | Originally I was using the 90th percentile, so these are poorly named variables, so apologies

00:37:21.800 | for that.

00:37:22.800 | So that's just truncating them.

00:37:25.200 | So we're nearly there.

00:37:26.320 | We've got our tokenized, numericalized English and French dataset.

00:37:33.900 | We've got some word vectors.

00:37:36.980 | So now we need to get it ready for PyTorch.

00:37:39.880 | So PyTorch expects a dataset object, and hopefully by now you all can tell me that a dataset

00:37:47.720 | object requires two things, a length and an indexer.

00:37:53.280 | So I started out writing this, and I was like, "Okay, I need a sector-sect dataset."

00:37:56.320 | I started out writing it, and I thought, "Okay, we're going to have to pass it our x's and

00:37:59.920 | our y's and store them away, and then my indexer is going to need to return a numpy array of

00:38:06.960 | the x's at that point and a numpy array of the y's at that point, and oh, that's it."

00:38:13.120 | So then after I wrote this, I realized I haven't really written a sector-sect dataset, I've

00:38:17.120 | just written a totally generic dataset.

00:38:20.240 | So here's the simplest possible dataset that works for any pair of arrays.

00:38:26.500 | So it's now poorly named, it's much more general than a sector-sect dataset, but that's what

00:38:31.480 | I needed it for.

00:38:33.600 | This a function, remember we've got v for variables, t for tensors, a for arrays.

00:38:39.800 | So this basically goes through each of the things you pass it, if it's not already a

00:38:43.400 | numpy array, it converts it into a numpy array and returns back a tuple of all of the things

00:38:49.120 | that you passed it, which are now guaranteed to be numpy arrays.

00:38:52.460 | So that's AVT3 very handy little functions.

00:39:00.000 | So that's it, that's our dataset.

00:39:04.640 | So now we need to grab our English and French IDs and get a training set and a validation

00:39:12.360 | set.

00:39:13.360 | And so one of the things which is pretty disappointing about a lot of code out there on the internet

00:39:18.800 | is that they don't follow some simple best practices.

00:39:22.240 | For example, if you go to the PyTorch website, they have an example section for sequence-to-sequence

00:39:29.760 | translation.

00:39:31.120 | Their example does not have a separate validation set.

00:39:33.840 | I tried it, training according to their settings, and tested it with a validation set and it

00:39:39.080 | turned out that it overfit massively.

00:39:41.800 | So this is not just a theoretical problem, the actual PyTorch repo has the actual official

00:39:47.720 | sequence-to-sequence translation example, which does not check for overfitting and overfits

00:39:52.200 | horribly.

00:39:54.200 | Also it fails to use minibatches, so it actually fails to utilize any of the efficiency of

00:40:00.120 | PyTorch whatsoever.

00:40:03.120 | Even if you find code in the official PyTorch repo, don't assume it's any good at all.

00:40:09.400 | The other thing you'll notice is that pretty much every other sequence-to-sequence model

00:40:16.640 | I've found in PyTorch anywhere on the internet has clearly copied from that shitty PyTorch

00:40:22.220 | repo because it's all the same variable names, it has the same problems, it has the same

00:40:27.080 | mistakes.

00:40:28.400 | Like another example, nearly every PyTorch convolutional neural network I've found does

00:40:33.480 | not use an adaptive pooling layer.

00:40:36.120 | So in other words, the final layer is always like average_pool(7,7).

00:40:41.880 | So they assume that the previous layer is 7x7, and if you use any other size input you

00:40:47.840 | get an exception.

00:40:49.200 | And therefore nearly everybody I've spoken to that uses PyTorch thinks that there is

00:40:53.000 | a fundamental limitation of CNNs that they are tied to the input size.

00:40:56.960 | And that has not been true since VGG.

00:41:00.240 | So every time we grab a new model and stick it in the fastai repo, I have to go in, search

00:41:04.680 | for pool and add adaptive to the start and replace the 7 with a 1, and now it works on

00:41:10.400 | any sized object.

00:41:12.400 | So just be careful, it's still early days, and believe it or not, even though most of

00:41:18.280 | you have only started in the last year your deep learning journey, you know quite a lot

00:41:24.040 | more about a lot of the more important practical aspects than the vast majority of people that

00:41:28.520 | are publishing and writing stuff in official repos.

00:41:32.520 | So you kind of need to have a little more self-confidence than you might expect when

00:41:37.760 | it comes to reading other people's code.

00:41:39.120 | If you find yourself thinking that looks odd, it's not necessarily you, right?

00:41:43.960 | It might well be them.

00:41:49.880 | So I would say like at least 90% of deep learning code that I start looking at turns out to

00:41:59.400 | have like deathly serious problems that make it completely unusable for anything.

00:42:07.960 | And so I've been telling people that I've been working with recently, if a repo you're

00:42:13.920 | looking at doesn't have a section on it saying here's the test we did where we got the same

00:42:18.080 | results as the paper that this is meant to be implementing, that almost certainly means

00:42:21.960 | they haven't got the same results as the paper they're implementing, they probably haven't

00:42:24.720 | even checked.

00:42:25.720 | And if you run it, it definitely won't get those results because it's hard to get things

00:42:31.080 | right the first time.

00:42:32.080 | It probably takes me 12 goes, probably takes normal smarter people than me 6 goes, but

00:42:38.000 | if they haven't tested it once, it almost certainly won't work.

00:42:43.680 | So there's our sequence data set.

00:42:45.400 | Let's get the training and validation sets.

00:42:48.100 | Here's an easy way to do that.

00:42:49.100 | Grab a bunch of random numbers, one for each row of your data.

00:42:52.680 | See if they're bigger than 0.1 or not.

00:42:55.800 | That gets you a list of balls.

00:42:57.820 | Index into your array with that list of balls to grab a training set.

00:43:01.520 | Index into that array with the opposite of that list of balls to get your validation

00:43:05.080 | set.

00:43:06.080 | There's lots of ways of doing it, I just like to do different ways to see a few approaches.

00:43:12.960 | So now we can create our data set with our X's and our Y's, French and English.

00:43:17.880 | If you want to translate instead English to French, switch these two around and you're

00:43:23.000 | done.

00:43:24.000 | Now we need to create data loaders.

00:43:26.720 | We can just grab our data loader and pass in our data set and batch size.

00:43:35.240 | We actually have to transpose the arrays.

00:43:38.160 | I'm not going to go into the details about why, we can talk about it during the week

00:43:41.840 | if you're interested, but have a think about why we might need to transpose their orientation.

00:43:49.160 | But there's a few more things I want to do.

00:43:51.120 | One is that since we've already done all the pre-processing, there's no point spawning

00:43:55.560 | off multiple workers to do augmentation or whatever because there's no work to do.

00:44:00.680 | So making num workers equals 1 will save you some time.

00:44:04.880 | We have to tell it what our padding index is, that's actually pretty important because

00:44:10.320 | what's going to happen is that we've got different length sentences and fastai will just automatically

00:44:19.160 | stick them together and pad the shorter ones so that they'll end up equal length.

00:44:24.520 | Because remember a tensor has to be rectangular.

00:44:31.560 | In the decoder in particular, I actually want my padding to be at the end, not at the start.

00:44:38.840 | For a classifier, I want the padding at the start because I want that final token to represent

00:44:44.800 | the last word of the movie review.

00:44:47.840 | But in the decoder, as you'll see, it's going to work out a bit better to have padding at

00:44:51.640 | the end.

00:44:52.640 | And then finally, since we've got sentences of different lengths coming in and they all

00:44:59.000 | have to be put together in a mini-batch to be the same size by padding, we would much

00:45:04.080 | prefer that the sentences in a mini-batch are of similar sizes already because otherwise

00:45:10.160 | it's going to be as long as the longest sentence and that's going to end up wasting time and

00:45:14.440 | memory.

00:45:16.280 | So therefore I'm going to use the sampler trick that we learned last time which is the

00:45:20.640 | validation set.

00:45:22.280 | We're going to ask it to sort everything by length first.

00:45:27.360 | And then for the training set, we're going to ask it to randomize the order of things

00:45:32.440 | but to roughly make it so that things of similar length are about in the same spot.

00:45:37.440 | So we've got our sort_sampler and our sort_ish_sampler.

00:45:41.960 | And then at that point, we can create a model_data object.

00:45:45.480 | For a model_data object, it really does one thing which is it says I have a training set

00:45:50.600 | and a validation set and an optional test set and sticks them into a single object.

00:45:55.480 | We also have a path so that it has somewhere to store temporary files, models, stuff like

00:46:00.400 | that.

00:46:01.400 | So we're not using fast.ai for very much at all in this example, just a minimal set to

00:46:09.520 | show you how to get your model_data object.

00:46:16.280 | In the end, once you've got a model_data object, you can then create a learner and you can

00:46:20.380 | then call fit.

00:46:22.080 | So that's a minimal amount of fast.ai stuff here.

00:46:28.440 | This is a standard PyTorch compatible dataset.

00:46:32.600 | This is a standard PyTorch compatible data loader.

00:46:34.840 | Behind the scenes, it's actually using the fast.ai version because I do need to do this

00:46:39.360 | automatic padding for convenience.

00:46:40.960 | So there's a few tweaks in our version that are a bit faster and a bit more convenient.

00:46:46.880 | The fast.ai samplers we're using, but there's not too much going on here.

00:46:52.480 | So now we've got our model_data object.

00:46:55.160 | We can basically tick off number one.

00:46:59.800 | So as I said, most of the work is in the architecture.

00:47:04.060 | And so the architecture is going to take our sequence of tokens.

00:47:14.420 | It's going to spit them into an encoder, or in computer vision terms, what we've been

00:47:23.240 | calling a backbone, something that's going to try and turn this into some kind of representation.

00:47:28.720 | That's just going to be an RNN.

00:47:31.920 | That's going to spit out the final hidden state, which for each sentence is just a vector.

00:47:38.960 | None of this is going to be new.

00:47:45.560 | That's all going to be using very direct simple techniques that we've already learnt.

00:47:50.200 | And then we're going to take that and we're going to spit it into a different RNN, which

00:47:54.960 | is a decoder, and that's going to have some new stuff because we need something that can

00:47:59.040 | go through one word at a time.

00:48:03.840 | And it's going to keep going until it thinks it's finished the sentence, it doesn't know

00:48:07.840 | how long the sentence is going to be ahead of time, it keeps going until it thinks it's

00:48:11.480 | finished the sentence, and then it stops and returns the sentence.

00:48:15.640 | So let's start with the encoder.

00:48:19.800 | So in terms of variable naming here, there's basically identical variables for encoder

00:48:27.000 | and decoder, well attributes for encoder and decoder.

00:48:29.880 | The encoder versions have 'enc', the decoder versions have 'dec'.

00:48:34.220 | So for the encoder, here's our embeddings.

00:48:39.120 | And so I always try to mention what the mnemonics are, rather than writing things out in too

00:48:47.380 | long hand.

00:48:48.920 | So just remember, 'enc' is an encoder, 'dec' is a decoder, and there's an embedding.

00:48:54.720 | The final thing that comes out is 'out'.

00:48:57.680 | The RNN, in this case, is a GRU, not an LSTM, they're nearly the same thing.

00:49:04.000 | So don't worry about the difference, you could replace it with an LSTM and you'll get basically

00:49:07.360 | the same results.

00:49:08.360 | To replace it with an LSTM, simply type LSTM and you're done.

00:49:17.440 | So we need to create an embedding layer to take -- because remember what we're being

00:49:23.320 | passed is the index of the words into a vocabulary, and we want to grab their fast text embedding,

00:49:30.240 | and then over time we might want to also fine-tune to train that embedding into it.

00:49:37.640 | So to create an embedding, we'll create embedding up here, so we'll just say nn.embedding.

00:49:43.760 | So it's important that you know now how to set the rows and columns for your embedding.

00:49:49.160 | So the number of rows has to be equal to your vocabulary size, so each vocabulary item has

00:49:54.160 | a word vector.

00:49:56.280 | And how big is your embedding?

00:49:58.480 | Well in this case it was determined by fast text, and the fast text embeddings are size

00:50:04.120 | 300.

00:50:05.120 | So we have to use size 300 as well, otherwise we can't start out by using their embeddings.

00:50:14.000 | So what we want to do is this is initially going to give us a random set of embeddings,

00:50:18.760 | and so we're going to now go through each one of these, and if we find it in fast text

00:50:22.840 | we'll replace it with a fast text embedding.

00:50:25.580 | So again, something that you should already know is that a PyTorch module that is learnable

00:50:34.080 | has a weight attribute, and the weight attribute is a variable, and the variables have a data

00:50:40.920 | attribute, and the data attribute is a tensor.

00:50:44.440 | And you'll notice very often today I'm saying here is something you should know, not so

00:50:48.440 | that you think oh I don't know that I'm a bad person, but so that you think okay this

00:50:53.760 | is a concept that I haven't learned yet and Jeremy thinks I ought to know about, and so

00:51:01.400 | I've got to write that down and I'm going to go home and I've got to Google -- this

00:51:05.640 | is a normal PyTorch attribute in every single learnable PyTorch module.

00:51:11.280 | This is a normal PyTorch attribute in every single PyTorch variable.

00:51:15.840 | And so if you don't know how to grab the weights out of a module, or you don't know how to grab

00:51:19.520 | the tensor out of a variable, it's going to be hard for you to build new things or debug

00:51:24.140 | things or maintain things or whatever.

00:51:26.440 | So if I say you ought to know this, and you're thinking I don't know this, don't run away

00:51:31.200 | and hide, go home and learn the thing, and if you're having trouble learning the thing,

00:51:36.200 | because you can't find documentation about it, or you don't understand that documentation,

00:51:40.080 | or you don't know why Jeremy thought it was important you know it, jump on the forum and

00:51:44.080 | say please explain this thing, here's my best understanding of that thing as I have it at

00:51:49.360 | the moment, here's the resources I've looked at, help fill me in.

00:51:54.440 | And normally if I respond, it's very likely I will not tell you the answer, but I will

00:52:00.520 | instead give you a problem that you could solve that if you solve it will solve it for

00:52:06.500 | you because I know that that way it will be something you remember.

00:52:10.600 | So again, don't be put off if I'm like okay, go read this link, try and summarize that

00:52:15.080 | thing, tell us what you think, like I'm trying to be helpful, not unhelpful, and if you're

00:52:18.920 | still not following, just come back and say I had a look, honestly that link you sent,

00:52:25.000 | I don't know what any of it means, I wouldn't know where to start, whatever.

00:52:28.680 | I'll keep trying to help you until you fully understand it.

00:52:35.680 | So now that we've got our weight tensor, we can just go through our vocabulary and we can

00:52:42.880 | look up the word in our pre-trained vectors, and if we find it we will replace the random

00:52:49.320 | weights with that pre-trained vector.

00:52:52.680 | The random weights have a standard deviation of 1, our pre-trained vectors it turned out

00:52:58.920 | had a standard deviation of about 0.3, so again this is the kind of hacky thing I do when

00:53:03.240 | I'm prototyping stuff, I just multiply it by 3.

00:53:07.480 | Obviously by the time you see the video of this and then you're able to put all this

00:53:11.320 | sequence to sequence stuff into the fastAI library, you won't find horrible hacks like

00:53:15.360 | that in there, sure hope, but hack away when you're prototyping.

00:53:23.080 | Some things won't be in fast text, in which case we'll just keep track of it, and I've

00:53:27.400 | just added this print statement here just so that I can kind of see why am I missing

00:53:32.860 | stuff, basically I'll probably comment it out when I actually commit this to GitHub.

00:53:43.180 | So we create those embeddings, and so when we actually create the sequence to sequence

00:53:47.640 | RNN, it will print out how many were missed, and so remember we had about 30,000 words,

00:53:54.960 | so we're not missing too many.

00:53:57.040 | And interesting, the things that are missing, well there's our special token for uppercase,

00:54:03.260 | not surprising that's missing, but also remember it's not token to vec, it's not token text,

00:54:10.000 | it does words, so L apostrophe and D apostrophe and apostrophe S, they're not appearing either.

00:54:15.480 | So that's interesting.

00:54:16.480 | That does suggest that maybe we could have slightly better embeddings if we tried to

00:54:21.520 | find some which would be tokenized the same way we tokenize, but that's okay.

00:54:27.440 | Do we just keep embedding vectors from training?

00:54:31.400 | Why don't we keep all word embeddings in case you have new words in the test set?

00:54:42.800 | We're going to be fine-tuning them, and so I don't know, it's an interesting idea.

00:54:50.440 | Maybe that would work, I haven't tried it, obviously you can also add random embedding

00:55:05.760 | to those, and at the beginning just keep them random, but it's going to make an effect in

00:55:13.000 | the sense that you're going to be using those words.

00:55:18.000 | I think it's an interesting line of inquiry, but I will say this.

00:55:22.000 | The vast majority of the time when you're doing this in the real world, your vocabulary

00:55:27.640 | will be bigger than 40,000, and once your vocabulary is bigger than 40,000, using the

00:55:33.160 | standard techniques, the embedding layers get so big that it takes up all your memory,

00:55:39.200 | it takes up all of the time in the backdrop.

00:55:41.920 | There are tricks to dealing with very large vocabries, I don't think we'll have time to

00:55:46.400 | handle them in this session, but you definitely would not want to have all 3.5 million fast-text

00:55:54.240 | vectors in an embedding layer.

00:55:59.360 | I wonder, if you're not touching a word, it's not going to change, given you're fine-tuning.

00:56:09.360 | It's in GPU RAM, and you've got to remember, 3.5 million times 300 times the size of a

00:56:16.360 | single-precision floating-point vector, plus all of the gradients for them, even if it's

00:56:20.760 | not touched.

00:56:24.400 | Without being very careful and adding a lot more code and stuff, it is slow and hard and

00:56:33.820 | we wouldn't touch it for now.

00:56:35.480 | I think it's an interesting path of inquiry, but it's the kind of path of inquiry that

00:56:39.680 | leads to multiple academic papers, not something that you do on a weekend.

00:56:45.440 | I think it would be very interesting, maybe we can look at it sometime.

00:56:52.000 | As I say, I have actually started doing some stuff around incorporating large vocabulary

00:56:57.600 | handling into fast.ai.

00:56:58.600 | It's not finished, but hopefully by the time we get here, this kind of stuff will be possible.

00:57:08.160 | We create our encoder embedding, add a bit of dropout, and then we create our RNN.

00:57:15.720 | This input to the RNN obviously is the size of the embedding by definition.

00:57:21.240 | Number of hidden is whatever we want, so we set it to 256 for now, however many layers

00:57:26.080 | we want, and some dropout inside the RNN as well.

00:57:31.080 | This is all standard PyTorch stuff, you could use an LSTM here as well, and then finally

00:57:35.880 | we need to turn that into some output that we're going to feed to the decoder, so let's

00:57:40.960 | use a linear layer to convert the number of hidden into the decoder embedding size.

00:57:48.160 | In the forward pass, here's how that's used.

00:57:52.620 | We first of all initialize our hidden state to a bunch of zeros.

00:57:58.760 | So we've now got a vector of zeros, and then we're going to take our input and put it through

00:58:05.920 | our embedding, we're going to put that through dropout, we then pass our currently zeros

00:58:12.440 | hidden state and our embeddings into our RNN, and it's going to spit out the usual stuff

00:58:18.560 | that RNN spit out, which includes the final hidden state.

00:58:25.320 | We're then going to take that final hidden state and stick it through that linear layer,

00:58:29.880 | so we now have something of the right size to feed to our decoder.

00:58:33.040 | So that's it, and again this ought to be very familiar and very comfortable, it's like the

00:58:40.800 | most simple possible RNN, so if it's not, go back, check out lesson 6, make sure you

00:58:46.120 | can write it from scratch and you understand what it does.

00:58:49.680 | But the key thing to know is that it takes our inputs and spits out a hidden vector that

00:59:01.000 | hopefully will learn to contain all of the information about what that sentence says

00:59:09.480 | and how it says it.

00:59:11.340 | Because if it can't do that, then we can't feed it into a decoder and hope it to spit

00:59:21.080 | out our sentence in a different language, so that's what we want it to learn to do.

00:59:28.520 | And we're not going to do anything special to make it learn to do that, we're just going

00:59:31.560 | to do the three things and cross our fingers because that's what we do.

00:59:40.720 | So that's h is that s, it's a hidden state.

00:59:48.720 | I guess Steven used s for state, I used h for hidden, but there you go, you would think

00:59:53.720 | that two Australians could agree on something like that, but apparently not.

00:59:59.080 | So how do we now do the new bit?

01:00:03.780 | And so the basic idea of the new bit is the same, we're going to do exactly the same thing,

01:00:09.360 | but we're going to write our own for loop.

01:00:13.560 | And so the for loop is going to do exactly what the for loop inside pytorch does here,

01:00:19.760 | but we're going to do it manually.

01:00:21.080 | So we're going to go through the for loop, and how big is the for loop?

01:00:25.720 | It's an output sequence length.

01:00:28.200 | Well what is output sequence length?

01:00:30.400 | It's something that got passed to the constructor, and it is equal to the length of the largest

01:00:38.120 | English sentence.

01:00:40.080 | So we're going to do this for loop as long as the largest English sentence, because we're

01:00:45.760 | translating it into English, so we can't possibly be longer than that.

01:00:53.080 | At least not in this corpus.

01:00:55.100 | If we then used it on some different corpus that was longer, this is going to fail.

01:01:02.720 | You could always pass in a different parameter, of course.

01:01:07.400 | So the basic idea is the same, we're going to go through and put it through the embedding,

01:01:12.560 | we're going to stick it through the RNN, we're going to stick it through dropout, and we're

01:01:16.600 | going to stick it through a linear layer.

01:01:19.040 | So the basic four steps are the same.

01:01:22.840 | And once we've done that, we're then going to append that output to a list, and then

01:01:29.840 | when we're going to finish, we're going to stack that list up into a single tensor and

01:01:33.840 | return it.

01:01:34.840 | So that's the basic idea.

01:01:38.840 | Normally a recurrent neural network works in a whole sequence at a time, but we've got

01:01:47.720 | a for loop to go through each part of the sequence separately.

01:01:51.520 | So we have to add a leading unit axis to the start to basically say this is a sequence

01:01:58.640 | of length 1.

01:02:00.240 | So we're not really taking advantage of the recurrent net much at all, we could easily

01:02:04.280 | rewrite this with a linear layer.

01:02:06.640 | That would be an interesting experiment if you wanted to try it.

01:02:11.260 | So we basically take our input and we feed it into our embedding, and we add something

01:02:19.300 | to the front saying treat this as a sequence of length 1, and then we pass that to our

01:02:24.160 | RNN.

01:02:26.800 | We then get the output of that RNN, feed it into our dropout, and feed it into our linear

01:02:33.960 | layer.

01:02:34.960 | So there's two extra things now to be aware of.

01:02:39.560 | The one thing is, what's this?

01:02:43.520 | What is the input to that embedding?

01:02:46.660 | And the answer is, it's the previous word that we translated.

01:02:52.380 | See how the input here is the previous word here?

01:02:55.680 | The input here is the previous word here.

01:02:58.160 | So the basic idea is, if you're trying to translate, if you're about to translate, tell

01:03:04.960 | me the fourth word of the new sentence, but you don't know what the third word you just

01:03:09.560 | said was, that's going to be really hard.

01:03:13.060 | So we're going to feed that in at each time step, let's make it as easy as possible.

01:03:19.040 | And so what was the previous word at the start?

01:03:21.000 | Well there was none.

01:03:23.340 | So specifically we're going to start out with a beginning of stream token.

01:03:37.940 | So the beginning of stream token is a zero.

01:03:42.080 | So let's start out our decoder with a beginning of stream token, which is zero.

01:03:49.880 | And of course we're doing a mini-batch, so we need batch size number of them.

01:03:53.400 | But let's just think about one part of that batch.

01:03:55.760 | So we start out with a zero.

01:03:58.480 | We look up that zero in our embedding matrix to find out what the vector for the beginning

01:04:04.240 | of stream token is.

01:04:06.160 | We stick a unit axis on the front to say we have a single sequence length of beginning

01:04:10.720 | of stream token.

01:04:12.480 | We stick that through our RNN, which gets not only the fact that there's a zero at the

01:04:18.720 | beginning of stream, but also the hidden state which at this point is whatever came out of

01:04:25.200 | our encoder.

01:04:26.880 | So now its job is to try and figure out what is the first word to translate this sentence.

01:04:39.760 | Pop the trees in dropout, go through one linear layer in order to convert that into the correct

01:04:45.160 | size for our decoder embedding matrix, append that to our list of translated words, and

01:04:54.720 | now we need to figure out what word that was because we need to feed it to the next time

01:05:00.440 | step.

01:05:02.440 | We need to feed it to the next time step.

01:05:04.680 | So remember what we actually output here, and don't forget, use a debugger, pdb.settrace,

01:05:12.760 | put it here, what is @p? @p is a tensor.

01:05:17.840 | How big is the tensor?

01:05:19.100 | So before you look it up in the debugger, try and figure it out from first principles

01:05:23.000 | and check your rate.

01:05:24.400 | So @p is a tensor whose length is equal to the number of words in our English vocabulary,

01:05:32.040 | and it contains the probability for every one of those words that it is that word.

01:05:38.760 | So then if we now say @p.data.max, that looks in its tensor to find out which word has the

01:05:48.600 | highest probability, and max implies which returns two things.

01:05:53.960 | The first thing is what is that max probability, and the second is what is the index into the

01:05:58.880 | array of that max probability.

01:06:01.160 | And so we want that second item, index number 1, which is the word index with the largest

01:06:06.720 | thing.

01:06:08.180 | So now that contains the word, or the word index into our vocabulary of the word.

01:06:17.120 | If it's a 1, you might remember 1 was padding, then that means we're done.

01:06:22.720 | That means we've reached the end because we've finished with a bunch of padding.

01:06:26.340 | If it's not 1, let's go back and continue.

01:06:30.440 | Now deck_imp is whatever the highest probability word was.

01:06:37.880 | So we keep looping through, either until we get to the largest length of a sentence, or

01:06:44.840 | until everything in our mini-batch is padding.

01:06:49.320 | And each time we've appended our outputs, not the word but the probabilities, to this

01:06:57.120 | list which we stack up into a tensor and we can now go ahead and feed that to a loss function.

01:07:04.880 | So before we go to a break, since we've done 1 and 2, let's do 3, which is a loss function.

01:07:13.600 | The loss function is categorical cross-entropy loss.

01:07:18.320 | We've got a list of probabilities for each of our classes, the classes are all the words

01:07:24.640 | in our English vocab, and we have a target which is the correct class, i.e. the correct

01:07:30.240 | word at this location.

01:07:34.180 | There's two tweaks, which is why we need to write our own little loss function, but you

01:07:37.360 | can see basically it's going to be cross-entropy loss.

01:07:40.600 | And the tweaks are as follows.

01:07:42.280 | Tweak number 1 is we might have stopped a little bit early, and so the sequence length

01:07:49.400 | that we generated may be different to the sequence length of the target, in which case

01:07:53.700 | we need to add some padding.

01:07:57.180 | PyTorch padding function is weird.

01:07:59.740 | If you have a rank 3 tensor, which we do, we have sequence length by batch size by number

01:08:12.360 | of words in the vocab, a rank 3 tensor requires a 6 tuple.

01:08:19.280 | Each pair in things in that tuple is the padding before and then the padding after that dimension.

01:08:26.760 | So in this case, the first dimension has no padding, the second dimension has no padding,

01:08:31.280 | the third dimension has no padding on the left, and as-matched padding is required on

01:08:35.640 | the right.

01:08:36.640 | It's good to know how to use that function.

01:08:40.720 | Now that we've added any padding, that's necessary.

01:08:43.960 | The only other thing we need to do is cross-entropy loss expects a rank 2 tensor, but we've got

01:08:51.440 | sequence length by batch size.

01:08:53.500 | So let's just flatten out the sequence length and batch size into a -1 in View.

01:09:01.120 | So flatten out that for both of them, and now we can go ahead and call cross-entropy.

01:09:06.760 | That's it.

01:09:08.160 | So now we can just use standard approach, here's our sequence-to-sequence RNN, that's

01:09:15.200 | this one here.

01:09:16.680 | So that is a standard PyTorch module.

01:09:20.760 | Stick it on the GPU.

01:09:23.240 | Hopefully by now you've noticed you can call .cuda, but if you call .gpu then it doesn't

01:09:29.840 | put it on the GPU if you don't have one.

01:09:32.060 | You can also set fastai.core.useGPU to false to force it to not use the GPU, and that can

01:09:37.720 | be super handy for debugging.

01:09:41.320 | We then need something that tells it how to handle learning rate groups.

01:09:47.760 | So there's a thing called single model that you can pass it to which treats the whole

01:09:51.320 | thing as a single learning rate group.

01:09:53.920 | So this is like the easiest way to turn a PyTorch module into a fastai model.

01:10:01.280 | Here's the model data object we created before.

01:10:05.160 | We could then just call learner to turn that into a learner, but if we call RNN_learner,

01:10:11.240 | RNN_learner is a learner.

01:10:16.480 | It defines cross-entropy as the default criteria, in this case we're overriding that anyway,

01:10:21.000 | so that's not what we care about.

01:10:22.920 | But it does add in these save encoder and load encoder things that can be handy sometimes.

01:10:31.400 | So in this case we really put it to set learner, but RNN_learner also works.

01:10:38.400 | So here's how we turn our PyTorch module into a fastai model into a learner.

01:10:46.120 | And once we have a learner, give it our new loss function, and then we can call lrefind,

01:10:53.600 | and we can call fit, and it runs through a while, and we can save it.

01:10:59.160 | So all the normal learn stuff now works.

01:11:02.760 | Remember the model attribute of a learner is a standard PyTorch model.

01:11:06.480 | So we can pass that some x which we can grab out of our validation set, or you could use

01:11:13.440 | learn.predict_array or whatever you like to get some predictions.

01:11:19.820 | And then we can convert those predictions into words by going .max1 to grab the index

01:11:26.000 | of the highest probability words to get some predictions.

01:11:30.440 | And then we can go through a few examples and print out the French, the correct English,

01:11:37.680 | and the predicted English for things that are not padding.

01:11:43.560 | And here we go.

01:11:45.320 | So amazingly enough, this kind of simplest possible written largely from scratch PyTorch

01:11:54.520 | module on only 50,000 sentences is sometimes capable on a validation set of giving you

01:12:00.760 | exactly the right answer, sometimes the right answer in slightly different wording, and sometimes

01:12:09.400 | sentences that aren't grammatically sensible or even have too many question marks.

01:12:14.120 | So we're well on the right track, I think you would agree.

01:12:18.400 | So even the simplest possible sec-to-sec trained for a very small number of epochs without

01:12:25.780 | any pre-training other than the use of word embeddings is surprisingly good.

01:12:32.600 | So I think the message here -- and we're going to improve this in a moment after the break

01:12:36.340 | -- but I think the message here is even sequence-to-sequence models that you think are simpler than could

01:12:42.560 | possibly work, even with less data than you think you could learn from can be surprisingly

01:12:48.480 | effective and in certain situations this may even be enough for your needs.

01:12:54.800 | So we're going to learn a few tricks after the break which will make this much better.

01:13:03.160 | So let's come back at 7.50.

01:13:10.740 | So one question that came up during the break is that some of the tokens that are missing

01:13:20.800 | in fast text had a curly quote rather than a straight quote, for example, and the question

01:13:28.680 | was would it help to normalize punctuation?

01:13:35.700 | And the answer for this particular case is probably yes.

01:13:44.320 | You do have to be very careful though because it may turn out that people using beautiful

01:13:51.140 | curly quotes like using more formal language and actually writing in a different way.

01:13:57.340 | So I generally -- if you're going to do some kind of pre-processing like punctuation normalization,

01:14:04.360 | you should definitely check your results with and without, because nearly always that kind

01:14:09.680 | of pre-processing makes things worse even when I'm sure it won't.

01:14:14.400 | What might be some ways of regularizing these sequence-to-sequence models besides dropout

01:14:22.880 | and weight gain?

01:14:28.220 | Let me think about that during the week.

01:14:32.240 | It's like, you know, AWDLSTM, which we've been relying on a lot, has so many great -- I mean,

01:14:41.560 | it's all dropout.

01:14:42.560 | Well, not all dropout.

01:14:43.560 | There's dropout of many different kinds.

01:14:46.880 | And then there's the -- we haven't talked about it much, but there's also a kind of

01:14:50.760 | regularization based on activations and stuff like that as well and on changes and whatever.

01:15:00.000 | I just haven't seen anybody put anything like that amount of work into regularization of

01:15:05.280 | sequence-to-sequence models, and I think there's a huge opportunity for somebody to do like

01:15:10.560 | the AWDLSTM of Sek2Sec, which might be as simple as stealing all the ideas from AWDLSTM and

01:15:18.640 | using them directly in Sek2Sec.

01:15:21.160 | That would be pretty easy to try, I think, and there's been an interesting paper that

01:15:27.440 | actually Stephen Merritt, he's added in the last couple of weeks, where he used an idea

01:15:32.240 | which I don't know if he stole it from me, but it was certainly something I had also

01:15:36.720 | recently done and talked about on Twitter.

01:15:38.560 | Either way, I'm thrilled that he's done it, which was to take all of those different AWDLSTM

01:15:46.080 | hyperparameters and train a bunch of different models and then use a random forest to find

01:15:51.720 | out with feature importance which ones actually matter the most and then figure out like how

01:15:56.040 | to set them.

01:15:58.640 | I think you could totally use this approach to figure out the sequence-to-sequence regularization

01:16:06.520 | approaches which ones are best and optimize them, and that would be amazing.

01:16:14.560 | But at the moment, I don't know that there are additional ideas to sequence-to-sequence

01:16:19.280 | regularization that I can think of beyond what's in that paper for regular language

01:16:24.160 | model stuff, and probably all those same approaches would work.

01:16:29.680 | So tricks, trick number 1, go bidirectional.

01:16:37.960 | For classification, my approach to bidirectional that I've suggested you use is take all of

01:16:47.160 | your token sequences, spin them around, train a new language model, and train a new classifier,

01:16:54.080 | and I also mentioned the wiki text pre-trained model.

01:16:58.080 | If you replace fwd with bwd in the name, you'll get the pre-trained backward model I created

01:17:04.240 | for you.

01:17:05.240 | So you can use that.

01:17:06.240 | Get a set of predictions and then average the predictions just like a normal ensemble,

01:17:12.040 | and that's kind of how we do bider for that kind of classification.

01:17:16.240 | There may be ways to do it end-to-end, but I haven't quite figured them out yet, they're

01:17:21.760 | not in fastAI yet and I don't think anybody has written a paper about them yet, so if

01:17:25.840 | you figure it out, that's an interesting line of research.

01:17:31.680 | But because we're not doing massive documents where we have to chunk it into separate bits

01:17:38.600 | and then pull over them and whatever, we can do bider very easily in this case, which is

01:17:45.480 | literally as simple as adding bidirectional equals true to our encoder.

01:17:53.960 | People tend not to do bidirectional for the decoder, I think partly because it's considered

01:18:00.640 | cheating, but I don't know, I was just talking to somebody at the break about it, maybe it

01:18:10.120 | can work in some situations, although it might need to be more of an ensembling approach in

01:18:16.480 | the decoder because it's a bit less obvious.

01:18:20.120 | The encoder is very simple, bidirectional equals true, and with bidirectional equals

01:18:29.600 | true rather than just having an RNN which is going this direction, we have a second

01:18:36.120 | RNN that's going in this direction.

01:18:41.200 | And so that second RNN literally is visiting each token in the opposing order, so when

01:18:50.240 | we get the final hidden state, it's here rather than here.

01:18:56.740 | But the hidden state is of the same size, so the final result is that we end up with

01:19:02.080 | a tensor that's got an extra too long axis.

01:19:07.000 | And depending on what library you use, often that will be then combined with the number

01:19:11.360 | of layers thing, so if you've got two layers and bidirectional, that tensor dimension is

01:19:16.840 | now with length 4.

01:19:19.960 | With PyTorch, it kind of depends which bit of the process you're looking at as to whether

01:19:25.140 | you get a separate result for each layer and each bidirectional bit and so forth.

01:19:29.320 | You have to look up the docs and it will tell you the inputs, outputs, tensor sizes, appropriate

01:19:35.040 | for the number of layers and whether you have bidirectional equals true.

01:19:38.940 | In this particular case, you'll basically see all the changes I've had to make.

01:19:45.040 | So for example, you'll see when I added bidirectional equals true, my linear layer now needs number

01:19:50.640 | of hidden times 2 to reflect the fact that we have that second direction in our hidden

01:19:56.480 | state now, you'll see in it hidden, it's now self.number of layers times 2 here.

01:20:05.000 | So you'll just see there's a few places where there's been an extra 2 that has to be thrown

01:20:11.100 | in.

01:20:12.100 | Yes, Yannette?

01:20:13.100 | Why making a decoder bidirectional is considered cheating?

01:20:19.440 | Well, it's not just that it's cheating, it's like we have this loop going on, you know?

01:20:28.760 | It's not as simple as just kind of having two tensors, and then how do you turn those

01:20:37.340 | two separate loops into a final result?

01:20:41.920 | After talking about it during the break, I've kind of gone from "Hey, everybody knows it

01:20:47.480 | doesn't work" to "Oh, maybe it kind of could work, but it requires more thought."

01:20:53.080 | It's quite possible during the week I realized it's a dumb idea and I was being stupid, but

01:20:57.920 | we'll think about it.

01:20:58.920 | Another question people have, why do you need to have an end to that loop?

01:21:04.560 | Why do I have a what to the loop?

01:21:06.040 | Why do you need to have an end to that loop?

01:21:09.240 | You have like a range, if you are...

01:21:12.640 | I mean, because when I start training everything's random, so this will probably never be true.

01:21:23.820 | Later on it will pretty much always break out eventually.

01:21:28.360 | It's basically like we're going to go forever.

01:21:33.640 | It's really important to remember when you're designing an architecture that when you start,

01:21:38.180 | the model knows nothing about anything.

01:21:41.040 | So you kind of want to make sure it's going to do something that's vaguely sensible.

01:21:46.400 | So bidirectional means we got out to 358 cross-entropy loss with a single direction.

01:21:59.560 | With bidirection, it's down to 351.

01:22:02.920 | So that improved it a bit, that's good, and as I say, it shouldn't really slow things

01:22:09.260 | down too much, bidirectional does mean there's a little bit more sequential processing have

01:22:16.200 | to happen, but it's generally a good win.

01:22:20.520 | In the Google translation model of the eight layers, only the first layer is bidirectional

01:22:27.160 | because it allows it to do more in parallel.

01:22:29.880 | So if you create really deep models, you may need to think about which one's bidirectional,

01:22:34.320 | otherwise you'll have performance issues.

01:22:36.360 | Okay, so 351.

01:22:39.440 | Now let's talk about teacher forcing.

01:22:42.760 | So teacher forcing is going to come back to this idea that when the model starts learning,

01:22:52.320 | it knows nothing about nothing.

01:22:54.460 | So when the model starts learning, it is not going to spit out 'uh' at this point, it's

01:23:00.160 | going to spit out some random, meaningless word because it doesn't know anything about

01:23:05.080 | German or about English or about the idea of language or anything.

01:23:08.140 | And then it's going to feed it down here as an input and be totally unhelpful.

01:23:12.320 | And so that means that early learning is going to be very, very difficult because it's feeding

01:23:16.960 | in an input that's stupid into a model that knows nothing, and somehow it's going to get

01:23:22.440 | better.

01:23:23.440 | So it's not asking too much, eventually it gets there, but it's definitely not as helpful

01:23:28.940 | as we can be.

01:23:30.600 | So what if instead of feeding in the thing I predicted just now, what if instead we feed

01:23:44.840 | in the actual correct word it was meant to be?

01:23:49.880 | Now we can't do that at inference time because by definition we don't know the correct word

01:23:54.480 | and we've been asked to translate it.

01:23:57.200 | And we can't require a correct translation in order to do translation.

01:24:01.760 | So the way I've set this up is I've got this thing called PR force, which is probability

01:24:08.180 | of forcing.

01:24:09.520 | And if some random number is less than that probability, then I'm going to replace my

01:24:15.080 | decoder input with the actual correct thing.

01:24:20.080 | And if we've already gone too far, if it's already longer than the target sentence, I'm

01:24:24.480 | just going to stop because I can't give it the correct thing.

01:24:28.480 | So you can see how beautiful PyTorch is for this, because if you tried to do this with

01:24:33.760 | some static graph thing like classic TensorFlow, I tried.

01:24:40.600 | One of the key reasons we switched to PyTorch at this exact point in last year's class was

01:24:46.120 | because Jeremy tried to implement teacher forcing in Keras and TensorFlow and went even

01:24:50.840 | more insane than he started.

01:24:54.040 | It was weeks of getting nowhere.

01:24:57.300 | And then literally on Twitter, I think it was on Dracopathy, I said something about PyTorch

01:25:07.280 | that just came out and it's really cool.

01:25:09.720 | And I tried it that day.

01:25:12.000 | By the next day, I had teacher forcing.

01:25:14.920 | And so I was like, oh my gosh.

01:25:17.480 | And all the stuff of trying to debug things, it was suddenly so much easier, and this kind

01:25:22.040 | of dynamic stuff is so much easier.

01:25:24.520 | So this is a great example of like, hey, I get to use random numbers and if statements

01:25:29.120 | and stuff.

01:25:30.120 | So here's the basic idea, at the start of training, let's set PR force really high so

01:25:39.600 | that nearly always it gets the actual correct previous word, and so it has a useful input.

01:25:47.800 | And then as I train a bit more, let's decrease PR force so that by the end, PR force is 0,

01:25:55.480 | and it has to learn properly, which is fine because it's now actually feeding in sensible

01:26:00.600 | inputs most of the time anyway.

01:26:03.720 | So let's now write something such that in the training loop, it gradually decreases

01:26:11.440 | PR force.

01:26:13.440 | So how do you do that?

01:26:15.000 | Well one approach would be to write our own training loop, but let's not do that because

01:26:20.240 | we already have a training loop that has progress bars and uses exponential weighted averages

01:26:25.420 | to smooth out the losses and keeps track of metrics.

01:26:28.080 | And it does a bunch of things which are not rocket science, but they're kind of convenient.

01:26:33.520 | And they also keep track of calling the reset for RNNs at the start of an epoch to make

01:26:38.720 | sure that the hidden states set to zeros and little things like that.

01:26:43.000 | You'd rather not have to write that from scratch.

01:26:45.600 | So what we've tended to find is that as I start to write some new thing, and I'm like

01:26:53.480 | I need to replace some part of the code, I'll then add some little hook so that we can all

01:27:01.080 | use that hook to make things easier.

01:27:03.960 | In this particular case, there's a hook that I've ended up using all the damn time now,

01:27:08.880 | which is the hook called the stepper.

01:27:12.040 | And so if you look at our code, model.py is where our fit function lives.

01:27:21.320 | And so the fit function in model.py, we've seen it before, I think it's like the lowest

01:27:26.840 | level thing that doesn't require a learner, it doesn't really require anything much at

01:27:31.320 | all.

01:27:32.320 | It just requires a standard PyTorch model and a model data object.

01:27:35.360 | You just need to know how many epochs, a standard PyTorch optimizer, and a standard PyTorch

01:27:40.640 | loss function.

01:27:42.120 | So we've hardly ever used it in the class, we normally call learn.fit, but learn.fit calls

01:27:48.000 | this.

01:27:49.000 | This is our lowest level thing.

01:27:50.000 | But we've looked at the source code here sometimes, we've seen how it loops through each epoch

01:27:54.800 | and it loops through each thing in our batch and calls stepper.step.

01:28:01.560 | And so stepper.step is the thing that's responsible for calling the model, finding the loss function

01:28:07.360 | and calling the optimizer.

01:28:10.080 | And so by default stepper.step uses a particular class called stepper, which there's a few things

01:28:18.520 | you don't know about too much, but basically it calls the model.

01:28:22.040 | So the model ends up inside m, zeros the gradients, calls the loss function, calls backwards,

01:28:32.640 | does gradient clipping if necessary, and then calls the optimizer.

01:28:37.080 | So they're the basic steps that back when we looked at PyTorch from scratch, we had

01:28:43.560 | to do.

01:28:44.920 | So the nice thing is we can replace that with something else rather than replacing the training

01:28:53.160 | loop.

01:28:54.160 | So if you inherit from stepper and then write your own version of step, you can just copy

01:29:01.480 | and paste the contents of step and add whatever you like.

01:29:06.180 | Or if it's something you're going to do before or afterwards, you could even call super.step.

01:29:12.640 | In this case, I'd rather suspect I've been unnecessarily complicated here, I probably

01:29:18.900 | could have replaced, commented out all of that and just said super.step x's comma y

01:29:27.640 | comma epoch.

01:29:28.640 | Because I think this is an exact copy of everything, but as I say, when I'm prototyping I don't

01:29:34.920 | think carefully about how to minimize my code.

01:29:37.560 | I copied and pasted the contents of the code from step, and I added a single line to the

01:29:42.280 | top which was to replace prforce in my module with something that gradually decreased linearly

01:29:54.600 | for the first 10 epochs, and after 10 epochs it was zero.

01:29:59.600 | So total hack, but good enough to try it out.

01:30:04.960 | So the nice thing is that everything else is the same, I've added these three lines

01:30:12.920 | of code to my module, and the only thing I need to do other than differently is when

01:30:20.040 | I call fit is I pass in my customized stepper class.

01:30:26.880 | And so that's going to do teacher forcing, and so we don't have bidirectional, so we're

01:30:32.880 | just changing one thing at a time.

01:30:34.880 | So we should compare this to our unidirectional results, which was 3.58, and this is 3.49.

01:30:45.440 | So that was an improvement.

01:30:47.800 | So that's great!

01:30:48.800 | I needed to make sure I at least did 10 epochs because before that it was cheating by using

01:30:54.860 | the teacher forcing.

01:30:58.400 | So that's good, that's an improvement.

01:31:01.400 | So we've got another trick, and this next trick is a bigger trick.

01:31:07.840 | It's a pretty cool trick, and it's called attention.

01:31:12.760 | And the basic idea of attention is this, which is, expecting the entirety of the sentence

01:31:25.920 | to be summarized into this single hidden vector is asking a lot.

01:31:32.340 | It has to know what was said and how it was said and everything necessary to create the

01:31:38.880 | sentence in German.

01:31:41.240 | And so the idea of attention is basically like maybe we're asking too much, particularly

01:31:46.800 | because we could use this form of model where we output every step of the loop to not just

01:31:57.280 | have a hidden state at the end, but to hit a hidden state after every single word.

01:32:02.680 | And why not try and use that information?

01:32:07.040 | It's already there, and so far we've just been throwing it away.

01:32:13.120 | And not only that, but bidirectional, we've got every step, we've got two vectors of state

01:32:22.360 | that we can use.

01:32:24.200 | So how could we use this piece of state, this piece of state, this piece of state, this piece

01:32:29.000 | of state and this piece of state rather than just the final state?

01:32:34.000 | And so the basic idea is, well, let's say I'm translating this word right now.

01:32:42.120 | Which of these five pieces of state do I want?

01:32:46.040 | And of course the answer is if I'm doing -- well, actually let's pick a more interesting word.

01:32:53.080 | So if I'm trying to do loved, then clearly the hidden state I want is this one, because

01:32:59.920 | this is the word.

01:33:03.360 | And then for this preposition, whatever, this little word here, no it's not a preposition,

01:33:10.000 | I guess it's part of the verb.

01:33:11.380 | So for this part of the verb, I probably would need this and this and this to make sure that

01:33:18.000 | I've got the tense right and know that I actually need this part of the verb and so forth.

01:33:23.760 | So depending on which bit I'm translating, I'm going to need one or more bits of these

01:33:32.200 | various hidden states.

01:33:33.960 | And in fact, I probably want some weighting of them.

01:33:38.580 | So like what I'm doing here, I probably mainly want this state, but I maybe want a little

01:33:44.440 | bit of that one and a little bit of that one.

01:33:47.780 | So in other words, for these five pieces of hidden state, we want a weighted average, and

01:33:54.440 | we want it weighted by something that can figure out which bits of the sentence are

01:34:00.720 | most important right now.

01:34:03.080 | So how do we figure out something like which bits of the sentence are important right now?

01:34:09.680 | We create a neural net, and we train the neural net to figure it out.

01:34:15.400 | When do we train that neural net?

01:34:17.720 | End to end.

01:34:19.440 | So let's now train two neural nets.

01:34:21.920 | We've actually already got a bunch, we've got an RNN encoder, an RNN decoder, a couple

01:34:26.720 | of linear layers.

01:34:29.040 | What the hell?

01:34:30.040 | Let's put the neural net into the mix.

01:34:32.760 | And this neural net is going to spit out a weight for every one of these things, and we're

01:34:38.760 | going to take the weighted average at every step.

01:34:41.120 | And it's just another set of parameters that we learn all at the same time.

01:34:46.700 | And so that's called attention.

01:34:50.140 | So the idea is that once that attention has been learned, we can see this terrific demo

01:34:56.040 | from Priscilla and Sean Carter, each different word is going to take a weighted average.

01:35:01.520 | See how the weights are different depending on which word is being translated.

01:35:06.880 | And you can see how it's kind of figuring out the color, the deepness of the blue is

01:35:10.280 | how much weight it's using.

01:35:12.080 | You can see that each word is basically which word are we translating from.

01:35:16.480 | So when we say European, we need to know that both of these two parts are going to be influenced

01:35:20.360 | or if we're doing economic, both of these three parts are going to be influenced, including

01:35:23.880 | the gender of the definite article and so forth.

01:35:27.960 | So check out this distill.pub article.

01:35:33.360 | These things are all nice little interactive diagrams.

01:35:38.040 | It basically shows you how attention works and what the actual attention looks like in

01:35:44.320 | a trained translation model.

01:35:48.040 | So let's try and implement attention.

01:35:53.160 | So with attention, it's basically this is all identical, and the encoder is identical,

01:36:08.360 | and all of this bit of the decoder is identical.

01:36:12.280 | There's one difference, which is that we basically are going to take a weighted average.

01:36:30.880 | And the way that we're going to do the weighted average is we create a little neural net,

01:36:35.320 | which we're going to see here and here, and then we use softmax, because of course the

01:36:41.600 | nice thing about softmax is that we want to ensure that all of the weights that we're

01:36:47.400 | using add up to 1, and we also kind of expect that one of those weights should probably

01:36:53.720 | be quite a bit higher than the other ones.

01:36:56.200 | So softmax gives us the guarantee that they add up to 1, and because it's the eta in it,

01:37:04.840 | it tends to encourage one of the weights to be higher than the other ones.

01:37:09.480 | So let's see how this works.

01:37:11.720 | So what's going to happen is we're going to take the last layer's hidden state, and we're

01:37:20.280 | going to stick it into a linear layer.

01:37:24.000 | And then we're going to stick it into a nonlinear activation, and then we're going to do matrix

01:37:32.560 | multiply, and so if you think about it, linear layer, nonlinear activation, matrix multiply,

01:37:40.200 | that's a neural net.

01:37:41.200 | It's a neural net with one hidden layer.

01:37:45.640 | Stick it into a softmax, and then we can use that to weight our encoder outputs.

01:37:56.600 | So now, rather than just taking the last encoder output, we've got this is going to be the

01:38:02.240 | whole tensor of all of the encoder outputs, which I just weight by this little neural

01:38:08.780 | net that I've created.

01:38:11.140 | And that's basically it.

01:38:18.160 | So what I'll do is I'll put on the wiki thread a couple of papers to check out.

01:38:27.680 | There was basically one amazing paper that really originally introduced this idea of attention.

01:38:35.120 | And I say amazing because it actually introduced a couple of key things which have really changed

01:38:40.480 | how people work in this field.

01:38:43.640 | This area of attention has been used not just for text, but for things like reading text

01:38:53.400 | out of pictures, or doing various stuff with computer vision, and stuff like that.

01:38:59.280 | And then there's a second paper which Jeffrey Hinton was involved in called Grammar as a

01:39:05.440 | Foreign Language, which used this idea of RNNs with attention to basically try to replace

01:39:11.880 | rules-based grammar with an RNN which automatically tagged the grammatical, each word based on

01:39:22.440 | this grammar, and turned out to do it better than any rules-based system.

01:39:26.320 | Which today actually kind of seems obvious.

01:39:29.480 | I think we're now used to the idea that neural nets do lots of this stuff better than rules-based

01:39:35.880 | systems, but at the time it was considered really surprising.

01:39:39.800 | One nice thing is that their summary of how attention works is really nice and concise.

01:39:47.000 | Let's go back and look at our original encoder.

01:40:12.040 | So an RNN spits out two things.

01:40:14.960 | It spits out a list of the state after every time step, and it also tells you the state

01:40:23.720 | at the last time step.

01:40:25.720 | And we used the state at the last time step to create the input state for our decoder,

01:40:34.680 | which is what we see here, one vector.

01:40:45.920 | But we know that it's actually creating a vector at every time step, so wouldn't it

01:40:50.480 | be nice to use them all?

01:40:53.440 | But wouldn't it be nice to use the ones that's most relevant to translating the word I'm

01:40:59.680 | translating now?

01:41:02.680 | So wouldn't it be nice to take a weighted average of the hidden state at each time step,

01:41:08.140 | weighted by whatever is the appropriate weight right now?

01:41:12.340 | Which for example in this case, "libter" would definitely be time step number 2, which is

01:41:18.120 | what it's all about, because that's the word I'm translating.

01:41:21.720 | So how do we get a list of weights that is suitable for the word we're training right

01:41:30.640 | now, or the answer is by training a neural net to figure out the list of weights.

01:41:37.020 | And so anytime we want to figure out how to train a little neural net that does any task,

01:41:43.320 | the easiest way normally always to do that is to include it in your module and train

01:41:48.880 | it in line with everything else.

01:41:53.480 | The minimal possible neural net is something that contains two layers and one nonlinear

01:42:05.120 | activation function.

01:42:09.640 | So here is one linear layer, and in fact instead of a linear layer, we can't even just grab

01:42:25.000 | a random matrix if we don't care about bias.

01:42:29.400 | And so here's a random matrix, it's just a random tensor wrapped up in a parameter.

01:42:36.000 | A parameter, remember, is just a PyTorch variable, it's like identical to a variable, but it

01:42:42.600 | just tells PyTorch I want you to learn the weights for this.

01:42:48.440 | So here we've got a linear layer, here we've got a random matrix, and so here at this point

01:42:56.760 | where we start out our decoder, let's take the current hidden state of the decoder, put

01:43:10.560 | that into a linear layer, because what's the information we use to decide what words we

01:43:18.840 | should focus on next?

01:43:20.640 | The only information we have to go on is what the decoder's hidden state is now.

01:43:25.000 | So let's grab that, put it into the linear layer, put it through a nonlinearity, put

01:43:34.400 | it through one more nonlinear layer.

01:43:36.880 | This one actually doesn't have a bias in it, so it's actually just a matrix multiply.

01:43:41.360 | Put that into a softmax, and that's it, that's a little neural net.

01:43:48.020 | It doesn't do anything, it's just a neural net, no neural nets do anything, they're just

01:43:54.520 | linear layers with nonlinear activations with random weights.

01:43:58.240 | But it starts to do something if we give it a job to do, and in this case the job we give

01:44:04.520 | it to do is to say, don't just take the final state, but now let's use all of the encoder

01:44:13.000 | states and let's take all of them and multiply them by the output of that little neural net.

01:44:22.280 | And so given that the things in this little neural net are learnable weights, hopefully

01:44:28.640 | it's going to learn to weight those encoder outputs, those encoder hidden states by something

01:44:34.080 | useful.

01:44:35.840 | That's all a neural net ever does, is we give it some random weights to start with and a

01:44:41.460 | job to do, and hope that it learns to do the job.

01:44:45.480 | And it turns out that it does.

01:44:49.200 | So everything else in here is identical to what it was before.

01:44:54.680 | We've got teacher forcing, it's not bidirectional.

01:44:58.880 | So we can see how this goes.

01:45:00.680 | You can see, here we are using teacher forcing.

01:45:11.340 | Teacher forcing had 3.49, and so now we've got nearly exactly the same thing, but we've

01:45:18.900 | got this little minimal neural net figuring out what weightings to give our inputs.

01:45:25.240 | Oh wow, now it's down to 3.37.

01:45:30.060 | However these things are logs, so e^ of this is quite a significant change.

01:45:37.540 | So 3.37, let's try it out.

01:45:45.340 | Not bad, right?

01:45:46.340 | Where are they located?

01:45:47.340 | What are their skills?

01:45:48.820 | What do you do?

01:45:51.340 | They're still not perfect, why or why not, but quite a few of them are correct.

01:46:02.020 | And again, considering that we're asking it to learn about the very idea of language for

01:46:06.660 | two different languages, and how to translate them between the two, and grammar, and vocabulary,

01:46:11.820 | and we only have 50,000 sentences, and a lot of the words only appear once, I would say

01:46:17.260 | this is actually pretty amazing.

01:46:24.340 | Why do we use tan(h) instead of relu for attention-mini-net?

01:46:28.500 | I don't quite remember, it's been a while since I looked at it.

01:46:37.180 | You should totally try using relu and see how it goes.

01:46:40.660 | The key difference is that it can go in each direction, and it's limited both at the top

01:46:48.900 | and the bottom.

01:46:51.060 | I know very often for the gates inside RNNs and LSTMs and GRUs, tan often works out better.

01:46:59.260 | But it's been about a year since I actually looked at that specific question, so I'll

01:47:02.900 | look at it during the week.

01:47:05.100 | The short answer is you should try a different activation function and see if you can get

01:47:09.100 | a better result.

01:47:13.640 | So what we can do also is we can actually grab the attentions out of the model.

01:47:20.420 | So I actually added this returnAttention = true, see here, my forward, you can put anything

01:47:30.740 | you like in forward.

01:47:31.860 | So I added a returnAttention parameter, false by default, because obviously the training

01:47:37.180 | loop doesn't know anything about it.

01:47:40.620 | But then I just had something here saying if returnAttention, then stick the attentions

01:47:44.740 | on as well, and the attentions is simply that value, a, just check it in a list.

01:47:52.700 | So we can now call the model with returnAttention = true and get back the probabilities and

01:47:58.500 | the attentions, which means as well as printing out these here, we can draw pictures.

01:48:05.900 | That each time step of the attention.

01:48:08.700 | And so you can see at the start, the attention is all in the first word, second word, third

01:48:12.860 | word, a couple of different words.

01:48:14.780 | And this is just for one particular sentence.

01:48:20.340 | So you can kind of see, this is the equivalent, this is like when your Chris Oler and Sean

01:48:29.420 | Carter make things that look like this, when you're Jeremy Howard, the exact same information

01:48:34.660 | looks like this.

01:48:35.660 | It's the same thing.

01:48:36.940 | Just pretend that it's beautiful.

01:48:41.820 | So you can see basically at each different time step, we've got a different attention.

01:48:47.340 | And it's really important when you try to build something like this, you don't really

01:48:54.780 | know if it's not working.

01:48:56.980 | Because if it's not working, and as per usual my first 12 attempts at this were broken,

01:49:03.300 | and they were broken in the sense that it wasn't really learning anything useful, and

01:49:06.940 | so therefore it was basically giving equal attention to everything, and therefore it

01:49:10.180 | wasn't worse.

01:49:11.180 | It just wasn't better, or it wasn't much better.

01:49:14.980 | And so until you actually find ways to visualize the thing in a way that you know what it ought

01:49:22.220 | to look like ahead of time, you don't really know if it's working.

01:49:24.820 | So it's really important that you try to find ways to kind of check your intermediate steps

01:49:29.640 | and your outputs.

01:49:30.640 | So people are asking what is the loss function for the attentional neural network?

01:49:38.260 | No, no, no loss function for the attentional neural network.

01:49:41.380 | It's trained end-to-end.

01:49:43.140 | So it's just sitting here inside our decoder loop.

01:49:48.100 | So the loss function for the decoder loop is that this result contains exactly the same

01:49:54.580 | as before.

01:49:55.580 | Just the outputs, the probabilities of the words.

01:49:59.680 | So like the loss function, it's the same loss function.

01:50:05.440 | So how come the little mini neural nets learning something, well because in order to make the

01:50:12.960 | outputs better and better, it would be great if it made the weights of this weighted average

01:50:19.380 | better and better.

01:50:21.300 | So part of creating our output is to please do a good job of finding a good set of weights.

01:50:26.540 | And if it doesn't do a good job of finding a good set of weights, then the loss function

01:50:29.580 | would improve from that bit.

01:50:31.700 | So end-to-end learning means you throw in everything that you can into one loss function.

01:50:41.420 | And the gradients of all the different parameters point in a direction that says basically,

01:50:47.740 | hey, if you had put more weight over there, it would have been better.

01:50:52.540 | And thanks to the magic of the train rule, it then knows, oh, it would have put more

01:50:55.800 | weight over there if you would change the parameter in this matrix, multiply a little

01:51:00.220 | bit over there.

01:51:02.360 | And so that's the magic of end-to-end learning.

01:51:07.940 | So it's a very understandable question of how did this little mini neural net work.

01:51:17.060 | But you've got to realize there's nothing particularly about this code that says, hey,

01:51:22.300 | this particular bit's a separate little mini neural net work any more than the GRU is a

01:51:26.460 | separate little neural net work, or this linear layer is a separate little function.

01:51:31.900 | It all ends up pushed into one output, which ends up in one loss function that returns

01:51:40.580 | a single number that says this either was or wasn't a good translation.

01:51:46.220 | And so thanks to the magic of the train rule, we then back-propagate little updates to all

01:51:52.380 | the parameters to make them a little bit better.

01:51:55.460 | So this is a big, weird counterintuitive idea, and it's totally okay if it's a bit mind bending.

01:52:07.700 | And it's the bit where, even back to lesson 1, it's like, how did we make it find dogs

01:52:15.380 | versus cats?

01:52:17.540 | We didn't.

01:52:19.300 | All we did was we said, this is our data, this is our architecture, this is our loss

01:52:25.020 | function, please back-propagate into the weights to make them better.

01:52:29.220 | And after you've made them better a while, it'll start finding cats from dogs.

01:52:34.140 | In this case, we haven't used somebody else's convolutional network architecture.

01:52:39.260 | We've said here's like a custom architecture which we hope is going to be particularly

01:52:43.260 | good at this problem.

01:52:45.180 | And even without this custom architecture, it was still okay.

01:52:49.300 | But then when we kind of made it in a way that made more sense to what we think it ought

01:52:54.900 | to do, it worked even better.

01:52:57.100 | But at no point did we kind of do anything different other than say here's data, here's

01:53:04.020 | an architecture, here's a loss function, go and find the parameters please.

01:53:09.620 | And it did it because that's what neural nets do.

01:53:18.000 | So that is sequence-to-sequence learning.

01:53:25.140 | If you want to encode an image using a CNN backbone of some kind and then pass that into

01:53:34.620 | a decoder which is like an RNN with a tension, and you make your Y values the actual correct

01:53:41.860 | captions for each of those images, you will end up with an image caption generator.

01:53:46.860 | If you do the same thing with videos and captions, you'll end up with a video caption generator.

01:53:50.460 | If you do the same thing with 3D CT scans and radiology reports, you'll end up with

01:53:55.340 | a radiology report generator.

01:53:57.140 | If you do the same thing with GitHub issues and people's chosen summaries of them, you'll

01:54:04.300 | get a GitHub issue summary generator.

01:54:07.900 | Sec2Sec, I agree.

01:54:12.180 | They're magical, but they work.

01:54:18.540 | And I don't feel like people have begun to scratch the surface of how to use Sec2Sec models

01:54:23.540 | in their own domains.

01:54:26.460 | Not being a GitHub person, it would never have occurred to me that it would be kind of cool

01:54:30.900 | to start with some issue and automatically create a summary.

01:54:34.620 | But now I'm like, of course, next time I go to GitHub I want to see a summary written there

01:54:40.660 | for me.

01:54:41.660 | I don't want to write my own damn commit message through that.

01:54:44.780 | Why should I write my own summary of the code review when I finish adding comments to lots

01:54:49.260 | of clients?

01:54:50.260 | It should do that for me as well.

01:54:51.260 | Now I'm thinking, GitHub is so behind, it could be doing this stuff.

01:54:55.500 | So what are the things in your industry that you could start with a sequence and generate

01:55:00.540 | something from it?

01:55:01.660 | I can't begin to imagine.

01:55:04.860 | So again, it's kind of like a fairly new area, the tools for it are not easy to use, they're

01:55:12.500 | not even built into fastai yet, as you can see, hopefully they will be soon.

01:55:18.740 | And I don't think anybody knows what the opportunities are.

01:55:24.420 | So I've got good news, bad news.

01:55:27.880 | The bad news is we have 20 minutes to cover a topic which in last year's course took a

01:55:34.660 | whole lesson.

01:55:37.580 | The good news is that when I went to rewrite this using fastai and PyTorch I ended up with

01:55:42.140 | almost no code.

01:55:44.840 | So all of the stuff that made it hard last year is basically gone now.

01:55:49.360 | So we're going to do something bringing together for the first time our two little worlds we

01:55:55.500 | focused on, text and images, and we're going to try and bring them together.

01:56:01.220 | And so this idea came up really in a paper by this extraordinary deep learning practitioner

01:56:08.300 | and researcher named Andrea Fromm.

01:56:11.140 | And Andrea was at Google at the time, and her basic crazy idea was to say words can

01:56:22.380 | have a distributed representation, a space, which at that time really was just word vectors.

01:56:31.260 | And images can be represented in a space, like in the end if we have a fully connected

01:56:36.340 | layer they kind of ended up as a vector representation.

01:56:41.180 | Could we merge the two?

01:56:42.660 | Could we somehow encourage the vector space that the images end up with be the same vector

01:56:50.140 | space that the words are in?

01:56:52.100 | And if we could do that, what would that mean?

01:56:54.820 | What could we do with that?

01:56:57.380 | So what could we do with that covers things like, well, what if I'm wrong?

01:57:05.300 | What if I'm predicting that this image is a beagle, and I predict jumboject, and Yanet's

01:57:16.920 | model predicts corgi.

01:57:20.220 | The normal loss function says that Yanet and Jeremy's models are equally good, i.e. they're

01:57:25.540 | both wrong.

01:57:27.700 | But what if we could somehow say corgi is closer to beagle than it is to jumboject,

01:57:34.220 | so Yanet's model is better than Jeremy's.

01:57:36.860 | And we should be able to do that because in word vector space, beagle and corgi are pretty

01:57:43.340 | close together, but jumboject not so much.

01:57:48.460 | So it would give us a nice situation where hopefully our inferences would be like wrong

01:57:55.060 | in saner ways, if they're wrong.

01:57:57.620 | It would also allow us to search for things that aren't at an ImageNet, like a category

01:58:07.940 | in ImageNet, like dog and cat.

01:58:11.700 | Why did I have to train a whole new model to find dogs versus cats when we already had

01:58:15.340 | something that found corgis and tabbies?

01:58:19.980 | Why can't I just say find me dogs?

01:58:21.980 | Well if I had trained it in word vector space, I totally could, because there's now a word

01:58:28.540 | vector, I can find things with the right image vector, and so forth.

01:58:34.740 | So we'll look at some cool things we can do with it in a moment, but first of all let's

01:58:38.140 | train a model where this model is not learning a category, a one-hot encoded ID where every

01:58:47.700 | category is equally far from every other category.

01:58:51.340 | Let's instead train a model where we're finding the dependent variable which is a word vector.

01:58:59.420 | So what word vector?

01:59:00.660 | Well obviously the word vector for the word you want.

01:59:03.700 | So if it's corgi, let's train it to create a word vector that's the corgi word vector.

01:59:10.060 | And if it's a jumbo-jet, let's train it with a dependent variable that says this is the

01:59:13.980 | word vector for a jumbo-jet.

01:59:16.300 | So as I said, it's now shockingly easy.

01:59:20.500 | So let's grab the fast text word vectors again, load them in, we only need English this time.

01:59:28.780 | And so here's an example of the word vector for king, it's just 300 numbers.

01:59:36.220 | So for example, little j Jeremy and big j Jeremy have a correlation of 0.6, I don't like

01:59:42.420 | bananas at all, this is good, banana and Jeremy, 0.14.

01:59:46.580 | So words that you would expect to be correlated are correlated in words that should be as

01:59:52.080 | far away from each other as possible, unfortunately they're still slightly correlated but not

01:59:55.620 | so much.

01:59:57.140 | So let's now grab all of the ImageNet classes because we actually want to know which one's

02:00:08.000 | corgi and which one's jumbo-jet.

02:00:10.740 | So we've got a list of all of those up on files.fast.ai, we can grab them.

02:00:17.580 | And let's also grab a list of all of the nouns in English which I've made available here

02:00:23.180 | as well.

02:00:24.620 | So here are the names of each of the 1000 ImageNet classes.

02:00:29.540 | And here are all of the nouns in English according to WordNet, which is a popular thing for kind

02:00:38.620 | of representing what words are or not.

02:00:41.460 | So we can now go ahead and load that list of nouns, load the list of ImageNet classes,

02:00:54.300 | turn that into a dictionary.

02:00:55.820 | So these are the class IDs for the 1000 ImageNet classes that are in the competition data set.

02:01:01.660 | There are 1000.

02:01:03.620 | So here's an example, n01, is a tench which apparently is a kind of fish.

02:01:11.980 | Let's do the same thing for all those WordNet nouns.

02:01:14.820 | And you can see actually it turns out that ImageNet is using WordNet class names, so that

02:01:21.220 | makes it nice and easy to map between the two.

02:01:25.580 | And WordNet, the most basic thing is an entity, and then that includes an abstraction, and

02:01:31.580 | a physical entity can be an object and so forth.

02:01:35.020 | So these are our two worlds.

02:01:36.420 | We've got the ImageNet 1000 and we've got the 82,000 which are in WordNet.

02:01:43.500 | So we want to map the two together, which is as simple as creating a couple of dictionaries

02:01:47.220 | to map them based on the Synset ID or the WordNet ID.

02:01:52.060 | And it turns out that 49,469 Synset to WordVector, what I need to do now is grab the 82,000 nouns

02:02:18.820 | in WordNet and try and look them up in fast text.

02:02:23.140 | And so I've managed to look up 49,000 of them in fast text.

02:02:27.920 | So I've now got a dictionary that goes from Synset ID, which is what WordNet calls them,

02:02:33.580 | to WordVectors.

02:02:34.580 | So that's what this dictionary is, Synset to WordVector.

02:02:41.740 | And I've also got the same thing specifically for the 1000 WordNet classes.

02:02:50.920 | So save them away, that's fine.

02:02:54.580 | Now I grab all of the ImageNet, which you can actually download from Kaggle now.

02:03:00.980 | If you look up the Kaggle ImageNet localization competition, that contains the entirety of

02:03:05.820 | the ImageNet classifications as well.

02:03:08.140 | It's got a validation set of 28,650 items in it.

02:03:15.740 | And so I can basically just grab for every image in ImageNet, I can grab using that Synset

02:03:23.420 | to WordVector, grab its fast text WordVector, and I can now stick that into this ImageVectors

02:03:35.200 | array.

02:03:37.200 | Stack that all up into a single matrix and save that away.

02:03:43.260 | And so now what I've got is something for every ImageNet image.

02:03:49.660 | I've also got the fast text WordVector that it's associated with.

02:03:55.520 | Just by looking up the Synset ID, going to WordNet, then going to fast text and grabbing

02:04:04.820 | the WordVector.

02:04:05.820 | And so here's a cool trick.

02:04:10.340 | I can now create a model data object, which specifically is an image classifier data object.

02:04:17.940 | And I've got this thing called from_names_an_array, I'm not sure if we've used it before.

02:04:21.340 | But we can pass it a list of file names, and so these are all of the file names in ImageNet.

02:04:29.340 | And we can just pass it an array of our dependent variables.

02:04:32.900 | And so this is all of the fast text WordVectors.

02:04:38.140 | And then I can pass in the validation indexes, which in this case is just all of the last

02:04:43.820 | IDs.

02:04:44.820 | I need to make sure that they're the same as ImageNet users, otherwise I'll be cheating.

02:04:50.100 | And then I pass in continuous = true, which means this puts a lie again to this image

02:04:55.740 | classifier data.

02:04:56.740 | It's now really an image regressor data.

02:04:59.020 | So continuous = true means don't want hot encode my outputs, but treat them just as continuous

02:05:05.380 | values.

02:05:06.580 | So now I've got a model data object that contains all of my file names, and for every file name

02:05:12.900 | a continuous array representing the WordVector for that.

02:05:17.500 | So I have an x, I have a y, so I have data, now I need an architecture and a loss function.

02:05:22.940 | Once I've got that, I should be done.

02:05:26.400 | So let's create an architecture, and we'll revise this next week, but basically we can

02:05:33.620 | use the tricks we've learned so far, but it's actually incredibly simple.

02:05:38.020 | Fast AI has a ConvNet builder, which is when you say conv_learner.pre_trained, it calls

02:05:47.060 | this.

02:05:48.300 | And you basically say, okay, what architecture do you want?

02:05:51.060 | So we're going to use ResNet-50.

02:05:54.480 | How many classes do you want?

02:05:55.780 | In this case it's not really classes, it's how many outputs do you want, which is the

02:06:00.260 | length of the fast text WordVector, 300.

02:06:04.780 | Obviously it's not multi-class classification, it's not classification at all.

02:06:08.260 | Is it regression?

02:06:09.260 | Yes it is regression.

02:06:12.480 | And then you can just say, alright, what fully connected layers do you want?

02:06:16.820 | So I'm just going to add one fully connected layer, hidden layer of length 1024.

02:06:21.740 | Why 1024?

02:06:23.620 | Well I've got the last layer of ResNet-50, I think it's 1024 long.

02:06:31.860 | The final output I need is 300 long.

02:06:34.940 | I obviously need my penultimate layer to be longer than 300, otherwise there's not enough

02:06:39.180 | information, so I kind of just picked something a bit bigger.

02:06:42.780 | Maybe different numbers would be better, but this worked for me.

02:06:47.260 | How much dropout do you want?

02:06:48.420 | I found that the default dropout I was consistently underfitting, so I just decreased the dropout

02:06:53.460 | from 0.5 to 0.2.

02:06:56.500 | And so this is now a convolutional neural network that does not have any softmax or

02:07:03.020 | anything like that because it's regression, it's just a linear layer at the end.

02:07:08.060 | And that's basically it, that's my model.

02:07:14.680 | So I can create a convlearner from that model, give it an optimization function.

02:07:21.460 | So now all I need, I've got data, I've got an architecture, because I said I've got this

02:07:28.060 | many 300 outputs, it knows there are 300 outputs because that's the size of this array.

02:07:36.760 | So now all I need is a loss function.

02:07:39.140 | Now the default loss function for regression is L1-loss, so the absolute differences.

02:07:46.260 | That's not bad, but unfortunately in really high dimensional spaces, anybody who's studied

02:07:55.220 | a bit of machine learning probably knows this, in really high dimensional spaces, in this

02:07:58.700 | case it's 300 dimensional, basically everything is on the outside.

02:08:04.060 | And when everything's on the outside, distance is not meaningless, but it's a little bit

02:08:12.220 | awkward, things tend to be close together or far away, doesn't really mean much in these

02:08:19.780 | really high dimensional spaces where everything's on the edge.

02:08:24.440 | What does mean something though is that if one thing's on the edge over here, and one

02:08:29.180 | thing's on the edge over here, you can form an angle between those vectors, and the angle

02:08:34.760 | is meaningful.

02:08:36.900 | And so that's why we use cosine similarity when we're basically looking for how close

02:08:43.700 | or far apart are things in high dimensional spaces.

02:08:47.500 | And if you haven't seen cosine similarity before, it's basically the same as Euclidean

02:08:51.500 | distance, but it's normalized to be basically a unit norm, so basically divide by the length.

02:09:00.980 | So we don't care about the length of the vector, we only care about its angle.

02:09:05.260 | So there's a bunch of stuff that you could easily learn in a couple of hours, but if

02:09:11.580 | you haven't seen it before, it's a bit mysterious.

02:09:14.620 | For now, just know that loss functions in high dimensional spaces where you're trying

02:09:18.780 | to find similarity, you care about angle and you don't care about distance.

02:09:26.300 | If you didn't use this custom loss function, it would still work, I tried it, it's just

02:09:30.500 | a little bit less good.

02:09:32.420 | So we've got an architecture, we've got data, we've got a loss function, therefore we're

02:09:37.620 | done.

02:09:39.340 | So we can go ahead and fit.

02:09:41.540 | Now I'm training on all of ImageNet, that's going to take a long time, so pre-compute

02:09:46.860 | equals true is your friend.

02:09:48.900 | You remember pre-compute equals true?

02:09:50.380 | That's that thing we learned ages ago that caches the output of the final convolutional

02:09:54.500 | layer and just trains the fully connected bit.

02:10:00.160 | And like even with pre-compute equals true, it takes like 3 minutes to train an epoch

02:10:06.260 | on all of ImageNet.

02:10:08.580 | So I trained it for a while longer, so that's like an hour's worth of training.

02:10:14.080 | But it's pretty cool that with fast.ai, we can train a new custom head basically on all

02:10:20.100 | of ImageNet for 40 epochs in an hour or so.

02:10:26.260 | And so at the end of all that, we can now say, let's grab the 1000 ImageNet classes,

02:10:36.300 | and let's predict on a whole validation set.

02:10:40.580 | And let's just take a look at a few pictures.

02:10:43.580 | So here's a look at a few pictures.

02:10:46.300 | And because the validation set is ordered, all the stuff is the same type as in the same

02:10:51.260 | place.

02:10:52.260 | I don't know what this thing is.

02:10:55.940 | And what we can now do is we can now use nearest neighbors search.

02:11:00.740 | So nearest neighbors search means here's one 300-dimensional vector, here's a whole lot

02:11:06.260 | of other 3-dimensional vectors, which things is it closest to.

02:11:10.460 | And normally that takes a very long time because you have to look through every 300-dimensional

02:11:13.420 | vector, calculate its distance, and find out how far away it is.

02:11:18.380 | But there's an amazing almost unknown library called NMSlib that does that incredibly fast.

02:11:27.100 | Almost nobody's heard of it.

02:11:29.100 | Some of you may have tried other nearest neighbors libraries.

02:11:31.740 | I guarantee this is faster than what you're using.

02:11:34.540 | I can tell you that because it's been benchmarked by people who do this stuff for a living.

02:11:40.120 | This is by far the fastest on every possible dimension.

02:11:44.580 | So this is basically a super fast way.

02:11:46.980 | We basically look here, this is angular distance.

02:11:49.920 | So we want to create an index on angular distance, and we're going to do it on all of our image

02:11:56.220 | network vectors to add in a whole batch, create the index, and now I can query a bunch of

02:12:01.780 | vectors all at once, get their 10 nearest neighbors, use this multi-threading.

02:12:06.260 | It's absolutely fantastic, this library.

02:12:09.280 | You can install it from pip, it just works, and it tells you how far away they are and

02:12:15.660 | their indexes.

02:12:17.260 | So we can now go through and print out the top 3.

02:12:21.620 | So it turns out that bird actually is a limpkin.

02:12:26.980 | This is the top 3 for each one.

02:12:29.820 | Interestingly, this one doesn't say it's a limpkin, and I looked it up, it's the 4th one.

02:12:34.780 | I don't know much about birds, but everything else here is brown with white spots, that's

02:12:39.780 | not.

02:12:41.460 | So I don't know if that's actually a limpkin or if it's a mislabeled, but it sure as hell

02:12:46.640 | doesn't look like the other birds.

02:12:48.860 | So I thought that was pretty interesting.

02:12:53.980 | It's kind of saying I don't think it's that.

02:12:55.780 | Now this is not a particularly hard thing to do because it's only 1,000 ImageNet classes,

02:13:00.380 | it's not doing anything new, but what if we now bring in the entirety of WordNet, and

02:13:06.380 | we now say which of those 45,000 things is it closest to, exactly the same.

02:13:13.780 | So it's now searching all of WordNet.

02:13:16.460 | So now let's do something a bit different, which is take all of our predictions, so basically

02:13:22.180 | take our whole validation set of images and create a KNN index of the image representations,

02:13:32.220 | because remember it's predicting things that are meant to be word vectors.

02:13:37.260 | And now let's grab the fast text vector for boat, and boat is not an ImageNet concept.

02:13:49.380 | And yet I can now find all of the images in my predicted word vectors in my validation

02:13:55.980 | set that are closest to the word boat.

02:13:59.320 | And it works, even though it's not something that was ever trained on.

02:14:04.340 | What if we now take engine's vector and boat's vector and take their average?

02:14:10.140 | And what if we now look in our nearest neighbors for that?

02:14:13.100 | These are boats with engines.

02:14:14.860 | I mean, yes, this is actually a boat with an engine, it just happens to have wings on

02:14:19.860 | as well.

02:14:20.860 | By the way, sail is not an ImageNet thing, boat is not an ImageNet thing, here's the

02:14:27.700 | average of two things that are not ImageNet things.

02:14:31.060 | And yet, with one exception, it's bound with two sailboats.

02:14:36.140 | Okay let's do something else crazy.

02:14:38.340 | Let's open up an image in the validation set.

02:14:42.420 | Here it is.

02:14:43.420 | I don't know what it is.

02:14:46.780 | Let's call predict_array on that image to get its kind of word-vector-like thing.

02:14:53.380 | And let's do a nearest-neighbors search on all the other images.

02:14:57.420 | And here's all the other images of whatever the hell that is.

02:15:01.660 | So you can see this is crazy.

02:15:04.980 | We've trained a thing on all of ImageNet in an hour using a custom head that required

02:15:09.740 | basically two lines of code.

02:15:12.340 | And these things run in like 300 milliseconds to do these searches.

02:15:16.980 | I actually taught this basic idea last year as well, but it was in Keras and it was just

02:15:22.180 | pages and pages and pages of code and everything took a long time and it was complicated.

02:15:27.060 | Back then I kind of said, I can't begin to think all the stuff you could do with this.

02:15:31.420 | I don't think anybody has really thought deeply about this yet, but I think it's fascinating.

02:15:36.880 | And so go back and read the device paper because like Andrea had a whole bunch of other thoughts.

02:15:43.420 | And now that it's so easy to do, hopefully people will dig into this now because I think

02:15:49.300 | it's crazy and amazing.

02:15:51.100 | Alright, thanks everybody.

02:15:53.800 | See you next week.

02:15:54.300 | [APPLAUSE]

Lesson 11: Deep Learning Part 2 2018 - Neural Translation

Chapters