Lesson 6: Deep Learning 2018

00:00:00.000 | So this is our penultimate lesson.

00:00:14.360 | Couple of weeks ago, in lesson 4, I mentioned I was going to share that lesson with this

00:00:18.820 | terrific NLP researcher, Sebastian Rutter, which I did, and he said he loved it and he's

00:00:25.220 | gone on to yesterday release this new post he called Optimization for Deep Learning Highlights

00:00:31.600 | in 2017 in which he covered basically everything that we talked about in that lesson.

00:00:39.720 | And with some very nice shoutouts to some of the work that some of the students here

00:00:43.280 | have done, including when he talked about the separation of weight decay from the momentum

00:00:56.120 | term, and so he actually mentions here the opportunities in terms of improved software

00:01:04.240 | decoupling this allows, and actually links to the commit from Anand Sahar actually showing

00:01:12.720 | how to implement this in fastai.

00:01:14.840 | So fastai's code is actually being used as a bit of a role model now.

00:01:20.320 | He then covers some of these learning rate training techniques that we've talked about.

00:01:28.520 | This is the SGDR schedule, it looks a bit different to what you're used to seeing because

00:01:34.600 | this is on a log curve, this is the way that they show it on the paper.

00:01:38.720 | And for more information, again, links to two blog posts, the one from Vitaly about

00:01:46.880 | this topic, and again Anand Sahar's blog post on this topic.

00:01:53.360 | So it's great to see that some of the work from fastai students is already getting noticed

00:01:59.680 | and picked up and shared, and this blog post went on to get on the front page of Hacker

00:02:03.320 | News, so that's pretty cool.

00:02:07.760 | And hopefully more and more of this work will be picked up once this is released publicly.

00:02:16.080 | So last week we were doing a deep dive into collaborative filtering.

00:02:26.240 | Let's remind ourselves of what our final model looked like.

00:02:35.240 | So in the end we ended up rebuilding the model that's actually in the fastai library, where

00:02:45.760 | we had an embedding, so we had this little get embedding function that grabbed an embedding

00:02:51.840 | and randomly initialized the weights, for the users, and for the items, that's the kind

00:03:00.280 | of generic term in our case, the items of movies, the bias for the users, the bias for

00:03:04.560 | the items, and we had n factors embedding size for each one, of course the biases just

00:03:12.120 | had a single one, and then we grabbed the users and item embeddings, multiplied them

00:03:16.600 | together, summed it up for each row, and added on the bias terms, popped that through a sigmoid

00:03:23.880 | to put it into the range that we wanted.

00:03:26.680 | So that was our model.

00:03:31.840 | One of you asked if we can interpret this information in some way, and I promised this

00:03:37.080 | week we would see how to do that.

00:03:38.800 | So let's take a look. We're going to start with the model we built here where we just

00:03:44.540 | used that fastai library, collab filter data set from CSV, and then that .get learner, and

00:03:52.080 | then we fitted it in three epochs, 19 seconds, we've got a pretty good result.

00:04:01.440 | So what we can now do is to analyze that model.

00:04:07.720 | So you may remember right back when we started, we read in the movies.csv file, but that's

00:04:15.600 | just a mapping from the ID of the movie to the name of the movie. So we're just going

00:04:20.040 | to use that for display purposes so we can see what we're doing.

00:04:25.640 | Because not all of us have watched every movie, I'm just going to limit this to the top 500

00:04:30.920 | 3,000 most popular movies, so we might have more chance of recognizing the movies we're

00:04:37.880 | looking at, and then I'll go ahead and change it from the movie IDs from MovieLens to those

00:04:44.000 | unique IDs that we're using, the contiguous IDs, because that's what our model has.

00:04:50.420 | So inside the Learn object that we create inside a learner, we can always grab the PyTorch

00:05:02.680 | model itself just by saying "learn.model", and I'm going to show you more and more of

00:05:10.440 | the code at the moment, so let's take a look at the definition of model.

00:05:18.800 | And so model is a property, so if you haven't seen a property before, a property is just

00:05:22.600 | something in Python which looks like a method when you define it, but you can call it without

00:05:29.800 | parentheses as we do here.

00:05:33.040 | And so it kind of looks when you call it like it's a regular attribute, but it looks like

00:05:36.800 | when you define it like it's a method.

00:05:38.640 | So every time you call it, it actually runs this code.

00:05:42.380 | And so in this case it's just a shortcut to grab something called .models.model, so you

00:05:48.760 | may be interested to know what that looks like, learn.models.

00:05:54.240 | And so the fastai model type is a very thin wrapper for PyTorch models.

00:06:03.200 | So we could take a look at this, collab.filter.model, and see what that is.

00:06:13.240 | It's only one line of code.

00:06:16.200 | And we'll talk more about these in part 2, but basically there's this very thin wrapper

00:06:23.960 | and one of the main things that fastai does is we have this concept of layer groups where

00:06:28.840 | basically when you say here there are different learning rates and they get applied to different

00:06:32.680 | sets of layers, then that's something that's not in PyTorch.

00:06:35.960 | So when you say I want to use this PyTorch model, there's one thing we have to do which

00:06:41.440 | is to say I want to run layer groups.

00:06:43.840 | So the details aren't terribly important, but in general if you want to create a little

00:06:48.400 | wrapper for some other PyTorch model, you could just write something like this.

00:06:55.600 | So to get inside that, to grab the actual PyTorch model itself, it's models.model, that's

00:07:02.280 | the PyTorch model, and then the learn object has a shortcut to that.

00:07:07.680 | So we're going to set m to be the PyTorch model.

00:07:12.520 | And so when you print out a PyTorch model, it prints it out basically by listing out

00:07:18.400 | all of the layers that you created in the constructor.

00:07:23.400 | It's quite nifty actually when you think about the way this works thanks to some very handy

00:07:30.720 | stuff in Python, we're actually able to use standard Python OO to define these modules

00:07:39.120 | and these layers and they basically automatically register themselves with PyTorch.

00:07:45.040 | So back in our embedding.bias, we just had a bunch of things where we said each of these

00:07:51.040 | things are equal to these things and it automatically knows how to represent that.

00:07:57.400 | So you can see there's the name is u, and so the name is just literally whatever we

00:08:02.080 | called it here, u.

00:08:05.560 | And then the definition is it's this kind of layer.

00:08:10.920 | So that's our PyTorch model.

00:08:14.760 | So we can look inside that, basically use that, so if we say m.ib, then that's referring

00:08:23.440 | to the embedding layer for an item which is the bias layer.

00:08:29.720 | So an item bias in this case is the movie bias.

00:08:32.940 | So each movie, there are 9,000 of them, has a single bias element.

00:08:39.600 | Now the really nice thing about PyTorch layers and models is that they all look the same.

00:08:48.600 | To use them, you call them as if they were a function. So we can go m.ib, and that basically

00:08:56.080 | says I want you to return the value of that layer, and that layer could be a full-on model.

00:09:04.000 | So to actually get a prediction from a PyTorch model, I would go m and pass in my variable.

00:09:12.320 | And so in this case, m.ib and pass in my top movie indexes.

00:09:19.880 | Now models remember layers, they require variables, not tensors, because it needs to keep track

00:09:28.760 | of the derivatives, and so we use this capital V to turn the tensor into a variable.

00:09:36.720 | Now it's just announced this week that PyTorch 0.4, which is the version after the one that's

00:09:43.760 | just about to be released, is going to get rid of variables and we'll actually be able

00:09:49.160 | to use tensors directly to keep track of derivatives.

00:09:52.440 | So if you're watching this on the MOOC and you're looking at 0.4, then you'll probably

00:09:56.640 | notice that the code doesn't have this V in it anymore, so that will be pretty exciting

00:10:02.320 | when that happens.

00:10:03.320 | For now, we have to remember if we're going to pass something into a model to turn it

00:10:06.480 | into a variable first.

00:10:08.160 | And remember, a variable has a strict superset of the API of a tensor, so anything you can

00:10:13.800 | do to a tensor, you can do to a variable, like add it up or take its log or whatever.

00:10:20.080 | So that's going to return a variable which consists of going through each of these movie

00:10:25.120 | IDs, putting it through this embedding layer to get its bias.

00:10:31.240 | And that's going to return a variable.

00:10:36.280 | Let's take a look.

00:10:42.720 | So before I press Shift + Enter here, you can have a think about what I'm going to have.

00:10:48.000 | I've got a list of 3000 movies going in, turning it into a variable, putting it through this

00:10:53.440 | embedding layer, so just have a think about what you expect to come out.

00:10:59.800 | And we have a variable of size 3000 by 1, hopefully that doesn't surprise you.

00:11:04.760 | We had 3000 movies that we were looking up, each one had a one long embedding, so there's

00:11:10.920 | our 3000 long.

00:11:12.240 | You'll notice it's a variable, which is not surprising because we've fed it a variable,

00:11:15.920 | so we've got a variable back, and it's a variable that's on the GPU, dot CUDA.

00:11:22.700 | So we have a little shortcut in fast.ai because we very often want to take variables, turn

00:11:29.840 | them into tensors, and move them back to the CPU so we can play with them more easily.

00:11:34.400 | So 2mp is to NumPy, and that does all of those things.

00:11:39.420 | It works regardless of whether it's a tensor or a variable, it works regardless of whether

00:11:43.400 | it's on the CPU or GPU, it'll end up giving you a NumPy array from that.

00:11:50.160 | So if we do that, that gives us exactly the same thing as we just looked at, but now in

00:11:56.660 | NumPy form.

00:11:58.680 | So that's a super handy thing to use when you're playing around with PyTorch.

00:12:03.600 | My approach to things is I try to use NumPy for everything, except when I explicitly need

00:12:13.080 | something to run on the GPU, or I need its derivatives, in which case I use PyTorch.

00:12:19.840 | I find NumPy's often easier to work with, it's been around many years longer than PyTorch.

00:12:29.920 | And lots of things like the Python imaging library, OpenCV, and lots and lots of stuff

00:12:35.880 | like Pandas, it works with NumPy.

00:12:40.040 | So my approach is do as much as I can in NumPy land, finally when I'm ready to do something

00:12:46.400 | on the GPU or take its derivative to PyTorch, and then as soon as I can, I put it back in

00:12:51.760 | NumPy.

00:12:52.760 | And you'll see that the FastAI library really works this way, like all the transformations

00:12:57.080 | and stuff happen in NumPy, which is different to most PyTorch computer vision libraries

00:13:03.840 | which tend to do it all as much as possible in PyTorch.

00:13:07.160 | I try to do as much as possible in NumPy.

00:13:13.280 | So let's say we wanted to build a model in the GPU with the GPU and train it, and then

00:13:18.820 | we want to bring this to production.

00:13:21.300 | So would we call to NumPy on the model itself, or would we have to iterate through all the

00:13:27.640 | different layers and then call to NPE?

00:13:30.480 | Yeah, good question.

00:13:31.480 | So it's very likely that you want to do inference on a CPU rather than a GPU, it's more scalable,

00:13:37.920 | you don't have to worry about putting things in batches, so on and so forth.

00:13:42.560 | So you can move a model onto the CPU just by typing m.CPU, and that model is now on the

00:13:52.120 | CPU.

00:13:53.280 | And therefore you can also then put your variable on the CPU by doing exactly the same thing,

00:13:59.880 | so you can say like so.

00:14:05.840 | Now having said that, if your server doesn't have a GPU, you don't have to do this because

00:14:12.720 | it won't put it on the GPU at all.

00:14:15.120 | So for inferencing on the server, if you're running it on some T2 instance or something,

00:14:23.440 | it'll work fine and it'll all run on the CPU automatically.

00:14:28.280 | So if we train the model on the GPU and then we save those embeddings and the weights,

00:14:37.520 | would we have to do anything special to load it onto the CPU?

00:14:40.520 | No, you won't.

00:14:43.360 | We have something, well it kind of depends on how much of fast AI you're using, so I'll

00:14:48.080 | show you how you can do that in case you have to do it manually.

00:14:52.920 | One of the students figured this out, which is very handy.

00:14:56.240 | When we -- there's a load model function, and you'll see what it does, but it does torch.load,

00:15:07.320 | is that basically this is like some magic incantation that normally it has to load it

00:15:12.440 | onto the same GPU it's saved on, but this will load it into whatever's available.

00:15:17.720 | So that was a handy discovery.

00:15:23.800 | Thanks for the great questions.

00:15:29.600 | To put that back on the GPU, I'll need to say .cuda and now I can run it again.

00:15:40.040 | So it's really important to know about the zip function in Python, which iterates through

00:15:47.060 | a number of lists at the same time.

00:15:50.360 | So in this case, I want to grab each movie along with its bias term so that I can just

00:15:56.240 | pop it into a list of tuples.

00:15:58.200 | So if I just go zip like that, that's going to iterate through each movie ID and each

00:16:03.240 | bias term.

00:16:05.680 | And so then I can use that in a list comprehension to grab the name of each movie along with

00:16:10.480 | its bias.

00:16:12.800 | So having done that, I can then sort, and so here I told you the John Travolta Scientology

00:16:23.160 | movie at the most negative of -- quite by a lot.

00:16:26.800 | If this was a Kaggle competition, Battlefield Earth would have won by miles.

00:16:30.560 | Look at this, 77796.

00:16:34.880 | So here is the worst movie of all time, according to IMDB.

00:16:38.960 | It's interesting when you think about what this means, because this is a much more authentic

00:16:43.440 | way to find out how bad this movie is, because some people are just more negative about movies.

00:16:50.680 | And if more of them watch your movie, like a highly critical audience, they're going

00:16:55.160 | to rate it badly.

00:16:56.160 | So if you take an average, it's not quite fair.

00:17:01.120 | And so what this is doing is saying once we remove the fact that different people have

00:17:07.840 | different overall positive or negative experiences, and different people watch different kinds

00:17:11.640 | of movies, and we correct for all that, this is the worst movie of all time.

00:17:17.200 | So that's a good thing to know.

00:17:22.580 | So this is how we can look inside our model and interpret the bias vectors.

00:17:30.760 | You'll see here I've sorted by the zeroth element of each tuple by using a lambda.

00:17:38.780 | Originally I used this special item getter.

00:17:42.200 | This is part of Python's operator library, and this creates a function that returns the

00:17:47.360 | zeroth element of something in order to save time.

00:17:51.760 | And then I actually realized that the lambda is only one more character to write than the

00:17:56.720 | item getter, so maybe we don't need to know this after all.

00:17:59.680 | So really useful to make sure you know how to write lambdas in Python, so this is a function.

00:18:07.560 | And so the sort is going to call this function every time it decides is this thing higher

00:18:12.840 | or lower than that other thing, and this is going to return the zeroth element.

00:18:20.000 | So here's the same thing in item getter format, and here is the reverse and Shawshank redemption

00:18:27.320 | right at the top, I'll definitely agree with that, godfather, usual suspects, these are

00:18:31.320 | all pretty great movies, 12 angry men, absolutely.

00:18:37.160 | So there you go, there's how we can look at the bias.

00:18:42.480 | So then the second piece to look at would be the embeddings.

00:18:49.200 | So we can do the same thing, so remember i was the item embeddings, rather than ib was

00:18:54.320 | the item bias, we can pass in our list of movies as a variable, turn it into numpy,

00:19:00.700 | and here's our movie embeddings, so for each of the 3000 most popular movies, here are

00:19:07.080 | its 50 embeddings.

00:19:10.160 | So it's very hard, unless you're Jeffrey Hinton, to visualize a 50-dimensional space.

00:19:16.560 | So what we'll do is we'll turn it into a 3-dimensional space.

00:19:21.000 | So we can compress high-dimensional spaces down into lower-dimensional spaces using lots

00:19:26.520 | of different techniques, perhaps one of the most common and popular is called PCA.

00:19:31.360 | PCA stands for Principal Components Analysis, it's a linear technique, but linear techniques

00:19:39.920 | generally work fine for this kind of embedding.

00:19:44.240 | I'm not going to teach you about PCA now, but I will say in Rachel's Computational Linear

00:19:48.840 | Algebra class, which you can get to from fast.ai, we cover PCA in a lot of detail.

00:19:57.320 | And it's a really important technique, it turns out to be almost identical to something

00:20:01.520 | called Singular Value Decomposition, which is a type of matrix decomposition which actually

00:20:08.180 | does turn up in deep learning a little bit from time to time.

00:20:12.840 | So it's kind of somewhat worth knowing if you were going to dig more into linear algebra,

00:20:19.200 | SPD and PCA, along with eigenvalues and eigenvectors, which are all slightly different versions

00:20:25.720 | of this kind of the same thing, are all worth knowing.

00:20:29.560 | But for now, just know that you can grab PCA from sklearn.decomposition, say how much you

00:20:36.280 | want to reduce the dimensionality to, so I want to find 3 components, and what this is

00:20:41.340 | going to do is find 3 linear combinations of the 50 dimensions which capture as much

00:20:49.240 | as the variation as possible, but are as different to each other as possible.

00:20:55.820 | So we would call this a lower rank approximation of our matrix.

00:21:03.040 | So then we can grab the components, so that's going to be the 3 dimensions, so once we've

00:21:08.080 | done that, we've now got 3 by 3000.

00:21:13.320 | And so we can now take a look at the first of them, and we'll do the same thing of using

00:21:17.600 | zip to look at each one along with its movie.

00:21:20.960 | And so here's the thing, we don't know ahead of time what this PCA thing is, it's just

00:21:29.960 | a bunch of latent factors, it's kind of the main axis in this space of latent factors.

00:21:39.460 | And so what we can do is look at it and see if we can figure out what it's about.

00:21:46.760 | So given that Police Academy 4 is high up here along with Waterworld, where El Spago,

00:21:54.040 | Pulp Fiction and Godfeather are high up here, I'm going to guess that a high value is not

00:21:59.880 | going to represent critically acclaimed movies or serious watching.

00:22:05.440 | So I called this "easy watching vs. serious", but this is kind of how you have to interpret

00:22:13.280 | your embeddings, take a look at what they seem to be showing and decide what you think

00:22:19.120 | it means.

00:22:20.120 | So this is the principal axis in this set of embeddings.

00:22:25.520 | So we can look at the next one, so do the same thing and look at the first index 1 embedding.

00:22:33.640 | This one's a little bit harder to figure out what's going on, but with things like Mulholland

00:22:37.700 | Drive and Purple Rose of Cairo, these look more kind of dialog-y ones, or else things

00:22:44.400 | like Lord of the Rings and Aladdin and Star Wars, these look more like modern CGI-y ones.

00:22:49.720 | So you can imagine that on that pair of dimensions, it probably represents a lot of differences

00:22:57.960 | between how people rate movies.

00:23:00.320 | Some people like Purple Rise of Cairo type movies, Woody Allen kind of classic, and some

00:23:09.920 | people like these big Hollywood spectacles.

00:23:15.240 | Some people presumably like Police Academy 4 more than they like Fargo.

00:23:22.720 | So you can kind of get the idea of what's happened.

00:23:34.560 | For a model which was literally multiply two things together in atom hub, it's learned

00:23:42.200 | quite a lot, which is kind of cool.

00:23:47.380 | So that's what we can do with that.

00:23:52.380 | And then we could plot them if we wanted to, I just grabbed a small subset to plot on those

00:24:00.320 | first two axes.

00:24:02.160 | Alright, so that's that.

00:24:05.280 | So I wanted to next dig in a layer deeper into what actually happens when we say fit.

00:24:16.680 | So when we said learn.fit, what's it doing?

00:24:25.200 | For something like the store model, is it a way to interpret the embeddings?

00:24:29.920 | For something like the Rustman one?

00:24:32.960 | Yes.

00:24:33.960 | We'll see that in a moment.

00:24:34.960 | Well let's jump straight there, what the hell.

00:24:45.400 | So for the Rustman, how much are we going to sell at each store on each date model?

00:25:03.580 | This is from the paper, Guar and Birkan.

00:25:06.040 | It's a great paper, by the way, well worth it, pretty accessible.

00:25:10.720 | I think any of you would at this point be able to at least get the gist of it, and much

00:25:16.960 | of the detail as well, particularly as you've also done the machine learning course.

00:25:22.240 | And they actually make this point in the paper, this is in the paper, that the equivalent

00:25:26.960 | of what they call entity embedding layers, an embedding of a categorical variable, is

00:25:32.200 | identical to a one-hot encoding followed by a matrix multiplier.

00:25:39.080 | So they're basically saying if you've got three embeddings, that's the same as doing

00:25:43.380 | three one-hot encodings, putting each one through a matrix multiplier, and then put

00:25:48.280 | that through a dense layer, or what PyTorch would call a linear layer.

00:25:56.980 | One of the nice things here is because this is kind of like, well they thought it was

00:26:00.360 | the first paper, it was actually the second, I think, paper to show the idea of using categorical

00:26:04.560 | embeddings for this kind of dataset, they really go into quite a lot of detail, right

00:26:11.200 | back to the detailed stuff that we learned about, so it's kind of a second cut at thinking

00:26:16.880 | about what embeddings are doing.

00:26:21.200 | So one of the interesting things that they did was they said after we've trained a neural

00:26:25.760 | net with these embeddings, what else could we do with it?

00:26:33.520 | So they got a winning result with a neural network with entity embeddings.

00:26:41.120 | But then they said hey you know what, we could take those entity embeddings and replace each

00:26:45.780 | categorical variable with the learned entity embeddings, and then feed that into a GBM.

00:26:54.080 | So in other words, rather than passing into the GBM a one-hot encoded version, or an ordinal

00:26:59.600 | version, let's actually replace the categorical variable with its embedding for the appropriate

00:27:07.680 | level for that row.

00:27:09.680 | So it's actually a way of feature engineering.

00:27:14.200 | And so the main average percent error without that for GBMs using just one-hot encodings

00:27:23.440 | was 0.15, but with that it was 0.11.

00:27:28.880 | Random forests without that was 0.16, with that 0.108 nearly as good as the neural net.

00:27:37.440 | So this is kind of an interesting technique because what it means is in your organization

00:27:43.520 | you can train a neural net that has an embedding of stores, and an embedding of product types,

00:27:49.880 | and an embedding of whatever kind of high cardinality or even medium cardinality categorical variables

00:27:55.920 | you have, and then everybody else in the organization can now chuck those into their GBM or random

00:28:03.120 | forest or whatever and use them.

00:28:07.000 | And what this is saying is they won't get, in fact you can even use k-nearest neighbors

00:28:12.480 | with this technique and get nearly as good a result.

00:28:15.680 | So this is a good way of giving the power of neural nets to everybody in your organization

00:28:22.600 | without having them do the fast AI deep learning course first.

00:28:26.760 | They can just use whatever sklearn or R or whatever that they're used to.

00:28:31.240 | And those embeddings could literally be in a database table because if you think about

00:28:36.320 | an embedding as just an index lookup, which is the same as an inner join in SQL.

00:28:43.480 | So if you've got a table of each product along with its embedding vector, then you can literally

00:28:48.560 | do an inner join, and now you have every row in your table along with its product embedding

00:28:53.480 | vector.

00:28:56.120 | This is a really useful idea.

00:28:59.280 | And GBMs and random forests learn a lot quicker than neural nets do.

00:29:12.080 | So here's what happened when they took the various different states of Germany and plotted

00:29:17.360 | the first two principal components of their embedding vectors.

00:29:21.200 | And they basically here is where they were in that 2D space.

00:29:25.240 | And wackily enough, I've circled in red 3 cities, and I've circled here the 3 cities in Germany.

00:29:32.720 | And here I've circled in purple, sorry blue, here are the blue, here's the green, here's

00:29:37.960 | the green.

00:29:38.960 | So it's actually drawn a map of Germany, even though it never was told anything about how

00:29:46.920 | far these states are away from each other or the very concept of geography didn't exist.

00:29:52.200 | So that's pretty crazy.

00:29:56.640 | So that was from their paper.

00:29:58.320 | So I went ahead and looked -- here's another thing, I think this is also from their paper.

00:30:04.440 | They took every pair of places and they looked at how far away they are on a map versus how

00:30:13.480 | far away are they in embedding space, and they got this beautiful correlation.

00:30:20.600 | So again, apparently stores that are nearby each other physically have similar characteristics

00:30:31.340 | in terms of when people buy more or less stuff from them.

00:30:35.680 | So I looked at the same thing for days of the week, so here's an embedding of the days

00:30:41.320 | of the week from our model.

00:30:43.640 | And I just joined up Monday, Tuesday, Wednesday, Tuesday, Thursday, Friday, Saturday, Sunday.

00:30:47.080 | I did the same thing for the months of the year.

00:30:51.240 | You can say, here's winter, here's summer.

00:30:57.160 | So I think visualizing embeddings can be interesting.

00:31:03.040 | It's good to first of all check you can see things you would expect to see, and then you

00:31:09.680 | could try and see maybe things you didn't expect to see.

00:31:13.000 | So you could try all kinds of clusterings or whatever.

00:31:20.680 | And this is not something which has been widely studied at all, so I'm not going to tell you

00:31:25.720 | what the limitations are of this technique or whatever.

00:31:29.200 | I've heard of other ways to generate embeddings like skip-grams, I was wondering if you could

00:31:39.120 | say is there one better than the other using neural networks or skip-grams.

00:31:44.240 | So skip-grams is quite specific to NLP.

00:31:46.920 | I'm not sure if we'll cover it in this course, but basically the original Word2vec approach

00:31:58.480 | to generating embeddings was to say, we don't actually have a labeled dataset, all we have

00:32:12.240 | is like Google Maps.

00:32:14.200 | And so they have an unsupervised learning problem, unlabeled problem.

00:32:18.200 | And so the best way in my opinion to turn an unlabeled problem into a labeled problem

00:32:22.520 | is to kind of invent some labels.

00:32:25.000 | And so what they did in the Word2vec case was they said okay, here's a sentence with

00:32:29.720 | 11 words in it, and then they said okay, let's delete the middle word and replace it with

00:32:40.560 | a random word.

00:32:42.720 | And so originally it said cat, and they said no, let's replace that with justice.

00:32:54.160 | So before it said the cute little cat sat on the fuzzy mat, and now it says the cute

00:32:59.360 | little justice sat on the fuzzy mat.

00:33:02.480 | And what they do is they do that so they have one sentence where they keep exactly as is,

00:33:12.200 | and then they make a copy of it and they do the replacement.

00:33:16.080 | And so then they have a label where they say it's a 1 if it was unchanged, it was the original,

00:33:23.040 | and 0 otherwise.

00:33:26.640 | And so basically then you now have something you can build a machine learning model on,

00:33:32.120 | and so they went and built a machine learning model on this, so the model was like try and

00:33:35.960 | find the faked sentences.

00:33:40.760 | Not because they were interested in a fake sentence finder, but because as a result they

00:33:44.960 | now have embeddings that just like we discussed you can now use for other purposes.

00:33:49.280 | And that became Word2vec.

00:33:50.600 | Now it turns out that if you do this effectively like a single matrix multiplier rather than

00:33:59.520 | making a deep neural net, you can train this super quickly.

00:34:04.520 | And so that's basically what they did.

00:34:06.800 | They decided we're going to make a pretty crappy model, like a shallow learning model

00:34:12.320 | rather than a deep model.

00:34:14.920 | With the downside it's a less powerful model, but a number of upsides.

00:34:18.760 | The first thing is we can train it on a really large dataset, and then also really importantly

00:34:23.720 | we're going to end up with embeddings which have really very linear characteristics, so

00:34:29.800 | we can add them together and subtract them and stuff like that.

00:34:35.840 | So there's a lot of stuff we can learn about there for other types of embedding, like categorical

00:34:42.560 | embeddings, specifically if we want categorical embeddings which we can kind of draw nicely

00:34:49.720 | and expect us to be able to add and subtract them and behave linearly, probably if we want

00:34:55.840 | to use them in k-nearest neighbors and stuff, we should probably use shallow learning.

00:35:02.580 | If we want something that's going to be more predictive, we probably want to use a neural

00:35:07.320 | net.

00:35:08.840 | And so actually in NLP, I'm really pushing the idea that we need to move past Word2vec

00:35:17.000 | and GloVe, these linear-based methods, because it turns out that those embeddings are way

00:35:22.180 | less predictive than embeddings learned from deep models.

00:35:26.240 | And so the language model that we learned about which ended up getting a state-of-the-art

00:35:29.680 | on sentiment analysis didn't use GloVe or Word2vec, but instead we pre-trained a deeper

00:35:35.360 | current neural network, and we ended up with not just pre-trained Word vectors but a full

00:35:40.600 | pre-trained model.

00:35:43.400 | So it looks like to create embeddings for entities we need a dummy task, right?

00:35:51.040 | Not necessarily a dummy task, like in this case we had a real task, right?

00:35:54.480 | So we created the embeddings for Rossman by trying to predict store sales.

00:36:02.120 | This isn't just for learning embeddings, for learning any kind of feature space, you either

00:36:09.660 | need labeled data, or you need to invent some kind of fake task.

00:36:16.840 | So does a task matter, like if I choose a task and train embeddings, if I choose another

00:36:21.240 | task and train embeddings, like which one is?

00:36:26.200 | It's a great question, and it's not something that's been studied nearly enough, right?

00:36:30.800 | I'm not sure that many people even quite understand that when they say unsupervised learning nowadays,

00:36:38.000 | they almost nearly always mean fake task labeled learning.

00:36:44.280 | And so the idea of what makes a good fake task, I don't know that I've seen a paper

00:36:49.600 | on that, but intuitively, we need something where the kinds of relationships it's going

00:36:58.080 | to learn are likely to be the kinds of relationships that you probably care about.

00:37:03.520 | So for example, in computer vision, one kind of fake task people use is to say let's take

00:37:15.200 | some images and use some kind of unreal and unreasonable data augmentation, like recolor

00:37:22.200 | them too much or whatever, and then we'll ask the neural net to predict which one was

00:37:27.320 | the augmented and which one was not the augmented.

00:37:36.680 | I think it's a fascinating area, and one which would be really interesting for people, maybe

00:37:43.160 | some of the students here to look into further, is take some interesting semi-supervised or

00:37:47.260 | unsupervised data sets and try and come up with some more clever fake tasks and see does

00:37:55.680 | it matter, how much does it matter.

00:37:58.800 | In general, if you can't come up with a fake task that you think seems great, I would say

00:38:03.520 | use it the best you can, it's often surprising how little you need.

00:38:10.120 | The ultimately crappy fake task is called the autoencoder, and the autoencoder is the

00:38:17.240 | thing which won the claims prediction competition that just finished on Kaggle.

00:38:22.720 | They had lots of examples of insurance policies where we knew this was how much was claimed,

00:38:29.080 | and then lots of examples of insurance policies where I guess they must have been still open,

00:38:34.040 | we didn't yet know how much they claimed.

00:38:36.600 | So what they did was they said let's basically start off by grabbing every policy, and we'll

00:38:44.040 | take a single policy and we'll put it through a neural net, and we'll try and have it reconstruct

00:38:52.680 | itself.

00:38:55.600 | But in these intermediate layers, at least one of those intermediate layers, we'll make

00:38:59.680 | sure there's less activations than there were inputs.

00:39:03.080 | So let's say if there was 100 variables on the insurance policy, we'll have something

00:39:08.440 | in the middle that only has 20 activations.

00:39:12.880 | And so when you basically are saying hey, reconstruct your own input, it's not a different

00:39:18.000 | kind of model, it doesn't require any special code, it's literally just passing, you can

00:39:24.160 | use any standard PyTorch or fastai learner, you just say my output equals my input.

00:39:31.200 | And that's the most uncreative, invented task you can create.

00:39:37.560 | That's called an autoencoder, and it works surprisingly well, in fact to the point that

00:39:42.000 | it literally just won a Kaggle competition.

00:39:44.640 | They took the features that it learned and chucked it into another neural net and won.

00:39:53.800 | Maybe if we have enough students taking an interest in this, then we'll be able to cover

00:40:00.560 | unsupervised learning in more detail in Part 2, especially given this Kaggle win.

00:40:06.440 | I think this may be related to the previous question.

00:40:16.800 | Is the language model for example trained on the archive data?

00:40:19.160 | Is that useful at all in the movie lens, the IMDB data?

00:40:25.000 | Great question.

00:40:26.000 | I was just talking to Sebastian about this, Sebastian wrote about this this week, and

00:40:31.560 | we thought we'd try and do some research on this in January.

00:40:35.560 | Again, it's not well known.

00:40:37.120 | We know that in computer vision, it's shockingly effective to train on cats and dogs, and use

00:40:44.320 | that pre-trained network to do lung cancer diagnosis in CT scans.

00:40:49.680 | In the NLP world, nobody much seems to have tried this, the NLP researchers I've spoken

00:40:55.600 | to, other than Sebastian, about this assume that it wouldn't work and they generally haven't

00:41:00.240 | bothered trying.

00:41:01.240 | I think it would work great.

00:41:04.400 | Since we're talking about Rustman, I'll just mention during the week, I was interested

00:41:13.080 | to see how good this solution actually was, because I noticed that on the public leaderboard

00:41:20.340 | it didn't look like it was going to be that great.

00:41:23.280 | I also thought it would be good to see what does it actually take to use a test set properly

00:41:29.400 | with this kind of structured data.

00:41:31.280 | So if you have a look at Rustman now, I've pushed some changes that actually run the

00:41:35.080 | test set through as well, so you can get a sense of how to do this.

00:41:38.720 | So you'll see basically every line appears twice, one for test and one for train when

00:41:45.840 | we get the test, train, test, train, test, train.

00:41:48.840 | Obviously you could do this in a lot fewer lines of code by putting all of the steps

00:41:53.120 | into a method and then pass either the train data set or the test data frame to it.

00:41:58.600 | In this case, for teaching purposes you'd be able to see each step and experiment to

00:42:05.160 | see what each step looks like, but you can certainly simplify this code.

00:42:12.040 | So we do this for every data frame, and then for some of these you can see I kind of loop

00:42:17.520 | through the data frame in join and for join test, train and test.

00:42:24.480 | This whole thing about the durations, I basically put two lines here, one that says data frame

00:42:30.760 | equals train columns, one that says data frame equals test columns.

00:42:34.640 | And so my idea is you'd run this line first and then you would skip the next one and run

00:42:40.760 | everything beneath it, and then you'd go back and run this line and then run everything

00:42:45.280 | beneath it.

00:42:46.280 | So some people on the forum were asking how come this code wasn't working this week, which

00:42:50.840 | is a good reminder that the code is not designed to be code that you always run top to bottom

00:42:56.040 | without thinking.

00:42:57.040 | You're meant to think, what is this code here, should I be running it right now?

00:43:03.760 | And so the early lessons I tried to make it so you can run it top to bottom, but increasingly

00:43:08.160 | as we go along I kind of make it more and more that you actually have to think about

00:43:11.560 | what's going on.

00:43:13.920 | So Jeremy, you're talking about shallow learning and deep learning, could you define that a

00:43:21.040 | bit better?

00:43:22.040 | By shallow learning, I think I just mean anything that doesn't have a hidden layer.

00:43:25.760 | So something that's like a dot product, a matrix multiplier basically.

00:43:40.720 | So we end up with a training and a test version, and then everything else is basically the

00:43:46.800 | same.

00:43:49.160 | One thing to note, and a lot of the details of this we cover in the machine learning course

00:43:53.000 | by the way, because it's not really deep learning specific, so check that out if you're interested

00:43:56.160 | in the details.

00:43:58.800 | I should mention we use apply cats rather than train cats to make sure that the test

00:44:03.440 | set and the training set have the same categorical codes that they join to.

00:44:12.720 | We also need to make sure that we keep track of the mapper.

00:44:15.640 | This is the thing which basically says what's the mean and standard deviation of each continuous

00:44:19.960 | column and then apply that same mapper to the test set.

00:44:27.020 | And so when we do all that, that's basically it.

00:44:28.720 | When the rest is easy, we just have to pass in the test data frame in the usual way when

00:44:34.560 | we create our model data object, and then there's no changes through all here, we train

00:44:42.240 | it in the same way.

00:44:43.640 | And then once we finish training it, we can then call predict as per usual passing in

00:44:54.880 | true to say this is the test set rather than the validation set, and pass that off to Kaggle.

00:45:02.260 | And so it was really interesting because this was my submission, it got a public score of

00:45:09.860 | 103, which would put us in about 300 and somethings plus, which looks awful.

00:45:24.040 | And our private score of 107 is about 5th.

00:45:40.400 | So if you're competing in a Kaggle competition and you haven't thoughtfully created a validation

00:45:47.520 | set of your own and you're relying on public leaderboard feedback, this could totally happen

00:45:52.480 | to you, but the other way round, you'll be like, "Oh, I'm in the top 10, I'm doing great!"

00:45:56.920 | And then, uh-oh.

00:45:58.200 | For example, at the moment, the icebergs competition, recognizing icebergs, a very large percentage

00:46:04.800 | of the public leaderboard set is synthetically generated data augmentation data.

00:46:10.600 | Like totally meaningless.

00:46:13.040 | And so your validation set is going to be much more helpful than the public leaderboard

00:46:17.200 | feedback.

00:46:18.760 | So be very careful.

00:46:22.120 | So our final score here is kind of within statistical noise of the actual 3rd place getters, so I'm

00:46:28.280 | pretty confident that we've captured their approach.

00:46:32.560 | And so that's pretty interesting.

00:46:40.040 | Something to mention, there's a nice kernel about the Rossman, quite a few nice kernels

00:46:44.600 | actually, but you can go back and see, particularly if you're doing the groceries competition,

00:46:48.280 | go and have a look at the Rossman kernels, because actually quite a few of them are higher

00:46:51.400 | quality than the ones for the Ecuadorian Groceries competition.

00:46:55.640 | One of them, for example, showed how for particular stores, like Store85, the sales for non-sundays

00:47:04.320 | and the sale for sundays looked very different, whereas there are some other stores where

00:47:09.800 | the sales on Sunday don't look any different, and you can kind of get a sense of why you

00:47:14.160 | need these kinds of interactions.

00:47:15.800 | The one I particularly wanted to point out is the one I think I briefly mentioned, that

00:47:19.360 | the 3rd place winners whose approach we used they didn't notice is this one.

00:47:25.280 | And here's a really cool visualization.

00:47:28.840 | Here you can see that the store is closed.

00:47:34.560 | And just after, oh my god, we ran out of eggs.

00:47:39.320 | And just before, oh my god, go and get the milk before the store closes.

00:47:45.040 | And here again, closed, bang.

00:47:48.300 | So this 3rd place winner actually deleted all of the closed store rows before they started

00:47:55.440 | doing any analysis.

00:47:57.160 | So remember how we talked about don't touch your data unless you first of all analyze

00:48:03.400 | to see whether that thing you're doing is actually okay.

00:48:08.080 | No assumptions.

00:48:09.840 | So in this case, I am sure, I haven't tried it, but I'm sure they would have won otherwise.

00:48:15.840 | Although there weren't actually any store closures to my knowledge in the test set period,

00:48:21.760 | the problem is that their model was trying to fit to these really extreme things, and

00:48:27.200 | because it wasn't able to do it very well, it was going to end up getting a little bit

00:48:31.040 | confused.

00:48:32.040 | It's not going to break the model, but it's definitely going to harm it because it's kind

00:48:35.680 | of trying to do computations to fit something which it literally doesn't have the data for.

00:48:40.080 | Yannette, can you pass that back there?

00:48:44.600 | So that Rossman model, again, it's nice to kind of look inside to see what's actually

00:48:53.440 | going on.

00:48:55.240 | And so that Rossman model, I want to make sure you kind of know how to find your way

00:49:04.080 | around the code so you can answer these questions for yourself.

00:49:06.280 | So it's inside columnar model data.

00:49:12.240 | We started out by saying hey, if you want to look at the code for something, you can

00:49:16.520 | go question mark, question mark like this, and I haven't got this read in, but you can

00:49:24.040 | use question mark, question mark to get the source code for something.

00:49:29.840 | But obviously that's not really a great way, because often you look at that source code

00:49:34.680 | and it turns out you need to look at something else.

00:49:37.280 | And so for those of you that haven't done much coding, you might not be aware that almost

00:49:42.040 | certainly the editor you're using probably has the ability to both open up stuff directly

00:49:48.360 | off SSH and to navigate through it so you can jump straight from place to place.

00:49:53.680 | So I want to show you what I mean.

00:49:55.520 | So if I want to find columnar model data, and I happen to be using vim here, I can basically

00:50:01.000 | say tag columnar model data and it will jump straight to the definition of that class.

00:50:09.040 | And so then I notice here that it's actually building up a data loader.

00:50:13.320 | That's interesting.

00:50:14.320 | If I hit Ctrl + right square bracket, it will jump to the definition of the thing that was

00:50:19.280 | under my cursor, and after I finished reading it for a while, I can hit Ctrl + T to jump

00:50:24.760 | back up to where I came from, and you kind of get the idea.

00:50:29.280 | If I want to find every usage of this in this file of columnar model data, I can hit * to

00:50:37.680 | jump to the next place it's used, and so forth.

00:50:42.200 | So in this case, get_learner was the thing which actually got the model.

00:50:50.760 | We want to find out what kind of model it is, and apparently it uses a columnar model

00:51:05.480 | data, get_learner, which uses -- and so here you can see mixed_input_model is the PyTorch

00:51:13.080 | model, and then it wraps it in the structured_learner, which is the fast_ai_learner type, which wraps

00:51:22.480 | the data and the model together.

00:51:24.660 | So if we want to see the definition of this actual PyTorch model, I can go to Ctrl + right

00:51:30.000 | square bracket to see it.

00:51:33.680 | And so here is the model, and nearly all of this we can now understand.

00:51:42.440 | So we got past a list of embedding sizes.

00:52:02.720 | In the mixed model that we saw, does it always expect categorical and continuous together?

00:52:12.880 | Yes, it does.

00:52:15.920 | And the model data behind the scenes, if there are none of the other type, it creates a column

00:52:24.760 | of 1s or 0s or something.

00:52:28.320 | So if it is null, it can still work.

00:52:34.600 | It's kind of ugly and hacky, and we'll hopefully improve it, but you can pass in an empty list

00:52:40.500 | of categorical or continuous variables to the model data, and it will basically pass

00:52:47.120 | an unused column of 0s to avoid things breaking.

00:52:54.560 | I'm leaving fixing some of these slightly hacky edge cases because PyTorch 0.4 as well

00:53:00.920 | as getting rid of variables, they're going to also add rank 0 tensors, which is to say

00:53:07.480 | if you grab a single thing out of a rank 1 tensor rather than getting back a number which

00:53:13.840 | is qualitatively different, you're actually going to get back a tensor that just happens

00:53:18.960 | to have no rank.

00:53:19.960 | Now it turns out that a lot of this code is going to be much easier to write then, so

00:53:24.400 | for now it's a little bit more hacky than it needs to be.

00:53:28.960 | Jeremy, you talked about this a little bit before, but maybe it's a good time at some

00:53:35.360 | point to talk about how can we write something that is slightly different from what is in

00:53:41.560 | the library.

00:53:42.560 | Yeah, I think we'll cover that a little bit next week, but I'm mainly going to do that

00:53:48.920 | in part 2.

00:53:49.920 | Part 2 is going to cover quite a lot of stuff.

00:53:54.860 | One of the main things we'll cover in part 2 is what are called generative models, so

00:53:58.480 | things where the output is a whole sentence or a whole image, but I'll also dig into how

00:54:04.160 | to really either customize the fastai library or use it on more custom models.

00:54:14.800 | So if we have time, we'll touch on it a little bit next week.

00:54:20.680 | So the learner, we were passing in a list of embedding sizes, and as you can see that

00:54:27.280 | embedding sizes list was literally just the number of rows and the number of columns in

00:54:31.360 | each embedding.

00:54:32.360 | And the number of rows was just coming from literally how many stores are there in the

00:54:38.960 | store category, for example, and the number of columns was just equal to that divided

00:54:44.760 | by 2 and a maximum of 50.

00:54:47.780 | So that list of tuples was coming in, and so you can see here how we use it.

00:54:52.240 | We go through each of those tuples, grab the number of categories and the size of the embedding,

00:54:59.680 | and construct an embedding.

00:55:02.480 | And so that's a list.

00:55:05.240 | One minor thing, PyTorch-specific thing we haven't talked about before is for it to be

00:55:11.080 | able to register, remember how we kind of said it registers your parameters, it registers

00:55:16.280 | your layers.

00:55:18.280 | So when we listed the model, it actually printed out the name of each embedding and each bias.

00:55:23.560 | It can't do that if they're hidden inside a list.

00:55:27.000 | They have to be an actual nn.module subclass.

00:55:33.440 | So there's a special thing called an nn.module list which takes a list, and it basically

00:55:38.600 | says I want you to register everything in here as being part of this model, so that's

00:55:44.440 | just a minor tweak.

00:55:47.960 | So our mixed-input model has a list of embeddings, and then I do the same thing for a list of

00:55:55.360 | linear layers.

00:55:57.440 | So when I said here 1000, 500, this is saying how many activations I wanted for each of

00:56:07.040 | my linear layers.

00:56:09.240 | And so here I just go through that list and create a linear layer that goes from this

00:56:15.240 | size to the next size.

00:56:18.920 | So you can see how easy it is to construct not just your own model, but a model which

00:56:25.560 | you can pass parameters to have it constructed on the fly dynamically.

00:56:29.960 | BatchNorm we'll talk about next week.

00:56:33.200 | This is initialization, we've mentioned timing initialization before, we mentioned it last

00:56:37.560 | week.

00:56:41.320 | And then dropout, same thing.

00:56:44.440 | We have here a list of how much dropout to apply to each layer.

00:56:48.360 | So again here, let's just go through each thing in that list and create a dropout layer

00:56:53.040 | for it.

00:56:54.300 | So this constructor, we understand everything in it except for BatchNorm, which we don't

00:57:00.060 | have to worry about for now, so that's the constructor.

00:57:03.900 | And so then the forward, also all stuff we're aware of, goes through each of those embedding

00:57:11.120 | layers that we just saw, and remember we just treated it like it's a function.

00:57:15.480 | So call it with the ith categorical variable, and then concatenate them all together, put

00:57:22.040 | that through dropout, and then go through each one of our linear layers and call it,

00:57:31.080 | apply ReLU to it, apply dropout to it, and then finally apply the final linear layer.

00:57:37.720 | And the final linear layer has this as its size, which is here.

00:57:47.600 | Size 1 is a single unit sales.

00:57:51.240 | So we're kind of getting to the point where, and then of course at the end, I mentioned

00:57:56.400 | we'd come back to this, if you passed in a y_range parameter, then we're going to do

00:58:01.920 | the thing we just learned about last week, which is to use a sigmoid.

00:58:06.200 | This is a cool little trick not just to make your collaborative filtering better, but in

00:58:11.040 | this case my basic idea was sales are going to be greater than zero, and probably less

00:58:19.680 | than the largest sale they've ever had.

00:58:23.200 | So I just pass in that as y_range, and so we do a sigmoid and multiply the sigmoid by

00:58:30.440 | the range that I passed it.

00:58:33.080 | And so hopefully we can find that here.

00:58:38.800 | So I actually said maybe the range is between 0 and the highest times 1.2, because maybe

00:58:47.280 | the next two weeks we have one bigger, but this is again trying to make it a little bit

00:58:51.120 | easier for it to give us the kind of results that it thinks is right.

00:58:56.160 | So increasingly, I'd love you all to kind of try to not treat these learners and models

00:59:05.880 | as black boxes, but to feel like you now have the information you need to look inside them.

00:59:11.320 | And remember you could then copy and paste this plus, paste it into a cell in Jupyter

00:59:16.960 | Notebook and start fiddling with it to create your own versions.

00:59:30.600 | I think what I might do is we might take a bit of an early break because we've got a

00:59:35.440 | lot to cover and I want to do it all in one big go.

00:59:38.720 | So let's take a break until 7.45 and then we're going to come back and talk about recurrent

00:59:48.040 | neural networks.

00:59:57.580 | So we're going to talk about RNNs.

00:59:59.360 | Before we do, we're going to dig a little bit deeper into SGD, because I just want to

01:00:06.120 | make sure everybody's totally comfortable with SGD.

01:00:11.200 | And so what we're going to look at is we're going to look at a Lesson 6 SGD Notebook.

01:00:18.600 | And we're going to look at a really simple example of using SGD to learn y=ax+b.

01:00:29.040 | And so what we're going to do here is create the simplest possible model, y=ax+b.

01:00:38.280 | And then we're going to generate some random data that looks like so.

01:00:44.320 | So here's our x, and here's our y, we're going to predict y from x.

01:00:50.240 | And we passed in 3 and 8 as our a and b, so we're going to try and recover that.

01:00:58.720 | And so the idea is that if we can solve something like this, which has two parameters, we can

01:01:04.640 | use the same technique to solve something with 100 million parameters without any changes

01:01:15.040 | at all.

01:01:18.440 | So in order to find an a and b that fits this, we need a loss function.

01:01:25.240 | And this is a regression problem because we have a continuous output.

01:01:29.240 | So for continuous output regression, we tend to use mean-squared error.

01:01:33.220 | And obviously all of this stuff, there's implementations in NumPy, implementations in PyTorch, we're

01:01:37.680 | just doing stuff by hand so you can see all the steps.

01:01:41.100 | So there's MSE.

01:01:43.120 | y-hat is what we often call our predictions.

01:01:45.620 | y-hat minus y-squared means there's our mean-squared error.

01:01:49.720 | So for example, if we had 10 and 5 were our a and b, then there's our mean-squared error.

01:01:58.360 | So if we've got an a and a b and we've got an x and a y, then our mean-squared error

01:02:01.540 | loss is just the mean-squared error of our linear predictions and our y.

01:02:07.160 | So there's our loss for 10, 5, x, y.

01:02:11.780 | So that's a loss function.

01:02:14.320 | And so when we talk about combining linear layers and loss functions and optionally nonlinear

01:02:23.160 | layers, this is all we're doing, we're putting a function inside a function.

01:02:30.280 | I know people draw these clever-looking dots and lines all over the screen when they're

01:02:36.200 | saying this is what a neural network is, but it's just a function of a function of a function.

01:02:40.640 | So here we've got a prediction function being a linear layer, followed by a loss function

01:02:44.960 | being MSE, and now we can say, oh, let's just define this as MSE loss and we'll use that

01:02:49.560 | in the future.

01:02:51.720 | So there's our loss function, which incorporates our prediction function.

01:02:56.760 | So let's generate 10,000 items of fake data, and let's turn them into variables so we can

01:03:02.600 | use them with PyTorch, because Jeremy doesn't like taking derivatives, so we're going to

01:03:05.880 | use PyTorch for that.

01:03:08.480 | And let's create a random weight for A and for B, so a single random number.

01:03:14.160 | And we want the gradients of these to be calculated as we start computing with them, because these

01:03:19.160 | are the actual things we need to update in our SGD.

01:03:23.820 | So here's our A and B, 0.029, 0.111.

01:03:30.840 | So let's pick a learning rate, and let's do 10,000 epochs of SGD.

01:03:40.200 | In fact, this isn't really SGD, it's not Stochastic Gradient Descent, this is actually full gradient

01:03:44.180 | descent, each loop is going to look at all of the data.

01:03:51.600 | Stochastic Gradient Descent would be looking at a subset each time.

01:03:56.540 | So to do gradient descent, we basically calculate the loss.

01:04:00.400 | So remember, we've started out with a random A and B, and so this is going to compute some

01:04:05.920 | amount of loss.

01:04:07.480 | And it's nice from time to time, so one way of saying from time to time is if the epoch

01:04:12.760 | number mod 1000 is 0, so every 1000 epochs, just print out the loss, see how we're doing.

01:04:21.560 | So now that we've computed the loss, we can compute our gradients.

01:04:25.920 | And so remember, this thing here is both a number, a single number that is our loss, something

01:04:32.840 | we can print, but it's also a variable because we passed variables into it, and therefore

01:04:37.960 | it also has a method .backward, which means calculate the gradients of everything that

01:04:43.880 | we asked it to, everything that we said requires grad equals true.

01:04:48.640 | So at this point, we now have a .grad property inside A and inside B, and here they are,

01:04:58.120 | here is that .grad property.

01:05:01.760 | So now that we've calculated the gradients for A and B, we can update them by saying

01:05:06.560 | A is equal to whatever it used to be minus the learning rate times the gradient.

01:05:14.240 | Update data, because A is a variable, and a variable contains a tensor in its .data property,

01:05:22.720 | and again this is going to disappear in PyTorch 0.4, but for now it's actually the tensor

01:05:28.480 | that we need to update.

01:05:30.360 | So update the tensor inside here with whatever it used to be minus the learning rate times

01:05:35.600 | the gradient.

01:05:38.680 | And that's basically it, that's basically all gradient descent is.

01:05:44.880 | So it's as simple as we claimed.

01:05:47.320 | There's one extra step in PyTorch, which is that you might have multiple different loss

01:05:53.400 | functions or lots of output layers all contributing to the gradient, and you have to add them

01:06:00.320 | all together.

01:06:01.760 | And so if you've got multiple loss functions, you could be calling loss.backward on each

01:06:06.560 | of them, and what it does is it adds it to the gradients.

01:06:10.640 | And so you have to tell it when to set the gradients back to 0.

01:06:15.080 | So that's where you just go set A to 0, and gradients and set B gradients to 0.

01:06:23.000 | And so this is wrapped up inside the Optium.sgd class.

01:06:31.560 | So when we say optium.sgd and we just say .stat, it's just doing these for us.

01:06:37.800 | So when we say .zero gradients, it's just doing this for us.

01:06:42.140 | And this underscore here, pretty much every function that applies to a tensor in PyTorch,

01:06:50.080 | if you stick an underscore on the end, it means do it in place.

01:06:53.000 | So this is actually going to not return a bunch of zeros, but it's going to change this

01:06:57.240 | in place to be a bunch of zeros.

01:07:01.440 | So that's basically it.

01:07:03.720 | We can look at the same thing without PyTorch, which means we actually do have to do some

01:07:10.680 | calculus.

01:07:11.840 | So if we generate some fake data, again, we're just going to create 50 data points this time

01:07:17.460 | just to make this fast and easy to look at.

01:07:21.040 | And so let's create a function called update, we're just going to use NumPy, no PyTorch.

01:07:26.840 | So our predictions are equal to linear, and in this case we're actually going to calculate

01:07:32.200 | the derivatives.

01:07:33.200 | So the derivative of the square of the loss is just 2 times, and then the derivative with

01:07:38.280 | respect to a is just that, you can confirm that yourself if you want to.

01:07:42.420 | And so here we're going to update a minus equals learning rate times the derivative

01:07:47.960 | of loss with respect to a, and for b it's learning rate times derivative with respect

01:07:54.540 | to b.

01:07:56.120 | And so what we can do -- let's just run all this.

01:08:01.640 | So just for fun, rather than looping through manually, we can use the matplotlibfuncanimation

01:08:08.320 | command to run the animate function a bunch of times, and the animate function is going

01:08:15.920 | to run 30 epochs, and at the end of each epoch it's going to print out on the plot where

01:08:22.920 | the line currently is, and that creates this little movie.

01:08:27.840 | So you can actually see the line moving into place.

01:08:32.520 | So if you want to play around with understanding how PyTorch gradients actually work, step

01:08:40.720 | by step here's the world's simplest example.

01:08:46.560 | And it's kind of weird to say that's it, like when you're optimizing 100 million parameters

01:08:54.200 | in a neural net, it's doing the same thing, but it actually is.

01:08:58.200 | You can actually look at the PyTorch code and see this is it, there's no trick.

01:09:05.200 | Well we learned a couple of minor tricks last time, which was like momentum and atom, but

01:09:12.760 | if you can do it in Excel you can do it in Python.

01:09:16.680 | So let's now talk about RNNs, so we're now in lesson 6 RNN notebook.

01:09:27.320 | And we're going to study Nietzsche, as you should.

01:09:35.480 | So Nietzsche says supposing that truth is a woman, apparently all philosophers have failed

01:09:46.260 | to understand women.

01:09:47.400 | So apparently at the point that Nietzsche was alive, there were no female philosophers,

01:09:51.160 | or at least those that were around didn't understand women either.

01:09:54.540 | So anyway, this is the philosopher apparently we've chosen to study.

01:10:00.600 | Which is actually much less worse than people think he is, but it's a different era I guess.

01:10:07.240 | So we're going to learn to write philosophy like Nietzsche.

01:10:14.320 | And so we're going to do it one character at a time.

01:10:17.440 | So this is like the language model that we did in lesson 4 where we did it a word at

01:10:21.240 | a time, but this time we're going to do it a character at a time.

01:10:25.880 | And so the main thing I'm going to try and convince you is that an RNN is no different

01:10:31.100 | to anything you've already learned.

01:10:33.960 | And so to show you that, we're going to build it from plain PyTorch layers, all of which

01:10:40.680 | are extremely familiar already.

01:10:43.240 | And eventually we're going to use something really complex, which is a for loop.

01:10:47.280 | So that's when we're going to make it really sophisticated.

01:10:49.920 | So the basic idea of RNNs is that you want to keep track of state-over-long-term dependencies.

01:11:00.720 | So for example, if you're trying to model something like this template language, then

01:11:08.200 | at the end of your percent comment do percent, you need a percent comment end percent.

01:11:13.800 | And so somehow your model needs to keep track of the fact that it's inside a comment over

01:11:19.160 | all of these different characters.

01:11:21.520 | So this is this idea of state, it needs kind of memory.

01:11:25.280 | And this is quite a difficult thing to do with just a ConvNet.

01:11:30.440 | It turns out to be possible, but it's a little bit tricky.

01:11:36.520 | Whereas with an RNN, it turns out to be pretty straightforward.

01:11:40.860 | So these are the basic ideas.

01:11:41.860 | You want a stateful representation where you're keeping track of where are we now, have memory,

01:11:47.040 | have long-term dependencies, and potentially even have variable length sequences -- these

01:11:53.160 | are all difficult things to do with ConvNets -- they're very straightforward with RNNs.

01:11:59.040 | So for example, SwiftKey a year or so ago did a blog post about how they had a new language

01:12:05.720 | model where they basically said, of course this is what their neural net looks like.

01:12:14.200 | Somehow they always looked like this on the internet.

01:12:17.400 | You've got a bunch of words and it's basically going to take your particular words in their

01:12:21.440 | particular orders and try and figure out what the next word's going to be, which is to say

01:12:25.320 | they built a language model.

01:12:27.120 | They actually have a pretty good language model.

01:12:28.520 | If you've used SwiftKey, they seem to do better predictions than anybody else still.

01:12:34.440 | Another cool example was Andre Kepathy a couple of years ago showed that he could use character

01:12:39.240 | level RNN to actually create an entire LaTeX document.

01:12:44.720 | So he didn't actually tell it in any way what LaTeX looks like, he just passed in some LaTeX

01:12:51.440 | text like this and said generate more LaTeX text, and it literally started writing something

01:12:56.400 | which means about as much to me as most math papers do.

01:13:03.960 | So we're going to start with something that's not an RNN, and I've got to introduce Jeremy's

01:13:11.760 | patented neural network notation involving boxes, circles and triangles.

01:13:20.740 | So let me explain what's going on.

01:13:24.000 | A rectangle is an input, an arrow is a layer, a circle -- in fact every shape is a bunch

01:13:39.000 | of activations.

01:13:41.320 | The rectangle is the input activations, the circle is the hidden activations, and a triangle

01:13:49.160 | is the output activations.

01:13:51.960 | An arrow is a layer operation, or possibly more than one.

01:13:58.200 | So here my rectangle is an input of number of rows equal to batch size and number of

01:14:04.600 | columns equal to the number of inputs, number of variables.

01:14:08.900 | And so my first arrow, my first operation, is going to represent a matrix product followed

01:14:14.480 | by a ReLU, and that's going to generate a set of activations.

01:14:20.680 | For activations, an activation is a number, an activation is a number, a number that's

01:14:28.000 | being calculated by a ReLU or a matrix product or whatever, it's a number.

01:14:34.740 | So this circle here represents a matrix of activations.

01:14:39.480 | All of the numbers that come out when we take the inputs, we do a matrix product followed

01:14:43.720 | by a ReLU.

01:14:44.720 | So we started with batch size by number of inputs, and so after we do this matrix operation,

01:14:49.920 | we now have batch size by whatever the number of columns in our matrix product was, by number

01:14:57.560 | of hidden units.

01:15:00.380 | And so if we now take these activations, which is a matrix, and we put it through another

01:15:05.480 | operation, in this case another matrix product, and a softmax, we get a triangle that's our

01:15:10.560 | output activations, another matrix of activations, and again, number of rows is batch size, number

01:15:16.160 | of columns is equal to the number of classes, however many columns our matrix and this matrix

01:15:21.600 | product had.

01:15:22.600 | So that's a neural net, that's our basic one hidden layer neural net.

01:15:34.160 | If you haven't written one of these from scratch, try it.

01:15:39.160 | And in fact, in lessons 9, 10 and 11 of the machine learning course, we do this, we create

01:15:45.040 | one of these from scratch.

01:15:46.720 | So if you're not quite sure how to do it, you can check out the machine learning course.

01:15:51.120 | In general the machine learning course is much more like building stuff up from the foundations,

01:15:55.880 | whereas this course is much more like best practices kind of top down.

01:16:02.800 | So if we were doing a ConvNet with a single dense hidden layer, our input would be equal

01:16:09.240 | to pi torch, number of channels by height by width, and notice that here batch size appeared

01:16:18.680 | every time, so I'm not going to write it anymore.

01:16:22.600 | So I've removed the batch size.

01:16:26.160 | Also the activation function, it's always basically value or something similar for all

01:16:30.840 | the hidden layers and softmax at the end for classification, so I'm not going to write

01:16:34.760 | that either.

01:16:36.760 | In each picture I'm going to simplify it a little bit.

01:16:40.640 | So I'm not going to mention batch size is still there, we're not going to mention value

01:16:43.840 | or softmax, but it's still there.

01:16:45.680 | So here's our input, and so in this case rather than a matrix product, we'll do a convolution,

01:16:51.040 | a stride 2 convolution, so we'll skip over every second one, or it could be a convolution

01:16:57.680 | followed by a max pull.

01:16:59.880 | In either case, we end up with something which is replace number of channels with number of

01:17:04.320 | filters, and we have now height divided by 2 and width divided by 2, and then we can

01:17:12.120 | flatten that out somehow.

01:17:13.880 | We'll talk next week about the main way we do that nowadays, which is basically to do

01:17:17.800 | something called an adaptive max pulling, where we basically get an average across the

01:17:23.480 | height and the width, and turn that into a vector.

01:17:27.800 | Anyway, somehow we flatten it out into a vector, we can do a matrix product, or a couple of

01:17:33.240 | matrix products.

01:17:34.240 | We actually tend to do in fastai, so that'll be our fully connected layer with some number

01:17:40.640 | of activations.

01:17:41.640 | The final matrix product gives us some number of classes.

01:17:45.920 | So this is our basic component, remembering, rectangle is input, circle is hidden, triangle

01:17:52.840 | is output, all of the shapes represent a tensor of activations, all of the arrows represent

01:18:00.520 | a layer operation.

01:18:05.080 | So now let's jump to the first one that we're going to actually try to create for NLP.

01:18:12.880 | And we're going to basically do exactly the same thing as here, and we're going to try

01:18:17.760 | and predict the third character in a three-character sequence based on the previous two characters.

01:18:25.080 | So our input, and again remember, we've removed the batch size dimension, we're not saying

01:18:34.240 | it but it's still here, and also here I've removed the names of the layer operations

01:18:40.520 | entirely, just keeping simplifying things.

01:18:43.640 | So for example, our first input would be the first character of each string in our mini-batch,

01:18:52.760 | and assuming this is one-hot encoded, then the width is just however many items there

01:18:59.160 | are in the vocabulary, how many unique characters could we have.

01:19:03.400 | We probably won't really one-hot encode it, we'll feed it in as an integer and pretend

01:19:07.920 | it's one-hot encoded by using an embedding layer, which is mathematically identical.

01:19:12.840 | And then that's going to give us some activations which we can stick through a fully connected

01:19:18.080 | layer, so we put that through a fully connected layer to get some activations.

01:19:27.760 | We can then put that through another fully connected layer, and now we're going to bring

01:19:32.160 | in the input of character 2.

01:19:34.760 | So the character 2 input will be exactly the same dimensionality as the character 1 input,

01:19:40.920 | and we now need to somehow combine these two arrows together.

01:19:44.580 | So we could just add them up, for instance, because remember this arrow here represents

01:19:51.320 | a matrix product, so this matrix product is going to spit out the same dimensionality

01:19:56.120 | as this matrix product.

01:19:57.820 | So we could just add them up to create these activations.

01:20:02.440 | And so now we can put that through another matrix product, and of course remember all

01:20:05.520 | these matrix products have a value as well, and this final one will have a softmax instead

01:20:11.920 | to create our predicted set of characters.

01:20:15.480 | So it's a standard 2-hidden layer, I guess it's actually 3 matrix products neural net.

01:20:26.160 | This first one is coming through an embedding layer.

01:20:29.160 | The only difference is that we've also got a second input coming in here that we're just

01:20:34.840 | adding in, but it's kind of conceptually identical.

01:20:38.960 | So let's implement that.

01:20:46.320 | So I'm not going to use torch text, I'm going to try not to use almost any fast AI so we

01:20:51.640 | can see it all again from raw.

01:20:54.680 | So here's the first 400 characters of the collected works.

01:20:59.160 | Let's grab a set of all of the letters that we see there and sort them.

01:21:06.000 | And so a set creates all the unique letters, so we've got 85 unique letters in our vocab.

01:21:14.180 | It's nice to put an empty null or some kind of padding character in there for padding,

01:21:19.000 | so we're going to put a padding character at the start.

01:21:22.000 | And so here is what our vocab looks like.

01:21:28.120 | So cars is our vocab.

01:21:31.880 | So as per usual, we want some way to map every character to a unique ID and every unique

01:21:38.880 | ID to a character.

01:21:42.440 | And so now we can just go through our collected works of niche and grab the index of each

01:21:49.340 | one of those characters, so now we've just turned it into this.

01:21:55.000 | So rather than "pe", we now have 40, 42, 29.

01:22:07.240 | So that's basically the first step.

01:22:08.780 | And just to confirm, we can now take each of those indexes and turn them back into characters

01:22:14.880 | and join them together, and yeah, there it is.

01:22:19.840 | So from now on we're just going to work with this IDX list, the list of character numbers

01:22:26.160 | in the connected works of Nietzsche.

01:22:28.080 | Yes?

01:22:29.080 | So Jeremy, why are we doing a model of characters and not a model of words?

01:22:36.920 | I just thought it seemed simpler.

01:22:38.800 | With a vocab of 80-ish items, we can kind of see it better.

01:22:45.620 | Character level models turn out to be potentially quite useful in a number of situations, but

01:22:51.440 | we'll cover that in Part 2.

01:22:53.640 | The short answer is, you generally want to combine both a word level model and a character

01:22:58.800 | level model, like if you're doing translation, it's a great way to deal with unusual words

01:23:04.440 | rather than treating it as unknown.

01:23:06.060 | Anytime you see a word you haven't seen before, you could use a character level model for

01:23:10.200 | that.

01:23:11.200 | And there's actually something in between the two called a byte pair encoding, BPE, which

01:23:15.360 | basically looks at little n-grams of characters, but we'll cover all that in Part 2.

01:23:23.000 | If you want to look at it right now, then Part 2 of the existing course already has

01:23:27.780 | this stuff taught.

01:23:30.280 | And Part 2 of the version 1 of this course, all the NLP stuff is in PyTorch, by the way,

01:23:37.200 | so you'll understand it straight away.

01:23:41.000 | It was actually the thing that inspired us to move to PyTorch, because trying to do it

01:23:44.060 | in Keras turned out to be a nightmare.

01:23:48.220 | So let's create the inputs to this.

01:23:52.480 | We're actually going to do something slightly different to what I said.

01:23:54.560 | We're actually going to try and predict the fifth character using the first 4, so the

01:24:01.680 | index 4 character using the index 0, 1, 2, and 3.

01:24:05.720 | So we're going to do exactly the same thing, but with just a couple more layers.

01:24:10.060 | So that means that we need a list of the 0th, 1st, 2nd, and 3rd characters, so I'm just

01:24:19.480 | getting every character from the start, from the 1, from 2, from 3, skipping over, 3 at

01:24:26.720 | a time.

01:24:29.200 | So we're going to predict the 4th character from the first 3.

01:24:47.420 | So our inputs will be these three lists.

01:24:52.540 | So we can just use np.stack to pop them together.

01:24:58.140 | So here's the 0, 1, and 2 characters that are going to feed into our model, and then

01:25:04.720 | here is the next character in the list.

01:25:09.120 | So for example, x1, x2, x3, and y.

01:25:22.620 | So you can see for example, the very first item would be 40, 42, and 29, so that's characters

01:25:35.060 | 0, 1, and 2.

01:25:37.700 | And then we'd be predicting 30, that's the 4th character, which is the start of the next

01:25:45.340 | row.

01:25:46.340 | So 30, 25, 27, we need to predict 29, which is the start of the next row, and so forth.

01:25:54.420 | So we're always using 3 characters to predict the 4th.

01:25:59.980 | So there are 200,000 of these that we're going to try and model.

01:26:07.300 | So we're going to build this model, which means we need to decide how many activations.

01:26:15.780 | So I'm going to use 256, and we need to decide how big our embeddings are going to be.

01:26:21.660 | And so I decided to use 42, so about half the number of characters I have.

01:26:27.180 | And you can play around with these to see if you can come up with better numbers, it's

01:26:29.620 | just kind of experimental.

01:26:32.300 | And now we're going to build our model.

01:26:34.740 | Now I'm going to change my model slightly.

01:26:37.260 | And so here is the full version, so predicting character 4 using characters 1, 2, and 3.

01:26:43.940 | As you can see, it's the same picture as the previous page, but I put some very important

01:26:47.860 | colored arrows here.

01:26:50.180 | All the arrows of the same color are going to use the same matrix, the same weight matrix.

01:26:57.740 | So all of our input embeddings are going to use the same matrix.

01:27:03.220 | All of our layers that go from one layer to the next are going to use the same orange

01:27:09.520 | arrow weight matrix, and then our output will have its own matrix.

01:27:15.020 | So we're going to have 1, 2, 3 weight matrices.

01:27:20.260 | And the idea here is the reason I'm not going to have a separate one for everything here

01:27:24.880 | is that why would semantically a character have a different meaning depending if it was

01:27:30.860 | the first or the second or the third item in a sequence?

01:27:33.940 | It's not like we're even starting every sequence at the start of a sentence, we just arbitrarily

01:27:38.060 | chopped it into groups of 3.

01:27:39.900 | So you would expect these to all have the same kind of conceptual mapping.

01:27:44.620 | And when we're moving from character_0 to character_1, to kind of say build up some

01:27:49.780 | state here, why would that be any different kind of operation to moving from character_1

01:27:53.980 | to character_2?

01:27:54.980 | So that's the basic idea.

01:27:57.940 | So let's create a 3-character model, and so we're going to create one linear layer for

01:28:05.140 | our green arrow, one linear layer for our orange arrow, and one linear layer for our

01:28:10.180 | blue arrow, and then also one embedding.

01:28:15.860 | So the embedding is going to bring in something of size, whatever it was, 84, and spit out

01:28:22.180 | something with a number of factors in the embedding, we'll then put that through a linear

01:28:26.980 | layer, and then we've got our hidden layers, we've got our output layer.

01:28:31.820 | So when we call forward, we're going to be passing in 1, 2, 3 characters.

01:28:39.300 | So for each one, we'll stick it through an embedding, we'll stick it through a linear

01:28:43.260 | layer, and we'll stick it through a value.

01:28:45.660 | So we'll do it for character_1, character_2, and character_3.

01:28:53.060 | Then I'm going to create this circle of activations

01:29:05.620 | here, and that matrix I'm going to call h, so it's going to be equal to my input activations

01:29:16.620 | after going through the value and the linear layer and the embedding, and then I'm going

01:29:21.380 | to apply this L hidden, so the orange arrow, and that's going to get me to here.

01:29:30.400 | So that's what this layer here does.

01:29:33.100 | And then to get to the next one, I need to apply the same thing, and apply the orange

01:29:37.300 | arrow to that.

01:29:39.820 | But I also have to add in this second input, so take my second input and add in my previous

01:29:49.620 | layer.

01:29:52.500 | I don't really see how these dimensions are the same from h and in, too.

01:30:07.900 | Let's figure out the dimensions together.

01:30:11.260 | Self.e is going to be of length 42, and then it's going to go through ln, I'm just going

01:30:18.720 | to make it of size nhidden.

01:30:24.820 | And so then we're going to pass that, which is now size nhidden, through this, which is

01:30:33.220 | also going to return something of size nhidden.

01:30:36.500 | So it's really important to notice that this is square, this is a square weight matrix.

01:30:42.600 | So we now know that this is of size nhidden, n2 is going to be exactly the same size as

01:30:47.660 | n1 was, which is nhidden.

01:30:50.000 | So we can now sum together two sets of activations, both of size nhidden, passing it into here,

01:30:58.180 | and again it returns something of size nhidden.

01:31:00.260 | So basically the trick was to make this a square matrix, and to make sure that its square

01:31:04.500 | matrix was the same size as the output of this hidden layer.

01:31:07.940 | Thanks for the great question.

01:31:09.260 | Can you pass that back to you now?

01:31:13.820 | I don't like it when I have three bits of code that look identical, and then three bits

01:31:30.460 | of code that look nearly identical but aren't quite because it's harder to refactor.

01:31:35.040 | So I'm going to make h into a bunch of zeros, so that I can then put h here, and these are

01:31:46.140 | now identical.

01:31:48.520 | So the hugely complex trick that we're going to do very shortly is to replace these three

01:31:54.600 | things with a for loop.

01:31:57.660 | And it's going to loop through 1, 2 and 3.

01:32:02.260 | That's going to be the for loop, or actually 0, 1 and 2.

01:32:05.180 | At that point we'll be able to call it a recurrent neural network.

01:32:08.580 | So just to skip ahead a little bit.

01:32:11.700 | So we create that model, so we can now just use the same columnar model data class that

01:32:26.660 | we've used before, and if we use fromArrays, then it's basically going to spit back the

01:32:32.300 | exact arrays we gave it.

01:32:34.420 | So if we stack together those three arrays, then it's going to feed us those three things

01:32:40.060 | back to our for loop method.

01:32:41.660 | So if you want to play around with training models using as raw an approach as possible

01:32:51.100 | but without writing lots of boilerplate, this is kind of how to do it.

01:32:54.440 | Use columnar model data fromArrays and then if you pass in whatever you pass in here,

01:33:01.480 | you're going to get back here.

01:33:06.800 | So I've passed in three things, which means I'm going to get sent three things.

01:33:10.780 | So that's how that works.

01:33:14.540 | Batch size 512, because this data is tiny so I can use a bigger batch size.

01:33:20.440 | So I'm not using really much fast.ai stuff at all, I'm using fast.ai stuff just to save

01:33:26.300 | me fiddling around with data loaders and data sets and stuff, but I'm actually going to create

01:33:30.180 | a standard PyTorch model, I'm not going to create a learner.

01:33:33.860 | So this is a standard PyTorch model, and because I'm using PyTorch, that means I have to remember

01:33:38.420 | to write .cuda, just to get on the GPU.

01:33:43.660 | So here is how we can look inside at what's going on.

01:33:50.700 | So we can say iter md.train data loader to grab the iterator to iterate through the training

01:33:56.960 | set.

01:33:58.160 | We can then call next on that to grab a mini-batch, and that's going to return all of our x's and

01:34:04.700 | our y tensor, and so we can then take a look at x's, for example.

01:34:14.260 | And so you would expect, have a think about what you would expect for this length, 3, not

01:34:21.460 | surprisingly because these are the three things.

01:34:24.940 | And so then xs0, not surprisingly, is of length 512, and it's not actually one hot encoded

01:34:37.860 | because we're using embedding to pretend it is.

01:34:41.180 | And so then we can use a model as if it's a function by passing to it the variableized

01:34:47.940 | version of our tensors.

01:34:50.760 | And so have a think about what you would expect to be returned here.

01:34:55.940 | So not surprisingly, we had a mini-batch of 512, so we still have 512.

01:35:00.240 | And then 85 is the probability of each of the possible vocab items, and of course we've

01:35:05.380 | got the log of them, because that's kind of what we do in PyTorch.

01:35:11.560 | So that's how you can look inside, so you can see here how to do everything really very

01:35:15.780 | much by hand.

01:35:19.020 | So we can create an optimizer, again using standard PyTorch.

01:35:23.160 | So with PyTorch, when you use a PyTorch optimizer, you have to pass in a list of the things to

01:35:28.300 | optimize, and so if you call m.parameters, that will return that list for you.

01:35:34.300 | And then we can fit.

01:35:37.280 | And there it goes.

01:35:41.020 | And so we don't have learning rate finders and SGDR and all that stuff because we're

01:35:46.460 | not using a learner, so we'll have to manually do learning rate annealing, so set the learning

01:35:50.300 | rate a little bit lower and fit again.

01:35:58.380 | And so now we can write a little function to test this thing out.

01:36:04.380 | So here's something called getNext() where we can pass in 3 characters, like y, and so

01:36:17.600 | I can then go through and turn that into a tensor of an array of the character index

01:36:25.620 | for each character in that list.

01:36:27.260 | So basically turn those into the integers, variables, pass that to our model, and then

01:36:34.740 | we can do an argmax on that to grab which character number is it.

01:36:40.260 | And in order to do stuff in NumPy land, I use 2np to turn that variable into a NumPy

01:36:45.500 | array.

01:36:46.500 | And then I can return that character, and so for example a capital T is what it thinks

01:36:51.040 | would be reasonable after seeing y, full stop space, that seems like a very reasonable way

01:36:56.060 | to start a sentence.

01:36:58.420 | If it was ppl, e, that sounds reasonable, space th, e, that sounds reasonable, and space,

01:37:05.060 | that sounds reasonable.

01:37:06.060 | So it seems to have created something sensible.

01:37:11.020 | So the important thing to note here is our character model is a totally standard fully

01:37:21.140 | connected model.

01:37:22.780 | The only slightly interesting thing we did was to do this addition of each of the inputs

01:37:32.540 | one at a time.

01:37:34.420 | But there's nothing new conceptually here, we're training it in the usual way.

01:37:46.100 | Let's now create an RNN.

01:37:50.460 | So an RNN is when we do exactly the same thing that we did here, but I could draw this more

01:38:04.900 | simply by saying, you know what, if we've got a green arrow going to a circle, let's

01:38:09.940 | not draw a green arrow going to a circle again and again and again, but let's just draw it

01:38:14.900 | like this.

01:38:15.900 | So this is exactly the same picture as this one.

01:38:32.300 | And so you just have to say how many times to go around this circle.

01:38:35.420 | So in this case, if we want to predict character number n from characters 1 through n-1, then

01:38:40.660 | we can take the character 1 input, get some activations, feed that to some new activations

01:38:45.540 | that go through, remember orange is the hidden to hidden weight matrix, and each time we'll

01:38:51.460 | also bring in the next character of input through its embeddings.

01:38:57.060 | So that picture and that picture are two ways of writing the same thing.

01:39:03.820 | But this one is more flexible because rather than me having to say let's do it for 8, I

01:39:07.860 | don't have to draw 8 circles, I can just say, oh just repeat this.

01:39:16.220 | So I could simplify this a little bit further by saying, you know what, rather than having

01:39:21.820 | this thing as a special case, let's actually start out with a bunch of zeros and then let's

01:39:27.940 | have all of our characters inside here.

01:39:37.980 | So I was wondering, if you can explain a little bit better, why are you reusing those...

01:39:45.220 | Why use the same colored arrows?

01:39:47.780 | You kind of seem to be reusing the same weight matrices.

01:39:53.500 | Maybe this is kind of similar to what we did in convolutional units, like somehow...

01:39:58.300 | No, I don't think so, at least not that I can see.

01:40:02.060 | So the idea is just kind of semantically speaking, like this arrow here is saying take a character

01:40:17.340 | of import and represent it as some set of features.

01:40:24.340 | And this arrow is saying the same thing, take some character and represent it as a set of

01:40:27.220 | features, and so is this one.

01:40:29.380 | So why would the 3 be represented with different weight matrices?

01:40:34.100 | Because it's all doing the same thing.

01:40:36.420 | And this orange arrow is saying transition from character 0's state to character 1's state

01:40:47.420 | to character 2's state.

01:40:48.420 | Again, it's the same thing.

01:40:49.740 | Why would the transition from character 0 to 1 be different from character 1 to 2?

01:40:55.140 | So the idea is to say, hey, if it's doing the same conceptual thing, let's use the exact

01:41:04.540 | same weight matrix.

01:41:07.260 | My comment on convolutional neural networks is that a filter also can apply to multiple

01:41:12.380 | places.

01:41:13.380 | Yeah, that's an interesting point of view.

01:41:15.380 | So you're saying a convolution is almost like a kind of a special dot product with shared

01:41:20.860 | weights.

01:41:21.860 | Yeah, that's a very good point.

01:41:24.420 | And in fact, one of our students actually wrote a good blog post about that last year.

01:41:28.940 | We should dig that up.

01:41:29.940 | Okay, I totally see where you're coming from and I totally agree with you.

01:41:35.860 | So let's implement this version.

01:41:40.940 | So this time we're going to do 8 characters, 8 c's.

01:41:47.900 | And so let's create a list of every 8th character from 0 through 7, and then our outputs will

01:41:55.020 | be the next character, and so we can stack that together.

01:41:59.260 | And so now we've got 600,000 by 8.

01:42:05.020 | So here's an example.

01:42:08.740 | So for example, after this series of 8 characters, so this is characters 0 through 8, this is

01:42:17.860 | characters 1 through 9, this is 2 through 10, these are all overlapping.

01:42:24.060 | So after characters 1, 0 through 8, this is going to be the next one.

01:42:29.460 | And then after these characters, this will be the next one.

01:42:32.820 | So you can see that this one here has 43 as its y value, because after those, the next

01:42:40.660 | one will be 43.

01:42:44.740 | So this is the first 8 characters, this is 2 through 9, 3 through 10, and so forth.

01:42:51.060 | So these are overlapping groups of 8 characters, and then this is the next one along.

01:42:59.420 | So let's create that model.

01:43:06.320 | So again, we use fromArrays to create a model data class.

01:43:10.540 | And so you'll see here we have exactly the same code as we had before.

01:43:14.980 | Here's our embedding, linear, hidden, output, these are literally identical.

01:43:21.780 | And then we've replaced our value of the linear input of the embedding with something that's

01:43:28.380 | inside a loop, and then we've replaced the self.lhidden thing, also inside the loop.

01:43:44.180 | I just realized I didn't mention last time the use of the hyperbolic tan.

01:43:49.780 | Hyperbolic tan looks like this, so it's just a sigmoid that's offset.

01:44:00.220 | And it's very common to use a hyperbolic tan inside this state-to-state transition because

01:44:06.700 | it kind of stops it from flying off too high or too low.

01:44:10.860 | It's nicely controlled.

01:44:14.020 | Back in the old days, we used to use hyperbolic tan or the equivalent sigmoid a lot as most

01:44:22.220 | of our activation functions.

01:44:23.540 | Nowadays we tend to use ReLU, but in these hidden-state transition matrices, we still

01:44:32.100 | tend to use hyperbolic tan quite a lot.

01:44:35.580 | So you'll see I've done that also here, hyperbolic tan.

01:44:41.900 | So this is exactly the same as before, but I've just replaced it with a for loop.

01:44:45.580 | And then here's my output.

01:44:47.260 | Yes, unit.

01:44:49.740 | So does it have to do anything with convergence of these networks?

01:44:54.420 | Yeah, kind of.

01:44:56.900 | We'll talk about that a little bit over time.

01:45:01.020 | Let's come back to that though.

01:45:02.900 | For now, we're not really going to do anything special at all, recognizing this is just a

01:45:07.940 | standard fully connected network.

01:45:11.380 | Mainly it's quite a deep one, because this is actually this, but we've got 8 of these

01:45:19.560 | things now, we've now got a deep 8-layer network, which is why units starting to suggest we

01:45:25.900 | should be concerned.

01:45:26.900 | As we get deeper and deeper networks, they can be harder and harder to train.

01:45:30.740 | But let's try training this.

01:45:38.620 | So away it goes.

01:45:41.140 | As before, we've got a batch size of 512, we're using atom, and away it goes.

01:45:49.220 | So we won't sit there watching it, so we can then set the loading rate down back to 1 in

01:45:54.300 | x3, we can fit it again, and it seems to be training fine.

01:46:04.500 | But we're going to try something else, which is we're going to use the trick that Yannette

01:46:08.660 | rather hinted at before, which is maybe we shouldn't be adding these things together.

01:46:13.780 | And so the reason you might want to be feeling a little uncomfortable about adding these

01:46:17.940 | things together is that the input state and the hidden state are kind of qualitatively

01:46:26.140 | different kinds of things.

01:46:28.340 | The input state is the encoding of this character, whereas h represents the encoding of the series

01:46:36.040 | of characters so far.

01:46:37.780 | And so adding them together is potentially going to lose information.

01:46:43.060 | So I think what Yannette was going to prefer that we might do is maybe to concatenate these

01:46:47.220 | instead of adding them.

01:46:48.220 | Does that sound good to you, Yannette?

01:46:51.060 | So let's now make a copy of the previous cell all the same, but rather than using +, let's

01:46:58.420 | use cat.

01:47:01.380 | Now if we concat, then we need to make sure now that our input layer is not from nFAC

01:47:09.740 | to hidden, which is what we had before, but because we're concatenating, it needs to be

01:47:13.820 | nFAC + nhidden to nhidden.

01:47:19.140 | And so now that's going to make all the dimensions work nicely.

01:47:23.760 | So this now is of size nFAC + nhidden.

01:47:29.380 | This now makes it back to size nhidden again, and then this is putting it through the same

01:47:35.240 | square matrix as before so it's still of size nhidden.

01:47:39.640 | So this is like a good design heuristic if you're designing an architecture is if you've

01:47:46.500 | got different types of information that you want to combine, you generally want to concatenate

01:47:51.980 | it, adding things together, even if they're the same shape, is losing information.

01:48:00.660 | And so once you've concatenated things together, you can always convert it back down to a fixed

01:48:06.380 | size by just chucking it through a matrix product.

01:48:10.980 | So that's what we've done here.

01:48:11.980 | It's the same thing, but now we're concatenating instead.

01:48:17.280 | And so we can fit that, and so last time we got 1.72, this time we got 1.68.

01:48:24.640 | So it's not setting the world on fire, but it's an improvement, and the improvement's

01:48:28.520 | good.

01:48:30.260 | So we can now test that with getNext.

01:48:33.220 | And so now we can pass in 8 things, so it's normally for those, it looks good, or part

01:48:38.900 | of 2, that sounds good as well.

01:48:41.620 | So that's enough manual hackery.

01:48:52.660 | Let's see if PyTorch can do some of this for us.

01:48:55.700 | And so basically what PyTorch will do for us is it will write this loop automatically,

01:49:03.620 | and it will create these linear input layers automatically.

01:49:08.420 | And so to ask it to do that, we can use the nn.rnnplus.

01:49:14.740 | So here's the exact same thing in less code by taking advantage of PyTorch.

01:49:20.660 | And again, I'm not using a conceptual analogy to say PyTorch is doing something like it,

01:49:25.860 | I'm saying PyTorch is doing it.

01:49:28.260 | This is just the code you just saw wrapped up a little bit, refactored a little bit for

01:49:33.500 | your convenience.

01:49:34.900 | So when we say we now want to create an rnn, called rnn, then what this does is it does

01:49:43.080 | that for loop.

01:49:44.660 | Now notice that our for loop needed a starting point.

01:49:50.700 | You remember why, right?

01:49:51.820 | Because otherwise our for loop didn't quite work, we couldn't quite refactor it out.

01:49:55.420 | And because this is exactly the same, this needs a starting point too.

01:49:59.660 | So let's give it a starting point and so you have to pass in your initial hidden state.

01:50:05.940 | For reasons that will become apparent later on, it turns out to be quite useful to be

01:50:14.820 | able to get back that hidden state at the end.

01:50:19.540 | And just like we could here, we could actually keep track of the hidden state.

01:50:24.460 | We get back two things.

01:50:25.460 | We get back both the output and the hidden state.

01:50:29.320 | So we pass in the input and the hidden state and we get back the output and the hidden

01:50:33.640 | state.

01:50:34.640 | Yes?

01:50:35.640 | So it's the orange circle ellipse of activations, and so it is of size 256.

01:51:01.520 | So there's one other thing to know, which is in our case we were replacing h with a new

01:51:12.740 | hidden state.

01:51:15.140 | The one minor difference in PyTorch is they append the new hidden state to a list, or

01:51:22.060 | to a tensor, which gets bigger and bigger.

01:51:24.340 | So they actually give you back all of the hidden states, so in other words, rather than

01:51:27.940 | just giving you back the final ellipse, they give you back all of the ellipses stacked

01:51:32.660 | on top of each other.

01:51:34.060 | And so because we just want the final one, I just got indexed into it with -1.

01:51:38.740 | Other than that, this is the same code as before.

01:51:43.700 | Work that through our output layer to get the correct vocab size, and then we can train

01:52:00.060 | that.

01:52:01.060 | So you can see here I can do it manually, I can create some hidden state, I can pass

01:52:02.060 | it to that RNN, I can see the stuff I get back.

01:52:06.100 | You'll see that the dimensionality of h, it's actually a rank 3 tensor, where else in my

01:52:15.060 | version it was a rank 2 tensor.

01:52:23.660 | And the difference is here we've got just a unit axis at the front.

01:52:28.380 | We'll learn more about why that is later, but basically it turns out you can have a

01:52:32.620 | second RNN that goes backwards, one that goes forwards, one that goes backwards, and the

01:52:37.700 | idea is it's going to be better at finding relationships that kind of go backwards.

01:52:43.580 | That's called a bidirectional RNN.

01:52:45.740 | Also it turns out you can have an RNN feed to an RNN, that's called a multi-layer RNN.

01:52:50.060 | So basically if you have those things, you need an additional axis on your tensor to

01:52:55.700 | keep track of those additional layers of hidden state.

01:52:58.620 | But for now, we'll always have a 1 here, and we'll always also get back a 1 at the end.

01:53:09.180 | So if we go ahead and fit this now, let's actually train it for a bit longer.

01:53:16.540 | So last time we only did a couple of epochs.

01:53:19.500 | This time we'll do 4 epochs at one in egg 3, and then we'll do another 2 epochs at one

01:53:28.020 | in egg 4.

01:53:30.120 | And so we've now got our loss down to 1.5, so getting better and better.

01:53:38.460 | So here's our getNext again, and let's just do the same thing.

01:53:44.580 | So what we can now do is we can loop through 40 times, calling getNext each time, and then

01:53:51.900 | each time we'll replace that input by removing the first character and adding the thing that

01:53:56.780 | we just predicted.

01:53:58.300 | And so that way we can feed in a new set of 8 characters again and again and again.

01:54:03.660 | And so that way we'll call that getNext in, so here are 40 characters that we've generated.

01:54:09.780 | So we started out with 4thos, we got 4 those of the same, to the same, to the same.

01:54:16.980 | You can probably guess what happens if you keep predicting the same to the same.

01:54:21.160 | So it's doing okay.

01:54:25.100 | We now have something which we've basically built from scratch, and then we've said here's

01:54:35.180 | how PyTorch refactored it for us.

01:54:38.540 | So if you want to have an interesting little homework assignment this week, try to write

01:54:44.220 | your own version of an RNN class.

01:54:48.820 | Try to literally create your Jeremy's RNN, and then type in here Jeremy's RNN, or in your

01:54:58.180 | case maybe your name's not Jeremy, which is OK too, and then get it to run, writing your

01:55:04.260 | implementation of that class from scratch without looking at the PyTorch source code.

01:55:09.940 | Basically it's just a case of going up and seeing what we did back here, make sure you

01:55:15.300 | get the same answers, and confirm that you do.

01:55:18.620 | So that's kind of a good little test, very simple little assignment, but I think you'll

01:55:23.940 | feel really good when you've seen "Oh, I've just re-implemented nn.RNN."

01:55:31.700 | So I'm going to do one other thing.

01:55:36.420 | When I switched from this one, when I've moved the car1 input inside the dotted line, this

01:55:41.020 | dotted rectangle represents the thing I'm repeating.

01:55:44.580 | I also, watch the triangle, the output, I move that inside as well.

01:55:50.700 | Now that's a big difference because now what I've actually done is I'm actually saying

01:55:57.940 | spit out an output after every one of these circles.

01:56:03.380 | So spit out an output here, and here, and here.

01:56:08.620 | So in other words, if I have a 3-character input, I'm going to spit out a 3-character

01:56:12.900 | output.

01:56:13.900 | I'm saying after character 1, this will be next, after character 2, this will be next,

01:56:17.260 | after character 3, this will be next.

01:56:21.580 | So again, nothing different, and again, if you wanted to go a bit further with the assignment,

01:56:28.820 | you could write this by hand as well.

01:56:31.420 | But basically what we're saying is in the for loop, we'd be saying results = some empty

01:56:39.680 | list, and then we'd be going through, and rather than returning that, we'd instead

01:56:45.620 | be saying results.append that, and then return torch.stat, something like that.

01:57:01.460 | That may even be right, I'm not quite sure.

01:57:03.960 | So now we now have every step we've created an output, which is basically this picture.

01:57:13.960 | And so the reason, well there's lots of reasons that's interesting, but I think the main reason

01:57:20.020 | right now that's interesting is that you probably noticed this approach to dealing with that

01:57:31.280 | data seems terribly inefficient.

01:57:33.700 | Like we're grabbing the first 8, but then this next set, all but one of them overlaps

01:57:41.780 | the previous one.

01:57:44.580 | So we're kind of recalculating the exact same embeddings, 7 out of 8 of them are going to

01:57:50.460 | be exact same embeddings, exact same transitions, it kind of seems weird to do all this calculation

01:57:59.460 | to just predict one thing and then go back and recalculate 7 out of 8 of them and add

01:58:03.620 | one more to the end to calculate the next thing.

01:58:07.260 | So the basic idea then is to say, well let's not do it that way, instead let's take non-overlapping

01:58:17.580 | sets of characters, like so.

01:58:22.580 | Here is our first 8 characters, here is the next 8 characters, here are the next 8 characters.

01:58:28.660 | So like if you read this top left to bottom right, that would be the whole mixture.

01:58:36.320 | And so then, if these are the first 8 characters, then offset this by 1, starting here, that's

01:58:45.100 | a list of outputs.

01:58:48.460 | So after we see characters 0 through 7, we should predict characters 1 through 8.

01:58:55.520 | So after 40 should come 42 as it did, after 42 should come 29 as it did.

01:59:04.580 | And so now that can be our inputs and labels for that model.

01:59:10.940 | And so it shouldn't be any more or less accurate, it should just be the same, but it should

01:59:20.660 | allow us to do it more efficiently.

01:59:26.860 | So let's try that.

01:59:35.540 | So I mentioned last time that we had a -1 index here, because we just wanted to grab

01:59:44.060 | the last triangle.

01:59:48.180 | So in this case, we're going to grab all the triangles.

01:59:51.300 | So this is actually the way an n dot r and n creates things.

01:59:55.880 | We only kept the last one, but this time we're going to keep all of them.

02:00:05.700 | So we've made one change, which is to remove that -1.

02:00:09.900 | Other than that, this is the exact same code as before.

02:00:16.260 | There's nothing much to show you here, except of course this time if we look at the labels,

02:00:29.080 | it's now 512x8 because we're trying to predict 8 things every time through.

02:00:38.220 | So there is one complexity here, which is that we want to use the negative-l likelihood

02:00:47.420 | loss function as before, but the negative-l likelihood loss function just like RMSE expects

02:00:55.680 | to receive 2 rank 1 tensors, well actually with the minibatch axis, 2 rank 2 tensors.

02:01:03.420 | So 2 minibatches of vectors.

02:01:08.000 | The problem is that we've got 8 time steps, 8 characters, in an r and n we call it a time

02:01:17.140 | step.

02:01:18.700 | We have 8 time steps, and then for each one we have 84 probabilities, the probability

02:01:25.020 | for every single one of those 8 time steps.

02:01:30.780 | And then we have that for each of our 512 items in the minibatch.

02:01:36.460 | So we have a rank 3 tensor, not a rank 2 tensor.

02:01:42.060 | So that means that the negative-l likelihood loss function is going to spit out an error.

02:01:47.940 | Frankly I think this is kind of dumb, I think it would be better if PyTorch had written

02:01:53.820 | their loss functions in such a way that they didn't care at all about rank and they just

02:01:59.060 | applied it to whatever rank you gave it.

02:02:02.620 | But for now at least, it does care about rec.

02:02:06.560 | But the nice thing is I get to show you how to write a custom loss function.

02:02:10.460 | So we're going to create a special negative-l likelihood loss function for sequences.

02:02:16.620 | And so it's going to take an input in the target, and it's going to call f.negative-l likelihood

02:02:21.460 | loss, so the PyTorch one.

02:02:26.140 | So what we're going to do is we're going to flatten our input, and we're going to flatten

02:02:34.540 | our targets.

02:02:37.940 | And it turns out these are going to be the first two axes that have to be transposed.

02:02:46.540 | So the way PyTorch handles RNN data by default is the first axis is the sequence length.

02:02:56.100 | In this case, 8.

02:02:58.120 | So the sequence length of an RNN is how many time steps?

02:03:02.620 | So we have 8 characters, so sequence length of 8.

02:03:05.540 | The second axis is the batch size, and then as you would expect, the third axis is the

02:03:11.360 | actual hidden state itself.

02:03:13.940 | So this is going to be 8 by 512 by nhidden, which I think was 256.

02:03:26.540 | So we can grab the size and unpack it into each of these, sequence length batch size

02:03:31.980 | numhidden.

02:03:36.820 | Our target is 512 by 8, where else this one here was 8 by 512.

02:03:52.900 | So to make them match, we're going to have to transpose the first two axes.

02:04:02.500 | PyTorch, when you do something like transpose, doesn't generally actually shuffle the memory

02:04:08.100 | order, but instead it just kind of keeps some internal metadata to say you should treat

02:04:14.340 | this as if it's transposed.

02:04:18.820 | Some things in PyTorch will give you an error if you try and use it when it has this internal

02:04:24.900 | state.

02:04:25.900 | It will basically say, "Error, this tensor is not contiguous."

02:04:32.100 | If you ever see that error, add the word "contiguous" after it and it goes away.

02:04:36.700 | So I don't know, they can't do that for you apparently.

02:04:39.300 | So in this particular case, I got that error, so I wrote the word "contiguous" after it.

02:04:44.020 | And so then finally we need to flatten it out into a single vector, and so we can just

02:04:49.100 | go dot view, which is the same as numPy dot reshape, and -1 means as long as it needs

02:04:54.840 | to be.

02:04:57.820 | And then the input, again we also reshape that, but remember the predictions also have

02:05:08.980 | this axis of length 84, all of the predicted probabilities.

02:05:14.860 | So here's a custom loss function, that's it.

02:05:19.140 | So if you ever want to play around with your own loss functions, you can just do that like

02:05:23.780 | so, and then pass that to fit.

02:05:29.460 | So it's important to remember that fit is this lowest level fastai abstraction, this

02:05:39.180 | is the thing that implements the training loop.

02:05:42.220 | And so the stuff you pass it in is all standard PyTorch stuff except for this.

02:05:51.820 | This is our model data object, this is the thing that wraps up the test set, the training

02:05:57.420 | set and the validation set together.

02:06:05.780 | So when we pull the triangle into the repeated structure, so the first n-1 iterations of

02:06:14.900 | the sequence length, we don't see the whole sequence length, so does that mean that the

02:06:19.940 | batch size should be much bigger so that you get a triangular kind of --

02:06:23.780 | Now be careful, you don't mean batch size, you mean sequence length, right?

02:06:27.820 | Because the batch size is like something else entirely.

02:06:31.820 | So yes, if you have a short sequence length like 8, the first character has nothing to

02:06:40.420 | go on, it starts with an empty hidden state of zeros.

02:06:48.260 | So what we're going to start with next week is we're going to learn how to avoid that

02:06:53.220 | problem.

02:06:54.740 | And so it's a really insightful question or concern.

02:06:59.700 | But if you think about it, the basic idea is why should we reset this to 0 every time?

02:07:09.100 | If we can kind of line up these mini-batches somehow so that the next mini-batch joins

02:07:16.540 | up correctly, it represents the next letter in nature's works, then we'd want to move

02:07:22.340 | this up into the constructor and then pass that here and then store it here.

02:07:36.460 | And now we're not resetting the hidden state each time, we're actually keeping the hidden

02:07:43.340 | state from call to call, and so the only time that it would be failing to benefit from learning

02:07:51.680 | state would be literally at the very start of the document.

02:07:55.500 | So that's where we're going to try ahead next week.

02:08:07.380 | I feel like this lesson, every time I've got a punch line coming, somebody asks me a question

02:08:11.700 | where I have to do the punch line ahead of time.

02:08:17.220 | So we can fit that and we can fit that.

02:08:20.940 | And I want to show you something interesting, and this is coming to another punch line that

02:08:26.740 | Yannette tried to spoil, which is when we're -- remember, this is just doing a loop, applying

02:08:36.380 | the same matrix multiply again and again.

02:08:40.420 | If that matrix multiply tends to increase the activations each time, then effectively

02:08:47.740 | we're doing that to the power of 8, so it's going to shoot off really high, or if it's

02:08:53.260 | decreasing it a little bit each time, it's going to shoot off really low.

02:08:57.260 | So this is what we call a gradient explosion.

02:09:00.860 | And so we really want to make sure that the initial L hidden that we create is of a size

02:09:18.700 | that's not going to cause our activations on average to increase or decrease.

02:09:24.500 | And there's actually a very nice matrix that does exactly that, called the identity matrix.

02:09:33.180 | So the identity matrix for those that don't quite remember their linear algebra is this.

02:09:40.980 | This would be a size 3 identity matrix.

02:09:44.740 | And so the trick about an identity matrix is anything times an identity matrix is itself.

02:09:52.740 | And therefore you could multiply by this again and again and again and again and still end

02:09:57.620 | up with itself, so there's no gradient explosion.

02:10:02.860 | So what we could do is instead of using whatever the default random unit is for this matrix,

02:10:10.540 | we could instead, after we create our RNN, we can go into that RNN, and if we now go

02:10:21.860 | like so, we can get the docs for m.RNN.

02:10:27.740 | And as well as the arguments for constructing it, it also tells you the inputs and outputs

02:10:33.140 | for calling the layer, and it also tells you the attributes.

02:10:37.180 | And so it tells you there's something called weightHH, and these are the learnable hidden

02:10:42.060 | to hidden weights, that's that square matrix.

02:10:45.300 | So after we've constructed our m, we can just go in and say m.RNN.weightHHL.data, that's

02:10:55.820 | the tensor, dot copy_inplace, torch.i, that is 'i' for identity, in case you were wondering.

02:11:08.860 | So this is an identity matrix of size n hidden.

02:11:12.580 | So this both puts into this weight matrix and returns the identity matrix.

02:11:20.320 | And so this was like, actually a Jeffrey Hinton paper was like, hey, you know, after 2015,

02:11:32.020 | so after recurrent neural nets have been around for decades, he was like, hey gang, maybe

02:11:39.300 | we should just use the identity matrix to initialize this, and it actually turns out

02:11:45.180 | to work really well.

02:11:48.140 | And so that was a 2015 paper, believe it or not, from the father of neural networks.

02:11:52.880 | And so here is our implementation of his paper.

02:11:55.780 | And this is an important thing to note, right?

02:11:58.060 | When very famous people like Jeffrey Hinton write a paper, sometimes the entire implementation

02:12:03.260 | of that paper looks like one line of code.

02:12:07.140 | So let's do it.

02:12:08.900 | Before we got 0.61257, we'll fit it with exactly the same parameters, and now we've got 0.51,

02:12:16.540 | and in fact we can keep training 0.50.

02:12:18.780 | So this tweak really, really, really helped.

02:12:23.300 | And one of the nice things about this tweak was before I could only use a learning rate

02:12:27.220 | of 1 in egg 3 before it started going crazy, but after I used the identity matrix, I found

02:12:33.460 | I could use 1 in egg 2 because it's better behaved.

02:12:37.420 | Weight initialization, I found I could use a higher learning rate.

02:12:42.340 | And honestly, these things, increasingly we're trying to incorporate into the defaults in

02:12:50.260 | fastai.

02:12:51.260 | You won't necessarily need to actually know them, but at this point we're still at a point

02:12:59.140 | where most things in most libraries most of the time don't have great defaults, it's good

02:13:03.420 | to know all these little tricks.

02:13:05.100 | It's also nice to know if you want to improve something what kind of tricks people have

02:13:09.100 | used elsewhere because you can often borrow them yourself.

02:13:13.360 | Alright, that's the end of the lesson today, so next week we will look at this idea of

02:13:20.140 | a stateful RNN that's going to keep its hidden state around, and then we're going to go back

02:13:24.740 | to looking at language models again, and then finally we're going to go all the way back

02:13:28.880 | to computer vision and learn about things like resnets, and batch norm, and all the

02:13:34.640 | tricks that were figured out in cats vs dogs.

02:13:37.780 | See you then!

02:13:38.740 | [APPLAUSE]

Lesson 6: Deep Learning 2018

Chapters