back to indexLesson 6: Deep Learning 2018
Chapters
0:0 Introduction
4:50 LearnModel
10:20 Return a variable
15:30 Zip function
20:30 PCA
24:45 Entity embedding layers
29:11 Map of Germany
30:35 Embeddings
35:45 Fake Tasks
37:35 Autoencoder
41:4 Russell
46:36 Russell kernels
49:4 Columnar model data
00:00:14.360 |
Couple of weeks ago, in lesson 4, I mentioned I was going to share that lesson with this 00:00:18.820 |
terrific NLP researcher, Sebastian Rutter, which I did, and he said he loved it and he's 00:00:25.220 |
gone on to yesterday release this new post he called Optimization for Deep Learning Highlights 00:00:31.600 |
in 2017 in which he covered basically everything that we talked about in that lesson. 00:00:39.720 |
And with some very nice shoutouts to some of the work that some of the students here 00:00:43.280 |
have done, including when he talked about the separation of weight decay from the momentum 00:00:56.120 |
term, and so he actually mentions here the opportunities in terms of improved software 00:01:04.240 |
decoupling this allows, and actually links to the commit from Anand Sahar actually showing 00:01:14.840 |
So fastai's code is actually being used as a bit of a role model now. 00:01:20.320 |
He then covers some of these learning rate training techniques that we've talked about. 00:01:28.520 |
This is the SGDR schedule, it looks a bit different to what you're used to seeing because 00:01:34.600 |
this is on a log curve, this is the way that they show it on the paper. 00:01:38.720 |
And for more information, again, links to two blog posts, the one from Vitaly about 00:01:46.880 |
this topic, and again Anand Sahar's blog post on this topic. 00:01:53.360 |
So it's great to see that some of the work from fastai students is already getting noticed 00:01:59.680 |
and picked up and shared, and this blog post went on to get on the front page of Hacker 00:02:07.760 |
And hopefully more and more of this work will be picked up once this is released publicly. 00:02:16.080 |
So last week we were doing a deep dive into collaborative filtering. 00:02:26.240 |
Let's remind ourselves of what our final model looked like. 00:02:35.240 |
So in the end we ended up rebuilding the model that's actually in the fastai library, where 00:02:45.760 |
we had an embedding, so we had this little get embedding function that grabbed an embedding 00:02:51.840 |
and randomly initialized the weights, for the users, and for the items, that's the kind 00:03:00.280 |
of generic term in our case, the items of movies, the bias for the users, the bias for 00:03:04.560 |
the items, and we had n factors embedding size for each one, of course the biases just 00:03:12.120 |
had a single one, and then we grabbed the users and item embeddings, multiplied them 00:03:16.600 |
together, summed it up for each row, and added on the bias terms, popped that through a sigmoid 00:03:31.840 |
One of you asked if we can interpret this information in some way, and I promised this 00:03:38.800 |
So let's take a look. We're going to start with the model we built here where we just 00:03:44.540 |
used that fastai library, collab filter data set from CSV, and then that .get learner, and 00:03:52.080 |
then we fitted it in three epochs, 19 seconds, we've got a pretty good result. 00:04:01.440 |
So what we can now do is to analyze that model. 00:04:07.720 |
So you may remember right back when we started, we read in the movies.csv file, but that's 00:04:15.600 |
just a mapping from the ID of the movie to the name of the movie. So we're just going 00:04:20.040 |
to use that for display purposes so we can see what we're doing. 00:04:25.640 |
Because not all of us have watched every movie, I'm just going to limit this to the top 500 00:04:30.920 |
3,000 most popular movies, so we might have more chance of recognizing the movies we're 00:04:37.880 |
looking at, and then I'll go ahead and change it from the movie IDs from MovieLens to those 00:04:44.000 |
unique IDs that we're using, the contiguous IDs, because that's what our model has. 00:04:50.420 |
So inside the Learn object that we create inside a learner, we can always grab the PyTorch 00:05:02.680 |
model itself just by saying "learn.model", and I'm going to show you more and more of 00:05:10.440 |
the code at the moment, so let's take a look at the definition of model. 00:05:18.800 |
And so model is a property, so if you haven't seen a property before, a property is just 00:05:22.600 |
something in Python which looks like a method when you define it, but you can call it without 00:05:33.040 |
And so it kind of looks when you call it like it's a regular attribute, but it looks like 00:05:38.640 |
So every time you call it, it actually runs this code. 00:05:42.380 |
And so in this case it's just a shortcut to grab something called .models.model, so you 00:05:48.760 |
may be interested to know what that looks like, learn.models. 00:05:54.240 |
And so the fastai model type is a very thin wrapper for PyTorch models. 00:06:03.200 |
So we could take a look at this, collab.filter.model, and see what that is. 00:06:16.200 |
And we'll talk more about these in part 2, but basically there's this very thin wrapper 00:06:23.960 |
and one of the main things that fastai does is we have this concept of layer groups where 00:06:28.840 |
basically when you say here there are different learning rates and they get applied to different 00:06:32.680 |
sets of layers, then that's something that's not in PyTorch. 00:06:35.960 |
So when you say I want to use this PyTorch model, there's one thing we have to do which 00:06:43.840 |
So the details aren't terribly important, but in general if you want to create a little 00:06:48.400 |
wrapper for some other PyTorch model, you could just write something like this. 00:06:55.600 |
So to get inside that, to grab the actual PyTorch model itself, it's models.model, that's 00:07:02.280 |
the PyTorch model, and then the learn object has a shortcut to that. 00:07:07.680 |
So we're going to set m to be the PyTorch model. 00:07:12.520 |
And so when you print out a PyTorch model, it prints it out basically by listing out 00:07:18.400 |
all of the layers that you created in the constructor. 00:07:23.400 |
It's quite nifty actually when you think about the way this works thanks to some very handy 00:07:30.720 |
stuff in Python, we're actually able to use standard Python OO to define these modules 00:07:39.120 |
and these layers and they basically automatically register themselves with PyTorch. 00:07:45.040 |
So back in our embedding.bias, we just had a bunch of things where we said each of these 00:07:51.040 |
things are equal to these things and it automatically knows how to represent that. 00:07:57.400 |
So you can see there's the name is u, and so the name is just literally whatever we 00:08:05.560 |
And then the definition is it's this kind of layer. 00:08:14.760 |
So we can look inside that, basically use that, so if we say m.ib, then that's referring 00:08:23.440 |
to the embedding layer for an item which is the bias layer. 00:08:29.720 |
So an item bias in this case is the movie bias. 00:08:32.940 |
So each movie, there are 9,000 of them, has a single bias element. 00:08:39.600 |
Now the really nice thing about PyTorch layers and models is that they all look the same. 00:08:48.600 |
To use them, you call them as if they were a function. So we can go m.ib, and that basically 00:08:56.080 |
says I want you to return the value of that layer, and that layer could be a full-on model. 00:09:04.000 |
So to actually get a prediction from a PyTorch model, I would go m and pass in my variable. 00:09:12.320 |
And so in this case, m.ib and pass in my top movie indexes. 00:09:19.880 |
Now models remember layers, they require variables, not tensors, because it needs to keep track 00:09:28.760 |
of the derivatives, and so we use this capital V to turn the tensor into a variable. 00:09:36.720 |
Now it's just announced this week that PyTorch 0.4, which is the version after the one that's 00:09:43.760 |
just about to be released, is going to get rid of variables and we'll actually be able 00:09:49.160 |
to use tensors directly to keep track of derivatives. 00:09:52.440 |
So if you're watching this on the MOOC and you're looking at 0.4, then you'll probably 00:09:56.640 |
notice that the code doesn't have this V in it anymore, so that will be pretty exciting 00:10:03.320 |
For now, we have to remember if we're going to pass something into a model to turn it 00:10:08.160 |
And remember, a variable has a strict superset of the API of a tensor, so anything you can 00:10:13.800 |
do to a tensor, you can do to a variable, like add it up or take its log or whatever. 00:10:20.080 |
So that's going to return a variable which consists of going through each of these movie 00:10:25.120 |
IDs, putting it through this embedding layer to get its bias. 00:10:42.720 |
So before I press Shift + Enter here, you can have a think about what I'm going to have. 00:10:48.000 |
I've got a list of 3000 movies going in, turning it into a variable, putting it through this 00:10:53.440 |
embedding layer, so just have a think about what you expect to come out. 00:10:59.800 |
And we have a variable of size 3000 by 1, hopefully that doesn't surprise you. 00:11:04.760 |
We had 3000 movies that we were looking up, each one had a one long embedding, so there's 00:11:12.240 |
You'll notice it's a variable, which is not surprising because we've fed it a variable, 00:11:15.920 |
so we've got a variable back, and it's a variable that's on the GPU, dot CUDA. 00:11:22.700 |
So we have a little shortcut in fast.ai because we very often want to take variables, turn 00:11:29.840 |
them into tensors, and move them back to the CPU so we can play with them more easily. 00:11:34.400 |
So 2mp is to NumPy, and that does all of those things. 00:11:39.420 |
It works regardless of whether it's a tensor or a variable, it works regardless of whether 00:11:43.400 |
it's on the CPU or GPU, it'll end up giving you a NumPy array from that. 00:11:50.160 |
So if we do that, that gives us exactly the same thing as we just looked at, but now in 00:11:58.680 |
So that's a super handy thing to use when you're playing around with PyTorch. 00:12:03.600 |
My approach to things is I try to use NumPy for everything, except when I explicitly need 00:12:13.080 |
something to run on the GPU, or I need its derivatives, in which case I use PyTorch. 00:12:19.840 |
I find NumPy's often easier to work with, it's been around many years longer than PyTorch. 00:12:29.920 |
And lots of things like the Python imaging library, OpenCV, and lots and lots of stuff 00:12:40.040 |
So my approach is do as much as I can in NumPy land, finally when I'm ready to do something 00:12:46.400 |
on the GPU or take its derivative to PyTorch, and then as soon as I can, I put it back in 00:12:52.760 |
And you'll see that the FastAI library really works this way, like all the transformations 00:12:57.080 |
and stuff happen in NumPy, which is different to most PyTorch computer vision libraries 00:13:03.840 |
which tend to do it all as much as possible in PyTorch. 00:13:13.280 |
So let's say we wanted to build a model in the GPU with the GPU and train it, and then 00:13:21.300 |
So would we call to NumPy on the model itself, or would we have to iterate through all the 00:13:31.480 |
So it's very likely that you want to do inference on a CPU rather than a GPU, it's more scalable, 00:13:37.920 |
you don't have to worry about putting things in batches, so on and so forth. 00:13:42.560 |
So you can move a model onto the CPU just by typing m.CPU, and that model is now on the 00:13:53.280 |
And therefore you can also then put your variable on the CPU by doing exactly the same thing, 00:14:05.840 |
Now having said that, if your server doesn't have a GPU, you don't have to do this because 00:14:15.120 |
So for inferencing on the server, if you're running it on some T2 instance or something, 00:14:23.440 |
it'll work fine and it'll all run on the CPU automatically. 00:14:28.280 |
So if we train the model on the GPU and then we save those embeddings and the weights, 00:14:37.520 |
would we have to do anything special to load it onto the CPU? 00:14:43.360 |
We have something, well it kind of depends on how much of fast AI you're using, so I'll 00:14:48.080 |
show you how you can do that in case you have to do it manually. 00:14:52.920 |
One of the students figured this out, which is very handy. 00:14:56.240 |
When we -- there's a load model function, and you'll see what it does, but it does torch.load, 00:15:07.320 |
is that basically this is like some magic incantation that normally it has to load it 00:15:12.440 |
onto the same GPU it's saved on, but this will load it into whatever's available. 00:15:29.600 |
To put that back on the GPU, I'll need to say .cuda and now I can run it again. 00:15:40.040 |
So it's really important to know about the zip function in Python, which iterates through 00:15:50.360 |
So in this case, I want to grab each movie along with its bias term so that I can just 00:15:58.200 |
So if I just go zip like that, that's going to iterate through each movie ID and each 00:16:05.680 |
And so then I can use that in a list comprehension to grab the name of each movie along with 00:16:12.800 |
So having done that, I can then sort, and so here I told you the John Travolta Scientology 00:16:23.160 |
movie at the most negative of -- quite by a lot. 00:16:26.800 |
If this was a Kaggle competition, Battlefield Earth would have won by miles. 00:16:34.880 |
So here is the worst movie of all time, according to IMDB. 00:16:38.960 |
It's interesting when you think about what this means, because this is a much more authentic 00:16:43.440 |
way to find out how bad this movie is, because some people are just more negative about movies. 00:16:50.680 |
And if more of them watch your movie, like a highly critical audience, they're going 00:16:56.160 |
So if you take an average, it's not quite fair. 00:17:01.120 |
And so what this is doing is saying once we remove the fact that different people have 00:17:07.840 |
different overall positive or negative experiences, and different people watch different kinds 00:17:11.640 |
of movies, and we correct for all that, this is the worst movie of all time. 00:17:22.580 |
So this is how we can look inside our model and interpret the bias vectors. 00:17:30.760 |
You'll see here I've sorted by the zeroth element of each tuple by using a lambda. 00:17:42.200 |
This is part of Python's operator library, and this creates a function that returns the 00:17:47.360 |
zeroth element of something in order to save time. 00:17:51.760 |
And then I actually realized that the lambda is only one more character to write than the 00:17:56.720 |
item getter, so maybe we don't need to know this after all. 00:17:59.680 |
So really useful to make sure you know how to write lambdas in Python, so this is a function. 00:18:07.560 |
And so the sort is going to call this function every time it decides is this thing higher 00:18:12.840 |
or lower than that other thing, and this is going to return the zeroth element. 00:18:20.000 |
So here's the same thing in item getter format, and here is the reverse and Shawshank redemption 00:18:27.320 |
right at the top, I'll definitely agree with that, godfather, usual suspects, these are 00:18:31.320 |
all pretty great movies, 12 angry men, absolutely. 00:18:37.160 |
So there you go, there's how we can look at the bias. 00:18:42.480 |
So then the second piece to look at would be the embeddings. 00:18:49.200 |
So we can do the same thing, so remember i was the item embeddings, rather than ib was 00:18:54.320 |
the item bias, we can pass in our list of movies as a variable, turn it into numpy, 00:19:00.700 |
and here's our movie embeddings, so for each of the 3000 most popular movies, here are 00:19:10.160 |
So it's very hard, unless you're Jeffrey Hinton, to visualize a 50-dimensional space. 00:19:16.560 |
So what we'll do is we'll turn it into a 3-dimensional space. 00:19:21.000 |
So we can compress high-dimensional spaces down into lower-dimensional spaces using lots 00:19:26.520 |
of different techniques, perhaps one of the most common and popular is called PCA. 00:19:31.360 |
PCA stands for Principal Components Analysis, it's a linear technique, but linear techniques 00:19:39.920 |
generally work fine for this kind of embedding. 00:19:44.240 |
I'm not going to teach you about PCA now, but I will say in Rachel's Computational Linear 00:19:48.840 |
Algebra class, which you can get to from fast.ai, we cover PCA in a lot of detail. 00:19:57.320 |
And it's a really important technique, it turns out to be almost identical to something 00:20:01.520 |
called Singular Value Decomposition, which is a type of matrix decomposition which actually 00:20:08.180 |
does turn up in deep learning a little bit from time to time. 00:20:12.840 |
So it's kind of somewhat worth knowing if you were going to dig more into linear algebra, 00:20:19.200 |
SPD and PCA, along with eigenvalues and eigenvectors, which are all slightly different versions 00:20:25.720 |
of this kind of the same thing, are all worth knowing. 00:20:29.560 |
But for now, just know that you can grab PCA from sklearn.decomposition, say how much you 00:20:36.280 |
want to reduce the dimensionality to, so I want to find 3 components, and what this is 00:20:41.340 |
going to do is find 3 linear combinations of the 50 dimensions which capture as much 00:20:49.240 |
as the variation as possible, but are as different to each other as possible. 00:20:55.820 |
So we would call this a lower rank approximation of our matrix. 00:21:03.040 |
So then we can grab the components, so that's going to be the 3 dimensions, so once we've 00:21:13.320 |
And so we can now take a look at the first of them, and we'll do the same thing of using 00:21:17.600 |
zip to look at each one along with its movie. 00:21:20.960 |
And so here's the thing, we don't know ahead of time what this PCA thing is, it's just 00:21:29.960 |
a bunch of latent factors, it's kind of the main axis in this space of latent factors. 00:21:39.460 |
And so what we can do is look at it and see if we can figure out what it's about. 00:21:46.760 |
So given that Police Academy 4 is high up here along with Waterworld, where El Spago, 00:21:54.040 |
Pulp Fiction and Godfeather are high up here, I'm going to guess that a high value is not 00:21:59.880 |
going to represent critically acclaimed movies or serious watching. 00:22:05.440 |
So I called this "easy watching vs. serious", but this is kind of how you have to interpret 00:22:13.280 |
your embeddings, take a look at what they seem to be showing and decide what you think 00:22:20.120 |
So this is the principal axis in this set of embeddings. 00:22:25.520 |
So we can look at the next one, so do the same thing and look at the first index 1 embedding. 00:22:33.640 |
This one's a little bit harder to figure out what's going on, but with things like Mulholland 00:22:37.700 |
Drive and Purple Rose of Cairo, these look more kind of dialog-y ones, or else things 00:22:44.400 |
like Lord of the Rings and Aladdin and Star Wars, these look more like modern CGI-y ones. 00:22:49.720 |
So you can imagine that on that pair of dimensions, it probably represents a lot of differences 00:23:00.320 |
Some people like Purple Rise of Cairo type movies, Woody Allen kind of classic, and some 00:23:15.240 |
Some people presumably like Police Academy 4 more than they like Fargo. 00:23:22.720 |
So you can kind of get the idea of what's happened. 00:23:34.560 |
For a model which was literally multiply two things together in atom hub, it's learned 00:23:52.380 |
And then we could plot them if we wanted to, I just grabbed a small subset to plot on those 00:24:05.280 |
So I wanted to next dig in a layer deeper into what actually happens when we say fit. 00:24:25.200 |
For something like the store model, is it a way to interpret the embeddings? 00:24:34.960 |
Well let's jump straight there, what the hell. 00:24:45.400 |
So for the Rustman, how much are we going to sell at each store on each date model? 00:25:06.040 |
It's a great paper, by the way, well worth it, pretty accessible. 00:25:10.720 |
I think any of you would at this point be able to at least get the gist of it, and much 00:25:16.960 |
of the detail as well, particularly as you've also done the machine learning course. 00:25:22.240 |
And they actually make this point in the paper, this is in the paper, that the equivalent 00:25:26.960 |
of what they call entity embedding layers, an embedding of a categorical variable, is 00:25:32.200 |
identical to a one-hot encoding followed by a matrix multiplier. 00:25:39.080 |
So they're basically saying if you've got three embeddings, that's the same as doing 00:25:43.380 |
three one-hot encodings, putting each one through a matrix multiplier, and then put 00:25:48.280 |
that through a dense layer, or what PyTorch would call a linear layer. 00:25:56.980 |
One of the nice things here is because this is kind of like, well they thought it was 00:26:00.360 |
the first paper, it was actually the second, I think, paper to show the idea of using categorical 00:26:04.560 |
embeddings for this kind of dataset, they really go into quite a lot of detail, right 00:26:11.200 |
back to the detailed stuff that we learned about, so it's kind of a second cut at thinking 00:26:21.200 |
So one of the interesting things that they did was they said after we've trained a neural 00:26:25.760 |
net with these embeddings, what else could we do with it? 00:26:33.520 |
So they got a winning result with a neural network with entity embeddings. 00:26:41.120 |
But then they said hey you know what, we could take those entity embeddings and replace each 00:26:45.780 |
categorical variable with the learned entity embeddings, and then feed that into a GBM. 00:26:54.080 |
So in other words, rather than passing into the GBM a one-hot encoded version, or an ordinal 00:26:59.600 |
version, let's actually replace the categorical variable with its embedding for the appropriate 00:27:09.680 |
So it's actually a way of feature engineering. 00:27:14.200 |
And so the main average percent error without that for GBMs using just one-hot encodings 00:27:28.880 |
Random forests without that was 0.16, with that 0.108 nearly as good as the neural net. 00:27:37.440 |
So this is kind of an interesting technique because what it means is in your organization 00:27:43.520 |
you can train a neural net that has an embedding of stores, and an embedding of product types, 00:27:49.880 |
and an embedding of whatever kind of high cardinality or even medium cardinality categorical variables 00:27:55.920 |
you have, and then everybody else in the organization can now chuck those into their GBM or random 00:28:07.000 |
And what this is saying is they won't get, in fact you can even use k-nearest neighbors 00:28:12.480 |
with this technique and get nearly as good a result. 00:28:15.680 |
So this is a good way of giving the power of neural nets to everybody in your organization 00:28:22.600 |
without having them do the fast AI deep learning course first. 00:28:26.760 |
They can just use whatever sklearn or R or whatever that they're used to. 00:28:31.240 |
And those embeddings could literally be in a database table because if you think about 00:28:36.320 |
an embedding as just an index lookup, which is the same as an inner join in SQL. 00:28:43.480 |
So if you've got a table of each product along with its embedding vector, then you can literally 00:28:48.560 |
do an inner join, and now you have every row in your table along with its product embedding 00:28:59.280 |
And GBMs and random forests learn a lot quicker than neural nets do. 00:29:12.080 |
So here's what happened when they took the various different states of Germany and plotted 00:29:17.360 |
the first two principal components of their embedding vectors. 00:29:21.200 |
And they basically here is where they were in that 2D space. 00:29:25.240 |
And wackily enough, I've circled in red 3 cities, and I've circled here the 3 cities in Germany. 00:29:32.720 |
And here I've circled in purple, sorry blue, here are the blue, here's the green, here's 00:29:38.960 |
So it's actually drawn a map of Germany, even though it never was told anything about how 00:29:46.920 |
far these states are away from each other or the very concept of geography didn't exist. 00:29:58.320 |
So I went ahead and looked -- here's another thing, I think this is also from their paper. 00:30:04.440 |
They took every pair of places and they looked at how far away they are on a map versus how 00:30:13.480 |
far away are they in embedding space, and they got this beautiful correlation. 00:30:20.600 |
So again, apparently stores that are nearby each other physically have similar characteristics 00:30:31.340 |
in terms of when people buy more or less stuff from them. 00:30:35.680 |
So I looked at the same thing for days of the week, so here's an embedding of the days 00:30:43.640 |
And I just joined up Monday, Tuesday, Wednesday, Tuesday, Thursday, Friday, Saturday, Sunday. 00:30:47.080 |
I did the same thing for the months of the year. 00:30:57.160 |
So I think visualizing embeddings can be interesting. 00:31:03.040 |
It's good to first of all check you can see things you would expect to see, and then you 00:31:09.680 |
could try and see maybe things you didn't expect to see. 00:31:13.000 |
So you could try all kinds of clusterings or whatever. 00:31:20.680 |
And this is not something which has been widely studied at all, so I'm not going to tell you 00:31:25.720 |
what the limitations are of this technique or whatever. 00:31:29.200 |
I've heard of other ways to generate embeddings like skip-grams, I was wondering if you could 00:31:39.120 |
say is there one better than the other using neural networks or skip-grams. 00:31:46.920 |
I'm not sure if we'll cover it in this course, but basically the original Word2vec approach 00:31:58.480 |
to generating embeddings was to say, we don't actually have a labeled dataset, all we have 00:32:14.200 |
And so they have an unsupervised learning problem, unlabeled problem. 00:32:18.200 |
And so the best way in my opinion to turn an unlabeled problem into a labeled problem 00:32:25.000 |
And so what they did in the Word2vec case was they said okay, here's a sentence with 00:32:29.720 |
11 words in it, and then they said okay, let's delete the middle word and replace it with 00:32:42.720 |
And so originally it said cat, and they said no, let's replace that with justice. 00:32:54.160 |
So before it said the cute little cat sat on the fuzzy mat, and now it says the cute 00:33:02.480 |
And what they do is they do that so they have one sentence where they keep exactly as is, 00:33:12.200 |
and then they make a copy of it and they do the replacement. 00:33:16.080 |
And so then they have a label where they say it's a 1 if it was unchanged, it was the original, 00:33:26.640 |
And so basically then you now have something you can build a machine learning model on, 00:33:32.120 |
and so they went and built a machine learning model on this, so the model was like try and 00:33:40.760 |
Not because they were interested in a fake sentence finder, but because as a result they 00:33:44.960 |
now have embeddings that just like we discussed you can now use for other purposes. 00:33:50.600 |
Now it turns out that if you do this effectively like a single matrix multiplier rather than 00:33:59.520 |
making a deep neural net, you can train this super quickly. 00:34:06.800 |
They decided we're going to make a pretty crappy model, like a shallow learning model 00:34:14.920 |
With the downside it's a less powerful model, but a number of upsides. 00:34:18.760 |
The first thing is we can train it on a really large dataset, and then also really importantly 00:34:23.720 |
we're going to end up with embeddings which have really very linear characteristics, so 00:34:29.800 |
we can add them together and subtract them and stuff like that. 00:34:35.840 |
So there's a lot of stuff we can learn about there for other types of embedding, like categorical 00:34:42.560 |
embeddings, specifically if we want categorical embeddings which we can kind of draw nicely 00:34:49.720 |
and expect us to be able to add and subtract them and behave linearly, probably if we want 00:34:55.840 |
to use them in k-nearest neighbors and stuff, we should probably use shallow learning. 00:35:02.580 |
If we want something that's going to be more predictive, we probably want to use a neural 00:35:08.840 |
And so actually in NLP, I'm really pushing the idea that we need to move past Word2vec 00:35:17.000 |
and GloVe, these linear-based methods, because it turns out that those embeddings are way 00:35:22.180 |
less predictive than embeddings learned from deep models. 00:35:26.240 |
And so the language model that we learned about which ended up getting a state-of-the-art 00:35:29.680 |
on sentiment analysis didn't use GloVe or Word2vec, but instead we pre-trained a deeper 00:35:35.360 |
current neural network, and we ended up with not just pre-trained Word vectors but a full 00:35:43.400 |
So it looks like to create embeddings for entities we need a dummy task, right? 00:35:51.040 |
Not necessarily a dummy task, like in this case we had a real task, right? 00:35:54.480 |
So we created the embeddings for Rossman by trying to predict store sales. 00:36:02.120 |
This isn't just for learning embeddings, for learning any kind of feature space, you either 00:36:09.660 |
need labeled data, or you need to invent some kind of fake task. 00:36:16.840 |
So does a task matter, like if I choose a task and train embeddings, if I choose another 00:36:21.240 |
task and train embeddings, like which one is? 00:36:26.200 |
It's a great question, and it's not something that's been studied nearly enough, right? 00:36:30.800 |
I'm not sure that many people even quite understand that when they say unsupervised learning nowadays, 00:36:38.000 |
they almost nearly always mean fake task labeled learning. 00:36:44.280 |
And so the idea of what makes a good fake task, I don't know that I've seen a paper 00:36:49.600 |
on that, but intuitively, we need something where the kinds of relationships it's going 00:36:58.080 |
to learn are likely to be the kinds of relationships that you probably care about. 00:37:03.520 |
So for example, in computer vision, one kind of fake task people use is to say let's take 00:37:15.200 |
some images and use some kind of unreal and unreasonable data augmentation, like recolor 00:37:22.200 |
them too much or whatever, and then we'll ask the neural net to predict which one was 00:37:27.320 |
the augmented and which one was not the augmented. 00:37:36.680 |
I think it's a fascinating area, and one which would be really interesting for people, maybe 00:37:43.160 |
some of the students here to look into further, is take some interesting semi-supervised or 00:37:47.260 |
unsupervised data sets and try and come up with some more clever fake tasks and see does 00:37:58.800 |
In general, if you can't come up with a fake task that you think seems great, I would say 00:38:03.520 |
use it the best you can, it's often surprising how little you need. 00:38:10.120 |
The ultimately crappy fake task is called the autoencoder, and the autoencoder is the 00:38:17.240 |
thing which won the claims prediction competition that just finished on Kaggle. 00:38:22.720 |
They had lots of examples of insurance policies where we knew this was how much was claimed, 00:38:29.080 |
and then lots of examples of insurance policies where I guess they must have been still open, 00:38:36.600 |
So what they did was they said let's basically start off by grabbing every policy, and we'll 00:38:44.040 |
take a single policy and we'll put it through a neural net, and we'll try and have it reconstruct 00:38:55.600 |
But in these intermediate layers, at least one of those intermediate layers, we'll make 00:38:59.680 |
sure there's less activations than there were inputs. 00:39:03.080 |
So let's say if there was 100 variables on the insurance policy, we'll have something 00:39:12.880 |
And so when you basically are saying hey, reconstruct your own input, it's not a different 00:39:18.000 |
kind of model, it doesn't require any special code, it's literally just passing, you can 00:39:24.160 |
use any standard PyTorch or fastai learner, you just say my output equals my input. 00:39:31.200 |
And that's the most uncreative, invented task you can create. 00:39:37.560 |
That's called an autoencoder, and it works surprisingly well, in fact to the point that 00:39:44.640 |
They took the features that it learned and chucked it into another neural net and won. 00:39:53.800 |
Maybe if we have enough students taking an interest in this, then we'll be able to cover 00:40:00.560 |
unsupervised learning in more detail in Part 2, especially given this Kaggle win. 00:40:06.440 |
I think this may be related to the previous question. 00:40:16.800 |
Is the language model for example trained on the archive data? 00:40:19.160 |
Is that useful at all in the movie lens, the IMDB data? 00:40:26.000 |
I was just talking to Sebastian about this, Sebastian wrote about this this week, and 00:40:31.560 |
we thought we'd try and do some research on this in January. 00:40:37.120 |
We know that in computer vision, it's shockingly effective to train on cats and dogs, and use 00:40:44.320 |
that pre-trained network to do lung cancer diagnosis in CT scans. 00:40:49.680 |
In the NLP world, nobody much seems to have tried this, the NLP researchers I've spoken 00:40:55.600 |
to, other than Sebastian, about this assume that it wouldn't work and they generally haven't 00:41:04.400 |
Since we're talking about Rustman, I'll just mention during the week, I was interested 00:41:13.080 |
to see how good this solution actually was, because I noticed that on the public leaderboard 00:41:20.340 |
it didn't look like it was going to be that great. 00:41:23.280 |
I also thought it would be good to see what does it actually take to use a test set properly 00:41:31.280 |
So if you have a look at Rustman now, I've pushed some changes that actually run the 00:41:35.080 |
test set through as well, so you can get a sense of how to do this. 00:41:38.720 |
So you'll see basically every line appears twice, one for test and one for train when 00:41:45.840 |
we get the test, train, test, train, test, train. 00:41:48.840 |
Obviously you could do this in a lot fewer lines of code by putting all of the steps 00:41:53.120 |
into a method and then pass either the train data set or the test data frame to it. 00:41:58.600 |
In this case, for teaching purposes you'd be able to see each step and experiment to 00:42:05.160 |
see what each step looks like, but you can certainly simplify this code. 00:42:12.040 |
So we do this for every data frame, and then for some of these you can see I kind of loop 00:42:17.520 |
through the data frame in join and for join test, train and test. 00:42:24.480 |
This whole thing about the durations, I basically put two lines here, one that says data frame 00:42:30.760 |
equals train columns, one that says data frame equals test columns. 00:42:34.640 |
And so my idea is you'd run this line first and then you would skip the next one and run 00:42:40.760 |
everything beneath it, and then you'd go back and run this line and then run everything 00:42:46.280 |
So some people on the forum were asking how come this code wasn't working this week, which 00:42:50.840 |
is a good reminder that the code is not designed to be code that you always run top to bottom 00:42:57.040 |
You're meant to think, what is this code here, should I be running it right now? 00:43:03.760 |
And so the early lessons I tried to make it so you can run it top to bottom, but increasingly 00:43:08.160 |
as we go along I kind of make it more and more that you actually have to think about 00:43:13.920 |
So Jeremy, you're talking about shallow learning and deep learning, could you define that a 00:43:22.040 |
By shallow learning, I think I just mean anything that doesn't have a hidden layer. 00:43:25.760 |
So something that's like a dot product, a matrix multiplier basically. 00:43:40.720 |
So we end up with a training and a test version, and then everything else is basically the 00:43:49.160 |
One thing to note, and a lot of the details of this we cover in the machine learning course 00:43:53.000 |
by the way, because it's not really deep learning specific, so check that out if you're interested 00:43:58.800 |
I should mention we use apply cats rather than train cats to make sure that the test 00:44:03.440 |
set and the training set have the same categorical codes that they join to. 00:44:12.720 |
We also need to make sure that we keep track of the mapper. 00:44:15.640 |
This is the thing which basically says what's the mean and standard deviation of each continuous 00:44:19.960 |
column and then apply that same mapper to the test set. 00:44:27.020 |
And so when we do all that, that's basically it. 00:44:28.720 |
When the rest is easy, we just have to pass in the test data frame in the usual way when 00:44:34.560 |
we create our model data object, and then there's no changes through all here, we train 00:44:43.640 |
And then once we finish training it, we can then call predict as per usual passing in 00:44:54.880 |
true to say this is the test set rather than the validation set, and pass that off to Kaggle. 00:45:02.260 |
And so it was really interesting because this was my submission, it got a public score of 00:45:09.860 |
103, which would put us in about 300 and somethings plus, which looks awful. 00:45:40.400 |
So if you're competing in a Kaggle competition and you haven't thoughtfully created a validation 00:45:47.520 |
set of your own and you're relying on public leaderboard feedback, this could totally happen 00:45:52.480 |
to you, but the other way round, you'll be like, "Oh, I'm in the top 10, I'm doing great!" 00:45:58.200 |
For example, at the moment, the icebergs competition, recognizing icebergs, a very large percentage 00:46:04.800 |
of the public leaderboard set is synthetically generated data augmentation data. 00:46:13.040 |
And so your validation set is going to be much more helpful than the public leaderboard 00:46:22.120 |
So our final score here is kind of within statistical noise of the actual 3rd place getters, so I'm 00:46:28.280 |
pretty confident that we've captured their approach. 00:46:40.040 |
Something to mention, there's a nice kernel about the Rossman, quite a few nice kernels 00:46:44.600 |
actually, but you can go back and see, particularly if you're doing the groceries competition, 00:46:48.280 |
go and have a look at the Rossman kernels, because actually quite a few of them are higher 00:46:51.400 |
quality than the ones for the Ecuadorian Groceries competition. 00:46:55.640 |
One of them, for example, showed how for particular stores, like Store85, the sales for non-sundays 00:47:04.320 |
and the sale for sundays looked very different, whereas there are some other stores where 00:47:09.800 |
the sales on Sunday don't look any different, and you can kind of get a sense of why you 00:47:15.800 |
The one I particularly wanted to point out is the one I think I briefly mentioned, that 00:47:19.360 |
the 3rd place winners whose approach we used they didn't notice is this one. 00:47:34.560 |
And just after, oh my god, we ran out of eggs. 00:47:39.320 |
And just before, oh my god, go and get the milk before the store closes. 00:47:48.300 |
So this 3rd place winner actually deleted all of the closed store rows before they started 00:47:57.160 |
So remember how we talked about don't touch your data unless you first of all analyze 00:48:03.400 |
to see whether that thing you're doing is actually okay. 00:48:09.840 |
So in this case, I am sure, I haven't tried it, but I'm sure they would have won otherwise. 00:48:15.840 |
Although there weren't actually any store closures to my knowledge in the test set period, 00:48:21.760 |
the problem is that their model was trying to fit to these really extreme things, and 00:48:27.200 |
because it wasn't able to do it very well, it was going to end up getting a little bit 00:48:32.040 |
It's not going to break the model, but it's definitely going to harm it because it's kind 00:48:35.680 |
of trying to do computations to fit something which it literally doesn't have the data for. 00:48:44.600 |
So that Rossman model, again, it's nice to kind of look inside to see what's actually 00:48:55.240 |
And so that Rossman model, I want to make sure you kind of know how to find your way 00:49:04.080 |
around the code so you can answer these questions for yourself. 00:49:12.240 |
We started out by saying hey, if you want to look at the code for something, you can 00:49:16.520 |
go question mark, question mark like this, and I haven't got this read in, but you can 00:49:24.040 |
use question mark, question mark to get the source code for something. 00:49:29.840 |
But obviously that's not really a great way, because often you look at that source code 00:49:34.680 |
and it turns out you need to look at something else. 00:49:37.280 |
And so for those of you that haven't done much coding, you might not be aware that almost 00:49:42.040 |
certainly the editor you're using probably has the ability to both open up stuff directly 00:49:48.360 |
off SSH and to navigate through it so you can jump straight from place to place. 00:49:55.520 |
So if I want to find columnar model data, and I happen to be using vim here, I can basically 00:50:01.000 |
say tag columnar model data and it will jump straight to the definition of that class. 00:50:09.040 |
And so then I notice here that it's actually building up a data loader. 00:50:14.320 |
If I hit Ctrl + right square bracket, it will jump to the definition of the thing that was 00:50:19.280 |
under my cursor, and after I finished reading it for a while, I can hit Ctrl + T to jump 00:50:24.760 |
back up to where I came from, and you kind of get the idea. 00:50:29.280 |
If I want to find every usage of this in this file of columnar model data, I can hit * to 00:50:37.680 |
jump to the next place it's used, and so forth. 00:50:42.200 |
So in this case, get_learner was the thing which actually got the model. 00:50:50.760 |
We want to find out what kind of model it is, and apparently it uses a columnar model 00:51:05.480 |
data, get_learner, which uses -- and so here you can see mixed_input_model is the PyTorch 00:51:13.080 |
model, and then it wraps it in the structured_learner, which is the fast_ai_learner type, which wraps 00:51:24.660 |
So if we want to see the definition of this actual PyTorch model, I can go to Ctrl + right 00:51:33.680 |
And so here is the model, and nearly all of this we can now understand. 00:52:02.720 |
In the mixed model that we saw, does it always expect categorical and continuous together? 00:52:15.920 |
And the model data behind the scenes, if there are none of the other type, it creates a column 00:52:34.600 |
It's kind of ugly and hacky, and we'll hopefully improve it, but you can pass in an empty list 00:52:40.500 |
of categorical or continuous variables to the model data, and it will basically pass 00:52:47.120 |
an unused column of 0s to avoid things breaking. 00:52:54.560 |
I'm leaving fixing some of these slightly hacky edge cases because PyTorch 0.4 as well 00:53:00.920 |
as getting rid of variables, they're going to also add rank 0 tensors, which is to say 00:53:07.480 |
if you grab a single thing out of a rank 1 tensor rather than getting back a number which 00:53:13.840 |
is qualitatively different, you're actually going to get back a tensor that just happens 00:53:19.960 |
Now it turns out that a lot of this code is going to be much easier to write then, so 00:53:24.400 |
for now it's a little bit more hacky than it needs to be. 00:53:28.960 |
Jeremy, you talked about this a little bit before, but maybe it's a good time at some 00:53:35.360 |
point to talk about how can we write something that is slightly different from what is in 00:53:42.560 |
Yeah, I think we'll cover that a little bit next week, but I'm mainly going to do that 00:53:49.920 |
Part 2 is going to cover quite a lot of stuff. 00:53:54.860 |
One of the main things we'll cover in part 2 is what are called generative models, so 00:53:58.480 |
things where the output is a whole sentence or a whole image, but I'll also dig into how 00:54:04.160 |
to really either customize the fastai library or use it on more custom models. 00:54:14.800 |
So if we have time, we'll touch on it a little bit next week. 00:54:20.680 |
So the learner, we were passing in a list of embedding sizes, and as you can see that 00:54:27.280 |
embedding sizes list was literally just the number of rows and the number of columns in 00:54:32.360 |
And the number of rows was just coming from literally how many stores are there in the 00:54:38.960 |
store category, for example, and the number of columns was just equal to that divided 00:54:47.780 |
So that list of tuples was coming in, and so you can see here how we use it. 00:54:52.240 |
We go through each of those tuples, grab the number of categories and the size of the embedding, 00:55:05.240 |
One minor thing, PyTorch-specific thing we haven't talked about before is for it to be 00:55:11.080 |
able to register, remember how we kind of said it registers your parameters, it registers 00:55:18.280 |
So when we listed the model, it actually printed out the name of each embedding and each bias. 00:55:23.560 |
It can't do that if they're hidden inside a list. 00:55:27.000 |
They have to be an actual nn.module subclass. 00:55:33.440 |
So there's a special thing called an nn.module list which takes a list, and it basically 00:55:38.600 |
says I want you to register everything in here as being part of this model, so that's 00:55:47.960 |
So our mixed-input model has a list of embeddings, and then I do the same thing for a list of 00:55:57.440 |
So when I said here 1000, 500, this is saying how many activations I wanted for each of 00:56:09.240 |
And so here I just go through that list and create a linear layer that goes from this 00:56:18.920 |
So you can see how easy it is to construct not just your own model, but a model which 00:56:25.560 |
you can pass parameters to have it constructed on the fly dynamically. 00:56:33.200 |
This is initialization, we've mentioned timing initialization before, we mentioned it last 00:56:44.440 |
We have here a list of how much dropout to apply to each layer. 00:56:48.360 |
So again here, let's just go through each thing in that list and create a dropout layer 00:56:54.300 |
So this constructor, we understand everything in it except for BatchNorm, which we don't 00:57:00.060 |
have to worry about for now, so that's the constructor. 00:57:03.900 |
And so then the forward, also all stuff we're aware of, goes through each of those embedding 00:57:11.120 |
layers that we just saw, and remember we just treated it like it's a function. 00:57:15.480 |
So call it with the ith categorical variable, and then concatenate them all together, put 00:57:22.040 |
that through dropout, and then go through each one of our linear layers and call it, 00:57:31.080 |
apply ReLU to it, apply dropout to it, and then finally apply the final linear layer. 00:57:37.720 |
And the final linear layer has this as its size, which is here. 00:57:51.240 |
So we're kind of getting to the point where, and then of course at the end, I mentioned 00:57:56.400 |
we'd come back to this, if you passed in a y_range parameter, then we're going to do 00:58:01.920 |
the thing we just learned about last week, which is to use a sigmoid. 00:58:06.200 |
This is a cool little trick not just to make your collaborative filtering better, but in 00:58:11.040 |
this case my basic idea was sales are going to be greater than zero, and probably less 00:58:23.200 |
So I just pass in that as y_range, and so we do a sigmoid and multiply the sigmoid by 00:58:38.800 |
So I actually said maybe the range is between 0 and the highest times 1.2, because maybe 00:58:47.280 |
the next two weeks we have one bigger, but this is again trying to make it a little bit 00:58:51.120 |
easier for it to give us the kind of results that it thinks is right. 00:58:56.160 |
So increasingly, I'd love you all to kind of try to not treat these learners and models 00:59:05.880 |
as black boxes, but to feel like you now have the information you need to look inside them. 00:59:11.320 |
And remember you could then copy and paste this plus, paste it into a cell in Jupyter 00:59:16.960 |
Notebook and start fiddling with it to create your own versions. 00:59:30.600 |
I think what I might do is we might take a bit of an early break because we've got a 00:59:35.440 |
lot to cover and I want to do it all in one big go. 00:59:38.720 |
So let's take a break until 7.45 and then we're going to come back and talk about recurrent 00:59:59.360 |
Before we do, we're going to dig a little bit deeper into SGD, because I just want to 01:00:06.120 |
make sure everybody's totally comfortable with SGD. 01:00:11.200 |
And so what we're going to look at is we're going to look at a Lesson 6 SGD Notebook. 01:00:18.600 |
And we're going to look at a really simple example of using SGD to learn y=ax+b. 01:00:29.040 |
And so what we're going to do here is create the simplest possible model, y=ax+b. 01:00:38.280 |
And then we're going to generate some random data that looks like so. 01:00:44.320 |
So here's our x, and here's our y, we're going to predict y from x. 01:00:50.240 |
And we passed in 3 and 8 as our a and b, so we're going to try and recover that. 01:00:58.720 |
And so the idea is that if we can solve something like this, which has two parameters, we can 01:01:04.640 |
use the same technique to solve something with 100 million parameters without any changes 01:01:18.440 |
So in order to find an a and b that fits this, we need a loss function. 01:01:25.240 |
And this is a regression problem because we have a continuous output. 01:01:29.240 |
So for continuous output regression, we tend to use mean-squared error. 01:01:33.220 |
And obviously all of this stuff, there's implementations in NumPy, implementations in PyTorch, we're 01:01:37.680 |
just doing stuff by hand so you can see all the steps. 01:01:45.620 |
y-hat minus y-squared means there's our mean-squared error. 01:01:49.720 |
So for example, if we had 10 and 5 were our a and b, then there's our mean-squared error. 01:01:58.360 |
So if we've got an a and a b and we've got an x and a y, then our mean-squared error 01:02:01.540 |
loss is just the mean-squared error of our linear predictions and our y. 01:02:14.320 |
And so when we talk about combining linear layers and loss functions and optionally nonlinear 01:02:23.160 |
layers, this is all we're doing, we're putting a function inside a function. 01:02:30.280 |
I know people draw these clever-looking dots and lines all over the screen when they're 01:02:36.200 |
saying this is what a neural network is, but it's just a function of a function of a function. 01:02:40.640 |
So here we've got a prediction function being a linear layer, followed by a loss function 01:02:44.960 |
being MSE, and now we can say, oh, let's just define this as MSE loss and we'll use that 01:02:51.720 |
So there's our loss function, which incorporates our prediction function. 01:02:56.760 |
So let's generate 10,000 items of fake data, and let's turn them into variables so we can 01:03:02.600 |
use them with PyTorch, because Jeremy doesn't like taking derivatives, so we're going to 01:03:08.480 |
And let's create a random weight for A and for B, so a single random number. 01:03:14.160 |
And we want the gradients of these to be calculated as we start computing with them, because these 01:03:19.160 |
are the actual things we need to update in our SGD. 01:03:30.840 |
So let's pick a learning rate, and let's do 10,000 epochs of SGD. 01:03:40.200 |
In fact, this isn't really SGD, it's not Stochastic Gradient Descent, this is actually full gradient 01:03:44.180 |
descent, each loop is going to look at all of the data. 01:03:51.600 |
Stochastic Gradient Descent would be looking at a subset each time. 01:03:56.540 |
So to do gradient descent, we basically calculate the loss. 01:04:00.400 |
So remember, we've started out with a random A and B, and so this is going to compute some 01:04:07.480 |
And it's nice from time to time, so one way of saying from time to time is if the epoch 01:04:12.760 |
number mod 1000 is 0, so every 1000 epochs, just print out the loss, see how we're doing. 01:04:21.560 |
So now that we've computed the loss, we can compute our gradients. 01:04:25.920 |
And so remember, this thing here is both a number, a single number that is our loss, something 01:04:32.840 |
we can print, but it's also a variable because we passed variables into it, and therefore 01:04:37.960 |
it also has a method .backward, which means calculate the gradients of everything that 01:04:43.880 |
we asked it to, everything that we said requires grad equals true. 01:04:48.640 |
So at this point, we now have a .grad property inside A and inside B, and here they are, 01:05:01.760 |
So now that we've calculated the gradients for A and B, we can update them by saying 01:05:06.560 |
A is equal to whatever it used to be minus the learning rate times the gradient. 01:05:14.240 |
Update data, because A is a variable, and a variable contains a tensor in its .data property, 01:05:22.720 |
and again this is going to disappear in PyTorch 0.4, but for now it's actually the tensor 01:05:30.360 |
So update the tensor inside here with whatever it used to be minus the learning rate times 01:05:38.680 |
And that's basically it, that's basically all gradient descent is. 01:05:47.320 |
There's one extra step in PyTorch, which is that you might have multiple different loss 01:05:53.400 |
functions or lots of output layers all contributing to the gradient, and you have to add them 01:06:01.760 |
And so if you've got multiple loss functions, you could be calling loss.backward on each 01:06:06.560 |
of them, and what it does is it adds it to the gradients. 01:06:10.640 |
And so you have to tell it when to set the gradients back to 0. 01:06:15.080 |
So that's where you just go set A to 0, and gradients and set B gradients to 0. 01:06:23.000 |
And so this is wrapped up inside the Optium.sgd class. 01:06:31.560 |
So when we say optium.sgd and we just say .stat, it's just doing these for us. 01:06:37.800 |
So when we say .zero gradients, it's just doing this for us. 01:06:42.140 |
And this underscore here, pretty much every function that applies to a tensor in PyTorch, 01:06:50.080 |
if you stick an underscore on the end, it means do it in place. 01:06:53.000 |
So this is actually going to not return a bunch of zeros, but it's going to change this 01:07:03.720 |
We can look at the same thing without PyTorch, which means we actually do have to do some 01:07:11.840 |
So if we generate some fake data, again, we're just going to create 50 data points this time 01:07:21.040 |
And so let's create a function called update, we're just going to use NumPy, no PyTorch. 01:07:26.840 |
So our predictions are equal to linear, and in this case we're actually going to calculate 01:07:33.200 |
So the derivative of the square of the loss is just 2 times, and then the derivative with 01:07:38.280 |
respect to a is just that, you can confirm that yourself if you want to. 01:07:42.420 |
And so here we're going to update a minus equals learning rate times the derivative 01:07:47.960 |
of loss with respect to a, and for b it's learning rate times derivative with respect 01:07:56.120 |
And so what we can do -- let's just run all this. 01:08:01.640 |
So just for fun, rather than looping through manually, we can use the matplotlibfuncanimation 01:08:08.320 |
command to run the animate function a bunch of times, and the animate function is going 01:08:15.920 |
to run 30 epochs, and at the end of each epoch it's going to print out on the plot where 01:08:22.920 |
the line currently is, and that creates this little movie. 01:08:27.840 |
So you can actually see the line moving into place. 01:08:32.520 |
So if you want to play around with understanding how PyTorch gradients actually work, step 01:08:46.560 |
And it's kind of weird to say that's it, like when you're optimizing 100 million parameters 01:08:54.200 |
in a neural net, it's doing the same thing, but it actually is. 01:08:58.200 |
You can actually look at the PyTorch code and see this is it, there's no trick. 01:09:05.200 |
Well we learned a couple of minor tricks last time, which was like momentum and atom, but 01:09:12.760 |
if you can do it in Excel you can do it in Python. 01:09:16.680 |
So let's now talk about RNNs, so we're now in lesson 6 RNN notebook. 01:09:27.320 |
And we're going to study Nietzsche, as you should. 01:09:35.480 |
So Nietzsche says supposing that truth is a woman, apparently all philosophers have failed 01:09:47.400 |
So apparently at the point that Nietzsche was alive, there were no female philosophers, 01:09:51.160 |
or at least those that were around didn't understand women either. 01:09:54.540 |
So anyway, this is the philosopher apparently we've chosen to study. 01:10:00.600 |
Which is actually much less worse than people think he is, but it's a different era I guess. 01:10:07.240 |
So we're going to learn to write philosophy like Nietzsche. 01:10:14.320 |
And so we're going to do it one character at a time. 01:10:17.440 |
So this is like the language model that we did in lesson 4 where we did it a word at 01:10:21.240 |
a time, but this time we're going to do it a character at a time. 01:10:25.880 |
And so the main thing I'm going to try and convince you is that an RNN is no different 01:10:33.960 |
And so to show you that, we're going to build it from plain PyTorch layers, all of which 01:10:43.240 |
And eventually we're going to use something really complex, which is a for loop. 01:10:47.280 |
So that's when we're going to make it really sophisticated. 01:10:49.920 |
So the basic idea of RNNs is that you want to keep track of state-over-long-term dependencies. 01:11:00.720 |
So for example, if you're trying to model something like this template language, then 01:11:08.200 |
at the end of your percent comment do percent, you need a percent comment end percent. 01:11:13.800 |
And so somehow your model needs to keep track of the fact that it's inside a comment over 01:11:21.520 |
So this is this idea of state, it needs kind of memory. 01:11:25.280 |
And this is quite a difficult thing to do with just a ConvNet. 01:11:30.440 |
It turns out to be possible, but it's a little bit tricky. 01:11:36.520 |
Whereas with an RNN, it turns out to be pretty straightforward. 01:11:41.860 |
You want a stateful representation where you're keeping track of where are we now, have memory, 01:11:47.040 |
have long-term dependencies, and potentially even have variable length sequences -- these 01:11:53.160 |
are all difficult things to do with ConvNets -- they're very straightforward with RNNs. 01:11:59.040 |
So for example, SwiftKey a year or so ago did a blog post about how they had a new language 01:12:05.720 |
model where they basically said, of course this is what their neural net looks like. 01:12:14.200 |
Somehow they always looked like this on the internet. 01:12:17.400 |
You've got a bunch of words and it's basically going to take your particular words in their 01:12:21.440 |
particular orders and try and figure out what the next word's going to be, which is to say 01:12:27.120 |
They actually have a pretty good language model. 01:12:28.520 |
If you've used SwiftKey, they seem to do better predictions than anybody else still. 01:12:34.440 |
Another cool example was Andre Kepathy a couple of years ago showed that he could use character 01:12:39.240 |
level RNN to actually create an entire LaTeX document. 01:12:44.720 |
So he didn't actually tell it in any way what LaTeX looks like, he just passed in some LaTeX 01:12:51.440 |
text like this and said generate more LaTeX text, and it literally started writing something 01:12:56.400 |
which means about as much to me as most math papers do. 01:13:03.960 |
So we're going to start with something that's not an RNN, and I've got to introduce Jeremy's 01:13:11.760 |
patented neural network notation involving boxes, circles and triangles. 01:13:24.000 |
A rectangle is an input, an arrow is a layer, a circle -- in fact every shape is a bunch 01:13:41.320 |
The rectangle is the input activations, the circle is the hidden activations, and a triangle 01:13:51.960 |
An arrow is a layer operation, or possibly more than one. 01:13:58.200 |
So here my rectangle is an input of number of rows equal to batch size and number of 01:14:04.600 |
columns equal to the number of inputs, number of variables. 01:14:08.900 |
And so my first arrow, my first operation, is going to represent a matrix product followed 01:14:14.480 |
by a ReLU, and that's going to generate a set of activations. 01:14:20.680 |
For activations, an activation is a number, an activation is a number, a number that's 01:14:28.000 |
being calculated by a ReLU or a matrix product or whatever, it's a number. 01:14:34.740 |
So this circle here represents a matrix of activations. 01:14:39.480 |
All of the numbers that come out when we take the inputs, we do a matrix product followed 01:14:44.720 |
So we started with batch size by number of inputs, and so after we do this matrix operation, 01:14:49.920 |
we now have batch size by whatever the number of columns in our matrix product was, by number 01:15:00.380 |
And so if we now take these activations, which is a matrix, and we put it through another 01:15:05.480 |
operation, in this case another matrix product, and a softmax, we get a triangle that's our 01:15:10.560 |
output activations, another matrix of activations, and again, number of rows is batch size, number 01:15:16.160 |
of columns is equal to the number of classes, however many columns our matrix and this matrix 01:15:22.600 |
So that's a neural net, that's our basic one hidden layer neural net. 01:15:34.160 |
If you haven't written one of these from scratch, try it. 01:15:39.160 |
And in fact, in lessons 9, 10 and 11 of the machine learning course, we do this, we create 01:15:46.720 |
So if you're not quite sure how to do it, you can check out the machine learning course. 01:15:51.120 |
In general the machine learning course is much more like building stuff up from the foundations, 01:15:55.880 |
whereas this course is much more like best practices kind of top down. 01:16:02.800 |
So if we were doing a ConvNet with a single dense hidden layer, our input would be equal 01:16:09.240 |
to pi torch, number of channels by height by width, and notice that here batch size appeared 01:16:18.680 |
every time, so I'm not going to write it anymore. 01:16:26.160 |
Also the activation function, it's always basically value or something similar for all 01:16:30.840 |
the hidden layers and softmax at the end for classification, so I'm not going to write 01:16:36.760 |
In each picture I'm going to simplify it a little bit. 01:16:40.640 |
So I'm not going to mention batch size is still there, we're not going to mention value 01:16:45.680 |
So here's our input, and so in this case rather than a matrix product, we'll do a convolution, 01:16:51.040 |
a stride 2 convolution, so we'll skip over every second one, or it could be a convolution 01:16:59.880 |
In either case, we end up with something which is replace number of channels with number of 01:17:04.320 |
filters, and we have now height divided by 2 and width divided by 2, and then we can 01:17:13.880 |
We'll talk next week about the main way we do that nowadays, which is basically to do 01:17:17.800 |
something called an adaptive max pulling, where we basically get an average across the 01:17:23.480 |
height and the width, and turn that into a vector. 01:17:27.800 |
Anyway, somehow we flatten it out into a vector, we can do a matrix product, or a couple of 01:17:34.240 |
We actually tend to do in fastai, so that'll be our fully connected layer with some number 01:17:41.640 |
The final matrix product gives us some number of classes. 01:17:45.920 |
So this is our basic component, remembering, rectangle is input, circle is hidden, triangle 01:17:52.840 |
is output, all of the shapes represent a tensor of activations, all of the arrows represent 01:18:05.080 |
So now let's jump to the first one that we're going to actually try to create for NLP. 01:18:12.880 |
And we're going to basically do exactly the same thing as here, and we're going to try 01:18:17.760 |
and predict the third character in a three-character sequence based on the previous two characters. 01:18:25.080 |
So our input, and again remember, we've removed the batch size dimension, we're not saying 01:18:34.240 |
it but it's still here, and also here I've removed the names of the layer operations 01:18:43.640 |
So for example, our first input would be the first character of each string in our mini-batch, 01:18:52.760 |
and assuming this is one-hot encoded, then the width is just however many items there 01:18:59.160 |
are in the vocabulary, how many unique characters could we have. 01:19:03.400 |
We probably won't really one-hot encode it, we'll feed it in as an integer and pretend 01:19:07.920 |
it's one-hot encoded by using an embedding layer, which is mathematically identical. 01:19:12.840 |
And then that's going to give us some activations which we can stick through a fully connected 01:19:18.080 |
layer, so we put that through a fully connected layer to get some activations. 01:19:27.760 |
We can then put that through another fully connected layer, and now we're going to bring 01:19:34.760 |
So the character 2 input will be exactly the same dimensionality as the character 1 input, 01:19:40.920 |
and we now need to somehow combine these two arrows together. 01:19:44.580 |
So we could just add them up, for instance, because remember this arrow here represents 01:19:51.320 |
a matrix product, so this matrix product is going to spit out the same dimensionality 01:19:57.820 |
So we could just add them up to create these activations. 01:20:02.440 |
And so now we can put that through another matrix product, and of course remember all 01:20:05.520 |
these matrix products have a value as well, and this final one will have a softmax instead 01:20:15.480 |
So it's a standard 2-hidden layer, I guess it's actually 3 matrix products neural net. 01:20:26.160 |
This first one is coming through an embedding layer. 01:20:29.160 |
The only difference is that we've also got a second input coming in here that we're just 01:20:34.840 |
adding in, but it's kind of conceptually identical. 01:20:46.320 |
So I'm not going to use torch text, I'm going to try not to use almost any fast AI so we 01:20:54.680 |
So here's the first 400 characters of the collected works. 01:20:59.160 |
Let's grab a set of all of the letters that we see there and sort them. 01:21:06.000 |
And so a set creates all the unique letters, so we've got 85 unique letters in our vocab. 01:21:14.180 |
It's nice to put an empty null or some kind of padding character in there for padding, 01:21:19.000 |
so we're going to put a padding character at the start. 01:21:31.880 |
So as per usual, we want some way to map every character to a unique ID and every unique 01:21:42.440 |
And so now we can just go through our collected works of niche and grab the index of each 01:21:49.340 |
one of those characters, so now we've just turned it into this. 01:22:08.780 |
And just to confirm, we can now take each of those indexes and turn them back into characters 01:22:14.880 |
and join them together, and yeah, there it is. 01:22:19.840 |
So from now on we're just going to work with this IDX list, the list of character numbers 01:22:29.080 |
So Jeremy, why are we doing a model of characters and not a model of words? 01:22:38.800 |
With a vocab of 80-ish items, we can kind of see it better. 01:22:45.620 |
Character level models turn out to be potentially quite useful in a number of situations, but 01:22:53.640 |
The short answer is, you generally want to combine both a word level model and a character 01:22:58.800 |
level model, like if you're doing translation, it's a great way to deal with unusual words 01:23:06.060 |
Anytime you see a word you haven't seen before, you could use a character level model for 01:23:11.200 |
And there's actually something in between the two called a byte pair encoding, BPE, which 01:23:15.360 |
basically looks at little n-grams of characters, but we'll cover all that in Part 2. 01:23:23.000 |
If you want to look at it right now, then Part 2 of the existing course already has 01:23:30.280 |
And Part 2 of the version 1 of this course, all the NLP stuff is in PyTorch, by the way, 01:23:41.000 |
It was actually the thing that inspired us to move to PyTorch, because trying to do it 01:23:52.480 |
We're actually going to do something slightly different to what I said. 01:23:54.560 |
We're actually going to try and predict the fifth character using the first 4, so the 01:24:01.680 |
index 4 character using the index 0, 1, 2, and 3. 01:24:05.720 |
So we're going to do exactly the same thing, but with just a couple more layers. 01:24:10.060 |
So that means that we need a list of the 0th, 1st, 2nd, and 3rd characters, so I'm just 01:24:19.480 |
getting every character from the start, from the 1, from 2, from 3, skipping over, 3 at 01:24:29.200 |
So we're going to predict the 4th character from the first 3. 01:24:52.540 |
So we can just use np.stack to pop them together. 01:24:58.140 |
So here's the 0, 1, and 2 characters that are going to feed into our model, and then 01:25:22.620 |
So you can see for example, the very first item would be 40, 42, and 29, so that's characters 01:25:37.700 |
And then we'd be predicting 30, that's the 4th character, which is the start of the next 01:25:46.340 |
So 30, 25, 27, we need to predict 29, which is the start of the next row, and so forth. 01:25:54.420 |
So we're always using 3 characters to predict the 4th. 01:25:59.980 |
So there are 200,000 of these that we're going to try and model. 01:26:07.300 |
So we're going to build this model, which means we need to decide how many activations. 01:26:15.780 |
So I'm going to use 256, and we need to decide how big our embeddings are going to be. 01:26:21.660 |
And so I decided to use 42, so about half the number of characters I have. 01:26:27.180 |
And you can play around with these to see if you can come up with better numbers, it's 01:26:37.260 |
And so here is the full version, so predicting character 4 using characters 1, 2, and 3. 01:26:43.940 |
As you can see, it's the same picture as the previous page, but I put some very important 01:26:50.180 |
All the arrows of the same color are going to use the same matrix, the same weight matrix. 01:26:57.740 |
So all of our input embeddings are going to use the same matrix. 01:27:03.220 |
All of our layers that go from one layer to the next are going to use the same orange 01:27:09.520 |
arrow weight matrix, and then our output will have its own matrix. 01:27:15.020 |
So we're going to have 1, 2, 3 weight matrices. 01:27:20.260 |
And the idea here is the reason I'm not going to have a separate one for everything here 01:27:24.880 |
is that why would semantically a character have a different meaning depending if it was 01:27:30.860 |
the first or the second or the third item in a sequence? 01:27:33.940 |
It's not like we're even starting every sequence at the start of a sentence, we just arbitrarily 01:27:39.900 |
So you would expect these to all have the same kind of conceptual mapping. 01:27:44.620 |
And when we're moving from character_0 to character_1, to kind of say build up some 01:27:49.780 |
state here, why would that be any different kind of operation to moving from character_1 01:27:57.940 |
So let's create a 3-character model, and so we're going to create one linear layer for 01:28:05.140 |
our green arrow, one linear layer for our orange arrow, and one linear layer for our 01:28:15.860 |
So the embedding is going to bring in something of size, whatever it was, 84, and spit out 01:28:22.180 |
something with a number of factors in the embedding, we'll then put that through a linear 01:28:26.980 |
layer, and then we've got our hidden layers, we've got our output layer. 01:28:31.820 |
So when we call forward, we're going to be passing in 1, 2, 3 characters. 01:28:39.300 |
So for each one, we'll stick it through an embedding, we'll stick it through a linear 01:28:45.660 |
So we'll do it for character_1, character_2, and character_3. 01:28:53.060 |
Then I'm going to create this circle of activations 01:29:05.620 |
here, and that matrix I'm going to call h, so it's going to be equal to my input activations 01:29:16.620 |
after going through the value and the linear layer and the embedding, and then I'm going 01:29:21.380 |
to apply this L hidden, so the orange arrow, and that's going to get me to here. 01:29:33.100 |
And then to get to the next one, I need to apply the same thing, and apply the orange 01:29:39.820 |
But I also have to add in this second input, so take my second input and add in my previous 01:29:52.500 |
I don't really see how these dimensions are the same from h and in, too. 01:30:11.260 |
Self.e is going to be of length 42, and then it's going to go through ln, I'm just going 01:30:24.820 |
And so then we're going to pass that, which is now size nhidden, through this, which is 01:30:33.220 |
also going to return something of size nhidden. 01:30:36.500 |
So it's really important to notice that this is square, this is a square weight matrix. 01:30:42.600 |
So we now know that this is of size nhidden, n2 is going to be exactly the same size as 01:30:50.000 |
So we can now sum together two sets of activations, both of size nhidden, passing it into here, 01:30:58.180 |
and again it returns something of size nhidden. 01:31:00.260 |
So basically the trick was to make this a square matrix, and to make sure that its square 01:31:04.500 |
matrix was the same size as the output of this hidden layer. 01:31:13.820 |
I don't like it when I have three bits of code that look identical, and then three bits 01:31:30.460 |
of code that look nearly identical but aren't quite because it's harder to refactor. 01:31:35.040 |
So I'm going to make h into a bunch of zeros, so that I can then put h here, and these are 01:31:48.520 |
So the hugely complex trick that we're going to do very shortly is to replace these three 01:32:02.260 |
That's going to be the for loop, or actually 0, 1 and 2. 01:32:05.180 |
At that point we'll be able to call it a recurrent neural network. 01:32:11.700 |
So we create that model, so we can now just use the same columnar model data class that 01:32:26.660 |
we've used before, and if we use fromArrays, then it's basically going to spit back the 01:32:34.420 |
So if we stack together those three arrays, then it's going to feed us those three things 01:32:41.660 |
So if you want to play around with training models using as raw an approach as possible 01:32:51.100 |
but without writing lots of boilerplate, this is kind of how to do it. 01:32:54.440 |
Use columnar model data fromArrays and then if you pass in whatever you pass in here, 01:33:06.800 |
So I've passed in three things, which means I'm going to get sent three things. 01:33:14.540 |
Batch size 512, because this data is tiny so I can use a bigger batch size. 01:33:20.440 |
So I'm not using really much fast.ai stuff at all, I'm using fast.ai stuff just to save 01:33:26.300 |
me fiddling around with data loaders and data sets and stuff, but I'm actually going to create 01:33:30.180 |
a standard PyTorch model, I'm not going to create a learner. 01:33:33.860 |
So this is a standard PyTorch model, and because I'm using PyTorch, that means I have to remember 01:33:43.660 |
So here is how we can look inside at what's going on. 01:33:50.700 |
So we can say iter md.train data loader to grab the iterator to iterate through the training 01:33:58.160 |
We can then call next on that to grab a mini-batch, and that's going to return all of our x's and 01:34:04.700 |
our y tensor, and so we can then take a look at x's, for example. 01:34:14.260 |
And so you would expect, have a think about what you would expect for this length, 3, not 01:34:21.460 |
surprisingly because these are the three things. 01:34:24.940 |
And so then xs0, not surprisingly, is of length 512, and it's not actually one hot encoded 01:34:37.860 |
because we're using embedding to pretend it is. 01:34:41.180 |
And so then we can use a model as if it's a function by passing to it the variableized 01:34:50.760 |
And so have a think about what you would expect to be returned here. 01:34:55.940 |
So not surprisingly, we had a mini-batch of 512, so we still have 512. 01:35:00.240 |
And then 85 is the probability of each of the possible vocab items, and of course we've 01:35:05.380 |
got the log of them, because that's kind of what we do in PyTorch. 01:35:11.560 |
So that's how you can look inside, so you can see here how to do everything really very 01:35:19.020 |
So we can create an optimizer, again using standard PyTorch. 01:35:23.160 |
So with PyTorch, when you use a PyTorch optimizer, you have to pass in a list of the things to 01:35:28.300 |
optimize, and so if you call m.parameters, that will return that list for you. 01:35:41.020 |
And so we don't have learning rate finders and SGDR and all that stuff because we're 01:35:46.460 |
not using a learner, so we'll have to manually do learning rate annealing, so set the learning 01:35:58.380 |
And so now we can write a little function to test this thing out. 01:36:04.380 |
So here's something called getNext() where we can pass in 3 characters, like y, and so 01:36:17.600 |
I can then go through and turn that into a tensor of an array of the character index 01:36:27.260 |
So basically turn those into the integers, variables, pass that to our model, and then 01:36:34.740 |
we can do an argmax on that to grab which character number is it. 01:36:40.260 |
And in order to do stuff in NumPy land, I use 2np to turn that variable into a NumPy 01:36:46.500 |
And then I can return that character, and so for example a capital T is what it thinks 01:36:51.040 |
would be reasonable after seeing y, full stop space, that seems like a very reasonable way 01:36:58.420 |
If it was ppl, e, that sounds reasonable, space th, e, that sounds reasonable, and space, 01:37:06.060 |
So it seems to have created something sensible. 01:37:11.020 |
So the important thing to note here is our character model is a totally standard fully 01:37:22.780 |
The only slightly interesting thing we did was to do this addition of each of the inputs 01:37:34.420 |
But there's nothing new conceptually here, we're training it in the usual way. 01:37:50.460 |
So an RNN is when we do exactly the same thing that we did here, but I could draw this more 01:38:04.900 |
simply by saying, you know what, if we've got a green arrow going to a circle, let's 01:38:09.940 |
not draw a green arrow going to a circle again and again and again, but let's just draw it 01:38:15.900 |
So this is exactly the same picture as this one. 01:38:32.300 |
And so you just have to say how many times to go around this circle. 01:38:35.420 |
So in this case, if we want to predict character number n from characters 1 through n-1, then 01:38:40.660 |
we can take the character 1 input, get some activations, feed that to some new activations 01:38:45.540 |
that go through, remember orange is the hidden to hidden weight matrix, and each time we'll 01:38:51.460 |
also bring in the next character of input through its embeddings. 01:38:57.060 |
So that picture and that picture are two ways of writing the same thing. 01:39:03.820 |
But this one is more flexible because rather than me having to say let's do it for 8, I 01:39:07.860 |
don't have to draw 8 circles, I can just say, oh just repeat this. 01:39:16.220 |
So I could simplify this a little bit further by saying, you know what, rather than having 01:39:21.820 |
this thing as a special case, let's actually start out with a bunch of zeros and then let's 01:39:37.980 |
So I was wondering, if you can explain a little bit better, why are you reusing those... 01:39:47.780 |
You kind of seem to be reusing the same weight matrices. 01:39:53.500 |
Maybe this is kind of similar to what we did in convolutional units, like somehow... 01:39:58.300 |
No, I don't think so, at least not that I can see. 01:40:02.060 |
So the idea is just kind of semantically speaking, like this arrow here is saying take a character 01:40:17.340 |
of import and represent it as some set of features. 01:40:24.340 |
And this arrow is saying the same thing, take some character and represent it as a set of 01:40:29.380 |
So why would the 3 be represented with different weight matrices? 01:40:36.420 |
And this orange arrow is saying transition from character 0's state to character 1's state 01:40:49.740 |
Why would the transition from character 0 to 1 be different from character 1 to 2? 01:40:55.140 |
So the idea is to say, hey, if it's doing the same conceptual thing, let's use the exact 01:41:07.260 |
My comment on convolutional neural networks is that a filter also can apply to multiple 01:41:15.380 |
So you're saying a convolution is almost like a kind of a special dot product with shared 01:41:24.420 |
And in fact, one of our students actually wrote a good blog post about that last year. 01:41:29.940 |
Okay, I totally see where you're coming from and I totally agree with you. 01:41:40.940 |
So this time we're going to do 8 characters, 8 c's. 01:41:47.900 |
And so let's create a list of every 8th character from 0 through 7, and then our outputs will 01:41:55.020 |
be the next character, and so we can stack that together. 01:42:08.740 |
So for example, after this series of 8 characters, so this is characters 0 through 8, this is 01:42:17.860 |
characters 1 through 9, this is 2 through 10, these are all overlapping. 01:42:24.060 |
So after characters 1, 0 through 8, this is going to be the next one. 01:42:29.460 |
And then after these characters, this will be the next one. 01:42:32.820 |
So you can see that this one here has 43 as its y value, because after those, the next 01:42:44.740 |
So this is the first 8 characters, this is 2 through 9, 3 through 10, and so forth. 01:42:51.060 |
So these are overlapping groups of 8 characters, and then this is the next one along. 01:43:06.320 |
So again, we use fromArrays to create a model data class. 01:43:10.540 |
And so you'll see here we have exactly the same code as we had before. 01:43:14.980 |
Here's our embedding, linear, hidden, output, these are literally identical. 01:43:21.780 |
And then we've replaced our value of the linear input of the embedding with something that's 01:43:28.380 |
inside a loop, and then we've replaced the self.lhidden thing, also inside the loop. 01:43:44.180 |
I just realized I didn't mention last time the use of the hyperbolic tan. 01:43:49.780 |
Hyperbolic tan looks like this, so it's just a sigmoid that's offset. 01:44:00.220 |
And it's very common to use a hyperbolic tan inside this state-to-state transition because 01:44:06.700 |
it kind of stops it from flying off too high or too low. 01:44:14.020 |
Back in the old days, we used to use hyperbolic tan or the equivalent sigmoid a lot as most 01:44:23.540 |
Nowadays we tend to use ReLU, but in these hidden-state transition matrices, we still 01:44:35.580 |
So you'll see I've done that also here, hyperbolic tan. 01:44:41.900 |
So this is exactly the same as before, but I've just replaced it with a for loop. 01:44:49.740 |
So does it have to do anything with convergence of these networks? 01:44:56.900 |
We'll talk about that a little bit over time. 01:45:02.900 |
For now, we're not really going to do anything special at all, recognizing this is just a 01:45:11.380 |
Mainly it's quite a deep one, because this is actually this, but we've got 8 of these 01:45:19.560 |
things now, we've now got a deep 8-layer network, which is why units starting to suggest we 01:45:26.900 |
As we get deeper and deeper networks, they can be harder and harder to train. 01:45:41.140 |
As before, we've got a batch size of 512, we're using atom, and away it goes. 01:45:49.220 |
So we won't sit there watching it, so we can then set the loading rate down back to 1 in 01:45:54.300 |
x3, we can fit it again, and it seems to be training fine. 01:46:04.500 |
But we're going to try something else, which is we're going to use the trick that Yannette 01:46:08.660 |
rather hinted at before, which is maybe we shouldn't be adding these things together. 01:46:13.780 |
And so the reason you might want to be feeling a little uncomfortable about adding these 01:46:17.940 |
things together is that the input state and the hidden state are kind of qualitatively 01:46:28.340 |
The input state is the encoding of this character, whereas h represents the encoding of the series 01:46:37.780 |
And so adding them together is potentially going to lose information. 01:46:43.060 |
So I think what Yannette was going to prefer that we might do is maybe to concatenate these 01:46:51.060 |
So let's now make a copy of the previous cell all the same, but rather than using +, let's 01:47:01.380 |
Now if we concat, then we need to make sure now that our input layer is not from nFAC 01:47:09.740 |
to hidden, which is what we had before, but because we're concatenating, it needs to be 01:47:19.140 |
And so now that's going to make all the dimensions work nicely. 01:47:29.380 |
This now makes it back to size nhidden again, and then this is putting it through the same 01:47:35.240 |
square matrix as before so it's still of size nhidden. 01:47:39.640 |
So this is like a good design heuristic if you're designing an architecture is if you've 01:47:46.500 |
got different types of information that you want to combine, you generally want to concatenate 01:47:51.980 |
it, adding things together, even if they're the same shape, is losing information. 01:48:00.660 |
And so once you've concatenated things together, you can always convert it back down to a fixed 01:48:06.380 |
size by just chucking it through a matrix product. 01:48:11.980 |
It's the same thing, but now we're concatenating instead. 01:48:17.280 |
And so we can fit that, and so last time we got 1.72, this time we got 1.68. 01:48:24.640 |
So it's not setting the world on fire, but it's an improvement, and the improvement's 01:48:33.220 |
And so now we can pass in 8 things, so it's normally for those, it looks good, or part 01:48:52.660 |
Let's see if PyTorch can do some of this for us. 01:48:55.700 |
And so basically what PyTorch will do for us is it will write this loop automatically, 01:49:03.620 |
and it will create these linear input layers automatically. 01:49:08.420 |
And so to ask it to do that, we can use the nn.rnnplus. 01:49:14.740 |
So here's the exact same thing in less code by taking advantage of PyTorch. 01:49:20.660 |
And again, I'm not using a conceptual analogy to say PyTorch is doing something like it, 01:49:28.260 |
This is just the code you just saw wrapped up a little bit, refactored a little bit for 01:49:34.900 |
So when we say we now want to create an rnn, called rnn, then what this does is it does 01:49:44.660 |
Now notice that our for loop needed a starting point. 01:49:51.820 |
Because otherwise our for loop didn't quite work, we couldn't quite refactor it out. 01:49:55.420 |
And because this is exactly the same, this needs a starting point too. 01:49:59.660 |
So let's give it a starting point and so you have to pass in your initial hidden state. 01:50:05.940 |
For reasons that will become apparent later on, it turns out to be quite useful to be 01:50:14.820 |
able to get back that hidden state at the end. 01:50:19.540 |
And just like we could here, we could actually keep track of the hidden state. 01:50:25.460 |
We get back both the output and the hidden state. 01:50:29.320 |
So we pass in the input and the hidden state and we get back the output and the hidden 01:50:35.640 |
So it's the orange circle ellipse of activations, and so it is of size 256. 01:51:01.520 |
So there's one other thing to know, which is in our case we were replacing h with a new 01:51:15.140 |
The one minor difference in PyTorch is they append the new hidden state to a list, or 01:51:24.340 |
So they actually give you back all of the hidden states, so in other words, rather than 01:51:27.940 |
just giving you back the final ellipse, they give you back all of the ellipses stacked 01:51:34.060 |
And so because we just want the final one, I just got indexed into it with -1. 01:51:38.740 |
Other than that, this is the same code as before. 01:51:43.700 |
Work that through our output layer to get the correct vocab size, and then we can train 01:52:01.060 |
So you can see here I can do it manually, I can create some hidden state, I can pass 01:52:02.060 |
it to that RNN, I can see the stuff I get back. 01:52:06.100 |
You'll see that the dimensionality of h, it's actually a rank 3 tensor, where else in my 01:52:23.660 |
And the difference is here we've got just a unit axis at the front. 01:52:28.380 |
We'll learn more about why that is later, but basically it turns out you can have a 01:52:32.620 |
second RNN that goes backwards, one that goes forwards, one that goes backwards, and the 01:52:37.700 |
idea is it's going to be better at finding relationships that kind of go backwards. 01:52:45.740 |
Also it turns out you can have an RNN feed to an RNN, that's called a multi-layer RNN. 01:52:50.060 |
So basically if you have those things, you need an additional axis on your tensor to 01:52:55.700 |
keep track of those additional layers of hidden state. 01:52:58.620 |
But for now, we'll always have a 1 here, and we'll always also get back a 1 at the end. 01:53:09.180 |
So if we go ahead and fit this now, let's actually train it for a bit longer. 01:53:19.500 |
This time we'll do 4 epochs at one in egg 3, and then we'll do another 2 epochs at one 01:53:30.120 |
And so we've now got our loss down to 1.5, so getting better and better. 01:53:38.460 |
So here's our getNext again, and let's just do the same thing. 01:53:44.580 |
So what we can now do is we can loop through 40 times, calling getNext each time, and then 01:53:51.900 |
each time we'll replace that input by removing the first character and adding the thing that 01:53:58.300 |
And so that way we can feed in a new set of 8 characters again and again and again. 01:54:03.660 |
And so that way we'll call that getNext in, so here are 40 characters that we've generated. 01:54:09.780 |
So we started out with 4thos, we got 4 those of the same, to the same, to the same. 01:54:16.980 |
You can probably guess what happens if you keep predicting the same to the same. 01:54:25.100 |
We now have something which we've basically built from scratch, and then we've said here's 01:54:38.540 |
So if you want to have an interesting little homework assignment this week, try to write 01:54:48.820 |
Try to literally create your Jeremy's RNN, and then type in here Jeremy's RNN, or in your 01:54:58.180 |
case maybe your name's not Jeremy, which is OK too, and then get it to run, writing your 01:55:04.260 |
implementation of that class from scratch without looking at the PyTorch source code. 01:55:09.940 |
Basically it's just a case of going up and seeing what we did back here, make sure you 01:55:15.300 |
get the same answers, and confirm that you do. 01:55:18.620 |
So that's kind of a good little test, very simple little assignment, but I think you'll 01:55:23.940 |
feel really good when you've seen "Oh, I've just re-implemented nn.RNN." 01:55:36.420 |
When I switched from this one, when I've moved the car1 input inside the dotted line, this 01:55:41.020 |
dotted rectangle represents the thing I'm repeating. 01:55:44.580 |
I also, watch the triangle, the output, I move that inside as well. 01:55:50.700 |
Now that's a big difference because now what I've actually done is I'm actually saying 01:55:57.940 |
spit out an output after every one of these circles. 01:56:03.380 |
So spit out an output here, and here, and here. 01:56:08.620 |
So in other words, if I have a 3-character input, I'm going to spit out a 3-character 01:56:13.900 |
I'm saying after character 1, this will be next, after character 2, this will be next, 01:56:21.580 |
So again, nothing different, and again, if you wanted to go a bit further with the assignment, 01:56:31.420 |
But basically what we're saying is in the for loop, we'd be saying results = some empty 01:56:39.680 |
list, and then we'd be going through, and rather than returning that, we'd instead 01:56:45.620 |
be saying results.append that, and then return torch.stat, something like that. 01:57:03.960 |
So now we now have every step we've created an output, which is basically this picture. 01:57:13.960 |
And so the reason, well there's lots of reasons that's interesting, but I think the main reason 01:57:20.020 |
right now that's interesting is that you probably noticed this approach to dealing with that 01:57:33.700 |
Like we're grabbing the first 8, but then this next set, all but one of them overlaps 01:57:44.580 |
So we're kind of recalculating the exact same embeddings, 7 out of 8 of them are going to 01:57:50.460 |
be exact same embeddings, exact same transitions, it kind of seems weird to do all this calculation 01:57:59.460 |
to just predict one thing and then go back and recalculate 7 out of 8 of them and add 01:58:03.620 |
one more to the end to calculate the next thing. 01:58:07.260 |
So the basic idea then is to say, well let's not do it that way, instead let's take non-overlapping 01:58:22.580 |
Here is our first 8 characters, here is the next 8 characters, here are the next 8 characters. 01:58:28.660 |
So like if you read this top left to bottom right, that would be the whole mixture. 01:58:36.320 |
And so then, if these are the first 8 characters, then offset this by 1, starting here, that's 01:58:48.460 |
So after we see characters 0 through 7, we should predict characters 1 through 8. 01:58:55.520 |
So after 40 should come 42 as it did, after 42 should come 29 as it did. 01:59:04.580 |
And so now that can be our inputs and labels for that model. 01:59:10.940 |
And so it shouldn't be any more or less accurate, it should just be the same, but it should 01:59:35.540 |
So I mentioned last time that we had a -1 index here, because we just wanted to grab 01:59:48.180 |
So in this case, we're going to grab all the triangles. 01:59:51.300 |
So this is actually the way an n dot r and n creates things. 01:59:55.880 |
We only kept the last one, but this time we're going to keep all of them. 02:00:05.700 |
So we've made one change, which is to remove that -1. 02:00:09.900 |
Other than that, this is the exact same code as before. 02:00:16.260 |
There's nothing much to show you here, except of course this time if we look at the labels, 02:00:29.080 |
it's now 512x8 because we're trying to predict 8 things every time through. 02:00:38.220 |
So there is one complexity here, which is that we want to use the negative-l likelihood 02:00:47.420 |
loss function as before, but the negative-l likelihood loss function just like RMSE expects 02:00:55.680 |
to receive 2 rank 1 tensors, well actually with the minibatch axis, 2 rank 2 tensors. 02:01:08.000 |
The problem is that we've got 8 time steps, 8 characters, in an r and n we call it a time 02:01:18.700 |
We have 8 time steps, and then for each one we have 84 probabilities, the probability 02:01:30.780 |
And then we have that for each of our 512 items in the minibatch. 02:01:36.460 |
So we have a rank 3 tensor, not a rank 2 tensor. 02:01:42.060 |
So that means that the negative-l likelihood loss function is going to spit out an error. 02:01:47.940 |
Frankly I think this is kind of dumb, I think it would be better if PyTorch had written 02:01:53.820 |
their loss functions in such a way that they didn't care at all about rank and they just 02:02:02.620 |
But for now at least, it does care about rec. 02:02:06.560 |
But the nice thing is I get to show you how to write a custom loss function. 02:02:10.460 |
So we're going to create a special negative-l likelihood loss function for sequences. 02:02:16.620 |
And so it's going to take an input in the target, and it's going to call f.negative-l likelihood 02:02:26.140 |
So what we're going to do is we're going to flatten our input, and we're going to flatten 02:02:37.940 |
And it turns out these are going to be the first two axes that have to be transposed. 02:02:46.540 |
So the way PyTorch handles RNN data by default is the first axis is the sequence length. 02:02:58.120 |
So the sequence length of an RNN is how many time steps? 02:03:02.620 |
So we have 8 characters, so sequence length of 8. 02:03:05.540 |
The second axis is the batch size, and then as you would expect, the third axis is the 02:03:13.940 |
So this is going to be 8 by 512 by nhidden, which I think was 256. 02:03:26.540 |
So we can grab the size and unpack it into each of these, sequence length batch size 02:03:36.820 |
Our target is 512 by 8, where else this one here was 8 by 512. 02:03:52.900 |
So to make them match, we're going to have to transpose the first two axes. 02:04:02.500 |
PyTorch, when you do something like transpose, doesn't generally actually shuffle the memory 02:04:08.100 |
order, but instead it just kind of keeps some internal metadata to say you should treat 02:04:18.820 |
Some things in PyTorch will give you an error if you try and use it when it has this internal 02:04:25.900 |
It will basically say, "Error, this tensor is not contiguous." 02:04:32.100 |
If you ever see that error, add the word "contiguous" after it and it goes away. 02:04:36.700 |
So I don't know, they can't do that for you apparently. 02:04:39.300 |
So in this particular case, I got that error, so I wrote the word "contiguous" after it. 02:04:44.020 |
And so then finally we need to flatten it out into a single vector, and so we can just 02:04:49.100 |
go dot view, which is the same as numPy dot reshape, and -1 means as long as it needs 02:04:57.820 |
And then the input, again we also reshape that, but remember the predictions also have 02:05:08.980 |
this axis of length 84, all of the predicted probabilities. 02:05:19.140 |
So if you ever want to play around with your own loss functions, you can just do that like 02:05:29.460 |
So it's important to remember that fit is this lowest level fastai abstraction, this 02:05:39.180 |
is the thing that implements the training loop. 02:05:42.220 |
And so the stuff you pass it in is all standard PyTorch stuff except for this. 02:05:51.820 |
This is our model data object, this is the thing that wraps up the test set, the training 02:06:05.780 |
So when we pull the triangle into the repeated structure, so the first n-1 iterations of 02:06:14.900 |
the sequence length, we don't see the whole sequence length, so does that mean that the 02:06:19.940 |
batch size should be much bigger so that you get a triangular kind of -- 02:06:23.780 |
Now be careful, you don't mean batch size, you mean sequence length, right? 02:06:27.820 |
Because the batch size is like something else entirely. 02:06:31.820 |
So yes, if you have a short sequence length like 8, the first character has nothing to 02:06:40.420 |
go on, it starts with an empty hidden state of zeros. 02:06:48.260 |
So what we're going to start with next week is we're going to learn how to avoid that 02:06:54.740 |
And so it's a really insightful question or concern. 02:06:59.700 |
But if you think about it, the basic idea is why should we reset this to 0 every time? 02:07:09.100 |
If we can kind of line up these mini-batches somehow so that the next mini-batch joins 02:07:16.540 |
up correctly, it represents the next letter in nature's works, then we'd want to move 02:07:22.340 |
this up into the constructor and then pass that here and then store it here. 02:07:36.460 |
And now we're not resetting the hidden state each time, we're actually keeping the hidden 02:07:43.340 |
state from call to call, and so the only time that it would be failing to benefit from learning 02:07:51.680 |
state would be literally at the very start of the document. 02:07:55.500 |
So that's where we're going to try ahead next week. 02:08:07.380 |
I feel like this lesson, every time I've got a punch line coming, somebody asks me a question 02:08:11.700 |
where I have to do the punch line ahead of time. 02:08:20.940 |
And I want to show you something interesting, and this is coming to another punch line that 02:08:26.740 |
Yannette tried to spoil, which is when we're -- remember, this is just doing a loop, applying 02:08:40.420 |
If that matrix multiply tends to increase the activations each time, then effectively 02:08:47.740 |
we're doing that to the power of 8, so it's going to shoot off really high, or if it's 02:08:53.260 |
decreasing it a little bit each time, it's going to shoot off really low. 02:08:57.260 |
So this is what we call a gradient explosion. 02:09:00.860 |
And so we really want to make sure that the initial L hidden that we create is of a size 02:09:18.700 |
that's not going to cause our activations on average to increase or decrease. 02:09:24.500 |
And there's actually a very nice matrix that does exactly that, called the identity matrix. 02:09:33.180 |
So the identity matrix for those that don't quite remember their linear algebra is this. 02:09:44.740 |
And so the trick about an identity matrix is anything times an identity matrix is itself. 02:09:52.740 |
And therefore you could multiply by this again and again and again and again and still end 02:09:57.620 |
up with itself, so there's no gradient explosion. 02:10:02.860 |
So what we could do is instead of using whatever the default random unit is for this matrix, 02:10:10.540 |
we could instead, after we create our RNN, we can go into that RNN, and if we now go 02:10:27.740 |
And as well as the arguments for constructing it, it also tells you the inputs and outputs 02:10:33.140 |
for calling the layer, and it also tells you the attributes. 02:10:37.180 |
And so it tells you there's something called weightHH, and these are the learnable hidden 02:10:42.060 |
to hidden weights, that's that square matrix. 02:10:45.300 |
So after we've constructed our m, we can just go in and say m.RNN.weightHHL.data, that's 02:10:55.820 |
the tensor, dot copy_inplace, torch.i, that is 'i' for identity, in case you were wondering. 02:11:08.860 |
So this is an identity matrix of size n hidden. 02:11:12.580 |
So this both puts into this weight matrix and returns the identity matrix. 02:11:20.320 |
And so this was like, actually a Jeffrey Hinton paper was like, hey, you know, after 2015, 02:11:32.020 |
so after recurrent neural nets have been around for decades, he was like, hey gang, maybe 02:11:39.300 |
we should just use the identity matrix to initialize this, and it actually turns out 02:11:48.140 |
And so that was a 2015 paper, believe it or not, from the father of neural networks. 02:11:52.880 |
And so here is our implementation of his paper. 02:11:55.780 |
And this is an important thing to note, right? 02:11:58.060 |
When very famous people like Jeffrey Hinton write a paper, sometimes the entire implementation 02:12:08.900 |
Before we got 0.61257, we'll fit it with exactly the same parameters, and now we've got 0.51, 02:12:23.300 |
And one of the nice things about this tweak was before I could only use a learning rate 02:12:27.220 |
of 1 in egg 3 before it started going crazy, but after I used the identity matrix, I found 02:12:33.460 |
I could use 1 in egg 2 because it's better behaved. 02:12:37.420 |
Weight initialization, I found I could use a higher learning rate. 02:12:42.340 |
And honestly, these things, increasingly we're trying to incorporate into the defaults in 02:12:51.260 |
You won't necessarily need to actually know them, but at this point we're still at a point 02:12:59.140 |
where most things in most libraries most of the time don't have great defaults, it's good 02:13:05.100 |
It's also nice to know if you want to improve something what kind of tricks people have 02:13:09.100 |
used elsewhere because you can often borrow them yourself. 02:13:13.360 |
Alright, that's the end of the lesson today, so next week we will look at this idea of 02:13:20.140 |
a stateful RNN that's going to keep its hidden state around, and then we're going to go back 02:13:24.740 |
to looking at language models again, and then finally we're going to go all the way back 02:13:28.880 |
to computer vision and learn about things like resnets, and batch norm, and all the 02:13:34.640 |
tricks that were figured out in cats vs dogs.