Lesson 6: Deep Learning 2018

So this is our penultimate lesson. Couple of weeks ago, in lesson 4, I mentioned I was going to share that lesson with this terrific NLP researcher, Sebastian Rutter, which I did, and he said he loved it and he's gone on to yesterday release this new post he called Optimization for Deep Learning Highlights in 2017 in which he covered basically everything that we talked about in that lesson.

And with some very nice shoutouts to some of the work that some of the students here have done, including when he talked about the separation of weight decay from the momentum term, and so he actually mentions here the opportunities in terms of improved software decoupling this allows, and actually links to the commit from Anand Sahar actually showing how to implement this in fastai.

So fastai's code is actually being used as a bit of a role model now. He then covers some of these learning rate training techniques that we've talked about. This is the SGDR schedule, it looks a bit different to what you're used to seeing because this is on a log curve, this is the way that they show it on the paper.

And for more information, again, links to two blog posts, the one from Vitaly about this topic, and again Anand Sahar's blog post on this topic. So it's great to see that some of the work from fastai students is already getting noticed and picked up and shared, and this blog post went on to get on the front page of Hacker News, so that's pretty cool.

And hopefully more and more of this work will be picked up once this is released publicly. So last week we were doing a deep dive into collaborative filtering. Let's remind ourselves of what our final model looked like. So in the end we ended up rebuilding the model that's actually in the fastai library, where we had an embedding, so we had this little get embedding function that grabbed an embedding and randomly initialized the weights, for the users, and for the items, that's the kind of generic term in our case, the items of movies, the bias for the users, the bias for the items, and we had n factors embedding size for each one, of course the biases just had a single one, and then we grabbed the users and item embeddings, multiplied them together, summed it up for each row, and added on the bias terms, popped that through a sigmoid to put it into the range that we wanted.

So that was our model. One of you asked if we can interpret this information in some way, and I promised this week we would see how to do that. So let's take a look. We're going to start with the model we built here where we just used that fastai library, collab filter data set from CSV, and then that .get learner, and then we fitted it in three epochs, 19 seconds, we've got a pretty good result.

So what we can now do is to analyze that model. So you may remember right back when we started, we read in the movies.csv file, but that's just a mapping from the ID of the movie to the name of the movie. So we're just going to use that for display purposes so we can see what we're doing.

Because not all of us have watched every movie, I'm just going to limit this to the top 500 3,000 most popular movies, so we might have more chance of recognizing the movies we're looking at, and then I'll go ahead and change it from the movie IDs from MovieLens to those unique IDs that we're using, the contiguous IDs, because that's what our model has.

So inside the Learn object that we create inside a learner, we can always grab the PyTorch model itself just by saying "learn.model", and I'm going to show you more and more of the code at the moment, so let's take a look at the definition of model. And so model is a property, so if you haven't seen a property before, a property is just something in Python which looks like a method when you define it, but you can call it without parentheses as we do here.

And so it kind of looks when you call it like it's a regular attribute, but it looks like when you define it like it's a method. So every time you call it, it actually runs this code. And so in this case it's just a shortcut to grab something called .models.model, so you may be interested to know what that looks like, learn.models.

And so the fastai model type is a very thin wrapper for PyTorch models. So we could take a look at this, collab.filter.model, and see what that is. It's only one line of code. And we'll talk more about these in part 2, but basically there's this very thin wrapper and one of the main things that fastai does is we have this concept of layer groups where basically when you say here there are different learning rates and they get applied to different sets of layers, then that's something that's not in PyTorch.

So when you say I want to use this PyTorch model, there's one thing we have to do which is to say I want to run layer groups. So the details aren't terribly important, but in general if you want to create a little wrapper for some other PyTorch model, you could just write something like this.

So to get inside that, to grab the actual PyTorch model itself, it's models.model, that's the PyTorch model, and then the learn object has a shortcut to that. So we're going to set m to be the PyTorch model. And so when you print out a PyTorch model, it prints it out basically by listing out all of the layers that you created in the constructor.

It's quite nifty actually when you think about the way this works thanks to some very handy stuff in Python, we're actually able to use standard Python OO to define these modules and these layers and they basically automatically register themselves with PyTorch. So back in our embedding.bias, we just had a bunch of things where we said each of these things are equal to these things and it automatically knows how to represent that.

So you can see there's the name is u, and so the name is just literally whatever we called it here, u. And then the definition is it's this kind of layer. So that's our PyTorch model. So we can look inside that, basically use that, so if we say m.ib, then that's referring to the embedding layer for an item which is the bias layer.

So an item bias in this case is the movie bias. So each movie, there are 9,000 of them, has a single bias element. Now the really nice thing about PyTorch layers and models is that they all look the same. To use them, you call them as if they were a function.

So we can go m.ib, and that basically says I want you to return the value of that layer, and that layer could be a full-on model. So to actually get a prediction from a PyTorch model, I would go m and pass in my variable. And so in this case, m.ib and pass in my top movie indexes.

Now models remember layers, they require variables, not tensors, because it needs to keep track of the derivatives, and so we use this capital V to turn the tensor into a variable. Now it's just announced this week that PyTorch 0.4, which is the version after the one that's just about to be released, is going to get rid of variables and we'll actually be able to use tensors directly to keep track of derivatives.

So if you're watching this on the MOOC and you're looking at 0.4, then you'll probably notice that the code doesn't have this V in it anymore, so that will be pretty exciting when that happens. For now, we have to remember if we're going to pass something into a model to turn it into a variable first.

And remember, a variable has a strict superset of the API of a tensor, so anything you can do to a tensor, you can do to a variable, like add it up or take its log or whatever. So that's going to return a variable which consists of going through each of these movie IDs, putting it through this embedding layer to get its bias.

And that's going to return a variable. Let's take a look. So before I press Shift + Enter here, you can have a think about what I'm going to have. I've got a list of 3000 movies going in, turning it into a variable, putting it through this embedding layer, so just have a think about what you expect to come out.

And we have a variable of size 3000 by 1, hopefully that doesn't surprise you. We had 3000 movies that we were looking up, each one had a one long embedding, so there's our 3000 long. You'll notice it's a variable, which is not surprising because we've fed it a variable, so we've got a variable back, and it's a variable that's on the GPU, dot CUDA.

So we have a little shortcut in fast.ai because we very often want to take variables, turn them into tensors, and move them back to the CPU so we can play with them more easily. So 2mp is to NumPy, and that does all of those things. It works regardless of whether it's a tensor or a variable, it works regardless of whether it's on the CPU or GPU, it'll end up giving you a NumPy array from that.

So if we do that, that gives us exactly the same thing as we just looked at, but now in NumPy form. So that's a super handy thing to use when you're playing around with PyTorch. My approach to things is I try to use NumPy for everything, except when I explicitly need something to run on the GPU, or I need its derivatives, in which case I use PyTorch.

I find NumPy's often easier to work with, it's been around many years longer than PyTorch. And lots of things like the Python imaging library, OpenCV, and lots and lots of stuff like Pandas, it works with NumPy. So my approach is do as much as I can in NumPy land, finally when I'm ready to do something on the GPU or take its derivative to PyTorch, and then as soon as I can, I put it back in NumPy.

And you'll see that the FastAI library really works this way, like all the transformations and stuff happen in NumPy, which is different to most PyTorch computer vision libraries which tend to do it all as much as possible in PyTorch. I try to do as much as possible in NumPy.

So let's say we wanted to build a model in the GPU with the GPU and train it, and then we want to bring this to production. So would we call to NumPy on the model itself, or would we have to iterate through all the different layers and then call to NPE?

Yeah, good question. So it's very likely that you want to do inference on a CPU rather than a GPU, it's more scalable, you don't have to worry about putting things in batches, so on and so forth. So you can move a model onto the CPU just by typing m.CPU, and that model is now on the CPU.

And therefore you can also then put your variable on the CPU by doing exactly the same thing, so you can say like so. Now having said that, if your server doesn't have a GPU, you don't have to do this because it won't put it on the GPU at all.

So for inferencing on the server, if you're running it on some T2 instance or something, it'll work fine and it'll all run on the CPU automatically. So if we train the model on the GPU and then we save those embeddings and the weights, would we have to do anything special to load it onto the CPU?

No, you won't. We have something, well it kind of depends on how much of fast AI you're using, so I'll show you how you can do that in case you have to do it manually. One of the students figured this out, which is very handy. When we -- there's a load model function, and you'll see what it does, but it does torch.load, is that basically this is like some magic incantation that normally it has to load it onto the same GPU it's saved on, but this will load it into whatever's available.

So that was a handy discovery. Thanks for the great questions. To put that back on the GPU, I'll need to say .cuda and now I can run it again. So it's really important to know about the zip function in Python, which iterates through a number of lists at the same time.

So in this case, I want to grab each movie along with its bias term so that I can just pop it into a list of tuples. So if I just go zip like that, that's going to iterate through each movie ID and each bias term. And so then I can use that in a list comprehension to grab the name of each movie along with its bias.

So having done that, I can then sort, and so here I told you the John Travolta Scientology movie at the most negative of -- quite by a lot. If this was a Kaggle competition, Battlefield Earth would have won by miles. Look at this, 77796. So here is the worst movie of all time, according to IMDB.

It's interesting when you think about what this means, because this is a much more authentic way to find out how bad this movie is, because some people are just more negative about movies. And if more of them watch your movie, like a highly critical audience, they're going to rate it badly.

So if you take an average, it's not quite fair. And so what this is doing is saying once we remove the fact that different people have different overall positive or negative experiences, and different people watch different kinds of movies, and we correct for all that, this is the worst movie of all time.

So that's a good thing to know. So this is how we can look inside our model and interpret the bias vectors. You'll see here I've sorted by the zeroth element of each tuple by using a lambda. Originally I used this special item getter. This is part of Python's operator library, and this creates a function that returns the zeroth element of something in order to save time.

And then I actually realized that the lambda is only one more character to write than the item getter, so maybe we don't need to know this after all. So really useful to make sure you know how to write lambdas in Python, so this is a function. And so the sort is going to call this function every time it decides is this thing higher or lower than that other thing, and this is going to return the zeroth element.

So here's the same thing in item getter format, and here is the reverse and Shawshank redemption right at the top, I'll definitely agree with that, godfather, usual suspects, these are all pretty great movies, 12 angry men, absolutely. So there you go, there's how we can look at the bias.

So then the second piece to look at would be the embeddings. So we can do the same thing, so remember i was the item embeddings, rather than ib was the item bias, we can pass in our list of movies as a variable, turn it into numpy, and here's our movie embeddings, so for each of the 3000 most popular movies, here are its 50 embeddings.

So it's very hard, unless you're Jeffrey Hinton, to visualize a 50-dimensional space. So what we'll do is we'll turn it into a 3-dimensional space. So we can compress high-dimensional spaces down into lower-dimensional spaces using lots of different techniques, perhaps one of the most common and popular is called PCA.

PCA stands for Principal Components Analysis, it's a linear technique, but linear techniques generally work fine for this kind of embedding. I'm not going to teach you about PCA now, but I will say in Rachel's Computational Linear Algebra class, which you can get to from fast.ai, we cover PCA in a lot of detail.

And it's a really important technique, it turns out to be almost identical to something called Singular Value Decomposition, which is a type of matrix decomposition which actually does turn up in deep learning a little bit from time to time. So it's kind of somewhat worth knowing if you were going to dig more into linear algebra, SPD and PCA, along with eigenvalues and eigenvectors, which are all slightly different versions of this kind of the same thing, are all worth knowing.

But for now, just know that you can grab PCA from sklearn.decomposition, say how much you want to reduce the dimensionality to, so I want to find 3 components, and what this is going to do is find 3 linear combinations of the 50 dimensions which capture as much as the variation as possible, but are as different to each other as possible.

So we would call this a lower rank approximation of our matrix. So then we can grab the components, so that's going to be the 3 dimensions, so once we've done that, we've now got 3 by 3000. And so we can now take a look at the first of them, and we'll do the same thing of using zip to look at each one along with its movie.

And so here's the thing, we don't know ahead of time what this PCA thing is, it's just a bunch of latent factors, it's kind of the main axis in this space of latent factors. And so what we can do is look at it and see if we can figure out what it's about.

So given that Police Academy 4 is high up here along with Waterworld, where El Spago, Pulp Fiction and Godfeather are high up here, I'm going to guess that a high value is not going to represent critically acclaimed movies or serious watching. So I called this "easy watching vs. serious", but this is kind of how you have to interpret your embeddings, take a look at what they seem to be showing and decide what you think it means.

So this is the principal axis in this set of embeddings. So we can look at the next one, so do the same thing and look at the first index 1 embedding. This one's a little bit harder to figure out what's going on, but with things like Mulholland Drive and Purple Rose of Cairo, these look more kind of dialog-y ones, or else things like Lord of the Rings and Aladdin and Star Wars, these look more like modern CGI-y ones.

So you can imagine that on that pair of dimensions, it probably represents a lot of differences between how people rate movies. Some people like Purple Rise of Cairo type movies, Woody Allen kind of classic, and some people like these big Hollywood spectacles. Some people presumably like Police Academy 4 more than they like Fargo.

So you can kind of get the idea of what's happened. For a model which was literally multiply two things together in atom hub, it's learned quite a lot, which is kind of cool. So that's what we can do with that. And then we could plot them if we wanted to, I just grabbed a small subset to plot on those first two axes.

Alright, so that's that. So I wanted to next dig in a layer deeper into what actually happens when we say fit. So when we said learn.fit, what's it doing? For something like the store model, is it a way to interpret the embeddings? For something like the Rustman one? Yes.

We'll see that in a moment. Well let's jump straight there, what the hell. So for the Rustman, how much are we going to sell at each store on each date model? This is from the paper, Guar and Birkan. It's a great paper, by the way, well worth it, pretty accessible.

I think any of you would at this point be able to at least get the gist of it, and much of the detail as well, particularly as you've also done the machine learning course. And they actually make this point in the paper, this is in the paper, that the equivalent of what they call entity embedding layers, an embedding of a categorical variable, is identical to a one-hot encoding followed by a matrix multiplier.

So they're basically saying if you've got three embeddings, that's the same as doing three one-hot encodings, putting each one through a matrix multiplier, and then put that through a dense layer, or what PyTorch would call a linear layer. One of the nice things here is because this is kind of like, well they thought it was the first paper, it was actually the second, I think, paper to show the idea of using categorical embeddings for this kind of dataset, they really go into quite a lot of detail, right back to the detailed stuff that we learned about, so it's kind of a second cut at thinking about what embeddings are doing.

So one of the interesting things that they did was they said after we've trained a neural net with these embeddings, what else could we do with it? So they got a winning result with a neural network with entity embeddings. But then they said hey you know what, we could take those entity embeddings and replace each categorical variable with the learned entity embeddings, and then feed that into a GBM.

So in other words, rather than passing into the GBM a one-hot encoded version, or an ordinal version, let's actually replace the categorical variable with its embedding for the appropriate level for that row. So it's actually a way of feature engineering. And so the main average percent error without that for GBMs using just one-hot encodings was 0.15, but with that it was 0.11.

Random forests without that was 0.16, with that 0.108 nearly as good as the neural net. So this is kind of an interesting technique because what it means is in your organization you can train a neural net that has an embedding of stores, and an embedding of product types, and an embedding of whatever kind of high cardinality or even medium cardinality categorical variables you have, and then everybody else in the organization can now chuck those into their GBM or random forest or whatever and use them.

And what this is saying is they won't get, in fact you can even use k-nearest neighbors with this technique and get nearly as good a result. So this is a good way of giving the power of neural nets to everybody in your organization without having them do the fast AI deep learning course first.

They can just use whatever sklearn or R or whatever that they're used to. And those embeddings could literally be in a database table because if you think about an embedding as just an index lookup, which is the same as an inner join in SQL. So if you've got a table of each product along with its embedding vector, then you can literally do an inner join, and now you have every row in your table along with its product embedding vector.

This is a really useful idea. And GBMs and random forests learn a lot quicker than neural nets do. So here's what happened when they took the various different states of Germany and plotted the first two principal components of their embedding vectors. And they basically here is where they were in that 2D space.

And wackily enough, I've circled in red 3 cities, and I've circled here the 3 cities in Germany. And here I've circled in purple, sorry blue, here are the blue, here's the green, here's the green. So it's actually drawn a map of Germany, even though it never was told anything about how far these states are away from each other or the very concept of geography didn't exist.

So that's pretty crazy. So that was from their paper. So I went ahead and looked -- here's another thing, I think this is also from their paper. They took every pair of places and they looked at how far away they are on a map versus how far away are they in embedding space, and they got this beautiful correlation.

So again, apparently stores that are nearby each other physically have similar characteristics in terms of when people buy more or less stuff from them. So I looked at the same thing for days of the week, so here's an embedding of the days of the week from our model. And I just joined up Monday, Tuesday, Wednesday, Tuesday, Thursday, Friday, Saturday, Sunday.

I did the same thing for the months of the year. You can say, here's winter, here's summer. So I think visualizing embeddings can be interesting. It's good to first of all check you can see things you would expect to see, and then you could try and see maybe things you didn't expect to see.

So you could try all kinds of clusterings or whatever. And this is not something which has been widely studied at all, so I'm not going to tell you what the limitations are of this technique or whatever. I've heard of other ways to generate embeddings like skip-grams, I was wondering if you could say is there one better than the other using neural networks or skip-grams.

So skip-grams is quite specific to NLP. I'm not sure if we'll cover it in this course, but basically the original Word2vec approach to generating embeddings was to say, we don't actually have a labeled dataset, all we have is like Google Maps. And so they have an unsupervised learning problem, unlabeled problem.

And so the best way in my opinion to turn an unlabeled problem into a labeled problem is to kind of invent some labels. And so what they did in the Word2vec case was they said okay, here's a sentence with 11 words in it, and then they said okay, let's delete the middle word and replace it with a random word.

And so originally it said cat, and they said no, let's replace that with justice. So before it said the cute little cat sat on the fuzzy mat, and now it says the cute little justice sat on the fuzzy mat. And what they do is they do that so they have one sentence where they keep exactly as is, and then they make a copy of it and they do the replacement.

And so then they have a label where they say it's a 1 if it was unchanged, it was the original, and 0 otherwise. And so basically then you now have something you can build a machine learning model on, and so they went and built a machine learning model on this, so the model was like try and find the faked sentences.

Not because they were interested in a fake sentence finder, but because as a result they now have embeddings that just like we discussed you can now use for other purposes. And that became Word2vec. Now it turns out that if you do this effectively like a single matrix multiplier rather than making a deep neural net, you can train this super quickly.

And so that's basically what they did. They decided we're going to make a pretty crappy model, like a shallow learning model rather than a deep model. With the downside it's a less powerful model, but a number of upsides. The first thing is we can train it on a really large dataset, and then also really importantly we're going to end up with embeddings which have really very linear characteristics, so we can add them together and subtract them and stuff like that.

So there's a lot of stuff we can learn about there for other types of embedding, like categorical embeddings, specifically if we want categorical embeddings which we can kind of draw nicely and expect us to be able to add and subtract them and behave linearly, probably if we want to use them in k-nearest neighbors and stuff, we should probably use shallow learning.

If we want something that's going to be more predictive, we probably want to use a neural net. And so actually in NLP, I'm really pushing the idea that we need to move past Word2vec and GloVe, these linear-based methods, because it turns out that those embeddings are way less predictive than embeddings learned from deep models.

And so the language model that we learned about which ended up getting a state-of-the-art on sentiment analysis didn't use GloVe or Word2vec, but instead we pre-trained a deeper current neural network, and we ended up with not just pre-trained Word vectors but a full pre-trained model. So it looks like to create embeddings for entities we need a dummy task, right?

Not necessarily a dummy task, like in this case we had a real task, right? So we created the embeddings for Rossman by trying to predict store sales. This isn't just for learning embeddings, for learning any kind of feature space, you either need labeled data, or you need to invent some kind of fake task.

So does a task matter, like if I choose a task and train embeddings, if I choose another task and train embeddings, like which one is? It's a great question, and it's not something that's been studied nearly enough, right? I'm not sure that many people even quite understand that when they say unsupervised learning nowadays, they almost nearly always mean fake task labeled learning.

And so the idea of what makes a good fake task, I don't know that I've seen a paper on that, but intuitively, we need something where the kinds of relationships it's going to learn are likely to be the kinds of relationships that you probably care about. So for example, in computer vision, one kind of fake task people use is to say let's take some images and use some kind of unreal and unreasonable data augmentation, like recolor them too much or whatever, and then we'll ask the neural net to predict which one was the augmented and which one was not the augmented.

I think it's a fascinating area, and one which would be really interesting for people, maybe some of the students here to look into further, is take some interesting semi-supervised or unsupervised data sets and try and come up with some more clever fake tasks and see does it matter, how much does it matter.

In general, if you can't come up with a fake task that you think seems great, I would say use it the best you can, it's often surprising how little you need. The ultimately crappy fake task is called the autoencoder, and the autoencoder is the thing which won the claims prediction competition that just finished on Kaggle.

They had lots of examples of insurance policies where we knew this was how much was claimed, and then lots of examples of insurance policies where I guess they must have been still open, we didn't yet know how much they claimed. So what they did was they said let's basically start off by grabbing every policy, and we'll take a single policy and we'll put it through a neural net, and we'll try and have it reconstruct itself.

But in these intermediate layers, at least one of those intermediate layers, we'll make sure there's less activations than there were inputs. So let's say if there was 100 variables on the insurance policy, we'll have something in the middle that only has 20 activations. And so when you basically are saying hey, reconstruct your own input, it's not a different kind of model, it doesn't require any special code, it's literally just passing, you can use any standard PyTorch or fastai learner, you just say my output equals my input.

And that's the most uncreative, invented task you can create. That's called an autoencoder, and it works surprisingly well, in fact to the point that it literally just won a Kaggle competition. They took the features that it learned and chucked it into another neural net and won. Maybe if we have enough students taking an interest in this, then we'll be able to cover unsupervised learning in more detail in Part 2, especially given this Kaggle win.

I think this may be related to the previous question. Is the language model for example trained on the archive data? Is that useful at all in the movie lens, the IMDB data? Great question. I was just talking to Sebastian about this, Sebastian wrote about this this week, and we thought we'd try and do some research on this in January.

Again, it's not well known. We know that in computer vision, it's shockingly effective to train on cats and dogs, and use that pre-trained network to do lung cancer diagnosis in CT scans. In the NLP world, nobody much seems to have tried this, the NLP researchers I've spoken to, other than Sebastian, about this assume that it wouldn't work and they generally haven't bothered trying.

I think it would work great. Since we're talking about Rustman, I'll just mention during the week, I was interested to see how good this solution actually was, because I noticed that on the public leaderboard it didn't look like it was going to be that great. I also thought it would be good to see what does it actually take to use a test set properly with this kind of structured data.

So if you have a look at Rustman now, I've pushed some changes that actually run the test set through as well, so you can get a sense of how to do this. So you'll see basically every line appears twice, one for test and one for train when we get the test, train, test, train, test, train.

Obviously you could do this in a lot fewer lines of code by putting all of the steps into a method and then pass either the train data set or the test data frame to it. In this case, for teaching purposes you'd be able to see each step and experiment to see what each step looks like, but you can certainly simplify this code.

So we do this for every data frame, and then for some of these you can see I kind of loop through the data frame in join and for join test, train and test. This whole thing about the durations, I basically put two lines here, one that says data frame equals train columns, one that says data frame equals test columns.

And so my idea is you'd run this line first and then you would skip the next one and run everything beneath it, and then you'd go back and run this line and then run everything beneath it. So some people on the forum were asking how come this code wasn't working this week, which is a good reminder that the code is not designed to be code that you always run top to bottom without thinking.

You're meant to think, what is this code here, should I be running it right now? And so the early lessons I tried to make it so you can run it top to bottom, but increasingly as we go along I kind of make it more and more that you actually have to think about what's going on.

So Jeremy, you're talking about shallow learning and deep learning, could you define that a bit better? By shallow learning, I think I just mean anything that doesn't have a hidden layer. So something that's like a dot product, a matrix multiplier basically. So we end up with a training and a test version, and then everything else is basically the same.

One thing to note, and a lot of the details of this we cover in the machine learning course by the way, because it's not really deep learning specific, so check that out if you're interested in the details. I should mention we use apply cats rather than train cats to make sure that the test set and the training set have the same categorical codes that they join to.

We also need to make sure that we keep track of the mapper. This is the thing which basically says what's the mean and standard deviation of each continuous column and then apply that same mapper to the test set. And so when we do all that, that's basically it. When the rest is easy, we just have to pass in the test data frame in the usual way when we create our model data object, and then there's no changes through all here, we train it in the same way.

And then once we finish training it, we can then call predict as per usual passing in true to say this is the test set rather than the validation set, and pass that off to Kaggle. And so it was really interesting because this was my submission, it got a public score of 103, which would put us in about 300 and somethings plus, which looks awful.

And our private score of 107 is about 5th. So if you're competing in a Kaggle competition and you haven't thoughtfully created a validation set of your own and you're relying on public leaderboard feedback, this could totally happen to you, but the other way round, you'll be like, "Oh, I'm in the top 10, I'm doing great!" And then, uh-oh.

For example, at the moment, the icebergs competition, recognizing icebergs, a very large percentage of the public leaderboard set is synthetically generated data augmentation data. Like totally meaningless. And so your validation set is going to be much more helpful than the public leaderboard feedback. So be very careful. So our final score here is kind of within statistical noise of the actual 3rd place getters, so I'm pretty confident that we've captured their approach.

And so that's pretty interesting. Something to mention, there's a nice kernel about the Rossman, quite a few nice kernels actually, but you can go back and see, particularly if you're doing the groceries competition, go and have a look at the Rossman kernels, because actually quite a few of them are higher quality than the ones for the Ecuadorian Groceries competition.

One of them, for example, showed how for particular stores, like Store85, the sales for non-sundays and the sale for sundays looked very different, whereas there are some other stores where the sales on Sunday don't look any different, and you can kind of get a sense of why you need these kinds of interactions.

The one I particularly wanted to point out is the one I think I briefly mentioned, that the 3rd place winners whose approach we used they didn't notice is this one. And here's a really cool visualization. Here you can see that the store is closed. And just after, oh my god, we ran out of eggs.

And just before, oh my god, go and get the milk before the store closes. And here again, closed, bang. So this 3rd place winner actually deleted all of the closed store rows before they started doing any analysis. So remember how we talked about don't touch your data unless you first of all analyze to see whether that thing you're doing is actually okay.

No assumptions. So in this case, I am sure, I haven't tried it, but I'm sure they would have won otherwise. Although there weren't actually any store closures to my knowledge in the test set period, the problem is that their model was trying to fit to these really extreme things, and because it wasn't able to do it very well, it was going to end up getting a little bit confused.

It's not going to break the model, but it's definitely going to harm it because it's kind of trying to do computations to fit something which it literally doesn't have the data for. Yannette, can you pass that back there? So that Rossman model, again, it's nice to kind of look inside to see what's actually going on.

And so that Rossman model, I want to make sure you kind of know how to find your way around the code so you can answer these questions for yourself. So it's inside columnar model data. We started out by saying hey, if you want to look at the code for something, you can go question mark, question mark like this, and I haven't got this read in, but you can use question mark, question mark to get the source code for something.

But obviously that's not really a great way, because often you look at that source code and it turns out you need to look at something else. And so for those of you that haven't done much coding, you might not be aware that almost certainly the editor you're using probably has the ability to both open up stuff directly off SSH and to navigate through it so you can jump straight from place to place.

So I want to show you what I mean. So if I want to find columnar model data, and I happen to be using vim here, I can basically say tag columnar model data and it will jump straight to the definition of that class. And so then I notice here that it's actually building up a data loader.

That's interesting. If I hit Ctrl + right square bracket, it will jump to the definition of the thing that was under my cursor, and after I finished reading it for a while, I can hit Ctrl + T to jump back up to where I came from, and you kind of get the idea.

If I want to find every usage of this in this file of columnar model data, I can hit * to jump to the next place it's used, and so forth. So in this case, get_learner was the thing which actually got the model. We want to find out what kind of model it is, and apparently it uses a columnar model data, get_learner, which uses -- and so here you can see mixed_input_model is the PyTorch model, and then it wraps it in the structured_learner, which is the fast_ai_learner type, which wraps the data and the model together.

So if we want to see the definition of this actual PyTorch model, I can go to Ctrl + right square bracket to see it. And so here is the model, and nearly all of this we can now understand. So we got past a list of embedding sizes. In the mixed model that we saw, does it always expect categorical and continuous together?

Yes, it does. And the model data behind the scenes, if there are none of the other type, it creates a column of 1s or 0s or something. So if it is null, it can still work. It's kind of ugly and hacky, and we'll hopefully improve it, but you can pass in an empty list of categorical or continuous variables to the model data, and it will basically pass an unused column of 0s to avoid things breaking.

I'm leaving fixing some of these slightly hacky edge cases because PyTorch 0.4 as well as getting rid of variables, they're going to also add rank 0 tensors, which is to say if you grab a single thing out of a rank 1 tensor rather than getting back a number which is qualitatively different, you're actually going to get back a tensor that just happens to have no rank.

Now it turns out that a lot of this code is going to be much easier to write then, so for now it's a little bit more hacky than it needs to be. Jeremy, you talked about this a little bit before, but maybe it's a good time at some point to talk about how can we write something that is slightly different from what is in the library.

Yeah, I think we'll cover that a little bit next week, but I'm mainly going to do that in part 2. Part 2 is going to cover quite a lot of stuff. One of the main things we'll cover in part 2 is what are called generative models, so things where the output is a whole sentence or a whole image, but I'll also dig into how to really either customize the fastai library or use it on more custom models.

So if we have time, we'll touch on it a little bit next week. So the learner, we were passing in a list of embedding sizes, and as you can see that embedding sizes list was literally just the number of rows and the number of columns in each embedding. And the number of rows was just coming from literally how many stores are there in the store category, for example, and the number of columns was just equal to that divided by 2 and a maximum of 50.

So that list of tuples was coming in, and so you can see here how we use it. We go through each of those tuples, grab the number of categories and the size of the embedding, and construct an embedding. And so that's a list. One minor thing, PyTorch-specific thing we haven't talked about before is for it to be able to register, remember how we kind of said it registers your parameters, it registers your layers.

So when we listed the model, it actually printed out the name of each embedding and each bias. It can't do that if they're hidden inside a list. They have to be an actual nn.module subclass. So there's a special thing called an nn.module list which takes a list, and it basically says I want you to register everything in here as being part of this model, so that's just a minor tweak.

So our mixed-input model has a list of embeddings, and then I do the same thing for a list of linear layers. So when I said here 1000, 500, this is saying how many activations I wanted for each of my linear layers. And so here I just go through that list and create a linear layer that goes from this size to the next size.

So you can see how easy it is to construct not just your own model, but a model which you can pass parameters to have it constructed on the fly dynamically. BatchNorm we'll talk about next week. This is initialization, we've mentioned timing initialization before, we mentioned it last week. And then dropout, same thing.

We have here a list of how much dropout to apply to each layer. So again here, let's just go through each thing in that list and create a dropout layer for it. So this constructor, we understand everything in it except for BatchNorm, which we don't have to worry about for now, so that's the constructor.

And so then the forward, also all stuff we're aware of, goes through each of those embedding layers that we just saw, and remember we just treated it like it's a function. So call it with the ith categorical variable, and then concatenate them all together, put that through dropout, and then go through each one of our linear layers and call it, apply ReLU to it, apply dropout to it, and then finally apply the final linear layer.

And the final linear layer has this as its size, which is here. Size 1 is a single unit sales. So we're kind of getting to the point where, and then of course at the end, I mentioned we'd come back to this, if you passed in a y_range parameter, then we're going to do the thing we just learned about last week, which is to use a sigmoid.

This is a cool little trick not just to make your collaborative filtering better, but in this case my basic idea was sales are going to be greater than zero, and probably less than the largest sale they've ever had. So I just pass in that as y_range, and so we do a sigmoid and multiply the sigmoid by the range that I passed it.

And so hopefully we can find that here. So I actually said maybe the range is between 0 and the highest times 1.2, because maybe the next two weeks we have one bigger, but this is again trying to make it a little bit easier for it to give us the kind of results that it thinks is right.

So increasingly, I'd love you all to kind of try to not treat these learners and models as black boxes, but to feel like you now have the information you need to look inside them. And remember you could then copy and paste this plus, paste it into a cell in Jupyter Notebook and start fiddling with it to create your own versions.

I think what I might do is we might take a bit of an early break because we've got a lot to cover and I want to do it all in one big go. So let's take a break until 7.45 and then we're going to come back and talk about recurrent neural networks.

So we're going to talk about RNNs. Before we do, we're going to dig a little bit deeper into SGD, because I just want to make sure everybody's totally comfortable with SGD. And so what we're going to look at is we're going to look at a Lesson 6 SGD Notebook.

And we're going to look at a really simple example of using SGD to learn y=ax+b. And so what we're going to do here is create the simplest possible model, y=ax+b. And then we're going to generate some random data that looks like so. So here's our x, and here's our y, we're going to predict y from x.

And we passed in 3 and 8 as our a and b, so we're going to try and recover that. And so the idea is that if we can solve something like this, which has two parameters, we can use the same technique to solve something with 100 million parameters without any changes at all.

So in order to find an a and b that fits this, we need a loss function. And this is a regression problem because we have a continuous output. So for continuous output regression, we tend to use mean-squared error. And obviously all of this stuff, there's implementations in NumPy, implementations in PyTorch, we're just doing stuff by hand so you can see all the steps.

So there's MSE. y-hat is what we often call our predictions. y-hat minus y-squared means there's our mean-squared error. So for example, if we had 10 and 5 were our a and b, then there's our mean-squared error. So if we've got an a and a b and we've got an x and a y, then our mean-squared error loss is just the mean-squared error of our linear predictions and our y.

So there's our loss for 10, 5, x, y. So that's a loss function. And so when we talk about combining linear layers and loss functions and optionally nonlinear layers, this is all we're doing, we're putting a function inside a function. I know people draw these clever-looking dots and lines all over the screen when they're saying this is what a neural network is, but it's just a function of a function of a function.

So here we've got a prediction function being a linear layer, followed by a loss function being MSE, and now we can say, oh, let's just define this as MSE loss and we'll use that in the future. So there's our loss function, which incorporates our prediction function. So let's generate 10,000 items of fake data, and let's turn them into variables so we can use them with PyTorch, because Jeremy doesn't like taking derivatives, so we're going to use PyTorch for that.

And let's create a random weight for A and for B, so a single random number. And we want the gradients of these to be calculated as we start computing with them, because these are the actual things we need to update in our SGD. So here's our A and B, 0.029, 0.111.

So let's pick a learning rate, and let's do 10,000 epochs of SGD. In fact, this isn't really SGD, it's not Stochastic Gradient Descent, this is actually full gradient descent, each loop is going to look at all of the data. Stochastic Gradient Descent would be looking at a subset each time.

So to do gradient descent, we basically calculate the loss. So remember, we've started out with a random A and B, and so this is going to compute some amount of loss. And it's nice from time to time, so one way of saying from time to time is if the epoch number mod 1000 is 0, so every 1000 epochs, just print out the loss, see how we're doing.

So now that we've computed the loss, we can compute our gradients. And so remember, this thing here is both a number, a single number that is our loss, something we can print, but it's also a variable because we passed variables into it, and therefore it also has a method .backward, which means calculate the gradients of everything that we asked it to, everything that we said requires grad equals true.

So at this point, we now have a .grad property inside A and inside B, and here they are, here is that .grad property. So now that we've calculated the gradients for A and B, we can update them by saying A is equal to whatever it used to be minus the learning rate times the gradient.

Update data, because A is a variable, and a variable contains a tensor in its .data property, and again this is going to disappear in PyTorch 0.4, but for now it's actually the tensor that we need to update. So update the tensor inside here with whatever it used to be minus the learning rate times the gradient.

And that's basically it, that's basically all gradient descent is. So it's as simple as we claimed. There's one extra step in PyTorch, which is that you might have multiple different loss functions or lots of output layers all contributing to the gradient, and you have to add them all together.

And so if you've got multiple loss functions, you could be calling loss.backward on each of them, and what it does is it adds it to the gradients. And so you have to tell it when to set the gradients back to 0. So that's where you just go set A to 0, and gradients and set B gradients to 0.

And so this is wrapped up inside the Optium.sgd class. So when we say optium.sgd and we just say .stat, it's just doing these for us. So when we say .zero gradients, it's just doing this for us. And this underscore here, pretty much every function that applies to a tensor in PyTorch, if you stick an underscore on the end, it means do it in place.

So this is actually going to not return a bunch of zeros, but it's going to change this in place to be a bunch of zeros. So that's basically it. We can look at the same thing without PyTorch, which means we actually do have to do some calculus. So if we generate some fake data, again, we're just going to create 50 data points this time just to make this fast and easy to look at.

And so let's create a function called update, we're just going to use NumPy, no PyTorch. So our predictions are equal to linear, and in this case we're actually going to calculate the derivatives. So the derivative of the square of the loss is just 2 times, and then the derivative with respect to a is just that, you can confirm that yourself if you want to.

And so here we're going to update a minus equals learning rate times the derivative of loss with respect to a, and for b it's learning rate times derivative with respect to b. And so what we can do -- let's just run all this. So just for fun, rather than looping through manually, we can use the matplotlibfuncanimation command to run the animate function a bunch of times, and the animate function is going to run 30 epochs, and at the end of each epoch it's going to print out on the plot where the line currently is, and that creates this little movie.

So you can actually see the line moving into place. So if you want to play around with understanding how PyTorch gradients actually work, step by step here's the world's simplest example. And it's kind of weird to say that's it, like when you're optimizing 100 million parameters in a neural net, it's doing the same thing, but it actually is.

You can actually look at the PyTorch code and see this is it, there's no trick. Well we learned a couple of minor tricks last time, which was like momentum and atom, but if you can do it in Excel you can do it in Python. So let's now talk about RNNs, so we're now in lesson 6 RNN notebook.

And we're going to study Nietzsche, as you should. So Nietzsche says supposing that truth is a woman, apparently all philosophers have failed to understand women. So apparently at the point that Nietzsche was alive, there were no female philosophers, or at least those that were around didn't understand women either.

So anyway, this is the philosopher apparently we've chosen to study. Which is actually much less worse than people think he is, but it's a different era I guess. So we're going to learn to write philosophy like Nietzsche. And so we're going to do it one character at a time.

So this is like the language model that we did in lesson 4 where we did it a word at a time, but this time we're going to do it a character at a time. And so the main thing I'm going to try and convince you is that an RNN is no different to anything you've already learned.

And so to show you that, we're going to build it from plain PyTorch layers, all of which are extremely familiar already. And eventually we're going to use something really complex, which is a for loop. So that's when we're going to make it really sophisticated. So the basic idea of RNNs is that you want to keep track of state-over-long-term dependencies.

So for example, if you're trying to model something like this template language, then at the end of your percent comment do percent, you need a percent comment end percent. And so somehow your model needs to keep track of the fact that it's inside a comment over all of these different characters.

So this is this idea of state, it needs kind of memory. And this is quite a difficult thing to do with just a ConvNet. It turns out to be possible, but it's a little bit tricky. Whereas with an RNN, it turns out to be pretty straightforward. So these are the basic ideas.

You want a stateful representation where you're keeping track of where are we now, have memory, have long-term dependencies, and potentially even have variable length sequences -- these are all difficult things to do with ConvNets -- they're very straightforward with RNNs. So for example, SwiftKey a year or so ago did a blog post about how they had a new language model where they basically said, of course this is what their neural net looks like.

Somehow they always looked like this on the internet. You've got a bunch of words and it's basically going to take your particular words in their particular orders and try and figure out what the next word's going to be, which is to say they built a language model. They actually have a pretty good language model.

If you've used SwiftKey, they seem to do better predictions than anybody else still. Another cool example was Andre Kepathy a couple of years ago showed that he could use character level RNN to actually create an entire LaTeX document. So he didn't actually tell it in any way what LaTeX looks like, he just passed in some LaTeX text like this and said generate more LaTeX text, and it literally started writing something which means about as much to me as most math papers do.

So we're going to start with something that's not an RNN, and I've got to introduce Jeremy's patented neural network notation involving boxes, circles and triangles. So let me explain what's going on. A rectangle is an input, an arrow is a layer, a circle -- in fact every shape is a bunch of activations.

The rectangle is the input activations, the circle is the hidden activations, and a triangle is the output activations. An arrow is a layer operation, or possibly more than one. So here my rectangle is an input of number of rows equal to batch size and number of columns equal to the number of inputs, number of variables.

And so my first arrow, my first operation, is going to represent a matrix product followed by a ReLU, and that's going to generate a set of activations. For activations, an activation is a number, an activation is a number, a number that's being calculated by a ReLU or a matrix product or whatever, it's a number.

So this circle here represents a matrix of activations. All of the numbers that come out when we take the inputs, we do a matrix product followed by a ReLU. So we started with batch size by number of inputs, and so after we do this matrix operation, we now have batch size by whatever the number of columns in our matrix product was, by number of hidden units.

And so if we now take these activations, which is a matrix, and we put it through another operation, in this case another matrix product, and a softmax, we get a triangle that's our output activations, another matrix of activations, and again, number of rows is batch size, number of columns is equal to the number of classes, however many columns our matrix and this matrix product had.

So that's a neural net, that's our basic one hidden layer neural net. If you haven't written one of these from scratch, try it. And in fact, in lessons 9, 10 and 11 of the machine learning course, we do this, we create one of these from scratch. So if you're not quite sure how to do it, you can check out the machine learning course.

In general the machine learning course is much more like building stuff up from the foundations, whereas this course is much more like best practices kind of top down. So if we were doing a ConvNet with a single dense hidden layer, our input would be equal to pi torch, number of channels by height by width, and notice that here batch size appeared every time, so I'm not going to write it anymore.

So I've removed the batch size. Also the activation function, it's always basically value or something similar for all the hidden layers and softmax at the end for classification, so I'm not going to write that either. In each picture I'm going to simplify it a little bit. So I'm not going to mention batch size is still there, we're not going to mention value or softmax, but it's still there.

So here's our input, and so in this case rather than a matrix product, we'll do a convolution, a stride 2 convolution, so we'll skip over every second one, or it could be a convolution followed by a max pull. In either case, we end up with something which is replace number of channels with number of filters, and we have now height divided by 2 and width divided by 2, and then we can flatten that out somehow.

We'll talk next week about the main way we do that nowadays, which is basically to do something called an adaptive max pulling, where we basically get an average across the height and the width, and turn that into a vector. Anyway, somehow we flatten it out into a vector, we can do a matrix product, or a couple of matrix products.

We actually tend to do in fastai, so that'll be our fully connected layer with some number of activations. The final matrix product gives us some number of classes. So this is our basic component, remembering, rectangle is input, circle is hidden, triangle is output, all of the shapes represent a tensor of activations, all of the arrows represent a layer operation.

So now let's jump to the first one that we're going to actually try to create for NLP. And we're going to basically do exactly the same thing as here, and we're going to try and predict the third character in a three-character sequence based on the previous two characters. So our input, and again remember, we've removed the batch size dimension, we're not saying it but it's still here, and also here I've removed the names of the layer operations entirely, just keeping simplifying things.

So for example, our first input would be the first character of each string in our mini-batch, and assuming this is one-hot encoded, then the width is just however many items there are in the vocabulary, how many unique characters could we have. We probably won't really one-hot encode it, we'll feed it in as an integer and pretend it's one-hot encoded by using an embedding layer, which is mathematically identical.

And then that's going to give us some activations which we can stick through a fully connected layer, so we put that through a fully connected layer to get some activations. We can then put that through another fully connected layer, and now we're going to bring in the input of character 2.

So the character 2 input will be exactly the same dimensionality as the character 1 input, and we now need to somehow combine these two arrows together. So we could just add them up, for instance, because remember this arrow here represents a matrix product, so this matrix product is going to spit out the same dimensionality as this matrix product.

So we could just add them up to create these activations. And so now we can put that through another matrix product, and of course remember all these matrix products have a value as well, and this final one will have a softmax instead to create our predicted set of characters.

So it's a standard 2-hidden layer, I guess it's actually 3 matrix products neural net. This first one is coming through an embedding layer. The only difference is that we've also got a second input coming in here that we're just adding in, but it's kind of conceptually identical. So let's implement that.

So I'm not going to use torch text, I'm going to try not to use almost any fast AI so we can see it all again from raw. So here's the first 400 characters of the collected works. Let's grab a set of all of the letters that we see there and sort them.

And so a set creates all the unique letters, so we've got 85 unique letters in our vocab. It's nice to put an empty null or some kind of padding character in there for padding, so we're going to put a padding character at the start. And so here is what our vocab looks like.

So cars is our vocab. So as per usual, we want some way to map every character to a unique ID and every unique ID to a character. And so now we can just go through our collected works of niche and grab the index of each one of those characters, so now we've just turned it into this.

So rather than "pe", we now have 40, 42, 29. So that's basically the first step. And just to confirm, we can now take each of those indexes and turn them back into characters and join them together, and yeah, there it is. So from now on we're just going to work with this IDX list, the list of character numbers in the connected works of Nietzsche.

Yes? So Jeremy, why are we doing a model of characters and not a model of words? I just thought it seemed simpler. With a vocab of 80-ish items, we can kind of see it better. Character level models turn out to be potentially quite useful in a number of situations, but we'll cover that in Part 2.

The short answer is, you generally want to combine both a word level model and a character level model, like if you're doing translation, it's a great way to deal with unusual words rather than treating it as unknown. Anytime you see a word you haven't seen before, you could use a character level model for that.

And there's actually something in between the two called a byte pair encoding, BPE, which basically looks at little n-grams of characters, but we'll cover all that in Part 2. If you want to look at it right now, then Part 2 of the existing course already has this stuff taught.

And Part 2 of the version 1 of this course, all the NLP stuff is in PyTorch, by the way, so you'll understand it straight away. It was actually the thing that inspired us to move to PyTorch, because trying to do it in Keras turned out to be a nightmare.

So let's create the inputs to this. We're actually going to do something slightly different to what I said. We're actually going to try and predict the fifth character using the first 4, so the index 4 character using the index 0, 1, 2, and 3. So we're going to do exactly the same thing, but with just a couple more layers.

So that means that we need a list of the 0th, 1st, 2nd, and 3rd characters, so I'm just getting every character from the start, from the 1, from 2, from 3, skipping over, 3 at a time. So we're going to predict the 4th character from the first 3. So our inputs will be these three lists.

So we can just use np.stack to pop them together. So here's the 0, 1, and 2 characters that are going to feed into our model, and then here is the next character in the list. So for example, x1, x2, x3, and y. So you can see for example, the very first item would be 40, 42, and 29, so that's characters 0, 1, and 2.

And then we'd be predicting 30, that's the 4th character, which is the start of the next row. So 30, 25, 27, we need to predict 29, which is the start of the next row, and so forth. So we're always using 3 characters to predict the 4th. So there are 200,000 of these that we're going to try and model.

So we're going to build this model, which means we need to decide how many activations. So I'm going to use 256, and we need to decide how big our embeddings are going to be. And so I decided to use 42, so about half the number of characters I have.

And you can play around with these to see if you can come up with better numbers, it's just kind of experimental. And now we're going to build our model. Now I'm going to change my model slightly. And so here is the full version, so predicting character 4 using characters 1, 2, and 3.

As you can see, it's the same picture as the previous page, but I put some very important colored arrows here. All the arrows of the same color are going to use the same matrix, the same weight matrix. So all of our input embeddings are going to use the same matrix.

All of our layers that go from one layer to the next are going to use the same orange arrow weight matrix, and then our output will have its own matrix. So we're going to have 1, 2, 3 weight matrices. And the idea here is the reason I'm not going to have a separate one for everything here is that why would semantically a character have a different meaning depending if it was the first or the second or the third item in a sequence?

It's not like we're even starting every sequence at the start of a sentence, we just arbitrarily chopped it into groups of 3. So you would expect these to all have the same kind of conceptual mapping. And when we're moving from character_0 to character_1, to kind of say build up some state here, why would that be any different kind of operation to moving from character_1 to character_2?

So that's the basic idea. So let's create a 3-character model, and so we're going to create one linear layer for our green arrow, one linear layer for our orange arrow, and one linear layer for our blue arrow, and then also one embedding. So the embedding is going to bring in something of size, whatever it was, 84, and spit out something with a number of factors in the embedding, we'll then put that through a linear layer, and then we've got our hidden layers, we've got our output layer.

So when we call forward, we're going to be passing in 1, 2, 3 characters. So for each one, we'll stick it through an embedding, we'll stick it through a linear layer, and we'll stick it through a value. So we'll do it for character_1, character_2, and character_3. Then I'm going to create this circle of activations here, and that matrix I'm going to call h, so it's going to be equal to my input activations after going through the value and the linear layer and the embedding, and then I'm going to apply this L hidden, so the orange arrow, and that's going to get me to here.

So that's what this layer here does. And then to get to the next one, I need to apply the same thing, and apply the orange arrow to that. But I also have to add in this second input, so take my second input and add in my previous layer. I don't really see how these dimensions are the same from h and in, too.

Let's figure out the dimensions together. Self.e is going to be of length 42, and then it's going to go through ln, I'm just going to make it of size nhidden. And so then we're going to pass that, which is now size nhidden, through this, which is also going to return something of size nhidden.

So it's really important to notice that this is square, this is a square weight matrix. So we now know that this is of size nhidden, n2 is going to be exactly the same size as n1 was, which is nhidden. So we can now sum together two sets of activations, both of size nhidden, passing it into here, and again it returns something of size nhidden.

So basically the trick was to make this a square matrix, and to make sure that its square matrix was the same size as the output of this hidden layer. Thanks for the great question. Can you pass that back to you now? I don't like it when I have three bits of code that look identical, and then three bits of code that look nearly identical but aren't quite because it's harder to refactor.

So I'm going to make h into a bunch of zeros, so that I can then put h here, and these are now identical. So the hugely complex trick that we're going to do very shortly is to replace these three things with a for loop. And it's going to loop through 1, 2 and 3.

That's going to be the for loop, or actually 0, 1 and 2. At that point we'll be able to call it a recurrent neural network. So just to skip ahead a little bit. So we create that model, so we can now just use the same columnar model data class that we've used before, and if we use fromArrays, then it's basically going to spit back the exact arrays we gave it.

So if we stack together those three arrays, then it's going to feed us those three things back to our for loop method. So if you want to play around with training models using as raw an approach as possible but without writing lots of boilerplate, this is kind of how to do it.

Use columnar model data fromArrays and then if you pass in whatever you pass in here, you're going to get back here. So I've passed in three things, which means I'm going to get sent three things. So that's how that works. Batch size 512, because this data is tiny so I can use a bigger batch size.

So I'm not using really much fast.ai stuff at all, I'm using fast.ai stuff just to save me fiddling around with data loaders and data sets and stuff, but I'm actually going to create a standard PyTorch model, I'm not going to create a learner. So this is a standard PyTorch model, and because I'm using PyTorch, that means I have to remember to write .cuda, just to get on the GPU.

So here is how we can look inside at what's going on. So we can say iter md.train data loader to grab the iterator to iterate through the training set. We can then call next on that to grab a mini-batch, and that's going to return all of our x's and our y tensor, and so we can then take a look at x's, for example.

And so you would expect, have a think about what you would expect for this length, 3, not surprisingly because these are the three things. And so then xs0, not surprisingly, is of length 512, and it's not actually one hot encoded because we're using embedding to pretend it is. And so then we can use a model as if it's a function by passing to it the variableized version of our tensors.

And so have a think about what you would expect to be returned here. So not surprisingly, we had a mini-batch of 512, so we still have 512. And then 85 is the probability of each of the possible vocab items, and of course we've got the log of them, because that's kind of what we do in PyTorch.

So that's how you can look inside, so you can see here how to do everything really very much by hand. So we can create an optimizer, again using standard PyTorch. So with PyTorch, when you use a PyTorch optimizer, you have to pass in a list of the things to optimize, and so if you call m.parameters, that will return that list for you.

And then we can fit. And there it goes. And so we don't have learning rate finders and SGDR and all that stuff because we're not using a learner, so we'll have to manually do learning rate annealing, so set the learning rate a little bit lower and fit again. And so now we can write a little function to test this thing out.

So here's something called getNext() where we can pass in 3 characters, like y, and so I can then go through and turn that into a tensor of an array of the character index for each character in that list. So basically turn those into the integers, variables, pass that to our model, and then we can do an argmax on that to grab which character number is it.

And in order to do stuff in NumPy land, I use 2np to turn that variable into a NumPy array. And then I can return that character, and so for example a capital T is what it thinks would be reasonable after seeing y, full stop space, that seems like a very reasonable way to start a sentence.

If it was ppl, e, that sounds reasonable, space th, e, that sounds reasonable, and space, that sounds reasonable. So it seems to have created something sensible. So the important thing to note here is our character model is a totally standard fully connected model. The only slightly interesting thing we did was to do this addition of each of the inputs one at a time.

But there's nothing new conceptually here, we're training it in the usual way. Let's now create an RNN. So an RNN is when we do exactly the same thing that we did here, but I could draw this more simply by saying, you know what, if we've got a green arrow going to a circle, let's not draw a green arrow going to a circle again and again and again, but let's just draw it like this.

So this is exactly the same picture as this one. And so you just have to say how many times to go around this circle. So in this case, if we want to predict character number n from characters 1 through n-1, then we can take the character 1 input, get some activations, feed that to some new activations that go through, remember orange is the hidden to hidden weight matrix, and each time we'll also bring in the next character of input through its embeddings.

So that picture and that picture are two ways of writing the same thing. But this one is more flexible because rather than me having to say let's do it for 8, I don't have to draw 8 circles, I can just say, oh just repeat this. So I could simplify this a little bit further by saying, you know what, rather than having this thing as a special case, let's actually start out with a bunch of zeros and then let's have all of our characters inside here.

So I was wondering, if you can explain a little bit better, why are you reusing those... Why use the same colored arrows? You kind of seem to be reusing the same weight matrices. Maybe this is kind of similar to what we did in convolutional units, like somehow... No, I don't think so, at least not that I can see.

So the idea is just kind of semantically speaking, like this arrow here is saying take a character of import and represent it as some set of features. And this arrow is saying the same thing, take some character and represent it as a set of features, and so is this one.

So why would the 3 be represented with different weight matrices? Because it's all doing the same thing. And this orange arrow is saying transition from character 0's state to character 1's state to character 2's state. Again, it's the same thing. Why would the transition from character 0 to 1 be different from character 1 to 2?

So the idea is to say, hey, if it's doing the same conceptual thing, let's use the exact same weight matrix. My comment on convolutional neural networks is that a filter also can apply to multiple places. Yeah, that's an interesting point of view. So you're saying a convolution is almost like a kind of a special dot product with shared weights.

Yeah, that's a very good point. And in fact, one of our students actually wrote a good blog post about that last year. We should dig that up. Okay, I totally see where you're coming from and I totally agree with you. So let's implement this version. So this time we're going to do 8 characters, 8 c's.

And so let's create a list of every 8th character from 0 through 7, and then our outputs will be the next character, and so we can stack that together. And so now we've got 600,000 by 8. So here's an example. So for example, after this series of 8 characters, so this is characters 0 through 8, this is characters 1 through 9, this is 2 through 10, these are all overlapping.

So after characters 1, 0 through 8, this is going to be the next one. And then after these characters, this will be the next one. So you can see that this one here has 43 as its y value, because after those, the next one will be 43. So this is the first 8 characters, this is 2 through 9, 3 through 10, and so forth.

So these are overlapping groups of 8 characters, and then this is the next one along. So let's create that model. So again, we use fromArrays to create a model data class. And so you'll see here we have exactly the same code as we had before. Here's our embedding, linear, hidden, output, these are literally identical.

And then we've replaced our value of the linear input of the embedding with something that's inside a loop, and then we've replaced the self.lhidden thing, also inside the loop. I just realized I didn't mention last time the use of the hyperbolic tan. Hyperbolic tan looks like this, so it's just a sigmoid that's offset.

And it's very common to use a hyperbolic tan inside this state-to-state transition because it kind of stops it from flying off too high or too low. It's nicely controlled. Back in the old days, we used to use hyperbolic tan or the equivalent sigmoid a lot as most of our activation functions.

Nowadays we tend to use ReLU, but in these hidden-state transition matrices, we still tend to use hyperbolic tan quite a lot. So you'll see I've done that also here, hyperbolic tan. So this is exactly the same as before, but I've just replaced it with a for loop. And then here's my output.

Yes, unit. So does it have to do anything with convergence of these networks? Yeah, kind of. We'll talk about that a little bit over time. Let's come back to that though. For now, we're not really going to do anything special at all, recognizing this is just a standard fully connected network.

Mainly it's quite a deep one, because this is actually this, but we've got 8 of these things now, we've now got a deep 8-layer network, which is why units starting to suggest we should be concerned. As we get deeper and deeper networks, they can be harder and harder to train.

But let's try training this. So away it goes. As before, we've got a batch size of 512, we're using atom, and away it goes. So we won't sit there watching it, so we can then set the loading rate down back to 1 in x3, we can fit it again, and it seems to be training fine.

But we're going to try something else, which is we're going to use the trick that Yannette rather hinted at before, which is maybe we shouldn't be adding these things together. And so the reason you might want to be feeling a little uncomfortable about adding these things together is that the input state and the hidden state are kind of qualitatively different kinds of things.

The input state is the encoding of this character, whereas h represents the encoding of the series of characters so far. And so adding them together is potentially going to lose information. So I think what Yannette was going to prefer that we might do is maybe to concatenate these instead of adding them.

Does that sound good to you, Yannette? So let's now make a copy of the previous cell all the same, but rather than using +, let's use cat. Now if we concat, then we need to make sure now that our input layer is not from nFAC to hidden, which is what we had before, but because we're concatenating, it needs to be nFAC + nhidden to nhidden.

And so now that's going to make all the dimensions work nicely. So this now is of size nFAC + nhidden. This now makes it back to size nhidden again, and then this is putting it through the same square matrix as before so it's still of size nhidden. So this is like a good design heuristic if you're designing an architecture is if you've got different types of information that you want to combine, you generally want to concatenate it, adding things together, even if they're the same shape, is losing information.

And so once you've concatenated things together, you can always convert it back down to a fixed size by just chucking it through a matrix product. So that's what we've done here. It's the same thing, but now we're concatenating instead. And so we can fit that, and so last time we got 1.72, this time we got 1.68.

So it's not setting the world on fire, but it's an improvement, and the improvement's good. So we can now test that with getNext. And so now we can pass in 8 things, so it's normally for those, it looks good, or part of 2, that sounds good as well. So that's enough manual hackery.

Let's see if PyTorch can do some of this for us. And so basically what PyTorch will do for us is it will write this loop automatically, and it will create these linear input layers automatically. And so to ask it to do that, we can use the nn.rnnplus. So here's the exact same thing in less code by taking advantage of PyTorch.

And again, I'm not using a conceptual analogy to say PyTorch is doing something like it, I'm saying PyTorch is doing it. This is just the code you just saw wrapped up a little bit, refactored a little bit for your convenience. So when we say we now want to create an rnn, called rnn, then what this does is it does that for loop.

Now notice that our for loop needed a starting point. You remember why, right? Because otherwise our for loop didn't quite work, we couldn't quite refactor it out. And because this is exactly the same, this needs a starting point too. So let's give it a starting point and so you have to pass in your initial hidden state.

For reasons that will become apparent later on, it turns out to be quite useful to be able to get back that hidden state at the end. And just like we could here, we could actually keep track of the hidden state. We get back two things. We get back both the output and the hidden state.

So we pass in the input and the hidden state and we get back the output and the hidden state. Yes? So it's the orange circle ellipse of activations, and so it is of size 256. So there's one other thing to know, which is in our case we were replacing h with a new hidden state.

The one minor difference in PyTorch is they append the new hidden state to a list, or to a tensor, which gets bigger and bigger. So they actually give you back all of the hidden states, so in other words, rather than just giving you back the final ellipse, they give you back all of the ellipses stacked on top of each other.

And so because we just want the final one, I just got indexed into it with -1. Other than that, this is the same code as before. Work that through our output layer to get the correct vocab size, and then we can train that. So you can see here I can do it manually, I can create some hidden state, I can pass it to that RNN, I can see the stuff I get back.

You'll see that the dimensionality of h, it's actually a rank 3 tensor, where else in my version it was a rank 2 tensor. And the difference is here we've got just a unit axis at the front. We'll learn more about why that is later, but basically it turns out you can have a second RNN that goes backwards, one that goes forwards, one that goes backwards, and the idea is it's going to be better at finding relationships that kind of go backwards.

That's called a bidirectional RNN. Also it turns out you can have an RNN feed to an RNN, that's called a multi-layer RNN. So basically if you have those things, you need an additional axis on your tensor to keep track of those additional layers of hidden state. But for now, we'll always have a 1 here, and we'll always also get back a 1 at the end.

So if we go ahead and fit this now, let's actually train it for a bit longer. So last time we only did a couple of epochs. This time we'll do 4 epochs at one in egg 3, and then we'll do another 2 epochs at one in egg 4. And so we've now got our loss down to 1.5, so getting better and better.

So here's our getNext again, and let's just do the same thing. So what we can now do is we can loop through 40 times, calling getNext each time, and then each time we'll replace that input by removing the first character and adding the thing that we just predicted. And so that way we can feed in a new set of 8 characters again and again and again.

And so that way we'll call that getNext in, so here are 40 characters that we've generated. So we started out with 4thos, we got 4 those of the same, to the same, to the same. You can probably guess what happens if you keep predicting the same to the same.

So it's doing okay. We now have something which we've basically built from scratch, and then we've said here's how PyTorch refactored it for us. So if you want to have an interesting little homework assignment this week, try to write your own version of an RNN class. Try to literally create your Jeremy's RNN, and then type in here Jeremy's RNN, or in your case maybe your name's not Jeremy, which is OK too, and then get it to run, writing your implementation of that class from scratch without looking at the PyTorch source code.

Basically it's just a case of going up and seeing what we did back here, make sure you get the same answers, and confirm that you do. So that's kind of a good little test, very simple little assignment, but I think you'll feel really good when you've seen "Oh, I've just re-implemented nn.RNN." So I'm going to do one other thing.

When I switched from this one, when I've moved the car1 input inside the dotted line, this dotted rectangle represents the thing I'm repeating. I also, watch the triangle, the output, I move that inside as well. Now that's a big difference because now what I've actually done is I'm actually saying spit out an output after every one of these circles.

So spit out an output here, and here, and here. So in other words, if I have a 3-character input, I'm going to spit out a 3-character output. I'm saying after character 1, this will be next, after character 2, this will be next, after character 3, this will be next.

So again, nothing different, and again, if you wanted to go a bit further with the assignment, you could write this by hand as well. But basically what we're saying is in the for loop, we'd be saying results = some empty list, and then we'd be going through, and rather than returning that, we'd instead be saying results.append that, and then return torch.stat, something like that.

That may even be right, I'm not quite sure. So now we now have every step we've created an output, which is basically this picture. And so the reason, well there's lots of reasons that's interesting, but I think the main reason right now that's interesting is that you probably noticed this approach to dealing with that data seems terribly inefficient.

Like we're grabbing the first 8, but then this next set, all but one of them overlaps the previous one. So we're kind of recalculating the exact same embeddings, 7 out of 8 of them are going to be exact same embeddings, exact same transitions, it kind of seems weird to do all this calculation to just predict one thing and then go back and recalculate 7 out of 8 of them and add one more to the end to calculate the next thing.

So the basic idea then is to say, well let's not do it that way, instead let's take non-overlapping sets of characters, like so. Here is our first 8 characters, here is the next 8 characters, here are the next 8 characters. So like if you read this top left to bottom right, that would be the whole mixture.

And so then, if these are the first 8 characters, then offset this by 1, starting here, that's a list of outputs. So after we see characters 0 through 7, we should predict characters 1 through 8. So after 40 should come 42 as it did, after 42 should come 29 as it did.

And so now that can be our inputs and labels for that model. And so it shouldn't be any more or less accurate, it should just be the same, but it should allow us to do it more efficiently. So let's try that. So I mentioned last time that we had a -1 index here, because we just wanted to grab the last triangle.

So in this case, we're going to grab all the triangles. So this is actually the way an n dot r and n creates things. We only kept the last one, but this time we're going to keep all of them. So we've made one change, which is to remove that -1.

Other than that, this is the exact same code as before. There's nothing much to show you here, except of course this time if we look at the labels, it's now 512x8 because we're trying to predict 8 things every time through. So there is one complexity here, which is that we want to use the negative-l likelihood loss function as before, but the negative-l likelihood loss function just like RMSE expects to receive 2 rank 1 tensors, well actually with the minibatch axis, 2 rank 2 tensors.

So 2 minibatches of vectors. The problem is that we've got 8 time steps, 8 characters, in an r and n we call it a time step. We have 8 time steps, and then for each one we have 84 probabilities, the probability for every single one of those 8 time steps.

And then we have that for each of our 512 items in the minibatch. So we have a rank 3 tensor, not a rank 2 tensor. So that means that the negative-l likelihood loss function is going to spit out an error. Frankly I think this is kind of dumb, I think it would be better if PyTorch had written their loss functions in such a way that they didn't care at all about rank and they just applied it to whatever rank you gave it.

But for now at least, it does care about rec. But the nice thing is I get to show you how to write a custom loss function. So we're going to create a special negative-l likelihood loss function for sequences. And so it's going to take an input in the target, and it's going to call f.negative-l likelihood loss, so the PyTorch one.

So what we're going to do is we're going to flatten our input, and we're going to flatten our targets. And it turns out these are going to be the first two axes that have to be transposed. So the way PyTorch handles RNN data by default is the first axis is the sequence length.

In this case, 8. So the sequence length of an RNN is how many time steps? So we have 8 characters, so sequence length of 8. The second axis is the batch size, and then as you would expect, the third axis is the actual hidden state itself. So this is going to be 8 by 512 by nhidden, which I think was 256.

So we can grab the size and unpack it into each of these, sequence length batch size numhidden. Our target is 512 by 8, where else this one here was 8 by 512. So to make them match, we're going to have to transpose the first two axes. PyTorch, when you do something like transpose, doesn't generally actually shuffle the memory order, but instead it just kind of keeps some internal metadata to say you should treat this as if it's transposed.

Some things in PyTorch will give you an error if you try and use it when it has this internal state. It will basically say, "Error, this tensor is not contiguous." If you ever see that error, add the word "contiguous" after it and it goes away. So I don't know, they can't do that for you apparently.

So in this particular case, I got that error, so I wrote the word "contiguous" after it. And so then finally we need to flatten it out into a single vector, and so we can just go dot view, which is the same as numPy dot reshape, and -1 means as long as it needs to be.

And then the input, again we also reshape that, but remember the predictions also have this axis of length 84, all of the predicted probabilities. So here's a custom loss function, that's it. So if you ever want to play around with your own loss functions, you can just do that like so, and then pass that to fit.

So it's important to remember that fit is this lowest level fastai abstraction, this is the thing that implements the training loop. And so the stuff you pass it in is all standard PyTorch stuff except for this. This is our model data object, this is the thing that wraps up the test set, the training set and the validation set together.

So when we pull the triangle into the repeated structure, so the first n-1 iterations of the sequence length, we don't see the whole sequence length, so does that mean that the batch size should be much bigger so that you get a triangular kind of -- Now be careful, you don't mean batch size, you mean sequence length, right?

Because the batch size is like something else entirely. So yes, if you have a short sequence length like 8, the first character has nothing to go on, it starts with an empty hidden state of zeros. So what we're going to start with next week is we're going to learn how to avoid that problem.

And so it's a really insightful question or concern. But if you think about it, the basic idea is why should we reset this to 0 every time? If we can kind of line up these mini-batches somehow so that the next mini-batch joins up correctly, it represents the next letter in nature's works, then we'd want to move this up into the constructor and then pass that here and then store it here.

And now we're not resetting the hidden state each time, we're actually keeping the hidden state from call to call, and so the only time that it would be failing to benefit from learning state would be literally at the very start of the document. So that's where we're going to try ahead next week.

I feel like this lesson, every time I've got a punch line coming, somebody asks me a question where I have to do the punch line ahead of time. So we can fit that and we can fit that. And I want to show you something interesting, and this is coming to another punch line that Yannette tried to spoil, which is when we're -- remember, this is just doing a loop, applying the same matrix multiply again and again.

If that matrix multiply tends to increase the activations each time, then effectively we're doing that to the power of 8, so it's going to shoot off really high, or if it's decreasing it a little bit each time, it's going to shoot off really low. So this is what we call a gradient explosion.

And so we really want to make sure that the initial L hidden that we create is of a size that's not going to cause our activations on average to increase or decrease. And there's actually a very nice matrix that does exactly that, called the identity matrix. So the identity matrix for those that don't quite remember their linear algebra is this.

This would be a size 3 identity matrix. And so the trick about an identity matrix is anything times an identity matrix is itself. And therefore you could multiply by this again and again and again and again and still end up with itself, so there's no gradient explosion. So what we could do is instead of using whatever the default random unit is for this matrix, we could instead, after we create our RNN, we can go into that RNN, and if we now go like so, we can get the docs for m.RNN.

And as well as the arguments for constructing it, it also tells you the inputs and outputs for calling the layer, and it also tells you the attributes. And so it tells you there's something called weightHH, and these are the learnable hidden to hidden weights, that's that square matrix. So after we've constructed our m, we can just go in and say m.RNN.weightHHL.data, that's the tensor, dot copy_inplace, torch.i, that is 'i' for identity, in case you were wondering.

So this is an identity matrix of size n hidden. So this both puts into this weight matrix and returns the identity matrix. And so this was like, actually a Jeffrey Hinton paper was like, hey, you know, after 2015, so after recurrent neural nets have been around for decades, he was like, hey gang, maybe we should just use the identity matrix to initialize this, and it actually turns out to work really well.

And so that was a 2015 paper, believe it or not, from the father of neural networks. And so here is our implementation of his paper. And this is an important thing to note, right? When very famous people like Jeffrey Hinton write a paper, sometimes the entire implementation of that paper looks like one line of code.

So let's do it. Before we got 0.61257, we'll fit it with exactly the same parameters, and now we've got 0.51, and in fact we can keep training 0.50. So this tweak really, really, really helped. And one of the nice things about this tweak was before I could only use a learning rate of 1 in egg 3 before it started going crazy, but after I used the identity matrix, I found I could use 1 in egg 2 because it's better behaved.

Weight initialization, I found I could use a higher learning rate. And honestly, these things, increasingly we're trying to incorporate into the defaults in fastai. You won't necessarily need to actually know them, but at this point we're still at a point where most things in most libraries most of the time don't have great defaults, it's good to know all these little tricks.

It's also nice to know if you want to improve something what kind of tricks people have used elsewhere because you can often borrow them yourself. Alright, that's the end of the lesson today, so next week we will look at this idea of a stateful RNN that's going to keep its hidden state around, and then we're going to go back to looking at language models again, and then finally we're going to go all the way back to computer vision and learn about things like resnets, and batch norm, and all the tricks that were figured out in cats vs dogs.

See you then!

Lesson 6: Deep Learning 2018

Chapters

Transcript