back to indexLesson 6: Practical Deep Learning for Coders
00:00:00.000 |
We talked about pseudo-labeling a couple of weeks ago and this is this way of dealing 00:00:09.960 |
Remember how in the state-farm competition we had far more unlabeled images in the test 00:00:19.040 |
And so the question was like how do we take advantage of knowing something about the structure 00:00:26.040 |
We learned this crazy technique called pseudo-labeling, or a combination of pseudo-labeling and not 00:00:29.920 |
a knowledge distillation, which is where you predict the outputs of the test set and then 00:00:38.120 |
you act as if those outputs were true labels and you kind of add them in to your training. 00:00:45.360 |
And the reason I wasn't able to actually implement that and see how it works was because we needed 00:00:50.080 |
a way of combining two different sets of batches. 00:00:54.320 |
And in particular, I think the advice I saw from Jeff Hinton when he wrote about pseudo-labeling 00:01:01.840 |
is that you want something like 1 in 3 or 1 in 4 of your training data to come from 00:01:08.720 |
the pseudo-label data and the rest to come from your real data. 00:01:13.720 |
So the good news is I built that thing and it was ridiculously easy. 00:01:21.280 |
This is the entire code, I called it the mix iterator, and it will be in our utils+ from 00:01:31.000 |
And all it does is it's something where you create whatever generators of batches you 00:01:36.720 |
like and then you pass an array of those iterators to this constructor and then every time the 00:01:46.240 |
Keras system calls Next on it, it grabs the next batch from all of those sets of batches 00:01:55.200 |
And so what that means in practice is that I tried doing pseudo-labeling, for example, 00:02:01.440 |
Because remember on MNIST we already had that pretty close to state-of-the-art result, which 00:02:07.360 |
was 99.69, so I thought can we improve it anymore if we use pseudo-labeling on the test set. 00:02:24.040 |
You grab your training batches as usual using data augmentation if you want, whatever else, 00:02:33.760 |
And then you create your pseudo-batches by saying, okay, my data is my test set and my 00:02:45.320 |
labels are my predictions, and these are the predictions that I calculated back up here. 00:02:51.960 |
So now this is the second set of batches, which is my pseudo-batches. 00:02:57.720 |
And so then passing an array of those two things to the mix iterator now creates a new 00:03:02.480 |
batch generator, which is going to give us a few images from here and a few images from 00:03:12.520 |
So in this case, I was getting 64 from my training set and 64/4 from my test set. 00:03:25.080 |
Now I can use that just like any other generator, so then I just call model.fitGenerator and 00:03:39.240 |
And so what it's going to do is create a bunch of batches which will be 64 items from my 00:03:48.400 |
regular training set and a quarter of that number of items from my pseudo-labeled set. 00:03:56.000 |
And lo and behold, it gave me a slightly better score. 00:04:00.320 |
There's only so much better we can do at this point, but that took us up to 99.72. 00:04:05.520 |
It's worth mentioning that every 0.01% at this point is just one image, so we're really 00:04:11.400 |
kind of on the edges at this point, but this is getting even closer to the state-of-the-art 00:04:17.120 |
despite the fact we're not doing any handwriting-specific techniques. 00:04:21.440 |
I also tried it on the fish dataset and I realized at that point that this allows us 00:04:28.040 |
to do something else which is pretty neat, which is normally when we train on the training 00:04:33.780 |
set and set aside a validation set, if we don't want to submit to Kaggle, we've only 00:04:39.520 |
trained on a subset of the data that they gave us. 00:04:42.340 |
We didn't train on the validation set as well, which is not great, right? 00:04:47.120 |
So what you can actually do is you can send three sets of batches to the MixCederator. 00:04:52.820 |
You can have your regular training batches, you can have your pseudo-label test batches, 00:04:59.800 |
and if you think about it, you could also add in some validation batches using the true 00:05:06.080 |
So this is something you do just right at the end when you say this is a model I'm happy 00:05:10.800 |
with, you could fine-tune it a bit using some of the real validation data. 00:05:15.920 |
You can see here I've got out of my batch size of 64, I'm putting 44 from the training 00:05:20.260 |
set, 4 from the validation set, and 16 from the pseudo-label test set. 00:05:28.520 |
It got me from about 110th to about 60th on the leaderboard. 00:05:42.720 |
So if we go to cross-documentation, there is something called sample weight. 00:05:47.120 |
And I wonder if you can just set the sample weight to be lower for... 00:05:52.680 |
Yeah, you can use the sample weight, but you would still have to manually construct the 00:06:00.920 |
So this is like a more convenient way where you don't have to append it all together. 00:06:14.800 |
I will mention that I found the way I'm doing it seems a little slow. 00:06:19.640 |
There are some obvious ways I can speed it up. 00:06:22.560 |
I'm not quite sure why it is, but it might be because this concatenation each time is 00:06:29.440 |
kind of having to create new memory and that takes a long time. 00:06:33.420 |
There are some obvious things I can do to try and speed it up. 00:06:39.560 |
I'm pleased that we now have a way to do convenient pseudo-labeling in Keras and it seems to do 00:06:51.840 |
So the other thing I wanted to talk about before we moved on to the new material today 00:06:57.800 |
I've had lots of questions about embeddings and I think it's pretty clear that at least 00:07:05.560 |
for some of you some additional explanations would be helpful. 00:07:09.160 |
So I wanted to start out by reminding you that when I introduced embeddings to you, 00:07:16.680 |
the data that we had, we looked at this crosstab form of data. 00:07:22.800 |
When it's in this crosstab form, it's very easy to visualize what embeddings look like, 00:07:27.000 |
which is for movie_27 and user_id number 14, here is that movie_id's embedding right here 00:07:34.000 |
and here is that user_id's embedding right here and so here is the dot product of the 00:07:44.920 |
And so then all we had to do to optimize our embeddings was use the gradient descent solver 00:07:51.000 |
that is built into Microsoft Excel, which is called solver, and we just told it what 00:07:58.320 |
our objective is, which is this cell, and we set to minimize it by changing these sets 00:08:08.520 |
Now the data that we are given in the movie lens dataset, however, requires some manipulation 00:08:16.960 |
to get into a crosstab form, we're actually given it in this form, and we wouldn't want 00:08:20.920 |
to create a crosstab with all of this data because it would be way too big, every single 00:08:25.960 |
user times every single movie, and it would also be very inconvenient. 00:08:29.480 |
So that's not how Keras works, Keras uses this data in exactly this format. 00:08:35.560 |
And so let me show you how that works and what an embedding is really doing. 00:08:41.200 |
So here is the exact same thing, but I'm going to show you this using the data in the format 00:08:51.520 |
Every rating is a row, it has a user_id, a movie_id, and a rating. 00:08:57.420 |
And this is what an embedding matrix looks like for 15 users. 00:09:02.600 |
So these are the user_id's, and for each user_id, here's user_id 14's embedding, and this is 00:09:12.760 |
At this stage, they're just random, they're just initializing random numbers. 00:09:16.880 |
So this thing here is called an embedding matrix. 00:09:23.720 |
So the embedding matrix for movie_27 are these 5 numbers. 00:09:29.360 |
So what happens when we look at user_id 14, movie_id for 17, rating number 2? 00:09:37.340 |
Well the first thing that happens is that we have to find user_id number 14. 00:09:50.760 |
So then here is the first row from the user_embedding matrix. 00:10:01.240 |
Similarly movie_id 4.1.7, here is movie_id 4.1.7, and it is the 14th row of this table. 00:10:15.600 |
And so we want to return the 14th row, and so you can see here it has looked up and found 00:10:21.280 |
that it's the 14th row, and then indexed into the table and grabbed the 14th row. 00:10:26.640 |
And so then to calculate the dot product, we simply take the dot product of the user 00:10:35.120 |
And then to calculate the loss, we simply take the rating and subtract the prediction 00:10:42.040 |
And then to get the total loss function, we just add that all up and take the square root. 00:10:48.240 |
So the orange background cells are the cells which we want our SGD solver to change in 00:11:03.880 |
And then all of the orange bold cells are the calculated cells. 00:11:09.160 |
So when I was saying last week that an embedding is simply looking up an array by an index, 00:11:18.800 |
It's literally taking an index and it looks it up in an array and returns that row. 00:11:27.600 |
You might want to convince yourself during the week that this is identical to taking 00:11:32.720 |
a one-hot encoded matrix and multiplying it by an embedding matrix that's identical to 00:11:44.040 |
In this way, we can say data solver, we want to set this cell to a minimum by changing 00:11:54.560 |
these cells and if I say solve, then x_l will go away and try to improve our objective and 00:12:03.040 |
you can see it's decreasing, it's up to about 2.5. 00:12:07.160 |
And so what it's doing here is it's using gradient descent to try to find ways to increase or 00:12:12.680 |
decrease all of these numbers such that that RMSE becomes as low as possible. 00:12:20.400 |
So that's literally all that is going on in our Keras example, here, this .product. 00:12:33.320 |
So this thing here where we said create an embedding for a user, that's just saying create 00:12:37.280 |
something where I can look up the user_id and find their row. 00:12:41.600 |
This is doing the same for a movie, look up the movie_id and find its row. 00:12:45.760 |
And this here says take the .product once you've found the two, and then this here says train 00:12:51.600 |
a model where you take in that user_id and movie_id and try to predict the rating and 00:13:04.000 |
So you can see here that it's got the RMSE down to 0.4, so for example the first one predicted 00:13:14.960 |
3, it's actually 2, 4.5, 4.6, 5 and so forth, so you get the idea of how it works. 00:13:28.360 |
So inspired by one of the students who talked about this during the week, I grabbed the 00:13:39.800 |
And so here is the text of Green Eggs and Ham. 00:13:41.640 |
I am Daniel, I am Sam, Sam I am, that's Sam I am, etc. 00:13:55.280 |
And the way I did that was to take every unique word in that poem. 00:14:03.080 |
Here is the ID of each of those words, just index from 1. 00:14:07.480 |
And so then I just randomly generated an embedding matrix, I equally well could have used the 00:14:18.880 |
And so then just for each word, I just look up in the list to find that word and find 00:14:23.360 |
out what number it is, so I is number 8, and so here is the 8th row of the embedding matrix. 00:14:31.040 |
So you can see here that we've started with a poem and we've turned it into a matrix of 00:14:38.960 |
And so the reason we do this is because our machine learning tools want a matrix of floats, 00:14:47.680 |
So all of the questions about does it matter what the word IDs are, you can see it doesn't 00:14:55.920 |
All we're doing is we're looking them up in this matrix and returning the floats. 00:15:01.880 |
And once we've done that, we never use them again, we just use this matrix of floats. 00:15:13.840 |
Feel free to ask if you have any questions either now or at any other time because we're 00:15:18.800 |
going to be using embeddings throughout this class. 00:15:23.960 |
So hopefully that helped a few people clarify what's going on. 00:15:33.980 |
So let's get back to recurrent neural networks. 00:15:39.240 |
So to remind you, we talked about the purpose of recurrent neural networks as being really 00:15:58.360 |
So it's really all about this idea of memory. 00:16:02.880 |
If we're going to handle something like recognizing a comment start and a comment end, and being 00:16:09.760 |
able to keep track of the fact that we're in a comment for all of this time so that 00:16:14.120 |
we can do modeling on this kind of structured language data, we're really going to need 00:16:20.920 |
That allows us to handle long-term dependencies and it provides this stateful representation. 00:16:27.240 |
So in general, the stuff we're talking about, we're going to be looking at things that kind 00:16:32.840 |
And it's also somewhat helpful just for when you have a variable length sequence. 00:16:41.840 |
One is how does the size of my embedding depend on the number of unique words? 00:16:46.280 |
So mapping green eggs and ham to five real numbers seems sufficient but wouldn't be for 00:16:54.000 |
So your choice of how big to make your embedding matrix, as in how many latent factors to create, 00:17:01.680 |
is one of these architectural decisions which we don't really have an answer to. 00:17:08.880 |
My best suggestion is to read the Word2Vec paper which introduced a lot of this and look 00:17:22.240 |
at the difference between a 50 dimensional, 100 dimensional, 200, 300, 600 dimensional 00:17:28.240 |
and see what are the different levels of accuracy that those different size embedding matrices 00:17:33.120 |
created when the authors of that paper provided this information. 00:17:38.840 |
So that's a quick shortcut because other people have already experimented and provided those 00:17:44.840 |
The other is to do your own experiments, try a few different sizes. 00:17:48.700 |
It's not really about the length of the word list, it's really about the complexity of 00:17:54.640 |
the language or other problem that you're trying to solve. 00:17:59.960 |
That's really problem dependent and will require both your intuition developed from reading 00:18:05.720 |
and experimenting and also your own experiments. 00:18:09.500 |
And what would be the range of root mean squared error value to say that a model is good? 00:18:16.760 |
To say that a model is good is another model specific issue. 00:18:21.980 |
So a root mean squared error is very interpretable, it's basically how far out is it on average. 00:18:29.800 |
So we were finding that we were getting ratings within about 0.4, this mini Excel dataset 00:18:38.000 |
is too small to really make intelligent comments, but let's say it was bigger. 00:18:43.180 |
If we're getting within 0.4 on average, that sounds like it's probably good enough to be 00:18:49.280 |
useful for helping people find movies that they might like. 00:18:53.880 |
But there's really no one solution, I actually wrote a whole paper about this. 00:19:09.640 |
If you look up "Designing Great Data Products" and look at my name, this is based on really 00:19:16.040 |
mainly 10 years of work I did at a company I created called Optimal Decision Group. 00:19:22.240 |
And Optimal Decision Group was all about how to use predictive modeling not just to make 00:19:30.060 |
predictions but to optimize actions, and this whole paper is about that. 00:19:35.460 |
In the end, it's really about coming up with a way to measure the benefit to your organization 00:19:42.120 |
or to your project of getting that extra 0.1% accuracy, and there are some suggestions on 00:19:57.760 |
So we looked at a kind of a visual vocabulary that we developed for writing down neural 00:20:08.660 |
nets where any colored box represents a matrix of activations, that's a really important 00:20:18.360 |
A colored box represents a matrix of activations, so it could either be the input matrix, it 00:20:25.880 |
could be the output matrix, or it could be the matrix that comes from taking an input 00:20:31.840 |
and putting it through like a matrix product. 00:20:37.280 |
The rectangle boxes represent inputs, the circular ones represent hidden, so intermediate 00:20:44.360 |
activations, and the triangles represent outputs. 00:20:49.200 |
Arrows, very importantly, represent what we'll call layer operations. 00:20:54.760 |
And a layer operation is anything that you do to one colored box to create another colored 00:20:59.560 |
In general, it's almost always going to involve some kind of linear function like a matrix 00:21:04.000 |
product or convolution, and it will probably also include some kind of activation function 00:21:13.960 |
Because the activation functions are pretty unimportant in terms of detail, I started 00:21:19.560 |
removing those from the pictures as we started to look at more complex models. 00:21:25.560 |
And then in fact, because the layer operations actually are pretty consistent, we probably 00:21:29.320 |
know what they are, I started removing those as well, just to keep these simple. 00:21:35.560 |
And so we're simplifying these diagrams to try and just keep the main pieces. 00:21:40.200 |
And as we did so, we could start to create more complex diagrams. 00:21:43.520 |
And so we talked about a kind of language model where we would take inputs of a character, 00:21:50.800 |
character number 1 and character number 2, and we would try and predict character number 00:21:57.360 |
And so we thought one way to do that would be to create a deep neural network with two 00:22:05.040 |
The character 1 input would go through a layer operation to create our first fully connected 00:22:11.920 |
That would go through another layer operation to create a second fully connected layer. 00:22:15.820 |
And we would also add our second character input going through its own fully connected 00:22:23.440 |
And to recall, the last important thing we have to learn is that two arrows going into 00:22:28.160 |
a single shape means that we are adding the results of those two layer operations together. 00:22:34.360 |
So two arrows going into a shape represents summing up, element-wise, the results of these 00:22:45.440 |
So this was the kind of little visual vocabulary that we set up last week. 00:22:51.060 |
And I've kept track of it down here as to what the things are in case you forget. 00:22:56.760 |
So now I wanted to point out something really interesting, which is that there's three kinds 00:23:08.480 |
We've got predicting a fourth character of a sequence using characters 1, 2 and 3. 00:23:13.400 |
It's exactly the same method as on the previous slide. 00:23:18.640 |
There are layer operations that turn a character input into a hidden activation matrix. 00:23:29.480 |
There are layer operations that turn one hidden layer activation into a new hidden layer activation. 00:23:36.920 |
And then there's an operation that takes hidden activations and turns it into output activations. 00:23:43.360 |
And so you can see here, I've colored them in. 00:23:45.040 |
And here I've got a little legend of these different colors. 00:23:49.240 |
Green are the input to hidden, blue is the output, and orange is the hidden to hidden. 00:23:56.560 |
So my claim is that the dimensions of the weight matrices for each of these different 00:24:03.360 |
colored arrows, all of the green ones have the same dimensions because they're taking 00:24:07.480 |
an input of vocab size and turning it into an output hidden activation of size number 00:24:17.360 |
So all of these arrows represent weight matrices which are of the same dimensionality. 00:24:22.640 |
Ditto, the orange arrows represent weight matrices with the same dimensionality. 00:24:28.800 |
I would go further than that though and say the green arrows represent semantically the 00:24:35.320 |
They're all saying how do you take a character and convert it into a hidden state. 00:24:41.200 |
And the orange arrows are all saying how do you take a hidden state from a previous character 00:24:46.240 |
and turn it into a hidden state for a new character. 00:24:49.660 |
And then the blue one is saying how do you take a hidden state and turn it into an output. 00:24:55.920 |
When you look at it that way, all of these circles are basically the same thing, they're 00:25:00.240 |
just representing this hidden state at a different point in time. 00:25:04.400 |
And I'm going to use this word 'time' in a fairly general way, I'm not really talking 00:25:09.360 |
about time, I'm just talking about the sequence in which we're presenting additional pieces 00:25:15.360 |
We first of all present the first character, the second character and the third character. 00:25:21.840 |
So we could redraw this whole thing in a simpler way and a more general way. 00:25:28.400 |
Before we do, I'm actually going to show you in Keras how to build this model. 00:25:35.640 |
And in doing so, we're going to learn a bit more about the functional API which hopefully 00:25:47.240 |
To do that, we are going to use this corpus of all of the collected works of niche shape. 00:25:54.040 |
So we load in those works, we find all of the unique characters of which there are 86. 00:26:03.440 |
Here they are, joined up together, and then we create a mapping from the character to 00:26:09.640 |
the index at which it appears in this list and a mapping from the index to the character. 00:26:15.720 |
So this is basically creating the equivalent of these tables, or more specifically I guess 00:26:25.160 |
But rather than using words, we're looking at characters. 00:26:29.400 |
So that allows us to take the text of Nietzsche and convert it into a list of numbers where 00:26:37.320 |
the numbers represent the number at which the character appears in this list. 00:26:45.320 |
So at any point we can turn this, that's called IDX, so we've converted our whole text into 00:26:54.040 |
At any point we can turn it back into text by simply taking those indexes and looking 00:27:02.480 |
So here you can see we turn it back into the start of the text again. 00:27:08.280 |
The data we're working with is a list of character IDs at this point where those character IDs 00:27:18.060 |
So we're going to build a model which attempts to predict the fourth character from the previous 00:27:30.040 |
So to do that, we're going to go through our whole list of indexes from 0 up to the end 00:27:41.920 |
And we're going to create a whole list of the 0th, 4th, 8th, 12th etc characters and 00:27:52.240 |
a list of the 1st, 5th, 9th etc and the 2nd, 6th, 10th and so forth. 00:28:00.320 |
So this is going to represent the first character of each sequence, the second character of 00:28:03.920 |
each sequence, the third character of each sequence and this is the one we want to predict, 00:28:11.080 |
So we can now turn these into NumPy arrays just by stacking them up together. 00:28:16.560 |
And so now we've got our input for our first characters, second characters and third characters 00:28:22.160 |
of every four character piece of this collected works. 00:28:27.440 |
And then our y's, our labels, will simply be the fourth characters of each sequence. 00:28:35.800 |
So for example, if we took x1, x2 and x3 and took the first element of each, this is the 00:28:43.640 |
first character of the text, the second character of the text, the third character of the text 00:28:52.240 |
So we'll be trying to predict this based on these three. 00:28:56.960 |
And then we'll try to predict this based on these three. 00:29:06.800 |
So you can see we've got about 200,000 of these inputs for each of x1 through x3 and 00:29:18.600 |
And so as per usual, we're going to first of all turn them into embeddings by creating 00:29:29.840 |
I haven't actually seen anybody else do this. 00:29:33.360 |
Most people just treat them as one-hot encodings. 00:29:37.440 |
So for example, the most widely used blog post about car RNNs, which really made them 00:29:46.800 |
popular, was Andre Kepathys, and it's quite fantastic. 00:29:51.280 |
And you can see that in his version, he shows them as being one-hot encoder. 00:30:03.040 |
We're not going to do that, we're going to turn them into embeddings. 00:30:13.240 |
Capital A and lowercase a have some similarities that an embedding can understand. 00:30:17.880 |
Different types of things that have to be opened and closed, like different types of 00:30:21.640 |
parentheses and quotes, have certain characteristics that can be constructed in embedding. 00:30:27.520 |
There's all kinds of things that we would expect an embedding to capture. 00:30:31.560 |
So my hypothesis was that an embedding is going to do a better job than just a one-hot 00:30:48.880 |
In my experiments over the last couple of weeks, that generally seems to be true. 00:30:54.960 |
So we're going to take each character, 1-3, and turn them into embeddings by first creating 00:31:01.680 |
an input layer for them and then creating an embedding layer for that input. 00:31:07.160 |
And then we can return the input layer and the flattened version of the embedding layer. 00:31:12.880 |
So this is the input to an output of each of our three embedding layers for our three 00:31:26.280 |
So we now have to decide how many activations do we want. 00:31:43.040 |
That's something that seems reasonable, seems to have worked okay. 00:31:47.840 |
So we now have to somehow construct something where each of our green arrows ends up with 00:31:59.680 |
And it turns out Keras makes this really easy with the Keras Functional API. 00:32:05.200 |
When you call dense like this, what it's actually doing is it's creating a layer with a specific 00:32:16.880 |
Notice that I haven't passed in anything here to say what it's connected to, so it's not 00:32:22.400 |
This is just saying, I'm going to have something which is a dense layer which creates 256 activations 00:32:36.280 |
So it doesn't actually do anything until I then do this, so I connect it to something. 00:32:41.160 |
So here I'm going to say character1's hidden state comes from taking character1, which 00:32:47.000 |
was the output of our first embedding, and putting it through this dense_in layer. 00:32:53.660 |
So this is the thing which creates our first circle. 00:32:58.760 |
So the embedding is the thing that creates the output of our first rectangle, this creates 00:33:08.960 |
So what that means is that in order to create the next set of activations, we need to create 00:33:16.560 |
So since the orange arrow is different weight matrix to the green arrow, we have to create 00:33:23.240 |
I've got a new dense layer, and again, with n hidden outputs. 00:33:28.500 |
So by creating a new dense layer, this is a whole separate weight matrix, this is going 00:33:34.800 |
So now that I've done that, I can create my character2 hidden state, which is here, and 00:33:43.040 |
I'm going to have to sum up two separate things. 00:33:45.980 |
I'm going to take my character2 embedding, put it through my green arrow, dense_in, that's 00:33:54.200 |
I'm going to take the output of my character1's hidden state and run it through my orange 00:34:01.840 |
arrow, which we call dense_hidden, and then we're going to merge the two together. 00:34:12.040 |
So this is adding together these two outputs. 00:34:16.880 |
In other words, it's adding together these two layer operation outputs. 00:34:24.600 |
So the third character output is done in exactly the same way. 00:34:28.160 |
We take the third character's embedding, run it through our green arrow, take the result 00:34:32.880 |
of our previous hidden activations and run it through our orange arrow, and then merge 00:34:38.000 |
Question- Is the first output the size of the latent fields in the embedding? 00:34:44.800 |
Answer- The size of the latent embeddings we defined when we created the embeddings up 00:34:55.600 |
here, and we defined them as having nFAT size, and nFAT we defined as 42. 00:35:07.200 |
So C1, C2 and C3 represent the result of putting each character through this embedding and 00:35:18.800 |
Those are then the things that we put into our green arrow. 00:35:25.240 |
So after doing this three times, we now have C3 hidden, which is 1, 2, 3 here. 00:35:33.160 |
So we now need a new set of weights, we need another dense layer, the blue arrow. 00:35:42.240 |
And this needs to create an output of size 86, vocab size, we need to create something 00:35:48.360 |
which can match to the one-hot encoded list of possible characters, which is 86 long. 00:35:54.520 |
So now that we've got this orange arrow, we can apply that to our final hidden state to 00:36:02.400 |
So in Keras, all we need to do now is call model "passing in" the three inputs, and so 00:36:09.560 |
the three inputs were returned to us way back here. 00:36:14.160 |
Each time we created an embedding, we returned the input layer, so C1 in C2 in C3 input. 00:36:23.640 |
So passing in the three inputs, and passing in our output. 00:36:29.640 |
And so we can now compile it, set a learning rate, fit it, and as you can see, its loss 00:36:39.080 |
And we can then test that out very easily by creating a little function that we're going 00:36:46.520 |
We're going to take those three letters and turn them into character indices, just look 00:36:52.000 |
them up to find the indexes, turn each of those into a numpy array, call model.predict 00:37:02.960 |
That gives us 86 outputs, which we then do argmax to find which index into those 86 is 00:37:14.240 |
the highest, and that's the character number that we want to return. 00:37:18.400 |
So if we pass in PHI, it thinks that L is most likely next, space th is most likely next, 00:37:27.640 |
space an, it thinks that d is most likely next. 00:37:31.480 |
So you can see that it seems to be doing a pretty reasonable job of taking three characters 00:37:36.480 |
and returning a fourth character that seems pretty sensible, not the world's most powerful 00:37:42.800 |
model, but a good example of how we can construct pretty arbitrary architectures using Keras 00:38:00.640 |
This model, how would it consider the context in which we are trying to predict the next 00:38:06.520 |
context, it knows nothing about the context, all it has at any point in time is the previous 00:38:16.960 |
We're going to improve it though, we've got to start somewhere. 00:38:21.200 |
In order to answer your question, let's build this up a little further, and rather than 00:38:38.560 |
trying to predict character 4 from the previous three characters, let's try and predict character 00:38:46.920 |
And since all of these circles basically mean the same thing, which is the hidden state 00:38:51.040 |
at this point, and since all of these orange arrows are literally the same thing, it's 00:38:55.520 |
a dense layer with exactly the same weight matrix, let's take all of the circles on top 00:38:59.640 |
of each other, which means that these orange arrows then can just become one arrow pointing 00:39:06.040 |
And this is the definition of a recurrent neural network. 00:39:10.400 |
When we see it in this form, we say that we're looking at it in its recurrent form. 00:39:15.280 |
When we see it in this form, we can say that we're looking at it in its unrolled form, or 00:39:27.840 |
And so for quickly sketching out an RNN architecture, this is much more convenient. 00:39:33.040 |
But actually, this unrolled form is really important. 00:39:35.880 |
For example, when Keras uses TensorFlow as a backend, it actually always unrolls it in 00:39:50.760 |
And so it's quite nice being able to use the Theano backend with Keras which can actually 00:39:55.200 |
directly implement it as this kind of loop, and that's what we'll be doing today shortly. 00:40:04.360 |
We're going to have character 1 input come in, go through the first green arrow, go through 00:40:11.640 |
the first orange arrow, and from then on, we can just say take the second character, 00:40:17.760 |
repeat the third character, repeat, and at each time period, we're getting a new character 00:40:23.680 |
going through a layer operation, as well as taking the previous hidden state and putting 00:40:31.440 |
And then at the very end, we will put it through a different layer operation, the blue arrow, 00:40:40.640 |
Does every fully connected layer have to have the same activation function? 00:40:50.560 |
In general, no, in all of the models we've seen so far, we have constructed them in a 00:41:01.200 |
way where you can write anything you like as the activation function. 00:41:07.440 |
In general though, I haven't seen any examples of successful architectures which mix activation 00:41:14.120 |
functions other than at the output layer would pretty much always be a softmax for classification. 00:41:22.080 |
I'm not sure it's not something that might become a good idea, it's just not something 00:41:27.840 |
that anybody has done very successfully with so far. 00:41:32.560 |
I will mention something important about activation functions though, which is that you can use 00:41:39.360 |
pretty much almost any nonlinear function as an activation function and get pretty reasonable 00:41:45.360 |
There are actually some pretty cool papers that people have written where they've tried 00:41:48.680 |
all kinds of weird activation functions and they pretty much all work. 00:41:55.280 |
It's more just certain activation functions will train more quickly and more resiliently. 00:42:02.360 |
In particular, ReLU and ReLU variations tend to work particularly well. 00:42:15.320 |
So we're going to use a very similar approach to what we used before. 00:42:20.840 |
And we're going to create our first RNN and we're going to create it from scratch using 00:42:29.920 |
In this case, we can't create C1, C2 and C3, we're going to have to create an array of 00:42:39.840 |
We're going to have to decide what N we're going to use, and so for this one I've decided 00:42:44.480 |
to use 8, so CS is characters, so I'm going to use 8 characters to predict the 9th character. 00:42:52.880 |
So I'm going to create an array with 8 elements in it, and each element will contain a list 00:43:00.280 |
of the 0, 8, 16, 24th character, the 1, 9, 17, etc. character, the 2, 10, 18, etc. character, 00:43:12.400 |
So we're going to have a sequence of inputs where each one is offset by 1 from the previous 00:43:20.480 |
one, and then our output will be exactly the same thing, except we're going to look at 00:43:31.800 |
So this will be the 8th thing in each sequence and we're going to predict it with the previous 00:43:39.640 |
So now we can go through every one of those input data items, lists, and turn them into 00:43:44.960 |
a NumPy array, and so here you can see that we have 8 inputs, and each one is at length 00:43:59.560 |
Do the same thing for our y, get a NumPy array out of it, and here we can visualize it. 00:44:09.200 |
So here are the first 8 elements of x, so in looking at the first 8 elements of x, let's 00:44:19.280 |
look at the very first element of each one, 40, 42, 29. 00:44:23.460 |
So this column is the first 8 characters of our text, and here is the 9th character. 00:44:31.560 |
So the first thing that the model will try to do is to look at these 8 to predict this, 00:44:36.480 |
and then look at these 8 to predict this, and look at these 8 and predict this and so forth. 00:44:41.400 |
And indeed you can see that this list here is exactly the same as this list here. 00:44:48.360 |
The final character of each sequence is the same as the first character of the next sequence. 00:44:53.040 |
So it's almost exactly the same as our previous data, we've just done it in a more flexible 00:44:59.040 |
We'll create 43 latent factors as before, where we use exactly the same embedding input function 00:45:06.960 |
And again, we're just going to have to use lists to store everything. 00:45:11.960 |
So in this case, all of our embeddings are going to be in a list, so we'll go through 00:45:15.400 |
each of our characters and create an embedding input and output for each one, store it here. 00:45:25.400 |
And here we're going to define them all at once, our green arrow, orange arrow, and blue 00:45:31.200 |
So here we're basically saying we've got 3 different weight matrices that we want Keras 00:45:39.960 |
So the very first hidden state here is going to take the list of all of our inputs, the 00:45:49.200 |
first one of those, and then that's a tuple of two things. 00:45:52.960 |
The first is the input to it, and the second is the output of the embedding. 00:45:56.320 |
So we're going to take the output of the embedding for the very first character, pass that into 00:46:01.840 |
our green arrow, and that's going to give us our initial hidden state. 00:46:08.240 |
And then this looks exactly the same as we saw before, but rather than doing it listing 00:46:14.160 |
separately, we're just going to loop through all of our remaining 1 through 8 characters 00:46:19.400 |
and go ahead and create the green arrow, orange arrow, and add the two together. 00:46:27.900 |
So finally we can take that final hidden state, put it through our blue arrow to create our 00:46:35.280 |
So we can then tell Keras that our model is all of the embedding inputs for that list 00:46:42.040 |
we created together, that's our inputs, and then our output that we just created is the 00:46:51.000 |
So we would expect this to be more accurate because it's now got 8 pieces of context in 00:46:59.080 |
So previously we were getting this time we get down to 1.8. 00:47:11.200 |
So it's still not great, but it's an improvement and we can create exactly the same kind of 00:47:15.160 |
tests as before, so now we can pass in 8 characters and get a prediction of the ninth. 00:47:24.160 |
So that is our first RNN that we've now built from scratch. 00:47:30.440 |
This kind of RNN where we're taking a list and predicting a single thing is most likely 00:47:38.160 |
to be useful for things like sentiment analysis. 00:47:44.160 |
Remember our sentiment analysis example using IMDB? 00:47:47.880 |
So in this case we were taking a sequence, being a list of words in a sentence, and predicting 00:47:52.840 |
whether or not something is positive sentiment or negative sentiment. 00:47:56.400 |
So that would seem like an appropriate kind of use case for this style of RNN. 00:48:06.300 |
So at that moment my computer crashed and we lost a little bit of the class's video. 00:48:13.000 |
So I'm just going to fill in the bit that we missed here. 00:48:23.440 |
So I wanted to show you something kind of interesting, which you may have noticed, which 00:48:27.620 |
is when we created our hidden dense layer, that is our orange arrow, I did not initialize 00:48:40.160 |
it in the default way which is the GLORO initialization, but instead I said "init = identity". 00:48:48.600 |
You may also have noticed that the equivalent thing was shown in our Keras RNN. 00:48:57.320 |
This here where it says "inner init = identity" was referring to the same thing. 00:49:02.360 |
It's referring to what is the initialization that is used for this orange arrow, how are 00:49:12.120 |
So rather than initializing them randomly, we're going to initialize them with an identity 00:49:19.040 |
An identity matrix, you may recall from your linear algebra at school, is a matrix which 00:49:24.740 |
is all zeros, except it is just ones down the diagonal. 00:49:30.320 |
So if you multiply any matrix by the identity matrix, it doesn't change the original matrix 00:49:37.800 |
You can write back exactly what you started with. 00:49:40.440 |
So in other words, we're going to start off by initializing our orange arrow, not with 00:49:47.660 |
a random matrix, but with a matrix that causes the hidden state to not change at all. 00:50:00.040 |
It seems reasonable to say "well in the absence of other knowledge to the country, why don't 00:50:04.520 |
we start off by having the hidden state stay the same until the SGD has a chance to update 00:50:12.320 |
But it actually turns out that it also makes sense based on an empirical analysis. 00:50:18.360 |
So since we always only do things that Jeffrey Hinton tells us to do, that's good news because 00:50:23.680 |
this is a paper by Jeff Hinton in which he points out this rather neat trick which is 00:50:29.960 |
if you initialize an RNN with the hidden weight matrix initialized to an identity matrix and 00:50:39.960 |
use rectified linear units as we are here, you actually get an architecture which can 00:50:51.260 |
get fantastic results on some reasonably significant problems including speech recognition and 00:51:00.840 |
I don't see this paper referred to or discussed very often, even though it is well over a 00:51:09.240 |
So I'm not sure if people forgot about it or haven't noticed it or what, but this is 00:51:13.200 |
actually a good trick to remember is that you can often get quite a long way doing nothing 00:51:19.560 |
but an identity matrix initialization and rectified linear units in just as we have 00:51:29.780 |
Okay, so that's a nice little trick to remember. 00:51:36.440 |
And so the next thing we're going to do is to make a couple of minor changes to this 00:51:47.200 |
So the first change we're going to make is we're going to take this rectangle here, so 00:51:51.160 |
this rectangle is referring to what is it that we repeat and so since in this case we're 00:51:57.320 |
predicting character n from characters 1 through n minus 1, then this whole area here we're 00:52:06.040 |
looping from 2 to n minus 1 before we generate our output once again. 00:52:13.480 |
So what we're going to do is we're going to take this triangle and we're going to put 00:52:17.540 |
it inside the loop, put it inside the rectangle. 00:52:21.700 |
And so what that means is that every time we loop through this, we're going to generate 00:52:27.440 |
So rather than generating one output at the end, this is going to predict characters 2 00:52:32.120 |
through n using characters n1 through n minus 1. 00:52:37.980 |
So it's going to predict character 2 using character 1 and character 3 using characters 00:52:44.200 |
1 and 2 and character 4 using characters 1, 2 and 3 and so forth. 00:52:51.840 |
It's nearly exactly the same as the previous model, except after every single step after 00:52:58.760 |
creating the hidden state on every step, we're going to create an output every time. 00:53:03.040 |
So this is not going to create a single output like this does, which predicted a single character, 00:53:09.120 |
the last character, in fact, the next after the last character of the sequence, character 00:53:18.880 |
This is going to predict a whole sequence of characters 2 through n using characters 00:53:25.480 |
OK, so that was all the stuff that we'd lost when we had our computer crash. 00:53:34.400 |
Let's now talk about how we would implement this sequence, where we're going to predict 00:53:38.400 |
characters 2 through n using characters 1 through n minus 1. 00:53:47.120 |
There's a few reasons, but one obvious reason why this would be a good idea is that if we're 00:53:51.800 |
only predicting one output for every n inputs, then the number of times that our model has 00:53:59.520 |
the opportunity to back-propagate those in gradients and improve those weights is just 00:54:08.880 |
Whereas if we predict characters 2 through n using characters 1 through n minus 1, we're 00:54:15.280 |
actually getting a whole lot of feedback about how our model is going. 00:54:18.840 |
So we can back-propagate n times, or actually n minus 1 times every time we do another sequence. 00:54:27.600 |
So there's a lot more learning going on for nearly the same amount of computation. 00:54:35.240 |
The other reason this is handy is that as you'll see in a moment, it's very helpful 00:54:38.960 |
for creating RNNs which can do truly long-term dependencies or context, as one of the people 00:54:49.920 |
So we're going to start here before we look at how to do context. 00:54:53.800 |
And so really anytime you're doing a kind of sequence-to-sequence exercise, you probably 00:55:00.000 |
want to construct something of this format where your triangle is inside the square rather 00:55:10.240 |
It's going to look very similar, and so I'm calling this returning sequences, rather than 00:55:15.840 |
returning a single character, we're going to return a sequence. 00:55:22.400 |
Our character_in data is identical to before, so I've just commented it out. 00:55:28.600 |
And now our character_out output isn't just a single character, but it's actually a list 00:55:36.480 |
In fact, it's exactly the same as the input, except that I have removed the -1, so it's 00:55:47.320 |
In each sequence, the first character will be used to predict the second, the first and 00:55:51.720 |
second will predict the third, the first, second and third will predict the fourth and so forth. 00:55:58.000 |
So we've got a lot more predictions going on, and therefore a lot more opportunity for 00:56:04.380 |
So then we will create our y's just as before with our x's. 00:56:12.400 |
And so now our y dataset looks exactly like our x dataset did, but everything's just shifted 00:56:21.360 |
And the model's going to look almost identical as well. 00:56:24.160 |
We've got our three dense layers as before, but we're going to do one other thing different 00:56:32.160 |
Rather than treating the first character as special, I won't treat it as special. 00:56:37.400 |
I'm going to move the character into here, so rather than repeating from 2 to n-1, I'm 00:56:47.520 |
So the only thing I have to be careful of is that we have to somehow initialize our 00:56:53.440 |
So we're going to initialize our hidden state to a vector of zeros. 00:56:59.200 |
So here we do that, we say we're going to have something to initialize our hidden state, 00:57:04.080 |
which we're going to feed it with a vector of zeros shortly. 00:57:06.680 |
So our initial hidden state is just going to be the result of that. 00:57:11.180 |
And then our loop is identical to before, but at the end of every loop, we're going 00:57:20.080 |
So we're now going to have 8 outputs for every sequence rather than 1. 00:57:27.400 |
The first is it's got an array of outputs, and the second is that we have to add the 00:57:33.480 |
thing that we're going to use to store our vector of zeros somewhere, so we're going 00:57:41.040 |
The box refers to the area that we're looping. 00:57:51.880 |
So initially we repeated the character n input coming into here, and then the hidden state 00:58:02.040 |
So the box is the thing which I'm looping through all those times. 00:58:07.000 |
This time I'm looping through this whole thing. 00:58:09.360 |
So a character input coming in, generating the hidden state, and creating an output, repeating 00:58:18.500 |
And so now you can see creating the output is inside the loop rather than outside the 00:58:25.960 |
So therefore we end up with an array of outputs. 00:58:30.920 |
So our model's nearly exactly the same as before, it's just got these two changes. 00:58:34.680 |
So now when we fit our model, we're going to add an array of zeros to the start of our 00:58:46.320 |
Our outputs are going to be those lists of 8 that have been offset by 1, and we can go 00:58:55.160 |
And you can see that as we train it, now we don't just have one loss, we have 8 losses. 00:59:04.400 |
And that's because every one of those 8 outputs has its own loss. 00:59:08.000 |
How are we going at predicting character 1 in each sequence? 00:59:11.880 |
And as you would expect, our ability to predict the first character using nothing but a vector 00:59:23.880 |
Whereas our ability to predict the 8th character, it has a lot more context. 00:59:31.840 |
And so you can see that the 8th character's loss keeps on improving. 00:59:37.200 |
And indeed, by a few epochs, we have a significantly better loss than we did before. 00:59:46.800 |
And so you can see a sequence model when we test it. 00:59:50.760 |
We pass in a sequence like this, space this is, and after every character, it returns 00:59:59.200 |
So after seeing a space, it guesses the next will be a t. 01:00:02.600 |
After seeing a space t, it guesses the next will be an h. 01:00:05.840 |
After seeing a space th, it guesses the next will be an e and so forth. 01:00:12.080 |
And so you can see that it's predicting some pretty reasonable things here, and indeed 01:00:21.080 |
So after seeing space par t, it expects that will be the end of the word, and indeed it 01:00:27.040 |
So after seeing par t, it's guessing that the next word is going to be of, and indeed 01:00:32.560 |
So it's able to use sequences of 8 to create a context, which isn't brilliant, but it's 01:00:47.600 |
With Keras, it's identical to our previous model, except that we have to use the different 01:00:55.560 |
input and output arrays, just like I just showed you, so the whole sequence of labels 01:01:05.720 |
And then the second thing we have to do is add one parameter, which is return_sequences_equals_true. 01:01:11.920 |
return_sequences_equals_true simply says rather than putting the triangle outside the loop, 01:01:19.960 |
And so return an output from every time you go to another time step rather than just returning 01:01:30.800 |
I add this return_sequences_equals_true, I don't have to change my data at all other 01:01:37.360 |
than some very minor dimensionality changes, and then I can just go ahead and fit it. 01:01:45.720 |
As you can see, I get a pretty similar loss function to what I did before, and I can build 01:01:53.100 |
something that looks very much like we had before and generate some pretty similar results. 01:01:58.880 |
So that's how we create a sequence model with Keras. 01:02:07.600 |
So then the question of how do you create more state, how do you generate a model which is 01:02:19.620 |
To generate a model that understands long-term dependencies, we can't anymore present our 01:02:28.780 |
So so far, we've always been using the default model, which is shuffle=true. 01:02:36.960 |
So it's passing across these sequences of 8 in a random order. 01:02:42.800 |
If we're going to do something which understands long-term dependencies, the first thing we 01:02:46.160 |
are going to have to do is we're going to have to use shuffle=false. 01:02:50.920 |
The second thing we're going to have to do is we're going to have to stop passing in 01:02:56.000 |
an array of zeros as my starting point every time around. 01:03:01.820 |
So effectively what I want to do is I want to pass in my array of zeros right at the 01:03:12.000 |
very start when I first start training, but then at the end of my sequence of 8, rather 01:03:18.000 |
than going back to initialize to zeros, I actually want to keep this hidden state. 01:03:24.680 |
So then I'd start my next sequence of 8 with this hidden state exactly where it was before, 01:03:29.960 |
and that's going to allow it to basically build up arbitrarily long dependencies. 01:03:37.360 |
So in Keras, that's actually as simple as adding one additional parameter, and the additional 01:03:53.360 |
And so when you say stateful=true, what that tells Keras is at the end of each sequence, 01:04:01.280 |
don't reset the hidden activations to zero, but leave them as they are. 01:04:07.320 |
And that means that we have to make sure we pass shuffle=false when we train it, so it's 01:04:12.960 |
now going to pass the first 8 characters of the book and then the second 8 characters 01:04:16.560 |
of the book and then the third 8 characters of the book, leaving the hidden state untouched 01:04:22.640 |
between each one, and therefore it's allowing it to continue to build up as much state as 01:04:30.720 |
Training these stateful models is a lot harder than training the models we've seen so far. 01:04:39.920 |
In these stateful models, this orange arrow, this single weight matrix, it's being applied 01:04:47.920 |
to this hidden matrix not 8 times, but 100,000 times or more, depending on how big your text 01:04:57.400 |
And just imagine if this weight matrix was even slightly poorly scaled, so if there was 01:05:03.120 |
like one number in it which was just a bit too high, then effectively that number is 01:05:08.920 |
going to be to the power of 100,000, it's being multiplied again and again and again. 01:05:15.560 |
So what can happen is you get this problem they call exploding gradients, or really in 01:05:20.900 |
some ways it's better described as exploding activations. 01:05:24.560 |
Because we're multiplying this by this almost the same weight matrix each time, if that 01:05:30.520 |
weight matrix is anything less than perfectly scaled, then it's going to make our hidden 01:05:40.360 |
And so we have to be very careful of how to train these, and indeed these kinds of long-term 01:05:45.920 |
dependency models were thought of as impossible to train for a while, until some folks in 01:05:54.640 |
the mid-90s came up with a model called the LSTM, or Long Short-Term Memory. 01:06:03.520 |
And in the Long Short-Term Memory, and we'll learn more about it next week, and we're actually 01:06:07.040 |
going to implement it ourselves from scratch, we replace this loop here with a loop where 01:06:12.960 |
there is actually a neural network inside the loop that decides how much of this state 01:06:20.240 |
matrix to keep and how much to use at each activation. 01:06:25.480 |
And so by having a neural network which actually controls how much state is kept and how much 01:06:31.320 |
is used, it can actually learn how to avoid those gradient explosions, it can actually 01:06:44.560 |
So we're going to look at that a lot more next week, but for now I will tell you that 01:06:50.000 |
when I tried to run this using a simple RNN, even with an identity matrix, initialization 01:06:57.120 |
and reuse, I had no luck at all. So I had to replace it with an LSTM. Even that wasn't 01:07:03.880 |
enough, I had to have well-scaled inputs, so I added a batch normalization layer after 01:07:11.320 |
And after I did those things, then I could fit it. It still ran pretty slowly, so before 01:07:20.600 |
I was getting 4 seconds per epoch, now it's 13 seconds per epoch, and the reason here 01:07:25.200 |
is it's much harder to parallelize this. It has to do each sequence in order, so it's 01:07:31.560 |
going to be slower. But over time, it does eventually get substantially better loss than 01:07:40.200 |
I had before, and that's because it's able to keep track of and use this state. 01:07:44.800 |
That's a good question. Definitely maybe. There's been a lot of discussion and papers 01:08:01.620 |
about this recently. There's something called layer normalization, which is a method which 01:08:06.920 |
is explicitly designed to work well with RNNs. Standard batch norm doesn't. It turns out 01:08:16.000 |
it's actually very easy to do layer normalization with Keras using a couple of simple parameters 01:08:21.880 |
you can provide for the normal batch norm constructor. In my experiments, that hasn't 01:08:27.960 |
worked so well, and I will show you a lot more about that in just a few minutes. 01:08:38.200 |
Stateful models are great. We're going to look at some very successful stateful models 01:08:43.120 |
in just a moment, but just be aware that they are more challenging to train. You'll see 01:08:48.120 |
another thing I had to do here is I had to reduce the learning rate in the middle, again 01:08:53.840 |
because you just have to be so careful of these exploding gradient problems. 01:09:04.840 |
Let me show you what I did with this, which is I tried to create a stateful model which 01:09:12.720 |
worked as well as I could. I took the same Nietzsche data as before, and I tried splitting 01:09:19.920 |
it into chunks of 40 rather than 8, so each one could do more work. Here are some examples 01:09:33.120 |
I built a model that was slightly more sophisticated than the previous one in two ways. The first 01:09:37.800 |
is it has an RNN feeding into an RNN. That's kind of a crazy idea, so I've drawn a picture. 01:09:49.480 |
An RNN feeding into an RNN means that the output is no longer going to an output, it's 01:09:57.080 |
actually the output of the first RNN is becoming the input to the second RNN. So the character 01:10:05.680 |
input goes into our first RNN and has the state updates as per usual, and then each 01:10:11.200 |
time we go through the sequence, it feeds the result to the state of the second RNN. 01:10:19.040 |
Why is this useful? Well, because it means that this output is now coming from not just 01:10:25.640 |
a single dense matrix and then a single dense matrix here, it's actually going through one, 01:10:36.960 |
two, three dense matrices and activation functions. 01:10:41.760 |
So I now have a deep neural network, assuming that two layers get to count as deep, between 01:10:48.180 |
my first character and my first output. And then indeed, between every hidden state and 01:10:54.520 |
every output, I now have multiple hidden layers. So effectively, what this is allowing us to 01:11:00.080 |
do is to create a little deep neural net for all of our activations. That turns out to 01:11:09.200 |
work really well because the structure of language is pretty complex and so it's nice 01:11:15.560 |
to be able to give it a more flexible function that it can learn. 01:11:22.120 |
That's the first thing I do. It's this easy to create that. You just copy and paste whatever 01:11:28.380 |
your RNN line is twice. You can see I've now added dropout inside my RNN. And as I talked 01:11:38.160 |
about before, adding dropout inside your RNN turns out to be a really good idea. There's 01:11:44.200 |
a really great paper about that quite recently showing that this is a great way to regularize 01:11:52.840 |
And then the second change I made is rather than going straight from the RNN to our output, 01:11:59.600 |
I went through a dense layer. Now there's something that you might have noticed here 01:12:05.920 |
is that our dense layers have this extra word at the front. Why do they have this extra 01:12:11.800 |
word at the front? Time distributed. It might be easier to understand why by looking at 01:12:23.160 |
And note that the output of our RNN is not just a vector of length 256, but 8 vectors 01:12:32.200 |
of length 256 because it's actually predicting 8 outputs. So we can't just have a normal 01:12:38.040 |
dense layer because a normal dense layer needs a single dimension that it can squish down. 01:12:48.040 |
So in this case, what we actually want to do is create 8 separate dense layers at the 01:12:53.260 |
output, one for every one of the outputs. And so what time distributed does is it says whatever 01:12:59.680 |
the layer is in the middle, I want you to create 8 copies of them, or however long this 01:13:07.160 |
dimension is. And every one of those copies is going to share the same weight matrix, 01:13:13.940 |
So the short version here is in Keras, anytime you say return_sequences=true, any dense layers 01:13:21.880 |
you have after that will always have to have time distributed wrapped around them because 01:13:27.120 |
we want to create not just one dense layer, but 8 dense layers. So in this case, since 01:13:35.280 |
we're saying return_sequences=true, we then have a time distributed dense layer, some 01:13:41.120 |
dropout, and another time distributed dense layer. 01:13:44.280 |
I have a few questions. Does the first RNN complete before it passes to the second or 01:13:53.800 |
No, it's operating exactly like this. So my initialization starts, my first character 01:14:02.880 |
comes in, and at the output of that comes two things, the hidden state for my next hidden 01:14:09.020 |
state and the output that goes into my second LSTM. The best way to think of this is to 01:14:21.240 |
draw it in the unrolled form, and then you'll realize there's nothing magical about this 01:14:28.080 |
at all. In an unrolled form, it just looks like a pretty standard deep neural net. 01:14:38.240 |
We'll talk about that more next week. In an LSTM, I mentioned that there's kind of like 01:14:44.040 |
little neural nets that control how the state updates work, and so this is talking about 01:14:48.640 |
how the dropout works inside these little neural nets. 01:14:53.280 |
And when stateful is false, can you explain again what is reset after each training example? 01:15:01.800 |
The best way to describe that is to show us doing it. Remember that the RNNs that we built 01:15:12.720 |
are identical to what Keras does, or close enough to identical. Let's go and have a look 01:15:24.480 |
You can see that what we did was we created a matrix of zeros that we stuck onto the front 01:15:33.360 |
of our inputs. Every set of 8 characters now starts with a vector of zeros. In other words, 01:15:43.800 |
this initialize to zeros happens every time we finish a sequence. In other words, this 01:15:51.360 |
hidden state gets initialized to 0 at the end of every sequence. It's this hidden state 01:15:58.280 |
which is where all of these dependencies and state is kept. So doing that is resetting 01:16:05.920 |
the state every time we look at a new sequence. 01:16:09.760 |
So when we say stateful = false, it only does this initialize to 0 once at the very start, 01:16:19.920 |
or when we explicitly ask it to. So when I actually run this model, the way I do it is 01:16:28.120 |
I wrote a little thing called run epochs that goes model.resetStates and then does a fit 01:16:34.800 |
on one epoch, which is what you really want at the end of your entire works of Nietzsche. 01:16:41.320 |
You want to reset the state because you're about to go back to the very start and start 01:16:55.760 |
So with this multilayer LSTM going into a multilayer neural net, I then tried seeing 01:17:03.640 |
how that goes. And remember that with our simpler versions, we were getting 1.6 loss 01:17:10.600 |
was the best we could do. After one epoch, it's awful. And now rather than just printing 01:17:19.840 |
out one letter, I'm starting with a whole sequence of letters, which is that, and asking 01:17:26.280 |
it to generate a sequence. You can see it starts out by generating a pretty rubbishy 01:17:31.160 |
One more question. In the double LSTM layer model, what is the input to the second LSTM 01:17:45.960 |
In addition to the output of the first LSTM is the previous output of its own hidden state. 01:17:55.440 |
Okay, so after a few more epochs, it's starting to create some actual proper English words, 01:18:08.480 |
although the English words aren't necessarily making a lot of sense. So I keep running epochs. 01:18:15.120 |
At this point, it's learned how to start chapters. This is actually how in this book the chapters 01:18:20.680 |
always start with a number and then an equal sign. It hasn't learned how to close quotes 01:18:26.040 |
apparently, it's not really saying anything useful. 01:18:29.920 |
So anyway, I kind of ran this overnight, and I then seeded it with a large amount of data, 01:18:39.560 |
so I seeded it with all this data, and I started getting some pretty reasonable results. Shreds 01:18:46.360 |
into one's own suffering sounds exactly like the kind of thing that you might see. Religions 01:18:52.000 |
have acts done by man. It's not all perfect, but it's not bad. 01:18:57.520 |
Interestingly, this sequence here, when I looked it up, it actually appears in his book. 01:19:05.120 |
This makes sense, right? It's kind of overfitting in a sense. He loves talking in all caps, 01:19:14.440 |
but he only does it from time to time. So once it so happened to start writing something 01:19:21.040 |
in all caps that looked like this phrase that only appeared once and is very unique, there 01:19:27.200 |
was kind of no other way that it could have finished it. So sometimes you get these little 01:19:32.880 |
rare phrases that basically it's plagiarized directly from each of them. 01:19:39.080 |
Now I didn't stop there because I thought, how can we improve this? And it was at this 01:19:43.440 |
point that I started thinking about batch normalization. And I started fiddling around 01:19:47.840 |
with a lot of different types of batch normalization and layer normalization and discovered this 01:19:54.400 |
interesting insight, which is that at least in this case, the very best approach was when 01:20:03.320 |
I simply applied batch normalization to the embedding layer. 01:20:14.840 |
When I applied batch normalization to the embedding layer, this is the training curve 01:20:19.800 |
that I got. So over epochs, this is my loss. With no batch normalization on the embedding 01:20:25.940 |
layer, this was my loss. And so you can see this was actually starting to flatten out. 01:20:31.840 |
This one really wasn't, and this one was training a lot quicker. So then I tried training it 01:20:36.960 |
with batch norm on the embedding layer overnight, and I was pretty stunned by the results. This 01:20:45.560 |
was my seeding text, and after 1000 epochs, this is what it came up with. And it's got 01:20:59.160 |
all kinds of pretty interesting little things. Perhaps some morality equals self-glorification. 01:21:06.680 |
This is really cool. For there are holy eyes to Schopenhauer's blind. This is interesting. 01:21:15.400 |
In reality, we must step above it. You can see that it's learnt to close quotes even 01:21:19.580 |
when those quotes were opened a long time ago. So if we weren't using stateful, it would 01:21:23.760 |
never have learnt how to do this. I've looked up these words in the original text and pretty 01:21:32.320 |
much none of these phrases appear. This is actually a genuinely, novelly produced piece 01:21:39.400 |
of text. It's not perfect by any means, but considering that this is only doing it character 01:21:48.240 |
by character, using nothing but a 42 long embedding matrix for each character and nothing 01:21:56.040 |
but there's no pre-trained vectors, there's just a pretty short 600,000 character epoch, 01:22:03.760 |
I think it's done a pretty amazing job of creating a pretty good model. 01:22:09.680 |
And so there's all kinds of things you could do with a model like this. The most obvious 01:22:13.320 |
one would be if you were producing a software keyboard for a mobile phone, for example. You 01:22:20.600 |
could use this to have a pretty accurate guess as to what they were going to type next and 01:22:24.600 |
correct it for them. You could do something similar on a word basis. But more generally, 01:22:31.800 |
you could do something like anomaly detection with this. You could generate a sequence that 01:22:38.000 |
is predicting what the rest of the sequence is going to look like for the next hour and 01:22:42.760 |
then recognize if something falls outside of what your prediction was and you know there's 01:22:47.760 |
been some kind of anomaly. There's all kinds of things you can do with these kinds of models. 01:22:54.660 |
I think that's pretty fun, but I want to show you something else which is pretty fun, which 01:23:01.720 |
is to build an RNN from scratch in Theano. And what we're going to do is we're going 01:23:10.680 |
to try and work up to next week where we're going to build an RNN from scratch in NumPy. 01:23:18.920 |
And we're also going to build an LSTM from scratch in Theano. And the reason we're doing 01:23:26.900 |
this is because next week's our last class in this part of the course, I want us to leave 01:23:34.680 |
with kind of feeling like we really understand the details of what's going on behind the 01:23:39.560 |
scenes. The main thing I wanted to teach in this class is the applied stuff, these kind 01:23:44.960 |
of practical tips about how you build a sequence model. Use return equals true, put batch null 01:23:51.160 |
in the embedding layer, add time distributed to the dense layer. But I also know that to 01:23:59.760 |
really debug your models and to build your architectures and stuff, it really helps to 01:24:05.440 |
understand what's going on. Particularly in the current situation where the tools and 01:24:12.120 |
libraries available are not that mature, they still require a whole lot of manual stuff. 01:24:19.680 |
So I do want to try and explain a bit more about what's going on behind the scenes. 01:24:27.800 |
In order to build an RNN in Theano, first of all make a small change to our Keras model, 01:24:36.560 |
which is that I'm going to use One Hot encoding. I don't know if you noticed this, but we did 01:24:41.240 |
something pretty cool in all of our models so far, which is that we never actually One 01:24:51.840 |
Question 3. Will time distributed dense take longer to train than dense? And is it really 01:24:57.920 |
that important to use time distributed dense? 01:25:01.240 |
So if you don't add time distributed dense to a model where return sequence equals true, 01:25:06.840 |
it literally won't work. It won't compile. Because you're trying to predict eight things 01:25:13.080 |
and the dense layer is going to stick that all into one thing. So it's going to say there's 01:25:16.360 |
a mismatch in your dimensions. But no, it doesn't really add much time because that's 01:25:26.240 |
something that can be very easily paralyzed. And since a lot of things in RNNs can't be 01:25:32.040 |
easily paralyzed, there generally is plenty of room in your GPU to do more work. So that 01:25:38.760 |
should be fun. The short answer is you have to use it, otherwise it won't work. 01:25:46.000 |
I wanted to point out something which is that in all of our models so far, we did not One 01:25:52.760 |
Hot encode our outputs. So our outputs, remember, looked like this. They were sequences of numbers. 01:26:08.400 |
And so always before, we've had to One Hot encode our outputs to use them. It turns out 01:26:16.160 |
that Keras has a very cool loss function called sparse-categorical-cross-entropy. This is identical 01:26:26.080 |
to categorical-cross-entropy, but rather than taking a One Hot encoded target, it takes an 01:26:34.360 |
integer target, and basically it acts as if you had One Hot encoded it. So it basically 01:26:44.240 |
So this is a really helpful thing to know about because when you have a lot of output 01:26:50.100 |
categories like, for example, if you're doing a word model, you could have 100,000 output 01:26:57.120 |
categories. There's no way you want to create a matrix that is 100,000 long, nearly all 01:27:02.960 |
zeros for every single word in your output. So by using sparse-categorical-cross-entropy, 01:27:09.880 |
you can just forget the whole One Hot encode. You don't have to do it. Keras implicitly 01:27:15.800 |
does it for you, but without ever actually explicitly doing it, it just does a direct 01:27:23.960 |
However, because I want to make things simpler for us to understand, I'm going to go ahead 01:27:30.160 |
and recreate our Keras model using One Hot encoding. So I'm going to take exactly the 01:27:38.280 |
same model that we had before with return_sequences=true, but this time I'm going to use normal-categorical-cross-entropy, 01:27:49.320 |
which means that -- and the other thing I'm doing is I don't have an embedding layer. 01:27:53.120 |
So since I don't have an embedding layer, I also have to One Hot encode my inputs. So 01:27:58.200 |
you can see I'm calling 2-categorical on all my inputs and 2-categorical on all my outputs. 01:28:06.760 |
So now the shape is 75,000 x 8, as before, by 86. So this is the One Hot encoding dimension 01:28:16.960 |
with which there are 85 zeros and 1 1. So we fit this in exactly the same way, we get 01:28:24.160 |
exactly the same answer. So the only reason I was doing that was because I want to use 01:28:31.920 |
One Hot encoding for the version that we're going to create ourselves from scratch. 01:28:39.960 |
So we haven't really looked at Theano before, but particularly if you come back next year, 01:28:47.640 |
as we start to try to add more and more stuff on top of Keras or into Keras, increasingly 01:28:57.240 |
you'll find yourself wanting to use Theano, because Theano is the language, if you like, 01:29:03.640 |
that Keras is using behind the scenes, and therefore it's the language which you can 01:29:09.200 |
use to extend it. Of course you can use TensorFlow as well, but we're using Theano in this course 01:29:15.920 |
because I think it's much easier for this kind of application. 01:29:23.520 |
So let's learn to use Theano. In the process of doing it in Theano, we're going to have 01:29:30.360 |
to force ourselves to think through a lot more of the details than we have before, because 01:29:37.720 |
Theano doesn't have any of the conveniences that Keras has. There's no such thing as a 01:29:42.920 |
variable. We have to think about all of the weight matrices and activation functions and 01:29:49.360 |
So let me show you how it works. In Theano, there's this concept of a variable, and a 01:29:58.680 |
variable is something which we basically define like so. We can say there is a variable which 01:30:05.000 |
is a matrix which I will call T_input, and there is a variable which is a matrix that 01:30:10.480 |
we'll call T_output, and there is a variable that is a vector that we will call H0. 01:30:16.520 |
What these are all saying is that these are things that we will give values to later. 01:30:25.280 |
Programming in Theano is very different to programming in normal Python, and the reason 01:30:29.800 |
for this is Theano's job in life is to provide a way for you to describe a computation that 01:30:37.560 |
you want to do, and then it's going to compile it for the GPU, and then it's going to run 01:30:45.880 |
So it's going to be a little more complex to work in Theano, because Theano isn't going 01:30:49.880 |
to be something where we immediately say do this, and then do this, and then do this. Instead 01:30:55.400 |
we're going to build up what's called a computation graph. It's going to be a series of steps. 01:30:59.840 |
We're going to say in the future, I'm going to give you some data, and when I do, I want 01:31:08.200 |
So rather than actually starting off by giving it data, we start off by just describing the 01:31:13.640 |
types of data that when we do give it data, we're going to give it. So eventually we're 01:31:18.400 |
going to give it some input data, we're going to give it some output data, and we're going 01:31:24.640 |
to give it some way of initializing the first hidden state. 01:31:29.600 |
And also we'll give it a learning rate, because we might want to change it later. So that's 01:31:35.240 |
all these things do. They create Theano variables. So then we can create a list of those, and 01:31:41.600 |
so this is all of the arguments that we're going to have to provide to Theano later on. 01:31:45.960 |
So there's no data here, nothing's being computed, we're just telling Theano that these things 01:31:54.680 |
The next thing that we need to do, because we're going to try to build this, is we're 01:32:02.480 |
going to have to build all of the pieces in all of these layer operations. So specifically 01:32:07.800 |
we're going to have to create the weight vector and bias matrix for the orange arrow, the weight 01:32:13.560 |
vector and the bias matrix for the green arrow, the weight matrix and the bias vector for the 01:32:18.400 |
orange arrow, the weight matrix and the bias vector for the green arrow, and the weight 01:32:22.080 |
matrix and the bias vector for the blue arrow, because that's what these layer operations 01:32:26.120 |
are. They're a matrix multiplier followed by a non-linear activation function. 01:32:33.120 |
So I've created some functions to do that. WH is what I'm going to call the weights and 01:32:41.080 |
bias to my hidden layer, Wx will be my weights and bias to my input, and Wy will be my weights 01:32:47.120 |
and bias to my output. So to create them, I've created this little function called weights 01:32:52.920 |
and bias in which I tell it the size of the matrix that I want to create. So the matrix 01:33:00.320 |
that goes from input to hidden therefore has n input rows and n hidden columns. So weights 01:33:10.320 |
and bias is here, and it's going to return a tuple, it's going to return our weights, 01:33:18.600 |
and it's going to return our bias. So how do we create the weights? 01:33:23.600 |
To create the weights, we first of all calculate the magic Glorow number, the square root of 01:33:29.480 |
2 over fan n, so that's the scale of the random numbers that we're going to use. We then create 01:33:36.800 |
those random numbers using the numpy_normal_random_number function, and then we use a special Theano 01:33:47.420 |
keyword called 'shared'. What shared does is it says to Theano, this data is something 01:33:55.680 |
that I'm going to want you to pass off to the GPU later and keep track of. 01:34:00.840 |
So as soon as you wrap something in shared, it kind of belongs to Theano now. So here 01:34:06.280 |
is a weight matrix that belongs to Theano, here is a vector of zeros that belongs to 01:34:13.500 |
Theano and that's our initial bias. So we've initialized our weights and our bias, so we 01:34:19.800 |
can do that for our inputs and we can do that for our outputs. 01:34:26.080 |
And then for our hidden, which is the orange error, we're going to do something slightly 01:34:32.480 |
different which is we will initialize it using an identity matrix. And rather amusingly in 01:34:38.640 |
numpy, it is 'i' for identity. So this is an identity matrix, believe it or not, of size 01:34:47.320 |
n by n. And so that's our initial weights and our initial bias is exactly as before. It's 01:34:58.320 |
So you can see we've had to manually construct each of these 3 weight matrices and bias vectors. 01:35:08.200 |
It's nice to now stick them all into a single list. And Python has this thing called chain 01:35:13.120 |
from iterable, which basically takes all of these tuples and dumps them all together into 01:35:17.920 |
a single list. And so this now has all 6 weight matrices and bias vectors in a single list. 01:35:30.400 |
We have defined the initial contents of each of these arrows. And we've also defined kind 01:35:37.280 |
of symbolically the concept that we're going to have something to initialize it with here, 01:35:43.360 |
something to initialize it with here and some time to initialize it with here. 01:35:48.080 |
So the next thing we have to do is to tell Theano what happens each time we take a single 01:36:00.680 |
On the GPU, you can't use a for loop. The reason you can't use a for loop is because 01:36:07.000 |
a GPU wants to be able to parallelize things and wants to do things at the same time. And 01:36:11.800 |
a for loop by definition can't do the second part of the loop until it's done the first 01:36:17.120 |
I don't know if we'll get time to do it in this course or not, but there's a very neat 01:36:22.600 |
result which shows that there's something very similar to a for loop that you can parallelize, 01:36:28.320 |
and it's called a scan operation. A scan operation is something that's defined in a very particular 01:36:37.960 |
A scan operation is something where you call some function for every element of some sequence. 01:36:48.040 |
And at every point, the function returns some output, and the next time through that function 01:36:55.320 |
is called, it's going to get the output of the previous time you called it, along with 01:37:03.600 |
So in fact, I've got an example of it. I actually wrote a very simple example of it in Python. 01:37:17.920 |
Here is the definition of scan, and here is an example of scan. Let's start with the example. 01:37:24.000 |
I want to do a scan, and the function I'm going to use is to add two things together. 01:37:31.840 |
And I'm going to start off with the number 0, and then I'm going to pass in a range of 01:37:39.600 |
So what scan does is it starts out by taking the first time through, it's going to call 01:37:45.000 |
this function with that argument and the first element of this. So it's going to be 0 plus 01:37:53.520 |
The second time, it's going to call this function with the second element of this, along with 01:38:00.440 |
the result of the previous call. So it will be 0 plus 1 equals 1. The next time through, 01:38:08.040 |
it's going to call this function with the result of the previous call plus the next 01:38:13.320 |
element of this range, so it will be 1 plus 2 equals 3. 01:38:18.640 |
So you can see here, this scan operation defines a cumulative sum. And so you can see the definition 01:38:26.600 |
of scan here. We're going to be returning an array of results. Initially, we take our starting 01:38:32.920 |
point, 0, and that's our initial value for the previous answer from scan. And then we're 01:38:40.400 |
going to go through everything in the sequence, which is 0 through 4. We're going to apply 01:38:45.240 |
this function, which in this case was AddThingsUp, and we're going to apply it to the previous 01:38:50.640 |
result along with the next element of the sequence. Stick the result at the end of our 01:38:56.400 |
list, set the previous result to whatever we just got, and then go back to the next 01:39:04.800 |
So it may be very surprising, I mean hopefully it is very surprising because it's an extraordinary 01:39:10.000 |
result, but it is possible to write a parallel version of this. So if you can turn your algorithm 01:39:18.640 |
into a scan, you can run it quickly on GPU. So what we're going to do is our job is to 01:39:25.360 |
turn this RNN into something that we can put into this kind of format, into a scan. So let's 01:39:45.640 |
So the function that we're going to call on each step through is the function called Step. 01:39:52.500 |
And the function called Step is going to be something which hopefully will not be very 01:39:56.600 |
surprising to you. It's going to be something which takes our input, x, it does a dot product 01:40:02.520 |
by that weight matrix we created earlier, wx, and adds on that bias vector we created earlier. 01:40:10.400 |
And then we do the same thing, taking our previous hidden state, multiplying it by the 01:40:15.240 |
weight matrix for the hidden state, and adding the biases for the hidden state, and then 01:40:19.760 |
puts the whole thing through an activation function, relu. 01:40:24.000 |
So in other words, let's go back to the unrolled version. So we had one bit which was calculating 01:40:31.860 |
our previous hidden state and putting it through the hidden state weight matrix. It was taking 01:40:40.040 |
our next input and putting it through the input one and then adding the two together. 01:40:47.880 |
So that's what we have here, the x by wx and the h by wh, and then adding the two together 01:40:54.780 |
along with the biases, and then put that through an activation function. 01:40:59.980 |
So once we've done that, we now want to create an output every single time, and so our output 01:41:07.980 |
is going to be exactly the same thing. It's going to take the result of that, which we 01:41:11.720 |
called h, our hidden state, multiply it by the output's weight vector, adding on the 01:41:17.480 |
bias, and this time we're going to use softmax. So you can see that this sequence here is 01:41:31.480 |
And so this therefore defines what we want to do each step through. And at the end of 01:41:37.400 |
that, we're going to return the hidden state we have so far and our output. So that's what's 01:41:48.560 |
So the sequence that we're going to pass into it is, well, we're not going to give it any 01:41:55.040 |
data yet because remember, all we're doing is we're describing a computation. So for 01:41:59.160 |
now, we're just telling it that it's going to be, it will be a matrix. So we're saying 01:42:06.560 |
it will be a matrix, we're going to pass you a matrix. It also needs a starting point, 01:42:14.080 |
and so the starting point is, again, we are going to provide to you an initial value for 01:42:21.100 |
our hidden state, but we haven't done it yet. 01:42:26.080 |
And then finally in Theano, you have to tell it what are all of the other things that are 01:42:29.520 |
passed to the function, and we're going to pass it that whole list of weights. That's 01:42:34.480 |
why we have here the x, the hidden, and then all of the weights and biases. 01:42:45.360 |
So that's now described how to execute a whole sequence of steps for an RNN. So we've now 01:42:55.560 |
described how to do this to Theano. We haven't given it any data to do it, we've just set 01:43:02.120 |
up the computation. And so when that computation is run, it's going to return two things because 01:43:09.640 |
step returns two things. It's going to return the hidden state and it's going to return our 01:43:19.040 |
So now we need to calculate our error. Our error will be the categorical cross-entropy, 01:43:25.160 |
and so these things are all part of Theano. You can see I'm using some Theano functions 01:43:28.920 |
here. And so we're going to compare the output that came out of our scan, and we're going 01:43:35.520 |
to compare it to what we don't know yet, but it will be a matrix. And then once you do 01:43:45.880 |
Now here's the amazing thing. Every step we're going to want to apply SGD, which means every 01:43:51.560 |
step we're going to want to take the derivative of this whole thing with respect to all of 01:43:59.480 |
the weights and use that, along with the learning rate, to update all of the weights. In Theano, 01:44:07.360 |
that's how you do it. You just say, "Please tell me the gradient of this function with 01:44:14.600 |
respect to these inputs." And Theano will symbolically, automatically calculate all 01:44:20.240 |
of the derivatives for you. So that's very nearly magic, but we don't have to worry about 01:44:26.700 |
derivatives because it's going to calculate them all for us. 01:44:31.200 |
So at this point, I now have a function that calculates our loss, and I have a function 01:44:36.240 |
that calculates all of the gradients that we need with respect to all of the different 01:44:43.720 |
So we're now ready to build our final function. Our final function, as input, takes all of 01:44:51.640 |
our arguments, that is, these four things, which is the things we told it we're going 01:44:57.360 |
to need later. The thing that's going to create an output is the error, which was this output. 01:45:07.360 |
And then at each step, it's going to do some updates. What are the updates going to do? 01:45:12.680 |
The updates it's going to do is the result of this little function. And this little function 01:45:17.240 |
is something that creates a dictionary that is going to map every one of our weights to 01:45:24.880 |
that weight minus each one of our gradients times the learning rate. So it's going to 01:45:33.840 |
update every weight to itself minus its gradient times the learning rate. 01:45:41.480 |
So basically what Theano does is it says, it's got this little thing called updates, 01:45:46.760 |
it says every time you calculate the next step, I want you to change your shared variables 01:45:53.720 |
as follows. So there's our list of changes to make. 01:45:58.960 |
And so that's it. So we use our one hot encoded x's and our one hot encoded y's, and we have 01:46:05.520 |
to now manually create our own loop. Theano doesn't have any built-in stuff for us, so 01:46:11.440 |
we're going to go through every element of our input and we're going to say let's call 01:46:19.320 |
that function, so that function is the function that we just created, and now we have to pass 01:46:24.480 |
in all of these inputs. So we have to finally pass in a value for the initial hidden state, 01:46:32.320 |
the input, the target, and the learning rate. So this is where we get to do it is when we 01:46:37.880 |
finally call it here. So here's our initial hidden state, it's just a bunch of zeros, our 01:46:44.920 |
input, our output, and our learning rate, which we set to 0.01. 01:46:50.880 |
And then I've just set it to something here that says every thousand times, print out 01:46:54.720 |
the error. And so as you can see, over time, it learns. And so at the end of learning, I 01:47:05.800 |
get a new theano function which takes some piece of input, along with some initial hidden 01:47:13.080 |
state, and it produces not the loss, but the output. 01:47:19.160 |
Are we using gradient descent and not stochastic gradient descent here? 01:47:27.240 |
We're using stochastic gradient descent with a mini-batch size of 1. So gradient descent 01:47:32.920 |
without stochastic actually means you're using a mini-batch size of the whole data set. This 01:47:37.400 |
is kind of the opposite of that. I think this is called online gradient descent. 01:47:45.320 |
So remember earlier on, we had this thing to calculate the vector of outputs. So now 01:47:54.120 |
to do our testing, we're going to create a new function which goes from our input to 01:47:58.760 |
our vector of outputs. And so our predictions will be to take that function, pass it in 01:48:05.600 |
our initial hidden state, and some input, and that's going to give us some predictions. 01:48:13.640 |
So if we call it, we can now see, let's now grab some sequence of text, pass it to our 01:48:27.280 |
function to get some predictions, and let's see what it does. So after t, it expected 01:48:31.200 |
h, after th, it expected e, after th e, expected space, after th en, expected the space, after 01:48:40.880 |
th en, question mark, it expected the space. So you can see here that we have successfully 01:49:00.040 |
That's been a very, very quick run-through. My goal really tonight is to kind of get to 01:49:10.400 |
a point where you can start to look at this during the week and kind of see all the pieces. 01:49:16.240 |
Because next week, we're going to try and build an LSTM in Theano, which is going to 01:49:22.320 |
mean that I want you by next week to start to feel like you've got a good understanding 01:49:27.560 |
of what's going on. So please ask lots of questions on the forum, look at the documentation, 01:49:36.560 |
And then the next thing we're going to do after that is we're going to build an RNN 01:49:42.920 |
without using Theano. We're going to use pure NumPy. And that means that we're not going 01:49:48.040 |
to be able to use t.grad, we're going to have to calculate the gradients by hand. So hopefully 01:49:55.960 |
that will be a useful exercise in really understanding what's going on in that propagation. 01:50:09.120 |
So I kind of want to make sure you feel like you've got enough information to get started 01:50:13.600 |
with looking at Theano this week. So did anybody want to ask any questions about this piece 01:50:24.960 |
So this is maybe a bit too far away from what we did today, but how would you apply an RNN 01:50:37.920 |
to say something other than text? So that's something that's worth doing, and if so, what 01:50:43.960 |
Yeah, sure. So the main way in which an RNN is applied to images is what we looked at 01:50:51.020 |
last week, which is these things called attentional models, which is where you basically say, 01:50:57.680 |
given which part of the image you're currently looking at, which part would make sense to 01:51:03.960 |
look at next. This is most useful on really big images where you can't really look at 01:51:23.120 |
the whole thing at once because it would just eat up all your GPU's RAM, so you can only 01:51:32.200 |
Another way that RNNs are very useful for images is for captioning images. We'll talk 01:51:49.680 |
a lot more about this in the next year's course, but have a think about this in the meantime. 01:51:56.440 |
If we've got an image, then a CNN can turn that into a vector representation of that 01:52:06.620 |
image. For example, we could chuck it through VGG and take the penultimate layers activations. 01:52:14.480 |
There's all kinds of things we could do, but in some way we can turn an image and turn 01:52:26.240 |
We could do the same thing to a sentence. We can take a sentence consisting of a number 01:52:30.520 |
of words and we can stick that through RNN, and at the end of it we will get some state. 01:52:41.680 |
And that state is also just a vector. What we could then do is learn a neural network 01:52:51.440 |
which maps the picture to the text, assuming that this sentence was originally a caption 01:53:02.320 |
that had been created for this image. In that way, if we can learn a mapping from some representation 01:53:13.820 |
of the image that came out of a CNN to some representation of a sentence that came out 01:53:21.040 |
of an RNN, then we could basically reverse that in order to generate captions for an 01:53:28.360 |
image. So basically what we could then do is we could take some new image that we've 01:53:32.840 |
never seen before, chuck it through the CNN to get our state out, and then we could figure 01:53:40.000 |
out what RNN state we would expect would be attached to that based on this neural net 01:53:48.360 |
that we had learned, and then we can basically do a sequence generation just like we have 01:53:54.680 |
been today and generate a sequence of words. And this is roughly how these image captioning 01:54:10.680 |
So RNNs, I guess finally the only other way in which I've seen RNNs applied to images 01:54:18.200 |
is for really big 3D images, for example, like in medical imaging. So if you've got 01:54:27.000 |
something like MRI that's basically a series of bladers, it's too big to look at the whole 01:54:32.720 |
thing. Instead you can use an RNN to start in the top corner and then look one pixel 01:54:42.440 |
to the left, then one pixel across, then one pixel back, and then it can go down into the 01:54:47.200 |
next layer and it can gradually look one pixel at a time. And it can do that and gradually 01:54:53.000 |
cover the whole thing. And in that way, it's gradually able to generate state about what 01:55:06.000 |
And so this is not something which is very widely used, at least at this point, but I 01:55:13.640 |
think it's worth thinking about. Because again, you could combine this with a CNN. Maybe you 01:55:19.440 |
could have a CNN that looks at large chunks of this MRI at a time and generates state 01:55:28.160 |
for each of these chunks, and then maybe you could use an RNN to go through the chunks. 01:55:33.560 |
There's all kinds of ways that you can combine CNNs and RNNs together. 01:55:45.600 |
So can you build a custom layer in Theano and then mix it with Keras? 01:56:14.460 |
There's lots of examples of them that you'll generally find in the GitHub issues where 01:56:23.400 |
people will show like I was trying to build this layer and I had this problem. But it's 01:56:33.600 |
The other thing I find really useful to do is to actually look at the definition of the 01:56:49.400 |
layers in Keras. One of the things I actually did was I created this little thing called 01:57:03.200 |
PyPath which allows me to put in any Python module and it returns the directory that that 01:57:16.880 |
module is defined in, so I can go, let's have a look at how any particular layer is defined. 01:57:29.560 |
So let's say I want to look at pooling. Here is a max_pooling1d layer and you can see it's 01:57:39.000 |
defined in nine lines of code. Generally speaking, you can kind of see that layers don't take 01:58:02.040 |
You can absolutely create an image from a caption. There's a lot of image generation 01:58:07.040 |
stuff going on at the moment. It's not at a point that's probably useful for anything 01:58:13.480 |
in practice. It's more like an interesting research journey I guess. So generally speaking, 01:58:23.400 |
this is in the area called generative models. We'll be looking at generative models next 01:58:28.440 |
year because they're very important for unsupervised and semi-supervised learning. 01:58:32.520 |
What could get the best performance on a document classification task? Is CNN, RNN, or both? 01:58:43.040 |
So let's go back to sentiment analysis. To remind ourselves, when we looked at sentiment 01:58:50.560 |
analysis for IMDB, the best result we got came from a multi-size convolutional neural 01:58:57.400 |
network where we basically took a bunch of convolutional neural networks of varying sizes. 01:59:03.000 |
A simple convolutional network was nearly as good. I actually tried an LSTM for this 01:59:18.280 |
and I found the accuracy that I got was less good than the accuracy of the CNN. I think 01:59:29.320 |
the reason for this is that when you have a whole movie review, which is a few paragraphs, 01:59:40.000 |
the information you can get just by looking at a few words at a time is enough to tell 01:59:44.880 |
you whether this is a positive review or a negative review. If you see a sequence of 01:59:51.360 |
five words like this is totally shit, you can probably learn that's not a good thing, 01:59:57.520 |
or else if this is totally awesome, you can probably learn that is a good thing. 02:00:01.840 |
The amount of nuance built into reading word-by-word an entire review just doesn't seem like there's 02:00:11.200 |
any need for that in practice. In general, once you get to a certain sized piece of text, 02:00:21.240 |
like a paragraph or two, there doesn't seem to be any sign that RNNs are helpful at least 02:00:35.240 |
Before I close off, I wanted to show you two little tricks because I don't spend enough 02:00:39.840 |
time showing you cool little tricks. When I was working with Brad today, there were two 02:00:44.760 |
little tricks that we realized that other people might like to learn about. The first 02:00:50.720 |
trick I wanted to point out to you is, if you want to learn about how a function works, 02:01:11.080 |
what would be a quick way to find out? If you've got a function there on your screen 02:01:17.000 |
and you hit Shift + Tab, all of the parameters to it will pop up. If you hit Shift + Tab twice, 02:01:26.440 |
the documentation will pop up. So that was one little tip that I wanted you guys to know 02:01:35.640 |
The second little tip that you may not have been aware of is that you can actually run 02:01:39.560 |
the Python debugger inside Jupyter Notebook. So today we were trying to do that when we 02:01:45.040 |
were trying to debug our pure Python RNN. So we can see an example of that. 02:01:58.240 |
So let's say we were having some problem inside our loop here. You can go import pdb, that's 02:02:05.480 |
the Python debugger, and then you can set a breakpoint anywhere. So now if I run this 02:02:14.600 |
as soon as it gets to here, it pops up a little dialog box and at this point I can look at 02:02:21.720 |
anything. For example, I can say what's the value of 'er' at this point? And I can say 02:02:26.960 |
what are the lines I'm about to execute? And I can say execute the next one line and it 02:02:35.520 |
If you want to learn about the Python debugger, just Google for a Python debugger, but learning 02:02:41.640 |
to use the debugger is one of the most helpful things because it lets you step through each 02:02:46.600 |
step of what's going on and see the values of all of your variables and do all kinds 02:02:53.000 |
of cool stuff like that. So those were two little tips I thought I would leave you with 02:02:59.080 |
And that's 9 o'clock. Thanks very much everybody.