back to indexLesson 5: Practical Deep Learning for Coders
00:00:00.000 |
So I wanted to start off by showing you something I'm kind of excited about, which is, here 00:00:05.120 |
is the Dogs and Cats competition, which we all know so well. 00:00:08.840 |
And it was interesting that the winner of this competition won by a very big margin, 00:00:20.080 |
This is very unusual in a Kaggle competition to see anybody win by 50-60% margin. 00:00:25.760 |
You can see that after that, people are generally clustering around 91.1, 91.9, 91.1, 98.1 -- about 00:00:35.900 |
This is the guy who actually created a piece of deep mining software called Overfee. 00:00:42.480 |
So I want to show you something pretty interesting, which is this week I tried something new, 00:00:58.360 |
The way I did that was by using nearly only techniques I've already shown you, which is 00:01:03.700 |
basically I created a standard model, which is basically a dense model. 00:01:11.560 |
And then I pre-computed the last convolutional layer, and then I trained the dense model 00:01:18.680 |
lots of times, and the other thing I did was to use some data augmentation. 00:01:26.760 |
And I didn't actually have time to figure out the best data augmentation parameters, 00:01:29.560 |
so I just picked some that seemed reasonable. 00:01:31.960 |
I should also mention this 98.95 would be easy to make a lot better. 00:01:38.000 |
I'm not doing any pseudo-labeling here, and I'm not even using the full dataset. 00:01:44.880 |
So with those two changes we would definitely get well over 99% accuracy. 00:01:51.400 |
The missing piece that I added is I added batch normalization to VGG. 00:01:58.140 |
So batch normalization, if you guys remember, I said the important takeaway is that all 00:02:03.320 |
modern networks should use batch norm because you can get 10x or more improvements in training 00:02:13.760 |
Because of the second one, it means you can use less dropout, and dropout of course is 00:02:20.200 |
destroying some of your network, so you don't want to use more dropout than necessary. 00:02:32.340 |
So VGG was kind of mid to late 2014, and batch norm was maybe early to mid 2015. 00:02:43.600 |
So why haven't people added batch norm to VGG already? 00:02:49.000 |
And the answer is actually interesting to think about. 00:02:51.380 |
So to remind you what batch norm is, batch norm is something which first of all normalizes 00:03:02.100 |
So it normalizes all of the activations by subtracting the mean and dividing by the standard 00:03:11.420 |
And I know somebody on the forum today asked why is it a good idea, and I've put a link 00:03:15.160 |
to some more information about that, so anybody who wants to know more about why do normalization 00:03:22.680 |
But just doing that alone isn't enough because SGD is quite bloody-minded, and so if it was 00:03:31.280 |
trying to de-normalize the activations because it thought that was a good thing to do, it 00:03:39.320 |
So every time you tried to normalize them, SGD would just undo it again. 00:03:43.580 |
So what batch norm does is it adds two additional trainable parameters to each layer. 00:03:50.040 |
One which multiplies the activations and one which is added to the activations. 00:03:55.400 |
So it basically allows it to undo the normalization, but not by changing every single weight, but 00:04:03.160 |
by just changing two weights for each activation. 00:04:06.400 |
So it makes things much more stable in practice. 00:04:11.020 |
So you can't just go ahead and stick batch norm into a pre-trained network, because if 00:04:15.740 |
you do, it's going to take that layer and it's going to divide all of the incoming activations 00:04:21.640 |
by subtract the mean and divide by the standard deviation, which means now those pre-trained 00:04:28.240 |
weights from then on are now wrong, because those weights were created for a completely 00:04:36.280 |
So it's not rocket science, but I realized all we need to do is to insert a batch norm 00:04:45.000 |
layer and figure out what the mean and standard deviation of the incoming activations would 00:04:51.880 |
be for that dataset and basically create the batch norm layer such that the two trainable 00:05:01.840 |
So that way we would insert a batch norm layer and it would not change the outputs at all. 00:05:08.380 |
So I grabbed the whole of ImageNet and I created our standard dense layer model. 00:05:17.680 |
I pre-computed the convolutional outputs for all of ImageNet, and then I created two batch 00:05:24.840 |
norm layers, and I created a little function which allows us to insert a layer into an 00:05:31.560 |
I inserted the layers just after the two dense layers. 00:05:38.960 |
I set the weights on the new batch norm layers equal to the variance and the mean, which 00:05:50.520 |
So I calculated the mean of each of those two layer outputs and the variance of each 00:05:56.560 |
And so that allowed me to insert these batch norm layers into an existing model. 00:06:03.240 |
And then afterwards I evaluated it and I checked that indeed it's giving me the same answers 00:06:11.520 |
As well as doing that, I then thought if you train a model with batch norm from the start, 00:06:21.120 |
you're going to end up with weights which are designed to take advantage of the fact 00:06:27.340 |
And so I thought I wonder what would happen if we now fine-tuned the ImageNet network 00:06:32.480 |
on all of ImageNet after we added these batch norm layers. 00:06:37.300 |
So I then tried training it for one epoch on both the ImageNet images and the horizontally 00:06:50.840 |
And you can see with modern GPUs, it only takes less than an hour to run the entirety 00:07:00.480 |
And the interesting thing was that my accuracy on the validation set went up from 63% to 00:07:08.160 |
So adding batch norm actually improves ImageNet, which is cool. 00:07:12.880 |
That wasn't the main reason I did it, the main reason I did it was so that we can now 00:07:21.280 |
So I did all that, I saved the weights, I then edited our VGG model. 00:07:38.920 |
So if we now look at the fully connected block in our VGG model, it now has batch norm in 00:07:53.560 |
I also saved to our website a new weights file called VGG16BN for batch norm. 00:08:02.480 |
And so then when I did cats and dogs, I used that model. 00:08:10.560 |
So now if you go and redownload from platform.ai the VGG16.py, it will automatically download 00:08:19.680 |
the new weights, you will have this without any changes to your code. 00:08:23.800 |
So I'll be interested to hear during the week if you try this out, just rerun the code you've 00:08:30.600 |
And hopefully you'll find it trains more quickly and you get better results. 00:08:36.520 |
At this stage, I've only added batch norm to the dense layers, not to the convolutional 00:08:43.600 |
There's no reason I shouldn't add it to the convolutional layers as well, I just had other 00:08:50.760 |
Since most of us are mainly fine-tuning just the dense layers, this is going to impact 00:08:58.760 |
So that's an exciting step which everybody can now use. 00:09:06.920 |
As well as -- the other thing to mention is now that you'll be using batch norm by default 00:09:12.720 |
in your VGG networks, you should find that you can increase your learning rates. 00:09:18.100 |
Because batch norm normalizes the activations, it makes sure that there's no activation that's 00:09:24.220 |
gone really high or really low, and that means that generally speaking you can use higher 00:09:30.560 |
So if you try higher learning rates in your code than you were before, you should find 00:09:36.240 |
You should also find that things that previously you couldn't get to train, now will start 00:09:42.680 |
Because often the reason that they don't train is because one of the activations shoots off 00:09:48.040 |
into really high or really low and screws everything up, and that kind of gets fixed 00:09:56.400 |
So there's some things to try this week, I'll be interested to hear how you go. 00:10:02.040 |
So last week we looked at collaborative filtering. 00:10:08.280 |
And to remind you, we had a file that basically meant something like this. 00:10:16.400 |
We had a bunch of movies and a bunch of users, and for some subset of those combinations 00:10:27.440 |
The way the actual file came to us didn't look like this, this is a crosstab. 00:10:32.440 |
The way the file came to us looked like this. 00:10:36.160 |
Each row was a single user rating a single movie with a single rating at a single time. 00:10:42.560 |
So I showed you in Excel how we could take the crosstab version, and we could create a 00:10:51.440 |
table of dot products, where the dot products would be between a set of 5 random numbers 00:11:00.200 |
for the movie and 5 random numbers for the user. 00:11:04.040 |
And we could then use gradient descent to optimize those sets of 5 random numbers for 00:11:13.520 |
And if we did so, we end up getting pretty decent guesses as to the original ratings. 00:11:21.440 |
And then we went a step further in the spreadsheet and we learned how you could take the dot 00:11:27.240 |
product and you could also add on a single bias, a movie bias and a user bias. 00:11:35.480 |
So we saw all that in Excel and we also learned that in Excel, Excel comes with a gradient 00:11:42.860 |
descent solver called, funnily enough, solver. 00:11:46.440 |
And we saw that if we ran solver, I'm telling it that these are our varying cells and this 00:11:53.720 |
is our target cell, then it came up with some pretty decent weight matrices. 00:12:01.100 |
We learned that these kinds of weight matrices are called embeddings. 00:12:05.120 |
An embedding is basically something where we can start with an integer, like 27, and 00:12:10.280 |
look up the movie number 27's vector of weights, that's called an embedding. 00:12:16.320 |
It's also in collaborative filtering, this particular kind of embedding is known as latent 00:12:25.720 |
Where we hypothesized that once trained, each of these latent factors may mean something. 00:12:33.300 |
And I said next week we might come back and have a look and see if we can figure out what 00:12:41.920 |
So I'm going to take the bias model that we created. 00:12:47.400 |
The bias model we created was the one where we took a user embedding and a movie embedding, 00:12:57.600 |
and we took the dot product of the two, and then we added to it a user bias and a movie 00:13:05.600 |
bias where those biases are just embeddings which have a single output. 00:13:11.200 |
Just like in Excel, the bias was a single cell for each movie and a single cell for 00:13:21.040 |
So then we tried fitting that model, and you might remember that we ended up getting an 00:13:27.680 |
accuracy that was quite a bit higher than previous state-of-the-art. 00:13:36.840 |
Actually, for that one we didn't, for the previous state-of-the-art we broke by using 00:13:43.600 |
I discovered something interesting during the week, which is that I can get a state-of-the-art 00:13:48.600 |
result using just this simple bias model, and the trick was that I just had to increase 00:13:57.720 |
So we haven't talked too much about regularization, we've briefly mentioned it a couple of times, 00:14:01.640 |
but it's a very simple thing where we can basically say, add to the loss function the 00:14:10.120 |
So we're trying to minimize the loss, and so if you're adding the sum of the squares 00:14:14.720 |
of the weights to the loss function, then the SGD solver is going to have to try to 00:14:24.200 |
And so we can pass to Keras, most Keras layers, a parameter called wregularizer, that stands 00:14:33.680 |
for weight regularizer, and we can tell it how to regularize our weights. 00:14:37.000 |
In this case, I say use an L2 norm, that means sum of the squares, of how much, and that's 00:14:44.160 |
something that I pass in, and I used 1a neg 4. 00:14:49.660 |
And it turns out if I do that, and then I train it for a while, it takes quite a lot 00:14:55.000 |
longer to train, but let's see if I've got this somewhere. 00:15:02.260 |
I got down to a loss of 0.7979, which is quite a bit better than the best results that that 00:15:12.840 |
That's not quite as good as the neural net, the neural net got 7938 at best. 00:15:19.920 |
But it's still interesting that this pretty simple approach actually gets results better 00:15:28.660 |
than the academic state-of-the-art as of 2012 or 2013, and I haven't been able to find more 00:15:41.060 |
So I took this model, and I wanted to find out what we can learn from these results. 00:15:52.080 |
So obviously one thing that we would do with this model is just to make predictions with 00:15:56.080 |
So if you were building a website for recommending movies, and a new user came along and said 00:16:03.120 |
I like these movies this much, what else would you recommend? 00:16:06.760 |
You could just go through and do a prediction for each movie for that user ID and tell them 00:16:13.400 |
That's the normal way we would use collaborative filtering. 00:16:18.480 |
We can grab the top 2,000 most popular movies, just to make this more interesting, and we 00:16:29.640 |
And I'll talk more about this particular syntax in just a moment, but just for now, this is 00:16:38.140 |
It's a model which simply takes the movie ID in and returns the movie bias out. 00:16:43.600 |
In other words, it doesn't look up in the movie bias table and just returns the movie bias 00:16:54.840 |
I then combine that bias with the actual name of each rating, and print out the top and 00:17:01.980 |
So according to Movie Lens, the worst movie of all time is the Church of Scientology classic 00:17:12.760 |
So this is interesting because these ratings are quite a lot more sophisticated than your 00:17:20.440 |
What this is saying is that these have been normalized for some reviewers are more positive 00:17:29.520 |
Some people are watching better or crappier films than others, and so this bias is removing 00:17:34.280 |
all of that noise and really telling us after removing all of that noise these are the least 00:17:40.280 |
good movies, and Battlefield Earth even worse than Spice World by a significant margin. 00:17:49.120 |
On the other hand, here are the best Miyazaki fans will be pleased to see Howl's Moving 00:18:02.640 |
Perhaps what's more interesting is to try and figure out what's going on not in the 00:18:19.120 |
The latent factors are a little bit harder to interpret because for every movie we have 00:18:24.680 |
50 of them, in the Excel spreadsheet we have 5, in our version we have 50 of them. 00:18:30.120 |
So what we want to do is we want to take from those 50 latent factors, we want to find two 00:18:41.440 |
The way we do this, the details aren't important but a lot of you will already be familiar 00:18:46.640 |
with it, which is that there's something called PCA, or Principle Components Analysis. 00:18:52.040 |
Principle Components Analysis does exactly what I just said. 00:18:55.360 |
It looks through a matrix, in this case it's got 50 columns, and it says what are the combinations 00:19:01.140 |
of columns that we can add together because they tend to move in the same direction. 00:19:06.240 |
And so in this case we say start with our 50 columns, and I want to create just three 00:19:10.120 |
columns that capture all of the information of the original 50. 00:19:15.240 |
If you're interested in learning more about how this works, PCA is something which is 00:19:19.080 |
kind of everywhere on the internet, so there's lots of information about it. 00:19:22.120 |
But as I say, the details aren't important, the important thing to recognize is that we're 00:19:26.080 |
just squishing our 50 latent factors down into 3. 00:19:30.440 |
So if we look at the first PCA factor, and we saw it on it, we can see that at one end 00:19:38.400 |
we have fairly well-regarded movies like "The Godfather", "Halt Fiction", "Usual Suspects" 00:19:53.000 |
At the other end we have things like Ace Ventura and Robocop 3, which are perhaps not so classic. 00:19:59.600 |
So our first PCA factor is some kind of classic score. 00:20:07.680 |
On our second one, we have something similar but actually very different. 00:20:12.040 |
At one end we've got 10 movies that are huge Hollywood blockbusters with lots of special 00:20:20.360 |
And at the other end we have things like Annie Hall and Brokeback Mountain, which are kind 00:20:30.200 |
So there's another dimension, which is the second most. 00:20:35.500 |
This is the first most important dimension by which people judge movies differently. 00:20:39.960 |
This is the second most important one by which people judge movies differently. 00:20:43.600 |
And then the third most important one by which people judge movies differently is something 00:20:47.240 |
where at one end we have a bunch of violent and scary movies, and at the other end we 00:20:57.260 |
And for those of you who haven't seen Babe, Australian movie, happiest movie ever. 00:21:02.440 |
It's about a small pig and its adventures and its path to success, so happiest movie 00:21:11.680 |
It's not saying that these factors are good or bad or anything like that, it's just saying 00:21:17.120 |
that these are the things that when we've done this matrix decomposition have popped 00:21:23.560 |
out as being the ways in which people are differing in their ratings for different kinds 00:21:30.360 |
So one of the reasons I wanted to show you this is to say that these kinds of SGD-learned 00:21:45.160 |
Indeed it's not great to go in and look at every one of those shifty latent factor coefficients 00:21:50.080 |
in detail, but you have to think about how to visualize them, how to look at them. 00:21:56.960 |
In this case, I actually went a step further and I grabbed a couple of principal components 00:22:04.680 |
And so with pictures, of course, you can start to see things in multiple dimensions. 00:22:09.920 |
And so here I've got the first and third principal components, and so you can see the far right-hand 00:22:14.900 |
side here we have more of the Hollywood type movies, and at the far left some of the more 00:22:21.400 |
classic movies, and at the top some of the more violent movies, and some of the bottom 00:22:25.240 |
movies, some of the happier movies, and they're so far happy that it's right off the bottom. 00:22:31.160 |
And so then if you wanted to find a movie that was violent and classic, you would go 00:22:37.920 |
into the top left, and yeah, Kubrick's A Clockwork Orange would probably be the one most people 00:22:43.360 |
Or if you wanted to come up with something that was very Hollywood and very non-violent, 00:22:50.600 |
you would be down here in Sleepless in Seattle. 00:22:53.960 |
You can really learn a lot by looking at these kinds of models, but you don't do it by looking 00:23:01.360 |
at the coefficients, you do it by visualizations, you do it by interrogating it. 00:23:07.240 |
And so I think this is a big difference, but for any of you that have done much statistics 00:23:11.680 |
before or have a background in the social sciences, you've spent most of your time doing 00:23:16.120 |
regressions and looking at coefficients and t-tests and stuff. 00:23:20.940 |
This is a world where you're asking the model questions and getting the model results, which 00:23:32.000 |
I mentioned I would talk briefly about this syntax. 00:23:37.800 |
And this syntax is something that we're going to be using a lot more of, and it's part of 00:23:46.440 |
The Keras Functional API is a way of doing exactly the same things that you've already 00:23:57.680 |
The API you've learned so far is the sequential API, that's where you use the word sequential, 00:24:03.200 |
and then you write in order the layers of your neural network. 00:24:08.840 |
But what if you want to do something like what we wanted to do just now, where we had 00:24:13.040 |
like 2 different things coming in, we had a user ID coming in and a movie ID coming in, 00:24:18.200 |
and each one went through its own embedding, and then they got multiplied together. 00:24:26.440 |
So the functional API was designed to answer this question. 00:24:31.360 |
The first thing to note about the functional API is that you can do everything you can 00:24:37.480 |
And here's an example of something you could do perfectly well with a sequential API, which 00:24:46.080 |
Every functional API model starts with an input layer, and then you assign that to some 00:24:54.240 |
And then you list each of the layers in order, and for each of them, after you've provided 00:25:00.280 |
the details for that layer, you then immediately call the layer passing in the output of the 00:25:08.680 |
So this passes in inputs and calls it x, and then this passes in our x, and this is our 00:25:14.580 |
new version of x, and then this next dense layer gets the next version of x and returns 00:25:20.080 |
So you can see that each layer is saying what its previous layer is. 00:25:24.640 |
Each layer is saying what its previous layer is. 00:25:27.520 |
So it's doing exactly the same thing as a sequential API, just in a different way. 00:25:33.080 |
Now as the docs note here, the sequential model is probably a better choice to implement 00:25:47.440 |
On the other hand, the model that we just looked at would be quite difficult, if not 00:25:54.420 |
impossible to do with the sequential model API, but with the functional API, it was very 00:26:01.360 |
We created a whole separate model which gave an output u for user, and that was the result 00:26:09.240 |
of creating an embedding, where we said an embedding has its own input and then goes 00:26:14.680 |
through an embedding layer, and then we returned the input to that and the embedding layer 00:26:21.040 |
So that gave us our user input, our user embedding, our movie input and our movie embedding. 00:26:29.640 |
And then we did a similar thing to create two little models for our bias terms. 00:26:33.320 |
They were both things that grabbed an embedding, returning a single output, and then flattened 00:26:41.720 |
And so now we've got four separate models, and so we can merge them. 00:26:52.520 |
In general, you will be using the small m merge. 00:26:55.120 |
I'm not going to go into the details of why they're both there. 00:27:00.360 |
If something weird happens to you with merge, try remembering to use the small m merge. 00:27:05.560 |
The small m merge takes two previous outputs that you've just created using the functional 00:27:11.240 |
API and combines them in whatever way you want, in this case the dot product. 00:27:18.320 |
And so that grabs our user and movie embeddings and takes the dot product. 00:27:23.240 |
We grab the output of that and take our user bias and the sum, and the output of that and 00:27:33.440 |
So that's a functional API to creating that model. 00:27:38.600 |
At the end of which, we then use the model function to actually create our model, saying 00:27:44.640 |
what are the inputs to the model and what is the output of the model. 00:27:49.800 |
So you can see this is different to usual because we've now got multiple inputs. 00:27:54.720 |
So then when we call fit, we now have to pass in an array of inputs, a user_id and movie_id. 00:28:03.600 |
So the functional API is something that we're going to be using increasingly from now on. 00:28:10.400 |
Now that we've kind of learned all the basic architectures just about, we're going to be 00:28:15.360 |
starting to build more exotic architectures for more special cases and we'll be using 00:28:22.960 |
Is the only reason to use an embedding layer so that you can provide a list of integers 00:28:30.520 |
Is the only reason to use an embedding layer so that you can use integers as input? 00:28:36.040 |
So instead of using an embedding layer, we could have one-hot encoded all of those user 00:28:40.120 |
IDs and one-hot encoded all of those movie IDs and created dense layers on top of them 00:28:45.680 |
and it would have done exactly the same thing. 00:28:54.640 |
Why choose 50 latent factors and then reduce them down with a principal component analysis? 00:29:00.720 |
Why not just have 3 latent factors to begin with? 00:29:05.720 |
If we only use 3 latent factors, then our predictive model would have been less accurate. 00:29:15.560 |
So we want an accurate predictive model so that when people come to our website, we can 00:29:21.240 |
do a good job of telling them what movie to watch. 00:29:26.360 |
But then for the purpose of our visualization of understanding what those factors are doing, 00:29:30.580 |
we want a small number so that we can interpret them more easily. 00:29:34.960 |
Okay, so one thing you might want to try during the week is taking one or two of your models 00:29:43.120 |
and converting them to use the functional API. 00:29:45.960 |
Just as a little thing, you could try to start to get the hang of how this API looks. 00:29:52.200 |
Are these functional models how we would add additional information to images in CNN's, 00:30:02.480 |
In general, the idea of adding additional information to say a CNN is basically like 00:30:10.040 |
This happens in collaborative filtering a lot. 00:30:12.720 |
You might have a collaborative filtering model that as well as having the ratings table, 00:30:18.840 |
you also have information about what genre the movie is in, maybe the demographic information 00:30:25.820 |
So you can incorporate all that stuff by having additional inputs. 00:30:30.240 |
And so with a CNN, for example, the new Kaggle fish recognition competition, one of the things 00:30:41.640 |
that turns out is a useful predictor, this is a leakage problem, is the size of the image. 00:30:47.340 |
So you could have another input which is the height and width of the image just as integers 00:30:52.600 |
and have that as a separate input which is concatenated to the output of your convolutional 00:30:56.760 |
layer after the first flattened layer and then your dense layers can then incorporate 00:31:01.240 |
both the convolutional outputs and your metadata would be a good example. 00:31:06.520 |
That's a great question, two great questions. 00:31:10.240 |
So you might remember from last week that this whole thing about collaborative filtering 00:31:19.720 |
And the journey is to NLP, natural language processing. 00:31:30.680 |
This is a question about collaborative filtering. 00:31:33.320 |
So if we need to predict the missing values, the NANs or the 0.0, so if a user hasn't watched 00:31:42.440 |
a movie, what would be the prediction or how do we go about predicting that? 00:31:53.840 |
So this is really the key purpose of creating this model is so that you can make predictions 00:32:00.840 |
for movie user combinations you haven't seen before. 00:32:06.160 |
And the way you do that is to simply do something like this. 00:32:14.340 |
You just call model.predict and pass in a movieId userId pair that you haven't seen 00:32:22.560 |
And all that's going to do is it's going to take the dot product of that movie's latent 00:32:28.440 |
factors and that user's latent factors and add on those biases and return you back the 00:32:38.680 |
And so if this was a Kaggle competition, that would be how we would generate our submission 00:32:45.040 |
for the Kaggle competition would be to take their test set, which would be a bunch of 00:32:48.000 |
movie user pairs that we haven't seen before. 00:32:58.280 |
Collaborative filtering is extremely useful of itself. 00:33:05.000 |
Without any doubt, it is far more commercially important right now than NLP is. 00:33:10.600 |
Having said that, fastAI's mission is to impact society in as positive a way as possible, 00:33:18.400 |
and doing a better job as predicting movies is not necessarily the best way to do that. 00:33:23.160 |
So we're maybe less excited about collaborative filtering than some people in industry are. 00:33:30.720 |
NLP, on the other hand, can be a very big deal. 00:33:34.780 |
If you can do a good job, for example, of reading through lots of medical journal articles 00:33:42.040 |
or family histories and patient notes, you could be a long way towards creating a fantastic 00:33:48.340 |
diagnostic tool to use in the developing world to help bring medicine to people who don't 00:33:53.160 |
currently have it, which is almost as good as telling them not to watch Battlefield Earth. 00:34:04.400 |
In order to do this, we're going to look at a particular dataset. 00:34:09.000 |
This dataset is like a really classic example of what people do with natural language processing, 00:34:17.840 |
Sentiment analysis means that you take a piece of text, it could be a phrase, a sentence, 00:34:23.680 |
a paragraph, or a whole document, and decide whether or not that is a positive or negative 00:34:33.520 |
Keras actually comes with such a dataset, which is called the IMDb sentiment dataset. 00:34:41.320 |
The IMDb sentiment dataset was originally developed from the Stanford AI group, and the paper 00:35:03.680 |
They talk about all the details about what people try to do with sentiment analysis. 00:35:09.760 |
In general, although academic papers tend to be way more math-y than they should be, 00:35:17.080 |
the introductory sections often do a great job of capturing why this is an interesting 00:35:22.520 |
problem, what kind of approaches people have taken, and so forth. 00:35:26.120 |
The other reason papers are super helpful is that you can skip down to the experiment 00:35:29.920 |
section -- every machine learning paper pretty much has an experiment section -- and find 00:35:40.160 |
Here they showed that using this dataset they created of IMDb movie reviews, along with 00:35:46.000 |
their sentiment, their full model plus an additional model got a score of 88.33% accuracy 00:35:57.720 |
They had another one here where they also added in some unlabeled data. 00:36:01.720 |
We're not going to be looking at that today, that would be a semi-supervised learning problem. 00:36:05.360 |
So today our goal is to beat 88.33% accuracy as being the academic state of the art for 00:36:19.440 |
To grab it, we can just say from Keras.datasets import IMDb. 00:36:23.840 |
Keras actually kind of fiddles around with it in ways that I don't really like, so I 00:36:27.480 |
actually copied and pasted from the Keras file these three lines to import it directly 00:36:34.680 |
So that's why rather than using the Keras dataset directly, I'm using these three lines. 00:36:43.220 |
There are 25,000 movie reviews in the training set, and here's an example of one. 00:36:53.960 |
Rumwell High is a cartoon comedy, around at the same time as some other programs. 00:37:01.600 |
So the dataset actually does not quite come to us in this format, it actually comes to 00:37:13.840 |
And so these IDs then we can look up in the word index, which is something that they provide. 00:37:22.800 |
And so for example, if we look at the word index, as you can see, basically maps an integer 00:37:44.480 |
It's in order of how frequently those words appeared in this particular corpus, which 00:37:51.480 |
So then I also create a reverse index, so it goes from Word to ID. 00:37:59.480 |
So I can see that in the very first training example, the very first word is word number 00:38:07.700 |
So if I look up index to word 23022, it is the word Rumwell. 00:38:13.460 |
And so then I just go through and I map everything in that first review to index to word and 00:38:20.280 |
join it together with a space, and that's how we can turn the data that they give us back 00:38:29.520 |
As well as providing the reviews, they also provide labels. 00:38:34.080 |
One is positive sentiment, zero is negative sentiment. 00:38:39.640 |
So our goal is to take these 25,000 reviews that look like this and predict whether it 00:38:46.200 |
will be positive or negative in sentiment, and the data is actually provided to us as 00:38:54.960 |
Is everybody clear on the problem we are trying to solve and how it's laid out? 00:39:03.080 |
So there's a couple of things we can do to make it simpler. 00:39:08.400 |
So currently there are some pretty unusual words, like word number 23022 is Bromwell. 00:39:15.760 |
And if we're trying to figure out how to deal with all these different words, having to 00:39:21.440 |
figure out the various ways in which the word Bromwell is used is probably not going to 00:39:25.680 |
net as much for a lot of computation and memory cost. 00:39:28.880 |
So we're going to truncate the vocabulary down to 5000. 00:39:32.000 |
And it's very easy to do that because the words are already ordered by frequency. 00:39:37.320 |
I simply go through everything in our training set and I just say if the word ID is less 00:39:45.420 |
than this vocab size of 5000, we'll leave it as it is, otherwise we'll replace it with 00:39:54.160 |
So at the end of this, we now have replaced all of our rare words with a single ID. 00:40:04.720 |
The reviews are sometimes up to 2493 words long. 00:40:17.960 |
As you will see, we actually need to make all of our reviews the same length. 00:40:26.040 |
Allowing this 2493 word review would again use up a lot of memory and time. 00:40:32.880 |
So we're going to decide to truncate every review at 500 words. 00:40:37.320 |
And that's twice as big, more than twice as big as the mean. 00:40:43.280 |
So what we now need to do is create a rectangular - what if the word 5000 gives a bias? 00:40:52.840 |
So we're about to learn a machine learning model, and so the vast majority of the time 00:41:21.160 |
it comes across the word 5000, it's actually going to mean 'rare word'. 00:41:29.560 |
And it's going to learn to deal with that as best as it can. 00:41:34.280 |
The idea is the rare words don't appear too often, so hopefully this is not going to cause 00:41:40.720 |
We're not just using frequencies, all we're doing is we're just truncating our vocabulary. 00:41:57.840 |
So the 5000 words, can we just replace it with some neutral word to take care of that 00:42:11.800 |
The fact that occasionally the word 1987 actually pops up is totally insignificant. 00:42:20.700 |
We could replace it with -1, it's just a sentinel value which has no meaning. 00:42:27.540 |
It's one of these design decisions which it's not worth spending a lot of time thinking about 00:42:53.920 |
So I just picked whatever happened to be easiest at the time. 00:42:58.440 |
As I said, I could personally always use -1, it's just not important. 00:43:04.600 |
What is important is that we have to create a square matrix, a rectangular matrix which 00:43:17.480 |
So quite conveniently Keras comes with something called pad sequences that does that for us. 00:43:22.280 |
It takes everything greater than this length and truncates it, and everything less than 00:43:28.520 |
that length, and pads it with what have we asked for, which in this case is zeros. 00:43:34.920 |
So at the end of this, the shape of our training set is now a NumPy array of 25,000 rows by 00:43:43.460 |
And as you can see, it's padded the front with zeros, such that it has 500 words in 00:43:53.000 |
And you can see that Bromwell has now been not replaced with 5000, but with 4000, 999. 00:43:59.320 |
So this is our same movie review again after going through that padding process. 00:44:08.320 |
I know that there's some reason that Keras decided to pad the front rather than the back. 00:44:16.800 |
Since it's what it does by default, I don't worry about it, I don't think it's important. 00:44:23.880 |
So now that we have a rectangular matrix of numbers, and we have some labels, we can use 00:44:30.240 |
the exact techniques we've already learned to create a model. 00:44:34.880 |
And as per usual, we should try to create the simplest possible model we can to start 00:44:39.880 |
And we know that the simplest model we can is one with one hidden layer in the middle. 00:44:44.600 |
Or at least this is the simplest model that we generally think ought to be pretty useful 00:44:51.760 |
Now here is why we started with collaborative filtering, and that's because we're starting 00:44:57.600 |
So if you think about it, our input are word ids, and we want to convert that into a vector. 00:45:08.100 |
So again, rather than one-hot encoding this into a 5000-column long huge input thing and 00:45:17.600 |
then doing a matrix product, an embedding just says look up that movie ID and grab that 00:45:27.180 |
So it's just a computational and memory shortcut to creating a one-hot encoding followed by 00:45:36.380 |
So we're creating an embedding where we are going to have 5000 latent factors or 5000 embeddings. 00:45:44.400 |
Each one we're going to have 32, in this case 32 items rather than 50. 00:45:53.400 |
So then we're going to flatten that, have our single dense layer, a bit of dropout, 00:46:04.520 |
You can see it's a good idea to go through and make sure you understand why all these 00:46:10.000 |
That's something you can do during the week and double-check that you're comfortable with 00:46:15.040 |
So this is the size of each of the weight matrices at each point. 00:46:23.160 |
And after two epochs, we have 88% accuracy on 00:46:36.960 |
And so let's just compare that to Stanford, where they had 88.3 and we have 88.04. 00:46:52.320 |
So we're not yet there, but we're well on the right track. 00:46:55.120 |
This is always the question about why have X number of filters in your convolutional 00:47:06.880 |
layer or why have X number of outputs in your dense layer. 00:47:11.280 |
It's just a case of trying things and seeing what works and also getting some intuition 00:47:19.440 |
In this case, I think 32 was the first I tried, I kind of felt like from my understanding 00:47:27.160 |
of really big embedding models, which we'll learn about shortly, even 50 dimensions is 00:47:32.760 |
enough to capture vocabularies of size 100,000 or more. 00:47:37.520 |
So I felt like 32 was likely to be more than enough to capture a vocabulary of size 5,000. 00:47:42.280 |
I tried it and I got a pretty good result, and so I've basically left it there. 00:47:46.920 |
If at some point I discovered that I wasn't getting great results, I would try increasing 00:47:58.520 |
You can always use a softmax instead of a sigmoid, it just means that you would have 00:48:04.240 |
to change your labels, because remember our labels were just 1's or 0's, they were just 00:48:15.500 |
If I wanted to use a softmax, I would have to create two columns. 00:48:18.440 |
It wouldn't just be 1, it would be 1, 0, 1, 0, 1, 0. 00:48:25.320 |
In the past, I've generally stuck to using softmax and then categorical cross-entropy 00:48:30.380 |
loss just to be consistent, because then regardless of whether you have two classes or more than 00:48:35.080 |
two classes, you can always do the same thing. 00:48:38.600 |
In this case, I thought I want to show the other way that you can do this, which is to 00:48:43.440 |
just have a single column output, and remember a sigmoid is exactly the same thing as a softmax 00:48:53.920 |
And so rather than using categorical cross-entropy, we use binary cross-entropy and again it's 00:48:57.600 |
exactly the same thing, it just means I didn't have to worry about one hot encoding the output 00:49:16.480 |
The important thing as far as I'm concerned is what is the benchmark that the Stanford 00:49:21.680 |
people got and they compared it to a range of other previous benchmarks and they found 00:49:28.520 |
And I'm sure there have been other techniques that have come out since that are probably 00:49:32.560 |
better, but I haven't seen them in any papers yet, so this is my target. 00:49:41.280 |
You can see that we can in one second of training get an accuracy which is pretty competitive, 00:49:51.760 |
And so hopefully you're starting to get a sense that a neural net with one hidden layer 00:49:57.720 |
is a great starting point for nearly everything, you now know how to create a pretty good sentiment 00:50:03.080 |
analysis model and before today you didn't, so that's a good step. 00:50:07.160 |
So an embedding is something I think would be particularly helpful if we go back to our 00:50:41.520 |
And remember that the actual data coming in does not look like this, but it looks like 00:50:51.520 |
So when we then come along and say, okay, what do we predict the rating would be for 00:50:56.160 |
user ID 1 for movie ID 1172, we actually have to go through our list of movie IDs and find 00:51:08.280 |
movie ID number 31, say, and then having found 31, then look up its latent factor. 00:51:15.400 |
And then we have to do the same thing for user ID number 1 and find its latent factor, 00:51:19.920 |
and then we have to multiply the two together. 00:51:21.900 |
So that step of taking an ID and finding it in a list and returning the vector that it's 00:51:31.260 |
So an embedding returns a vector which is of length, in this case 32. 00:51:40.160 |
So the output of this is that for each, the none always means your mini batch size. 00:51:46.840 |
So for each movie review, for each of the 500 words in that sequence, you're getting 00:51:59.500 |
And so therefore you have a mini batch size by 500 by 32 tensor coming out of this layer. 00:52:07.520 |
That gets flattened, so 500 times 32 is 16,000, and then that is the input into your first 00:52:16.840 |
Q. And I also think it might be helpful to show that for a review, instead of having 00:52:23.920 |
that in words that's being entered as a sequence of numbers where the number is -- 00:52:32.160 |
So we look at this first review and we take -- and remember this has now been truncated 00:52:37.120 |
to 4999, this is still 309, so it's going to take 309, and it's going to look up the 309th 00:52:44.320 |
vector in the embedding, and it's going to return it, and then it's going to concatenate 00:52:54.880 |
An embedding is a shortcut to a one-hot encoding followed by a matrix product. 00:53:04.520 |
Can you show us words which have similar latent features? 00:53:07.600 |
I'm hoping these words would be synonyms or semantically similar. 00:53:13.160 |
Q. And who made the labels, and why should I believe them, it seems difficult and subjective? 00:53:18.960 |
A. Well that's the whole point of sentiment analysis and these kinds of things, is that 00:53:24.680 |
So the interesting thing about NLP is that we're trying to capture something which is 00:53:31.560 |
So in this case you would have to read the original paper to find out how they got these 00:53:40.880 |
The way that people tend to get labels is either, in this case it's the IMDB data set. 00:53:46.600 |
IMDB has ratings, so you could just say anything higher than 8 is very positive and anything 00:53:51.600 |
lower than 2 is very negative, and we'll throw away everything in the middle. 00:53:57.080 |
The other way that people tend to label academic data sets is to send it off to Amazon Mechanical 00:54:02.400 |
Turk and pay them a few cents to label each thing. 00:54:07.120 |
So that's the kind of ways that you can label stuff. 00:54:10.920 |
Q. And there are places where people don't just use Mechanical Turk, but they specifically 00:54:18.560 |
A. Yeah, you certainly wouldn't do that for this because the whole purpose here is to 00:54:28.080 |
Q. We know of a team at Google that does that. 00:54:30.360 |
A. Yeah, so for example -- and I know when I was in medicine, we went through all these 00:54:36.200 |
radiology reports and tried to capture which ones were critical findings and which ones 00:54:39.960 |
weren't critical findings, and we used good radiologists rather than Mechanical Turk for 00:54:45.920 |
Q. So we're not considering any sentence construction or diagrams or just a bag of words and the 00:54:55.120 |
literal set of words that are being used in a comment? 00:55:01.060 |
If you think about it, this dense layer here has 1.6 million parameters. 00:55:06.480 |
It's connecting every one of those 500 inputs to our output. 00:55:14.840 |
And not only that, but it's doing that for every one of the incoming factors. 00:55:22.160 |
So it's creating a pretty complex kind of big Cartesian product of all of these weights, 00:55:30.240 |
and so it's taking account of the position of a word in the overall sentence. 00:55:35.820 |
It's not terribly sophisticated, and it's not taking account of its position compared 00:55:40.600 |
to other words, but it is taking account of whereabouts it occurs in the whole review. 00:55:46.920 |
So it's not like -- it's the dumbest kind of model I could come up with. 00:55:55.520 |
It's a good starting point, but we would expect that with a little bit of thought, which we're 00:56:06.920 |
So the slightly better -- hopefully you guys have all predicted what that would be -- it's 00:56:12.960 |
And the reason I hope you predicted that is because (a) we've already talked about how 00:56:16.360 |
CNNs are taking over the world, and (b) specifically they're taking over the world any time we 00:56:26.220 |
One word comes after another word, it has a specific ordering. 00:56:34.580 |
We can't use a 2D convolution because the sentence is not in 2D, a sentence is in 1D, 00:56:42.020 |
So a 1D convolution is even simpler than a 2D convolution. 00:56:45.420 |
We're just going to grab a string of a few words, and we're going to take their embeddings, 00:56:51.760 |
and we're going to take that string, and we're going to multiply it by some filter. 00:56:56.240 |
And then we're going to move that sequence along our sentence. 00:57:01.520 |
So this is our normal next place we go as we try to gradually increase the complexity, 00:57:10.920 |
which is to grab our simplest possible CNN, which is a convolution, dropout, max pooling. 00:57:18.940 |
And then flatten that, and then we have our dense layer and our output. 00:57:22.880 |
So this is exactly like what we did when we were looking at gradually improving our state 00:57:29.320 |
But rather than having convolution 2D, we have convolution 1D. 00:57:37.160 |
How many filters do you want to create, and what is the size of your convolution? 00:57:42.480 |
Originally I tried 3 here, 5 turned out to be better. 00:57:46.920 |
So I'm looking at 5 words at a time and multiplying them by each one of 64 filters. 00:57:54.440 |
So that is going to return -- so we're going to start with the same embedding as before. 00:58:02.060 |
So we take our sentences and we turn them into a 500x32 matrix for each of our inputs. 00:58:10.680 |
We then put it through our convolution, and because our convolution has a border mode 00:58:16.240 |
is same, we get back exactly the same shape that we gave it. 00:58:21.320 |
We then put it through our 1D max pooling and that will halve its size, and then we 00:58:25.360 |
stick it through the same dense layers as we had before. 00:58:29.540 |
So that's a really simple convolutional neural network for words. 00:58:34.400 |
Compile it, run it, and we get 89.47 compared to -- let's go back to the videotape -- without 00:58:55.680 |
So we have already broken the academic state-of-the-art as at when this paper was written. 00:59:00.240 |
And again, simple convolutional neural network gets us a very, very long way. 00:59:06.800 |
I was going to put out it's 10-8, maybe take time for a break, but there's also a question. 00:59:12.480 |
Convolution 2D for images is easier to understand, element-wise multiplication and addition, but 00:59:23.280 |
Don't think of it as a sequence of words because remember it's been through an embedding. 00:59:31.840 |
So it's doing exactly the same thing as we're doing in a 2D convolution, but rather than 00:59:37.840 |
having 3 channels of color, we have 32 channels of embedding. 00:59:45.760 |
So we're just going through and we're just like in our convolution spreadsheet. 00:59:58.840 |
Remember how in the second one, once we had two filters already, our filter had to be 01:00:05.080 |
a 3x3x2 tensor in order to allow us to create the second layer. 01:00:14.440 |
For us, we now don't have a 3x3x2 tensor, we have a 5x1x32, or more conveniently, a 5x32 01:00:26.560 |
So each convolution is going to go through each of the 5 words and each of the 32 embeddings, 01:00:32.560 |
do an element-wise multiplication, and add them all up. 01:00:36.840 |
So the important thing to remember is that once we've done the embedding layer, which 01:00:42.440 |
is always going to be our first step for every NLP model, is that we don't have words anymore. 01:00:48.080 |
We now have vectors which are attempting to capture the information in that word in some 01:00:54.880 |
way, just like our latent factors captured information about a movie and a user into our 01:01:03.720 |
We haven't yet looked at what they do, we will in a moment, just like we did with the 01:01:09.080 |
movie vectors, but we do know from our experience that SGD is going to try to fill out those 01:01:16.360 |
32 places with information about how that word is being used which allows us to make 01:01:26.800 |
Just like when you first learned about 2D convolutions, it took you probably a few days 01:01:31.800 |
of fiddling around with spreadsheets and pieces of paper and Python and checking inputs and 01:01:37.680 |
outputs to get a really intuitive understanding of what a 2D convolution is doing. 01:01:42.440 |
You may find it's the same with a 1D convolution, but it will take you probably a fifth of the 01:01:47.240 |
time to get there because you've really done all the hard work already. 01:01:55.120 |
I think now is a great time to have a break, so let's come back here at 7.57. 01:02:07.680 |
There's a couple of concepts that we come across from time to time in this class which 01:02:16.080 |
there is no way that me lecturing to you is going to be enough to get an intuitive understanding 01:02:21.920 |
The first clearly is the 2D convolution, and hopefully you've had lots of opportunities 01:02:28.320 |
to experiment and practice and read, and these are things you have to tackle from many different 01:02:36.880 |
And 2D convolutions in a sense are really 3D because if it's in full color, you've got 01:02:41.840 |
Hopefully that's something you've all played with. 01:02:44.280 |
And once you have multiple filters later on in your image models, you still have 3D and 01:02:50.640 |
you've got more than 3 channels, you might have 32 filters or 64 filters. 01:02:55.680 |
In this lesson we've introduced one much simpler concept, which is the 1D convolution, which 01:03:06.920 |
is really a 2D convolution because just like with images we had red, green, blue, now we 01:03:18.080 |
So that's something you will definitely need to experiment with. 01:03:23.200 |
Create a model with just an embedding layer, look at what the output is, what does it shape, 01:03:27.160 |
what does it look like, and then how does a 1D convolution modify that. 01:03:36.360 |
And then trying to understand what an embedding is is kind of your next big task if you're 01:03:45.600 |
And if you haven't seen them before today, I'm sure you won't, because this is a big 01:03:52.640 |
It's not in any way mathematically challenging. 01:03:55.760 |
It's literally looking up an array and returning the thing at that ID. 01:04:00.700 |
So an embedding looking at movie_id 3 is go to the third column of the matrix and return 01:04:13.000 |
They couldn't be mathematically simpler, it's the simplest possible operation. 01:04:20.640 |
But the kind of intuitive understanding of what happens when you put an embedding into 01:04:26.180 |
an SGD and learn a vector which turns out to be useful is something which is kind of 01:04:35.600 |
mind-blowing because as we saw from the movie lens example, with just a dot product and 01:04:45.040 |
this simple lookup something in an index operation, we ended up with vectors which captured all 01:04:52.960 |
kinds of interesting features about movies without us in any way asking it to. 01:04:58.240 |
So I wanted to make sure that you guys really felt like after this class, you're going to 01:05:06.520 |
go away and try and find a dozen different ways of looking at these concepts. 01:05:12.560 |
One of those ways is to look at how other people explain them. 01:05:15.800 |
And Chris Ola has one of the very, very best technical blogs I've come across and quite 01:05:23.640 |
often referred to in this class, and in his Understanding Convolutions post, he actually 01:05:28.760 |
has a very interesting example of thinking about what a dropped ball does as a convolutional 01:05:35.440 |
operation and he shows how you can think about a 1D convolution using this dropped ball analogy. 01:05:44.520 |
Particularly if you have some background in electrical or mechanical engineering, I suspect 01:05:52.880 |
There are many resources out there for thinking about convolutions and I hope some of you 01:05:58.480 |
will share on the forums any that you come across. 01:06:01.640 |
Question - so one, this is from just before the break, essentially are we training the 01:06:12.200 |
Yeah, we are absolutely training the input because the only input we have is 25,000 sequences 01:06:24.000 |
And so we take each of those integers and replace them with a lookup into a 500-column 01:06:32.400 |
Initially that matrix is random, just like in our Excel example. 01:06:37.840 |
We started with a random matrix, these are all random numbers, and then we created this 01:06:45.840 |
loss function which was the sum of the squares of differences between the dot product and 01:06:52.400 |
And if we then use the gradient descent solver in Excel to solve that, it attempts to modify 01:07:01.280 |
the two embedding matrices (as you can see, the objective is going down) to try and come 01:07:10.640 |
up with the two embedding matrices which give us the best approximation of the original 01:07:18.720 |
So this Excel spreadsheet is something which you can play with and do exactly what our 01:07:33.440 |
The only difference is that our version in Python also has LQ regularization. 01:07:45.760 |
So this one's just finished here, so you can see it's come up with these are no longer 01:07:51.520 |
We've now got two embedding matrices which have got the loss function down from 40 to 01:07:55.360 |
5.6, and so you can see for example these ratings are now very close to what they're 01:08:03.240 |
So this is exactly what Keras and SGD are doing in our Python example. 01:08:08.280 |
Q So my question is, is it that we've got an embedding in which each word is a vector 01:08:20.960 |
Each word in our vocabulary of 5000 has been converted into a vector of 32 elements. 01:08:27.560 |
Q Another question is, what would be the equivalent 01:08:31.960 |
dense network if we didn't use a 2D embedding? 01:08:35.440 |
This is in the initial model, the simple one. 01:08:39.720 |
A dense layer with input of size, embedding size, we have size? 01:08:44.880 |
A I actually don't know what that meant, sorry. 01:08:48.760 |
Q Okay, next question is, does it matter that encoded values which are close by are close 01:08:57.000 |
in color in the case of pictures, which is not true for word vectors? 01:09:01.000 |
For example, 254 and 255 are close as colors, but for words they have no relation. 01:09:12.200 |
the word IDs are not used mathematically in any way at all, other than as an index to 01:09:20.720 |
So the fact that this is movie number 27, the number 27 is not used in any way. 01:09:26.160 |
We just take the number 27 and find its vector. 01:09:31.040 |
So what's important is the values of each latent factor as to whether they're close 01:09:37.040 |
So in the movie example, there were some latent factors that were something about is it a 01:09:42.080 |
And there were some latent factors that were something about is it a violent movie or not? 01:09:48.080 |
It's the similarity on those factors that matters. 01:09:51.800 |
The ID is never ever used, other than is an index to simply index into a matrix to return 01:10:01.080 |
So as Yannette was mentioning, in our case now for the word embeddings, we're looking 01:10:05.840 |
up in our embeddings to return a 32-element vector of floats that are initially random, 01:10:15.000 |
and the model is trying to learn the 32 floats for each of our words that is semantically 01:10:24.640 |
And in a moment we're going to look at some visualizations of that to try and understand 01:10:29.600 |
You can apply the dropout parameter to the embedding layer itself, and what that does 01:10:50.440 |
is it zeroes out at random 20% of each of these 32 embeddings for each word. 01:11:00.160 |
So it's basically avoiding overfitting the specifics of each word's embedding. 01:11:06.080 |
This dropout, on the other hand, is removing at random some of the words, effectively, 01:11:16.120 |
The significance of which one to use where is not something which I've seen anybody research 01:11:23.560 |
in depth, so I'm not sure that we have an answer that says use this amount in this place. 01:11:30.920 |
I just tried a few different values in different places, and it seems that putting the same 01:11:36.280 |
amount of dropout in all these different spots seems to work pretty well in my experiments, 01:11:44.100 |
If you find you're massively overfitting or not massively underfitting, try playing around 01:11:48.920 |
with the various values and report back on the forum and tell us what you find. 01:11:52.880 |
Maybe you'll find some different, better configurations than I've come up with. 01:12:08.840 |
We are taking each of our 5,000 words in our vocabulary and we're replacing them with a 01:12:14.040 |
32 element long vector, which we are training to hopefully capture all of the information 01:12:21.720 |
about what this word means and what it does and how it works. 01:12:26.880 |
You might expect intuitively that somebody might have done this before. 01:12:31.720 |
Just like with ImageNet and VGG, you can get a pre-trained network that says, oh, if you've 01:12:37.920 |
got an image that looks a bit like a dog, well we've had a trained network which has 01:12:43.000 |
seen lots of dogs, so it will probably take your dog image and return some useful predictions 01:12:50.320 |
because we've done lots of dog images before. 01:12:53.640 |
The interesting thing here is your dog picture and the VGG author's dog pictures are not 01:13:00.920 |
They're going to be different in all kinds of ways. 01:13:04.560 |
To get pre-trained weights for images, you have to give somebody a whole pre-trained 01:13:09.200 |
network, which is like 500 megabytes worth of weights in a whole architecture. 01:13:17.740 |
In a document, the word 'dog' always appears the same way. 01:13:24.880 |
It doesn't have different lighting conditions or facial expressions or whatever, it's just 01:13:30.740 |
So the cool thing is in NLP, we don't have to pass around pre-trained networks, we can 01:13:36.720 |
pass around pre-trained embeddings, or as they're commonly known, pre-trained word vectors. 01:13:43.440 |
That is to say, other people have already created big models with big text corpuses 01:13:49.280 |
where they've attempted to build a 32-element vector, or however long vector, which captures 01:13:57.080 |
all of the useful information about what that word is and how it behaves. 01:14:02.200 |
So for example, if we type in 'word-vector-download', you can see that -- this is not quite what 01:14:16.040 |
we wanted -- let's do 'word-embeddings-download'. 01:14:23.720 |
Lots of questions and answers and pages about where we can download pre-trained word embeddings. 01:14:40.120 |
But I guess what was a little unintuitive to me is that I think this means that if I can 01:14:48.680 |
train a corpus on, I don't know, the works of Shakespeare, somehow that tells me something 01:14:53.080 |
about how I can understand movie reviews, and I imagine that in some sense that's true about 01:15:01.680 |
how language is structured and whatnot, but the meaning of the word 'dog' in Shakespeare 01:15:05.640 |
is probably going to be used pretty differently. 01:15:20.880 |
The word vectors that I'm going to be using, and I don't strongly recommend but slightly 01:15:29.160 |
The other main competition to these is called the Word2Vec word vectors. 01:15:34.440 |
The GloVe word vectors come from a researcher named Jeffrey Pennington from Stanford. 01:15:44.640 |
I will have a mention that the TensorFlow documentation on the Word2Vec vectors is fantastic. 01:15:51.720 |
So I would definitely highly recommend checking this out. 01:15:56.400 |
The GloVe word vectors have been pre-trained on a number of different corpuses. 01:16:07.320 |
One of them has been pre-trained on all of Wikipedia and a huge database full of newspaper 01:16:13.360 |
articles -- a total of 6 billion words covering 400,000-size vocabulary. 01:16:21.880 |
And they provide 50-dimensional, 100-dimensional, 200-dimensional and 300-dimensional pre-trained 01:16:29.140 |
They have another one which has been trained on 840 billion words of a huge dump of the 01:16:39.360 |
And then they have another one which has been trained on 2 billion tweets, which I believe 01:16:43.880 |
all of the Donald Trump tweets have been carefully cleaned out prior to usage. 01:16:50.480 |
So in my case, what I've done is I've downloaded the 6 billion token version, and I will show 01:17:19.480 |
Sometimes these are cased, so you can see for example this particular one includes case. 01:17:27.400 |
There are 2.2 million items of vocabulary in this, sometimes they're uncased. 01:17:36.920 |
Here is the start of the GloVe 50-dimensional word vectors trained on a corpus of 6 billion. 01:17:44.880 |
Here is the word "the," and here are the 50 floats which attempt to capture all of the 01:18:05.000 |
And so here are the 50 floats that attempt to capture all of the information captured 01:18:13.160 |
So here is the word "in," here is the word "double quote," here is "apostrophe s." 01:18:19.560 |
So you can see that the GloVe authors have tokenized their text in a very particular 01:18:24.680 |
And the idea that "apostrophe s" should be treated as a thing, that makes a lot of sense. 01:18:30.320 |
It certainly has that thinginess in the English language. 01:18:34.640 |
And so indeed, the way the authors of a word-embedding corpus have chosen to tokenize their text 01:18:43.360 |
And one of the things I quite like about GloVe is that they've been pretty smart, in my opinion, 01:18:53.760 |
So the question is, how does one create word vectors in general? 01:19:00.080 |
What is the model that you're creating and what are the labels that you're building? 01:19:09.080 |
So one of the things that we talked about getting to at some point is unsupervised learning. 01:19:15.200 |
And this is a great example of unsupervised learning. 01:19:17.440 |
We want to take 840 billion tokens of an internet dump and build a model of something. 01:19:29.360 |
We're trying to capture some structure of this data, in this case, how does English 01:19:38.040 |
The way that this is done, at least in the Word2Vec example, is quite cool. 01:19:42.640 |
What they do is they take every sentence of, say, 11 words long, not just every sentence, 01:19:52.800 |
but every 11 long string of words that appears in the corpus, and then they take the middle 01:19:58.800 |
The first thing they do is they create a copy of it, an exact copy. 01:20:06.640 |
And then in the copy, they delete the middle word and replace it with some random word. 01:20:17.440 |
So we now have two strings of 11 words, one of which makes sense because it's real, one 01:20:24.640 |
of which probably doesn't make sense because the middle word has been replaced with something 01:20:30.120 |
And so the model task that they create, the label is 1 if it's a real sentence, or 0 if 01:20:43.440 |
So you can see it's not a directly useful task in any way, unless somebody actually 01:20:49.760 |
comes along and says, "I just found this corpus in which somebody's replaced half of the middle 01:20:56.560 |
And it is something where in order to be able to tackle this task, you're going to have 01:21:02.360 |
You're going to have to be able to recognize that this sentence doesn't make sense, and 01:21:07.440 |
So this is a great example of unsupervised learning. 01:21:10.120 |
Generally speaking in deep learning, unsupervised learning means coming up with a task which 01:21:15.820 |
is as close to the task you're eventually going to be interested in as possible but that doesn't 01:21:20.600 |
require labels or whether labels are really cheap to generate. 01:21:50.520 |
So it turns out that the embeddings that is created when you look at say, Hindu and Japanese 01:22:05.360 |
And so one way to translate language is to create a bunch of word vectors in English 01:22:15.000 |
for various words, and then to create a bunch of word vectors in Japanese for various words. 01:22:22.080 |
And then what you can do is you can say, "Okay, I want to translate this word, which might 01:22:29.960 |
You can basically look up and find the nearest word in the same vector space in the Japanese 01:22:39.800 |
So it's a fascinating thing about language, in fact, Google has just announced that they've 01:22:47.200 |
replaced Google Translate with a neural translation system and part of what that is doing is basically 01:22:54.120 |
In fact, here are some interesting examples of some word embeddings. 01:23:00.800 |
The word embedding for king and queen has the same distance and direction as the word 01:23:07.320 |
Ditto for walking vs. walked and swinging vs. swam, and ditto for Spain vs. Madrid and Italy 01:23:15.000 |
So the embeddings that have to get learned in order to solve this stupid, meaningless, 01:23:27.640 |
And so I've actually downloaded those glove embeddings, and I've pre-processed them, and 01:23:34.320 |
I'm going to upload these for you shortly into a form that's going to be really easy 01:23:40.880 |
And I've created this little thing called load glove, which loads the pre-processed 01:23:48.240 |
It's going to give you the word vectors, which is the 400,000 by, in this case, 50 dimensional 01:23:54.280 |
vectors, a list of the words, here they are, the comma dot of two, and a list of the word 01:24:04.920 |
So you can now take a word and call word2vec to get back its 50-dimensional array. 01:24:16.260 |
In order to turn a 50-dimensional vector into something 2-dimensional that I can plot, we 01:24:23.200 |
have to do something called dimensionality reduction. 01:24:25.880 |
And there's a particular technique, the details don't really matter, called TSNE, which attempts 01:24:30.240 |
to find a way of taking your high-dimensional information and plot it on 2 dimensions such 01:24:36.840 |
that things that were close in the 50 dimensions are still close in the 2 dimensions. 01:24:41.520 |
And so I used TSNE to plot the first 350 most common words, and here they all are. 01:24:50.720 |
And so you can see that bits of punctuation have appeared close to each other, numerals 01:24:55.820 |
appear close to each other, written versions of numerals are close to each other, seasons, 01:25:00.280 |
games, leagues played are all close to each other, various things about politics, school 01:25:05.440 |
and university, president, general, prime, minister, and Bush. 01:25:11.640 |
Now this is a great example of where this TSNE 2-dimensional projection is misleading 01:25:18.680 |
about the level of complexity that's actually in these word vectors. 01:25:22.440 |
In a different projection, Bush would be very close to tree. 01:25:27.340 |
The 2-dimensional projection is losing a lot of information. 01:25:31.480 |
The true detail here is a lot more complex than us mere humans can see on a page. 01:25:41.920 |
So all I've done here is I've just taken those 50-dimensional word vectors and I've plotted 01:25:49.920 |
And so you can see that when you learn a word embedding, you end up with something, we've 01:25:58.620 |
now seen, not just a word embedding, we've seen for movies, we were able to plot some 01:26:03.000 |
movies in 2 dimensions and see how they relate to each other and we can do the same thing 01:26:07.920 |
In general, when you have some high-dimension, high-cardinality categorical variable, whether 01:26:13.640 |
it be lots of movies or lots of reviewers or lots of words or whatever, you can turn it 01:26:18.440 |
into a useful, lower-dimensional space using this very simple technique of creating an 01:26:24.680 |
The explanation on how unsupervised learning was used in Word2Vec was pretty smart. 01:26:31.320 |
I don't recall how it was done in GloVe, I believe it was something similar. 01:26:35.520 |
I should mention though that both GloVe and Word2Vec did not use deep learning. 01:26:41.460 |
They actually tried to create a linear model, and the reason they did that was that they 01:26:47.600 |
specifically wanted to create representations which had these kinds of linear relationships 01:26:53.220 |
because they felt that this would be a useful characteristic of these representations. 01:26:59.400 |
I'm not even sure if anybody has tried to create a similarly useful representation using 01:27:07.520 |
a deeper model and whether that turns out to be better. 01:27:10.600 |
Obviously with these linear models, it saves a lot of computational time as well. 01:27:15.720 |
The embeddings, however, even though they were built using linear models, we can now 01:27:22.680 |
use them as inputs to deep models, which is what we're about to do, just behind you Rachel. 01:27:32.440 |
So Google SyntaxNet model that just came out, was that the one you were mentioning? 01:27:41.760 |
Word2Vec has been around for 2 and a half years, 2 years. 01:27:54.480 |
I think it's called Parsey McPass Face, that one is the one where they claim 97% accuracy 01:28:17.000 |
on NLP, and it also returns parts of speech, so I'll tell you if you give a sentence it'll 01:28:21.960 |
In that high-dimensional space, for example, you can see here is information about tense, 01:28:29.440 |
So it's very easy to take a word vector and use it to create a part of speech recognizer, 01:28:35.240 |
you just need a fairly small labeled corpus, and it's actually pretty easy to download 01:28:40.400 |
a rather large labeled corpus, and build a simple model that goes from word vector to 01:28:47.720 |
There's a really interesting paper called "Exploring the Limits of Language Modeling." 01:28:55.720 |
That Parsey McPass Face thing got far more PR than it deserved. 01:29:01.800 |
It was not really an advance over the state-of-the-art language models of the time, but since that 01:29:11.480 |
time there have been some much more interesting things. 01:29:15.680 |
One of the interesting papers is "Exploring the Limits of Language Modeling," which is 01:29:19.480 |
looking at what happens when you take a very, very, very large dataset and spend shitloads 01:29:28.680 |
of Google's money on lots and lots of GPUs for a very long time, and they have some genuine 01:29:36.760 |
massive improvements to the state-of-the-art in language modeling. 01:29:41.600 |
In general, when we're talking about language modeling, we're talking about things like 01:29:46.160 |
is this a noun or a verb, is this a happy sentence or a sad sentence, is this a formal 01:29:52.840 |
speech or an informal speech, so on and so forth. 01:29:57.600 |
And all of these things that NLP researchers do, we can now do super easily with these 01:30:02.800 |
This uses two techniques, one of which you know and one of which you're about to know, 01:30:22.080 |
convolutional neural networks and recurrent neural networks, specifically a type called 01:30:29.340 |
You can check out this paper to see how they compare. 01:30:32.220 |
Almost this time, there's been an even newer paper that has furthered the state-of-the-art 01:30:36.760 |
in language modeling and it's using a convolutional neural network. 01:30:41.000 |
So right now, CNNs with pre-trained word embeddings are the state-of-the-art. 01:30:56.280 |
So given that we can now download these pre-trained word embeddings, that leads to the question 01:31:04.280 |
of why are we using randomly generated word embeddings when we do our sentiment analysis. 01:31:21.920 |
From now on, you should now always use pre-trained word embeddings anytime you do NLP. 01:31:31.120 |
Over the next few weeks, we will be gradually making this easier and easier. 01:31:35.600 |
At this stage, it requires slightly less than a screen of code. 01:31:39.700 |
You have to load the embeddings off disk, creating your word vectors, your words and 01:31:48.040 |
The next thing you have to do is, the word indexes that come from GloVe are going to 01:31:54.000 |
be different to the word indexes in your vocabulary. 01:32:04.800 |
In the GloVe case, it's probably not the word Bromwell. 01:32:07.320 |
So this little piece of code is simply something that is mapping from one index to the other 01:32:16.320 |
So this createEmbedding function is then going to create an embedding matrix where the indexes 01:32:28.160 |
are the indexes in the IMDB dataset, and the embeddings are the embeddings from GloVe. 01:32:38.080 |
This embedding matrix are the GloVe word vectors indexed according to the IMDB dataset. 01:32:44.440 |
So now I have simply copied and pasted the previous code and I have added this, weights 01:32:54.280 |
Since we think these embeddings are pretty good, I've set trainable to false. 01:32:59.600 |
I won't leave it at false because we're going to fine-tune them, but we'll start it at false. 01:33:04.920 |
One particular reason that we can't leave it at false is that sometimes I've had to 01:33:09.400 |
create a random embedding because sometimes the word that I looked up in GloVe didn't exist. 01:33:16.640 |
For example, anything that finishes with apostrophe s, in GloVe they tokenize that to have apostrophe 01:33:23.120 |
s and the word as separate tokens, but in IMDB they were combined into one token. 01:33:29.160 |
And so all of those things, there aren't vectors for them. 01:33:32.000 |
So I just randomly created embeddings for anything that I couldn't find in the GloVe 01:33:41.040 |
But for now, let's start using just the embeddings that were given, and we will set this to non-trainable, 01:33:46.840 |
and we will train a convolutional neural network using those embeddings for the IMDB task. 01:33:55.800 |
And after 2 epochs, we have 89.8. Previously, with random embeddings, we had 89.5. 01:34:16.640 |
Let's now go ahead and say first layer trainable is true. 01:34:23.680 |
Place the learning braid a bit and do just one more epoch, and we're now up to 90.1. 01:34:29.880 |
So we've got way beyond the academic state of the art here. 01:34:34.400 |
We're kind of cheating because we're now not just building a model, we're now using a pre-trained 01:34:40.640 |
word embedding model that somebody else has provided for us. 01:34:44.880 |
But why would you ever not do that if that exists? 01:34:48.800 |
So you can see that we've had a big jump, and furthermore it's only taken us 12 seconds 01:34:56.000 |
So we started out with the pre-trained word embeddings, we set them initially to non-trainable 01:35:02.520 |
in order to just train the layers that used them, waited until that was stable, which 01:35:11.280 |
took really 2 epochs, and then we set them to trainable and did one more little fine 01:35:19.120 |
And this kind of approach of these 3 epochs of training is likely to work for a lot of 01:35:28.520 |
Do you not need to compile the model after resetting the input layer to trainable equals 01:35:37.280 |
No you don't, because the architecture of the model has not changed in any way, it's 01:35:48.040 |
There's never any harm in compiling the model. 01:35:51.480 |
Sometimes if you forget to compile, it just continues to use the old model, so best to 01:36:01.560 |
Something that I thought was pretty cool is that during the week, one of our students 01:36:06.220 |
here had an extremely popular post appear all over the place, I saw it on the front 01:36:11.240 |
page of Hacker News talking about how his company, Quidd, uses deep learning and very 01:36:16.880 |
happy to see with small data, which is what we're all about. 01:36:21.200 |
For those of you who don't know it, Quidd is a company, quite a successful startup actually, 01:36:26.300 |
that is processing millions and millions of documents, things like patents and stuff like 01:36:31.760 |
that, and providing enterprise customers with really cool visualizations and interactive 01:36:40.680 |
And so this is by Ben Bowles, one of our students here, and he talked about how he compared 01:36:44.920 |
three different approaches to a particular NLP classification task, one of which involved 01:36:52.000 |
some pretty complex and slow to develop carefully engineered features. 01:37:02.200 |
But Model 3 in this example was a convolutional neural network. 01:37:07.680 |
So I think this is pretty cool and I was hoping to talk to Ben about this piece of work. 01:37:18.200 |
Could you give us a little bit of context on what you were doing in this project? 01:37:23.760 |
Yeah, so the task is about detecting marketing language from company descriptions. 01:37:30.960 |
So it's had the flavor of being very similar to sentiment analysis, like you have two classes 01:37:36.040 |
of things, they're kind of different in some kind of semantic way. 01:37:39.440 |
And you've got some examples here, so one was our patent pending support system is engineered 01:37:44.240 |
designed to bring confidence style, with your more marketing I guess, and your spatial scanning 01:37:49.920 |
software for mobile devices, is your more informative. 01:37:53.200 |
Yeah, I mean the semantics of the marketing language is like, oh this is exciting. 01:37:59.640 |
There are certain types of meanings and semantics around which the marketing tends to cluster, 01:38:04.520 |
and I sort of realized, hey, this would be kind of a nice task for deep learning. 01:38:09.440 |
How were these labeled, your data set in the first place? 01:38:13.400 |
Basically by a couple of us in the company, we basically just found some good ones and 01:38:18.440 |
found the bad ones and then literally tried it out. 01:38:21.560 |
I mean, it's literally as hacky as you could possibly imagine. 01:38:25.320 |
So yeah, it was kind of what's super, super scrappy. 01:38:30.520 |
But it actually ended up being very useful for us, I think, because that kind of a nice 01:38:33.600 |
lesson is sometimes scrappy gets you most of the way you need them, you think about like, 01:38:38.440 |
hey, how do you get your data for your project, well you can actually just create it, right? 01:38:44.120 |
I mean I love this lesson because when -- and so startup, right? 01:38:48.720 |
When I talk to big enterprise executives, they're all about their five year metadata 01:38:54.880 |
and data lake repository infrastructure program at the end of which maybe they'll actually 01:39:00.760 |
try and get some value out of it, whereas startups are just like, okay, what have we 01:39:05.280 |
got that we can do by Monday, let's throw it together and see if it works. 01:39:10.680 |
The latter approach is so much better because by Monday you know whether it kind of looks 01:39:15.760 |
good, which kind of things are important, and you can decide on how much it's worth 01:39:23.240 |
So one of the things I wanted to show is your convolutional neural network did something 01:39:28.240 |
pretty neat, and so I wanted to use this same neat trick for our convolutional neural network, 01:39:37.220 |
So I mentioned earlier that when I built this CNN, I tried using a filter size of 5, and 01:39:50.680 |
And what Ben in his blog post points out is that there's a neat paper in which they describe 01:39:56.400 |
doing something interesting, which is not just using one size convolution, but trying 01:40:04.320 |
And you can see here, this is a great use of the functional API, and I haven't exactly 01:40:10.600 |
used your code, I've kind of rewritten a little bit then, but basically it's the same concept. 01:40:14.720 |
Let's try size 3 and size 4 and size 5 convolutional filters, and so let's create a 1D convolutional 01:40:22.920 |
filter of size 3 and then size 4 and then size 5, and then for each one using the functional 01:40:29.760 |
API we'll add max_pulling and we'll flatten it and we'll add it to a list of these different 01:40:36.960 |
And then at the end, we'll merge them all together by simply concatenating them. 01:40:42.400 |
So we're now going to have a single vector containing the result of the 3 and 4 and 5 01:40:48.000 |
size convolutions, like why settle for 1. And then let's return that whole model as a little 01:40:54.800 |
sub-model, which in Ben's code he called graph. 01:40:59.280 |
The reason I assume you call this graph is because people tend to think of these things, 01:41:06.560 |
A computational graph basically is saying this is a computation being expressed as various 01:41:13.520 |
inputs and outputs, so you can think of it as a graph. 01:41:16.640 |
So once you've got this little multi-layer convolution module, you can stick it inside 01:41:23.760 |
a standard sequential model by simply replacing the convolution 1D and max_pulling piece with 01:41:32.120 |
graph, where graph is the concatenated version of all of these different scales of convolution. 01:41:41.240 |
And so trying this out, I got a slightly better answer again, which is 90.36%. 01:41:50.840 |
And I hadn't seen that paper before, so thank you for giving that great idea. 01:41:55.360 |
Did you have anything to add about this multi-scale convolution idea? 01:41:59.760 |
Not really, other than I think it's super cool. 01:42:04.240 |
But actually I'm still trying to figure out all the ends and notes of exactly how it works. 01:42:10.400 |
Some ways implementation is easier than understanding. 01:42:14.720 |
In a lot of these things, the math is kind of ridiculously simple, and then you throw 01:42:23.880 |
it at an SGD and let it do billions and billions of calculations in a fraction of a second, 01:42:29.960 |
and what it comes up with is kind of hard to grasp. 01:42:34.360 |
And you are using capital M merge in this example, did you want to talk about that? 01:42:40.640 |
Ben used capital M merge and I just did the same thing. 01:42:44.800 |
Were it me, I would have used small M merge, so we'll have to agree to disagree here. 01:43:01.280 |
So we have a few minutes to talk about something enormous, so we're going to do a brief introduction. 01:43:16.960 |
So everything we've learned so far about convolutional neural networks does not necessarily do a 01:43:24.800 |
great job of solving a problem like how would you model this? 01:43:31.960 |
Now notice whatever this markup is, I'm not quite sure. 01:43:37.160 |
It has to recognize when you have a start tag and know to close that tag, but then over 01:43:42.720 |
a longer period of time that it's inside a weird XML comment thing and to know that it 01:43:48.980 |
has to finish off the weird XML comment thing, which means it has to kind of keep memory 01:43:55.720 |
about what happened in the distant past if you're going to successfully do any kind of 01:44:04.480 |
And so with that kind of memory therefore, it can handle long-term dependencies. 01:44:15.640 |
Also think about these two different sentences. 01:44:19.320 |
They both mean effectively the same thing, but in order to realize that, you're going 01:44:24.360 |
to have to keep some kind of state that knows that after this has been read in, you're now 01:44:29.720 |
talking about something that happened in 2009, and you then have to remember it all the way 01:44:35.440 |
to here to know when it was that this thing happened that you did in Nepal. 01:44:41.200 |
So we want to create some kind of stateful representation. 01:44:46.680 |
Furthermore it would be nice if we're going to deal with big long pieces of language like 01:44:50.800 |
this with a lot of structure to be able to handle variable length sequences, so that 01:44:55.320 |
we can handle some things that might be really long and some things that might be really 01:44:59.480 |
So these are all things which convolutional neural networks don't necessarily do that 01:45:06.480 |
So we're going to look at something else which is a recurrent neural network which handles 01:45:12.200 |
And here is a great example of a good use of a recurrent neural network. 01:45:17.680 |
At the top here, you can see that there is a convolutional neural network that is looking 01:45:30.560 |
These images are coming from really big Google Street View pictures, and so it has to figure 01:45:36.440 |
out what part of the image should I look at next in order to figure out the house number. 01:45:42.400 |
And so you can see that there's a little square box that is scanning through and figuring 01:45:49.680 |
And then at the bottom, you can see it's then showing you what it's actually seeing after 01:45:57.120 |
So the thing that is figuring out where to look next is a recurrent neural network. 01:46:02.120 |
It's something which is taking its previous state and figuring out what should its next 01:46:09.000 |
And this kind of model is called an attentional model. 01:46:14.120 |
And it's a really interesting avenue of research when it comes to dealing with things like 01:46:19.320 |
very large images, images which might be too big for a single convolutional neural network 01:46:29.400 |
On the left is another great example of a useful recurrent neural network, which is the 01:46:34.440 |
very popular Android and iOS text entry system called SwiftKey. 01:46:40.320 |
And SwiftKey had a post-up a few months ago in which they announced that they had just 01:46:46.920 |
replaced their language model with a neural network of this kind, which basically looked 01:46:52.720 |
at your previous words and figured out what word are you likely to be typing in next, 01:47:01.040 |
A final example was Andre Kepathy showed a really cool thing where he was able to generate 01:47:10.320 |
random mathematical papers by generating random LaTeX, and to generate random LaTeX you actually 01:47:17.440 |
have to learn things like /begin-proof and /end-proof and these kind of long-term dependencies. 01:47:25.840 |
And he was able to do that successfully, so this is actually a randomly generated piece 01:47:30.200 |
of LaTeX which is being created with a recurrent neural network. 01:47:36.400 |
So today I am not going to show you exactly how it works, I'm going to try to give you 01:47:44.400 |
And I'm going to start off by showing you how to think about neural networks as computational 01:47:53.560 |
So this is coming back to that word Ben used earlier, this idea of a graph. 01:47:57.880 |
And so I started out by trying to draw -- this is like my notation, you won't see this anywhere 01:48:02.480 |
else but it'll do for now -- here is a picture of a single hidden layer basic neural network. 01:48:11.000 |
We can think of it as having an input, which is going to be of size, batch size, and contain 01:48:23.160 |
And then this arrow, this orange arrow, represents something that we're doing to that matrix. 01:48:29.440 |
So each of the boxes represents a matrix, and each of the arrows represents one or more 01:48:37.640 |
In this case, we do a matrix product and then we throw it through a rectified linear unit. 01:48:43.520 |
And then we get a circle which represents a matrix, but it's now a hidden layer which 01:48:49.840 |
is of size, batch size, by number of activations. 01:48:54.640 |
And number of activations is just when we created that dense layer, we would have said 01:48:59.880 |
and then we would have had some number, and that number is how many activations we create. 01:49:06.520 |
And then we put that through another operation, which in this case is a matrix product followed 01:49:12.320 |
by a softmax, and so triangle here represents an output matrix. 01:49:18.400 |
And that's going to be batch size by, if it's ImageNet, 1000. 01:49:24.440 |
So this is my little way of representing the computation graph of a basic neural network 01:49:35.000 |
I'm now going to create some slightly more complex models, but I'm going to slightly 01:49:43.560 |
One thing to note is that batch size appears all the time, so I'm going to get rid of it. 01:49:49.980 |
So here's the same thing where I've removed batch size. 01:49:53.180 |
Also the specific activation function, who gives a shit? 01:49:56.640 |
It's probably Ralu everywhere except the last layer where it's softmax, so I've removed 01:50:03.280 |
Let's now look at what a convolutional neural network with a single dense hidden layer would 01:50:10.500 |
So we'd have our input, which this time will be, and remember I've removed batch size, 01:50:15.960 |
number of channels by height by width, the operation, and we're ignoring the activation 01:50:21.040 |
function is going to be a convolution followed by a max pool. 01:50:24.760 |
Remember any shape is representing a matrix, so that gives us a matrix which will be size 01:50:30.440 |
num_filters by height/2 by width/2, since we did a max pooling. 01:50:39.040 |
I've put flatten in parentheses because flattening mathematically does nothing at all. 01:50:45.080 |
Flattening is just telling Keras to think of it as a vector. 01:50:49.840 |
It doesn't actually calculate anything, it doesn't move anything, it doesn't really do 01:50:55.560 |
It just says think of it as being a different shape. 01:50:59.600 |
So let's then take a matrix product, and remember I'm not putting in the activation functions 01:51:05.280 |
So that would be our dense layer, gives us our first fully connected layer, which will 01:51:10.920 |
be of size, number of activations, and then we put that through a final matrix product 01:51:19.040 |
So here is how we can represent a convolutional neural network with a single dense hidden layer. 01:51:27.800 |
The number of activations again is the same as we had last time, it's whatever the n was 01:51:37.200 |
Just like when the number of filters is when we write convolution_2D, we say number of 01:51:51.280 |
So I'm going to now create a slightly more complex computation graph, but again I'm going 01:51:57.520 |
to slightly simplify what I put on the screen, which is this time I'm going to remove all 01:52:03.400 |
Because now that we have removed the activation function, you can see that in every case we 01:52:09.560 |
basically have either some kind of linear thing, either a matrix product or a convolution, 01:52:16.700 |
and optionally there might also be a max pull. 01:52:19.320 |
So really, this is not adding much additional information, so I'm going to get rid of it 01:52:25.160 |
So we're now not showing the layer operations. 01:52:26.800 |
So remember now, every arrow is representing one or more layer operations, which will generally 01:52:33.520 |
be a convolution or a matrix product, followed by an activation function, and maybe there 01:52:42.000 |
So let's say we wanted to predict the third word of a three-word string based on the previous 01:52:53.240 |
Now there's all kinds of ways we could do this, but here is one interesting way, which 01:52:58.360 |
you will now recognize you could do with Keras's functional API. 01:53:02.440 |
Which is, we could take word1 input, and that could be either a one-hot encoded thing, in 01:53:13.160 |
which case its size would be vocab size, or it could be an embedding of it. 01:53:21.600 |
We then stick that through a layer operation to get a matrix output, which is our first 01:53:32.560 |
And this thing here, we could then take and put through another layer operation, but this 01:53:38.960 |
time we could also add in the word2 input, again, either of vocab size or the embedding 01:53:45.680 |
of it, put that through a layer operation of its own, and then when we have two arrows 01:53:55.680 |
And a merge could either be done as a sum, or as a concab. 01:54:02.000 |
I'm not going to say one's better than the other, but there are two ways that we can 01:54:05.640 |
take two input vectors and combine them together. 01:54:09.600 |
So now at this point, we have the input from word2 after sticking that through a layer. 01:54:18.240 |
We have the input from word1 after sticking that through two layers. 01:54:23.160 |
Merge them together, stick that through another layer to get our output, which we could then 01:54:27.600 |
compare to word3 and try to train that to recognize word3 from words1 and word2. 01:54:38.480 |
You could try and build this network using some corpus you find online, see how it goes. 01:54:45.440 |
Pretty obviously then, you could bring it up another level to say let's try and predict 01:54:51.800 |
the fourth word of a three-word string using words1 and 2 and 3. 01:55:00.160 |
The reason I'm doing it in this way is that what's happening is each time I'm going through 01:55:05.680 |
another layer operation and then bringing in word2 and going through a layer operation 01:55:11.160 |
and bringing in word3 and going through a layer operation is I am collecting state. 01:55:17.800 |
Each of these things has the ability to capture state about all of the words that have come 01:55:23.240 |
so far and the order in which they've arrived. 01:55:27.480 |
So by the time I get to predicting word4, this matrix has had the opportunity to learn 01:55:35.280 |
what does it need to know about the previous words' orderings and how they're connected 01:55:40.080 |
to each other and so forth in order to predict this fourth word. 01:55:47.920 |
It's important to note that we have not yet previously built a model in Keras which has 01:55:53.880 |
input coming in anywhere other than the first layer, but there's no reason we can't. 01:56:01.040 |
One of you asked a great question earlier, which was could we use this to bring in metadata 01:56:05.800 |
like the speed a car was going to add it with a convolutional neural network's image data. 01:56:11.520 |
I said yes we can, so in this case we're doing the same thing, which is we're bringing in 01:56:17.240 |
an additional word's worth of data, and remember each time you see two different arrows coming 01:56:25.160 |
So here's a perfectly reasonable way of trying to predict the fourth word from the previous 01:56:32.920 |
So this leads to a really interesting question, which was what if instead we said let's bring 01:56:41.160 |
in our Word 1, and then we had a layer operation in order to create our hidden state, and that 01:56:50.080 |
would be enough to predict Word 2, and then to predict Word 3, could we just do a layer 01:57:06.560 |
And then that could be used to predict Word 3, and then run it again to predict Word 4, 01:57:14.600 |
This is called an RNN, and everything that you see here is exactly the same structurally 01:57:26.000 |
The colored-in areas represent matrices, and the arrows represent layer operations. 01:57:33.360 |
One of the really interesting things about an RNN is each of these arrows that you see 01:57:38.800 |
- three arrows - there's only one weight matrix attached to those. 01:57:43.200 |
In other words, it's the equivalent thing of saying every time you see an arrow from 01:57:49.240 |
a circle to a circle, so that would be that one and that one, those two weight matrices 01:57:59.000 |
Every time you see an arrow from a rectangle to a circle, those three matrices have to 01:58:07.640 |
And then finally, you've got an arrow from a circle to a triangle, and that weight matrix 01:58:13.080 |
The idea being that if you have a word coming in and being added to some state, why would 01:58:19.200 |
you want to treat it differently depending on whether it's the first word in a string 01:58:25.520 |
Given that generally speaking, we kind of split up strings pretty much at random anyway. 01:58:29.720 |
We're going to be having a whole bunch of 11-word strings. 01:58:36.680 |
One of the nice things about this way of thinking about it where you have it going back to itself 01:58:41.640 |
is that you can very clearly see there is one layer operation, one weight matrix for 01:58:46.640 |
input to hidden, one for hidden to hidden, circle to circle, and one for hidden to output, 01:58:58.320 |
So we're going to talk about that in a lot more detail next week. 01:59:04.520 |
So now, I'm just going to quickly show you something in the last one minute, which is 01:59:14.720 |
that we can train something which takes, for example, all of the text of Nietzsche, so here's 01:59:25.040 |
a bit of his text, I've just read it in here, and we could split it up into every sequence 01:59:31.600 |
- let's grab it here - into every sequence of length 40. 01:59:36.200 |
So I've gone through the whole text and grabbed every sequence of length 40. 01:59:41.440 |
And then I've created an RNN and its goal is to take the sentence which represents the 01:59:46.120 |
indexes from i to i+40 and predict the sentence from i+1 to i+40+1. 01:59:55.560 |
So every string of length max len, I'm trying to predict the string one word after that. 02:00:04.040 |
And so I can take that now and create a model which has - an LSTM is a kind of recurrent 02:00:10.340 |
neural network, we'll talk about it next week - which has a recurrent neural network, starts 02:00:14.720 |
of course with an embedding. And then I can train that by passing in my sentences and 02:00:31.360 |
And I can then say, okay, let's try and generate 300 characters by building a prediction of 02:00:37.720 |
what do you think the next character would be. And so I have to seed it with something, 02:00:42.140 |
I don't know, I thought it felt very Nietzschean, ethics is a basic foundation of all that. 02:00:48.400 |
And see what happens. And after training it for only a few seconds, I get ethics is a 02:00:53.720 |
basic foundation of all that. You can get the sense that it's starting to learn a bit 02:00:59.760 |
about the idea that - oh by the way, one thing to mention is this Nietzsche corpus is slightly 02:01:04.760 |
annoying. It has carriage returns after every line, so you'll see it's going to throw carriage 02:01:10.320 |
returns in all over the place. It's got some pretty hideous formatting. 02:01:15.920 |
So then I train it for another 30 seconds. I train it for another 30 seconds and I get 02:01:22.760 |
to a point where it's kind of understanding the concept of punctuation and spacing. 02:01:27.760 |
And then I've trained it for 640 seconds and it's starting to actually create real words. 02:01:35.080 |
And then I've trained it for another 640 seconds. And interestingly, each section of Nietzsche 02:01:42.160 |
starts with a numbered section that looks exactly like this. It's even starting to learn 02:01:47.360 |
to close its quotation marks. It also notes that at the start of a chapter, it always 02:01:52.160 |
has three lines, so it's learned to start chapters after another 640 seconds and another 02:02:01.680 |
640 seconds. And so by this time, it's actually got to a point where it's saying some things 02:02:06.920 |
which are so obscure and difficult to understand, it could really be niche. 02:02:15.300 |
These car RNN models are fun and all, but the reason this is interesting is that we're 02:02:22.320 |
showing that we only provided that amount of text and it was able to generate text out 02:02:29.280 |
here because it has state, it has recurrence. And what that means is that we could use this 02:02:35.120 |
kind of model to generate something like SwiftKey, whereas you're typing it's saying this is the 02:02:40.140 |
next thing you're going to type. I would love you to think about during the week whether 02:02:47.120 |
this is likely to help our IMDB sentiment model or not. That would be an interesting 02:02:53.400 |
thing to talk about. Next week, we will look into the details of how RNNs work. Thanks.