So I wanted to start off by showing you something I'm kind of excited about, which is, here is the Dogs and Cats competition, which we all know so well. And it was interesting that the winner of this competition won by a very big margin, a 1.1% error versus a 1.7% error.
This is very unusual in a Kaggle competition to see anybody win by 50-60% margin. You can see that after that, people are generally clustering around 91.1, 91.9, 91.1, 98.1 -- about the same kind of number. So this was a pretty impressive performance. This is the guy who actually created a piece of deep mining software called Overfee.
So I want to show you something pretty interesting, which is this week I tried something new, and on Dogs and Cats got 98.95. So I want to show you how I did that. The way I did that was by using nearly only techniques I've already shown you, which is basically I created a standard model, which is basically a dense model.
And then I pre-computed the last convolutional layer, and then I trained the dense model lots of times, and the other thing I did was to use some data augmentation. And I didn't actually have time to figure out the best data augmentation parameters, so I just picked some that seemed reasonable.
I should also mention this 98.95 would be easy to make a lot better. I'm not doing any pseudo-labeling here, and I'm not even using the full dataset. I put aside 2000 for the validation set. So with those two changes we would definitely get well over 99% accuracy. The missing piece that I added is I added batch normalization to VGG.
So batch normalization, if you guys remember, I said the important takeaway is that all modern networks should use batch norm because you can get 10x or more improvements in training speed and it tends to reduce overfitting. Because of the second one, it means you can use less dropout, and dropout of course is destroying some of your network, so you don't want to use more dropout than necessary.
So why didn't VGG already have batch norm? Because it didn't exist. So VGG was kind of mid to late 2014, and batch norm was maybe early to mid 2015. So why haven't people added batch norm to VGG already? And the answer is actually interesting to think about. So to remind you what batch norm is, batch norm is something which first of all normalizes every intermediate layer.
So it normalizes all of the activations by subtracting the mean and dividing by the standard deviation, which is always a good idea. And I know somebody on the forum today asked why is it a good idea, and I've put a link to some more information about that, so anybody who wants to know more about why do normalization check out the forum.
But just doing that alone isn't enough because SGD is quite bloody-minded, and so if it was trying to de-normalize the activations because it thought that was a good thing to do, it would do so anyway. So every time you tried to normalize them, SGD would just undo it again.
So what batch norm does is it adds two additional trainable parameters to each layer. One which multiplies the activations and one which is added to the activations. So it basically allows it to undo the normalization, but not by changing every single weight, but by just changing two weights for each activation.
So it makes things much more stable in practice. So you can't just go ahead and stick batch norm into a pre-trained network, because if you do, it's going to take that layer and it's going to divide all of the incoming activations by subtract the mean and divide by the standard deviation, which means now those pre-trained weights from then on are now wrong, because those weights were created for a completely different set of activations.
So it's not rocket science, but I realized all we need to do is to insert a batch norm layer and figure out what the mean and standard deviation of the incoming activations would be for that dataset and basically create the batch norm layer such that the two trainable parameters immediately undo that.
So that way we would insert a batch norm layer and it would not change the outputs at all. So I grabbed the whole of ImageNet and I created our standard dense layer model. I pre-computed the convolutional outputs for all of ImageNet, and then I created two batch norm layers, and I created a little function which allows us to insert a layer into an existing model.
I inserted the layers just after the two dense layers. And then here is the key piece. I set the weights on the new batch norm layers equal to the variance and the mean, which I calculated on all of ImageNet. So I calculated the mean of each of those two layer outputs and the variance of each of those two layer outputs.
And so that allowed me to insert these batch norm layers into an existing model. And then afterwards I evaluated it and I checked that indeed it's giving me the same answers as it was before. As well as doing that, I then thought if you train a model with batch norm from the start, you're going to end up with weights which are designed to take advantage of the fact that the activations are being normalized.
And so I thought I wonder what would happen if we now fine-tuned the ImageNet network on all of ImageNet after we added these batch norm layers. So I then tried training it for one epoch on both the ImageNet images and the horizontally flipped ImageNet images. So that's what these 2.5 million here are.
And you can see with modern GPUs, it only takes less than an hour to run the entirety of ImageNet twice. And the interesting thing was that my accuracy on the validation set went up from 63% to 67%. So adding batch norm actually improves ImageNet, which is cool. That wasn't the main reason I did it, the main reason I did it was so that we can now use VGG with batch norm in our models.
So I did all that, I saved the weights, I then edited our VGG model. So if we now look at the fully connected block in our VGG model, it now has batch norm in there. I also saved to our website a new weights file called VGG16BN for batch norm.
And so then when I did cats and dogs, I used that model. So now if you go and redownload from platform.ai the VGG16.py, it will automatically download the new weights, you will have this without any changes to your code. So I'll be interested to hear during the week if you try this out, just rerun the code you've got whether you see improvements.
And hopefully you'll find it trains more quickly and you get better results. At this stage, I've only added batch norm to the dense layers, not to the convolutional layers. There's no reason I shouldn't add it to the convolutional layers as well, I just had other things to do this week.
Since most of us are mainly fine-tuning just the dense layers, this is going to impact most of us the most anyway. So that's an exciting step which everybody can now use. As well as -- the other thing to mention is now that you'll be using batch norm by default in your VGG networks, you should find that you can increase your learning rates.
Because batch norm normalizes the activations, it makes sure that there's no activation that's gone really high or really low, and that means that generally speaking you can use higher learning rates. So if you try higher learning rates in your code than you were before, you should find that they work pretty well.
You should also find that things that previously you couldn't get to train, now will start to train. Because often the reason that they don't train is because one of the activations shoots off into really high or really low and screws everything up, and that kind of gets fixed when you use batch norm.
So there's some things to try this week, I'll be interested to hear how you go. So last week we looked at collaborative filtering. And to remind you, we had a file that basically meant something like this. We had a bunch of movies and a bunch of users, and for some subset of those combinations we had a review of that movie by that user.
The way the actual file came to us didn't look like this, this is a crosstab. The way the file came to us looked like this. Each row was a single user rating a single movie with a single rating at a single time. So I showed you in Excel how we could take the crosstab version, and we could create a table of dot products, where the dot products would be between a set of 5 random numbers for the movie and 5 random numbers for the user.
And we could then use gradient descent to optimize those sets of 5 random numbers for every user and every movie. And if we did so, we end up getting pretty decent guesses as to the original ratings. And then we went a step further in the spreadsheet and we learned how you could take the dot product and you could also add on a single bias, a movie bias and a user bias.
So we saw all that in Excel and we also learned that in Excel, Excel comes with a gradient descent solver called, funnily enough, solver. And we saw that if we ran solver, I'm telling it that these are our varying cells and this is our target cell, then it came up with some pretty decent weight matrices.
We learned that these kinds of weight matrices are called embeddings. An embedding is basically something where we can start with an integer, like 27, and look up the movie number 27's vector of weights, that's called an embedding. It's also in collaborative filtering, this particular kind of embedding is known as latent factors.
Where we hypothesized that once trained, each of these latent factors may mean something. And I said next week we might come back and have a look and see if we can figure out what they mean. So that was what I thought I would do now. So I'm going to take the bias model that we created.
The bias model we created was the one where we took a user embedding and a movie embedding, and we took the dot product of the two, and then we added to it a user bias and a movie bias where those biases are just embeddings which have a single output.
Just like in Excel, the bias was a single cell for each movie and a single cell for each user. So then we tried fitting that model, and you might remember that we ended up getting an accuracy that was quite a bit higher than previous state-of-the-art. Actually, for that one we didn't, for the previous state-of-the-art we broke by using the neural network.
I discovered something interesting during the week, which is that I can get a state-of-the-art result using just this simple bias model, and the trick was that I just had to increase my regularization. So we haven't talked too much about regularization, we've briefly mentioned it a couple of times, but it's a very simple thing where we can basically say, add to the loss function the sum of the squares of the weights.
So we're trying to minimize the loss, and so if you're adding the sum of the squares of the weights to the loss function, then the SGD solver is going to have to try to avoid increasing the weights where it can. And so we can pass to Keras, most Keras layers, a parameter called wregularizer, that stands for weight regularizer, and we can tell it how to regularize our weights.
In this case, I say use an L2 norm, that means sum of the squares, of how much, and that's something that I pass in, and I used 1a neg 4. And it turns out if I do that, and then I train it for a while, it takes quite a lot longer to train, but let's see if I've got this somewhere.
I got down to a loss of 0.7979, which is quite a bit better than the best results that that Stanford paper showed. That's not quite as good as the neural net, the neural net got 7938 at best. But it's still interesting that this pretty simple approach actually gets results better than the academic state-of-the-art as of 2012 or 2013, and I haven't been able to find more recent academic benchmarks than that.
So I took this model, and I wanted to find out what we can learn from these results. So obviously one thing that we would do with this model is just to make predictions with it. So if you were building a website for recommending movies, and a new user came along and said I like these movies this much, what else would you recommend?
You could just go through and do a prediction for each movie for that user ID and tell them which ones had the highest numbers. That's the normal way we would use collaborative filtering. We can do some other things. We can grab the top 2,000 most popular movies, just to make this more interesting, and we can say let's just grab the bias term.
And I'll talk more about this particular syntax in just a moment, but just for now, this is a particularly simple kind of model. It's a model which simply takes the movie ID in and returns the movie bias out. In other words, it doesn't look up in the movie bias table and just returns the movie bias indexed by this movie ID.
That's what these two lines do. I then combine that bias with the actual name of each rating, and print out the top and bottom 15. So according to Movie Lens, the worst movie of all time is the Church of Scientology classic Battlefield Earth. So this is interesting because these ratings are quite a lot more sophisticated than your average movie rating.
What this is saying is that these have been normalized for some reviewers are more positive and negative than others. Some people are watching better or crappier films than others, and so this bias is removing all of that noise and really telling us after removing all of that noise these are the least good movies, and Battlefield Earth even worse than Spice World by a significant margin.
On the other hand, here are the best Miyazaki fans will be pleased to see Howl's Moving Castle at number 2. So that's interesting. Perhaps what's more interesting is to try and figure out what's going on not in the biases but in the latent factors. The latent factors are a little bit harder to interpret because for every movie we have 50 of them, in the Excel spreadsheet we have 5, in our version we have 50 of them.
So what we want to do is we want to take from those 50 latent factors, we want to find two or three main components. The way we do this, the details aren't important but a lot of you will already be familiar with it, which is that there's something called PCA, or Principle Components Analysis.
Principle Components Analysis does exactly what I just said. It looks through a matrix, in this case it's got 50 columns, and it says what are the combinations of columns that we can add together because they tend to move in the same direction. And so in this case we say start with our 50 columns, and I want to create just three columns that capture all of the information of the original 50.
If you're interested in learning more about how this works, PCA is something which is kind of everywhere on the internet, so there's lots of information about it. But as I say, the details aren't important, the important thing to recognize is that we're just squishing our 50 latent factors down into 3.
So if we look at the first PCA factor, and we saw it on it, we can see that at one end we have fairly well-regarded movies like "The Godfather", "Halt Fiction", "Usual Suspects" - these are all kind of classics. At the other end we have things like Ace Ventura and Robocop 3, which are perhaps not so classic.
So our first PCA factor is some kind of classic score. On our second one, we have something similar but actually very different. At one end we've got 10 movies that are huge Hollywood blockbusters with lots of special effects. And at the other end we have things like Annie Hall and Brokeback Mountain, which are kind of dialogue-heavy, not big Hollywood hits.
So there's another dimension, which is the second most. This is the first most important dimension by which people judge movies differently. This is the second most important one by which people judge movies differently. And then the third most important one by which people judge movies differently is something where at one end we have a bunch of violent and scary movies, and at the other end we have a bunch of very happy movies.
And for those of you who haven't seen Babe, Australian movie, happiest movie ever. It's about a small pig and its adventures and its path to success, so happiest movie ever according to movie lens. So that's interesting. It's not saying that these factors are good or bad or anything like that, it's just saying that these are the things that when we've done this matrix decomposition have popped out as being the ways in which people are differing in their ratings for different kinds of movies.
So one of the reasons I wanted to show you this is to say that these kinds of SGD-learned many-parameter networks are not inscrutable. Indeed it's not great to go in and look at every one of those shifty latent factor coefficients in detail, but you have to think about how to visualize them, how to look at them.
In this case, I actually went a step further and I grabbed a couple of principal components and tried drawing a picture. And so with pictures, of course, you can start to see things in multiple dimensions. And so here I've got the first and third principal components, and so you can see the far right-hand side here we have more of the Hollywood type movies, and at the far left some of the more classic movies, and at the top some of the more violent movies, and some of the bottom movies, some of the happier movies, and they're so far happy that it's right off the bottom.
And so then if you wanted to find a movie that was violent and classic, you would go into the top left, and yeah, Kubrick's A Clockwork Orange would probably be the one most people would come up with first. Or if you wanted to come up with something that was very Hollywood and very non-violent, you would be down here in Sleepless in Seattle.
You can really learn a lot by looking at these kinds of models, but you don't do it by looking at the coefficients, you do it by visualizations, you do it by interrogating it. And so I think this is a big difference, but for any of you that have done much statistics before or have a background in the social sciences, you've spent most of your time doing regressions and looking at coefficients and t-tests and stuff.
This is a very different world. This is a world where you're asking the model questions and getting the model results, which is kind of what we're doing here. I mentioned I would talk briefly about this syntax. And this syntax is something that we're going to be using a lot more of, and it's part of what's called the Keras Functional API.
The Keras Functional API is a way of doing exactly the same things that you've already learned how to do, using a different API. That is not such a dumb idea. The API you've learned so far is the sequential API, that's where you use the word sequential, and then you write in order the layers of your neural network.
That's all very well. But what if you want to do something like what we wanted to do just now, where we had like 2 different things coming in, we had a user ID coming in and a movie ID coming in, and each one went through its own embedding, and then they got multiplied together.
How do you express that as a sequence? It's not very easy to do that. So the functional API was designed to answer this question. The first thing to note about the functional API is that you can do everything you can do in this sequential API. And here's an example of something you could do perfectly well with a sequential API, which is something with two dense layers.
But it looks a bit different. Every functional API model starts with an input layer, and then you assign that to some variable. And then you list each of the layers in order, and for each of them, after you've provided the details for that layer, you then immediately call the layer passing in the output of the previous layer.
So this passes in inputs and calls it x, and then this passes in our x, and this is our new version of x, and then this next dense layer gets the next version of x and returns the predictions. So you can see that each layer is saying what its previous layer is.
Each layer is saying what its previous layer is. So it's doing exactly the same thing as a sequential API, just in a different way. Now as the docs note here, the sequential model is probably a better choice to implement this particular network because it's easier. This is just showing that you can do it.
On the other hand, the model that we just looked at would be quite difficult, if not impossible to do with the sequential model API, but with the functional API, it was very easy. We created a whole separate model which gave an output u for user, and that was the result of creating an embedding, where we said an embedding has its own input and then goes through an embedding layer, and then we returned the input to that and the embedding layer like so.
So that gave us our user input, our user embedding, our movie input and our movie embedding. So there's like two separate little models. And then we did a similar thing to create two little models for our bias terms. They were both things that grabbed an embedding, returning a single output, and then flattened it.
And that grabbed our biases. And so now we've got four separate models, and so we can merge them. There's this function called merge. It's pretty confusing. There's a small m merge and a big m merge. In general, you will be using the small m merge. I'm not going to go into the details of why they're both there.
They are there for a reason. If something weird happens to you with merge, try remembering to use the small m merge. The small m merge takes two previous outputs that you've just created using the functional API and combines them in whatever way you want, in this case the dot product.
And so that grabs our user and movie embeddings and takes the dot product. We grab the output of that and take our user bias and the sum, and the output of that and the movie bias and the sum. So that's a functional API to creating that model. At the end of which, we then use the model function to actually create our model, saying what are the inputs to the model and what is the output of the model.
So you can see this is different to usual because we've now got multiple inputs. So then when we call fit, we now have to pass in an array of inputs, a user_id and movie_id. So the functional API is something that we're going to be using increasingly from now on.
Now that we've kind of learned all the basic architectures just about, we're going to be starting to build more exotic architectures for more special cases and we'll be using the functional API more and more. Is the only reason to use an embedding layer so that you can provide a list of integers as input?
That's a great question. Is the only reason to use an embedding layer so that you can use integers as input? Absolutely yes. So instead of using an embedding layer, we could have one-hot encoded all of those user IDs and one-hot encoded all of those movie IDs and created dense layers on top of them and it would have done exactly the same thing.
Green box please. Why choose 50 latent factors and then reduce them down with a principal component analysis? Why not just have 3 latent factors to begin with? I'm not quite sure why you use both. Sure. If we only use 3 latent factors, then our predictive model would have been less accurate.
So we want an accurate predictive model so that when people come to our website, we can do a good job of telling them what movie to watch. So 50 latent factors for that. But then for the purpose of our visualization of understanding what those factors are doing, we want a small number so that we can interpret them more easily.
Okay, so one thing you might want to try during the week is taking one or two of your models and converting them to use the functional API. Just as a little thing, you could try to start to get the hang of how this API looks. Are these functional models how we would add additional information to images in CNN's, say driving speed or turning radius?
Yes, absolutely. In general, the idea of adding additional information to say a CNN is basically like adding metadata. This happens in collaborative filtering a lot. You might have a collaborative filtering model that as well as having the ratings table, you also have information about what genre the movie is in, maybe the demographic information about the user.
So you can incorporate all that stuff by having additional inputs. And so with a CNN, for example, the new Kaggle fish recognition competition, one of the things that turns out is a useful predictor, this is a leakage problem, is the size of the image. So you could have another input which is the height and width of the image just as integers and have that as a separate input which is concatenated to the output of your convolutional layer after the first flattened layer and then your dense layers can then incorporate both the convolutional outputs and your metadata would be a good example.
That's a great question, two great questions. So you might remember from last week that this whole thing about collaborative filtering was a journey to somewhere else. And the journey is to NLP, natural language processing. This is a question about collaborative filtering. So if we need to predict the missing values, the NANs or the 0.0, so if a user hasn't watched a movie, what would be the prediction or how do we go about predicting that?
So this is really the key purpose of creating this model is so that you can make predictions for movie user combinations you haven't seen before. And the way you do that is to simply do something like this. You just call model.predict and pass in a movieId userId pair that you haven't seen before.
And all that's going to do is it's going to take the dot product of that movie's latent factors and that user's latent factors and add on those biases and return you back the answer. It's that easy. And so if this was a Kaggle competition, that would be how we would generate our submission for the Kaggle competition would be to take their test set, which would be a bunch of movie user pairs that we haven't seen before.
Natural language processing. Collaborative filtering is extremely useful of itself. Without any doubt, it is far more commercially important right now than NLP is. Having said that, fastAI's mission is to impact society in as positive a way as possible, and doing a better job as predicting movies is not necessarily the best way to do that.
So we're maybe less excited about collaborative filtering than some people in industry are. So that's why it's not our main destination. NLP, on the other hand, can be a very big deal. If you can do a good job, for example, of reading through lots of medical journal articles or family histories and patient notes, you could be a long way towards creating a fantastic diagnostic tool to use in the developing world to help bring medicine to people who don't currently have it, which is almost as good as telling them not to watch Battlefield Earth.
They're both important. So let's talk a bit about NLP. In order to do this, we're going to look at a particular dataset. This dataset is like a really classic example of what people do with natural language processing, and it's called sentiment analysis. Sentiment analysis means that you take a piece of text, it could be a phrase, a sentence, a paragraph, or a whole document, and decide whether or not that is a positive or negative sentiment.
Keras actually comes with such a dataset, which is called the IMDb sentiment dataset. The IMDb sentiment dataset was originally developed from the Stanford AI group, and the paper about it was actually published in 2012. They talk about all the details about what people try to do with sentiment analysis.
In general, although academic papers tend to be way more math-y than they should be, the introductory sections often do a great job of capturing why this is an interesting problem, what kind of approaches people have taken, and so forth. The other reason papers are super helpful is that you can skip down to the experiment section -- every machine learning paper pretty much has an experiment section -- and find out what the score is.
So here's their score section. Here they showed that using this dataset they created of IMDb movie reviews, along with their sentiment, their full model plus an additional model got a score of 88.33% accuracy in predicting sentiment. They had another one here where they also added in some unlabeled data.
We're not going to be looking at that today, that would be a semi-supervised learning problem. So today our goal is to beat 88.33% accuracy as being the academic state of the art for this dataset, at least as at this time. To grab it, we can just say from Keras.datasets import IMDb.
Keras actually kind of fiddles around with it in ways that I don't really like, so I actually copied and pasted from the Keras file these three lines to import it directly without screwing with it. So that's why rather than using the Keras dataset directly, I'm using these three lines.
There are 25,000 movie reviews in the training set, and here's an example of one. Rumwell High is a cartoon comedy, around at the same time as some other programs. So the dataset actually does not quite come to us in this format, it actually comes to us in this format, which is a list of IDs.
And so these IDs then we can look up in the word index, which is something that they provide. And so for example, if we look at the word index, as you can see, basically maps an integer to every word. It's in order of how frequently those words appeared in this particular corpus, which is kind of handy.
So then I also create a reverse index, so it goes from Word to ID. So I can see that in the very first training example, the very first word is word number 23022. So if I look up index to word 23022, it is the word Rumwell. And so then I just go through and I map everything in that first review to index to word and join it together with a space, and that's how we can turn the data that they give us back into the movie review.
As well as providing the reviews, they also provide labels. One is positive sentiment, zero is negative sentiment. So our goal is to take these 25,000 reviews that look like this and predict whether it will be positive or negative in sentiment, and the data is actually provided to us as a list of word IDs for each review.
Is everybody clear on the problem we are trying to solve and how it's laid out? Ok, you guys are quick. So there's a couple of things we can do to make it simpler. One is we can reduce the vocabulary. So currently there are some pretty unusual words, like word number 23022 is Bromwell.
And if we're trying to figure out how to deal with all these different words, having to figure out the various ways in which the word Bromwell is used is probably not going to net as much for a lot of computation and memory cost. So we're going to truncate the vocabulary down to 5000.
And it's very easy to do that because the words are already ordered by frequency. I simply go through everything in our training set and I just say if the word ID is less than this vocab size of 5000, we'll leave it as it is, otherwise we'll replace it with the number 5000.
So at the end of this, we now have replaced all of our rare words with a single ID. Here's a quick look at these sentences. The reviews are sometimes up to 2493 words long. Some people spend far too much time on IMDb. Some are as short as 10 words.
On average, they're 237 words. As you will see, we actually need to make all of our reviews the same length. Allowing this 2493 word review would again use up a lot of memory and time. So we're going to decide to truncate every review at 500 words. And that's twice as big, more than twice as big as the mean.
So we're not going to lose too much. So what we now need to do is create a rectangular - what if the word 5000 gives a bias? So we're about to learn a machine learning model, and so the vast majority of the time it comes across the word 5000, it's actually going to mean 'rare word'.
It's not going to specifically mean 1987. And it's going to learn to deal with that as best as it can. The idea is the rare words don't appear too often, so hopefully this is not going to cause too much problem. We're not just using frequencies, all we're doing is we're just truncating our vocabulary.
Can you put that close to your mouth? So the 5000 words, can we just replace it with some neutral word to take care of that biased thing? There's really not going to be a bias here. We're just replacing it with a random ID. The fact that occasionally the word 1987 actually pops up is totally insignificant.
We could replace it with -1, it's just a sentinel value which has no meaning. It's one of these design decisions which it's not worth spending a lot of time thinking about because it's not significant. So I just picked whatever happened to be easiest at the time. As I said, I could personally always use -1, it's just not important.
What is important is that we have to create a square matrix, a rectangular matrix which we can pass to our machine learning model. So quite conveniently Keras comes with something called pad sequences that does that for us. It takes everything greater than this length and truncates it, and everything less than that length, and pads it with what have we asked for, which in this case is zeros.
So at the end of this, the shape of our training set is now a NumPy array of 25,000 rows by 500 columns. And as you can see, it's padded the front with zeros, such that it has 500 words in it. That is exactly the same as before. And you can see that Bromwell has now been not replaced with 5000, but with 4000, 999.
So this is our same movie review again after going through that padding process. I know that there's some reason that Keras decided to pad the front rather than the back. I don't recall what it is. Since it's what it does by default, I don't worry about it, I don't think it's important.
So now that we have a rectangular matrix of numbers, and we have some labels, we can use the exact techniques we've already learned to create a model. And as per usual, we should try to create the simplest possible model we can to start with. And we know that the simplest model we can is one with one hidden layer in the middle.
Or at least this is the simplest model that we generally think ought to be pretty useful for just about everything. Now here is why we started with collaborative filtering, and that's because we're starting with an embedding. So if you think about it, our input are word ids, and we want to convert that into a vector.
And that is what an embedding does. So again, rather than one-hot encoding this into a 5000-column long huge input thing and then doing a matrix product, an embedding just says look up that movie ID and grab that vector directly. So it's just a computational and memory shortcut to creating a one-hot encoding followed by a matrix product.
So we're creating an embedding where we are going to have 5000 latent factors or 5000 embeddings. Each one we're going to have 32, in this case 32 items rather than 50. So then we're going to flatten that, have our single dense layer, a bit of dropout, and then our output to a sigmoid.
So that's a pretty simple model. You can see it's a good idea to go through and make sure you understand why all these parameter counts are what they are. That's something you can do during the week and double-check that you're comfortable with all of those. So this is the size of each of the weight matrices at each point.
And we can fit it. And after two epochs, we have 88% accuracy on the validation set. And so let's just compare that to Stanford, where they had 88.3 and we have 88.04. So we're not yet there, but we're well on the right track. This is always the question about why have X number of filters in your convolutional layer or why have X number of outputs in your dense layer.
It's just a case of trying things and seeing what works and also getting some intuition by looking at other models. In this case, I think 32 was the first I tried, I kind of felt like from my understanding of really big embedding models, which we'll learn about shortly, even 50 dimensions is enough to capture vocabularies of size 100,000 or more.
So I felt like 32 was likely to be more than enough to capture a vocabulary of size 5,000. I tried it and I got a pretty good result, and so I've basically left it there. If at some point I discovered that I wasn't getting great results, I would try increasing it.
You can always use a softmax instead of a sigmoid, it just means that you would have to change your labels, because remember our labels were just 1's or 0's, they were just a single column. If I wanted to use a softmax, I would have to create two columns. It wouldn't just be 1, it would be 1, 0, 1, 0, 1, 0.
In the past, I've generally stuck to using softmax and then categorical cross-entropy loss just to be consistent, because then regardless of whether you have two classes or more than two classes, you can always do the same thing. In this case, I thought I want to show the other way that you can do this, which is to just have a single column output, and remember a sigmoid is exactly the same thing as a softmax if you just have a binary output.
And so rather than using categorical cross-entropy, we use binary cross-entropy and again it's exactly the same thing, it just means I didn't have to worry about one hot encoding the output because it's just a binary output. No, we don't. It's not something I have looked at. The important thing as far as I'm concerned is what is the benchmark that the Stanford people got and they compared it to a range of other previous benchmarks and they found that their technique was the best.
So that's my goal here. And I'm sure there have been other techniques that have come out since that are probably better, but I haven't seen them in any papers yet, so this is my target. You can see that we can in one second of training get an accuracy which is pretty competitive, and it's just a simple neural net.
And so hopefully you're starting to get a sense that a neural net with one hidden layer is a great starting point for nearly everything, you now know how to create a pretty good sentiment analysis model and before today you didn't, so that's a good step. So an embedding is something I think would be particularly helpful if we go back to our movie, movie lens recommendation dataset.
And remember that the actual data coming in does not look like this, but it looks like this. So when we then come along and say, okay, what do we predict the rating would be for user ID 1 for movie ID 1172, we actually have to go through our list of movie IDs and find movie ID number 31, say, and then having found 31, then look up its latent factor.
And then we have to do the same thing for user ID number 1 and find its latent factor, and then we have to multiply the two together. So that step of taking an ID and finding it in a list and returning the vector that it's attached to, that is what an embedding is.
So an embedding returns a vector which is of length, in this case 32. So the output of this is that for each, the none always means your mini batch size. So for each movie review, for each of the 500 words in that sequence, you're getting a 32 element vector.
And so therefore you have a mini batch size by 500 by 32 tensor coming out of this layer. That gets flattened, so 500 times 32 is 16,000, and then that is the input into your first dense layer. Q. And I also think it might be helpful to show that for a review, instead of having that in words that's being entered as a sequence of numbers where the number is -- A.
Yeah, that's right. So we look at this first review and we take -- and remember this has now been truncated to 4999, this is still 309, so it's going to take 309, and it's going to look up the 309th vector in the embedding, and it's going to return it, and then it's going to concatenate it to create this tensor.
So that's all an embedding is. An embedding is a shortcut to a one-hot encoding followed by a matrix product. Q. Then two other questions. Can you show us words which have similar latent features? I'm hoping these words would be synonyms or semantically similar. A. Yes, we'll see that shortly.
Q. And who made the labels, and why should I believe them, it seems difficult and subjective? A. Well that's the whole point of sentiment analysis and these kinds of things, is that it's totally subjective. So the interesting thing about NLP is that we're trying to capture something which is very subjective.
So in this case you would have to read the original paper to find out how they got these particular labels. The way that people tend to get labels is either, in this case it's the IMDB data set. IMDB has ratings, so you could just say anything higher than 8 is very positive and anything lower than 2 is very negative, and we'll throw away everything in the middle.
The other way that people tend to label academic data sets is to send it off to Amazon Mechanical Turk and pay them a few cents to label each thing. So that's the kind of ways that you can label stuff. Q. And there are places where people don't just use Mechanical Turk, but they specifically try to hire linguistics PhDs.
A. Yeah, you certainly wouldn't do that for this because the whole purpose here is to kind of capture normal people's sentiment. You would hire -- Q. We know of a team at Google that does that. A. Yeah, so for example -- and I know when I was in medicine, we went through all these radiology reports and tried to capture which ones were critical findings and which ones weren't critical findings, and we used good radiologists rather than Mechanical Turk for that purpose.
Q. So we're not considering any sentence construction or diagrams or just a bag of words and the literal set of words that are being used in a comment? A. It's not actually just a bag of words. If you think about it, this dense layer here has 1.6 million parameters.
It's connecting every one of those 500 inputs to our output. And not only that, but it's doing that for every one of the incoming factors. So it's creating a pretty complex kind of big Cartesian product of all of these weights, and so it's taking account of the position of a word in the overall sentence.
It's not terribly sophisticated, and it's not taking account of its position compared to other words, but it is taking account of whereabouts it occurs in the whole review. So it's not like -- it's the dumbest kind of model I could come up with. It's a good starting point, but we would expect that with a little bit of thought, which we're about to use, we could do a lot better.
So why don't we go ahead and do that? So the slightly better -- hopefully you guys have all predicted what that would be -- it's a convolutional neural network. And the reason I hope you predicted that is because (a) we've already talked about how CNNs are taking over the world, and (b) specifically they're taking over the world any time we have some kind of ordered data.
And clearly a sentence is ordered. One word comes after another word, it has a specific ordering. So therefore we can use a convolution. We can't use a 2D convolution because the sentence is not in 2D, a sentence is in 1D, so we're going to use a 1D convolution. So a 1D convolution is even simpler than a 2D convolution.
We're just going to grab a string of a few words, and we're going to take their embeddings, and we're going to take that string, and we're going to multiply it by some filter. And then we're going to move that sequence along our sentence. So this is our normal next place we go as we try to gradually increase the complexity, which is to grab our simplest possible CNN, which is a convolution, dropout, max pooling.
And then flatten that, and then we have our dense layer and our output. So this is exactly like what we did when we were looking at gradually improving our state farm result. But rather than having convolution 2D, we have convolution 1D. The parameters are exactly the same. How many filters do you want to create, and what is the size of your convolution?
Originally I tried 3 here, 5 turned out to be better. So I'm looking at 5 words at a time and multiplying them by each one of 64 filters. So that is going to return -- so we're going to start with the same embedding as before. So we take our sentences and we turn them into a 500x32 matrix for each of our inputs.
We then put it through our convolution, and because our convolution has a border mode is same, we get back exactly the same shape that we gave it. We then put it through our 1D max pooling and that will halve its size, and then we stick it through the same dense layers as we had before.
So that's a really simple convolutional neural network for words. Compile it, run it, and we get 89.47 compared to -- let's go back to the videotape -- without any unlabeled data, 88.33. So we have already broken the academic state-of-the-art as at when this paper was written. And again, simple convolutional neural network gets us a very, very long way.
I was going to put out it's 10-8, maybe take time for a break, but there's also a question. Convolution 2D for images is easier to understand, element-wise multiplication and addition, but what does it mean for a sequence of words? Don't think of it as a sequence of words because remember it's been through an embedding.
So it's a sequence of 32 element vectors. So it's doing exactly the same thing as we're doing in a 2D convolution, but rather than having 3 channels of color, we have 32 channels of embedding. So we're just going through and we're just like in our convolution spreadsheet. Remember how in the second one, once we had two filters already, our filter had to be a 3x3x2 tensor in order to allow us to create the second layer.
For us, we now don't have a 3x3x2 tensor, we have a 5x1x32, or more conveniently, a 5x32 matrix. So each convolution is going to go through each of the 5 words and each of the 32 embeddings, do an element-wise multiplication, and add them all up. So the important thing to remember is that once we've done the embedding layer, which is always going to be our first step for every NLP model, is that we don't have words anymore.
We now have vectors which are attempting to capture the information in that word in some way, just like our latent factors captured information about a movie and a user into our collaborative filtering. We haven't yet looked at what they do, we will in a moment, just like we did with the movie vectors, but we do know from our experience that SGD is going to try to fill out those 32 places with information about how that word is being used which allows us to make these predictions.
Just like when you first learned about 2D convolutions, it took you probably a few days of fiddling around with spreadsheets and pieces of paper and Python and checking inputs and outputs to get a really intuitive understanding of what a 2D convolution is doing. You may find it's the same with a 1D convolution, but it will take you probably a fifth of the time to get there because you've really done all the hard work already.
I think now is a great time to have a break, so let's come back here at 7.57. There's a couple of concepts that we come across from time to time in this class which there is no way that me lecturing to you is going to be enough to get an intuitive understanding of it.
The first clearly is the 2D convolution, and hopefully you've had lots of opportunities to experiment and practice and read, and these are things you have to tackle from many different directions to understand a 2D convolution. And 2D convolutions in a sense are really 3D because if it's in full color, you've got 3 channels.
Hopefully that's something you've all played with. And once you have multiple filters later on in your image models, you still have 3D and you've got more than 3 channels, you might have 32 filters or 64 filters. In this lesson we've introduced one much simpler concept, which is the 1D convolution, which is really a 2D convolution because just like with images we had red, green, blue, now we have the 32 or whatever embedding factors.
So that's something you will definitely need to experiment with. Create a model with just an embedding layer, look at what the output is, what does it shape, what does it look like, and then how does a 1D convolution modify that. And then trying to understand what an embedding is is kind of your next big task if you're not already feeling comfortable with it.
And if you haven't seen them before today, I'm sure you won't, because this is a big new concept. It's not in any way mathematically challenging. It's literally looking up an array and returning the thing at that ID. So an embedding looking at movie_id 3 is go to the third column of the matrix and return what you see.
That's all an embedding does. They couldn't be mathematically simpler, it's the simplest possible operation. Return the thing at this index. But the kind of intuitive understanding of what happens when you put an embedding into an SGD and learn a vector which turns out to be useful is something which is kind of mind-blowing because as we saw from the movie lens example, with just a dot product and this simple lookup something in an index operation, we ended up with vectors which captured all kinds of interesting features about movies without us in any way asking it to.
So I wanted to make sure that you guys really felt like after this class, you're going to go away and try and find a dozen different ways of looking at these concepts. One of those ways is to look at how other people explain them. And Chris Ola has one of the very, very best technical blogs I've come across and quite often referred to in this class, and in his Understanding Convolutions post, he actually has a very interesting example of thinking about what a dropped ball does as a convolutional operation and he shows how you can think about a 1D convolution using this dropped ball analogy.
Particularly if you have some background in electrical or mechanical engineering, I suspect you'll find this a very helpful example. There are many resources out there for thinking about convolutions and I hope some of you will share on the forums any that you come across. Question - so one, this is from just before the break, essentially are we training the input too?
Yeah, we are absolutely training the input because the only input we have is 25,000 sequences of 500 integers. And so we take each of those integers and replace them with a lookup into a 500-column matrix. Initially that matrix is random, just like in our Excel example. We started with a random matrix, these are all random numbers, and then we created this loss function which was the sum of the squares of differences between the dot product and the rating.
And if we then use the gradient descent solver in Excel to solve that, it attempts to modify the two embedding matrices (as you can see, the objective is going down) to try and come up with the two embedding matrices which give us the best approximation of the original rating matrix.
So this Excel spreadsheet is something which you can play with and do exactly what our first movie lens example is doing in Python. The only difference is that our version in Python also has LQ regularization. So this one's just finished here, so you can see it's come up with these are no longer random.
We've now got two embedding matrices which have got the loss function down from 40 to 5.6, and so you can see for example these ratings are now very close to what they're meant to be. So this is exactly what Keras and SGD are doing in our Python example. Q So my question is, is it that we've got an embedding in which each word is a vector of 32 elements?
A Yes. Q It's more clear in that way, no? A Yes, exactly. Each word in our vocabulary of 5000 has been converted into a vector of 32 elements. Q Another question is, what would be the equivalent dense network if we didn't use a 2D embedding? This is in the initial model, the simple one.
A dense layer with input of size, embedding size, we have size? A I actually don't know what that meant, sorry. Q Okay, next question is, does it matter that encoded values which are close by are close in color in the case of pictures, which is not true for word vectors?
For example, 254 and 255 are close as colors, but for words they have no relation. A So the important thing to realize is that the word IDs are not used mathematically in any way at all, other than as an index to look up into an integer. So the fact that this is movie number 27, the number 27 is not used in any way.
We just take the number 27 and find its vector. So what's important is the values of each latent factor as to whether they're close together. So in the movie example, there were some latent factors that were something about is it a Hollywood blockbuster? And there were some latent factors that were something about is it a violent movie or not?
It's the similarity on those factors that matters. The ID is never ever used, other than is an index to simply index into a matrix to return the vector that we found. So as Yannette was mentioning, in our case now for the word embeddings, we're looking up in our embeddings to return a 32-element vector of floats that are initially random, and the model is trying to learn the 32 floats for each of our words that is semantically useful.
And in a moment we're going to look at some visualizations of that to try and understand what it's actually learned. You can apply the dropout parameter to the embedding layer itself, and what that does is it zeroes out at random 20% of each of these 32 embeddings for each word.
So it's basically avoiding overfitting the specifics of each word's embedding. This dropout, on the other hand, is removing at random some of the words, effectively, some of the whole vectors. The significance of which one to use where is not something which I've seen anybody research in depth, so I'm not sure that we have an answer that says use this amount in this place.
I just tried a few different values in different places, and it seems that putting the same amount of dropout in all these different spots seems to work pretty well in my experiments, so it's a reasonable rule of thumb. If you find you're massively overfitting or not massively underfitting, try playing around with the various values and report back on the forum and tell us what you find.
Maybe you'll find some different, better configurations than I've come up with. I'm sure some of you will. Let's think about what's going on here. We are taking each of our 5,000 words in our vocabulary and we're replacing them with a 32 element long vector, which we are training to hopefully capture all of the information about what this word means and what it does and how it works.
You might expect intuitively that somebody might have done this before. Just like with ImageNet and VGG, you can get a pre-trained network that says, oh, if you've got an image that looks a bit like a dog, well we've had a trained network which has seen lots of dogs, so it will probably take your dog image and return some useful predictions because we've done lots of dog images before.
The interesting thing here is your dog picture and the VGG author's dog pictures are not the same. They're going to be different in all kinds of ways. To get pre-trained weights for images, you have to give somebody a whole pre-trained network, which is like 500 megabytes worth of weights in a whole architecture.
Words are much easier. In a document, the word 'dog' always appears the same way. It's the word 'dog'. It doesn't have different lighting conditions or facial expressions or whatever, it's just the word 'dog'. So the cool thing is in NLP, we don't have to pass around pre-trained networks, we can pass around pre-trained embeddings, or as they're commonly known, pre-trained word vectors.
That is to say, other people have already created big models with big text corpuses where they've attempted to build a 32-element vector, or however long vector, which captures all of the useful information about what that word is and how it behaves. So for example, if we type in 'word-vector-download', you can see that -- this is not quite what we wanted -- let's do 'word-embeddings-download'.
That's better. Lots of questions and answers and pages about where we can download pre-trained word embeddings. So, that's pretty cool. But I guess what was a little unintuitive to me is that I think this means that if I can train a corpus on, I don't know, the works of Shakespeare, somehow that tells me something about how I can understand movie reviews, and I imagine that in some sense that's true about how language is structured and whatnot, but the meaning of the word 'dog' in Shakespeare is probably going to be used pretty differently.
We're getting to that now. The word vectors that I'm going to be using, and I don't strongly recommend but slightly recommend are the GloVe word vectors. The other main competition to these is called the Word2Vec word vectors. The GloVe word vectors come from a researcher named Jeffrey Pennington from Stanford.
The Word2Vec word vectors come from Google. I will have a mention that the TensorFlow documentation on the Word2Vec vectors is fantastic. So I would definitely highly recommend checking this out. The GloVe word vectors have been pre-trained on a number of different corpuses. One of them has been pre-trained on all of Wikipedia and a huge database full of newspaper articles -- a total of 6 billion words covering 400,000-size vocabulary.
And they provide 50-dimensional, 100-dimensional, 200-dimensional and 300-dimensional pre-trained vectors. They have another one which has been trained on 840 billion words of a huge dump of the entire Internet. And then they have another one which has been trained on 2 billion tweets, which I believe all of the Donald Trump tweets have been carefully cleaned out prior to usage.
So in my case, what I've done is I've downloaded the 6 billion token version, and I will show you what one of these looks like. So here is -- Sometimes these are cased, so you can see for example this particular one includes case. There are 2.2 million items of vocabulary in this, sometimes they're uncased.
So we'll look at punctuation in a moment. Here is the start of the GloVe 50-dimensional word vectors trained on a corpus of 6 billion. Here is the word "the," and here are the 50 floats which attempt to capture all of the information in the word "the." Punctuation, here is the word "full stop." And so here are the 50 floats that attempt to capture all of the information captured by a full stop.
So here is the word "in," here is the word "double quote," here is "apostrophe s." So you can see that the GloVe authors have tokenized their text in a very particular way. And the idea that "apostrophe s" should be treated as a thing, that makes a lot of sense.
It certainly has that thinginess in the English language. And so indeed, the way the authors of a word-embedding corpus have chosen to tokenize their text definitely matters. And one of the things I quite like about GloVe is that they've been pretty smart, in my opinion, about how they've done this.
So the question is, how does one create word vectors in general? What is the model that you're creating and what are the labels that you're building? So one of the things that we talked about getting to at some point is unsupervised learning. And this is a great example of unsupervised learning.
We want to take 840 billion tokens of an internet dump and build a model of something. So what do we build a model of? And this is a case of unsupervised learning. We're trying to capture some structure of this data, in this case, how does English look, work and feel.
The way that this is done, at least in the Word2Vec example, is quite cool. What they do is they take every sentence of, say, 11 words long, not just every sentence, but every 11 long string of words that appears in the corpus, and then they take the middle word.
The first thing they do is they create a copy of it, an exact copy. And then in the copy, they delete the middle word and replace it with some random word. So we now have two strings of 11 words, one of which makes sense because it's real, one of which probably doesn't make sense because the middle word has been replaced with something random.
And so the model task that they create, the label is 1 if it's a real sentence, or 0 if it's a fake sentence. And that's the task they give it. So you can see it's not a directly useful task in any way, unless somebody actually comes along and says, "I just found this corpus in which somebody's replaced half of the middle words with random words." And it is something where in order to be able to tackle this task, you're going to have to know something about language.
You're going to have to be able to recognize that this sentence doesn't make sense, and this sentence does make sense. So this is a great example of unsupervised learning. Generally speaking in deep learning, unsupervised learning means coming up with a task which is as close to the task you're eventually going to be interested in as possible but that doesn't require labels or whether labels are really cheap to generate.
So it turns out that the embeddings that is created when you look at say, Hindu and Japanese turn out to be nearly the same. And so one way to translate language is to create a bunch of word vectors in English for various words, and then to create a bunch of word vectors in Japanese for various words.
And then what you can do is you can say, "Okay, I want to translate this word, which might be 'queen', to Japanese." You can basically look up and find the nearest word in the same vector space in the Japanese corpus and it turns out it works. So it's a fascinating thing about language, in fact, Google has just announced that they've replaced Google Translate with a neural translation system and part of what that is doing is basically doing this.
In fact, here are some interesting examples of some word embeddings. The word embedding for king and queen has the same distance and direction as the word embeddings for man and woman. Ditto for walking vs. walked and swinging vs. swam, and ditto for Spain vs. Madrid and Italy vs. Rome.
So the embeddings that have to get learned in order to solve this stupid, meaningless, random sentence task are quite amazing. And so I've actually downloaded those glove embeddings, and I've pre-processed them, and I'm going to upload these for you shortly into a form that's going to be really easy for you to use in Python.
And I've created this little thing called load glove, which loads the pre-processed stuff that I've created for you. And it's going to give you three things. It's going to give you the word vectors, which is the 400,000 by, in this case, 50 dimensional vectors, a list of the words, here they are, the comma dot of two, and a list of the word indexes.
So you can now take a word and call word2vec to get back its 50-dimensional array. And so then I drew a picture. In order to turn a 50-dimensional vector into something 2-dimensional that I can plot, we have to do something called dimensionality reduction. And there's a particular technique, the details don't really matter, called TSNE, which attempts to find a way of taking your high-dimensional information and plot it on 2 dimensions such that things that were close in the 50 dimensions are still close in the 2 dimensions.
And so I used TSNE to plot the first 350 most common words, and here they all are. And so you can see that bits of punctuation have appeared close to each other, numerals appear close to each other, written versions of numerals are close to each other, seasons, games, leagues played are all close to each other, various things about politics, school and university, president, general, prime, minister, and Bush.
Now this is a great example of where this TSNE 2-dimensional projection is misleading about the level of complexity that's actually in these word vectors. In a different projection, Bush would be very close to tree. The 2-dimensional projection is losing a lot of information. The true detail here is a lot more complex than us mere humans can see on a page.
But hopefully you get a sense of this. So all I've done here is I've just taken those 50-dimensional word vectors and I've plotted them in 2 dimensions. And so you can see that when you learn a word embedding, you end up with something, we've now seen, not just a word embedding, we've seen for movies, we were able to plot some movies in 2 dimensions and see how they relate to each other and we can do the same thing for words.
In general, when you have some high-dimension, high-cardinality categorical variable, whether it be lots of movies or lots of reviewers or lots of words or whatever, you can turn it into a useful, lower-dimensional space using this very simple technique of creating an embedding. The explanation on how unsupervised learning was used in Word2Vec was pretty smart.
How was it done in GloVe? I don't recall how it was done in GloVe, I believe it was something similar. I should mention though that both GloVe and Word2Vec did not use deep learning. They actually tried to create a linear model, and the reason they did that was that they specifically wanted to create representations which had these kinds of linear relationships because they felt that this would be a useful characteristic of these representations.
I'm not even sure if anybody has tried to create a similarly useful representation using a deeper model and whether that turns out to be better. Obviously with these linear models, it saves a lot of computational time as well. The embeddings, however, even though they were built using linear models, we can now use them as inputs to deep models, which is what we're about to do, just behind you Rachel.
So Google SyntaxNet model that just came out, was that the one you were mentioning? No, I was mentioning Word2Vec. Word2Vec has been around for 2 and a half years, 2 years. SyntaxNet is a whole framework, so -- I think it's called Parsey McPass Face, that one is the one where they claim 97% accuracy on NLP, and it also returns parts of speech, so I'll tell you if you give a sentence it'll say this is a word, this is an action.
Right. In that high-dimensional space, for example, you can see here is information about tense, for example. So it's very easy to take a word vector and use it to create a part of speech recognizer, you just need a fairly small labeled corpus, and it's actually pretty easy to download a rather large labeled corpus, and build a simple model that goes from word vector to part of speech.
There's a really interesting paper called "Exploring the Limits of Language Modeling." That Parsey McPass Face thing got far more PR than it deserved. It was not really an advance over the state-of-the-art language models of the time, but since that time there have been some much more interesting things. One of the interesting papers is "Exploring the Limits of Language Modeling," which is looking at what happens when you take a very, very, very large dataset and spend shitloads of Google's money on lots and lots of GPUs for a very long time, and they have some genuine massive improvements to the state-of-the-art in language modeling.
In general, when we're talking about language modeling, we're talking about things like is this a noun or a verb, is this a happy sentence or a sad sentence, is this a formal speech or an informal speech, so on and so forth. And all of these things that NLP researchers do, we can now do super easily with these embeddings.
This uses two techniques, one of which you know and one of which you're about to know, convolutional neural networks and recurrent neural networks, specifically a type called LSTM. You can check out this paper to see how they compare. Almost this time, there's been an even newer paper that has furthered the state-of-the-art in language modeling and it's using a convolutional neural network.
So right now, CNNs with pre-trained word embeddings are the state-of-the-art. So given that we can now download these pre-trained word embeddings, that leads to the question of why are we using randomly generated word embeddings when we do our sentiment analysis. That doesn't seem like a very good idea. And indeed, it's not a remotely good idea.
You should never do that. From now on, you should now always use pre-trained word embeddings anytime you do NLP. Over the next few weeks, we will be gradually making this easier and easier. At this stage, it requires slightly less than a screen of code. You have to load the embeddings off disk, creating your word vectors, your words and your word indexes.
The next thing you have to do is, the word indexes that come from GloVe are going to be different to the word indexes in your vocabulary. In our case, this was the word Bromwell. In the GloVe case, it's probably not the word Bromwell. So this little piece of code is simply something that is mapping from one index to the other index.
So this createEmbedding function is then going to create an embedding matrix where the indexes are the indexes in the IMDB dataset, and the embeddings are the embeddings from GloVe. So that's what EMB now contains. This embedding matrix are the GloVe word vectors indexed according to the IMDB dataset. So now I have simply copied and pasted the previous code and I have added this, weights equals my pre-trained embeddings.
Since we think these embeddings are pretty good, I've set trainable to false. I won't leave it at false because we're going to fine-tune them, but we'll start it at false. One particular reason that we can't leave it at false is that sometimes I've had to create a random embedding because sometimes the word that I looked up in GloVe didn't exist.
For example, anything that finishes with apostrophe s, in GloVe they tokenize that to have apostrophe s and the word as separate tokens, but in IMDB they were combined into one token. And so all of those things, there aren't vectors for them. So I just randomly created embeddings for anything that I couldn't find in the GloVe dictionary.
But for now, let's start using just the embeddings that were given, and we will set this to non-trainable, and we will train a convolutional neural network using those embeddings for the IMDB task. And after 2 epochs, we have 89.8. Previously, with random embeddings, we had 89.5. And the academic state of the art was 88.3.
So we made significant improvements. Let's now go ahead and say first layer trainable is true. Place the learning braid a bit and do just one more epoch, and we're now up to 90.1. So we've got way beyond the academic state of the art here. We're kind of cheating because we're now not just building a model, we're now using a pre-trained word embedding model that somebody else has provided for us.
But why would you ever not do that if that exists? So you can see that we've had a big jump, and furthermore it's only taken us 12 seconds to train this network. So we started out with the pre-trained word embeddings, we set them initially to non-trainable in order to just train the layers that used them, waited until that was stable, which took really 2 epochs, and then we set them to trainable and did one more little fine tuning step.
And this kind of approach of these 3 epochs of training is likely to work for a lot of the NLP stuff that you'll find in a while. Do you not need to compile the model after resetting the input layer to trainable equals true? No you don't, because the architecture of the model has not changed in any way, it's just changed the metadata attached to it.
There's never any harm in compiling the model. Sometimes if you forget to compile, it just continues to use the old model, so best to err on the side of using it. Something that I thought was pretty cool is that during the week, one of our students here had an extremely popular post appear all over the place, I saw it on the front page of Hacker News talking about how his company, Quidd, uses deep learning and very happy to see with small data, which is what we're all about.
For those of you who don't know it, Quidd is a company, quite a successful startup actually, that is processing millions and millions of documents, things like patents and stuff like that, and providing enterprise customers with really cool visualizations and interactive tools that lets them analyze huge datasets. And so this is by Ben Bowles, one of our students here, and he talked about how he compared three different approaches to a particular NLP classification task, one of which involved some pretty complex and slow to develop carefully engineered features.
But Model 3 in this example was a convolutional neural network. So I think this is pretty cool and I was hoping to talk to Ben about this piece of work. Could you give us a little bit of context on what you were doing in this project? Yeah, so the task is about detecting marketing language from company descriptions.
So it's had the flavor of being very similar to sentiment analysis, like you have two classes of things, they're kind of different in some kind of semantic way. And you've got some examples here, so one was our patent pending support system is engineered designed to bring confidence style, with your more marketing I guess, and your spatial scanning software for mobile devices, is your more informative.
Yeah, I mean the semantics of the marketing language is like, oh this is exciting. There are certain types of meanings and semantics around which the marketing tends to cluster, and I sort of realized, hey, this would be kind of a nice task for deep learning. How were these labeled, your data set in the first place?
Basically by a couple of us in the company, we basically just found some good ones and found the bad ones and then literally tried it out. I mean, it's literally as hacky as you could possibly imagine. So yeah, it was kind of what's super, super scrappy. But it actually ended up being very useful for us, I think, because that kind of a nice lesson is sometimes scrappy gets you most of the way you need them, you think about like, hey, how do you get your data for your project, well you can actually just create it, right?
Yeah, exactly. I mean I love this lesson because when -- and so startup, right? When I talk to big enterprise executives, they're all about their five year metadata and data lake repository infrastructure program at the end of which maybe they'll actually try and get some value out of it, whereas startups are just like, okay, what have we got that we can do by Monday, let's throw it together and see if it works.
The latter approach is so much better because by Monday you know whether it kind of looks good, which kind of things are important, and you can decide on how much it's worth investing in, so that's cool. So one of the things I wanted to show is your convolutional neural network did something pretty neat, and so I wanted to use this same neat trick for our convolutional neural network, and it's a multi-size CNN.
So I mentioned earlier that when I built this CNN, I tried using a filter size of 5, and I found it better than 3. And what Ben in his blog post points out is that there's a neat paper in which they describe doing something interesting, which is not just using one size convolution, but trying a few size convolutions.
And you can see here, this is a great use of the functional API, and I haven't exactly used your code, I've kind of rewritten a little bit then, but basically it's the same concept. Let's try size 3 and size 4 and size 5 convolutional filters, and so let's create a 1D convolutional filter of size 3 and then size 4 and then size 5, and then for each one using the functional API we'll add max_pulling and we'll flatten it and we'll add it to a list of these different convolutions.
And then at the end, we'll merge them all together by simply concatenating them. So we're now going to have a single vector containing the result of the 3 and 4 and 5 size convolutions, like why settle for 1. And then let's return that whole model as a little sub-model, which in Ben's code he called graph.
The reason I assume you call this graph is because people tend to think of these things, they call them a computational graph. A computational graph basically is saying this is a computation being expressed as various inputs and outputs, so you can think of it as a graph. So once you've got this little multi-layer convolution module, you can stick it inside a standard sequential model by simply replacing the convolution 1D and max_pulling piece with graph, where graph is the concatenated version of all of these different scales of convolution.
And so trying this out, I got a slightly better answer again, which is 90.36%. And I hadn't seen that paper before, so thank you for giving that great idea. Did you have anything to add about this multi-scale convolution idea? Not really, other than I think it's super cool. But actually I'm still trying to figure out all the ends and notes of exactly how it works.
Some ways implementation is easier than understanding. That's exactly right. In a lot of these things, the math is kind of ridiculously simple, and then you throw it at an SGD and let it do billions and billions of calculations in a fraction of a second, and what it comes up with is kind of hard to grasp.
And you are using capital M merge in this example, did you want to talk about that? Not really. Ben used capital M merge and I just did the same thing. Were it me, I would have used small M merge, so we'll have to agree to disagree here. Okay, now let's not go there.
So I think that's super fun. So we have a few minutes to talk about something enormous, so we're going to do a brief introduction. And then next week, we will do a deep dive. So everything we've learned so far about convolutional neural networks does not necessarily do a great job of solving a problem like how would you model this?
Now notice whatever this markup is, I'm not quite sure. It has to recognize when you have a start tag and know to close that tag, but then over a longer period of time that it's inside a weird XML comment thing and to know that it has to finish off the weird XML comment thing, which means it has to kind of keep memory about what happened in the distant past if you're going to successfully do any kind of modeling with data that looks like this.
And so with that kind of memory therefore, it can handle long-term dependencies. Also think about these two different sentences. They both mean effectively the same thing, but in order to realize that, you're going to have to keep some kind of state that knows that after this has been read in, you're now talking about something that happened in 2009, and you then have to remember it all the way to here to know when it was that this thing happened that you did in Nepal.
So we want to create some kind of stateful representation. Furthermore it would be nice if we're going to deal with big long pieces of language like this with a lot of structure to be able to handle variable length sequences, so that we can handle some things that might be really long and some things that might be really short.
So these are all things which convolutional neural networks don't necessarily do that well. So we're going to look at something else which is a recurrent neural network which handles that kind of thing well. And here is a great example of a good use of a recurrent neural network. At the top here, you can see that there is a convolutional neural network that is looking at images of house numbers.
These images are coming from really big Google Street View pictures, and so it has to figure out what part of the image should I look at next in order to figure out the house number. And so you can see that there's a little square box that is scanning through and figuring out I want to look at this piece next.
And then at the bottom, you can see it's then showing you what it's actually seeing after each time step. So the thing that is figuring out where to look next is a recurrent neural network. It's something which is taking its previous state and figuring out what should its next state be.
And this kind of model is called an attentional model. And it's a really interesting avenue of research when it comes to dealing with things like very large images, images which might be too big for a single convolutional neural network with our current hardware constraints. On the left is another great example of a useful recurrent neural network, which is the very popular Android and iOS text entry system called SwiftKey.
And SwiftKey had a post-up a few months ago in which they announced that they had just replaced their language model with a neural network of this kind, which basically looked at your previous words and figured out what word are you likely to be typing in next, and then it could predict that word.
A final example was Andre Kepathy showed a really cool thing where he was able to generate random mathematical papers by generating random LaTeX, and to generate random LaTeX you actually have to learn things like /begin-proof and /end-proof and these kind of long-term dependencies. And he was able to do that successfully, so this is actually a randomly generated piece of LaTeX which is being created with a recurrent neural network.
So today I am not going to show you exactly how it works, I'm going to try to give you an intuition. And I'm going to start off by showing you how to think about neural networks as computational graphs. So this is coming back to that word Ben used earlier, this idea of a graph.
And so I started out by trying to draw -- this is like my notation, you won't see this anywhere else but it'll do for now -- here is a picture of a single hidden layer basic neural network. We can think of it as having an input, which is going to be of size, batch size, and contain width of number of inputs.
And then this arrow, this orange arrow, represents something that we're doing to that matrix. So each of the boxes represents a matrix, and each of the arrows represents one or more things we do to that. In this case, we do a matrix product and then we throw it through a rectified linear unit.
And then we get a circle which represents a matrix, but it's now a hidden layer which is of size, batch size, by number of activations. And number of activations is just when we created that dense layer, we would have said and then we would have had some number, and that number is how many activations we create.
And then we put that through another operation, which in this case is a matrix product followed by a softmax, and so triangle here represents an output matrix. And that's going to be batch size by, if it's ImageNet, 1000. So this is my little way of representing the computation graph of a basic neural network with a single hidden layer.
I'm now going to create some slightly more complex models, but I'm going to slightly reduce the amount of stuff on the screen. One thing to note is that batch size appears all the time, so I'm going to get rid of it. So here's the same thing where I've removed batch size.
Also the specific activation function, who gives a shit? It's probably Ralu everywhere except the last layer where it's softmax, so I've removed that as well. Let's now look at what a convolutional neural network with a single dense hidden layer would look like. So we'd have our input, which this time will be, and remember I've removed batch size, number of channels by height by width, the operation, and we're ignoring the activation function is going to be a convolution followed by a max pool.
Remember any shape is representing a matrix, so that gives us a matrix which will be size num_filters by height/2 by width/2, since we did a max pooling. And then we take that and we flatten it. I've put flatten in parentheses because flattening mathematically does nothing at all. Flattening is just telling Keras to think of it as a vector.
It doesn't actually calculate anything, it doesn't move anything, it doesn't really do anything. It just says think of it as being a different shape. That's why I put it in parentheses. So let's then take a matrix product, and remember I'm not putting in the activation functions anymore. So that would be our dense layer, gives us our first fully connected layer, which will be of size, number of activations, and then we put that through a final matrix product to get an output of size, number of classes.
So here is how we can represent a convolutional neural network with a single dense hidden layer. The number of activations again is the same as we had last time, it's whatever the n was that we wrote dense_n. Just like when the number of filters is when we write convolution_2D, we say number of filters followed by its size.
So I'm going to now create a slightly more complex computation graph, but again I'm going to slightly simplify what I put on the screen, which is this time I'm going to remove all of the layer operations. Because now that we have removed the activation function, you can see that in every case we basically have either some kind of linear thing, either a matrix product or a convolution, and optionally there might also be a max pull.
So really, this is not adding much additional information, so I'm going to get rid of it from now on. So we're now not showing the layer operations. So remember now, every arrow is representing one or more layer operations, which will generally be a convolution or a matrix product, followed by an activation function, and maybe there will be a max pulling in there as well.
So let's say we wanted to predict the third word of a three-word string based on the previous two words. Now there's all kinds of ways we could do this, but here is one interesting way, which you will now recognize you could do with Keras's functional API. Which is, we could take word1 input, and that could be either a one-hot encoded thing, in which case its size would be vocab size, or it could be an embedding of it.
It doesn't really matter either way. We then stick that through a layer operation to get a matrix output, which is our first fully connected layer. And this thing here, we could then take and put through another layer operation, but this time we could also add in the word2 input, again, either of vocab size or the embedding of it, put that through a layer operation of its own, and then when we have two arrows coming in together, that represents a merge.
And a merge could either be done as a sum, or as a concab. I'm not going to say one's better than the other, but there are two ways that we can take two input vectors and combine them together. So now at this point, we have the input from word2 after sticking that through a layer.
We have the input from word1 after sticking that through two layers. Merge them together, stick that through another layer to get our output, which we could then compare to word3 and try to train that to recognize word3 from words1 and word2. So you could try this. You could try and build this network using some corpus you find online, see how it goes.
Pretty obviously then, you could bring it up another level to say let's try and predict the fourth word of a three-word string using words1 and 2 and 3. The reason I'm doing it in this way is that what's happening is each time I'm going through another layer operation and then bringing in word2 and going through a layer operation and bringing in word3 and going through a layer operation is I am collecting state.
Each of these things has the ability to capture state about all of the words that have come so far and the order in which they've arrived. So by the time I get to predicting word4, this matrix has had the opportunity to learn what does it need to know about the previous words' orderings and how they're connected to each other and so forth in order to predict this fourth word.
So we're actually capturing state here. It's important to note that we have not yet previously built a model in Keras which has input coming in anywhere other than the first layer, but there's no reason we can't. One of you asked a great question earlier, which was could we use this to bring in metadata like the speed a car was going to add it with a convolutional neural network's image data.
I said yes we can, so in this case we're doing the same thing, which is we're bringing in an additional word's worth of data, and remember each time you see two different arrows coming in that represents a merge operation. So here's a perfectly reasonable way of trying to predict the fourth word from the previous three words.
So this leads to a really interesting question, which was what if instead we said let's bring in our Word 1, and then we had a layer operation in order to create our hidden state, and that would be enough to predict Word 2, and then to predict Word 3, could we just do a layer operation and generate itself?
And then that could be used to predict Word 3, and then run it again to predict Word 4, and run it again to predict Word 5. This is called an RNN, and everything that you see here is exactly the same structurally as everything I've shown before. The colored-in areas represent matrices, and the arrows represent layer operations.
One of the really interesting things about an RNN is each of these arrows that you see - three arrows - there's only one weight matrix attached to those. In other words, it's the equivalent thing of saying every time you see an arrow from a circle to a circle, so that would be that one and that one, those two weight matrices have to be exactly the same.
Every time you see an arrow from a rectangle to a circle, those three matrices have to be exactly the same. And then finally, you've got an arrow from a circle to a triangle, and that weight matrix is separate. The idea being that if you have a word coming in and being added to some state, why would you want to treat it differently depending on whether it's the first word in a string or the third word in a string?
Given that generally speaking, we kind of split up strings pretty much at random anyway. We're going to be having a whole bunch of 11-word strings. One of the nice things about this way of thinking about it where you have it going back to itself is that you can very clearly see there is one layer operation, one weight matrix for input to hidden, one for hidden to hidden, circle to circle, and one for hidden to output, i.e., circle to triangle.
So we're going to talk about that in a lot more detail next week. So now, I'm just going to quickly show you something in the last one minute, which is that we can train something which takes, for example, all of the text of Nietzsche, so here's a bit of his text, I've just read it in here, and we could split it up into every sequence - let's grab it here - into every sequence of length 40.
So I've gone through the whole text and grabbed every sequence of length 40. And then I've created an RNN and its goal is to take the sentence which represents the indexes from i to i+40 and predict the sentence from i+1 to i+40+1. So every string of length max len, I'm trying to predict the string one word after that.
And so I can take that now and create a model which has - an LSTM is a kind of recurrent neural network, we'll talk about it next week - which has a recurrent neural network, starts of course with an embedding. And then I can train that by passing in my sentences and my sentence one character later.
And I can then say, okay, let's try and generate 300 characters by building a prediction of what do you think the next character would be. And so I have to seed it with something, I don't know, I thought it felt very Nietzschean, ethics is a basic foundation of all that.
And see what happens. And after training it for only a few seconds, I get ethics is a basic foundation of all that. You can get the sense that it's starting to learn a bit about the idea that - oh by the way, one thing to mention is this Nietzsche corpus is slightly annoying.
It has carriage returns after every line, so you'll see it's going to throw carriage returns in all over the place. It's got some pretty hideous formatting. So then I train it for another 30 seconds. I train it for another 30 seconds and I get to a point where it's kind of understanding the concept of punctuation and spacing.
And then I've trained it for 640 seconds and it's starting to actually create real words. And then I've trained it for another 640 seconds. And interestingly, each section of Nietzsche starts with a numbered section that looks exactly like this. It's even starting to learn to close its quotation marks.
It also notes that at the start of a chapter, it always has three lines, so it's learned to start chapters after another 640 seconds and another 640 seconds. And so by this time, it's actually got to a point where it's saying some things which are so obscure and difficult to understand, it could really be niche.
These car RNN models are fun and all, but the reason this is interesting is that we're showing that we only provided that amount of text and it was able to generate text out here because it has state, it has recurrence. And what that means is that we could use this kind of model to generate something like SwiftKey, whereas you're typing it's saying this is the next thing you're going to type.
I would love you to think about during the week whether this is likely to help our IMDB sentiment model or not. That would be an interesting thing to talk about. Next week, we will look into the details of how RNNs work. Thanks. (audience applauds) (audience applauds)