Lesson 5: Practical Deep Learning for Coders

00:00:00.000 | So I wanted to start off by showing you something I'm kind of excited about, which is, here

00:00:05.120 | is the Dogs and Cats competition, which we all know so well.

00:00:08.840 | And it was interesting that the winner of this competition won by a very big margin,

00:00:14.360 | a 1.1% error versus a 1.7% error.

00:00:20.080 | This is very unusual in a Kaggle competition to see anybody win by 50-60% margin.

00:00:25.760 | You can see that after that, people are generally clustering around 91.1, 91.9, 91.1, 98.1 -- about

00:00:32.480 | the same kind of number.

00:00:34.020 | So this was a pretty impressive performance.

00:00:35.900 | This is the guy who actually created a piece of deep mining software called Overfee.

00:00:42.480 | So I want to show you something pretty interesting, which is this week I tried something new,

00:00:47.440 | and on Dogs and Cats got 98.95.

00:00:53.800 | So I want to show you how I did that.

00:00:58.360 | The way I did that was by using nearly only techniques I've already shown you, which is

00:01:03.700 | basically I created a standard model, which is basically a dense model.

00:01:11.560 | And then I pre-computed the last convolutional layer, and then I trained the dense model

00:01:18.680 | lots of times, and the other thing I did was to use some data augmentation.

00:01:26.760 | And I didn't actually have time to figure out the best data augmentation parameters,

00:01:29.560 | so I just picked some that seemed reasonable.

00:01:31.960 | I should also mention this 98.95 would be easy to make a lot better.

00:01:38.000 | I'm not doing any pseudo-labeling here, and I'm not even using the full dataset.

00:01:42.120 | I put aside 2000 for the validation set.

00:01:44.880 | So with those two changes we would definitely get well over 99% accuracy.

00:01:51.400 | The missing piece that I added is I added batch normalization to VGG.

00:01:58.140 | So batch normalization, if you guys remember, I said the important takeaway is that all

00:02:03.320 | modern networks should use batch norm because you can get 10x or more improvements in training

00:02:08.840 | speed and it tends to reduce overfitting.

00:02:13.760 | Because of the second one, it means you can use less dropout, and dropout of course is

00:02:20.200 | destroying some of your network, so you don't want to use more dropout than necessary.

00:02:26.680 | So why didn't VGG already have batch norm?

00:02:30.700 | Because it didn't exist.

00:02:32.340 | So VGG was kind of mid to late 2014, and batch norm was maybe early to mid 2015.

00:02:43.600 | So why haven't people added batch norm to VGG already?

00:02:49.000 | And the answer is actually interesting to think about.

00:02:51.380 | So to remind you what batch norm is, batch norm is something which first of all normalizes

00:03:00.800 | every intermediate layer.

00:03:02.100 | So it normalizes all of the activations by subtracting the mean and dividing by the standard

00:03:06.280 | deviation, which is always a good idea.

00:03:11.420 | And I know somebody on the forum today asked why is it a good idea, and I've put a link

00:03:15.160 | to some more information about that, so anybody who wants to know more about why do normalization

00:03:19.680 | check out the forum.

00:03:22.680 | But just doing that alone isn't enough because SGD is quite bloody-minded, and so if it was

00:03:31.280 | trying to de-normalize the activations because it thought that was a good thing to do, it

00:03:38.120 | would do so anyway.

00:03:39.320 | So every time you tried to normalize them, SGD would just undo it again.

00:03:43.580 | So what batch norm does is it adds two additional trainable parameters to each layer.

00:03:50.040 | One which multiplies the activations and one which is added to the activations.

00:03:55.400 | So it basically allows it to undo the normalization, but not by changing every single weight, but

00:04:03.160 | by just changing two weights for each activation.

00:04:06.400 | So it makes things much more stable in practice.

00:04:11.020 | So you can't just go ahead and stick batch norm into a pre-trained network, because if

00:04:15.740 | you do, it's going to take that layer and it's going to divide all of the incoming activations

00:04:21.640 | by subtract the mean and divide by the standard deviation, which means now those pre-trained

00:04:28.240 | weights from then on are now wrong, because those weights were created for a completely

00:04:34.000 | different set of activations.

00:04:36.280 | So it's not rocket science, but I realized all we need to do is to insert a batch norm

00:04:45.000 | layer and figure out what the mean and standard deviation of the incoming activations would

00:04:51.880 | be for that dataset and basically create the batch norm layer such that the two trainable

00:04:58.560 | parameters immediately undo that.

00:05:01.840 | So that way we would insert a batch norm layer and it would not change the outputs at all.

00:05:08.380 | So I grabbed the whole of ImageNet and I created our standard dense layer model.

00:05:17.680 | I pre-computed the convolutional outputs for all of ImageNet, and then I created two batch

00:05:24.840 | norm layers, and I created a little function which allows us to insert a layer into an

00:05:30.560 | existing model.

00:05:31.560 | I inserted the layers just after the two dense layers.

00:05:37.480 | And then here is the key piece.

00:05:38.960 | I set the weights on the new batch norm layers equal to the variance and the mean, which

00:05:46.200 | I calculated on all of ImageNet.

00:05:50.520 | So I calculated the mean of each of those two layer outputs and the variance of each

00:05:54.680 | of those two layer outputs.

00:05:56.560 | And so that allowed me to insert these batch norm layers into an existing model.

00:06:03.240 | And then afterwards I evaluated it and I checked that indeed it's giving me the same answers

00:06:08.500 | as it was before.

00:06:11.520 | As well as doing that, I then thought if you train a model with batch norm from the start,

00:06:21.120 | you're going to end up with weights which are designed to take advantage of the fact

00:06:25.160 | that the activations are being normalized.

00:06:27.340 | And so I thought I wonder what would happen if we now fine-tuned the ImageNet network

00:06:32.480 | on all of ImageNet after we added these batch norm layers.

00:06:37.300 | So I then tried training it for one epoch on both the ImageNet images and the horizontally

00:06:44.600 | flipped ImageNet images.

00:06:45.960 | So that's what these 2.5 million here are.

00:06:50.840 | And you can see with modern GPUs, it only takes less than an hour to run the entirety

00:06:58.360 | of ImageNet twice.

00:07:00.480 | And the interesting thing was that my accuracy on the validation set went up from 63% to

00:07:07.160 | 67%.

00:07:08.160 | So adding batch norm actually improves ImageNet, which is cool.

00:07:12.880 | That wasn't the main reason I did it, the main reason I did it was so that we can now

00:07:16.200 | use VGG with batch norm in our models.

00:07:21.280 | So I did all that, I saved the weights, I then edited our VGG model.

00:07:38.920 | So if we now look at the fully connected block in our VGG model, it now has batch norm in

00:07:49.560 | there.

00:07:53.560 | I also saved to our website a new weights file called VGG16BN for batch norm.

00:08:02.480 | And so then when I did cats and dogs, I used that model.

00:08:10.560 | So now if you go and redownload from platform.ai the VGG16.py, it will automatically download

00:08:19.680 | the new weights, you will have this without any changes to your code.

00:08:23.800 | So I'll be interested to hear during the week if you try this out, just rerun the code you've

00:08:27.480 | got whether you see improvements.

00:08:30.600 | And hopefully you'll find it trains more quickly and you get better results.

00:08:36.520 | At this stage, I've only added batch norm to the dense layers, not to the convolutional

00:08:41.400 | layers.

00:08:43.600 | There's no reason I shouldn't add it to the convolutional layers as well, I just had other

00:08:47.660 | things to do this week.

00:08:50.760 | Since most of us are mainly fine-tuning just the dense layers, this is going to impact

00:08:54.520 | most of us the most anyway.

00:08:58.760 | So that's an exciting step which everybody can now use.

00:09:06.920 | As well as -- the other thing to mention is now that you'll be using batch norm by default

00:09:12.720 | in your VGG networks, you should find that you can increase your learning rates.

00:09:18.100 | Because batch norm normalizes the activations, it makes sure that there's no activation that's

00:09:24.220 | gone really high or really low, and that means that generally speaking you can use higher

00:09:28.840 | learning rates.

00:09:30.560 | So if you try higher learning rates in your code than you were before, you should find

00:09:33.920 | that they work pretty well.

00:09:36.240 | You should also find that things that previously you couldn't get to train, now will start

00:09:41.680 | to train.

00:09:42.680 | Because often the reason that they don't train is because one of the activations shoots off

00:09:48.040 | into really high or really low and screws everything up, and that kind of gets fixed

00:09:55.040 | when you use batch norm.

00:09:56.400 | So there's some things to try this week, I'll be interested to hear how you go.

00:10:02.040 | So last week we looked at collaborative filtering.

00:10:08.280 | And to remind you, we had a file that basically meant something like this.

00:10:16.400 | We had a bunch of movies and a bunch of users, and for some subset of those combinations

00:10:21.980 | we had a review of that movie by that user.

00:10:27.440 | The way the actual file came to us didn't look like this, this is a crosstab.

00:10:32.440 | The way the file came to us looked like this.

00:10:36.160 | Each row was a single user rating a single movie with a single rating at a single time.

00:10:42.560 | So I showed you in Excel how we could take the crosstab version, and we could create a

00:10:51.440 | table of dot products, where the dot products would be between a set of 5 random numbers

00:11:00.200 | for the movie and 5 random numbers for the user.

00:11:04.040 | And we could then use gradient descent to optimize those sets of 5 random numbers for

00:11:10.720 | every user and every movie.

00:11:13.520 | And if we did so, we end up getting pretty decent guesses as to the original ratings.

00:11:21.440 | And then we went a step further in the spreadsheet and we learned how you could take the dot

00:11:27.240 | product and you could also add on a single bias, a movie bias and a user bias.

00:11:35.480 | So we saw all that in Excel and we also learned that in Excel, Excel comes with a gradient

00:11:42.860 | descent solver called, funnily enough, solver.

00:11:46.440 | And we saw that if we ran solver, I'm telling it that these are our varying cells and this

00:11:53.720 | is our target cell, then it came up with some pretty decent weight matrices.

00:12:01.100 | We learned that these kinds of weight matrices are called embeddings.

00:12:05.120 | An embedding is basically something where we can start with an integer, like 27, and

00:12:10.280 | look up the movie number 27's vector of weights, that's called an embedding.

00:12:16.320 | It's also in collaborative filtering, this particular kind of embedding is known as latent

00:12:20.800 | factors.

00:12:25.720 | Where we hypothesized that once trained, each of these latent factors may mean something.

00:12:33.300 | And I said next week we might come back and have a look and see if we can figure out what

00:12:36.840 | they mean.

00:12:38.640 | So that was what I thought I would do now.

00:12:41.920 | So I'm going to take the bias model that we created.

00:12:47.400 | The bias model we created was the one where we took a user embedding and a movie embedding,

00:12:57.600 | and we took the dot product of the two, and then we added to it a user bias and a movie

00:13:05.600 | bias where those biases are just embeddings which have a single output.

00:13:11.200 | Just like in Excel, the bias was a single cell for each movie and a single cell for

00:13:17.440 | each user.

00:13:21.040 | So then we tried fitting that model, and you might remember that we ended up getting an

00:13:27.680 | accuracy that was quite a bit higher than previous state-of-the-art.

00:13:36.840 | Actually, for that one we didn't, for the previous state-of-the-art we broke by using

00:13:42.600 | the neural network.

00:13:43.600 | I discovered something interesting during the week, which is that I can get a state-of-the-art

00:13:48.600 | result using just this simple bias model, and the trick was that I just had to increase

00:13:54.640 | my regularization.

00:13:57.720 | So we haven't talked too much about regularization, we've briefly mentioned it a couple of times,

00:14:01.640 | but it's a very simple thing where we can basically say, add to the loss function the

00:14:08.040 | sum of the squares of the weights.

00:14:10.120 | So we're trying to minimize the loss, and so if you're adding the sum of the squares

00:14:14.720 | of the weights to the loss function, then the SGD solver is going to have to try to

00:14:20.080 | avoid increasing the weights where it can.

00:14:24.200 | And so we can pass to Keras, most Keras layers, a parameter called wregularizer, that stands

00:14:33.680 | for weight regularizer, and we can tell it how to regularize our weights.

00:14:37.000 | In this case, I say use an L2 norm, that means sum of the squares, of how much, and that's

00:14:44.160 | something that I pass in, and I used 1a neg 4.

00:14:49.660 | And it turns out if I do that, and then I train it for a while, it takes quite a lot

00:14:55.000 | longer to train, but let's see if I've got this somewhere.

00:15:02.260 | I got down to a loss of 0.7979, which is quite a bit better than the best results that that

00:15:10.500 | Stanford paper showed.

00:15:12.840 | That's not quite as good as the neural net, the neural net got 7938 at best.

00:15:19.920 | But it's still interesting that this pretty simple approach actually gets results better

00:15:28.660 | than the academic state-of-the-art as of 2012 or 2013, and I haven't been able to find more

00:15:35.280 | recent academic benchmarks than that.

00:15:41.060 | So I took this model, and I wanted to find out what we can learn from these results.

00:15:52.080 | So obviously one thing that we would do with this model is just to make predictions with

00:15:55.080 | it.

00:15:56.080 | So if you were building a website for recommending movies, and a new user came along and said

00:16:03.120 | I like these movies this much, what else would you recommend?

00:16:06.760 | You could just go through and do a prediction for each movie for that user ID and tell them

00:16:12.120 | which ones had the highest numbers.

00:16:13.400 | That's the normal way we would use collaborative filtering.

00:16:17.480 | We can do some other things.

00:16:18.480 | We can grab the top 2,000 most popular movies, just to make this more interesting, and we

00:16:24.040 | can say let's just grab the bias term.

00:16:29.640 | And I'll talk more about this particular syntax in just a moment, but just for now, this is

00:16:34.760 | a particularly simple kind of model.

00:16:38.140 | It's a model which simply takes the movie ID in and returns the movie bias out.

00:16:43.600 | In other words, it doesn't look up in the movie bias table and just returns the movie bias

00:16:49.160 | indexed by this movie ID.

00:16:53.500 | That's what these two lines do.

00:16:54.840 | I then combine that bias with the actual name of each rating, and print out the top and

00:17:00.160 | bottom 15.

00:17:01.980 | So according to Movie Lens, the worst movie of all time is the Church of Scientology classic

00:17:08.560 | Battlefield Earth.

00:17:12.760 | So this is interesting because these ratings are quite a lot more sophisticated than your

00:17:19.360 | average movie rating.

00:17:20.440 | What this is saying is that these have been normalized for some reviewers are more positive

00:17:27.520 | and negative than others.

00:17:29.520 | Some people are watching better or crappier films than others, and so this bias is removing

00:17:34.280 | all of that noise and really telling us after removing all of that noise these are the least

00:17:40.280 | good movies, and Battlefield Earth even worse than Spice World by a significant margin.

00:17:49.120 | On the other hand, here are the best Miyazaki fans will be pleased to see Howl's Moving

00:17:54.680 | Castle at number 2.

00:18:00.520 | So that's interesting.

00:18:02.640 | Perhaps what's more interesting is to try and figure out what's going on not in the

00:18:15.360 | biases but in the latent factors.

00:18:19.120 | The latent factors are a little bit harder to interpret because for every movie we have

00:18:24.680 | 50 of them, in the Excel spreadsheet we have 5, in our version we have 50 of them.

00:18:30.120 | So what we want to do is we want to take from those 50 latent factors, we want to find two

00:18:35.980 | or three main components.

00:18:41.440 | The way we do this, the details aren't important but a lot of you will already be familiar

00:18:46.640 | with it, which is that there's something called PCA, or Principle Components Analysis.

00:18:52.040 | Principle Components Analysis does exactly what I just said.

00:18:55.360 | It looks through a matrix, in this case it's got 50 columns, and it says what are the combinations

00:19:01.140 | of columns that we can add together because they tend to move in the same direction.

00:19:06.240 | And so in this case we say start with our 50 columns, and I want to create just three

00:19:10.120 | columns that capture all of the information of the original 50.

00:19:15.240 | If you're interested in learning more about how this works, PCA is something which is

00:19:19.080 | kind of everywhere on the internet, so there's lots of information about it.

00:19:22.120 | But as I say, the details aren't important, the important thing to recognize is that we're

00:19:26.080 | just squishing our 50 latent factors down into 3.

00:19:30.440 | So if we look at the first PCA factor, and we saw it on it, we can see that at one end

00:19:38.400 | we have fairly well-regarded movies like "The Godfather", "Halt Fiction", "Usual Suspects"

00:19:49.760 | - these are all kind of classics.

00:19:53.000 | At the other end we have things like Ace Ventura and Robocop 3, which are perhaps not so classic.

00:19:59.600 | So our first PCA factor is some kind of classic score.

00:20:07.680 | On our second one, we have something similar but actually very different.

00:20:12.040 | At one end we've got 10 movies that are huge Hollywood blockbusters with lots of special

00:20:18.640 | effects.

00:20:20.360 | And at the other end we have things like Annie Hall and Brokeback Mountain, which are kind

00:20:25.520 | of dialogue-heavy, not big Hollywood hits.

00:20:30.200 | So there's another dimension, which is the second most.

00:20:35.500 | This is the first most important dimension by which people judge movies differently.

00:20:39.960 | This is the second most important one by which people judge movies differently.

00:20:43.600 | And then the third most important one by which people judge movies differently is something

00:20:47.240 | where at one end we have a bunch of violent and scary movies, and at the other end we

00:20:54.240 | have a bunch of very happy movies.

00:20:57.260 | And for those of you who haven't seen Babe, Australian movie, happiest movie ever.

00:21:02.440 | It's about a small pig and its adventures and its path to success, so happiest movie

00:21:08.200 | ever according to movie lens.

00:21:10.680 | So that's interesting.

00:21:11.680 | It's not saying that these factors are good or bad or anything like that, it's just saying

00:21:17.120 | that these are the things that when we've done this matrix decomposition have popped

00:21:23.560 | out as being the ways in which people are differing in their ratings for different kinds

00:21:29.360 | of movies.

00:21:30.360 | So one of the reasons I wanted to show you this is to say that these kinds of SGD-learned

00:21:38.960 | many-parameter networks are not inscrutable.

00:21:45.160 | Indeed it's not great to go in and look at every one of those shifty latent factor coefficients

00:21:50.080 | in detail, but you have to think about how to visualize them, how to look at them.

00:21:56.960 | In this case, I actually went a step further and I grabbed a couple of principal components

00:22:01.120 | and tried drawing a picture.

00:22:04.680 | And so with pictures, of course, you can start to see things in multiple dimensions.

00:22:09.920 | And so here I've got the first and third principal components, and so you can see the far right-hand

00:22:14.900 | side here we have more of the Hollywood type movies, and at the far left some of the more

00:22:21.400 | classic movies, and at the top some of the more violent movies, and some of the bottom

00:22:25.240 | movies, some of the happier movies, and they're so far happy that it's right off the bottom.

00:22:31.160 | And so then if you wanted to find a movie that was violent and classic, you would go

00:22:37.920 | into the top left, and yeah, Kubrick's A Clockwork Orange would probably be the one most people

00:22:42.360 | would come up with first.

00:22:43.360 | Or if you wanted to come up with something that was very Hollywood and very non-violent,

00:22:50.600 | you would be down here in Sleepless in Seattle.

00:22:53.960 | You can really learn a lot by looking at these kinds of models, but you don't do it by looking

00:23:01.360 | at the coefficients, you do it by visualizations, you do it by interrogating it.

00:23:07.240 | And so I think this is a big difference, but for any of you that have done much statistics

00:23:11.680 | before or have a background in the social sciences, you've spent most of your time doing

00:23:16.120 | regressions and looking at coefficients and t-tests and stuff.

00:23:19.600 | This is a very different world.

00:23:20.940 | This is a world where you're asking the model questions and getting the model results, which

00:23:28.720 | is kind of what we're doing here.

00:23:32.000 | I mentioned I would talk briefly about this syntax.

00:23:37.800 | And this syntax is something that we're going to be using a lot more of, and it's part of

00:23:42.440 | what's called the Keras Functional API.

00:23:46.440 | The Keras Functional API is a way of doing exactly the same things that you've already

00:23:51.080 | learned how to do, using a different API.

00:23:55.440 | That is not such a dumb idea.

00:23:57.680 | The API you've learned so far is the sequential API, that's where you use the word sequential,

00:24:03.200 | and then you write in order the layers of your neural network.

00:24:07.080 | That's all very well.

00:24:08.840 | But what if you want to do something like what we wanted to do just now, where we had

00:24:13.040 | like 2 different things coming in, we had a user ID coming in and a movie ID coming in,

00:24:18.200 | and each one went through its own embedding, and then they got multiplied together.

00:24:21.980 | How do you express that as a sequence?

00:24:24.160 | It's not very easy to do that.

00:24:26.440 | So the functional API was designed to answer this question.

00:24:31.360 | The first thing to note about the functional API is that you can do everything you can

00:24:35.200 | do in this sequential API.

00:24:37.480 | And here's an example of something you could do perfectly well with a sequential API, which

00:24:41.800 | is something with two dense layers.

00:24:44.440 | But it looks a bit different.

00:24:46.080 | Every functional API model starts with an input layer, and then you assign that to some

00:24:52.880 | variable.

00:24:54.240 | And then you list each of the layers in order, and for each of them, after you've provided

00:25:00.280 | the details for that layer, you then immediately call the layer passing in the output of the

00:25:07.000 | previous layer.

00:25:08.680 | So this passes in inputs and calls it x, and then this passes in our x, and this is our

00:25:14.580 | new version of x, and then this next dense layer gets the next version of x and returns

00:25:19.080 | the predictions.

00:25:20.080 | So you can see that each layer is saying what its previous layer is.

00:25:24.640 | Each layer is saying what its previous layer is.

00:25:27.520 | So it's doing exactly the same thing as a sequential API, just in a different way.

00:25:33.080 | Now as the docs note here, the sequential model is probably a better choice to implement

00:25:38.960 | this particular network because it's easier.

00:25:42.160 | This is just showing that you can do it.

00:25:47.440 | On the other hand, the model that we just looked at would be quite difficult, if not

00:25:54.420 | impossible to do with the sequential model API, but with the functional API, it was very

00:25:59.720 | easy.

00:26:01.360 | We created a whole separate model which gave an output u for user, and that was the result

00:26:09.240 | of creating an embedding, where we said an embedding has its own input and then goes

00:26:14.680 | through an embedding layer, and then we returned the input to that and the embedding layer

00:26:20.000 | like so.

00:26:21.040 | So that gave us our user input, our user embedding, our movie input and our movie embedding.

00:26:25.440 | So there's like two separate little models.

00:26:29.640 | And then we did a similar thing to create two little models for our bias terms.

00:26:33.320 | They were both things that grabbed an embedding, returning a single output, and then flattened

00:26:38.640 | it.

00:26:40.240 | And that grabbed our biases.

00:26:41.720 | And so now we've got four separate models, and so we can merge them.

00:26:45.640 | There's this function called merge.

00:26:48.440 | It's pretty confusing.

00:26:49.440 | There's a small m merge and a big m merge.

00:26:52.520 | In general, you will be using the small m merge.

00:26:55.120 | I'm not going to go into the details of why they're both there.

00:26:58.240 | They are there for a reason.

00:27:00.360 | If something weird happens to you with merge, try remembering to use the small m merge.

00:27:05.560 | The small m merge takes two previous outputs that you've just created using the functional

00:27:11.240 | API and combines them in whatever way you want, in this case the dot product.

00:27:18.320 | And so that grabs our user and movie embeddings and takes the dot product.

00:27:23.240 | We grab the output of that and take our user bias and the sum, and the output of that and

00:27:30.000 | the movie bias and the sum.

00:27:33.440 | So that's a functional API to creating that model.

00:27:38.600 | At the end of which, we then use the model function to actually create our model, saying

00:27:44.640 | what are the inputs to the model and what is the output of the model.

00:27:49.800 | So you can see this is different to usual because we've now got multiple inputs.

00:27:54.720 | So then when we call fit, we now have to pass in an array of inputs, a user_id and movie_id.

00:28:03.600 | So the functional API is something that we're going to be using increasingly from now on.

00:28:10.400 | Now that we've kind of learned all the basic architectures just about, we're going to be

00:28:15.360 | starting to build more exotic architectures for more special cases and we'll be using

00:28:20.080 | the functional API more and more.

00:28:22.960 | Is the only reason to use an embedding layer so that you can provide a list of integers

00:28:27.160 | as input?

00:28:29.520 | That's a great question.

00:28:30.520 | Is the only reason to use an embedding layer so that you can use integers as input?

00:28:34.200 | Absolutely yes.

00:28:36.040 | So instead of using an embedding layer, we could have one-hot encoded all of those user

00:28:40.120 | IDs and one-hot encoded all of those movie IDs and created dense layers on top of them

00:28:45.680 | and it would have done exactly the same thing.

00:28:49.000 | Green box please.

00:28:54.640 | Why choose 50 latent factors and then reduce them down with a principal component analysis?

00:29:00.720 | Why not just have 3 latent factors to begin with?

00:29:03.240 | I'm not quite sure why you use both.

00:29:04.720 | Sure.

00:29:05.720 | If we only use 3 latent factors, then our predictive model would have been less accurate.

00:29:15.560 | So we want an accurate predictive model so that when people come to our website, we can

00:29:21.240 | do a good job of telling them what movie to watch.

00:29:24.000 | So 50 latent factors for that.

00:29:26.360 | But then for the purpose of our visualization of understanding what those factors are doing,

00:29:30.580 | we want a small number so that we can interpret them more easily.

00:29:34.960 | Okay, so one thing you might want to try during the week is taking one or two of your models

00:29:43.120 | and converting them to use the functional API.

00:29:45.960 | Just as a little thing, you could try to start to get the hang of how this API looks.

00:29:52.200 | Are these functional models how we would add additional information to images in CNN's,

00:29:57.820 | say driving speed or turning radius?

00:30:00.480 | Yes, absolutely.

00:30:02.480 | In general, the idea of adding additional information to say a CNN is basically like

00:30:07.840 | adding metadata.

00:30:10.040 | This happens in collaborative filtering a lot.

00:30:12.720 | You might have a collaborative filtering model that as well as having the ratings table,

00:30:18.840 | you also have information about what genre the movie is in, maybe the demographic information

00:30:24.240 | about the user.

00:30:25.820 | So you can incorporate all that stuff by having additional inputs.

00:30:30.240 | And so with a CNN, for example, the new Kaggle fish recognition competition, one of the things

00:30:41.640 | that turns out is a useful predictor, this is a leakage problem, is the size of the image.

00:30:47.340 | So you could have another input which is the height and width of the image just as integers

00:30:52.600 | and have that as a separate input which is concatenated to the output of your convolutional

00:30:56.760 | layer after the first flattened layer and then your dense layers can then incorporate

00:31:01.240 | both the convolutional outputs and your metadata would be a good example.

00:31:06.520 | That's a great question, two great questions.

00:31:10.240 | So you might remember from last week that this whole thing about collaborative filtering

00:31:15.360 | was a journey to somewhere else.

00:31:19.720 | And the journey is to NLP, natural language processing.

00:31:30.680 | This is a question about collaborative filtering.

00:31:33.320 | So if we need to predict the missing values, the NANs or the 0.0, so if a user hasn't watched

00:31:42.440 | a movie, what would be the prediction or how do we go about predicting that?

00:31:53.840 | So this is really the key purpose of creating this model is so that you can make predictions

00:32:00.840 | for movie user combinations you haven't seen before.

00:32:06.160 | And the way you do that is to simply do something like this.

00:32:14.340 | You just call model.predict and pass in a movieId userId pair that you haven't seen

00:32:21.560 | before.

00:32:22.560 | And all that's going to do is it's going to take the dot product of that movie's latent

00:32:28.440 | factors and that user's latent factors and add on those biases and return you back the

00:32:34.220 | answer.

00:32:35.220 | It's that easy.

00:32:38.680 | And so if this was a Kaggle competition, that would be how we would generate our submission

00:32:45.040 | for the Kaggle competition would be to take their test set, which would be a bunch of

00:32:48.000 | movie user pairs that we haven't seen before.

00:32:56.040 | Natural language processing.

00:32:58.280 | Collaborative filtering is extremely useful of itself.

00:33:05.000 | Without any doubt, it is far more commercially important right now than NLP is.

00:33:10.600 | Having said that, fastAI's mission is to impact society in as positive a way as possible,

00:33:18.400 | and doing a better job as predicting movies is not necessarily the best way to do that.

00:33:23.160 | So we're maybe less excited about collaborative filtering than some people in industry are.

00:33:28.200 | So that's why it's not our main destination.

00:33:30.720 | NLP, on the other hand, can be a very big deal.

00:33:34.780 | If you can do a good job, for example, of reading through lots of medical journal articles

00:33:42.040 | or family histories and patient notes, you could be a long way towards creating a fantastic

00:33:48.340 | diagnostic tool to use in the developing world to help bring medicine to people who don't

00:33:53.160 | currently have it, which is almost as good as telling them not to watch Battlefield Earth.

00:33:58.360 | They're both important.

00:34:00.840 | So let's talk a bit about NLP.

00:34:04.400 | In order to do this, we're going to look at a particular dataset.

00:34:09.000 | This dataset is like a really classic example of what people do with natural language processing,

00:34:15.320 | and it's called sentiment analysis.

00:34:17.840 | Sentiment analysis means that you take a piece of text, it could be a phrase, a sentence,

00:34:23.680 | a paragraph, or a whole document, and decide whether or not that is a positive or negative

00:34:30.080 | sentiment.

00:34:33.520 | Keras actually comes with such a dataset, which is called the IMDb sentiment dataset.

00:34:41.320 | The IMDb sentiment dataset was originally developed from the Stanford AI group, and the paper

00:34:51.040 | about it was actually published in 2012.

00:35:03.680 | They talk about all the details about what people try to do with sentiment analysis.

00:35:09.760 | In general, although academic papers tend to be way more math-y than they should be,

00:35:17.080 | the introductory sections often do a great job of capturing why this is an interesting

00:35:22.520 | problem, what kind of approaches people have taken, and so forth.

00:35:26.120 | The other reason papers are super helpful is that you can skip down to the experiment

00:35:29.920 | section -- every machine learning paper pretty much has an experiment section -- and find

00:35:34.000 | out what the score is.

00:35:37.360 | So here's their score section.

00:35:40.160 | Here they showed that using this dataset they created of IMDb movie reviews, along with

00:35:46.000 | their sentiment, their full model plus an additional model got a score of 88.33% accuracy

00:35:55.920 | in predicting sentiment.

00:35:57.720 | They had another one here where they also added in some unlabeled data.

00:36:01.720 | We're not going to be looking at that today, that would be a semi-supervised learning problem.

00:36:05.360 | So today our goal is to beat 88.33% accuracy as being the academic state of the art for

00:36:11.600 | this dataset, at least as at this time.

00:36:19.440 | To grab it, we can just say from Keras.datasets import IMDb.

00:36:23.840 | Keras actually kind of fiddles around with it in ways that I don't really like, so I

00:36:27.480 | actually copied and pasted from the Keras file these three lines to import it directly

00:36:32.920 | without screwing with it.

00:36:34.680 | So that's why rather than using the Keras dataset directly, I'm using these three lines.

00:36:43.220 | There are 25,000 movie reviews in the training set, and here's an example of one.

00:36:53.960 | Rumwell High is a cartoon comedy, around at the same time as some other programs.

00:37:01.600 | So the dataset actually does not quite come to us in this format, it actually comes to

00:37:08.220 | us in this format, which is a list of IDs.

00:37:13.840 | And so these IDs then we can look up in the word index, which is something that they provide.

00:37:22.800 | And so for example, if we look at the word index, as you can see, basically maps an integer

00:37:41.820 | to every word.

00:37:44.480 | It's in order of how frequently those words appeared in this particular corpus, which

00:37:49.720 | is kind of handy.

00:37:51.480 | So then I also create a reverse index, so it goes from Word to ID.

00:37:59.480 | So I can see that in the very first training example, the very first word is word number

00:38:05.700 | 23022.

00:38:07.700 | So if I look up index to word 23022, it is the word Rumwell.

00:38:13.460 | And so then I just go through and I map everything in that first review to index to word and

00:38:20.280 | join it together with a space, and that's how we can turn the data that they give us back

00:38:24.720 | into the movie review.

00:38:29.520 | As well as providing the reviews, they also provide labels.

00:38:34.080 | One is positive sentiment, zero is negative sentiment.

00:38:39.640 | So our goal is to take these 25,000 reviews that look like this and predict whether it

00:38:46.200 | will be positive or negative in sentiment, and the data is actually provided to us as

00:38:51.120 | a list of word IDs for each review.

00:38:54.960 | Is everybody clear on the problem we are trying to solve and how it's laid out?

00:38:59.280 | Ok, you guys are quick.

00:39:03.080 | So there's a couple of things we can do to make it simpler.

00:39:05.480 | One is we can reduce the vocabulary.

00:39:08.400 | So currently there are some pretty unusual words, like word number 23022 is Bromwell.

00:39:15.760 | And if we're trying to figure out how to deal with all these different words, having to

00:39:21.440 | figure out the various ways in which the word Bromwell is used is probably not going to

00:39:25.680 | net as much for a lot of computation and memory cost.

00:39:28.880 | So we're going to truncate the vocabulary down to 5000.

00:39:32.000 | And it's very easy to do that because the words are already ordered by frequency.

00:39:37.320 | I simply go through everything in our training set and I just say if the word ID is less

00:39:45.420 | than this vocab size of 5000, we'll leave it as it is, otherwise we'll replace it with

00:39:52.280 | the number 5000.

00:39:54.160 | So at the end of this, we now have replaced all of our rare words with a single ID.

00:40:00.880 | Here's a quick look at these sentences.

00:40:04.720 | The reviews are sometimes up to 2493 words long.

00:40:09.520 | Some people spend far too much time on IMDb.

00:40:12.720 | Some are as short as 10 words.

00:40:14.440 | On average, they're 237 words.

00:40:17.960 | As you will see, we actually need to make all of our reviews the same length.

00:40:26.040 | Allowing this 2493 word review would again use up a lot of memory and time.

00:40:32.880 | So we're going to decide to truncate every review at 500 words.

00:40:37.320 | And that's twice as big, more than twice as big as the mean.

00:40:39.840 | So we're not going to lose too much.

00:40:43.280 | So what we now need to do is create a rectangular - what if the word 5000 gives a bias?

00:40:52.840 | So we're about to learn a machine learning model, and so the vast majority of the time

00:41:21.160 | it comes across the word 5000, it's actually going to mean 'rare word'.

00:41:26.680 | It's not going to specifically mean 1987.

00:41:29.560 | And it's going to learn to deal with that as best as it can.

00:41:34.280 | The idea is the rare words don't appear too often, so hopefully this is not going to cause

00:41:38.860 | too much problem.

00:41:40.720 | We're not just using frequencies, all we're doing is we're just truncating our vocabulary.

00:41:51.120 | Can you put that close to your mouth?

00:41:57.840 | So the 5000 words, can we just replace it with some neutral word to take care of that

00:42:03.600 | biased thing?

00:42:04.600 | There's really not going to be a bias here.

00:42:08.800 | We're just replacing it with a random ID.

00:42:11.800 | The fact that occasionally the word 1987 actually pops up is totally insignificant.

00:42:20.700 | We could replace it with -1, it's just a sentinel value which has no meaning.

00:42:27.540 | It's one of these design decisions which it's not worth spending a lot of time thinking about

00:42:52.920 | because it's not significant.

00:42:53.920 | So I just picked whatever happened to be easiest at the time.

00:42:58.440 | As I said, I could personally always use -1, it's just not important.

00:43:04.600 | What is important is that we have to create a square matrix, a rectangular matrix which

00:43:14.000 | we can pass to our machine learning model.

00:43:17.480 | So quite conveniently Keras comes with something called pad sequences that does that for us.

00:43:22.280 | It takes everything greater than this length and truncates it, and everything less than

00:43:28.520 | that length, and pads it with what have we asked for, which in this case is zeros.

00:43:34.920 | So at the end of this, the shape of our training set is now a NumPy array of 25,000 rows by

00:43:41.560 | 500 columns.

00:43:43.460 | And as you can see, it's padded the front with zeros, such that it has 500 words in

00:43:50.540 | it.

00:43:51.540 | That is exactly the same as before.

00:43:53.000 | And you can see that Bromwell has now been not replaced with 5000, but with 4000, 999.

00:43:59.320 | So this is our same movie review again after going through that padding process.

00:44:08.320 | I know that there's some reason that Keras decided to pad the front rather than the back.

00:44:13.760 | I don't recall what it is.

00:44:16.800 | Since it's what it does by default, I don't worry about it, I don't think it's important.

00:44:23.880 | So now that we have a rectangular matrix of numbers, and we have some labels, we can use

00:44:30.240 | the exact techniques we've already learned to create a model.

00:44:34.880 | And as per usual, we should try to create the simplest possible model we can to start

00:44:38.880 | with.

00:44:39.880 | And we know that the simplest model we can is one with one hidden layer in the middle.

00:44:44.600 | Or at least this is the simplest model that we generally think ought to be pretty useful

00:44:49.480 | for just about everything.

00:44:51.760 | Now here is why we started with collaborative filtering, and that's because we're starting

00:44:55.760 | with an embedding.

00:44:57.600 | So if you think about it, our input are word ids, and we want to convert that into a vector.

00:45:05.280 | And that is what an embedding does.

00:45:08.100 | So again, rather than one-hot encoding this into a 5000-column long huge input thing and

00:45:17.600 | then doing a matrix product, an embedding just says look up that movie ID and grab that

00:45:25.420 | vector directly.

00:45:27.180 | So it's just a computational and memory shortcut to creating a one-hot encoding followed by

00:45:34.380 | a matrix product.

00:45:36.380 | So we're creating an embedding where we are going to have 5000 latent factors or 5000 embeddings.

00:45:44.400 | Each one we're going to have 32, in this case 32 items rather than 50.

00:45:53.400 | So then we're going to flatten that, have our single dense layer, a bit of dropout,

00:45:57.940 | and then our output to a sigmoid.

00:46:00.580 | So that's a pretty simple model.

00:46:04.520 | You can see it's a good idea to go through and make sure you understand why all these

00:46:08.040 | parameter counts are what they are.

00:46:10.000 | That's something you can do during the week and double-check that you're comfortable with

00:46:14.040 | all of those.

00:46:15.040 | So this is the size of each of the weight matrices at each point.

00:46:21.200 | And we can fit it.

00:46:23.160 | And after two epochs, we have 88% accuracy on

00:46:34.400 | the validation set.

00:46:36.960 | And so let's just compare that to Stanford, where they had 88.3 and we have 88.04.

00:46:52.320 | So we're not yet there, but we're well on the right track.

00:46:55.120 | This is always the question about why have X number of filters in your convolutional

00:47:06.880 | layer or why have X number of outputs in your dense layer.

00:47:11.280 | It's just a case of trying things and seeing what works and also getting some intuition

00:47:17.600 | by looking at other models.

00:47:19.440 | In this case, I think 32 was the first I tried, I kind of felt like from my understanding

00:47:27.160 | of really big embedding models, which we'll learn about shortly, even 50 dimensions is

00:47:32.760 | enough to capture vocabularies of size 100,000 or more.

00:47:37.520 | So I felt like 32 was likely to be more than enough to capture a vocabulary of size 5,000.

00:47:42.280 | I tried it and I got a pretty good result, and so I've basically left it there.

00:47:46.920 | If at some point I discovered that I wasn't getting great results, I would try increasing

00:47:52.520 | it.

00:47:58.520 | You can always use a softmax instead of a sigmoid, it just means that you would have

00:48:04.240 | to change your labels, because remember our labels were just 1's or 0's, they were just

00:48:13.560 | a single column.

00:48:15.500 | If I wanted to use a softmax, I would have to create two columns.

00:48:18.440 | It wouldn't just be 1, it would be 1, 0, 1, 0, 1, 0.

00:48:25.320 | In the past, I've generally stuck to using softmax and then categorical cross-entropy

00:48:30.380 | loss just to be consistent, because then regardless of whether you have two classes or more than

00:48:35.080 | two classes, you can always do the same thing.

00:48:38.600 | In this case, I thought I want to show the other way that you can do this, which is to

00:48:43.440 | just have a single column output, and remember a sigmoid is exactly the same thing as a softmax

00:48:48.880 | if you just have a binary output.

00:48:53.920 | And so rather than using categorical cross-entropy, we use binary cross-entropy and again it's

00:48:57.600 | exactly the same thing, it just means I didn't have to worry about one hot encoding the output

00:49:02.200 | because it's just a binary output.

00:49:04.320 | No, we don't.

00:49:12.440 | It's not something I have looked at.

00:49:16.480 | The important thing as far as I'm concerned is what is the benchmark that the Stanford

00:49:21.680 | people got and they compared it to a range of other previous benchmarks and they found

00:49:24.960 | that their technique was the best.

00:49:26.280 | So that's my goal here.

00:49:28.520 | And I'm sure there have been other techniques that have come out since that are probably

00:49:32.560 | better, but I haven't seen them in any papers yet, so this is my target.

00:49:41.280 | You can see that we can in one second of training get an accuracy which is pretty competitive,

00:49:50.040 | and it's just a simple neural net.

00:49:51.760 | And so hopefully you're starting to get a sense that a neural net with one hidden layer

00:49:57.720 | is a great starting point for nearly everything, you now know how to create a pretty good sentiment

00:50:03.080 | analysis model and before today you didn't, so that's a good step.

00:50:07.160 | So an embedding is something I think would be particularly helpful if we go back to our

00:50:31.440 | movie, movie lens recommendation dataset.

00:50:41.520 | And remember that the actual data coming in does not look like this, but it looks like

00:50:49.280 | this.

00:50:51.520 | So when we then come along and say, okay, what do we predict the rating would be for

00:50:56.160 | user ID 1 for movie ID 1172, we actually have to go through our list of movie IDs and find

00:51:08.280 | movie ID number 31, say, and then having found 31, then look up its latent factor.

00:51:15.400 | And then we have to do the same thing for user ID number 1 and find its latent factor,

00:51:19.920 | and then we have to multiply the two together.

00:51:21.900 | So that step of taking an ID and finding it in a list and returning the vector that it's

00:51:27.640 | attached to, that is what an embedding is.

00:51:31.260 | So an embedding returns a vector which is of length, in this case 32.

00:51:40.160 | So the output of this is that for each, the none always means your mini batch size.

00:51:46.840 | So for each movie review, for each of the 500 words in that sequence, you're getting

00:51:57.040 | a 32 element vector.

00:51:59.500 | And so therefore you have a mini batch size by 500 by 32 tensor coming out of this layer.

00:52:07.520 | That gets flattened, so 500 times 32 is 16,000, and then that is the input into your first

00:52:15.840 | dense layer.

00:52:16.840 | Q. And I also think it might be helpful to show that for a review, instead of having

00:52:23.920 | that in words that's being entered as a sequence of numbers where the number is --

00:52:29.960 | A. Yeah, that's right.

00:52:32.160 | So we look at this first review and we take -- and remember this has now been truncated

00:52:37.120 | to 4999, this is still 309, so it's going to take 309, and it's going to look up the 309th

00:52:44.320 | vector in the embedding, and it's going to return it, and then it's going to concatenate

00:52:48.100 | it to create this tensor.

00:52:53.880 | So that's all an embedding is.

00:52:54.880 | An embedding is a shortcut to a one-hot encoding followed by a matrix product.

00:53:01.000 | Q. Then two other questions.

00:53:04.520 | Can you show us words which have similar latent features?

00:53:07.600 | I'm hoping these words would be synonyms or semantically similar.

00:53:11.360 | A. Yes, we'll see that shortly.

00:53:13.160 | Q. And who made the labels, and why should I believe them, it seems difficult and subjective?

00:53:18.960 | A. Well that's the whole point of sentiment analysis and these kinds of things, is that

00:53:23.560 | it's totally subjective.

00:53:24.680 | So the interesting thing about NLP is that we're trying to capture something which is

00:53:29.160 | very subjective.

00:53:31.560 | So in this case you would have to read the original paper to find out how they got these

00:53:37.160 | particular labels.

00:53:40.880 | The way that people tend to get labels is either, in this case it's the IMDB data set.

00:53:46.600 | IMDB has ratings, so you could just say anything higher than 8 is very positive and anything

00:53:51.600 | lower than 2 is very negative, and we'll throw away everything in the middle.

00:53:57.080 | The other way that people tend to label academic data sets is to send it off to Amazon Mechanical

00:54:02.400 | Turk and pay them a few cents to label each thing.

00:54:07.120 | So that's the kind of ways that you can label stuff.

00:54:10.920 | Q. And there are places where people don't just use Mechanical Turk, but they specifically

00:54:16.320 | try to hire linguistics PhDs.

00:54:18.560 | A. Yeah, you certainly wouldn't do that for this because the whole purpose here is to

00:54:23.040 | kind of capture normal people's sentiment.

00:54:27.080 | You would hire --

00:54:28.080 | Q. We know of a team at Google that does that.

00:54:30.360 | A. Yeah, so for example -- and I know when I was in medicine, we went through all these

00:54:36.200 | radiology reports and tried to capture which ones were critical findings and which ones

00:54:39.960 | weren't critical findings, and we used good radiologists rather than Mechanical Turk for

00:54:44.920 | that purpose.

00:54:45.920 | Q. So we're not considering any sentence construction or diagrams or just a bag of words and the

00:54:55.120 | literal set of words that are being used in a comment?

00:54:58.840 | A. It's not actually just a bag of words.

00:55:01.060 | If you think about it, this dense layer here has 1.6 million parameters.

00:55:06.480 | It's connecting every one of those 500 inputs to our output.

00:55:14.840 | And not only that, but it's doing that for every one of the incoming factors.

00:55:22.160 | So it's creating a pretty complex kind of big Cartesian product of all of these weights,

00:55:30.240 | and so it's taking account of the position of a word in the overall sentence.

00:55:35.820 | It's not terribly sophisticated, and it's not taking account of its position compared

00:55:40.600 | to other words, but it is taking account of whereabouts it occurs in the whole review.

00:55:46.920 | So it's not like -- it's the dumbest kind of model I could come up with.

00:55:55.520 | It's a good starting point, but we would expect that with a little bit of thought, which we're

00:56:00.000 | about to use, we could do a lot better.

00:56:04.880 | So why don't we go ahead and do that?

00:56:06.920 | So the slightly better -- hopefully you guys have all predicted what that would be -- it's

00:56:10.880 | a convolutional neural network.

00:56:12.960 | And the reason I hope you predicted that is because (a) we've already talked about how

00:56:16.360 | CNNs are taking over the world, and (b) specifically they're taking over the world any time we

00:56:21.840 | have some kind of ordered data.

00:56:24.000 | And clearly a sentence is ordered.

00:56:26.220 | One word comes after another word, it has a specific ordering.

00:56:29.960 | So therefore we can use a convolution.

00:56:34.580 | We can't use a 2D convolution because the sentence is not in 2D, a sentence is in 1D,

00:56:39.520 | so we're going to use a 1D convolution.

00:56:42.020 | So a 1D convolution is even simpler than a 2D convolution.

00:56:45.420 | We're just going to grab a string of a few words, and we're going to take their embeddings,

00:56:51.760 | and we're going to take that string, and we're going to multiply it by some filter.

00:56:56.240 | And then we're going to move that sequence along our sentence.

00:57:01.520 | So this is our normal next place we go as we try to gradually increase the complexity,

00:57:10.920 | which is to grab our simplest possible CNN, which is a convolution, dropout, max pooling.

00:57:18.940 | And then flatten that, and then we have our dense layer and our output.

00:57:22.880 | So this is exactly like what we did when we were looking at gradually improving our state

00:57:28.320 | farm result.

00:57:29.320 | But rather than having convolution 2D, we have convolution 1D.

00:57:33.360 | The parameters are exactly the same.

00:57:37.160 | How many filters do you want to create, and what is the size of your convolution?

00:57:42.480 | Originally I tried 3 here, 5 turned out to be better.

00:57:46.920 | So I'm looking at 5 words at a time and multiplying them by each one of 64 filters.

00:57:54.440 | So that is going to return -- so we're going to start with the same embedding as before.

00:58:02.060 | So we take our sentences and we turn them into a 500x32 matrix for each of our inputs.

00:58:10.680 | We then put it through our convolution, and because our convolution has a border mode

00:58:16.240 | is same, we get back exactly the same shape that we gave it.

00:58:21.320 | We then put it through our 1D max pooling and that will halve its size, and then we

00:58:25.360 | stick it through the same dense layers as we had before.

00:58:29.540 | So that's a really simple convolutional neural network for words.

00:58:34.400 | Compile it, run it, and we get 89.47 compared to -- let's go back to the videotape -- without

00:58:52.240 | any unlabeled data, 88.33.

00:58:55.680 | So we have already broken the academic state-of-the-art as at when this paper was written.

00:59:00.240 | And again, simple convolutional neural network gets us a very, very long way.

00:59:06.800 | I was going to put out it's 10-8, maybe take time for a break, but there's also a question.

00:59:12.480 | Convolution 2D for images is easier to understand, element-wise multiplication and addition, but

00:59:18.100 | what does it mean for a sequence of words?

00:59:23.280 | Don't think of it as a sequence of words because remember it's been through an embedding.

00:59:27.400 | So it's a sequence of 32 element vectors.

00:59:31.840 | So it's doing exactly the same thing as we're doing in a 2D convolution, but rather than

00:59:37.840 | having 3 channels of color, we have 32 channels of embedding.

00:59:45.760 | So we're just going through and we're just like in our convolution spreadsheet.

00:59:58.840 | Remember how in the second one, once we had two filters already, our filter had to be

01:00:05.080 | a 3x3x2 tensor in order to allow us to create the second layer.

01:00:14.440 | For us, we now don't have a 3x3x2 tensor, we have a 5x1x32, or more conveniently, a 5x32

01:00:24.960 | matrix.

01:00:26.560 | So each convolution is going to go through each of the 5 words and each of the 32 embeddings,

01:00:32.560 | do an element-wise multiplication, and add them all up.

01:00:36.840 | So the important thing to remember is that once we've done the embedding layer, which

01:00:42.440 | is always going to be our first step for every NLP model, is that we don't have words anymore.

01:00:48.080 | We now have vectors which are attempting to capture the information in that word in some

01:00:54.880 | way, just like our latent factors captured information about a movie and a user into our

01:01:01.560 | collaborative filtering.

01:01:03.720 | We haven't yet looked at what they do, we will in a moment, just like we did with the

01:01:09.080 | movie vectors, but we do know from our experience that SGD is going to try to fill out those

01:01:16.360 | 32 places with information about how that word is being used which allows us to make

01:01:22.520 | these predictions.

01:01:26.800 | Just like when you first learned about 2D convolutions, it took you probably a few days

01:01:31.800 | of fiddling around with spreadsheets and pieces of paper and Python and checking inputs and

01:01:37.680 | outputs to get a really intuitive understanding of what a 2D convolution is doing.

01:01:42.440 | You may find it's the same with a 1D convolution, but it will take you probably a fifth of the

01:01:47.240 | time to get there because you've really done all the hard work already.

01:01:55.120 | I think now is a great time to have a break, so let's come back here at 7.57.

01:02:07.680 | There's a couple of concepts that we come across from time to time in this class which

01:02:16.080 | there is no way that me lecturing to you is going to be enough to get an intuitive understanding

01:02:20.920 | of it.

01:02:21.920 | The first clearly is the 2D convolution, and hopefully you've had lots of opportunities

01:02:28.320 | to experiment and practice and read, and these are things you have to tackle from many different

01:02:32.920 | directions to understand a 2D convolution.

01:02:36.880 | And 2D convolutions in a sense are really 3D because if it's in full color, you've got

01:02:40.840 | 3 channels.

01:02:41.840 | Hopefully that's something you've all played with.

01:02:44.280 | And once you have multiple filters later on in your image models, you still have 3D and

01:02:50.640 | you've got more than 3 channels, you might have 32 filters or 64 filters.

01:02:55.680 | In this lesson we've introduced one much simpler concept, which is the 1D convolution, which

01:03:06.920 | is really a 2D convolution because just like with images we had red, green, blue, now we

01:03:12.840 | have the 32 or whatever embedding factors.

01:03:18.080 | So that's something you will definitely need to experiment with.

01:03:23.200 | Create a model with just an embedding layer, look at what the output is, what does it shape,

01:03:27.160 | what does it look like, and then how does a 1D convolution modify that.

01:03:36.360 | And then trying to understand what an embedding is is kind of your next big task if you're

01:03:43.920 | not already feeling comfortable with it.

01:03:45.600 | And if you haven't seen them before today, I'm sure you won't, because this is a big

01:03:51.240 | new concept.

01:03:52.640 | It's not in any way mathematically challenging.

01:03:55.760 | It's literally looking up an array and returning the thing at that ID.

01:04:00.700 | So an embedding looking at movie_id 3 is go to the third column of the matrix and return

01:04:09.240 | what you see.

01:04:10.280 | That's all an embedding does.

01:04:13.000 | They couldn't be mathematically simpler, it's the simplest possible operation.

01:04:17.280 | Return the thing at this index.

01:04:20.640 | But the kind of intuitive understanding of what happens when you put an embedding into

01:04:26.180 | an SGD and learn a vector which turns out to be useful is something which is kind of

01:04:35.600 | mind-blowing because as we saw from the movie lens example, with just a dot product and

01:04:45.040 | this simple lookup something in an index operation, we ended up with vectors which captured all

01:04:52.960 | kinds of interesting features about movies without us in any way asking it to.

01:04:58.240 | So I wanted to make sure that you guys really felt like after this class, you're going to

01:05:06.520 | go away and try and find a dozen different ways of looking at these concepts.

01:05:12.560 | One of those ways is to look at how other people explain them.

01:05:15.800 | And Chris Ola has one of the very, very best technical blogs I've come across and quite

01:05:23.640 | often referred to in this class, and in his Understanding Convolutions post, he actually

01:05:28.760 | has a very interesting example of thinking about what a dropped ball does as a convolutional

01:05:35.440 | operation and he shows how you can think about a 1D convolution using this dropped ball analogy.

01:05:44.520 | Particularly if you have some background in electrical or mechanical engineering, I suspect

01:05:49.800 | you'll find this a very helpful example.

01:05:52.880 | There are many resources out there for thinking about convolutions and I hope some of you

01:05:58.480 | will share on the forums any that you come across.

01:06:01.640 | Question - so one, this is from just before the break, essentially are we training the

01:06:11.200 | input too?

01:06:12.200 | Yeah, we are absolutely training the input because the only input we have is 25,000 sequences

01:06:20.400 | of 500 integers.

01:06:24.000 | And so we take each of those integers and replace them with a lookup into a 500-column

01:06:29.720 | matrix.

01:06:32.400 | Initially that matrix is random, just like in our Excel example.

01:06:37.840 | We started with a random matrix, these are all random numbers, and then we created this

01:06:45.840 | loss function which was the sum of the squares of differences between the dot product and

01:06:51.040 | the rating.

01:06:52.400 | And if we then use the gradient descent solver in Excel to solve that, it attempts to modify

01:07:01.280 | the two embedding matrices (as you can see, the objective is going down) to try and come

01:07:10.640 | up with the two embedding matrices which give us the best approximation of the original

01:07:16.460 | rating matrix.

01:07:18.720 | So this Excel spreadsheet is something which you can play with and do exactly what our

01:07:27.280 | first movie lens example is doing in Python.

01:07:33.440 | The only difference is that our version in Python also has LQ regularization.

01:07:45.760 | So this one's just finished here, so you can see it's come up with these are no longer

01:07:50.520 | random.

01:07:51.520 | We've now got two embedding matrices which have got the loss function down from 40 to

01:07:55.360 | 5.6, and so you can see for example these ratings are now very close to what they're

01:08:02.240 | meant to be.

01:08:03.240 | So this is exactly what Keras and SGD are doing in our Python example.

01:08:08.280 | Q So my question is, is it that we've got an embedding in which each word is a vector

01:08:15.960 | of 32 elements?

01:08:16.960 | A Yes.

01:08:17.960 | Q It's more clear in that way, no?

01:08:19.960 | A Yes, exactly.

01:08:20.960 | Each word in our vocabulary of 5000 has been converted into a vector of 32 elements.

01:08:27.560 | Q Another question is, what would be the equivalent

01:08:31.960 | dense network if we didn't use a 2D embedding?

01:08:35.440 | This is in the initial model, the simple one.

01:08:39.720 | A dense layer with input of size, embedding size, we have size?

01:08:44.880 | A I actually don't know what that meant, sorry.

01:08:48.760 | Q Okay, next question is, does it matter that encoded values which are close by are close

01:08:57.000 | in color in the case of pictures, which is not true for word vectors?

01:09:01.000 | For example, 254 and 255 are close as colors, but for words they have no relation.

01:09:09.240 | A So the important thing to realize is that

01:09:12.200 | the word IDs are not used mathematically in any way at all, other than as an index to

01:09:19.200 | look up into an integer.

01:09:20.720 | So the fact that this is movie number 27, the number 27 is not used in any way.

01:09:26.160 | We just take the number 27 and find its vector.

01:09:31.040 | So what's important is the values of each latent factor as to whether they're close

01:09:36.040 | together.

01:09:37.040 | So in the movie example, there were some latent factors that were something about is it a

01:09:40.520 | Hollywood blockbuster?

01:09:42.080 | And there were some latent factors that were something about is it a violent movie or not?

01:09:48.080 | It's the similarity on those factors that matters.

01:09:51.800 | The ID is never ever used, other than is an index to simply index into a matrix to return

01:09:59.440 | the vector that we found.

01:10:01.080 | So as Yannette was mentioning, in our case now for the word embeddings, we're looking

01:10:05.840 | up in our embeddings to return a 32-element vector of floats that are initially random,

01:10:15.000 | and the model is trying to learn the 32 floats for each of our words that is semantically

01:10:23.640 | useful.

01:10:24.640 | And in a moment we're going to look at some visualizations of that to try and understand

01:10:27.800 | what it's actually learned.

01:10:29.600 | You can apply the dropout parameter to the embedding layer itself, and what that does

01:10:50.440 | is it zeroes out at random 20% of each of these 32 embeddings for each word.

01:11:00.160 | So it's basically avoiding overfitting the specifics of each word's embedding.

01:11:06.080 | This dropout, on the other hand, is removing at random some of the words, effectively,

01:11:13.400 | some of the whole vectors.

01:11:16.120 | The significance of which one to use where is not something which I've seen anybody research

01:11:23.560 | in depth, so I'm not sure that we have an answer that says use this amount in this place.

01:11:30.920 | I just tried a few different values in different places, and it seems that putting the same

01:11:36.280 | amount of dropout in all these different spots seems to work pretty well in my experiments,

01:11:40.640 | so it's a reasonable rule of thumb.

01:11:44.100 | If you find you're massively overfitting or not massively underfitting, try playing around

01:11:48.920 | with the various values and report back on the forum and tell us what you find.

01:11:52.880 | Maybe you'll find some different, better configurations than I've come up with.

01:11:56.480 | I'm sure some of you will.

01:12:00.220 | Let's think about what's going on here.

01:12:08.840 | We are taking each of our 5,000 words in our vocabulary and we're replacing them with a

01:12:14.040 | 32 element long vector, which we are training to hopefully capture all of the information

01:12:21.720 | about what this word means and what it does and how it works.

01:12:26.880 | You might expect intuitively that somebody might have done this before.

01:12:31.720 | Just like with ImageNet and VGG, you can get a pre-trained network that says, oh, if you've

01:12:37.920 | got an image that looks a bit like a dog, well we've had a trained network which has

01:12:43.000 | seen lots of dogs, so it will probably take your dog image and return some useful predictions

01:12:50.320 | because we've done lots of dog images before.

01:12:53.640 | The interesting thing here is your dog picture and the VGG author's dog pictures are not

01:12:58.920 | the same.

01:13:00.920 | They're going to be different in all kinds of ways.

01:13:04.560 | To get pre-trained weights for images, you have to give somebody a whole pre-trained

01:13:09.200 | network, which is like 500 megabytes worth of weights in a whole architecture.

01:13:15.960 | Words are much easier.

01:13:17.740 | In a document, the word 'dog' always appears the same way.

01:13:23.720 | It's the word 'dog'.

01:13:24.880 | It doesn't have different lighting conditions or facial expressions or whatever, it's just

01:13:29.120 | the word 'dog'.

01:13:30.740 | So the cool thing is in NLP, we don't have to pass around pre-trained networks, we can

01:13:36.720 | pass around pre-trained embeddings, or as they're commonly known, pre-trained word vectors.

01:13:43.440 | That is to say, other people have already created big models with big text corpuses

01:13:49.280 | where they've attempted to build a 32-element vector, or however long vector, which captures

01:13:57.080 | all of the useful information about what that word is and how it behaves.

01:14:02.200 | So for example, if we type in 'word-vector-download', you can see that -- this is not quite what

01:14:16.040 | we wanted -- let's do 'word-embeddings-download'.

01:14:21.520 | That's better.

01:14:23.720 | Lots of questions and answers and pages about where we can download pre-trained word embeddings.

01:14:31.320 | So, that's pretty cool.

01:14:40.120 | But I guess what was a little unintuitive to me is that I think this means that if I can

01:14:48.680 | train a corpus on, I don't know, the works of Shakespeare, somehow that tells me something

01:14:53.080 | about how I can understand movie reviews, and I imagine that in some sense that's true about

01:15:01.680 | how language is structured and whatnot, but the meaning of the word 'dog' in Shakespeare

01:15:05.640 | is probably going to be used pretty differently.

01:15:07.860 | We're getting to that now.

01:15:20.880 | The word vectors that I'm going to be using, and I don't strongly recommend but slightly

01:15:26.200 | recommend are the GloVe word vectors.

01:15:29.160 | The other main competition to these is called the Word2Vec word vectors.

01:15:34.440 | The GloVe word vectors come from a researcher named Jeffrey Pennington from Stanford.

01:15:39.040 | The Word2Vec word vectors come from Google.

01:15:44.640 | I will have a mention that the TensorFlow documentation on the Word2Vec vectors is fantastic.

01:15:51.720 | So I would definitely highly recommend checking this out.

01:15:56.400 | The GloVe word vectors have been pre-trained on a number of different corpuses.

01:16:07.320 | One of them has been pre-trained on all of Wikipedia and a huge database full of newspaper

01:16:13.360 | articles -- a total of 6 billion words covering 400,000-size vocabulary.

01:16:21.880 | And they provide 50-dimensional, 100-dimensional, 200-dimensional and 300-dimensional pre-trained

01:16:27.720 | vectors.

01:16:29.140 | They have another one which has been trained on 840 billion words of a huge dump of the

01:16:36.760 | entire Internet.

01:16:39.360 | And then they have another one which has been trained on 2 billion tweets, which I believe

01:16:43.880 | all of the Donald Trump tweets have been carefully cleaned out prior to usage.

01:16:50.480 | So in my case, what I've done is I've downloaded the 6 billion token version, and I will show

01:16:58.680 | you what one of these looks like.

01:17:01.720 | So here is --

01:17:19.480 | Sometimes these are cased, so you can see for example this particular one includes case.

01:17:27.400 | There are 2.2 million items of vocabulary in this, sometimes they're uncased.

01:17:33.260 | So we'll look at punctuation in a moment.

01:17:36.920 | Here is the start of the GloVe 50-dimensional word vectors trained on a corpus of 6 billion.

01:17:44.880 | Here is the word "the," and here are the 50 floats which attempt to capture all of the

01:17:57.440 | information in the word "the."

01:18:00.160 | Punctuation, here is the word "full stop."

01:18:05.000 | And so here are the 50 floats that attempt to capture all of the information captured

01:18:09.960 | by a full stop.

01:18:13.160 | So here is the word "in," here is the word "double quote," here is "apostrophe s."

01:18:19.560 | So you can see that the GloVe authors have tokenized their text in a very particular

01:18:23.680 | way.

01:18:24.680 | And the idea that "apostrophe s" should be treated as a thing, that makes a lot of sense.

01:18:30.320 | It certainly has that thinginess in the English language.

01:18:34.640 | And so indeed, the way the authors of a word-embedding corpus have chosen to tokenize their text

01:18:41.920 | definitely matters.

01:18:43.360 | And one of the things I quite like about GloVe is that they've been pretty smart, in my opinion,

01:18:48.000 | about how they've done this.

01:18:53.760 | So the question is, how does one create word vectors in general?

01:19:00.080 | What is the model that you're creating and what are the labels that you're building?

01:19:09.080 | So one of the things that we talked about getting to at some point is unsupervised learning.

01:19:15.200 | And this is a great example of unsupervised learning.

01:19:17.440 | We want to take 840 billion tokens of an internet dump and build a model of something.

01:19:25.720 | So what do we build a model of?

01:19:27.460 | And this is a case of unsupervised learning.

01:19:29.360 | We're trying to capture some structure of this data, in this case, how does English

01:19:33.920 | look, work and feel.

01:19:38.040 | The way that this is done, at least in the Word2Vec example, is quite cool.

01:19:42.640 | What they do is they take every sentence of, say, 11 words long, not just every sentence,

01:19:52.800 | but every 11 long string of words that appears in the corpus, and then they take the middle

01:19:57.800 | word.

01:19:58.800 | The first thing they do is they create a copy of it, an exact copy.

01:20:06.640 | And then in the copy, they delete the middle word and replace it with some random word.

01:20:17.440 | So we now have two strings of 11 words, one of which makes sense because it's real, one

01:20:24.640 | of which probably doesn't make sense because the middle word has been replaced with something

01:20:28.880 | random.

01:20:30.120 | And so the model task that they create, the label is 1 if it's a real sentence, or 0 if

01:20:39.360 | it's a fake sentence.

01:20:40.940 | And that's the task they give it.

01:20:43.440 | So you can see it's not a directly useful task in any way, unless somebody actually

01:20:49.760 | comes along and says, "I just found this corpus in which somebody's replaced half of the middle

01:20:54.160 | words with random words."

01:20:56.560 | And it is something where in order to be able to tackle this task, you're going to have

01:21:00.200 | to know something about language.

01:21:02.360 | You're going to have to be able to recognize that this sentence doesn't make sense, and

01:21:05.640 | this sentence does make sense.

01:21:07.440 | So this is a great example of unsupervised learning.

01:21:10.120 | Generally speaking in deep learning, unsupervised learning means coming up with a task which

01:21:15.820 | is as close to the task you're eventually going to be interested in as possible but that doesn't

01:21:20.600 | require labels or whether labels are really cheap to generate.

01:21:50.520 | So it turns out that the embeddings that is created when you look at say, Hindu and Japanese

01:22:02.800 | turn out to be nearly the same.

01:22:05.360 | And so one way to translate language is to create a bunch of word vectors in English

01:22:15.000 | for various words, and then to create a bunch of word vectors in Japanese for various words.

01:22:22.080 | And then what you can do is you can say, "Okay, I want to translate this word, which might

01:22:26.400 | be 'queen', to Japanese."

01:22:29.960 | You can basically look up and find the nearest word in the same vector space in the Japanese

01:22:35.820 | corpus and it turns out it works.

01:22:39.800 | So it's a fascinating thing about language, in fact, Google has just announced that they've

01:22:47.200 | replaced Google Translate with a neural translation system and part of what that is doing is basically

01:22:53.120 | doing this.

01:22:54.120 | In fact, here are some interesting examples of some word embeddings.

01:23:00.800 | The word embedding for king and queen has the same distance and direction as the word

01:23:05.160 | embeddings for man and woman.

01:23:07.320 | Ditto for walking vs. walked and swinging vs. swam, and ditto for Spain vs. Madrid and Italy

01:23:13.160 | vs. Rome.

01:23:15.000 | So the embeddings that have to get learned in order to solve this stupid, meaningless,

01:23:22.520 | random sentence task are quite amazing.

01:23:27.640 | And so I've actually downloaded those glove embeddings, and I've pre-processed them, and

01:23:34.320 | I'm going to upload these for you shortly into a form that's going to be really easy

01:23:39.200 | for you to use in Python.

01:23:40.880 | And I've created this little thing called load glove, which loads the pre-processed

01:23:44.800 | stuff that I've created for you.

01:23:46.680 | And it's going to give you three things.

01:23:48.240 | It's going to give you the word vectors, which is the 400,000 by, in this case, 50 dimensional

01:23:54.280 | vectors, a list of the words, here they are, the comma dot of two, and a list of the word

01:24:02.720 | indexes.

01:24:04.920 | So you can now take a word and call word2vec to get back its 50-dimensional array.

01:24:13.100 | And so then I drew a picture.

01:24:16.260 | In order to turn a 50-dimensional vector into something 2-dimensional that I can plot, we

01:24:23.200 | have to do something called dimensionality reduction.

01:24:25.880 | And there's a particular technique, the details don't really matter, called TSNE, which attempts

01:24:30.240 | to find a way of taking your high-dimensional information and plot it on 2 dimensions such

01:24:36.840 | that things that were close in the 50 dimensions are still close in the 2 dimensions.

01:24:41.520 | And so I used TSNE to plot the first 350 most common words, and here they all are.

01:24:50.720 | And so you can see that bits of punctuation have appeared close to each other, numerals

01:24:55.820 | appear close to each other, written versions of numerals are close to each other, seasons,

01:25:00.280 | games, leagues played are all close to each other, various things about politics, school

01:25:05.440 | and university, president, general, prime, minister, and Bush.

01:25:11.640 | Now this is a great example of where this TSNE 2-dimensional projection is misleading

01:25:18.680 | about the level of complexity that's actually in these word vectors.

01:25:22.440 | In a different projection, Bush would be very close to tree.

01:25:27.340 | The 2-dimensional projection is losing a lot of information.

01:25:31.480 | The true detail here is a lot more complex than us mere humans can see on a page.

01:25:38.760 | But hopefully you get a sense of this.

01:25:41.920 | So all I've done here is I've just taken those 50-dimensional word vectors and I've plotted

01:25:48.200 | them in 2 dimensions.

01:25:49.920 | And so you can see that when you learn a word embedding, you end up with something, we've

01:25:58.620 | now seen, not just a word embedding, we've seen for movies, we were able to plot some

01:26:03.000 | movies in 2 dimensions and see how they relate to each other and we can do the same thing

01:26:06.920 | for words.

01:26:07.920 | In general, when you have some high-dimension, high-cardinality categorical variable, whether

01:26:13.640 | it be lots of movies or lots of reviewers or lots of words or whatever, you can turn it

01:26:18.440 | into a useful, lower-dimensional space using this very simple technique of creating an

01:26:23.680 | embedding.

01:26:24.680 | The explanation on how unsupervised learning was used in Word2Vec was pretty smart.

01:26:29.960 | How was it done in GloVe?

01:26:31.320 | I don't recall how it was done in GloVe, I believe it was something similar.

01:26:35.520 | I should mention though that both GloVe and Word2Vec did not use deep learning.

01:26:41.460 | They actually tried to create a linear model, and the reason they did that was that they

01:26:47.600 | specifically wanted to create representations which had these kinds of linear relationships

01:26:53.220 | because they felt that this would be a useful characteristic of these representations.

01:26:59.400 | I'm not even sure if anybody has tried to create a similarly useful representation using

01:27:07.520 | a deeper model and whether that turns out to be better.

01:27:10.600 | Obviously with these linear models, it saves a lot of computational time as well.

01:27:15.720 | The embeddings, however, even though they were built using linear models, we can now

01:27:22.680 | use them as inputs to deep models, which is what we're about to do, just behind you Rachel.

01:27:32.440 | So Google SyntaxNet model that just came out, was that the one you were mentioning?

01:27:38.520 | No, I was mentioning Word2Vec.

01:27:41.760 | Word2Vec has been around for 2 and a half years, 2 years.

01:27:48.160 | SyntaxNet is a whole framework, so --

01:27:54.480 | I think it's called Parsey McPass Face, that one is the one where they claim 97% accuracy

01:28:17.000 | on NLP, and it also returns parts of speech, so I'll tell you if you give a sentence it'll

01:28:19.960 | say this is a word, this is an action.

01:28:20.960 | Right.

01:28:21.960 | In that high-dimensional space, for example, you can see here is information about tense,

01:28:28.440 | for example.

01:28:29.440 | So it's very easy to take a word vector and use it to create a part of speech recognizer,

01:28:35.240 | you just need a fairly small labeled corpus, and it's actually pretty easy to download

01:28:40.400 | a rather large labeled corpus, and build a simple model that goes from word vector to

01:28:46.160 | part of speech.

01:28:47.720 | There's a really interesting paper called "Exploring the Limits of Language Modeling."

01:28:55.720 | That Parsey McPass Face thing got far more PR than it deserved.

01:29:01.800 | It was not really an advance over the state-of-the-art language models of the time, but since that

01:29:11.480 | time there have been some much more interesting things.

01:29:15.680 | One of the interesting papers is "Exploring the Limits of Language Modeling," which is

01:29:19.480 | looking at what happens when you take a very, very, very large dataset and spend shitloads

01:29:28.680 | of Google's money on lots and lots of GPUs for a very long time, and they have some genuine

01:29:36.760 | massive improvements to the state-of-the-art in language modeling.

01:29:41.600 | In general, when we're talking about language modeling, we're talking about things like

01:29:46.160 | is this a noun or a verb, is this a happy sentence or a sad sentence, is this a formal

01:29:52.840 | speech or an informal speech, so on and so forth.

01:29:57.600 | And all of these things that NLP researchers do, we can now do super easily with these

01:30:01.800 | embeddings.

01:30:02.800 | This uses two techniques, one of which you know and one of which you're about to know,

01:30:22.080 | convolutional neural networks and recurrent neural networks, specifically a type called

01:30:26.000 | LSTM.

01:30:29.340 | You can check out this paper to see how they compare.

01:30:32.220 | Almost this time, there's been an even newer paper that has furthered the state-of-the-art

01:30:36.760 | in language modeling and it's using a convolutional neural network.

01:30:41.000 | So right now, CNNs with pre-trained word embeddings are the state-of-the-art.

01:30:56.280 | So given that we can now download these pre-trained word embeddings, that leads to the question

01:31:04.280 | of why are we using randomly generated word embeddings when we do our sentiment analysis.

01:31:12.480 | That doesn't seem like a very good idea.

01:31:15.320 | And indeed, it's not a remotely good idea.

01:31:18.560 | You should never do that.

01:31:21.920 | From now on, you should now always use pre-trained word embeddings anytime you do NLP.

01:31:31.120 | Over the next few weeks, we will be gradually making this easier and easier.

01:31:35.600 | At this stage, it requires slightly less than a screen of code.

01:31:39.700 | You have to load the embeddings off disk, creating your word vectors, your words and

01:31:46.520 | your word indexes.

01:31:48.040 | The next thing you have to do is, the word indexes that come from GloVe are going to

01:31:54.000 | be different to the word indexes in your vocabulary.

01:32:01.000 | In our case, this was the word Bromwell.

01:32:04.800 | In the GloVe case, it's probably not the word Bromwell.

01:32:07.320 | So this little piece of code is simply something that is mapping from one index to the other

01:32:14.720 | index.

01:32:16.320 | So this createEmbedding function is then going to create an embedding matrix where the indexes

01:32:28.160 | are the indexes in the IMDB dataset, and the embeddings are the embeddings from GloVe.

01:32:35.360 | So that's what EMB now contains.

01:32:38.080 | This embedding matrix are the GloVe word vectors indexed according to the IMDB dataset.

01:32:44.440 | So now I have simply copied and pasted the previous code and I have added this, weights

01:32:51.000 | equals my pre-trained embeddings.

01:32:54.280 | Since we think these embeddings are pretty good, I've set trainable to false.

01:32:59.600 | I won't leave it at false because we're going to fine-tune them, but we'll start it at false.

01:33:04.920 | One particular reason that we can't leave it at false is that sometimes I've had to

01:33:09.400 | create a random embedding because sometimes the word that I looked up in GloVe didn't exist.

01:33:16.640 | For example, anything that finishes with apostrophe s, in GloVe they tokenize that to have apostrophe

01:33:23.120 | s and the word as separate tokens, but in IMDB they were combined into one token.

01:33:29.160 | And so all of those things, there aren't vectors for them.

01:33:32.000 | So I just randomly created embeddings for anything that I couldn't find in the GloVe

01:33:37.160 | dictionary.

01:33:41.040 | But for now, let's start using just the embeddings that were given, and we will set this to non-trainable,

01:33:46.840 | and we will train a convolutional neural network using those embeddings for the IMDB task.

01:33:55.800 | And after 2 epochs, we have 89.8. Previously, with random embeddings, we had 89.5.

01:34:07.600 | And the academic state of the art was 88.3.

01:34:11.780 | So we made significant improvements.

01:34:16.640 | Let's now go ahead and say first layer trainable is true.

01:34:23.680 | Place the learning braid a bit and do just one more epoch, and we're now up to 90.1.

01:34:29.880 | So we've got way beyond the academic state of the art here.

01:34:34.400 | We're kind of cheating because we're now not just building a model, we're now using a pre-trained

01:34:40.640 | word embedding model that somebody else has provided for us.

01:34:44.880 | But why would you ever not do that if that exists?

01:34:48.800 | So you can see that we've had a big jump, and furthermore it's only taken us 12 seconds

01:34:53.640 | to train this network.

01:34:56.000 | So we started out with the pre-trained word embeddings, we set them initially to non-trainable

01:35:02.520 | in order to just train the layers that used them, waited until that was stable, which

01:35:11.280 | took really 2 epochs, and then we set them to trainable and did one more little fine

01:35:17.880 | tuning step.

01:35:19.120 | And this kind of approach of these 3 epochs of training is likely to work for a lot of

01:35:25.200 | the NLP stuff that you'll find in a while.

01:35:28.520 | Do you not need to compile the model after resetting the input layer to trainable equals

01:35:36.280 | true?

01:35:37.280 | No you don't, because the architecture of the model has not changed in any way, it's

01:35:44.880 | just changed the metadata attached to it.

01:35:48.040 | There's never any harm in compiling the model.

01:35:51.480 | Sometimes if you forget to compile, it just continues to use the old model, so best to

01:35:57.160 | err on the side of using it.

01:36:01.560 | Something that I thought was pretty cool is that during the week, one of our students

01:36:06.220 | here had an extremely popular post appear all over the place, I saw it on the front

01:36:11.240 | page of Hacker News talking about how his company, Quidd, uses deep learning and very

01:36:16.880 | happy to see with small data, which is what we're all about.

01:36:21.200 | For those of you who don't know it, Quidd is a company, quite a successful startup actually,

01:36:26.300 | that is processing millions and millions of documents, things like patents and stuff like

01:36:31.760 | that, and providing enterprise customers with really cool visualizations and interactive

01:36:36.880 | tools that lets them analyze huge datasets.

01:36:40.680 | And so this is by Ben Bowles, one of our students here, and he talked about how he compared

01:36:44.920 | three different approaches to a particular NLP classification task, one of which involved

01:36:52.000 | some pretty complex and slow to develop carefully engineered features.

01:37:02.200 | But Model 3 in this example was a convolutional neural network.

01:37:07.680 | So I think this is pretty cool and I was hoping to talk to Ben about this piece of work.

01:37:18.200 | Could you give us a little bit of context on what you were doing in this project?

01:37:23.760 | Yeah, so the task is about detecting marketing language from company descriptions.

01:37:30.960 | So it's had the flavor of being very similar to sentiment analysis, like you have two classes

01:37:36.040 | of things, they're kind of different in some kind of semantic way.

01:37:39.440 | And you've got some examples here, so one was our patent pending support system is engineered

01:37:44.240 | designed to bring confidence style, with your more marketing I guess, and your spatial scanning

01:37:49.920 | software for mobile devices, is your more informative.

01:37:53.200 | Yeah, I mean the semantics of the marketing language is like, oh this is exciting.

01:37:59.640 | There are certain types of meanings and semantics around which the marketing tends to cluster,

01:38:04.520 | and I sort of realized, hey, this would be kind of a nice task for deep learning.

01:38:09.440 | How were these labeled, your data set in the first place?

01:38:13.400 | Basically by a couple of us in the company, we basically just found some good ones and

01:38:18.440 | found the bad ones and then literally tried it out.

01:38:21.560 | I mean, it's literally as hacky as you could possibly imagine.

01:38:25.320 | So yeah, it was kind of what's super, super scrappy.

01:38:30.520 | But it actually ended up being very useful for us, I think, because that kind of a nice

01:38:33.600 | lesson is sometimes scrappy gets you most of the way you need them, you think about like,

01:38:38.440 | hey, how do you get your data for your project, well you can actually just create it, right?

01:38:43.120 | Yeah, exactly.

01:38:44.120 | I mean I love this lesson because when -- and so startup, right?

01:38:48.720 | When I talk to big enterprise executives, they're all about their five year metadata

01:38:54.880 | and data lake repository infrastructure program at the end of which maybe they'll actually

01:39:00.760 | try and get some value out of it, whereas startups are just like, okay, what have we

01:39:05.280 | got that we can do by Monday, let's throw it together and see if it works.

01:39:10.680 | The latter approach is so much better because by Monday you know whether it kind of looks

01:39:15.760 | good, which kind of things are important, and you can decide on how much it's worth

01:39:20.720 | investing in, so that's cool.

01:39:23.240 | So one of the things I wanted to show is your convolutional neural network did something

01:39:28.240 | pretty neat, and so I wanted to use this same neat trick for our convolutional neural network,

01:39:34.360 | and it's a multi-size CNN.

01:39:37.220 | So I mentioned earlier that when I built this CNN, I tried using a filter size of 5, and

01:39:47.000 | I found it better than 3.

01:39:50.680 | And what Ben in his blog post points out is that there's a neat paper in which they describe

01:39:56.400 | doing something interesting, which is not just using one size convolution, but trying

01:40:01.400 | a few size convolutions.

01:40:04.320 | And you can see here, this is a great use of the functional API, and I haven't exactly

01:40:10.600 | used your code, I've kind of rewritten a little bit then, but basically it's the same concept.

01:40:14.720 | Let's try size 3 and size 4 and size 5 convolutional filters, and so let's create a 1D convolutional

01:40:22.920 | filter of size 3 and then size 4 and then size 5, and then for each one using the functional

01:40:29.760 | API we'll add max_pulling and we'll flatten it and we'll add it to a list of these different

01:40:35.320 | convolutions.

01:40:36.960 | And then at the end, we'll merge them all together by simply concatenating them.

01:40:42.400 | So we're now going to have a single vector containing the result of the 3 and 4 and 5

01:40:48.000 | size convolutions, like why settle for 1. And then let's return that whole model as a little

01:40:54.800 | sub-model, which in Ben's code he called graph.

01:40:59.280 | The reason I assume you call this graph is because people tend to think of these things,

01:41:04.280 | they call them a computational graph.

01:41:06.560 | A computational graph basically is saying this is a computation being expressed as various

01:41:13.520 | inputs and outputs, so you can think of it as a graph.

01:41:16.640 | So once you've got this little multi-layer convolution module, you can stick it inside

01:41:23.760 | a standard sequential model by simply replacing the convolution 1D and max_pulling piece with

01:41:32.120 | graph, where graph is the concatenated version of all of these different scales of convolution.

01:41:41.240 | And so trying this out, I got a slightly better answer again, which is 90.36%.

01:41:50.840 | And I hadn't seen that paper before, so thank you for giving that great idea.

01:41:55.360 | Did you have anything to add about this multi-scale convolution idea?

01:41:59.760 | Not really, other than I think it's super cool.

01:42:04.240 | But actually I'm still trying to figure out all the ends and notes of exactly how it works.

01:42:10.400 | Some ways implementation is easier than understanding.

01:42:13.720 | That's exactly right.

01:42:14.720 | In a lot of these things, the math is kind of ridiculously simple, and then you throw

01:42:23.880 | it at an SGD and let it do billions and billions of calculations in a fraction of a second,

01:42:29.960 | and what it comes up with is kind of hard to grasp.

01:42:34.360 | And you are using capital M merge in this example, did you want to talk about that?

01:42:39.640 | Not really.

01:42:40.640 | Ben used capital M merge and I just did the same thing.

01:42:44.800 | Were it me, I would have used small M merge, so we'll have to agree to disagree here.

01:42:53.240 | Okay, now let's not go there.

01:42:58.760 | So I think that's super fun.

01:43:01.280 | So we have a few minutes to talk about something enormous, so we're going to do a brief introduction.

01:43:08.720 | And then next week, we will do a deep dive.

01:43:16.960 | So everything we've learned so far about convolutional neural networks does not necessarily do a

01:43:24.800 | great job of solving a problem like how would you model this?

01:43:31.960 | Now notice whatever this markup is, I'm not quite sure.

01:43:37.160 | It has to recognize when you have a start tag and know to close that tag, but then over

01:43:42.720 | a longer period of time that it's inside a weird XML comment thing and to know that it

01:43:48.980 | has to finish off the weird XML comment thing, which means it has to kind of keep memory

01:43:55.720 | about what happened in the distant past if you're going to successfully do any kind of

01:44:00.840 | modeling with data that looks like this.

01:44:04.480 | And so with that kind of memory therefore, it can handle long-term dependencies.

01:44:15.640 | Also think about these two different sentences.

01:44:19.320 | They both mean effectively the same thing, but in order to realize that, you're going

01:44:24.360 | to have to keep some kind of state that knows that after this has been read in, you're now

01:44:29.720 | talking about something that happened in 2009, and you then have to remember it all the way

01:44:35.440 | to here to know when it was that this thing happened that you did in Nepal.

01:44:41.200 | So we want to create some kind of stateful representation.

01:44:46.680 | Furthermore it would be nice if we're going to deal with big long pieces of language like

01:44:50.800 | this with a lot of structure to be able to handle variable length sequences, so that

01:44:55.320 | we can handle some things that might be really long and some things that might be really

01:44:58.240 | short.

01:44:59.480 | So these are all things which convolutional neural networks don't necessarily do that

01:45:05.120 | well.

01:45:06.480 | So we're going to look at something else which is a recurrent neural network which handles

01:45:10.660 | that kind of thing well.

01:45:12.200 | And here is a great example of a good use of a recurrent neural network.

01:45:17.680 | At the top here, you can see that there is a convolutional neural network that is looking

01:45:23.480 | at images of house numbers.

01:45:30.560 | These images are coming from really big Google Street View pictures, and so it has to figure

01:45:36.440 | out what part of the image should I look at next in order to figure out the house number.

01:45:42.400 | And so you can see that there's a little square box that is scanning through and figuring

01:45:47.400 | out I want to look at this piece next.

01:45:49.680 | And then at the bottom, you can see it's then showing you what it's actually seeing after

01:45:55.080 | each time step.

01:45:57.120 | So the thing that is figuring out where to look next is a recurrent neural network.

01:46:02.120 | It's something which is taking its previous state and figuring out what should its next

01:46:06.960 | state be.

01:46:09.000 | And this kind of model is called an attentional model.

01:46:14.120 | And it's a really interesting avenue of research when it comes to dealing with things like

01:46:19.320 | very large images, images which might be too big for a single convolutional neural network

01:46:25.560 | with our current hardware constraints.

01:46:29.400 | On the left is another great example of a useful recurrent neural network, which is the

01:46:34.440 | very popular Android and iOS text entry system called SwiftKey.

01:46:40.320 | And SwiftKey had a post-up a few months ago in which they announced that they had just

01:46:46.920 | replaced their language model with a neural network of this kind, which basically looked

01:46:52.720 | at your previous words and figured out what word are you likely to be typing in next,

01:46:58.440 | and then it could predict that word.

01:47:01.040 | A final example was Andre Kepathy showed a really cool thing where he was able to generate

01:47:10.320 | random mathematical papers by generating random LaTeX, and to generate random LaTeX you actually

01:47:17.440 | have to learn things like /begin-proof and /end-proof and these kind of long-term dependencies.

01:47:25.840 | And he was able to do that successfully, so this is actually a randomly generated piece

01:47:30.200 | of LaTeX which is being created with a recurrent neural network.

01:47:36.400 | So today I am not going to show you exactly how it works, I'm going to try to give you

01:47:41.880 | an intuition.

01:47:44.400 | And I'm going to start off by showing you how to think about neural networks as computational

01:47:52.560 | graphs.

01:47:53.560 | So this is coming back to that word Ben used earlier, this idea of a graph.

01:47:57.880 | And so I started out by trying to draw -- this is like my notation, you won't see this anywhere

01:48:02.480 | else but it'll do for now -- here is a picture of a single hidden layer basic neural network.

01:48:11.000 | We can think of it as having an input, which is going to be of size, batch size, and contain

01:48:18.980 | width of number of inputs.

01:48:23.160 | And then this arrow, this orange arrow, represents something that we're doing to that matrix.

01:48:29.440 | So each of the boxes represents a matrix, and each of the arrows represents one or more

01:48:36.240 | things we do to that.

01:48:37.640 | In this case, we do a matrix product and then we throw it through a rectified linear unit.

01:48:43.520 | And then we get a circle which represents a matrix, but it's now a hidden layer which

01:48:49.840 | is of size, batch size, by number of activations.

01:48:54.640 | And number of activations is just when we created that dense layer, we would have said

01:48:59.880 | and then we would have had some number, and that number is how many activations we create.

01:49:06.520 | And then we put that through another operation, which in this case is a matrix product followed

01:49:12.320 | by a softmax, and so triangle here represents an output matrix.

01:49:18.400 | And that's going to be batch size by, if it's ImageNet, 1000.

01:49:24.440 | So this is my little way of representing the computation graph of a basic neural network

01:49:32.480 | with a single hidden layer.

01:49:35.000 | I'm now going to create some slightly more complex models, but I'm going to slightly

01:49:41.040 | reduce the amount of stuff on the screen.

01:49:43.560 | One thing to note is that batch size appears all the time, so I'm going to get rid of it.

01:49:49.980 | So here's the same thing where I've removed batch size.

01:49:53.180 | Also the specific activation function, who gives a shit?

01:49:56.640 | It's probably Ralu everywhere except the last layer where it's softmax, so I've removed

01:50:00.600 | that as well.

01:50:03.280 | Let's now look at what a convolutional neural network with a single dense hidden layer would

01:50:09.500 | look like.

01:50:10.500 | So we'd have our input, which this time will be, and remember I've removed batch size,

01:50:15.960 | number of channels by height by width, the operation, and we're ignoring the activation

01:50:21.040 | function is going to be a convolution followed by a max pool.

01:50:24.760 | Remember any shape is representing a matrix, so that gives us a matrix which will be size

01:50:30.440 | num_filters by height/2 by width/2, since we did a max pooling.

01:50:36.560 | And then we take that and we flatten it.

01:50:39.040 | I've put flatten in parentheses because flattening mathematically does nothing at all.

01:50:45.080 | Flattening is just telling Keras to think of it as a vector.

01:50:49.840 | It doesn't actually calculate anything, it doesn't move anything, it doesn't really do

01:50:54.560 | anything.

01:50:55.560 | It just says think of it as being a different shape.

01:50:57.720 | That's why I put it in parentheses.

01:50:59.600 | So let's then take a matrix product, and remember I'm not putting in the activation functions

01:51:03.680 | anymore.

01:51:05.280 | So that would be our dense layer, gives us our first fully connected layer, which will

01:51:10.920 | be of size, number of activations, and then we put that through a final matrix product

01:51:15.600 | to get an output of size, number of classes.

01:51:19.040 | So here is how we can represent a convolutional neural network with a single dense hidden layer.

01:51:27.800 | The number of activations again is the same as we had last time, it's whatever the n was

01:51:33.120 | that we wrote dense_n.

01:51:37.200 | Just like when the number of filters is when we write convolution_2D, we say number of

01:51:44.280 | filters followed by its size.

01:51:51.280 | So I'm going to now create a slightly more complex computation graph, but again I'm going

01:51:57.520 | to slightly simplify what I put on the screen, which is this time I'm going to remove all

01:52:01.880 | of the layer operations.

01:52:03.400 | Because now that we have removed the activation function, you can see that in every case we

01:52:09.560 | basically have either some kind of linear thing, either a matrix product or a convolution,

01:52:16.700 | and optionally there might also be a max pull.

01:52:19.320 | So really, this is not adding much additional information, so I'm going to get rid of it

01:52:24.160 | from now on.

01:52:25.160 | So we're now not showing the layer operations.

01:52:26.800 | So remember now, every arrow is representing one or more layer operations, which will generally

01:52:33.520 | be a convolution or a matrix product, followed by an activation function, and maybe there

01:52:38.280 | will be a max pulling in there as well.

01:52:42.000 | So let's say we wanted to predict the third word of a three-word string based on the previous

01:52:50.800 | two words.

01:52:53.240 | Now there's all kinds of ways we could do this, but here is one interesting way, which

01:52:58.360 | you will now recognize you could do with Keras's functional API.

01:53:02.440 | Which is, we could take word1 input, and that could be either a one-hot encoded thing, in

01:53:13.160 | which case its size would be vocab size, or it could be an embedding of it.

01:53:18.600 | It doesn't really matter either way.

01:53:21.600 | We then stick that through a layer operation to get a matrix output, which is our first

01:53:30.080 | fully connected layer.

01:53:32.560 | And this thing here, we could then take and put through another layer operation, but this

01:53:38.960 | time we could also add in the word2 input, again, either of vocab size or the embedding

01:53:45.680 | of it, put that through a layer operation of its own, and then when we have two arrows

01:53:51.720 | coming in together, that represents a merge.

01:53:55.680 | And a merge could either be done as a sum, or as a concab.

01:54:02.000 | I'm not going to say one's better than the other, but there are two ways that we can

01:54:05.640 | take two input vectors and combine them together.

01:54:09.600 | So now at this point, we have the input from word2 after sticking that through a layer.

01:54:18.240 | We have the input from word1 after sticking that through two layers.

01:54:23.160 | Merge them together, stick that through another layer to get our output, which we could then

01:54:27.600 | compare to word3 and try to train that to recognize word3 from words1 and word2.

01:54:36.320 | So you could try this.

01:54:38.480 | You could try and build this network using some corpus you find online, see how it goes.

01:54:45.440 | Pretty obviously then, you could bring it up another level to say let's try and predict

01:54:51.800 | the fourth word of a three-word string using words1 and 2 and 3.

01:55:00.160 | The reason I'm doing it in this way is that what's happening is each time I'm going through

01:55:05.680 | another layer operation and then bringing in word2 and going through a layer operation

01:55:11.160 | and bringing in word3 and going through a layer operation is I am collecting state.

01:55:17.800 | Each of these things has the ability to capture state about all of the words that have come

01:55:23.240 | so far and the order in which they've arrived.

01:55:27.480 | So by the time I get to predicting word4, this matrix has had the opportunity to learn

01:55:35.280 | what does it need to know about the previous words' orderings and how they're connected

01:55:40.080 | to each other and so forth in order to predict this fourth word.

01:55:44.520 | So we're actually capturing state here.

01:55:47.920 | It's important to note that we have not yet previously built a model in Keras which has

01:55:53.880 | input coming in anywhere other than the first layer, but there's no reason we can't.

01:56:01.040 | One of you asked a great question earlier, which was could we use this to bring in metadata

01:56:05.800 | like the speed a car was going to add it with a convolutional neural network's image data.

01:56:11.520 | I said yes we can, so in this case we're doing the same thing, which is we're bringing in

01:56:17.240 | an additional word's worth of data, and remember each time you see two different arrows coming

01:56:21.440 | in that represents a merge operation.

01:56:25.160 | So here's a perfectly reasonable way of trying to predict the fourth word from the previous

01:56:31.480 | three words.

01:56:32.920 | So this leads to a really interesting question, which was what if instead we said let's bring

01:56:41.160 | in our Word 1, and then we had a layer operation in order to create our hidden state, and that

01:56:50.080 | would be enough to predict Word 2, and then to predict Word 3, could we just do a layer

01:57:01.760 | operation and generate itself?

01:57:06.560 | And then that could be used to predict Word 3, and then run it again to predict Word 4,

01:57:11.040 | and run it again to predict Word 5.

01:57:14.600 | This is called an RNN, and everything that you see here is exactly the same structurally

01:57:23.860 | as everything I've shown before.

01:57:26.000 | The colored-in areas represent matrices, and the arrows represent layer operations.

01:57:33.360 | One of the really interesting things about an RNN is each of these arrows that you see

01:57:38.800 | - three arrows - there's only one weight matrix attached to those.

01:57:43.200 | In other words, it's the equivalent thing of saying every time you see an arrow from

01:57:49.240 | a circle to a circle, so that would be that one and that one, those two weight matrices

01:57:56.480 | have to be exactly the same.

01:57:59.000 | Every time you see an arrow from a rectangle to a circle, those three matrices have to

01:58:06.040 | be exactly the same.

01:58:07.640 | And then finally, you've got an arrow from a circle to a triangle, and that weight matrix

01:58:12.080 | is separate.

01:58:13.080 | The idea being that if you have a word coming in and being added to some state, why would

01:58:19.200 | you want to treat it differently depending on whether it's the first word in a string

01:58:23.800 | or the third word in a string?

01:58:25.520 | Given that generally speaking, we kind of split up strings pretty much at random anyway.

01:58:29.720 | We're going to be having a whole bunch of 11-word strings.

01:58:36.680 | One of the nice things about this way of thinking about it where you have it going back to itself

01:58:41.640 | is that you can very clearly see there is one layer operation, one weight matrix for

01:58:46.640 | input to hidden, one for hidden to hidden, circle to circle, and one for hidden to output,

01:58:55.160 | i.e., circle to triangle.

01:58:58.320 | So we're going to talk about that in a lot more detail next week.

01:59:04.520 | So now, I'm just going to quickly show you something in the last one minute, which is

01:59:14.720 | that we can train something which takes, for example, all of the text of Nietzsche, so here's

01:59:25.040 | a bit of his text, I've just read it in here, and we could split it up into every sequence

01:59:31.600 | - let's grab it here - into every sequence of length 40.

01:59:36.200 | So I've gone through the whole text and grabbed every sequence of length 40.

01:59:41.440 | And then I've created an RNN and its goal is to take the sentence which represents the

01:59:46.120 | indexes from i to i+40 and predict the sentence from i+1 to i+40+1.

01:59:55.560 | So every string of length max len, I'm trying to predict the string one word after that.

02:00:04.040 | And so I can take that now and create a model which has - an LSTM is a kind of recurrent

02:00:10.340 | neural network, we'll talk about it next week - which has a recurrent neural network, starts

02:00:14.720 | of course with an embedding. And then I can train that by passing in my sentences and

02:00:27.320 | my sentence one character later.

02:00:31.360 | And I can then say, okay, let's try and generate 300 characters by building a prediction of

02:00:37.720 | what do you think the next character would be. And so I have to seed it with something,

02:00:42.140 | I don't know, I thought it felt very Nietzschean, ethics is a basic foundation of all that.

02:00:48.400 | And see what happens. And after training it for only a few seconds, I get ethics is a

02:00:53.720 | basic foundation of all that. You can get the sense that it's starting to learn a bit

02:00:59.760 | about the idea that - oh by the way, one thing to mention is this Nietzsche corpus is slightly

02:01:04.760 | annoying. It has carriage returns after every line, so you'll see it's going to throw carriage

02:01:10.320 | returns in all over the place. It's got some pretty hideous formatting.

02:01:15.920 | So then I train it for another 30 seconds. I train it for another 30 seconds and I get

02:01:22.760 | to a point where it's kind of understanding the concept of punctuation and spacing.

02:01:27.760 | And then I've trained it for 640 seconds and it's starting to actually create real words.

02:01:35.080 | And then I've trained it for another 640 seconds. And interestingly, each section of Nietzsche

02:01:42.160 | starts with a numbered section that looks exactly like this. It's even starting to learn

02:01:47.360 | to close its quotation marks. It also notes that at the start of a chapter, it always

02:01:52.160 | has three lines, so it's learned to start chapters after another 640 seconds and another

02:02:01.680 | 640 seconds. And so by this time, it's actually got to a point where it's saying some things

02:02:06.920 | which are so obscure and difficult to understand, it could really be niche.

02:02:15.300 | These car RNN models are fun and all, but the reason this is interesting is that we're

02:02:22.320 | showing that we only provided that amount of text and it was able to generate text out

02:02:29.280 | here because it has state, it has recurrence. And what that means is that we could use this

02:02:35.120 | kind of model to generate something like SwiftKey, whereas you're typing it's saying this is the

02:02:40.140 | next thing you're going to type. I would love you to think about during the week whether

02:02:47.120 | this is likely to help our IMDB sentiment model or not. That would be an interesting

02:02:53.400 | thing to talk about. Next week, we will look into the details of how RNNs work. Thanks.

02:02:59.220 | (audience applauds)

02:03:02.220 | (audience applauds)