Lesson 4: Deep Learning 2018

00:00:00.000 | Okay, hi everybody welcome back good to see you all here

00:00:03.300 | It's been another

00:00:08.240 | Busy week of deep learning

00:00:11.360 | Lots of cool things going on and like last week

00:00:16.400 | I wanted to highlight a few really interesting articles that some of some of you folks who have written

00:00:24.640 | Fatali wrote one of the best articles I've seen for a while. I think actually talking about

00:00:32.880 | differential learning rates and stochastic gradient descent with restarts

00:00:36.720 | Be sure to check it out if you can because what he's done. I feel like he's done a great job of

00:00:43.040 | Kind of positioning in a place that you can get a lot out of it

00:00:48.360 | You know regardless of your background, but for those who want to go further

00:00:52.160 | He's also got links to like the academic papers that came from and kind of graphs of showing examples of all of all the things

00:00:59.080 | He's talking about

00:01:00.320 | And I think it's a it's a particularly

00:01:02.440 | Nicely done article so a good kind of role model for technical communication

00:01:08.600 | One of the things I've liked about you know seeing people post these

00:01:12.560 | Post these articles during the week is the discussion on the forums have also been like really great. There's been a lot of a

00:01:20.320 | lot of people helping out like

00:01:22.320 | Explaining things you know which you know maybe there's parts of the post bit where people have said actually that's not quite how it works

00:01:28.640 | And people have learned new things that way people have come up with new ideas as a result as well

00:01:33.600 | These discussions of stochastic gradient descent with restarts and cyclical learning rates has been a few of them actually

00:01:41.520 | Anand Sahar has written another great post

00:01:44.720 | talking about a similar

00:01:48.720 | Similar topic and why it works so well and again lots of great pictures and references to

00:01:53.780 | Papers and most importantly perhaps code are showing how it actually works

00:01:59.600 | Mark Hoffman covered the same topic at kind of a nice introductory level. I think really really kind of clear intuition

00:02:10.180 | Many can't talk specifically about differential learning rates

00:02:15.920 | And why it's interesting and again providing some nice context people not familiar with transfer learning

00:02:22.120 | You're not going back back to saying like well. What is transfer learning?

00:02:24.520 | Why is that interesting and given that why could differential learning rates be helpful?

00:02:30.300 | and then

00:02:33.440 | One thing I particularly liked about Arjun's

00:02:35.440 | article was that he talked not just about the technology that we're looking at but also talked about some of the

00:02:42.840 | implications particularly from a commercial point of view

00:02:45.280 | So thinking about like based on some of the things we've learned about so far

00:02:49.800 | What are some of the implications that that has you know in real life?

00:02:53.160 | And lots of background lots of pictures

00:02:56.180 | And then discussing some of the yeah some of the implications

00:03:00.400 | So there's been lots of great stuff online and thanks to everybody for all the great work that you've been doing

00:03:08.640 | As we talked about last week if you're kind of vaguely wondering about writing something

00:03:13.800 | But you're feeling a bit intimidated about it because you've never really written a technical post before just jump in you know

00:03:19.240 | It's it's it's it's a really

00:03:21.980 | Welcoming and encouraging group. I think to to work with

00:03:26.360 | So we're going to have a kind of an interesting lesson today, which is we're going to cover a

00:03:37.120 | Whole lot of different applications, so we've we've spent quite a lot of time on computer vision

00:03:42.600 | And today we're going to try if we can to get through three totally different areas

00:03:48.120 | Structured learning so looking at kind of how you look at

00:03:52.120 | So we're going to start out looking at structured learning or structured data learning by which I mean

00:03:59.340 | Building models on top of things look more like database tables

00:04:04.680 | So kind of columns of different types of data. They might be financial or geographical or whatever

00:04:10.240 | We're going to look at using deep learning for language natural language processing

00:04:16.600 | And we're going to look at using deep learning for recommendation systems, and so we're going to cover these

00:04:22.440 | at a very high level and the focus will be on

00:04:26.640 | Here is how to use the software to do it

00:04:31.480 | More than here is what's going on behind the scenes, and then the next three lessons

00:04:36.160 | We'll be digging into the details of what's been going on behind the scenes and also coming back to

00:04:41.920 | Looking at a lot of the details of computer vision that we kind of skipped over so far

00:04:47.740 | So the focus today is really on like how do you actually do these applications?

00:04:53.880 | And we'll kind of talk briefly about some of the concepts involved

00:04:59.720 | Before we do I did want to talk about one key

00:05:02.200 | New concept

00:05:06.480 | Which is dropout and you might have seen dropout mentioned a bunch of times already and got the got the impression that this is

00:05:12.840 | Something important and indeed it is

00:05:14.840 | So to look at dropout. I'm going to look at the the dog breeds

00:05:18.740 | Current cable competition that's going on and what I've done is I've gone ahead and I've created a

00:05:28.240 | pre-trained network as per usual

00:05:30.240 | and I've passed in pre compute equals true and so that's going to

00:05:34.520 | Pre-compute the activations that come out of the last convolutional layer. Remember an activation is just a number

00:05:42.080 | It's a number just a reminder

00:05:44.240 | an activation

00:05:46.480 | Like here is one activation. It's a number and

00:05:49.600 | Specifically the activations are calculated based on some

00:05:54.240 | Weights also called parameters that make up

00:05:58.160 | kernels or filters and they get applied to the previous layers activations

00:06:03.160 | Which could well be the inputs or they could themselves be the results of other calculations

00:06:09.000 | Okay, so when we say activation keep remembering we're talking about a number that's being calculated

00:06:13.640 | So we pre compute some activations

00:06:17.580 | And then what we do is we put on top of that a bunch of additional

00:06:22.880 | Initially randomly generated

00:06:24.880 | Fully connected layers, so we're just going to do some matrix modifications on top of these just like in our Excel worksheet

00:06:31.440 | at the very end

00:06:34.000 | We had this matrix that we just did a matrix multiplication

00:06:39.520 | So what you can actually do is if you just type

00:06:45.360 | The name of your learner object you can actually see

00:06:49.200 | What's in it? You can see the layers in it. So when I was previously been skipping over a little bit about oh

00:06:54.800 | We add a few layers to the end. These are actually the layers that we add

00:06:58.120 | We're going to do batch norm in the last lesson. So don't worry about that for now a

00:07:03.240 | Linear layer simply means a matrix multiply. Okay, so this is a matrix which has a 1024 rows and

00:07:10.360 | 512 columns and so in other words, it's going to take in 1,024 activations and spit out

00:07:18.400 | 512 activations

00:07:20.400 | Then we have a relu which remember is just replace the negatives with zero

00:07:24.600 | We'll skip over the batch norm

00:07:27.200 | We'll come back to drop out and then we have a second linear layer that takes those

00:07:30.800 | 512 activations from the previous linear layer and puts them through a new matrix multiply

00:07:35.880 | 512 by 120 spits out a new 120 activations and then finally put that through

00:07:43.440 | Softmax and for those of you that don't remember softmax. We looked at that last year at last week

00:07:49.160 | It's this idea that we basically just

00:07:51.960 | Take the previous the activation. Let's say for dog

00:07:55.680 | Go either the power of that and then divide that into the sum of either the power of all the activations

00:08:02.960 | So that was the thing that adds up to one all of them add up to one and each one individually is between zero and one

00:08:08.620 | okay, so

00:08:10.960 | That's that's what we added on top and that's the thing when we have pre compute equals true

00:08:15.920 | That's the thing we train so I wanted to talk about what this dropout is and what this P is because it's a really important

00:08:22.400 | Thing that we get to choose

00:08:25.000 | So a dropout layer with P equals zero point five

00:08:28.340 | Literally does this we go over to our spreadsheet and let's pick any layer with some activations and let's say okay

00:08:34.800 | I'm going to apply dropout with a P of zero point five two times two what that means is I go through and

00:08:42.000 | with a 50% chance I

00:08:44.720 | Pick a cell right pick an activation. So I picked like half of them randomly and I delete them

00:08:51.760 | Okay

00:08:54.880 | That's that's what dropout is right? So it's so the P equals point five means what's the probability of

00:09:02.080 | deleting that cell

00:09:04.600 | Right. So when I delete those cells

00:09:07.720 | If you have a walk like look at the output

00:09:11.560 | It doesn't actually change by very much at all just a little bit particularly because remember it's going through a max pooling layer

00:09:17.760 | Right, so it's only going to change it at all if it was actually the maximum in that group of four

00:09:22.660 | and furthermore, it's just one piece of you know, if it's going into a convolution rather than into a max pool

00:09:30.460 | It's just one piece of that that filter

00:09:33.720 | so interestingly

00:09:35.720 | The idea of like randomly throwing away half of the activations in a layer

00:09:41.840 | Has a really interesting result and one important thing to mention is each mini batch we throw away a different

00:09:50.960 | random half of activations in that layer and so what it means is

00:09:56.160 | It it forces it to not overfit right in other words if there's some particular activation

00:10:02.920 | That's really learnt just that exact

00:10:05.680 | That exact dog or that exact cat right then when that gets dropped out

00:10:12.160 | The whole thing now isn't going to work as well. It's not going to recognize that image, right?

00:10:17.160 | so it has to in order for this to work it has to try and find a

00:10:21.480 | representation that

00:10:24.280 | That actually continues to work even as random half of the activations get thrown away every time

00:10:31.720 | Right, so it's a it's it's I guess about four years old now. They're four years old and it's been

00:10:39.020 | Absolutely critical in

00:10:43.120 | Making modern deep learning work and the reason why is it really just about solve?

00:10:49.120 | The problem of generalization for us before dropout came along

00:10:53.440 | if you try to train a model with lots of parameters and you were overfitting and

00:11:01.160 | You already tried all the data augmentation you called and you already had as much data as you could you?

00:11:07.240 | There were some other things you could try, but to a large degree you were kind of stuck

00:11:11.160 | and so then

00:11:13.800 | Jeffrey Hinton and his colleagues came up with this this dropout idea that was loosely inspired by the way the brain works

00:11:22.160 | And also loosely inspired by Jeffrey Hinton's experience in bank tele cues apparently

00:11:28.040 | and

00:11:30.160 | yeah, somehow they came up with this amazing idea of like hey, let's let's try throwing things away at random and

00:11:36.080 | So as you can imagine if your P was like 0.01

00:11:42.420 | Then you're throwing away 1% of your activations for that layer at random. It's not going to randomly

00:11:49.320 | Change things up very much at all

00:11:51.920 | So it's not really going to protect you from

00:11:56.040 | Overfitting much at all on the other hand if your P was 0.99

00:12:00.040 | then that would be like going through the whole thing and throwing away nearly everything right and

00:12:06.360 | That would be very hard for it to overfit so that would be great for generalization, but it's also going to kill your

00:12:14.780 | accuracy

00:12:16.880 | so this is kind of

00:12:19.040 | playoff between high P values generalize well

00:12:22.860 | But will decrease your training accuracy and low P values will generalize less well, but will give you a less good training accuracy

00:12:30.760 | So for those of you that have been wondering why is it that particularly early in training are my validation losses better?

00:12:39.040 | Than my training losses right which seems otherwise really surprising. Hopefully some of you have been wondering why that is

00:12:46.400 | because on a data set that it never gets to see you wouldn't expect the losses to ever be

00:12:51.880 | That's better and the reason why is because when we look at the validation set we turn off dropout

00:12:58.200 | Right so in other words when you're doing inference when you're trying to say is this a cat or is this a dog?

00:13:03.240 | We certainly don't want to be including

00:13:05.800 | Random dropout there right we want to be using the best model we can

00:13:10.300 | Okay, so that's why early in training in particular

00:13:14.840 | We actually see that our validation

00:13:16.840 | Accuracy and loss tends to be better

00:13:19.920 | If we're using dropout, okay, so yes

00:13:24.000 | You have to do anything to accommodate for the fact that you are throwing away some

00:13:30.920 | That's a great question, so

00:13:34.280 | We don't but pytorch does so pytorch behind the scenes does two things if you say P equals point five

00:13:42.040 | It throws away half of the activations

00:13:45.360 | but it also

00:13:48.120 | Doubles all the activations that are already there so on average the kind of the average activation doesn't change

00:13:55.040 | Which is pretty pretty neat trick?

00:13:57.640 | So yeah, you don't have to worry about it basically it's done for you

00:14:02.840 | So if we say so you can pass in peas

00:14:08.560 | This is the this is the P value for all of the added layers to say

00:14:13.820 | With fastai what dropout do you want on each of the layers in these these added layers?

00:14:19.360 | It won't change the dropout in the pre-trained network like the hope is that that's already been

00:14:25.440 | Pretty trained with some appropriate level of dropout

00:14:28.440 | We don't change it but on these layers that we add you can say how much and so you can see here

00:14:33.080 | I said peas equals point five so my first dropout has point five my second dropout has point five

00:14:39.460 | All right, and remember coming to the input of this

00:14:42.240 | Was the output of the last convolutional layer of pre-trained network?

00:14:47.460 | And we go away and we actually throw away half of that before you can start go through our linear layer

00:14:53.360 | Throw away the negatives

00:14:56.120 | Throw away half of the result of that go through another linear layer and then pass that to our softmax

00:15:02.080 | for

00:15:04.080 | Minor numerical precision region reasons it turns out to be better to take the log of the softmax then the softmax directly

00:15:12.080 | And that's why you'll have noticed that when you actually get predictions out of our models you always have to go

00:15:17.440 | NP dot X of the predictions

00:15:20.040 | Again, the details as to why aren't important. So if we want to

00:15:25.240 | Try removing dropout. We could go peas equals zero

00:15:30.200 | Right and you'll see where else before we started with the point seven six accuracy in the first epoch now

00:15:35.680 | You've got a point eight accuracy in the first epoch

00:15:37.680 | So by not doing dropout our first epoch worked better not surprisingly because we're not throwing anything away

00:15:44.240 | but by the third epoch here, we had eighty four point eight and

00:15:48.160 | Here we have eighty four point one. So it started out better and ended up worse

00:15:53.080 | So even after three epochs, you can already see we're massively overfitting, right?

00:15:57.520 | We've got point three loss on the train and point five loss on the validation

00:16:03.560 | And so if you look now you can see in the resulting model there's no dropout at all

00:16:11.760 | So if the P is zero, we don't even add it to the model

00:16:14.840 | Another thing to mention is you might have noticed that what we've been doing is we've been adding two

00:16:24.200 | linear layers

00:16:26.000 | Right in our additional layers. You don't have to do that. By the way, there's actually a parameter called extra fully connected

00:16:33.520 | Layers that you can basically pass a list of how long do you want all how big do you want each of the additional fully connected?

00:16:41.320 | Layers to be and so by default

00:16:43.320 | Well, you need to have at least one

00:16:45.840 | Right because you need something that takes the output of the convolutional layer

00:16:50.120 | which in this case is a size 1024 and turns it into the number of

00:16:54.800 | Classes you have cats versus dogs would be two dog breeds would be 120

00:17:00.960 | Planet satellite 17 whatever that's you always need one linear layer at least and you can't pick how big that is

00:17:08.640 | That's defined by your problem

00:17:10.640 | But you can choose what the other size is or if it happens at all

00:17:15.600 | So if we were to pass in an empty list, then now we're saying don't add any additional linear layers

00:17:21.080 | Just the one that we have to have

00:17:23.080 | Right. So here if we've got P's equals zero extra fully connected layers is empty. This is like the minimum

00:17:29.640 | possible

00:17:32.240 | Kind of top model we can put on top and again like if we do that

00:17:37.800 | You can see above we actually end up with in this case a

00:17:44.960 | Reasonably good result because we're not training it for very long and this particular pre-trained network is very well suited

00:17:51.560 | To this particular problem. Yes, you know

00:17:54.040 | So Jeremy, what kind of P should we we using?

00:17:58.960 | by default

00:18:01.080 | So the one that's there by default

00:18:04.120 | for the first layer

00:18:06.720 | Is 0.25 and for the second layer is 0.5

00:18:10.800 | That seems to work pretty well

00:18:14.760 | For most things right? So like it's it's you you don't necessarily need to change it at all

00:18:19.760 | Basically, if you find it's overfitting

00:18:23.200 | Just start bumping it up. So try first of all setting it to 0.5

00:18:28.240 | That'll set them both to 0.5 if it's still overfitting a lot try 0.7 like you can you can narrow down

00:18:34.320 | And like there's not that many

00:18:37.040 | Numbers change right and if you're under fitting

00:18:42.000 | Then you can try making it lower

00:18:44.160 | It's unlikely you would need to make it much lower because like even in these dogs versus cats situations

00:18:51.600 | You know, we don't seem to have to make it lower so it's more likely you'd be increasing it to like 0.6 or 0.7

00:18:58.800 | But you can fiddle around I find these the ones that are there by default seem to work pretty well most of the time

00:19:05.680 | So one place I actually did increase this

00:19:10.440 | Was in the dog breeds one. I did set it them both to point five

00:19:14.080 | when I used a

00:19:16.760 | Bigger model so like resnet 34 has less parameters

00:19:21.120 | So it doesn't overfit as much but then when I started bumping pumping it up to like a resnet 50

00:19:26.420 | Which has a lot more parameters. I noticed it started overfitting. So then I also increased my dropout. So as you use like

00:19:32.920 | Bigger models you'll often need to add more dropout. Can you pass that over there, please? You know

00:19:39.360 | If you set B 2.5 roughly what percentage is it 50% 50%? Yeah

00:19:48.680 | Was there how can you pass that back?

00:19:51.640 | Thanks. Is there a particular way in which you can determine if the data is being old fitted?

00:19:58.160 | Yeah

00:20:01.280 | You can see that the like here you can see that the training error is a

00:20:07.200 | Loss is much lower than the validation loss

00:20:09.760 | you can't tell if it's like

00:20:12.520 | to overfitted like

00:20:15.080 | Zero overfitting is not generally optimal like the only way to find that out is

00:20:19.920 | Remember the only thing you're trying to do is to get this number low right the validation loss number low

00:20:24.440 | So in the end you kind of have to play around with a few different things and see which thing ends up getting the validation

00:20:31.080 | Loss low, but you're kind of going to feel over time for your particular problem

00:20:36.720 | What does overfitting? What does too much overfitting look like?

00:20:40.240 | Great so

00:20:44.840 | So that's dropout and we're going to be using that a lot and remember it's there by default service here another question

00:20:50.880 | So I have two questions one is

00:20:55.520 | So when it says the dropout rate is 7.5

00:21:00.280 | Is does it like you know a delete each cell with a probability of?

00:21:06.120 | 0.5 does it just pick 50% randomly? I mean, I know both effectively

00:21:11.280 | It's the former yeah, okay, okay, second question is why why does the average activation matter?

00:21:17.920 | well, it matters because the remember if you look at the

00:21:22.960 | Excel spreadsheet that the result of

00:21:26.720 | this cell for example is equal to

00:21:31.520 | These

00:21:33.520 | Nine

00:21:38.360 | Multiplied by each of these nine

00:21:40.520 | Right and add it up, so if we deleted half of these

00:21:44.000 | Then that would also cause this number to half which would cause like everything else after that to change and so if you change

00:21:51.600 | What it means you know like it then you're changing something that used to say like oh

00:21:57.080 | Fluffy ears are fluffy if this is greater than point six now

00:22:00.720 | It's only fluffy if it's greater than point three like we're changing the meaning of everything

00:22:04.000 | So you the goal here is to delete things without changing

00:22:08.800 | We're using a linear activation for one of the earlier activations

00:22:17.560 | Why are we using linear? Yeah? Why that particular activation?

00:22:22.040 | Because that's what this set of layers is so we've we've the the pre trained network is all is the convolutional network

00:22:28.960 | And that's pretty computed, so we don't see it so what that spits out is a vector

00:22:35.320 | So the only choice we have is to use linear layers at this point

00:22:39.600 | Okay

00:22:41.760 | Can we have different level of dropout by layer? Yes, absolutely how to do that great so

00:22:49.880 | You can absolutely have different dropout by layer, and that's why this is actually called peas

00:22:54.720 | So you can pass in an array here, so if I went zero

00:22:58.400 | comma 0.2 for example and then extra fully connected. I might add 512

00:23:05.120 | Right then that's going to be zero dropout before the first of them and point two dropout before the second of them

00:23:12.800 | Yes requests, and I must admit. I don't have a great

00:23:17.760 | Intuition even after doing this for a few years for like

00:23:20.640 | When should earlier or later layers have different amounts of dropout?

00:23:26.640 | It's still something I kind of play with and I can't quite find rules of thumb

00:23:32.000 | So there's some of you come up with some good rules of thumb. I'd love to

00:23:35.020 | Hear about them. I think if in doubt

00:23:37.840 | You can use the same dropout in every fully connected layer

00:23:42.040 | The other thing you can try is often people only put dropout on the very last

00:23:48.040 | Linear layer, so that'd be the two things to try

00:23:50.440 | So Jeremy, why do you monitor the log loss the loss instead of the accuracy going up?

00:24:00.800 | Well because the loss is the only thing that we can see

00:24:05.080 | For both the validation set and the training set so it's nice to be able to compare them

00:24:13.440 | also as we'll learn about

00:24:16.720 | Later the loss is the thing that we're actually

00:24:19.400 | optimizing

00:24:22.240 | So it's it's kind of a little more. It's a little easier to monitor that and understand what that means

00:24:28.520 | Can you pass it over there?

00:24:32.120 | So with dropout we are kind of adding some random noise every iteration right so

00:24:39.240 | So that means that we don't do as much learning right or actually so that's right

00:24:45.800 | So we have to play around with the learning rate and it doesn't seem to impact the learning rate

00:24:50.860 | Enough that I've ever noticed it. I I would say you're probably right in theory it might but not enough that it's ever affected me

00:24:59.280 | Okay, so let's talk about this

00:25:07.360 | Structured data problem and so to remind you we were looking at Kaggles Rossman competition

00:25:15.160 | Which is a German?

00:25:17.160 | Chain of supermarkets, I believe and you can find this in lesson 3 Rossman

00:25:23.480 | and

00:25:26.280 | The main data set is the one where we were looking to say at a particular store

00:25:33.040 | How much did they sell?

00:25:36.040 | Okay, and there's a few big key pieces of information one is what was the date another was were they open?

00:25:42.840 | Did they have a promotion on?

00:25:44.840 | Was it a holiday in that state?

00:25:47.400 | And was it a holiday as for school a state holiday there?

00:25:51.360 | Or was it a school holiday there and then we had some more information about stores like what for this store?

00:25:57.200 | What kind of stuff did they tend to sell what kind of store are they how far away the competition and so forth so?

00:26:03.240 | With the data set like this there's really two main kinds of column. There's columns that we think of as

00:26:10.600 | Categorical they have a number of levels so the assortment

00:26:13.760 | Column is categorical, and it has levels such as a B and C

00:26:19.200 | Where else something like competition distance we would call continuous

00:26:25.380 | It has a number attached to it where differences or ratios even if that number have some kind of meaning

00:26:31.480 | And so we need to deal with these two things quite differently, okay, so anybody who's done any

00:26:39.240 | Machine learning of any kind will be familiar with using continuous columns if you've done any linear regression for example

00:26:45.400 | You can just like multiply them by parameters for instance

00:26:48.680 | Categorical columns we're going to have to think about a little bit more

00:26:52.440 | We're not going to go through the data cleaning we're going to assume that that's in feature engineering we're going to assume all that's been done

00:27:00.240 | And so basically

00:27:04.280 | at the end of that we have a list of columns and the in this case I

00:27:09.960 | Didn't do any of the thinking around the feature engineering or data cleaning myself

00:27:16.920 | This is all directly from the third place winners of this competition

00:27:20.680 | And so they came up with all of these different

00:27:25.160 | Columns that they found useful

00:27:28.760 | and so

00:27:30.640 | You'll notice the list here is a list of the things that we're going to treat as categorical variables

00:27:37.000 | Numbers like year a month and day

00:27:42.480 | Although we could treat them as continuous like they the differences between 2000 and 2003 is meaningful

00:27:51.200 | We don't have to right and you'll see shortly how

00:27:55.080 | how categorical

00:27:59.840 | variables are treated

00:28:00.880 | But basically if we decide to make something a categorical variable what we're telling our neural net down the track is

00:28:07.480 | That for every different level of say year, you know, 2000 2001 2002 you can treat it totally differently

00:28:14.920 | Where else if we say it's continuous, it's going to have to come up with some kind of like function some kind of smooth ish

00:28:22.120 | function right and so often even for things like year that actually are continuous

00:28:29.280 | But they don't actually have many distinct levels it often works better

00:28:33.640 | To treat it as categorical

00:28:36.200 | So another good example day of week, right? So like day of week between naught and six

00:28:42.080 | It's a number and it means something like the difference between three and five is two days and has meaning but if you think about

00:28:49.680 | like how would

00:28:51.680 | Sales in a store vary by day of week

00:28:54.520 | It could well be that like, you know, Saturdays and Sundays are over here and Fridays are over here and Wednesdays over here

00:29:00.860 | Like each day is going to behave

00:29:03.000 | Kind of qualitatively differently, right? So by saying this is the categorical variable as you'll see we're going to let the neural net

00:29:11.920 | Do that right? So this thing where we get where we say

00:29:15.960 | Which are continuous and which are categorical to some extent? This is a modeling decision you get to make

00:29:23.440 | now if something is coded in your data is like a B and C or

00:29:29.560 | You know Jeremy and your net or whatever you actually you're going to have to call that categorical, right?

00:29:36.400 | There's no way to treat that directly as a continuous variable

00:29:40.000 | On the other hand if it starts out as a continuous variable like age or day of week

00:29:45.540 | You get to decide

00:29:48.280 | Whether you want to treat it as continuous or categorical. Okay, so summarize if it's categorical in the data

00:29:54.060 | It's going to have to be categorical in the model if it's continuous in the data

00:29:58.240 | You get to pick whether to make it continuous or categorical in the model

00:30:02.680 | So in this case again, I just did whatever the third place winners of this competition did

00:30:09.440 | These are the ones that they decided to use as categorical. These were the ones they decided to use as continuous and you can see

00:30:15.360 | that basically

00:30:18.000 | The continuous ones are all of the ones which are actual

00:30:22.120 | Floating point numbers like competition distance actually has a decimal place to it, right and temperature actually has a decimal place to it

00:30:30.080 | So these would be very hard to make

00:30:32.080 | categorical because they have many many levels right like if it's like five digits of floating point then potentially there will be as

00:30:40.640 | many levels as there are

00:30:43.160 | As there are rows and by the way the word we use to say how many levels are in a category?

00:30:49.400 | We use the word cardinality, right?

00:30:51.400 | So if you hear me say cardinality for example the cardinality of the day of week

00:30:55.200 | Variable is seven because there are seven different days of the week

00:30:58.480 | Do you have a heuristic for one to have been continuous variables or do you ever been variables? I don't ever been continuous variables

00:31:11.800 | So yeah, so one thing we could do with like max temperature is group it into

00:31:16.520 | 0 to 10 10 to 20 20 to 30 and then call that categorical

00:31:21.000 | interestingly a paper just came out last week in which a group of researchers found that

00:31:28.280 | Sometimes bidding can be helpful

00:31:30.440 | But that literally came out in the last week and until that time I haven't seen anything in deep learning saying that so I haven't

00:31:36.440 | I haven't looked at it myself until this week. I would have said it's a bad idea

00:31:41.360 | Now I have to think differently. I guess maybe it is sometimes

00:31:44.480 | So if you're using

00:31:51.480 | Year as a category what happens when you run the model on a year? It's never seen so you trained it in

00:31:58.080 | Well, we'll get there. Yeah, the short answer is it'll be treated as an unknown category

00:32:04.300 | And so plan does which is the underlying data frame thing?

00:32:08.840 | We're using with categories as a special category called unknown and if it sees a category it hasn't seen before it gets treated as unknown

00:32:16.800 | So for our deep learning model unknown would just be another category

00:32:22.600 | If our data set training the data set doesn't have a category and

00:32:32.080 | Test has unknown. How will it be? It'll just be part of this unknown category. Well, it's still predict

00:32:39.480 | It'll predict something right like it will just have the value

00:32:42.940 | 0 behind the scenes and if there's been any unknowns of any kind in the training set then it'll have learned a

00:32:49.680 | Way to predict unknown if it hasn't it's going to have some random vector. And so that's a

00:32:56.480 | Interesting detail around training that we probably won't talk about in this part of the course

00:33:01.720 | But we can certainly talk about on the forum

00:33:03.720 | Okay, so we've got our categorical and continuous variable lists to find in this case there was a 800,000 rows

00:33:14.720 | So 800,000 dates basically by stores

00:33:18.520 | And so you can now take all of these columns

00:33:25.120 | loop through each one and

00:33:28.880 | Replace it in the data frame with a version where you say take it and change its type to category

00:33:34.800 | Okay, and so that just that's just a pandas thing. So I'm not going to teach you pandas

00:33:41.120 | There's plenty of books particularly Wes McKinney's books book on Python for data analysis is great

00:33:47.080 | But hopefully it's intuitive as to what's going on even if you haven't seen the specific syntax before

00:33:52.320 | So we're going to turn that column into a categorical column

00:33:56.840 | And then for the continuous variables, we're going to make them all

00:33:59.920 | 32-bit floating-point and for the reason for that is that PyTorch

00:34:05.400 | Expects everything to be 32-bit floating-point. Okay, so like some of these include like

00:34:13.480 | 1-0 things like

00:34:16.720 | Can't see them straight away. But anyway, some of them. Yeah, like was there a promo was was a holiday

00:34:23.760 | And so that'll become the floating-point values one and zero instance. Okay, so

00:34:29.640 | I try to do as much of my work as possible on

00:34:35.640 | small data sets

00:34:38.000 | For when I'm working with images that generally means resizing the images to like 64 by 64 or 128 by 128

00:34:45.320 | We can't do that with structured data. So instead I tend to take a sample. So I randomly pick a few rows

00:34:53.080 | So I start running with a sample and I can use exactly the same thing that we've seen before

00:34:57.920 | For getting a validation set we can use the same way to get some random

00:35:02.460 | Random row numbers to use in a random sample. Okay, so this is just a bunch of random numbers

00:35:09.280 | And then okay, so that's going to be a size 150,000 rather than 840,000

00:35:20.840 | And so my data that before I go any further it basically looks like this. You can see I've got some booleans here

00:35:27.880 | I've got some

00:35:29.880 | Integers here of various different scales. There's my year 2014

00:35:35.240 | And I've got some letters here. So even though I said

00:35:39.880 | Please call that a pandas category

00:35:42.880 | Pandas still displays that in the notebook as strings, right?

00:35:49.000 | It's just stored in internally differently

00:35:51.440 | so then the first AI library has a special little function called process data frame and

00:35:57.440 | Process data frame takes a data frame and you tell it. What's my dependent variable?

00:36:03.200 | Right, and it does a few different things

00:36:05.720 | The first thing is it pulls out that dependent variable and puts it into a separate variable

00:36:10.620 | Okay, and deletes it from the original data frame

00:36:13.800 | So DF now does not have the sales column in where else y just contains the sales column

00:36:19.880 | Something else that it does is it does scaling?

00:36:24.600 | so neural nets

00:36:27.040 | Really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around one

00:36:34.920 | all right, so we can always take our data and

00:36:37.200 | Subtract the mean and divide by the standard deviation to make that happen

00:36:43.080 | So that's what do scale equals true does and it actually returns a special object

00:36:47.480 | Which keeps track of what mean and standard deviation did it use for that normalizing?

00:36:52.560 | So you can then do the same thing to the test set later

00:36:56.180 | It also handles missing values

00:36:59.120 | so

00:37:01.040 | missing values and categorical variables just become the ID 0 and then all the other categories become 12345 for that

00:37:09.680 | categorical variable

00:37:11.560 | for continuous variables that replaces the

00:37:15.080 | missing value with the median

00:37:18.080 | And creates a new column

00:37:20.400 | That's a Boolean and just says is this missing or not and I'm going to skip over this pretty quickly because we talk about this

00:37:26.120 | In detail in the machine learning course, okay, so if you've got any questions about this part

00:37:30.800 | That would be a good place to go. It's nothing deep learning specific there

00:37:36.040 | So you can see afterwards year 2014

00:37:39.800 | For example has become year 2 ok because these categorical variables have all been replaced with

00:37:44.560 | With contiguous integers starting at 0

00:37:48.440 | Right and the reason for that is later on we're going to be putting them into a matrix

00:37:53.680 | Right and so we wouldn't want the matrix to be 2014 rows long when it could just be 2 rows long

00:37:59.440 | so that's the basic idea there, and you'll see that the

00:38:05.120 | AC for example has been replaced in the same way with 1 and 3

00:38:09.840 | Okay, so we now have a data frame

00:38:14.040 | Which does not contain the dependent variable and where everything is a number okay?

00:38:18.880 | And so that's the that's where we need to get to to do deep learning and all of the stage above that

00:38:24.160 | As I said we talk about in detail in the machine learning course nothing deep learning specific about any of it

00:38:29.860 | This is exactly what we throw into our random forests as well, so

00:38:34.280 | Another

00:38:36.280 | Thing we talk about a lot in the machine learning course of course is validation sets

00:38:40.200 | In this case we need to predict the next two weeks of sales

00:38:45.800 | Right it's not like pick a random set of sales, but we have to pick the next two weeks of sales. That was what the Kaggle

00:38:53.200 | competition folks told us to do

00:38:55.960 | And therefore I'm going to create a validation set which is the last two weeks of

00:39:02.320 | my training set right to try and make it as similar to the test set as possible and

00:39:06.640 | We just posted actually Rachel wrote this thing last week about

00:39:11.480 | Creating validation sets so if you go to fast at AI you can check it out

00:39:16.320 | We'll put that in the lesson wiki as well

00:39:18.840 | But it's basically a summary of a recent machine learning lesson that we did

00:39:25.180 | The videos are available for that as well, and this is kind of a written a written summary of it

00:39:31.960 | Okay

00:39:33.960 | So yeah

00:39:37.160 | So Rachel has been a lot of time thinking about kind of you know

00:39:39.760 | How do you need to think about validation sets and training sets and test sets and so forth and that's all there?

00:39:45.480 | But again, nothing deep learning specific, so let's get straight to the the deep learning action, okay?

00:39:51.400 | so in this particular competition as always with any competition or any kind of

00:39:59.920 | Machine learning project you really need to make sure you have a strong understanding of your metric

00:40:05.280 | How are you going to be judged here and in this case?

00:40:08.400 | You know Kaggle makes it easy they tell us how we're going to be judged and so we're going to be judged on the roots

00:40:13.080 | mean squared

00:40:14.440 | percentage error

00:40:15.920 | Right so we're going to say like oh you predicted three

00:40:19.180 | It was actually three point three so you were ten percent out

00:40:24.520 | And then we're going to average all those percents right and remember. I warned you that

00:40:30.480 | You are going to need to make sure you know logarithms really well right and so in this case from you know

00:40:37.920 | We're basically being saying your prediction divided by the actual the mean of that

00:40:43.880 | Right is the thing that we care about

00:40:46.880 | and so we don't have a

00:40:52.160 | Metric in Pytorch called root mean squared percent error

00:40:54.880 | We could actually easily create it by the way if you look at the source code

00:41:00.480 | You'll see like it's you know a line of code, but easier still would be to realize that

00:41:05.000 | That if you have

00:41:09.240 | That right then you could replace a with like

00:41:13.320 | Log of a dash and B with like log of B dash

00:41:17.960 | And then you can replace that whole thing with a subtraction

00:41:22.040 | That's just the rule of logs right and so if you don't know that rule

00:41:28.520 | Then you know make sure you go look it up because it's super helpful

00:41:31.200 | But it means in this case all we need to do is to

00:41:34.520 | Take the log of our data

00:41:38.640 | which I actually did earlier in this

00:41:41.200 | Notebook and when you take the log of the data getting the root means great error

00:41:46.120 | Will actually get you the root means great percent error for free, okay?

00:41:50.720 | But then when we want to like print out our root means red percent error

00:41:55.440 | We actually have to go either the power of it

00:41:58.640 | Again, right and then we can actually return the percent difference, so that's all that's going on here

00:42:04.760 | It's again. Not really deep learning specific at all

00:42:06.960 | So here we finally get to the deep learning alright, so as per usual like you'll see everything

00:42:15.840 | We look at today looks exactly the same as everything. We've looked at so far. Which is first we create a model data object

00:42:22.620 | Something that has a validation set

00:42:25.400 | Training set an optional test set built into it from that we will get a learner we will then

00:42:32.000 | Optionally called learner dot LR find will then called learner dot fit

00:42:37.600 | It'll be all the same parameters and everything that you've seen many times before okay

00:42:42.320 | So the difference though is obviously we're not going to go

00:42:45.560 | Image classifier data dot from CSV or dot from paths we need to get some different kind of model data

00:42:53.640 | And so for stuff that is in rows and columns

00:42:56.280 | We use columnar model data

00:42:58.960 | Okay, but this will return an object with basically the same API that you're familiar with and rather than from paths

00:43:06.720 | Or from CSV this is from data frame, okay, so this gets past a few things

00:43:12.520 | The path here is just used for it to know where should it store?

00:43:17.600 | Like model files or stuff like that right this is just basically saying where do you want to store anything that you save later?

00:43:24.160 | This is the list of the indexes of the rows that we want to put in the validation set we created earlier

00:43:30.920 | Here's our data frame

00:43:33.560 | okay, and

00:43:36.040 | then

00:43:38.040 | Let's have a look here's this is where we did the log right so I talked the

00:43:42.680 | The Y that came out of property F our dependent variable. I logged it and I call that YL

00:43:48.180 | Right so we tell it

00:43:50.680 | When we create our model data we need to tell it that's our dependent variable

00:43:54.200 | So so far we've got list of the stuff to go in the validation set which is what's our independent variables?

00:44:00.800 | What's our dependent variables and then we have to tell it which things do we want treated as categorical right?

00:44:07.120 | Because remember by this time

00:44:09.800 | Everything's a number

00:44:14.600 | Right so it could do the whole thing as if it's continuous it would just be totally meaningless

00:44:20.260 | Right so we need to tell it which things do we want to treat as categories and so here we just pass in

00:44:26.640 | That list of names that we use before

00:44:30.960 | okay, and then a bunch of the parameters are the same as the ones you're used to for example you can set the batch size

00:44:37.800 | Yeah, so after we do that. We've got a

00:44:42.180 | You know a standard

00:44:45.440 | Model data object with a trend train DL

00:44:48.400 | Attribute there's a vowel DL attribute a train DS attribute of our DS attribute. It's got a length

00:44:54.840 | It's got all this stuff

00:44:56.480 | Exactly like it did in all of our

00:44:59.920 | image based

00:45:01.920 | data objects

00:45:03.840 | Okay, so now we need to create the the model or create the learner and so to skip ahead a little bit

00:45:10.280 | We're basically going to pass in something that looks pretty familiar

00:45:15.120 | We're going to be passing saying from our model from our model data

00:45:18.560 | Create a learner that is suitable for it

00:45:21.560 | And we'll basically be passing in a few other bits of information which will include

00:45:27.640 | How much dropout to use at the very start?

00:45:29.920 | How many how many activations to have in each layer how much dropout to use at the at the later layers?

00:45:38.120 | But then there's a couple of extra things that we need to learn about and specifically it's this thing called

00:45:44.560 | embeddings

00:45:47.120 | So this is really the key new concept we have to learn about all right, so

00:45:55.960 | All we're doing basically is we're going to take our

00:45:59.520 | Let's forget about categorical variables for a moment and just think about the continuous variables

00:46:05.920 | For our continuous variables all we're going to do

00:46:09.680 | Is we're going to grab them all

00:46:12.760 | Okay, so for our continuous variables, we're basically going to say like okay, here's a

00:46:22.520 | big list of all of our continuous variables like the minimum temperature and

00:46:26.600 | maximum temperature and the distance to the nearest competitor and so forth right and so here's just a bunch of

00:46:33.480 | floating point numbers and so basically what the neural nets going to do is it's going to take that that 1d array or

00:46:40.120 | Or vector or to be very DL like

00:46:45.520 | rank 1 tensor

00:46:48.200 | All means the same thing okay, so we're going to take our rank 1 tensor

00:46:51.680 | And let's put it through a matrix multiplication, so let's say this has got like I don't know 20

00:46:57.880 | continuous variables, and then we can put it through a matrix which

00:47:03.160 | Must have 20 rows. That's how much this multiplication works, and then we can decide how many columns we want right

00:47:10.400 | So maybe we decided 100 right and so that matrix multiplication is going to spit out a new

00:47:15.800 | length 100

00:47:19.040 | rank 1 tensor

00:47:20.800 | Okay, that's that's what that's what a linear. That's what a matrix product does and that's the definition of a linear layer

00:47:28.080 | in deep length

00:47:30.160 | Okay, and so then the next thing we do is we can put that through a relu right which means we throw away the negatives

00:47:37.200 | Okay, and now we can put that through another matrix product. Okay, so this is going to have to have a hundred rows by definition

00:47:45.100 | And we can have as many columns as we like and so let's say maybe this was

00:47:50.840 | The last layer so the next thing we're trying to do is to predict sales

00:47:55.040 | So there's just one

00:47:57.720 | value, we're trying to predict the sales so we could put it through a

00:48:00.400 | Matrix product that just had one column and that's going to spit out a single number

00:48:05.520 | All right, so that's like

00:48:08.280 | That's kind of like a one layer

00:48:11.440 | Neural net if you like now in practice, you know we wouldn't make it one layer

00:48:18.440 | so we'd actually have like

00:48:20.440 | You know, maybe we'd have 50 here and so then that gives us a 50 long vector and

00:48:30.040 | then

00:48:32.800 | Maybe we then put that into our final

00:48:35.020 | 50 by 1

00:48:38.760 | And that spits out a single number and one reason I wanted to change that there was to point out, you know, relu

00:48:44.920 | You would never put relu in the last layer

00:48:48.240 | Like you'd never want to throw away the negatives because that the softmax

00:48:51.720 | The softmax

00:48:57.320 | Needs negatives in it because it's the negatives that are the things that allow it to create low probabilities

00:49:02.820 | That's minor detail, but it's useful to remember. Okay, so basically

00:49:08.120 | So basically a

00:49:16.240 | simple view of a

00:49:18.240 | Fully connected neural net is something that takes in as an input a rank one tensor

00:49:26.240 | it's bits it's through a linear layer an

00:49:29.720 | Activation layer another linear layer

00:49:34.680 | Softmax and

00:49:38.400 | That's the output

00:49:41.960 | Okay, and so we could obviously decide to add more

00:49:46.920 | Linear layers we could decide maybe to add dropout

00:49:51.000 | Right. So these are some of the decisions that we we get to make right but we there's not that much we can do

00:49:58.800 | Right. There's not much really crazy architecture stuff to do. So when we come back to

00:50:03.540 | Image models later in the course

00:50:06.520 | We're going to learn about all the weird things that go on and like res nets and inception networks and blah blah blah

00:50:12.100 | But in these fully connected networks, they're really pretty simple. They're just interspersed

00:50:16.600 | linear layers that is matrix products and

00:50:19.580 | Activation functions like value and a softmax at the end

00:50:24.680 | And if it's not classification which actually ours is not classification in this case. We're trying to predict sales

00:50:31.900 | There isn't even a softmax

00:50:34.420 | Right, we don't want it to be between 0 and 1

00:50:37.780 | Okay, so we can just throw away the last activation all together

00:50:41.580 | If we have time we can talk about a slight trick we can do there but for now we can think of it that way

00:50:48.940 | So that was all assuming that everything was continuous, right? But what about categorical, right? So we've got like

00:50:57.360 | Day of week

00:51:01.500 | right and

00:51:04.500 | We're going to treat it as categorical, right? So it's like Saturday Sunday Monday

00:51:10.180 | Six

00:51:15.220 | Friday

00:51:16.760 | okay, how do we feed that in because I want to find a way of getting that in so that we still end up with a

00:51:22.940 | rank one tensor of floats and

00:51:24.940 | so the trick is this we create a new little matrix of

00:51:29.240 | With seven rows

00:51:33.700 | And as many columns as we choose right so let's pick four right so here's our

00:51:40.860 | Seven rows and

00:51:44.380 | four columns

00:51:46.900 | Right and basically what we do is let's add our categorical variables to the end. So let's say the first row was Sunday

00:51:55.380 | Right then what we do is we do a lookup into this matrix and we say oh here's Sunday

00:52:01.660 | We do a lookup into here and we grab

00:52:04.340 | This row and so this matrix we basically fill with floating point numbers. So we're going to end up grabbing a

00:52:11.540 | little

00:52:14.020 | Subset of four floating point numbers. It's Sunday's particular for floating point numbers

00:52:20.720 | And so that way we convert

00:52:23.500 | Sunday

00:52:25.740 | Into a rank one tensor of four floating point numbers and initially those four numbers are random

00:52:33.080 | Right and in fact this whole thing we initially start out

00:52:37.100 | random, okay

00:52:40.020 | But then we're going to put that through our neural net, right?

00:52:44.260 | So we basically then take those four numbers and we remove Sunday instead we add

00:52:49.660 | Our four numbers on here, right? So we've turned our categorical thing into a floating point vector

00:52:56.360 | Right and so now we can just put that through our neural net

00:53:00.100 | just like before and at the very end we find out the loss and

00:53:04.300 | then we can figure out which direction is down and

00:53:08.180 | Do gradient descent in that direction and eventually that will find its way back

00:53:12.940 | To this little list of four numbers and it'll say okay those random numbers weren't very good

00:53:18.620 | This one needs to go up a bit that one needs to go up a bit that one needs to go down a bit

00:53:22.660 | That one needs to go up a bit and so we'll actually update

00:53:25.260 | our original those four numbers in that matrix and

00:53:29.340 | We'll do this again and again and again

00:53:31.780 | And so this this matrix will stop looking random and it will start looking more and more like like

00:53:37.660 | The exact four numbers that happen to work best for Sunday the exact four numbers that happen to work best for Friday and so forth

00:53:45.700 | And so in other words this matrix is just another bunch of weights

00:53:51.000 | in our neural net

00:53:53.780 | All right, and so matrices of this type are called

00:53:57.180 | embedding matrices

00:54:00.540 | So an embedding matrix is something where we start out with an

00:54:10.100 | integer between zero and the maximum number of levels of that category

00:54:15.420 | We literally index into a matrix to find our particular row

00:54:20.460 | So if it was the level was one we take the first row

00:54:24.420 | we grab that row and

00:54:27.340 | we append it to all of our continuous variables and

00:54:31.100 | So we now have a new

00:54:35.020 | Vector of continuous variables and when we can do the same thing for let's say zip code

00:54:39.540 | Right, so we could like have an embedding matrix. Let's say there are 5,000 zip codes

00:54:45.260 | It would be 5,000 rows long as wide as we decide maybe it's 50 wide and so we'd say okay. Here's

00:54:52.140 | nine four zero zero three

00:54:54.860 | That zip code is index number four in our matrix

00:54:58.560 | So go down and we find the fourth row regret those 50 numbers and append those

00:55:03.900 | Onto our big vector and then everything after that is just the same. We just put it through a linear layer value linear layer, whatever

00:55:10.460 | What are those four numbers

00:55:15.180 | Represent that's a great question and we'll learn more about that when we look at collaborative filtering for now

00:55:21.860 | They represent no more or no less than any other parameter in our neural net, you know, they're just

00:55:28.900 | They're just parameters that we're learning that happen to end up giving us

00:55:33.980 | a good loss

00:55:35.780 | We will discover later that these particular parameters often

00:55:39.260 | However, our human interpretable and quite can be quite interesting, but that's a side effect of them. It's not

00:55:45.660 | Fundamental they're just four random numbers for now that we're that we're learning or sets of four random numbers

00:55:52.940 | To have a good heuristic for the dimensionality of the embedding matrix, so why four here

00:56:02.660 | sure do

00:56:04.660 | So

00:56:10.940 | What I first of all did was I made a little list of every categorical variable and its cardinality

00:56:17.460 | Okay, so there they allow so there's a hundred and that's a thousand plus different stores

00:56:23.460 | apparently in Rossman's network

00:56:26.620 | There are eight days of the week

00:56:28.740 | That's because there are seven days of the week plus one left over four unknown

00:56:32.700 | Even if there were no missing values in the original data

00:56:36.060 | I always still set aside one just in case there's a missing or an unknown or something different in the test set

00:56:41.900 | Again four years, but there's actually three plus room for an unknown and so forth. Alright, so what I do

00:56:49.380 | My rule of thumb is this

00:56:52.300 | Take the cardinality of the variable

00:56:57.660 | Divide it by two

00:56:59.660 | But don't make it bigger than 50

00:57:01.700 | Okay, so

00:57:04.700 | These are my embedding matrices. So my store matrix. So the that has to have a

00:57:10.140 | thousand one hundred and sixteen rows because I need to look up right to find his store number three and then it's going to return back a

00:57:18.380 | Rank one tensor of length 50

00:57:21.940 | Day of week it's going to look up into which one of the eight and return the thing of length four

00:57:28.400 | So would you typically build an embedding matrix for each categorical feature? Yes. Yeah, so that's what I've done here

00:57:38.300 | So I've said

00:57:40.300 | For C in categorical variables

00:57:44.140 | See how many categories there are and

00:57:49.260 | then for each of those things

00:57:52.140 | create one of these and

00:57:55.260 | Then this is called embedding sizes

00:57:57.380 | And then you may have noticed that that's actually the first thing that we pass to get learner

00:58:03.260 | And so that tells it for every categorical variable. That's the embedding matrix to use for that variable

00:58:09.660 | That is behind you doesn't yes

00:58:12.420 | So besides

00:58:17.060 | Random initialization are there other ways to actually initialize embedding?

00:58:21.000 | Yes or no, there's two ways one is random the other is pre-trained and

00:58:28.460 | We'll probably talk about pre-trained more later in the course

00:58:32.060 | But the basic idea though is if somebody else at Rossman had already trained a neural net

00:58:36.280 | just like you you would use a pre-trained net from image net to look at pictures of cats and dogs if

00:58:42.300 | Somebody else has pre-trained a network to predict cheese sales in Rossman

00:58:47.200 | You may as well start with their embedding matrix of stores to predict liquor sales in Rossman

00:58:52.680 | And this is what happens for example at

00:58:55.280 | At Pinterest and Instacart they both use this technique Instacart uses it for routing their shoppers

00:59:03.200 | Pinterest uses it for deciding what to display on a web page when you go there and they have

00:59:08.920 | embedding matrices of products

00:59:12.000 | In Instacart's case of stores that get shared in the organization so people don't have to train new ones

00:59:19.800 | So for the embedding size

00:59:25.760 | Why wouldn't you just use like the one hot scheme and just

00:59:31.400 | Well, what is the advantage of doing this?

00:59:34.280 | As opposed to just doing a lot of questions. So so we could easily as you point out have

00:59:41.600 | Instead of passing in these four numbers. We could instead have passed in seven numbers

00:59:47.640 | all zeros, but one of them is a one and that also is a list of floats and

00:59:53.280 | That would totally work

00:59:56.800 | and that's how

00:59:58.960 | Generally speaking categorical variables have been used in statistics for many years. It's called dummy variable coding

01:00:06.440 | The problem is that in that case?

01:00:10.520 | the concept of Sunday

01:00:12.520 | Could only ever be associated with a single floating-point number

01:00:16.840 | Right, and so it basically gets this kind of linear behavior. It says like Sunday is more or less of a single thing

01:00:25.960 | Yeah, well, it's not just interactions. It's saying like now Sunday is a concept in four-dimensional space

01:00:32.200 | Right. And so what we tend to find happen is that these

01:00:37.440 | Embedding vectors tend to get these kind of rich semantic concepts. So for example

01:00:43.760 | if it turns out that

01:00:46.480 | Weekends

01:00:49.560 | Kind of have a different behavior

01:00:51.080 | You'll tend to see that Saturday and Sunday will have like some particular number higher or more likely

01:00:57.320 | it turns out that certain days of the week are associated with higher sales of

01:01:07.000 | Certain kinds of goods that you kind of can't go without I don't know like gas or milk say

01:01:12.640 | Where else there might be other products?

01:01:15.240 | like

01:01:17.240 | like wine, for example

01:01:19.240 | Like wine that tend to be associated with like the days before weekends or holidays, right? So there might be kind of a column

01:01:29.160 | which is like

01:01:31.440 | To what extent is this day of the week?

01:01:35.200 | Kind of associated with people going out

01:01:37.800 | You know, so basically yeah by by having this higher dimensionality vector rather than just a single number

01:01:45.280 | It gives the deep learning

01:01:47.560 | Network a chance to learn these rich

01:01:50.480 | Representations and so this idea of an embedding is actually what's called a distributed representation

01:01:58.600 | It's kind of the fun most fundamental concept of neural networks

01:02:02.440 | It's this idea that a concept in a neural network has a kind of a high dimensional

01:02:08.960 | Representation and often it can be hard to interpret because the idea is like each of these

01:02:14.720 | Numbers in this vector doesn't even have to have just one meaning

01:02:18.480 | You know

01:02:19.200 | It could mean one thing if this is low and that one's high and something else if that one's high and that one's low

01:02:23.640 | Because it's going through this kind of rich nonlinear

01:02:26.880 | Function right and so it's this

01:02:30.920 | It's this rich representation that allows it to learn such such such interesting

01:02:37.080 | Relationships

01:02:40.520 | Kind of oh another question. Sure. I'll speak louder. So are there

01:02:46.200 | Is an embedding so I get the the fundamental of the like the word vector word to Vic vector algebra

01:02:55.040 | You can run on this but are they embedding suited suitable for certain types of variables?

01:03:00.640 | Like are are these only suitable for?

01:03:03.800 | Are there different categories that that the embeddings are suitable for an embedding is suitable for any categorical variable?

01:03:11.120 | Okay, so so the only thing it it can't really work

01:03:16.120 | Well at all for would be something that is too high cardinality

01:03:19.880 | So like in other words, we had likes whatever it was 600,000 rows if you had a variable with 600,000 levels

01:03:26.640 | That's just not a useful

01:03:30.360 | categorical variable you could bucketize it I guess

01:03:33.660 | But yeah in general like you can see here that the third place getters in this competition

01:03:39.560 | Really decided that everything that was not too high cardinality

01:03:45.880 | They put them all as categorical variables and I think that's a good rule of thumb

01:03:49.320 | You know if you can make it a categorical variable you may as well because that way it can learn this rich distributed representation

01:03:57.080 | Or else if you leave it as continuous, you know, the most it can do is to kind of try and find a

01:04:02.520 | You know a single functional form that fits it well

01:04:05.560 | after question, so

01:04:09.080 | You were saying that you are kind of increasing the dimension

01:04:12.960 | But actually in in most cases we will use a one-holding calling which has even a bigger dimension

01:04:19.520 | That so so in a way you are also

01:04:23.240 | Reducing but in the most rich. I think that's that's that's fair. Yeah. Yeah it like

01:04:28.240 | Yes, you know you can think of it as one hot encoding which actually is high dimensional, but it's not

01:04:34.800 | Meaningfully high dimensional because everything except one is zero

01:04:38.200 | I'm saying that also because even this will reduce the amount of memory and things like this that you have to write

01:04:43.680 | This is better. You're absolutely right. Absolutely, right?

01:04:46.760 | And and so we may as well go ahead and actually describe like what's going on with the matrix algebra behind the scenes

01:04:52.920 | It this if this doesn't quite make sense you can kind of skip over it

01:04:56.600 | But for some people I know this really helps if we started out with something saying this is Sunday

01:05:03.280 | right

01:05:05.320 | we could represent this as a one hot encoded vector right and so

01:05:09.640 | Sunday, you know, maybe was positioned here. So that would be a one and then the rest of zeros

01:05:16.560 | Okay, and then we've got our

01:05:22.360 | Embedding matrix right with eight rows and in this case four columns

01:05:28.540 | One way to think of this actually is a matrix product

01:05:35.840 | Right, so I said you could think of this as like looking up the number one, you know and finding like its index in the array

01:05:44.820 | But if you think about it, that's actually

01:05:48.000 | identical to doing a matrix product between a one hot encoded vector and

01:05:53.080 | The embedding matrix like you're going to go zero times this row one times this row zero times this row

01:06:02.040 | And so it's like a one hot embedding matrix product is identical

01:06:06.720 | to doing a lookup and so

01:06:09.680 | Some people in the bad old days actually implemented embedding

01:06:16.200 | Matrices by doing a one hot encoding and then a matrix product and in fact a lot of like machine learning

01:06:22.200 | methods still kind of do that

01:06:24.560 | But as your net was kind of alluding to it's that's terribly inefficient. So all of the modern

01:06:31.660 | Libraries implement this as taken take an integer and do a lookup into an array

01:06:37.040 | But the nice thing about realizing that it's actually a matrix product

01:06:40.400 | Mathematically is it makes it more obvious?

01:06:43.320 | How the gradients are going to flow so when we do stochastic gradient descent, it's we can think of it as just another

01:06:50.060 | Linear layer. Okay, does it say that's like a somewhat minor detail, but hopefully for some of you it helps

01:06:56.680 | Could you touch on using dates and times as categoricals how that affects seasonality? Yeah, absolutely. That's a great question

01:07:06.360 | Did I cover dates at all last week?

01:07:09.800 | No, okay

01:07:13.680 | So I covered dates in a lot of detail in the machine learning course, but it's worth briefly mentioning here

01:07:19.120 | There's a fast AI function called add date part

01:07:26.920 | Which takes a data frame and a column name

01:07:30.640 | That column name needs to be a date

01:07:33.920 | It removes unless you've got drop equals false

01:07:37.800 | It optionally removes the column from the data frame and replaces it with lots of columns

01:07:43.560 | representing all of the useful information about that date like

01:07:47.680 | Day of week day of month month of year year is at the start of a quarter

01:07:52.600 | Is at the end of a quarter basically everything that pandas?

01:07:55.220 | gives us

01:07:57.480 | And so that way we end up

01:08:00.200 | When we look at our list of features where you can see them here, right?

01:08:05.840 | Year month week day day of week, etc. So these all get created for us by add date path

01:08:11.500 | so we end up with

01:08:14.680 | you know this

01:08:17.120 | Eight long embedding

01:08:20.720 | Matrix, so I guess eight rows by four column embedding matrix for day of week and

01:08:26.800 | Conceptually that allows us allows our model to create some pretty interesting time series models

01:08:34.920 | Right like it can if there's something that has a

01:08:37.760 | seven-day period cycle

01:08:40.840 | That kind of goes up on Mondays and down on Wednesdays, but only for dairy and only in Berlin

01:08:47.040 | It can totally do that, but it has all the information it needs

01:08:51.020 | to do that

01:08:53.320 | So this turns out to be a really fantastic way to deal with time series

01:08:57.960 | So I'm really glad you asked the question you just need to make sure that

01:09:02.560 | That the the cycle indicator in your time series exists as a column

01:09:07.800 | So if you didn't have a column there called day of week

01:09:11.280 | it would be very very difficult for the neural network to somehow learn to do like a

01:09:17.060 | Divide mod 7 and then somehow look that up in an embedding matrix

01:09:20.960 | I get not impossible, but really hard would use lots of computation wouldn't do it very well

01:09:26.720 | So an example of the kind of thing that you need to think about might be

01:09:32.360 | Holidays for example, you know, or if you were doing something in in, you know of sales of

01:09:39.560 | Beverages in San Francisco

01:09:41.840 | You probably want a list of like when when are the when is the ball game on at AT&T Park?

01:09:47.120 | All right, because that's going to impact how many people that are drinking beer in soma

01:09:51.660 | all right, so you need to make sure that the kind of the basic indicators or

01:09:57.120 | Periodicities or whatever are there in your data and as long as they are the neural nets going to learn to use them

01:10:03.200 | So I'm kind of trying to skip over some of the non-deep learning parts

01:10:08.320 | All right, so

01:10:13.320 | The key thing here is that we've got our model data that came from the data frame

01:10:17.560 | We tell it how big to make the embedding matrices

01:10:21.260 | We also have to tell it of the columns in that data frame

01:10:27.360 | How many of those?

01:10:29.360 | Categorical variables or how many of them are continuous variables. So the actual parameter is number of continuous variables

01:10:36.560 | So you can hear you can see we just pass in how many columns are there minus how many categorical variables are there?

01:10:43.000 | so then that way the

01:10:45.120 | The neural net knows how to create something that puts the continuous variables over here and the categorical variables over there

01:10:54.480 | The embedding matrix has its own dropout

01:10:57.680 | All right. So this is the dropout applied to the embedding matrix

01:11:01.280 | This is the number of activations in the first linear layer the number of activations in the second linear layer

01:11:07.280 | The dropout in the first linear layer the dropout for the second linear layer

01:11:11.840 | This bit we won't worry about for now and then finally is how many outputs do we want to create?

01:11:16.880 | Okay, so this is the output of the last linear layer and obviously it's one because we want to predict a single number

01:11:23.560 | Which is sales?

01:11:25.480 | Okay

01:11:26.680 | So after that we now have a learner where we can call our find and we get the standard looking shape and we can say

01:11:33.680 | what the amount we want to use and

01:11:35.880 | we can then go ahead and

01:11:38.680 | Start training using exactly the same API. We've seen before

01:11:44.120 | So this is all identical

01:11:46.920 | You can pass in I'm not sure if you've seen this before

01:11:51.000 | Custom metrics what this does is it just says please print out a number at the end of every epoch by calling

01:11:57.560 | this function and this is a function we defined a little bit earlier, which was the

01:12:02.120 | Root mean squared percentage error. First of all going either the power of our

01:12:07.320 | Sales because our sales were originally logged. So this doesn't change the training at all

01:12:14.800 | It just it's just something to print out

01:12:16.840 | So we train that for a while

01:12:20.920 | And you know, we've got some benefits that the original people that built this don't have specifically we've got things like

01:12:29.280 | Cyclical not cyclic learning rate stochastic gradient descent with restarts. And so it's actually interesting to have a look and compare

01:12:37.400 | Although our validation set isn't identical to the test set it's very similar

01:12:45.720 | It's a two-week period that is at the end of the training data

01:12:49.880 | so our numbers should be similar and if we look at what we get 0.097 and compare that to the

01:12:57.360 | Leaderboard public leaderboard

01:13:00.880 | You can see we're kind of

01:13:07.520 | Let's have a look in the top actually that's interesting

01:13:13.960 | There's a big difference between the public and private leaderboard it would have

01:13:19.960 | Would have been right at the top of the private leaderboard

01:13:22.280 | But only in the top 30 or 40 on the public leaderboard. So not quite sure but you can see like we're certainly in

01:13:28.400 | the top end of this competition I

01:13:33.200 | actually tried running the third place getters code and

01:13:38.120 | Their final result was over 0.1. So I actually think that where should be compared to the private leaderboard

01:13:48.840 | So anyway, so you can see there basically there's a technique for dealing with time series and

01:13:55.600 | Structured data and you know, interestingly the group that that used this technique. They actually wrote a paper about it. That's linked in this notebook

01:14:04.640 | When you compare it to the folks that won this competition and came second

01:14:11.560 | They did the other folks did way more feature engineering like the winners of this competition were actually

01:14:19.000 | subject matter experts in logistics sales forecasting and so they had their own like code to create lots and lots of features and

01:14:27.400 | Talking to the folks at Pinterest who built their very similar model for recommendations of Pinterest

01:14:33.400 | They said the same thing which is that when they switched from gradient boosting machines to deep learning

01:14:38.880 | They did like way way way less

01:14:41.760 | Feature engineering it was a much much simpler model and requires much less maintenance

01:14:48.440 | And so this is like one of the big benefits of using this approach to deep learning. We can get state-of-the-art results

01:14:54.400 | But with a lot less work

01:14:56.960 | Yes

01:15:00.080 | Are we using any time series in any of these fits

01:15:06.960 | indirectly

01:15:10.280 | Absolutely using what we just saw we have day of week month of year all that stuff columns

01:15:17.200 | And most of them are being treated as categories. So we're building a distributed representation of January

01:15:23.040 | We're building a distributed representation of Sunday. We're building a distributed representation of Christmas. So we're not using any

01:15:30.720 | Classic time series techniques all we're doing is

01:15:35.760 | true fully connected layers in a neural net

01:15:40.360 | Embedded matrix, that's what

01:15:42.960 | Exactly. Exactly. Yes. So the embedding matrix is able to deal with this stuff like

01:15:48.400 | Day of week periodicity and so forth in a way

01:15:52.480 | Richer way than any

01:15:55.800 | Standard time series technique I've ever come across

01:15:58.400 | one last question

01:16:01.280 | The matrix in the earlier models when we did the CNN we did not pass it during the fit

01:16:08.480 | We passed it when the data was

01:16:10.640 | When we got the data, so we're not passing

01:16:15.120 | Anything to fit just the learning rate and the number of cycles

01:16:19.000 | In this case we're passing in metrics because we want to print out some extra stuff

01:16:22.800 | There is a difference in that we're calling data dot get learner. So with

01:16:29.480 | The imaging approach

01:16:34.600 | We just go learner dot trained and pass it the data

01:16:38.400 | but

01:16:40.680 | In for these kinds of models in fact for a lot of the models the model that we build

01:16:46.680 | Depends on the data in this case. We actually need to know like

01:16:50.760 | What embedding matrices do we have?

01:16:53.400 | And stuff like that. So in this case, it's actually the data object that creates the learner

01:16:59.200 | So yeah, it is it is a bit upside down to what we've seen before

01:17:02.440 | Yeah

01:17:04.440 | So just to summarize or maybe I'm confused

01:17:09.920 | So in this case what we are doing is that we have some kind of a structured data

01:17:16.400 | We did feature engineering

01:17:18.400 | We got some column in a database or some things in it a parent is data frame

01:17:25.320 | Yeah data frame and then we are mapping it to deep learning by using this

01:17:31.060 | in

01:17:33.060 | Embedding matrix for the categorical variables. So the continuous we just put them straight in

01:17:38.580 | So all I need to do is like if I have a if I have already have a feature engineering model

01:17:46.100 | Yeah, then to map it to deep learning. I just have to figure out which one I can move in to categorical and then

01:17:52.560 | Yeah, great question. So yes, exactly if you want to use this on your own data set

01:17:59.900 | Step one is list the categorical variable names list the continuous variable names

01:18:05.040 | Put it in a data frame pandas data frame

01:18:08.920 | Step two is to

01:18:12.580 | Create a list of which row indexes do you want new validation set?

01:18:18.700 | step three

01:18:21.980 | Is to call this line of code using this exact like these exact you can just copy and paste it

01:18:29.600 | step four is to create your list of how big you want each embedding matrix to be and

01:18:35.760 | Then step five is to call get learner

01:18:39.560 | You can use these exact parameters to start with

01:18:42.880 | And if it over fits or under fits you can fiddle with them and then the final step is to call

01:18:49.320 | Fit so yeah, almost all of this code will be nearly identical

01:18:54.140 | I

01:18:56.140 | Have a couple of questions one is

01:19:02.540 | How is data augmentation can be used in this case and the second one is?

01:19:09.340 | Why what are dropouts doing in here? Okay, so data augmentation I have no idea. I mean, that's a really interesting question. I

01:19:21.300 | Think it's got to be domain specific. I've never seen any paper or anybody in industry doing data augmentation with structured data and deep learning

01:19:28.220 | So I don't I think it can be done. I just haven't seen it done

01:19:32.060 | What is dropout doing?

01:19:35.140 | Exactly the same as before so at each point

01:19:39.460 | we have

01:19:42.380 | The output of each of these linear layers is just a

01:19:49.380 | Rank one tensor and so dropout is going to go ahead and say let's throw away half of the activations

01:19:56.420 | and the very first dropout embedding dropout literally goes through the embedding matrix and says

01:20:03.460 | Let's throw away half the activations

01:20:07.980 | That's it

01:20:11.860 | Okay, let's take a break and let's come back at a 5 past 8

01:20:16.980 | Okay, thanks everybody

01:20:18.980 | So now

01:20:29.940 | We're going to move into something

01:20:32.660 | Equally exciting actually before I do I just mention that I had a good question during the break which was

01:20:40.260 | What's the downside like?

01:20:44.340 | Like look almost no one's using this

01:20:46.860 | Why not

01:20:50.580 | And and basically I think the answer is like as we discussed before

01:20:54.660 | No one in academia almost is working on this because it's not something that people really publish on

01:21:00.260 | And as a result there haven't been really great examples where people could look at and say oh, here's a technique that works

01:21:08.660 | Well, so let's have our company implemented

01:21:12.020 | But perhaps equally importantly

01:21:14.020 | Until now with this fast AI library. There hasn't been any

01:21:18.500 | Way to to do it conveniently if you wanted to implement one of these models

01:21:24.380 | You had to write all the custom code

01:21:27.100 | Yourself or else now as we discussed. It's you know six

01:21:31.900 | It's basically a six step process, you know involving about you know, not much more than six lines of code

01:21:41.340 | So the reason I mentioned this is to say like I think there are a lot of big

01:21:46.420 | commercial and scientific

01:21:49.100 | opportunities to use this to solve problems that previously haven't been solved very well before

01:21:55.860 | So like I'll be really interested to hear if some of you

01:22:00.220 | Try this out, you know, maybe on like

01:22:03.780 | Old Kaggle competitions you might find like oh I would have won this if I'd use this technique

01:22:09.420 | That would be interesting or if you've got some data set you work with at work

01:22:13.900 | You know some kind of predictive model that you've been doing with a GBM or a random forest. Does this help?

01:22:18.700 | You know the thing I I'm still somewhat new to this I've been doing this for

01:22:26.220 | Basically since the start of the year was when I started working on these structured deep learning models

01:22:31.860 | So I haven't had enough opportunity to know

01:22:35.540 | Where might it fail? It's worked for nearly everything. I've tried it with so far

01:22:39.480 | But yeah, I think this class is the first time that

01:22:44.700 | There's going to be like more than half a dozen people in the world who actually are working on this

01:22:50.220 | So I think you know as a group we're going to hopefully learn a lot and build some interesting things

01:22:55.120 | and this would be a great thing if you're thinking of writing a post about something or here's an area that

01:23:01.420 | There's a couple of that. There's a post from Instacart about what they did

01:23:05.260 | Pinterest has a

01:23:08.340 | Riley AI video about what they did that's about it, and there's two academic papers

01:23:13.860 | Both about Kaggle competition victories one from a Yoshio Yoshio Benjio and his group they won a taxi

01:23:23.300 | Destination forecasting competition and then also the one linked

01:23:28.300 | for this Rossman competition

01:23:31.100 | so

01:23:32.540 | Yeah, there's some background on that all right

01:23:34.540 | so language

01:23:37.380 | natural language processing

01:23:39.900 | is the area which

01:23:42.900 | Is kind of like the most up-and-coming area of deep learning. It's kind of like two or three years behind

01:23:49.820 | Computer vision in deep learning it was kind of like the the second area that deep learning started getting really popular in and

01:23:59.340 | You know computer vision

01:24:01.340 | Got to the point where it was like the clear state-of-the-art

01:24:04.700 | For most computer vision things maybe in like 2014, you know and in some things in like 2012

01:24:11.380 | In NLP, we're still at the point where

01:24:14.740 | For a lot of things deep learning is now the state of the art, but not quite everything

01:24:19.580 | but as you'll see the state of kind of

01:24:23.820 | The software and some of the concepts is much less mature than it is for computer vision

01:24:30.340 | So in general none of the stuff we talk about after computer vision is going to be as like

01:24:36.620 | Settled as the computer vision stuff was so NLP

01:24:40.980 | One of the interesting things is in the last few months

01:24:43.980 | Some of the good ideas from computer vision have started to spread into NLP for the first time and we've seen some really big

01:24:51.180 | Advances so a lot of the stuff you'll see in NLP is is pretty new

01:24:54.920 | So I'm going to start with a particular

01:24:58.900 | Kind of NLP problem and one of the things you'll find in NLP

01:25:03.780 | It's like there are particular problems you can solve and they have particular names

01:25:07.580 | and so there's a particular kind of problem in NLP called language modeling and

01:25:12.020 | Language modeling has a very specific definition. It means build a model where given a

01:25:18.740 | Few words of a sentence. Can you predict what the next word is going to be?

01:25:23.140 | So if you're using your mobile phone and you're typing away and you press space and then it says like this is what the next

01:25:30.700 | Word might be like SwiftKey does this like really well and SwiftKey actually uses deep learning for this

01:25:36.620 | That's that's a language model. Okay, so it has a very specific meaning when we say language modeling

01:25:42.980 | We mean a model that can predict the next word of a sentence

01:25:47.980 | So let me give you an example. I

01:25:49.980 | downloaded

01:25:51.820 | about 18 months worth of

01:25:53.820 | Papers from archive. So for those of you that don't know it archive is

01:25:59.100 | The most popular pre-print server in this community and various others

01:26:05.060 | And has you know, lots of academic papers

01:26:08.220 | and so I grabbed the

01:26:12.820 | Abstracts and the topics for each and so here's an example. So the category of this particular paper was compute a

01:26:19.660 | CSNI is computer science and networking and

01:26:22.140 | Then the summaries let the abstract of the paper

01:26:25.180 | Let's say in the exploitation of mm-wave bands is one of the key enabler for 5g mobile blah blah blah. Okay, so here's like an

01:26:32.800 | example

01:26:35.140 | piece of text from my language model

01:26:39.420 | So I trained a language model on this archive data set that I downloaded and then I built a simple little test

01:26:45.940 | which basically

01:26:48.260 | You would pass it some like priming text

01:26:52.140 | So you'd say like oh imagine you started reading a document that said

01:26:55.460 | Category is computer science networking and the summary is algorithms that and then I said, please write

01:27:03.100 | An archive abstract so it said that if it's networking

01:27:08.900 | algorithms that

01:27:10.220 | Use the same network as a single node are not able to achieve the same performance as a traditional network based routing algorithms in this

01:27:16.860 | Paper we propose a novel routing scheme, but okay

01:27:19.700 | So it it's learned by reading archive papers that somebody who was saying algorithms that

01:27:26.500 | Where the word cat CSNI came before it is going to talk like this and remember it started out not knowing English at all

01:27:35.740 | Right, it actually started out with an embedding matrix for every word in English that was random

01:27:42.180 | Okay, and by reading lots of archive papers, it weren't what kind of words followed others

01:27:47.700 | So then I tried what if we said cat computer science computer vision?

01:27:52.220 | summary

01:27:54.300 | algorithms that

01:27:55.820 | Use the same data to perform image classification are increasingly being used to improve the performance of image classification

01:28:03.100 | Algorithms and this paper we propose a novel method for image classification using a deep convolutional neural network parentheses CNN

01:28:10.020 | So you can see like it's kind of like almost the same sentence as back here

01:28:15.940 | But things have just changed into this world of computer vision rather than networking

01:28:21.060 | So I tried something else which is like, okay

01:28:23.700 | Category computer vision and I created the world's shortest ever abstract algorithms

01:28:29.260 | And then I said title on and the title of this is going to be on the performance of deep learning for image classification

01:28:36.980 | EOS is end of string. So that's like end of title

01:28:40.740 | What if it was networking summary algorithms title on the performance of wireless networks as opposed to?

01:28:48.420 | Towards computer vision towards a new approach to image classification

01:28:52.900 | Networking towards a new approach to the analysis of wireless networks

01:28:58.340 | So like I find this mind-blowing right? I started out with some random matrices

01:29:04.020 | Richard like literally no

01:29:07.260 | No, pre-trained anything. I fed it 18 months worth of archive articles and it learnt not only

01:29:14.380 | How to write English pretty well

01:29:17.420 | but also after you say something's a convolutional neural network, you should then use parentheses to say what it's called and

01:29:24.900 | furthermore that the kinds of things people talk and say create algorithms for in computer vision are

01:29:30.940 | performing image classification and in networking are

01:29:34.220 | Achieving the same performance as traditional network-based routing algorithms. So like a language model is

01:29:42.500 | Can be like incredibly deep and subtle

01:29:47.420 | Right, and so we're going to try and build that

01:29:50.480 | But actually not because we care about this at all

01:29:54.540 | We're going to build it because we're going to try and create a pre-trained model

01:29:58.340 | what we're actually going to try and do is take IMDB movie reviews and

01:30:02.960 | Figure out whether they're positive or negative

01:30:06.060 | So if you think about it, this is a lot like cats versus dogs. It's a classification algorithm, but rather than an image

01:30:13.160 | We're going to have the text of a review

01:30:15.620 | So I'd really like to use a pre-trained network

01:30:19.860 | like I would at least like a net to start with a network that knows how to read English, right and so

01:30:27.380 | My view was like okay that to know how to read English means you should be able to like predict the next word of a sentence

01:30:34.740 | so what if we pre-train a language model and

01:30:38.700 | Then use that pre-trained language model and then just like in computer vision

01:30:43.580 | Stick some new layers on the end and ask it instead of to predicting the next word in the sentence

01:30:49.340 | Instead predict whether something is positive or negative

01:30:52.520 | So when I started working on this, this was actually a new idea

01:30:57.860 | Unfortunately in the last couple of months I've been doing it

01:31:01.300 | You know a few people have actually couple people have started publishing this and so this has moved from being a totally new idea to being

01:31:07.660 | a you know somewhat new idea

01:31:10.180 | so

01:31:12.420 | so this idea of

01:31:14.780 | Creating a language model making that the pre-trained model for a classification model is what we're going to learn to do now

01:31:22.380 | And so the idea is we're really kind of trying to leverage exactly what we learned in our computer vision work

01:31:28.420 | Which is how do we do fine-tuning to create powerful classification models? Yes, you know

01:31:33.820 | So why don't you think that doing just directly what you want to do?

01:31:40.820 | Doesn't work better

01:31:43.660 | Well a because it doesn't just turns out it doesn't empirically

01:31:48.300 | And the reason it doesn't is a number of things

01:31:52.460 | first of all

01:31:55.180 | as we know

01:31:56.780 | Fine-tuning a pre-trained network is really powerful

01:31:59.500 | Right. So if we can get it to learn some related tasks first, then we can use all that information

01:32:06.900 | To try and help it on the second task

01:32:12.380 | the other reason is

01:32:14.380 | IMDB movie reviews

01:32:17.140 | You know up to a thousand words long

01:32:19.300 | They're pretty big and so after reading a thousand words knowing nothing about

01:32:24.220 | How English is structured or even what the concept of the word is?

01:32:28.340 | or punctuation or whatever

01:32:31.100 | at the end of this thousand

01:32:33.340 | Integers, you know, they end up in inches. All you get is a one or a zero

01:32:38.340 | Positive or negative and so trying to like learn the entire structure of English and then how it expresses positive and negative

01:32:44.540 | Sentiments from a single number is just too much to expect

01:32:48.260 | So by building a language model first we can try to build a neural network that kind of understands

01:32:54.900 | The English of movie reviews and then we hope that some of the things it's learnt about

01:33:01.100 | Are going to be useful in deciding whether something's a positive or a negative

01:33:05.060 | That's a great question

01:33:08.020 | Thanks. Is this similar to the CAR RNN by Carpathi?

01:33:15.780 | Yeah, this is somewhat similar to CAR RNN by Carpathi. So the famous CAR as in C-H-A-R RNN

01:33:23.660 | Try to predict the next letter given a number of previous letters

01:33:29.100 | Language models generally work at a word level. They don't have to

01:33:33.460 | and doing things at a word level turns out to be a

01:33:37.940 | Can be quite a bit more powerful and we're going to focus on word level modeling in this course

01:33:42.980 | To what extent are these generated words?

01:33:47.380 | Actual copies of what it's found in the in the training data set or are these completely

01:33:54.100 | Random things that it actually learned and how do we know how to distinguish between those two? Yeah, I mean these are all good questions

01:34:02.340 | The words are definitely words we've seen before the work because it's not at a character level

01:34:06.660 | So it can only give us the word it's seen before the sentences

01:34:10.060 | There's a number of kind of rigorous ways of doing it

01:34:14.380 | But I think the easiest is to get a sense of like well here are two like different categories

01:34:19.780 | Where it's kind of created very similar concepts, but mixing them up in just the right way like it would be very hard

01:34:27.660 | To to do what we've seen here just by like spitting back things. It's seen before

01:34:34.220 | But you could of course actually go back and check. You know have you seen that sentence before or like a string distance

01:34:40.780 | Have you seen a similar sentence before?

01:34:42.780 | in this case

01:34:44.820 | And of course another way to do it is the length most importantly when we train the language model as we'll see

01:34:51.080 | We'll have a validation set and so we're trying to predict the next word

01:34:54.540 | Of something that's never seen before and so if it's good at doing that. It should be good at generating text in this case the purpose

01:35:03.380 | The purpose is not to generate text

01:35:05.380 | That was just a fun example and so I'm not really going to study that too much

01:35:09.340 | But you know you during the week totally can like you can totally build

01:35:14.620 | The or you know greater American novel generator or whatever

01:35:18.940 | there are actually some tricks to

01:35:21.740 | To using language models to generate text that I'm not using here. They're pretty simple

01:35:27.940 | We can talk about them on the forum if you like, but my focus is actually on classification

01:35:33.180 | So I think that's the thing which is

01:35:35.500 | Incredibly powerful like text classification I

01:35:40.880 | Don't know you're a hedge fund

01:35:43.340 | You want to like read every article as soon as it comes out through Reuters or Twitter or whatever and immediately

01:35:50.220 | Identify things which in the past have caused you know massive market drops. That's a classification model or you want to

01:36:00.740 | Recognize all of the customer service

01:36:02.740 | queries which tend to be associated with people who

01:36:06.940 | Who leave your you know who cancel their contracts in the next month?

01:36:12.500 | That's a classification problem, so like it's a really powerful kind of thing for

01:36:17.740 | data journalism

01:36:20.540 | Activision activism

01:36:22.540 | law

01:36:24.260 | commerce so forth right like

01:36:27.500 | I'm trying to class documents into whether they're part of legal discovery or not part of legal discovery

01:36:32.680 | Okay, so you get the idea?

01:36:36.100 | so

01:36:38.260 | In terms of stuff. We're importing we're importing a few new things here

01:36:41.820 | one of the bunch of things we're importing is

01:36:45.180 | Torch text torch text is PI torches like NLP

01:36:52.420 | Library and so fast AI is designed to work hand-in-hand with porch text as you'll see and then there's a few

01:36:59.180 | Text specific sub bits of faster fast AI that we'll be using

01:37:04.200 | So we're going to be working with the IMDB large movie review data set. It's very very well studied in academia

01:37:12.740 | you know

01:37:15.420 | Lots and lots of people over the years have

01:37:17.660 | Studied this data set

01:37:21.180 | 50,000 reviews highly polarized reviews either positive or negative each one has been

01:37:26.980 | classified by sentiment

01:37:29.860 | Okay, so we're going to try our first of all however to create a language model

01:37:33.540 | So we're going to ignore the sentiment entirely right so just like the dogs and cats

01:37:37.580 | Pre-train the model to do one thing and then fine-tune it to do something else

01:37:41.300 | Because this kind of idea in NLP is is so so so new

01:37:47.980 | There's basically no models you can download for this so we're going to have to create our own

01:37:52.940 | right, so

01:37:55.620 | Having downloaded the data you can use the link here. We do the usual stuff saying the path to it training and validation path

01:38:03.220 | And as you can see it looks pretty pretty traditional compared to vision. There's a directory of training

01:38:10.120 | There's a directory of test we don't actually have separate test and validation in this case

01:38:15.940 | And just like in in vision the training directory has a bunch of files in it

01:38:22.440 | In this case not representing images, but representing movie reviews

01:38:26.940 | So we could cat one of those files and here we learn about the classic zombie Geddon movie

01:38:36.460 | I have to say with a name like zombie Geddon and an atom bomb on the front cover

01:38:42.120 | I was expecting a flat-out chop socky funku

01:38:45.040 | Rented if you want to get stoned on a Friday night and laugh with your buddies

01:38:51.780 | Don't rent it if you're an uptight weenie or want a zombie movie with lots of fresh eating

01:38:55.560 | I think I'm going to enjoy zombie Geddon so all right, so we've learned something today

01:39:00.360 | All right, so we can just use standard unique stuff to see like how many words are in the data set so the training set we've got

01:39:09.360 | 17 and a half million words

01:39:13.400 | Test set we've got five point six million words

01:39:16.300 | So here's

01:39:20.260 | These are this is IMDB so IMDB is yeah random people this is not a New York Times listed review as far as I know

01:39:30.060 | Okay, so

01:39:35.580 | Before we can do anything with text we have to turn it into a list of tokens

01:39:41.580 | A token is basically like a word right so we're going to try and turn this eventually into a list of numbers

01:39:47.180 | So the first step is to turn it into a list of words

01:39:49.580 | That's called tokenization in NLP NLP has a huge lot of jargon that we'll we'll learn over time

01:39:56.180 | One thing that's a bit tricky though when we're doing tokenization is here

01:40:02.740 | I've tokenized that review and then joined it back up with spaces and you'll see here that wasn't

01:40:09.220 | Has become two tokens which makes perfect sense right wasn't is two things, right?

01:40:16.340 | Dot dot dot has become one token

01:40:20.500 | Right, where else lots of exclamation marks has become lots of tokens. So like a good tokenizer

01:40:26.960 | will do a good job of recognizing like

01:40:30.260 | Pieces of an English sentence each separate piece of punctuation will be separated

01:40:36.740 | And each part of a multi-part word will be separated as appropriate. So

01:40:42.500 | Spacey is a I think it's an Australian developed piece of software actually that does lots of NLP stuff

01:40:49.260 | It's got the best tokenizer. I know and so

01:40:52.220 | Fast AI is designed to work. Well with the spacey tokenizer as its torch text. So here's an example of

01:40:59.100 | Tokenization, right so what we do with torch text is we basically have to start out by creating

01:41:06.700 | Something called a field and a field is a definition of how to pre-process some text

01:41:12.620 | And so here's an example of the definition of a field. It says I want to lowercase

01:41:17.360 | The text and I want to tokenize it with the function called spacey tokenize

01:41:23.160 | Okay, so it hasn't done anything yet. We're just telling her when we do do something

01:41:28.100 | This is what to do. And so that we're going to store that

01:41:30.620 | description of what to do in a thing called

01:41:33.420 | capital text

01:41:35.580 | And so this is this is none of this is but this is not fast AI specific at all

01:41:39.900 | This is part of torch text. You can go to the torch text website read the docs. There's not lots of docs yet

01:41:45.340 | This is all very very new

01:41:47.180 | so

01:41:48.300 | Probably the best information you'll find about it is in this lesson, but there's some more information on this site

01:41:54.260 | Alright, so what we can now do is go ahead and create the usual fast AI model data object

01:42:03.060 | Okay, and so to create the model data object. We have to provide a few bits of information

01:42:07.820 | We have to say what's the training set?

01:42:10.260 | So the path to the text files the validation set and the test set in this case just to keep things simple

01:42:17.660 | I don't have a separate validation in test set so I'm going to pass in the validation set for both of those two things

01:42:23.620 | Right. So now we can create our model data object as per usual. The first thing we give it is the path

01:42:31.060 | The second thing we give it is the torch text field definition of how to pre-process that text

01:42:36.940 | The third thing we give it is the dictionary or the list of all of the files we have train validation test

01:42:44.540 | As per usual we can pass in a batch size and then we've got a special special couple of extra things here

01:42:51.900 | One is a very commonly used in NLP minimum frequency. What this says is

01:43:00.020 | In a moment, we're going to be replacing every one of these words with an integer

01:43:04.980 | Which basically will be a unique index for every word and this basically says if there are any words that occur less than 10 times

01:43:13.340 | Just call it unknown

01:43:16.220 | Right don't think of it as a word, but we'll see that in more detail in a moment

01:43:20.740 | And then we're going to see this in more detail as well BP TT stands for back prop through time

01:43:27.580 | And this is where we define how long a sentence will we?

01:43:32.060 | Stick on the GPU at once. So we're going to break them up in this case. We're going to break them up into sentences of

01:43:38.820 | 70 tokens or less on the whole so we're going to see all this in a moment

01:43:44.860 | All right. So after building our model data object, right what it actually does is it's going to fill this text field

01:43:54.700 | With an additional attribute called vocab and this is a really important NLP concept

01:44:01.020 | I'm sorry. There's so many NLP concepts. We just have to throw at you kind of quickly, but we'll see them a few times

01:44:05.980 | right a

01:44:08.100 | Vocab is the vocabulary and the vocabulary in NLP has a very specific meaning it is

01:44:14.100 | What is the list of unique words that appeared in this text?

01:44:17.160 | So every one of them is going to get a unique index. So let's take a look right here is text

01:44:24.540 | Vocab dot I to s this stands for this is all torch text not fast AI

01:44:29.340 | Text of vocab dot int to string

01:44:32.300 | Maps the integer zero to unknown the integer one the padding into to the then comma dot and

01:44:41.500 | Of two and so forth. All right, so this is the first 12

01:44:45.540 | elements of the array

01:44:50.220 | Of the vocab from the IMDB movie review and it's been sorted by frequency

01:44:55.820 | Except for the first two special ones. So for example, we can then go backwards s to I string to int

01:45:02.900 | Here is the it's in position 0 1 2 so stream to int the is 2

01:45:09.460 | So the vocab lets us take a word and map it to an integer or take an integer and map it to a word

01:45:19.060 | Right. And so that means that we can then take

01:45:22.060 | the first 12 tokens for example of our text and turn them into

01:45:28.380 | 12 it's so for example here is of the you can see 7 2 and

01:45:35.940 | Here you can see 7 2

01:45:38.500 | Right. So we're going to be working in this form. Did you have a question? Yeah, could you pass that back there?

01:45:47.940 | Is it a common to any stemming or limitizing?

01:45:50.860 | Not really. No

01:45:53.900 | Generally tokenization is is what we want like with a language model

01:45:57.800 | We you know to keep it as general as possible we want to know what's coming next and so like whether it's

01:46:04.700 | Future tense or past tense or plural or singular like we don't really know which things are going to be interesting in which aren't

01:46:11.900 | so

01:46:15.420 | It seems that it's generally best to kind of leave it alone as much as possible

01:46:20.380 | Be the short answer

01:46:23.340 | You know having said that as I say, this is all pretty new

01:46:26.660 | So if there are some particular areas that some researcher maybe has already discovered that some other kinds of pre-processing are helpful

01:46:33.620 | You know, I wouldn't be surprised not to know about it

01:46:37.220 | So when you're dealing with

01:46:40.420 | You know natural language is in context important context is very important. So if you're if you're using

01:46:46.940 | Words no, no, we're not looking at words

01:46:51.780 | This is this look this is I just don't get some of the big premises of this like they're in order

01:46:57.740 | Yeah, so just because we replaced I with the number 12

01:47:02.700 | These are still in that order. Yeah

01:47:07.380 | There is a different way of dealing with natural language called a bag of words and bag of words

01:47:12.380 | You do throw away the order in the context and in the machine learning course

01:47:16.220 | We'll be learning about working with bag of words representations

01:47:18.940 | But my belief is that they are

01:47:21.740 | No longer useful or in the verge of becoming no longer useful

01:47:26.540 | We're starting to learn how to use dick learning to use context properly now

01:47:32.620 | But it's kind of for the first time it's really like only in the last few months

01:47:37.140 | All right, so I mentioned that we've got two numbers batch size and BPT T back prop through time

01:47:45.420 | So this is kind of subtle

01:47:47.620 | So we've got some big long piece of text

01:47:58.940 | Okay, so we've got some big long piece of text, you know, here's our sentence. It's a bunch of words, right and

01:48:03.540 | Actually what happens in a language model is even though we have lots of movie reviews

01:48:10.460 | They actually all get concatenated together into one big block of text, right? So it's basically predict the next word

01:48:18.580 | In this huge long thing, which is all of the IMDb movie reviews concatenate together. So this thing is, you know

01:48:26.340 | What do we say? It was like tens of millions of words long and so what we do

01:48:32.060 | Is we split it up into batches?

01:48:36.020 | First right so these like are our spits into batches, right? And so if we said

01:48:42.420 | we want a batch size of

01:48:45.020 | 64 we actually break the whatever was 60 million words into the 64

01:48:51.620 | sections

01:48:53.700 | right, and then we take each one of the 64 sections and

01:48:59.060 | We move it

01:49:02.340 | Like underneath the previous one I didn't do a great job of that

01:49:09.140 | Right move it underneath

01:49:14.420 | So we end up with a matrix

01:49:18.320 | Which is

01:49:22.100 | 64

01:49:24.100 | Actually, I think we've moved them across wise so it's actually I think just transpose it we end up with a matrix. It's like 64

01:49:37.460 | columns

01:49:39.340 | Wide and the length let's say the original was 64 million right then the length is like

01:49:46.900 | 10 million

01:49:48.860 | long

01:49:50.060 | Right. So each of these represents

01:49:52.740 | 1/64 of our entire IMDb review set

01:49:58.340 | And so that's our starting point

01:50:01.140 | so then what we do is

01:50:03.660 | We then grab a little chunk of this at a time and those chunk lengths are approximately equal to

01:50:11.500 | BP TT which I think we had equal to 70. So we basically grab a little

01:50:16.980 | 70 long

01:50:19.500 | section and

01:50:20.980 | That's the first thing we chuck into our GPU. That's a batch, right? So a batch is always of length of width

01:50:28.020 | 64 or batch size and each bit is a sequence of length up to 70

01:50:35.220 | So let me show you

01:50:37.260 | Right. So here if I go take my train data loader

01:50:42.060 | I don't know if you folks have tried playing with this yet

01:50:44.980 | But you can take any data loader wrap it with it up to turn it into an iterator and then call next on it to grab

01:50:51.660 | a batch of data just as if you were a neural net you get exactly what the neural net gets and you can see here we

01:50:58.940 | get back a

01:51:00.940 | 75 by 64

01:51:03.060 | Tensor right so it's 64 wide right and I said it's approximately

01:51:09.900 | 70 high and

01:51:13.300 | But not exactly

01:51:15.140 | And that's actually kind of interesting a really neat trick that torch text does is they randomly change

01:51:22.060 | The back prop through time number every time so each epoch it's getting slightly different

01:51:29.300 | bits of text

01:51:32.220 | This is kind of like in computer vision. We randomly shuffle the images

01:51:37.080 | We can't randomly shuffle the words right because we need to be in the right order

01:51:42.100 | So instead we randomly move their break points a little bit. Okay, so this is the equivalent

01:51:47.240 | so in other words this

01:51:50.340 | This here is of length 75 right there's a there's an ellipsis in the middle

01:52:00.420 | And that represents the first 75 words of the first review

01:52:05.700 | Right, where else this 75 here?

01:52:09.780 | Represents the first 75 words of this of the second of the 64 segments

01:52:15.060 | That's it have to go in like 10 million words to find that one right and so here's the first

01:52:20.780 | 75 words of the last of those 64 segments okay, and so then what we have

01:52:27.820 | down here is

01:52:30.940 | The next

01:52:34.540 | Sequence right so 51 there's 51

01:52:38.540 | 6 1 5 there's 6 1 5 25 there's 25 right and in this case

01:52:45.180 | It actually is of the same size

01:52:47.820 | It's also 75 by 64 but for minor technical reasons being flattened out

01:52:53.060 | Into a single vector that basically it's exactly the same as this matrix, but it's just moved down

01:53:01.980 | By one because we're trying to predict the next word

01:53:05.740 | Right so that all happens for us right if we ask for and this is the fast AI now if you ask for a language model

01:53:14.580 | data

01:53:15.420 | object then it's going to create these batches of

01:53:18.820 | batch size width by BP TT height

01:53:23.980 | Bits of our language corpus along with the same thing shuffled along by one word

01:53:32.220 | Right and so we're always going to try and predict the next word

01:53:36.100 | So why don't you instead of just arbitrarily choosing 64?

01:53:47.100 | Why don't you choose like like 64 is a large number

01:53:52.900 | Maybe like do it by sentences and make it a large number and then pat it was zero or something

01:54:00.860 | if you

01:54:02.340 | You know so that you actually have a one full sentence per line

01:54:05.460 | Basically wouldn't that make more sense not really because remember we're using columns right so each of our columns is of length about 10 million

01:54:13.140 | Right so although it's true that those columns aren't always exactly finishing on a full stop. This so damn long. We don't care

01:54:21.520 | Because they're like 10 million one

01:54:25.340 | Right and we're trying to also each line contains multiple sentences column contains more

01:54:32.120 | Yeah, it's it's it's of length about 10 million

01:54:35.500 | And it contains many many many many many sentences

01:54:38.880 | Because remember the first thing we did was take the whole thing and split it into 64 groups

01:54:43.660 | Okay, great

01:54:50.620 | So um I found this you know pertaining to this question this thing about like

01:54:55.960 | What's in this language model matrix a little mind-bending for quite a while?

01:55:01.780 | So don't worry if it takes a while and you have to ask a thousand questions on the forum. That's fine, right?

01:55:07.820 | but

01:55:09.540 | Go back and listen to what I just said in this lecture again

01:55:12.420 | go back to that bit where I showed you is putting it up to 64 and moving them around and try it with some sentences and

01:55:17.600 | Excel or something and see if you can do a better job of explaining it than I did

01:55:22.240 | Because this is like how torch text works

01:55:26.260 | And then what fast AI adds on is this idea of like kind of how to build a language model out of it

01:55:33.460 | Although actually a lot of that stolen from torch text as well like there's sometimes where torch text starts and fast AI ends

01:55:39.700 | Is well vice versa is a little?

01:55:41.700 | Subtle they really work closely together, okay?

01:55:46.620 | so

01:55:48.020 | Now that we have a model data object

01:55:50.980 | That can feed us

01:55:53.540 | Batches we can go ahead and create a model right and so in this case

01:55:59.300 | We're going to create an embedding matrix and our vocab

01:56:04.140 | We can see how big our vocab was

01:56:06.780 | Let's have a look back here, so we can see here in the model data object there are

01:56:16.020 | 202

01:56:18.020 | Kind of pieces that we're going to go through that's basically equal to the number of

01:56:22.540 | The total length of everything divided by batch size times

01:56:27.620 | BPTT and this one I wanted to show you NT

01:56:31.020 | I've got the definition up here number of unique tokens NT is the number of tokens

01:56:36.080 | That's the size of our vocab so we've got three thirty four thousand nine hundred and forty five unique words

01:56:43.700 | And notice the unique words it had to appear at least ten times

01:56:46.900 | Okay, because otherwise they've been replaced with

01:56:50.300 | The length of the data set is one because as far as a language model is concerned there's only one

01:57:00.860 | Thing which is the whole corpus all right, and then that thing has

01:57:06.500 | Here it is twenty point six million

01:57:09.420 | words

01:57:11.500 | right

01:57:12.820 | So those thirty four thousand hundred and forty five things are used to create an embedding matrix

01:57:18.820 | Of number of roses equal to

01:57:23.060 | Thirty four

01:57:27.340 | nine four five

01:57:29.340 | Right and so the first one represents onk the second one represents pad

01:57:35.180 | The third one was dot the fourth one was comma this one. I'm just guessing was there and so forth

01:57:42.660 | Right and so each one of these gets an

01:57:45.460 | embedding vector

01:57:47.660 | So this is literally identical to what we did

01:57:50.500 | Before the break right this is a categorical variable. It's just a very high cardinality categorical variable and furthermore

01:57:59.300 | It's the only variable right. This is pretty standard in NLP. You have a variable which is a word

01:58:06.740 | Right we have a single categorical variable

01:58:10.260 | single column basically, and it's it's of thirty four thousand nine hundred and forty five

01:58:16.860 | Cardinality categorical variable and so we're going to create an embedding matrix for it

01:58:21.900 | So M size is the size of the embedding vector 200, okay?

01:58:28.020 | So that's going to be length 200 a lot bigger than our previous embedding vectors not surprising because a word

01:58:34.740 | Has a lot more nuance to it than the concept of Sunday

01:58:39.580 | right

01:58:40.780 | Or Rossman's Berlin store or whatever right so it's generally an embedding size for a word

01:58:47.640 | Will be somewhere between about 50 and about 600?

01:58:50.400 | Okay, so I've kind of gone some in the middle

01:58:52.980 | We then have to say as per usual how many activations

01:58:58.100 | Do you want in your layers so we're going to use 500 and then how many layers?

01:59:02.140 | Do you want in your neural net we're going to use three okay?

01:59:08.140 | This is a minor technical detail it turns out that

01:59:11.180 | We're going to learn later about the atom optimizer

01:59:14.460 | That basically the defaults for it don't work very well with these kinds of models

01:59:18.720 | So we just have to change some of these you know basically any time you're doing NLP. You should probably

01:59:24.660 | include this line

01:59:27.300 | Because it works pretty well

01:59:30.060 | So having done that we can now again take our model data object and grab a model out of it

01:59:36.260 | And we can pass in a few different things

01:59:38.580 | What optimization function do we want how big an embedding do we want how many hidden activate how many activations number of hidden?

01:59:46.860 | how many layers and

01:59:48.900 | How much dropout of many different kinds?

01:59:52.500 | So this language model. We're going to use is a very recent development called awd LSTM by Stephen Meridy

02:00:01.020 | Who's a NLP researcher based in San Francisco and his main contribution really was to show like?

02:00:07.680 | How to put dropout all over the place in in these NLP models?

02:00:13.360 | So we're not going to worry now

02:00:15.740 | We'll do this in the last lecture is worrying about like what all that like

02:00:18.780 | What is the architecture and what are all these dropouts for now?

02:00:22.460 | Just know it's the same as per usual if you try to build an NLP model and your under fitting

02:00:28.540 | Then decrease all of these dropouts if you're over fitting then increase all of these dropouts in roughly this ratio

02:00:35.960 | Okay, that's that's my rule of thumb and it again. This is such a recent paper

02:00:42.260 | Nobody else is working on this model anyway, so there's not a lot of guidance, but I've found this these ratios work

02:00:49.260 | Well, that's what Stephen's been using as well

02:00:51.500 | There's another kind of way we can avoid overfitting that we'll talk about in the last class

02:00:58.540 | Again for now this one actually works totally reliably so all of your NLP models probably want this particular line of code

02:01:05.600 | And then this one we're going to talk about at the end last lecture as well you can always include this basically what it says is

02:01:14.700 | When you do

02:01:19.220 | When you look at your gradients, and you multiply them by the learning rate, and you decide how much to update your weights by

02:01:26.580 | This says clip them

02:01:28.580 | like literally like

02:01:31.220 | Like don't let them be more than zero point three

02:01:34.740 | and this is quite a cool little trick right because like

02:01:39.540 | If you're learning rates pretty high, and you kind of don't want to get in that situation

02:01:46.140 | We talked about where you're kind of got this kind of thing where you go

02:01:54.100 | You know rather than little step little step little step instead you go like oh too big oh too big right with gradient

02:02:01.340 | Clipping it kind of goes this far, and it's like oh my goodness. I'm going too far. I'll stop

02:02:05.900 | Right that's basically what gradient flipping does

02:02:08.860 | so

02:02:11.980 | Anyway, so these are a bunch of parameters the details don't matter too much right now. You can just steal these

02:02:16.980 | And then we can go ahead and call

02:02:20.340 | fit

02:02:22.140 | With exactly the same parameters as usual

02:02:24.140 | So Jeremy, um there are all these other

02:02:32.940 | Work embedding things like like

02:02:36.420 | What to make and glove so I have two questions about that one is

02:02:41.840 | How are those different from these and the second question? Why don't you initialize them with one of those? Yeah, so

02:02:51.900 | So basically that's a great question, so basically

02:02:54.540 | People have pre-trained

02:02:57.820 | These embedding matrices before to do various other tasks. They're not whole pre-trained models

02:03:03.860 | They're just a pre-trained embedding matrix, and you can download them and as unit says they have names like word2vec and love

02:03:10.780 | And they're literally just a matrix

02:03:12.780 | There's no reason we couldn't download them really it's just like

02:03:20.300 | kind of

02:03:22.300 | I found that

02:03:25.620 | Building a whole pre-trained model in this way didn't seem to benefit much if at all from using pre-trained word vectors

02:03:32.700 | We're also using a whole pre-trained language model

02:03:35.140 | Made a much bigger difference

02:03:37.460 | So like you remember what a big those of you who saw word2vec it made a big splash when it came out

02:03:42.820 | I I'm finding this technique of pre-trained language models seems much more powerful

02:03:49.740 | Basically, but I think we combine both to make them a little better still

02:03:53.620 | What is what is the model that you have used like how can I know the architecture of the model?

02:04:00.020 | So we'll be learning about the model architecture in the last lesson for now. It's a recurrent neural network

02:04:07.980 | Using something called LSTM long short-term memory

02:04:11.620 | Okay

02:04:16.180 | so

02:04:17.740 | So if they had lots of details that we're skipping over but you know you can do all this without any of those details

02:04:23.500 | We go ahead and fit the model

02:04:25.980 | I found that this language model took quite a while to fit so I kind of like ran it for a while

02:04:31.260 | Noticed it was still under fitting save where it was up to

02:04:34.860 | Ran it a bit more with longer cycle length saved it again. It still

02:04:39.500 | was kind of under fitting

02:04:42.180 | You know run it again

02:04:44.220 | And kind of finally got to the point where it's like kind of honestly I kind of ran out of patience

02:04:48.300 | So I just like saved it at that point

02:04:51.660 | and

02:04:53.700 | I did the same kind of test that we looked at before so I was like oh it wasn't quite what I was expecting

02:04:58.620 | But I really liked it anyway the best and then I was like okay

02:05:01.080 | Let's see how that goes the best performance was one in the movie was a little bit. I say okay

02:05:05.020 | It looks like the language models working pretty well

02:05:07.180 | So I've pre-trained the language model

02:05:12.980 | And so now I want to use it

02:05:14.980 | Fine-tune it to do classification send my classification now obviously if I'm going to use a pre-trained model

02:05:21.760 | I need to use exactly the same vocab right the word there

02:05:25.860 | Still needs to map for the number two so that I can look up the vector for that right so that's why I first of all

02:05:33.820 | Load back up my my field object the thing with the vocab in right now in this case

02:05:41.060 | If I run it straight afterwards, this is unnecessary

02:05:43.880 | It's already in memory, but this means I can come back to this later right and a new session basically

02:05:50.860 | I can then go ahead and say okay. I've never got one more field right in addition to my field

02:05:59.780 | Which represents the reviews I've also got a field which represents the label

02:06:04.000 | Okay

02:06:05.780 | And the details are too important here

02:06:09.060 | Now this time I need to not treat the whole thing as one big

02:06:14.180 | Piece of text, but every review is separate because each one has a different sentiment attached to it

02:06:20.420 | And it so happens that torch text already has a data set that does that for IMDB, so I just used IMDB

02:06:27.460 | built into torch text

02:06:30.180 | So basically once we've done all that we end up with something where we can like grab for a particular example

02:06:36.180 | We can grab its label

02:06:38.860 | positive and

02:06:40.020 | Here's some of the text. This is another great Tom Beringdon movie blah blah blah blah all right, so

02:06:45.220 | This is all not nothing fast AI specific here

02:06:49.820 | We'll come back to it in the last lecture

02:06:51.660 | But torch text docs can help understand what's going on all you need to know is that

02:06:56.660 | Once you've used this special talks torch text thing called splits to grab a splits object

02:07:02.860 | You can pass it straight into fast AI text data from splits and that basically converts a torch text

02:07:10.140 | Object into a fast AI object we can train on so as soon as you've done that you can just go ahead and say

02:07:17.500 | Get model right and that gets us our learner

02:07:20.700 | And then we can load into it the pre-trained model the language model

02:07:26.860 | right, and so we can now take that pre-trained language model and

02:07:31.900 | Use the stuff that we're kind of familiar with right so we can

02:07:35.300 | Make sure that you know all it's at the last layer is frozen train it a bit

02:07:40.140 | Unfreeze it train it a bit and the nice thing is once you've got a pre-trained

02:07:45.300 | Language model it actually trained super fast you can see here. It's like a couple of minutes

02:07:50.380 | The epoch and it only took me to get my is my best one here

02:07:56.060 | It already took me like 10 epochs, so it's like 20 minutes to train this bit. It's really fast

02:08:01.900 | And I ended up with

02:08:03.900 | 94.5% so how good is 94.5% well it so happens that

02:08:11.540 | Actually one of Stephen Verity's colleagues James Bradbury recently created a paper

02:08:17.220 | Looking at the state at like where they tried to create a new state-of-the-art for a bunch of NLP things and one of the things

02:08:25.980 | I looked at was

02:08:27.940 | IMDB and they actually have here a list of the current world's best for

02:08:33.180 | IMDB and

02:08:35.780 | Even with stuff that is highly specialized for sentiment analysis the best anybody had previously come up with was 94 94.1

02:08:43.220 | So in other words this technique

02:08:45.700 | getting 94.5 is literally

02:08:48.980 | better than

02:08:51.100 | Anybody has created in the world before as far as we know or as far as James Bradbury knows

02:08:57.260 | so

02:08:58.820 | so when I say like there are big opportunities to use this I mean like

02:09:03.180 | This is a technique that nobody else currently has access to which you know you could like it, you know, whatever

02:09:10.300 | IBM has in Watson or whatever any big company has you know that they're

02:09:16.180 | Advertising unless they have some secret source that they're not publishing which they don't right because people get you know

02:09:23.020 | If they have a better thing they publish it

02:09:25.380 | Then you now have access to a better text classification method than as ever existed before

02:09:30.340 | So I really hope that you know, you can try this out and see how you go

02:09:35.140 | There may be some things that works really well on and others that it doesn't work as well on I don't know

02:09:41.860 | I think this kind of sweet spot here that we had about 25,000

02:09:48.420 | You know short to medium sized documents if you don't have at least that much text

02:09:54.060 | It may be hard to train a different language model

02:09:56.540 | But having said that there's a lot more we do here, right and we won't be able to do it in part one of this course

02:10:02.660 | We're doing part two, but for example, we could start like training language models that look at like

02:10:08.860 | You know lots and lots of medical journals and then we could like make a downloadable

02:10:13.620 | medical language model that then anybody could use to like fine-tune on like a

02:10:20.300 | Prostate cancer subset of medical literature for instance, like there's so much we could do

02:10:26.300 | It's kind of exciting and then you know to your nets point we could also combine this with like pre-trained word vectors

02:10:32.020 | so like even without

02:10:34.260 | Trying that hard like, you know, we even without news like

02:10:37.780 | we could have pre-trained a Wikipedia say corpus language model and then fine-tuned it into a

02:10:45.820 | IMDb language model and then fine-tuned that into an IBM IMDb sentiment analysis model and we would have got something better than this

02:10:53.100 | So like this and I really think this is the tip of the iceberg

02:10:56.780 | And I was talking there's a really fantastic researcher called Sebastian ruder who is

02:11:04.500 | Basically the only NLP researcher. I know who's been really really writing a lot about

02:11:11.380 | Training and fine-tuning and transfer learning and NLP and I was asking him like why isn't this happening more?

02:11:17.740 | And his view was it's because there isn't the software to make it easy, you know

02:11:23.500 | So I'm actually going to share this lecture with with him tomorrow

02:11:27.780 | Because you know it feels like there's you know

02:11:32.540 | Hopefully going to be a lot of stuff coming out now that we're making it really easy to do this

02:11:39.380 | Okay

02:11:41.380 | We're kind of out of time so what I'll do is I'll quickly look at

02:11:45.580 | Collaborative filtering introduction and then we'll finish it next time the collaborative filtering. There's very very little new to learn

02:11:53.360 | We basically learned everything we're going to need

02:11:56.300 | So collaborative filtering will will cover this quite quickly next week

02:12:02.980 | And then we're going to do a really deep dive into collaborative filtering next week

02:12:07.980 | Where we're going to learn about like we're actually going to from scratch learn how to do stochastic gradient descent

02:12:13.820 | How to create loss functions how they work exactly and then we'll go from there and we'll gradually build back up to really deeply understand

02:12:22.820 | What's going on in the structured models and then what's going on in confidence and then finally what's going on in recurrent neural networks

02:12:30.500 | And hopefully we'll be able to build them all

02:12:32.940 | From scratch okay, so this is kind of going to be really important this movie lens data set because we're going to use it to

02:12:39.100 | learn a lot of like

02:12:40.860 | Really foundational theory and kind of math behind it so the movie lens data set

02:12:47.380 | This is basically what it looks like it contains a bunch of ratings. It says user number one

02:12:54.140 | Watched movie number 31 and they gave it a rating of two and a half

02:12:58.740 | at this particular time and

02:13:02.020 | Then they watched movie one or two nine and they gave it a rating of three and they watched rating one ones movie one one seven

02:13:08.340 | Two and they gave it a rating of four. Okay, and so forth

02:13:11.020 | So this is the ratings table. This is really the only one that matters and our goal will be for some user

02:13:18.740 | We haven't seen before sorry for some user movie combination. We haven't seen before we have to predict if they'll like it

02:13:25.580 | Right and so this is how recommendation systems are built

02:13:29.220 | This is how like Amazon besides what books to recommend how Netflix decides what movies to recommend and so forth

02:13:34.880 | To make it more interesting we'll also actually download a list of movies so each movie

02:13:42.020 | We're actually going to have the title and so for that question earlier about like what's actually going to be in these embedding matrices

02:13:47.420 | How do we interpret them? We're actually going to be able to look and see

02:13:50.260 | How that's working?

02:13:52.660 | So basically this is kind of like what we're creating this is kind of crosstab of users

02:13:59.960 | by movies

02:14:01.400 | Alright, and so feel free to look ahead during the week. You'll see basically as per usual collab filter data set from CSP

02:14:08.300 | model data dot get learner

02:14:10.800 | Learn dot fit and we're done and you won't be surprised to hear when we then take that and we can cut the benchmarks

02:14:16.680 | It seems to be better than the benchmarks where you looked at so that'll basically be it and then next week

02:14:22.040 | We'll have a deep dive and we'll see how to actually build this from scratch. All right. See you next week

02:14:27.640 | [APPLAUSE]

Lesson 4: Deep Learning 2018

Chapters