back to index

Lesson 4: Deep Learning 2018


Chapters

0:0
10:11 The Whole Thing Now Isn't Going To Work As Well It's Not Going To Recognize that Image Right so It Has to in Order for this To Work It Has To Try and Find a Representation that that Actually Continues To Work Even As Random Half of the Activations Get Thrown Away every Time All Right so It's a It's It's I Guess About Four Years Old Now Three or Four Years Old and It's Been Absolutely Critical in Making Modern Deep Learning Work and the Reason Why Is It Really Just About Solve the Problem of Generalization for Us Before Drop Out Came Along if You Try To Train a Model with Lots of Parameters and You Were Overfitting and You Already Tried All the Imitation You Could and You Already Had As Much Data as You Could You There Were some Other Things You Could Try but to a Large Degree You Were Kind Of Stuck Okay and So Then Geoffrey Hinton and His Colleagues Came Up with this this Dropout Idea That Was Loosely Inspired
17:40 We Actually End Up with in this Case a Reasonably Good Result because We'Re Not Training It for Very Long and this Particular Pre-Trained Network Is Really Well Suited to this Particular Problem Yesterday so Jeremy What Kind of Piece Should We Were Using by Default so the One That's There by Default for the First Layer Is 0 25 and for the Second Layer Is 0 5 That Seems To Work Pretty Well for Most Things Right So like It's It's It You Don't Necessarily Need To Change It At All Basically if You Find It's Overfitting
28:3 What We'Re Telling Our Neural Net down the Track Is that for every Different Level of Say Year You Know 2000 2001 2002 You Can Treat It Totally Differently Where Else if We Say It's Continuous Its Have To Come Up with some Kind of Like Function some Kind of Smooth Ish Function Right and So Often Even for Things like a Year That Actually Are Continuous but They Don't Actually Have Many Distinct Levels It Often Works Better To Treat It as Categorical so another Good Example Day of Week Right So like Day of Week between Naught & 6 It's a Number and It Means Something Motifs between 3 & 5 Is Two Days and Has Meaning
45:7 Create the Learner
45:44 Embeddings
46:4 Continuous Variables
57:5 Embedding Matrices
61:56 Distributed Representation
71:50 Custom Metrics
79:14 Data Augmentation
79:32 What Is Dropout Doing
84:48 Nlp
85:10 Language Modeling
91:14 Language Model
97:32 Create a Language Model
99:50 Tokenization in Nlp
100:58 Example of Tokenization
102:4 Create the Model Data Object
102:25 Model Data Object
102:55 Minimum Frequency
107:11 Bag of Words
115:55 Create a Model
116:0 Embedding Matrix
118:19 Create an Embedding Matrix
124:24 Fit the Model

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, hi everybody welcome back good to see you all here
00:00:03.300 | It's been another
00:00:08.240 | Busy week of deep learning
00:00:11.360 | Lots of cool things going on and like last week
00:00:16.400 | I wanted to highlight a few really interesting articles that some of some of you folks who have written
00:00:24.640 | Fatali wrote one of the best articles I've seen for a while. I think actually talking about
00:00:32.880 | differential learning rates and stochastic gradient descent with restarts
00:00:36.720 | Be sure to check it out if you can because what he's done. I feel like he's done a great job of
00:00:43.040 | Kind of positioning in a place that you can get a lot out of it
00:00:48.360 | You know regardless of your background, but for those who want to go further
00:00:52.160 | He's also got links to like the academic papers that came from and kind of graphs of showing examples of all of all the things
00:00:59.080 | He's talking about
00:01:00.320 | And I think it's a it's a particularly
00:01:02.440 | Nicely done article so a good kind of role model for technical communication
00:01:08.600 | One of the things I've liked about you know seeing people post these
00:01:12.560 | Post these articles during the week is the discussion on the forums have also been like really great. There's been a lot of a
00:01:20.320 | lot of people helping out like
00:01:22.320 | Explaining things you know which you know maybe there's parts of the post bit where people have said actually that's not quite how it works
00:01:28.640 | And people have learned new things that way people have come up with new ideas as a result as well
00:01:33.600 | These discussions of stochastic gradient descent with restarts and cyclical learning rates has been a few of them actually
00:01:41.520 | Anand Sahar has written another great post
00:01:44.720 | talking about a similar
00:01:48.720 | Similar topic and why it works so well and again lots of great pictures and references to
00:01:53.780 | Papers and most importantly perhaps code are showing how it actually works
00:01:59.600 | Mark Hoffman covered the same topic at kind of a nice introductory level. I think really really kind of clear intuition
00:02:10.180 | Many can't talk specifically about differential learning rates
00:02:15.920 | And why it's interesting and again providing some nice context people not familiar with transfer learning
00:02:22.120 | You're not going back back to saying like well. What is transfer learning?
00:02:24.520 | Why is that interesting and given that why could differential learning rates be helpful?
00:02:30.300 | and then
00:02:33.440 | One thing I particularly liked about Arjun's
00:02:35.440 | article was that he talked not just about the technology that we're looking at but also talked about some of the
00:02:42.840 | implications particularly from a commercial point of view
00:02:45.280 | So thinking about like based on some of the things we've learned about so far
00:02:49.800 | What are some of the implications that that has you know in real life?
00:02:53.160 | And lots of background lots of pictures
00:02:56.180 | And then discussing some of the yeah some of the implications
00:03:00.400 | So there's been lots of great stuff online and thanks to everybody for all the great work that you've been doing
00:03:08.640 | As we talked about last week if you're kind of vaguely wondering about writing something
00:03:13.800 | But you're feeling a bit intimidated about it because you've never really written a technical post before just jump in you know
00:03:19.240 | It's it's it's it's a really
00:03:21.980 | Welcoming and encouraging group. I think to to work with
00:03:26.360 | So we're going to have a kind of an interesting lesson today, which is we're going to cover a
00:03:37.120 | Whole lot of different applications, so we've we've spent quite a lot of time on computer vision
00:03:42.600 | And today we're going to try if we can to get through three totally different areas
00:03:48.120 | Structured learning so looking at kind of how you look at
00:03:52.120 | So we're going to start out looking at structured learning or structured data learning by which I mean
00:03:59.340 | Building models on top of things look more like database tables
00:04:04.680 | So kind of columns of different types of data. They might be financial or geographical or whatever
00:04:10.240 | We're going to look at using deep learning for language natural language processing
00:04:16.600 | And we're going to look at using deep learning for recommendation systems, and so we're going to cover these
00:04:22.440 | at a very high level and the focus will be on
00:04:26.640 | Here is how to use the software to do it
00:04:31.480 | More than here is what's going on behind the scenes, and then the next three lessons
00:04:36.160 | We'll be digging into the details of what's been going on behind the scenes and also coming back to
00:04:41.920 | Looking at a lot of the details of computer vision that we kind of skipped over so far
00:04:47.740 | So the focus today is really on like how do you actually do these applications?
00:04:53.880 | And we'll kind of talk briefly about some of the concepts involved
00:04:59.720 | Before we do I did want to talk about one key
00:05:02.200 | New concept
00:05:06.480 | Which is dropout and you might have seen dropout mentioned a bunch of times already and got the got the impression that this is
00:05:12.840 | Something important and indeed it is
00:05:14.840 | So to look at dropout. I'm going to look at the the dog breeds
00:05:18.740 | Current cable competition that's going on and what I've done is I've gone ahead and I've created a
00:05:28.240 | pre-trained network as per usual
00:05:30.240 | and I've passed in pre compute equals true and so that's going to
00:05:34.520 | Pre-compute the activations that come out of the last convolutional layer. Remember an activation is just a number
00:05:42.080 | It's a number just a reminder
00:05:44.240 | an activation
00:05:46.480 | Like here is one activation. It's a number and
00:05:49.600 | Specifically the activations are calculated based on some
00:05:54.240 | Weights also called parameters that make up
00:05:58.160 | kernels or filters and they get applied to the previous layers activations
00:06:03.160 | Which could well be the inputs or they could themselves be the results of other calculations
00:06:09.000 | Okay, so when we say activation keep remembering we're talking about a number that's being calculated
00:06:13.640 | So we pre compute some activations
00:06:17.580 | And then what we do is we put on top of that a bunch of additional
00:06:22.880 | Initially randomly generated
00:06:24.880 | Fully connected layers, so we're just going to do some matrix modifications on top of these just like in our Excel worksheet
00:06:31.440 | at the very end
00:06:34.000 | We had this matrix that we just did a matrix multiplication
00:06:39.520 | So what you can actually do is if you just type
00:06:45.360 | The name of your learner object you can actually see
00:06:49.200 | What's in it? You can see the layers in it. So when I was previously been skipping over a little bit about oh
00:06:54.800 | We add a few layers to the end. These are actually the layers that we add
00:06:58.120 | We're going to do batch norm in the last lesson. So don't worry about that for now a
00:07:03.240 | Linear layer simply means a matrix multiply. Okay, so this is a matrix which has a 1024 rows and
00:07:10.360 | 512 columns and so in other words, it's going to take in 1,024 activations and spit out
00:07:18.400 | 512 activations
00:07:20.400 | Then we have a relu which remember is just replace the negatives with zero
00:07:24.600 | We'll skip over the batch norm
00:07:27.200 | We'll come back to drop out and then we have a second linear layer that takes those
00:07:30.800 | 512 activations from the previous linear layer and puts them through a new matrix multiply
00:07:35.880 | 512 by 120 spits out a new 120 activations and then finally put that through
00:07:43.440 | Softmax and for those of you that don't remember softmax. We looked at that last year at last week
00:07:49.160 | It's this idea that we basically just
00:07:51.960 | Take the previous the activation. Let's say for dog
00:07:55.680 | Go either the power of that and then divide that into the sum of either the power of all the activations
00:08:02.960 | So that was the thing that adds up to one all of them add up to one and each one individually is between zero and one
00:08:08.620 | okay, so
00:08:10.960 | That's that's what we added on top and that's the thing when we have pre compute equals true
00:08:15.920 | That's the thing we train so I wanted to talk about what this dropout is and what this P is because it's a really important
00:08:22.400 | Thing that we get to choose
00:08:25.000 | So a dropout layer with P equals zero point five
00:08:28.340 | Literally does this we go over to our spreadsheet and let's pick any layer with some activations and let's say okay
00:08:34.800 | I'm going to apply dropout with a P of zero point five two times two what that means is I go through and
00:08:42.000 | with a 50% chance I
00:08:44.720 | Pick a cell right pick an activation. So I picked like half of them randomly and I delete them
00:08:54.880 | That's that's what dropout is right? So it's so the P equals point five means what's the probability of
00:09:02.080 | deleting that cell
00:09:04.600 | Right. So when I delete those cells
00:09:07.720 | If you have a walk like look at the output
00:09:11.560 | It doesn't actually change by very much at all just a little bit particularly because remember it's going through a max pooling layer
00:09:17.760 | Right, so it's only going to change it at all if it was actually the maximum in that group of four
00:09:22.660 | and furthermore, it's just one piece of you know, if it's going into a convolution rather than into a max pool
00:09:30.460 | It's just one piece of that that filter
00:09:33.720 | so interestingly
00:09:35.720 | The idea of like randomly throwing away half of the activations in a layer
00:09:41.840 | Has a really interesting result and one important thing to mention is each mini batch we throw away a different
00:09:50.960 | random half of activations in that layer and so what it means is
00:09:56.160 | It it forces it to not overfit right in other words if there's some particular activation
00:10:02.920 | That's really learnt just that exact
00:10:05.680 | That exact dog or that exact cat right then when that gets dropped out
00:10:12.160 | The whole thing now isn't going to work as well. It's not going to recognize that image, right?
00:10:17.160 | so it has to in order for this to work it has to try and find a
00:10:21.480 | representation that
00:10:24.280 | That actually continues to work even as random half of the activations get thrown away every time
00:10:31.720 | Right, so it's a it's it's I guess about four years old now. They're four years old and it's been
00:10:39.020 | Absolutely critical in
00:10:43.120 | Making modern deep learning work and the reason why is it really just about solve?
00:10:49.120 | The problem of generalization for us before dropout came along
00:10:53.440 | if you try to train a model with lots of parameters and you were overfitting and
00:11:01.160 | You already tried all the data augmentation you called and you already had as much data as you could you?
00:11:07.240 | There were some other things you could try, but to a large degree you were kind of stuck
00:11:11.160 | and so then
00:11:13.800 | Jeffrey Hinton and his colleagues came up with this this dropout idea that was loosely inspired by the way the brain works
00:11:22.160 | And also loosely inspired by Jeffrey Hinton's experience in bank tele cues apparently
00:11:30.160 | yeah, somehow they came up with this amazing idea of like hey, let's let's try throwing things away at random and
00:11:36.080 | So as you can imagine if your P was like 0.01
00:11:42.420 | Then you're throwing away 1% of your activations for that layer at random. It's not going to randomly
00:11:49.320 | Change things up very much at all
00:11:51.920 | So it's not really going to protect you from
00:11:56.040 | Overfitting much at all on the other hand if your P was 0.99
00:12:00.040 | then that would be like going through the whole thing and throwing away nearly everything right and
00:12:06.360 | That would be very hard for it to overfit so that would be great for generalization, but it's also going to kill your
00:12:14.780 | accuracy
00:12:16.880 | so this is kind of
00:12:19.040 | playoff between high P values generalize well
00:12:22.860 | But will decrease your training accuracy and low P values will generalize less well, but will give you a less good training accuracy
00:12:30.760 | So for those of you that have been wondering why is it that particularly early in training are my validation losses better?
00:12:39.040 | Than my training losses right which seems otherwise really surprising. Hopefully some of you have been wondering why that is
00:12:46.400 | because on a data set that it never gets to see you wouldn't expect the losses to ever be
00:12:51.880 | That's better and the reason why is because when we look at the validation set we turn off dropout
00:12:58.200 | Right so in other words when you're doing inference when you're trying to say is this a cat or is this a dog?
00:13:03.240 | We certainly don't want to be including
00:13:05.800 | Random dropout there right we want to be using the best model we can
00:13:10.300 | Okay, so that's why early in training in particular
00:13:14.840 | We actually see that our validation
00:13:16.840 | Accuracy and loss tends to be better
00:13:19.920 | If we're using dropout, okay, so yes
00:13:24.000 | You have to do anything to accommodate for the fact that you are throwing away some
00:13:30.920 | That's a great question, so
00:13:34.280 | We don't but pytorch does so pytorch behind the scenes does two things if you say P equals point five
00:13:42.040 | It throws away half of the activations
00:13:45.360 | but it also
00:13:48.120 | Doubles all the activations that are already there so on average the kind of the average activation doesn't change
00:13:55.040 | Which is pretty pretty neat trick?
00:13:57.640 | So yeah, you don't have to worry about it basically it's done for you
00:14:02.840 | So if we say so you can pass in peas
00:14:08.560 | This is the this is the P value for all of the added layers to say
00:14:13.820 | With fastai what dropout do you want on each of the layers in these these added layers?
00:14:19.360 | It won't change the dropout in the pre-trained network like the hope is that that's already been
00:14:25.440 | Pretty trained with some appropriate level of dropout
00:14:28.440 | We don't change it but on these layers that we add you can say how much and so you can see here
00:14:33.080 | I said peas equals point five so my first dropout has point five my second dropout has point five
00:14:39.460 | All right, and remember coming to the input of this
00:14:42.240 | Was the output of the last convolutional layer of pre-trained network?
00:14:47.460 | And we go away and we actually throw away half of that before you can start go through our linear layer
00:14:53.360 | Throw away the negatives
00:14:56.120 | Throw away half of the result of that go through another linear layer and then pass that to our softmax
00:15:04.080 | Minor numerical precision region reasons it turns out to be better to take the log of the softmax then the softmax directly
00:15:12.080 | And that's why you'll have noticed that when you actually get predictions out of our models you always have to go
00:15:17.440 | NP dot X of the predictions
00:15:20.040 | Again, the details as to why aren't important. So if we want to
00:15:25.240 | Try removing dropout. We could go peas equals zero
00:15:30.200 | Right and you'll see where else before we started with the point seven six accuracy in the first epoch now
00:15:35.680 | You've got a point eight accuracy in the first epoch
00:15:37.680 | So by not doing dropout our first epoch worked better not surprisingly because we're not throwing anything away
00:15:44.240 | but by the third epoch here, we had eighty four point eight and
00:15:48.160 | Here we have eighty four point one. So it started out better and ended up worse
00:15:53.080 | So even after three epochs, you can already see we're massively overfitting, right?
00:15:57.520 | We've got point three loss on the train and point five loss on the validation
00:16:03.560 | And so if you look now you can see in the resulting model there's no dropout at all
00:16:11.760 | So if the P is zero, we don't even add it to the model
00:16:14.840 | Another thing to mention is you might have noticed that what we've been doing is we've been adding two
00:16:24.200 | linear layers
00:16:26.000 | Right in our additional layers. You don't have to do that. By the way, there's actually a parameter called extra fully connected
00:16:33.520 | Layers that you can basically pass a list of how long do you want all how big do you want each of the additional fully connected?
00:16:41.320 | Layers to be and so by default
00:16:43.320 | Well, you need to have at least one
00:16:45.840 | Right because you need something that takes the output of the convolutional layer
00:16:50.120 | which in this case is a size 1024 and turns it into the number of
00:16:54.800 | Classes you have cats versus dogs would be two dog breeds would be 120
00:17:00.960 | Planet satellite 17 whatever that's you always need one linear layer at least and you can't pick how big that is
00:17:08.640 | That's defined by your problem
00:17:10.640 | But you can choose what the other size is or if it happens at all
00:17:15.600 | So if we were to pass in an empty list, then now we're saying don't add any additional linear layers
00:17:21.080 | Just the one that we have to have
00:17:23.080 | Right. So here if we've got P's equals zero extra fully connected layers is empty. This is like the minimum
00:17:29.640 | possible
00:17:32.240 | Kind of top model we can put on top and again like if we do that
00:17:37.800 | You can see above we actually end up with in this case a
00:17:44.960 | Reasonably good result because we're not training it for very long and this particular pre-trained network is very well suited
00:17:51.560 | To this particular problem. Yes, you know
00:17:54.040 | So Jeremy, what kind of P should we we using?
00:17:58.960 | by default
00:18:01.080 | So the one that's there by default
00:18:04.120 | for the first layer
00:18:06.720 | Is 0.25 and for the second layer is 0.5
00:18:10.800 | That seems to work pretty well
00:18:14.760 | For most things right? So like it's it's you you don't necessarily need to change it at all
00:18:19.760 | Basically, if you find it's overfitting
00:18:23.200 | Just start bumping it up. So try first of all setting it to 0.5
00:18:28.240 | That'll set them both to 0.5 if it's still overfitting a lot try 0.7 like you can you can narrow down
00:18:34.320 | And like there's not that many
00:18:37.040 | Numbers change right and if you're under fitting
00:18:42.000 | Then you can try making it lower
00:18:44.160 | It's unlikely you would need to make it much lower because like even in these dogs versus cats situations
00:18:51.600 | You know, we don't seem to have to make it lower so it's more likely you'd be increasing it to like 0.6 or 0.7
00:18:58.800 | But you can fiddle around I find these the ones that are there by default seem to work pretty well most of the time
00:19:05.680 | So one place I actually did increase this
00:19:10.440 | Was in the dog breeds one. I did set it them both to point five
00:19:14.080 | when I used a
00:19:16.760 | Bigger model so like resnet 34 has less parameters
00:19:21.120 | So it doesn't overfit as much but then when I started bumping pumping it up to like a resnet 50
00:19:26.420 | Which has a lot more parameters. I noticed it started overfitting. So then I also increased my dropout. So as you use like
00:19:32.920 | Bigger models you'll often need to add more dropout. Can you pass that over there, please? You know
00:19:39.360 | If you set B 2.5 roughly what percentage is it 50% 50%? Yeah
00:19:48.680 | Was there how can you pass that back?
00:19:51.640 | Thanks. Is there a particular way in which you can determine if the data is being old fitted?
00:20:01.280 | You can see that the like here you can see that the training error is a
00:20:07.200 | Loss is much lower than the validation loss
00:20:09.760 | you can't tell if it's like
00:20:12.520 | to overfitted like
00:20:15.080 | Zero overfitting is not generally optimal like the only way to find that out is
00:20:19.920 | Remember the only thing you're trying to do is to get this number low right the validation loss number low
00:20:24.440 | So in the end you kind of have to play around with a few different things and see which thing ends up getting the validation
00:20:31.080 | Loss low, but you're kind of going to feel over time for your particular problem
00:20:36.720 | What does overfitting? What does too much overfitting look like?
00:20:40.240 | Great so
00:20:44.840 | So that's dropout and we're going to be using that a lot and remember it's there by default service here another question
00:20:50.880 | So I have two questions one is
00:20:55.520 | So when it says the dropout rate is 7.5
00:21:00.280 | Is does it like you know a delete each cell with a probability of?
00:21:06.120 | 0.5 does it just pick 50% randomly? I mean, I know both effectively
00:21:11.280 | It's the former yeah, okay, okay, second question is why why does the average activation matter?
00:21:17.920 | well, it matters because the remember if you look at the
00:21:22.960 | Excel spreadsheet that the result of
00:21:26.720 | this cell for example is equal to
00:21:31.520 | These
00:21:38.360 | Multiplied by each of these nine
00:21:40.520 | Right and add it up, so if we deleted half of these
00:21:44.000 | Then that would also cause this number to half which would cause like everything else after that to change and so if you change
00:21:51.600 | What it means you know like it then you're changing something that used to say like oh
00:21:57.080 | Fluffy ears are fluffy if this is greater than point six now
00:22:00.720 | It's only fluffy if it's greater than point three like we're changing the meaning of everything
00:22:04.000 | So you the goal here is to delete things without changing
00:22:08.800 | We're using a linear activation for one of the earlier activations
00:22:17.560 | Why are we using linear? Yeah? Why that particular activation?
00:22:22.040 | Because that's what this set of layers is so we've we've the the pre trained network is all is the convolutional network
00:22:28.960 | And that's pretty computed, so we don't see it so what that spits out is a vector
00:22:35.320 | So the only choice we have is to use linear layers at this point
00:22:41.760 | Can we have different level of dropout by layer? Yes, absolutely how to do that great so
00:22:49.880 | You can absolutely have different dropout by layer, and that's why this is actually called peas
00:22:54.720 | So you can pass in an array here, so if I went zero
00:22:58.400 | comma 0.2 for example and then extra fully connected. I might add 512
00:23:05.120 | Right then that's going to be zero dropout before the first of them and point two dropout before the second of them
00:23:12.800 | Yes requests, and I must admit. I don't have a great
00:23:17.760 | Intuition even after doing this for a few years for like
00:23:20.640 | When should earlier or later layers have different amounts of dropout?
00:23:26.640 | It's still something I kind of play with and I can't quite find rules of thumb
00:23:32.000 | So there's some of you come up with some good rules of thumb. I'd love to
00:23:35.020 | Hear about them. I think if in doubt
00:23:37.840 | You can use the same dropout in every fully connected layer
00:23:42.040 | The other thing you can try is often people only put dropout on the very last
00:23:48.040 | Linear layer, so that'd be the two things to try
00:23:50.440 | So Jeremy, why do you monitor the log loss the loss instead of the accuracy going up?
00:24:00.800 | Well because the loss is the only thing that we can see
00:24:05.080 | For both the validation set and the training set so it's nice to be able to compare them
00:24:13.440 | also as we'll learn about
00:24:16.720 | Later the loss is the thing that we're actually
00:24:19.400 | optimizing
00:24:22.240 | So it's it's kind of a little more. It's a little easier to monitor that and understand what that means
00:24:28.520 | Can you pass it over there?
00:24:32.120 | So with dropout we are kind of adding some random noise every iteration right so
00:24:39.240 | So that means that we don't do as much learning right or actually so that's right
00:24:45.800 | So we have to play around with the learning rate and it doesn't seem to impact the learning rate
00:24:50.860 | Enough that I've ever noticed it. I I would say you're probably right in theory it might but not enough that it's ever affected me
00:24:59.280 | Okay, so let's talk about this
00:25:07.360 | Structured data problem and so to remind you we were looking at Kaggles Rossman competition
00:25:15.160 | Which is a German?
00:25:17.160 | Chain of supermarkets, I believe and you can find this in lesson 3 Rossman
00:25:26.280 | The main data set is the one where we were looking to say at a particular store
00:25:33.040 | How much did they sell?
00:25:36.040 | Okay, and there's a few big key pieces of information one is what was the date another was were they open?
00:25:42.840 | Did they have a promotion on?
00:25:44.840 | Was it a holiday in that state?
00:25:47.400 | And was it a holiday as for school a state holiday there?
00:25:51.360 | Or was it a school holiday there and then we had some more information about stores like what for this store?
00:25:57.200 | What kind of stuff did they tend to sell what kind of store are they how far away the competition and so forth so?
00:26:03.240 | With the data set like this there's really two main kinds of column. There's columns that we think of as
00:26:10.600 | Categorical they have a number of levels so the assortment
00:26:13.760 | Column is categorical, and it has levels such as a B and C
00:26:19.200 | Where else something like competition distance we would call continuous
00:26:25.380 | It has a number attached to it where differences or ratios even if that number have some kind of meaning
00:26:31.480 | And so we need to deal with these two things quite differently, okay, so anybody who's done any
00:26:39.240 | Machine learning of any kind will be familiar with using continuous columns if you've done any linear regression for example
00:26:45.400 | You can just like multiply them by parameters for instance
00:26:48.680 | Categorical columns we're going to have to think about a little bit more
00:26:52.440 | We're not going to go through the data cleaning we're going to assume that that's in feature engineering we're going to assume all that's been done
00:27:00.240 | And so basically
00:27:04.280 | at the end of that we have a list of columns and the in this case I
00:27:09.960 | Didn't do any of the thinking around the feature engineering or data cleaning myself
00:27:16.920 | This is all directly from the third place winners of this competition
00:27:20.680 | And so they came up with all of these different
00:27:25.160 | Columns that they found useful
00:27:28.760 | and so
00:27:30.640 | You'll notice the list here is a list of the things that we're going to treat as categorical variables
00:27:37.000 | Numbers like year a month and day
00:27:42.480 | Although we could treat them as continuous like they the differences between 2000 and 2003 is meaningful
00:27:51.200 | We don't have to right and you'll see shortly how
00:27:55.080 | how categorical
00:27:59.840 | variables are treated
00:28:00.880 | But basically if we decide to make something a categorical variable what we're telling our neural net down the track is
00:28:07.480 | That for every different level of say year, you know, 2000 2001 2002 you can treat it totally differently
00:28:14.920 | Where else if we say it's continuous, it's going to have to come up with some kind of like function some kind of smooth ish
00:28:22.120 | function right and so often even for things like year that actually are continuous
00:28:29.280 | But they don't actually have many distinct levels it often works better
00:28:33.640 | To treat it as categorical
00:28:36.200 | So another good example day of week, right? So like day of week between naught and six
00:28:42.080 | It's a number and it means something like the difference between three and five is two days and has meaning but if you think about
00:28:49.680 | like how would
00:28:51.680 | Sales in a store vary by day of week
00:28:54.520 | It could well be that like, you know, Saturdays and Sundays are over here and Fridays are over here and Wednesdays over here
00:29:00.860 | Like each day is going to behave
00:29:03.000 | Kind of qualitatively differently, right? So by saying this is the categorical variable as you'll see we're going to let the neural net
00:29:11.920 | Do that right? So this thing where we get where we say
00:29:15.960 | Which are continuous and which are categorical to some extent? This is a modeling decision you get to make
00:29:23.440 | now if something is coded in your data is like a B and C or
00:29:29.560 | You know Jeremy and your net or whatever you actually you're going to have to call that categorical, right?
00:29:36.400 | There's no way to treat that directly as a continuous variable
00:29:40.000 | On the other hand if it starts out as a continuous variable like age or day of week
00:29:45.540 | You get to decide
00:29:48.280 | Whether you want to treat it as continuous or categorical. Okay, so summarize if it's categorical in the data
00:29:54.060 | It's going to have to be categorical in the model if it's continuous in the data
00:29:58.240 | You get to pick whether to make it continuous or categorical in the model
00:30:02.680 | So in this case again, I just did whatever the third place winners of this competition did
00:30:09.440 | These are the ones that they decided to use as categorical. These were the ones they decided to use as continuous and you can see
00:30:15.360 | that basically
00:30:18.000 | The continuous ones are all of the ones which are actual
00:30:22.120 | Floating point numbers like competition distance actually has a decimal place to it, right and temperature actually has a decimal place to it
00:30:30.080 | So these would be very hard to make
00:30:32.080 | categorical because they have many many levels right like if it's like five digits of floating point then potentially there will be as
00:30:40.640 | many levels as there are
00:30:43.160 | As there are rows and by the way the word we use to say how many levels are in a category?
00:30:49.400 | We use the word cardinality, right?
00:30:51.400 | So if you hear me say cardinality for example the cardinality of the day of week
00:30:55.200 | Variable is seven because there are seven different days of the week
00:30:58.480 | Do you have a heuristic for one to have been continuous variables or do you ever been variables? I don't ever been continuous variables
00:31:11.800 | So yeah, so one thing we could do with like max temperature is group it into
00:31:16.520 | 0 to 10 10 to 20 20 to 30 and then call that categorical
00:31:21.000 | interestingly a paper just came out last week in which a group of researchers found that
00:31:28.280 | Sometimes bidding can be helpful
00:31:30.440 | But that literally came out in the last week and until that time I haven't seen anything in deep learning saying that so I haven't
00:31:36.440 | I haven't looked at it myself until this week. I would have said it's a bad idea
00:31:41.360 | Now I have to think differently. I guess maybe it is sometimes
00:31:44.480 | So if you're using
00:31:51.480 | Year as a category what happens when you run the model on a year? It's never seen so you trained it in
00:31:58.080 | Well, we'll get there. Yeah, the short answer is it'll be treated as an unknown category
00:32:04.300 | And so plan does which is the underlying data frame thing?
00:32:08.840 | We're using with categories as a special category called unknown and if it sees a category it hasn't seen before it gets treated as unknown
00:32:16.800 | So for our deep learning model unknown would just be another category
00:32:22.600 | If our data set training the data set doesn't have a category and
00:32:32.080 | Test has unknown. How will it be? It'll just be part of this unknown category. Well, it's still predict
00:32:39.480 | It'll predict something right like it will just have the value
00:32:42.940 | 0 behind the scenes and if there's been any unknowns of any kind in the training set then it'll have learned a
00:32:49.680 | Way to predict unknown if it hasn't it's going to have some random vector. And so that's a
00:32:56.480 | Interesting detail around training that we probably won't talk about in this part of the course
00:33:01.720 | But we can certainly talk about on the forum
00:33:03.720 | Okay, so we've got our categorical and continuous variable lists to find in this case there was a 800,000 rows
00:33:14.720 | So 800,000 dates basically by stores
00:33:18.520 | And so you can now take all of these columns
00:33:25.120 | loop through each one and
00:33:28.880 | Replace it in the data frame with a version where you say take it and change its type to category
00:33:34.800 | Okay, and so that just that's just a pandas thing. So I'm not going to teach you pandas
00:33:41.120 | There's plenty of books particularly Wes McKinney's books book on Python for data analysis is great
00:33:47.080 | But hopefully it's intuitive as to what's going on even if you haven't seen the specific syntax before
00:33:52.320 | So we're going to turn that column into a categorical column
00:33:56.840 | And then for the continuous variables, we're going to make them all
00:33:59.920 | 32-bit floating-point and for the reason for that is that PyTorch
00:34:05.400 | Expects everything to be 32-bit floating-point. Okay, so like some of these include like
00:34:13.480 | 1-0 things like
00:34:16.720 | Can't see them straight away. But anyway, some of them. Yeah, like was there a promo was was a holiday
00:34:23.760 | And so that'll become the floating-point values one and zero instance. Okay, so
00:34:29.640 | I try to do as much of my work as possible on
00:34:35.640 | small data sets
00:34:38.000 | For when I'm working with images that generally means resizing the images to like 64 by 64 or 128 by 128
00:34:45.320 | We can't do that with structured data. So instead I tend to take a sample. So I randomly pick a few rows
00:34:53.080 | So I start running with a sample and I can use exactly the same thing that we've seen before
00:34:57.920 | For getting a validation set we can use the same way to get some random
00:35:02.460 | Random row numbers to use in a random sample. Okay, so this is just a bunch of random numbers
00:35:09.280 | And then okay, so that's going to be a size 150,000 rather than 840,000
00:35:20.840 | And so my data that before I go any further it basically looks like this. You can see I've got some booleans here
00:35:27.880 | I've got some
00:35:29.880 | Integers here of various different scales. There's my year 2014
00:35:35.240 | And I've got some letters here. So even though I said
00:35:39.880 | Please call that a pandas category
00:35:42.880 | Pandas still displays that in the notebook as strings, right?
00:35:49.000 | It's just stored in internally differently
00:35:51.440 | so then the first AI library has a special little function called process data frame and
00:35:57.440 | Process data frame takes a data frame and you tell it. What's my dependent variable?
00:36:03.200 | Right, and it does a few different things
00:36:05.720 | The first thing is it pulls out that dependent variable and puts it into a separate variable
00:36:10.620 | Okay, and deletes it from the original data frame
00:36:13.800 | So DF now does not have the sales column in where else y just contains the sales column
00:36:19.880 | Something else that it does is it does scaling?
00:36:24.600 | so neural nets
00:36:27.040 | Really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around one
00:36:34.920 | all right, so we can always take our data and
00:36:37.200 | Subtract the mean and divide by the standard deviation to make that happen
00:36:43.080 | So that's what do scale equals true does and it actually returns a special object
00:36:47.480 | Which keeps track of what mean and standard deviation did it use for that normalizing?
00:36:52.560 | So you can then do the same thing to the test set later
00:36:56.180 | It also handles missing values
00:37:01.040 | missing values and categorical variables just become the ID 0 and then all the other categories become 12345 for that
00:37:09.680 | categorical variable
00:37:11.560 | for continuous variables that replaces the
00:37:15.080 | missing value with the median
00:37:18.080 | And creates a new column
00:37:20.400 | That's a Boolean and just says is this missing or not and I'm going to skip over this pretty quickly because we talk about this
00:37:26.120 | In detail in the machine learning course, okay, so if you've got any questions about this part
00:37:30.800 | That would be a good place to go. It's nothing deep learning specific there
00:37:36.040 | So you can see afterwards year 2014
00:37:39.800 | For example has become year 2 ok because these categorical variables have all been replaced with
00:37:44.560 | With contiguous integers starting at 0
00:37:48.440 | Right and the reason for that is later on we're going to be putting them into a matrix
00:37:53.680 | Right and so we wouldn't want the matrix to be 2014 rows long when it could just be 2 rows long
00:37:59.440 | so that's the basic idea there, and you'll see that the
00:38:05.120 | AC for example has been replaced in the same way with 1 and 3
00:38:09.840 | Okay, so we now have a data frame
00:38:14.040 | Which does not contain the dependent variable and where everything is a number okay?
00:38:18.880 | And so that's the that's where we need to get to to do deep learning and all of the stage above that
00:38:24.160 | As I said we talk about in detail in the machine learning course nothing deep learning specific about any of it
00:38:29.860 | This is exactly what we throw into our random forests as well, so
00:38:34.280 | Another
00:38:36.280 | Thing we talk about a lot in the machine learning course of course is validation sets
00:38:40.200 | In this case we need to predict the next two weeks of sales
00:38:45.800 | Right it's not like pick a random set of sales, but we have to pick the next two weeks of sales. That was what the Kaggle
00:38:53.200 | competition folks told us to do
00:38:55.960 | And therefore I'm going to create a validation set which is the last two weeks of
00:39:02.320 | my training set right to try and make it as similar to the test set as possible and
00:39:06.640 | We just posted actually Rachel wrote this thing last week about
00:39:11.480 | Creating validation sets so if you go to fast at AI you can check it out
00:39:16.320 | We'll put that in the lesson wiki as well
00:39:18.840 | But it's basically a summary of a recent machine learning lesson that we did
00:39:25.180 | The videos are available for that as well, and this is kind of a written a written summary of it
00:39:33.960 | So yeah
00:39:37.160 | So Rachel has been a lot of time thinking about kind of you know
00:39:39.760 | How do you need to think about validation sets and training sets and test sets and so forth and that's all there?
00:39:45.480 | But again, nothing deep learning specific, so let's get straight to the the deep learning action, okay?
00:39:51.400 | so in this particular competition as always with any competition or any kind of
00:39:59.920 | Machine learning project you really need to make sure you have a strong understanding of your metric
00:40:05.280 | How are you going to be judged here and in this case?
00:40:08.400 | You know Kaggle makes it easy they tell us how we're going to be judged and so we're going to be judged on the roots
00:40:13.080 | mean squared
00:40:14.440 | percentage error
00:40:15.920 | Right so we're going to say like oh you predicted three
00:40:19.180 | It was actually three point three so you were ten percent out
00:40:24.520 | And then we're going to average all those percents right and remember. I warned you that
00:40:30.480 | You are going to need to make sure you know logarithms really well right and so in this case from you know
00:40:37.920 | We're basically being saying your prediction divided by the actual the mean of that
00:40:43.880 | Right is the thing that we care about
00:40:46.880 | and so we don't have a
00:40:52.160 | Metric in Pytorch called root mean squared percent error
00:40:54.880 | We could actually easily create it by the way if you look at the source code
00:41:00.480 | You'll see like it's you know a line of code, but easier still would be to realize that
00:41:05.000 | That if you have
00:41:09.240 | That right then you could replace a with like
00:41:13.320 | Log of a dash and B with like log of B dash
00:41:17.960 | And then you can replace that whole thing with a subtraction
00:41:22.040 | That's just the rule of logs right and so if you don't know that rule
00:41:28.520 | Then you know make sure you go look it up because it's super helpful
00:41:31.200 | But it means in this case all we need to do is to
00:41:34.520 | Take the log of our data
00:41:38.640 | which I actually did earlier in this
00:41:41.200 | Notebook and when you take the log of the data getting the root means great error
00:41:46.120 | Will actually get you the root means great percent error for free, okay?
00:41:50.720 | But then when we want to like print out our root means red percent error
00:41:55.440 | We actually have to go either the power of it
00:41:58.640 | Again, right and then we can actually return the percent difference, so that's all that's going on here
00:42:04.760 | It's again. Not really deep learning specific at all
00:42:06.960 | So here we finally get to the deep learning alright, so as per usual like you'll see everything
00:42:15.840 | We look at today looks exactly the same as everything. We've looked at so far. Which is first we create a model data object
00:42:22.620 | Something that has a validation set
00:42:25.400 | Training set an optional test set built into it from that we will get a learner we will then
00:42:32.000 | Optionally called learner dot LR find will then called learner dot fit
00:42:37.600 | It'll be all the same parameters and everything that you've seen many times before okay
00:42:42.320 | So the difference though is obviously we're not going to go
00:42:45.560 | Image classifier data dot from CSV or dot from paths we need to get some different kind of model data
00:42:53.640 | And so for stuff that is in rows and columns
00:42:56.280 | We use columnar model data
00:42:58.960 | Okay, but this will return an object with basically the same API that you're familiar with and rather than from paths
00:43:06.720 | Or from CSV this is from data frame, okay, so this gets past a few things
00:43:12.520 | The path here is just used for it to know where should it store?
00:43:17.600 | Like model files or stuff like that right this is just basically saying where do you want to store anything that you save later?
00:43:24.160 | This is the list of the indexes of the rows that we want to put in the validation set we created earlier
00:43:30.920 | Here's our data frame
00:43:33.560 | okay, and
00:43:38.040 | Let's have a look here's this is where we did the log right so I talked the
00:43:42.680 | The Y that came out of property F our dependent variable. I logged it and I call that YL
00:43:48.180 | Right so we tell it
00:43:50.680 | When we create our model data we need to tell it that's our dependent variable
00:43:54.200 | So so far we've got list of the stuff to go in the validation set which is what's our independent variables?
00:44:00.800 | What's our dependent variables and then we have to tell it which things do we want treated as categorical right?
00:44:07.120 | Because remember by this time
00:44:09.800 | Everything's a number
00:44:14.600 | Right so it could do the whole thing as if it's continuous it would just be totally meaningless
00:44:20.260 | Right so we need to tell it which things do we want to treat as categories and so here we just pass in
00:44:26.640 | That list of names that we use before
00:44:30.960 | okay, and then a bunch of the parameters are the same as the ones you're used to for example you can set the batch size
00:44:37.800 | Yeah, so after we do that. We've got a
00:44:42.180 | You know a standard
00:44:45.440 | Model data object with a trend train DL
00:44:48.400 | Attribute there's a vowel DL attribute a train DS attribute of our DS attribute. It's got a length
00:44:54.840 | It's got all this stuff
00:44:56.480 | Exactly like it did in all of our
00:44:59.920 | image based
00:45:01.920 | data objects
00:45:03.840 | Okay, so now we need to create the the model or create the learner and so to skip ahead a little bit
00:45:10.280 | We're basically going to pass in something that looks pretty familiar
00:45:15.120 | We're going to be passing saying from our model from our model data
00:45:18.560 | Create a learner that is suitable for it
00:45:21.560 | And we'll basically be passing in a few other bits of information which will include
00:45:27.640 | How much dropout to use at the very start?
00:45:29.920 | How many how many activations to have in each layer how much dropout to use at the at the later layers?
00:45:38.120 | But then there's a couple of extra things that we need to learn about and specifically it's this thing called
00:45:44.560 | embeddings
00:45:47.120 | So this is really the key new concept we have to learn about all right, so
00:45:55.960 | All we're doing basically is we're going to take our
00:45:59.520 | Let's forget about categorical variables for a moment and just think about the continuous variables
00:46:05.920 | For our continuous variables all we're going to do
00:46:09.680 | Is we're going to grab them all
00:46:12.760 | Okay, so for our continuous variables, we're basically going to say like okay, here's a
00:46:22.520 | big list of all of our continuous variables like the minimum temperature and
00:46:26.600 | maximum temperature and the distance to the nearest competitor and so forth right and so here's just a bunch of
00:46:33.480 | floating point numbers and so basically what the neural nets going to do is it's going to take that that 1d array or
00:46:40.120 | Or vector or to be very DL like
00:46:45.520 | rank 1 tensor
00:46:48.200 | All means the same thing okay, so we're going to take our rank 1 tensor
00:46:51.680 | And let's put it through a matrix multiplication, so let's say this has got like I don't know 20
00:46:57.880 | continuous variables, and then we can put it through a matrix which
00:47:03.160 | Must have 20 rows. That's how much this multiplication works, and then we can decide how many columns we want right
00:47:10.400 | So maybe we decided 100 right and so that matrix multiplication is going to spit out a new
00:47:15.800 | length 100
00:47:19.040 | rank 1 tensor
00:47:20.800 | Okay, that's that's what that's what a linear. That's what a matrix product does and that's the definition of a linear layer
00:47:28.080 | in deep length
00:47:30.160 | Okay, and so then the next thing we do is we can put that through a relu right which means we throw away the negatives
00:47:37.200 | Okay, and now we can put that through another matrix product. Okay, so this is going to have to have a hundred rows by definition
00:47:45.100 | And we can have as many columns as we like and so let's say maybe this was
00:47:50.840 | The last layer so the next thing we're trying to do is to predict sales
00:47:55.040 | So there's just one
00:47:57.720 | value, we're trying to predict the sales so we could put it through a
00:48:00.400 | Matrix product that just had one column and that's going to spit out a single number
00:48:05.520 | All right, so that's like
00:48:08.280 | That's kind of like a one layer
00:48:11.440 | Neural net if you like now in practice, you know we wouldn't make it one layer
00:48:18.440 | so we'd actually have like
00:48:20.440 | You know, maybe we'd have 50 here and so then that gives us a 50 long vector and
00:48:32.800 | Maybe we then put that into our final
00:48:35.020 | 50 by 1
00:48:38.760 | And that spits out a single number and one reason I wanted to change that there was to point out, you know, relu
00:48:44.920 | You would never put relu in the last layer
00:48:48.240 | Like you'd never want to throw away the negatives because that the softmax
00:48:51.720 | The softmax
00:48:57.320 | Needs negatives in it because it's the negatives that are the things that allow it to create low probabilities
00:49:02.820 | That's minor detail, but it's useful to remember. Okay, so basically
00:49:08.120 | So basically a
00:49:16.240 | simple view of a
00:49:18.240 | Fully connected neural net is something that takes in as an input a rank one tensor
00:49:26.240 | it's bits it's through a linear layer an
00:49:29.720 | Activation layer another linear layer
00:49:34.680 | Softmax and
00:49:38.400 | That's the output
00:49:41.960 | Okay, and so we could obviously decide to add more
00:49:46.920 | Linear layers we could decide maybe to add dropout
00:49:51.000 | Right. So these are some of the decisions that we we get to make right but we there's not that much we can do
00:49:58.800 | Right. There's not much really crazy architecture stuff to do. So when we come back to
00:50:03.540 | Image models later in the course
00:50:06.520 | We're going to learn about all the weird things that go on and like res nets and inception networks and blah blah blah
00:50:12.100 | But in these fully connected networks, they're really pretty simple. They're just interspersed
00:50:16.600 | linear layers that is matrix products and
00:50:19.580 | Activation functions like value and a softmax at the end
00:50:24.680 | And if it's not classification which actually ours is not classification in this case. We're trying to predict sales
00:50:31.900 | There isn't even a softmax
00:50:34.420 | Right, we don't want it to be between 0 and 1
00:50:37.780 | Okay, so we can just throw away the last activation all together
00:50:41.580 | If we have time we can talk about a slight trick we can do there but for now we can think of it that way
00:50:48.940 | So that was all assuming that everything was continuous, right? But what about categorical, right? So we've got like
00:50:57.360 | Day of week
00:51:01.500 | right and
00:51:04.500 | We're going to treat it as categorical, right? So it's like Saturday Sunday Monday
00:51:15.220 | Friday
00:51:16.760 | okay, how do we feed that in because I want to find a way of getting that in so that we still end up with a
00:51:22.940 | rank one tensor of floats and
00:51:24.940 | so the trick is this we create a new little matrix of
00:51:29.240 | With seven rows
00:51:33.700 | And as many columns as we choose right so let's pick four right so here's our
00:51:40.860 | Seven rows and
00:51:44.380 | four columns
00:51:46.900 | Right and basically what we do is let's add our categorical variables to the end. So let's say the first row was Sunday
00:51:55.380 | Right then what we do is we do a lookup into this matrix and we say oh here's Sunday
00:52:01.660 | We do a lookup into here and we grab
00:52:04.340 | This row and so this matrix we basically fill with floating point numbers. So we're going to end up grabbing a
00:52:11.540 | little
00:52:14.020 | Subset of four floating point numbers. It's Sunday's particular for floating point numbers
00:52:20.720 | And so that way we convert
00:52:23.500 | Sunday
00:52:25.740 | Into a rank one tensor of four floating point numbers and initially those four numbers are random
00:52:33.080 | Right and in fact this whole thing we initially start out
00:52:37.100 | random, okay
00:52:40.020 | But then we're going to put that through our neural net, right?
00:52:44.260 | So we basically then take those four numbers and we remove Sunday instead we add
00:52:49.660 | Our four numbers on here, right? So we've turned our categorical thing into a floating point vector
00:52:56.360 | Right and so now we can just put that through our neural net
00:53:00.100 | just like before and at the very end we find out the loss and
00:53:04.300 | then we can figure out which direction is down and
00:53:08.180 | Do gradient descent in that direction and eventually that will find its way back
00:53:12.940 | To this little list of four numbers and it'll say okay those random numbers weren't very good
00:53:18.620 | This one needs to go up a bit that one needs to go up a bit that one needs to go down a bit
00:53:22.660 | That one needs to go up a bit and so we'll actually update
00:53:25.260 | our original those four numbers in that matrix and
00:53:29.340 | We'll do this again and again and again
00:53:31.780 | And so this this matrix will stop looking random and it will start looking more and more like like
00:53:37.660 | The exact four numbers that happen to work best for Sunday the exact four numbers that happen to work best for Friday and so forth
00:53:45.700 | And so in other words this matrix is just another bunch of weights
00:53:51.000 | in our neural net
00:53:53.780 | All right, and so matrices of this type are called
00:53:57.180 | embedding matrices
00:54:00.540 | So an embedding matrix is something where we start out with an
00:54:10.100 | integer between zero and the maximum number of levels of that category
00:54:15.420 | We literally index into a matrix to find our particular row
00:54:20.460 | So if it was the level was one we take the first row
00:54:24.420 | we grab that row and
00:54:27.340 | we append it to all of our continuous variables and
00:54:31.100 | So we now have a new
00:54:35.020 | Vector of continuous variables and when we can do the same thing for let's say zip code
00:54:39.540 | Right, so we could like have an embedding matrix. Let's say there are 5,000 zip codes
00:54:45.260 | It would be 5,000 rows long as wide as we decide maybe it's 50 wide and so we'd say okay. Here's
00:54:52.140 | nine four zero zero three
00:54:54.860 | That zip code is index number four in our matrix
00:54:58.560 | So go down and we find the fourth row regret those 50 numbers and append those
00:55:03.900 | Onto our big vector and then everything after that is just the same. We just put it through a linear layer value linear layer, whatever
00:55:10.460 | What are those four numbers
00:55:15.180 | Represent that's a great question and we'll learn more about that when we look at collaborative filtering for now
00:55:21.860 | They represent no more or no less than any other parameter in our neural net, you know, they're just
00:55:28.900 | They're just parameters that we're learning that happen to end up giving us
00:55:33.980 | a good loss
00:55:35.780 | We will discover later that these particular parameters often
00:55:39.260 | However, our human interpretable and quite can be quite interesting, but that's a side effect of them. It's not
00:55:45.660 | Fundamental they're just four random numbers for now that we're that we're learning or sets of four random numbers
00:55:52.940 | To have a good heuristic for the dimensionality of the embedding matrix, so why four here
00:56:02.660 | sure do
00:56:10.940 | What I first of all did was I made a little list of every categorical variable and its cardinality
00:56:17.460 | Okay, so there they allow so there's a hundred and that's a thousand plus different stores
00:56:23.460 | apparently in Rossman's network
00:56:26.620 | There are eight days of the week
00:56:28.740 | That's because there are seven days of the week plus one left over four unknown
00:56:32.700 | Even if there were no missing values in the original data
00:56:36.060 | I always still set aside one just in case there's a missing or an unknown or something different in the test set
00:56:41.900 | Again four years, but there's actually three plus room for an unknown and so forth. Alright, so what I do
00:56:49.380 | My rule of thumb is this
00:56:52.300 | Take the cardinality of the variable
00:56:57.660 | Divide it by two
00:56:59.660 | But don't make it bigger than 50
00:57:01.700 | Okay, so
00:57:04.700 | These are my embedding matrices. So my store matrix. So the that has to have a
00:57:10.140 | thousand one hundred and sixteen rows because I need to look up right to find his store number three and then it's going to return back a
00:57:18.380 | Rank one tensor of length 50
00:57:21.940 | Day of week it's going to look up into which one of the eight and return the thing of length four
00:57:28.400 | So would you typically build an embedding matrix for each categorical feature? Yes. Yeah, so that's what I've done here
00:57:38.300 | So I've said
00:57:40.300 | For C in categorical variables
00:57:44.140 | See how many categories there are and
00:57:49.260 | then for each of those things
00:57:52.140 | create one of these and
00:57:55.260 | Then this is called embedding sizes
00:57:57.380 | And then you may have noticed that that's actually the first thing that we pass to get learner
00:58:03.260 | And so that tells it for every categorical variable. That's the embedding matrix to use for that variable
00:58:09.660 | That is behind you doesn't yes
00:58:12.420 | So besides
00:58:17.060 | Random initialization are there other ways to actually initialize embedding?
00:58:21.000 | Yes or no, there's two ways one is random the other is pre-trained and
00:58:28.460 | We'll probably talk about pre-trained more later in the course
00:58:32.060 | But the basic idea though is if somebody else at Rossman had already trained a neural net
00:58:36.280 | just like you you would use a pre-trained net from image net to look at pictures of cats and dogs if
00:58:42.300 | Somebody else has pre-trained a network to predict cheese sales in Rossman
00:58:47.200 | You may as well start with their embedding matrix of stores to predict liquor sales in Rossman
00:58:52.680 | And this is what happens for example at
00:58:55.280 | At Pinterest and Instacart they both use this technique Instacart uses it for routing their shoppers
00:59:03.200 | Pinterest uses it for deciding what to display on a web page when you go there and they have
00:59:08.920 | embedding matrices of products
00:59:12.000 | In Instacart's case of stores that get shared in the organization so people don't have to train new ones
00:59:19.800 | So for the embedding size
00:59:25.760 | Why wouldn't you just use like the one hot scheme and just
00:59:31.400 | Well, what is the advantage of doing this?
00:59:34.280 | As opposed to just doing a lot of questions. So so we could easily as you point out have
00:59:41.600 | Instead of passing in these four numbers. We could instead have passed in seven numbers
00:59:47.640 | all zeros, but one of them is a one and that also is a list of floats and
00:59:53.280 | That would totally work
00:59:56.800 | and that's how
00:59:58.960 | Generally speaking categorical variables have been used in statistics for many years. It's called dummy variable coding
01:00:06.440 | The problem is that in that case?
01:00:10.520 | the concept of Sunday
01:00:12.520 | Could only ever be associated with a single floating-point number
01:00:16.840 | Right, and so it basically gets this kind of linear behavior. It says like Sunday is more or less of a single thing
01:00:25.960 | Yeah, well, it's not just interactions. It's saying like now Sunday is a concept in four-dimensional space
01:00:32.200 | Right. And so what we tend to find happen is that these
01:00:37.440 | Embedding vectors tend to get these kind of rich semantic concepts. So for example
01:00:43.760 | if it turns out that
01:00:46.480 | Weekends
01:00:49.560 | Kind of have a different behavior
01:00:51.080 | You'll tend to see that Saturday and Sunday will have like some particular number higher or more likely
01:00:57.320 | it turns out that certain days of the week are associated with higher sales of
01:01:07.000 | Certain kinds of goods that you kind of can't go without I don't know like gas or milk say
01:01:12.640 | Where else there might be other products?
01:01:17.240 | like wine, for example
01:01:19.240 | Like wine that tend to be associated with like the days before weekends or holidays, right? So there might be kind of a column
01:01:29.160 | which is like
01:01:31.440 | To what extent is this day of the week?
01:01:35.200 | Kind of associated with people going out
01:01:37.800 | You know, so basically yeah by by having this higher dimensionality vector rather than just a single number
01:01:45.280 | It gives the deep learning
01:01:47.560 | Network a chance to learn these rich
01:01:50.480 | Representations and so this idea of an embedding is actually what's called a distributed representation
01:01:58.600 | It's kind of the fun most fundamental concept of neural networks
01:02:02.440 | It's this idea that a concept in a neural network has a kind of a high dimensional
01:02:08.960 | Representation and often it can be hard to interpret because the idea is like each of these
01:02:14.720 | Numbers in this vector doesn't even have to have just one meaning
01:02:18.480 | You know
01:02:19.200 | It could mean one thing if this is low and that one's high and something else if that one's high and that one's low
01:02:23.640 | Because it's going through this kind of rich nonlinear
01:02:26.880 | Function right and so it's this
01:02:30.920 | It's this rich representation that allows it to learn such such such interesting
01:02:37.080 | Relationships
01:02:40.520 | Kind of oh another question. Sure. I'll speak louder. So are there
01:02:46.200 | Is an embedding so I get the the fundamental of the like the word vector word to Vic vector algebra
01:02:55.040 | You can run on this but are they embedding suited suitable for certain types of variables?
01:03:00.640 | Like are are these only suitable for?
01:03:03.800 | Are there different categories that that the embeddings are suitable for an embedding is suitable for any categorical variable?
01:03:11.120 | Okay, so so the only thing it it can't really work
01:03:16.120 | Well at all for would be something that is too high cardinality
01:03:19.880 | So like in other words, we had likes whatever it was 600,000 rows if you had a variable with 600,000 levels
01:03:26.640 | That's just not a useful
01:03:30.360 | categorical variable you could bucketize it I guess
01:03:33.660 | But yeah in general like you can see here that the third place getters in this competition
01:03:39.560 | Really decided that everything that was not too high cardinality
01:03:45.880 | They put them all as categorical variables and I think that's a good rule of thumb
01:03:49.320 | You know if you can make it a categorical variable you may as well because that way it can learn this rich distributed representation
01:03:57.080 | Or else if you leave it as continuous, you know, the most it can do is to kind of try and find a
01:04:02.520 | You know a single functional form that fits it well
01:04:05.560 | after question, so
01:04:09.080 | You were saying that you are kind of increasing the dimension
01:04:12.960 | But actually in in most cases we will use a one-holding calling which has even a bigger dimension
01:04:19.520 | That so so in a way you are also
01:04:23.240 | Reducing but in the most rich. I think that's that's that's fair. Yeah. Yeah it like
01:04:28.240 | Yes, you know you can think of it as one hot encoding which actually is high dimensional, but it's not
01:04:34.800 | Meaningfully high dimensional because everything except one is zero
01:04:38.200 | I'm saying that also because even this will reduce the amount of memory and things like this that you have to write
01:04:43.680 | This is better. You're absolutely right. Absolutely, right?
01:04:46.760 | And and so we may as well go ahead and actually describe like what's going on with the matrix algebra behind the scenes
01:04:52.920 | It this if this doesn't quite make sense you can kind of skip over it
01:04:56.600 | But for some people I know this really helps if we started out with something saying this is Sunday
01:05:03.280 | right
01:05:05.320 | we could represent this as a one hot encoded vector right and so
01:05:09.640 | Sunday, you know, maybe was positioned here. So that would be a one and then the rest of zeros
01:05:16.560 | Okay, and then we've got our
01:05:22.360 | Embedding matrix right with eight rows and in this case four columns
01:05:28.540 | One way to think of this actually is a matrix product
01:05:35.840 | Right, so I said you could think of this as like looking up the number one, you know and finding like its index in the array
01:05:44.820 | But if you think about it, that's actually
01:05:48.000 | identical to doing a matrix product between a one hot encoded vector and
01:05:53.080 | The embedding matrix like you're going to go zero times this row one times this row zero times this row
01:06:02.040 | And so it's like a one hot embedding matrix product is identical
01:06:06.720 | to doing a lookup and so
01:06:09.680 | Some people in the bad old days actually implemented embedding
01:06:16.200 | Matrices by doing a one hot encoding and then a matrix product and in fact a lot of like machine learning
01:06:22.200 | methods still kind of do that
01:06:24.560 | But as your net was kind of alluding to it's that's terribly inefficient. So all of the modern
01:06:31.660 | Libraries implement this as taken take an integer and do a lookup into an array
01:06:37.040 | But the nice thing about realizing that it's actually a matrix product
01:06:40.400 | Mathematically is it makes it more obvious?
01:06:43.320 | How the gradients are going to flow so when we do stochastic gradient descent, it's we can think of it as just another
01:06:50.060 | Linear layer. Okay, does it say that's like a somewhat minor detail, but hopefully for some of you it helps
01:06:56.680 | Could you touch on using dates and times as categoricals how that affects seasonality? Yeah, absolutely. That's a great question
01:07:06.360 | Did I cover dates at all last week?
01:07:09.800 | No, okay
01:07:13.680 | So I covered dates in a lot of detail in the machine learning course, but it's worth briefly mentioning here
01:07:19.120 | There's a fast AI function called add date part
01:07:26.920 | Which takes a data frame and a column name
01:07:30.640 | That column name needs to be a date
01:07:33.920 | It removes unless you've got drop equals false
01:07:37.800 | It optionally removes the column from the data frame and replaces it with lots of columns
01:07:43.560 | representing all of the useful information about that date like
01:07:47.680 | Day of week day of month month of year year is at the start of a quarter
01:07:52.600 | Is at the end of a quarter basically everything that pandas?
01:07:55.220 | gives us
01:07:57.480 | And so that way we end up
01:08:00.200 | When we look at our list of features where you can see them here, right?
01:08:05.840 | Year month week day day of week, etc. So these all get created for us by add date path
01:08:11.500 | so we end up with
01:08:14.680 | you know this
01:08:17.120 | Eight long embedding
01:08:20.720 | Matrix, so I guess eight rows by four column embedding matrix for day of week and
01:08:26.800 | Conceptually that allows us allows our model to create some pretty interesting time series models
01:08:34.920 | Right like it can if there's something that has a
01:08:37.760 | seven-day period cycle
01:08:40.840 | That kind of goes up on Mondays and down on Wednesdays, but only for dairy and only in Berlin
01:08:47.040 | It can totally do that, but it has all the information it needs
01:08:51.020 | to do that
01:08:53.320 | So this turns out to be a really fantastic way to deal with time series
01:08:57.960 | So I'm really glad you asked the question you just need to make sure that
01:09:02.560 | That the the cycle indicator in your time series exists as a column
01:09:07.800 | So if you didn't have a column there called day of week
01:09:11.280 | it would be very very difficult for the neural network to somehow learn to do like a
01:09:17.060 | Divide mod 7 and then somehow look that up in an embedding matrix
01:09:20.960 | I get not impossible, but really hard would use lots of computation wouldn't do it very well
01:09:26.720 | So an example of the kind of thing that you need to think about might be
01:09:32.360 | Holidays for example, you know, or if you were doing something in in, you know of sales of
01:09:39.560 | Beverages in San Francisco
01:09:41.840 | You probably want a list of like when when are the when is the ball game on at AT&T Park?
01:09:47.120 | All right, because that's going to impact how many people that are drinking beer in soma
01:09:51.660 | all right, so you need to make sure that the kind of the basic indicators or
01:09:57.120 | Periodicities or whatever are there in your data and as long as they are the neural nets going to learn to use them
01:10:03.200 | So I'm kind of trying to skip over some of the non-deep learning parts
01:10:08.320 | All right, so
01:10:13.320 | The key thing here is that we've got our model data that came from the data frame
01:10:17.560 | We tell it how big to make the embedding matrices
01:10:21.260 | We also have to tell it of the columns in that data frame
01:10:27.360 | How many of those?
01:10:29.360 | Categorical variables or how many of them are continuous variables. So the actual parameter is number of continuous variables
01:10:36.560 | So you can hear you can see we just pass in how many columns are there minus how many categorical variables are there?
01:10:43.000 | so then that way the
01:10:45.120 | The neural net knows how to create something that puts the continuous variables over here and the categorical variables over there
01:10:54.480 | The embedding matrix has its own dropout
01:10:57.680 | All right. So this is the dropout applied to the embedding matrix
01:11:01.280 | This is the number of activations in the first linear layer the number of activations in the second linear layer
01:11:07.280 | The dropout in the first linear layer the dropout for the second linear layer
01:11:11.840 | This bit we won't worry about for now and then finally is how many outputs do we want to create?
01:11:16.880 | Okay, so this is the output of the last linear layer and obviously it's one because we want to predict a single number
01:11:23.560 | Which is sales?
01:11:26.680 | So after that we now have a learner where we can call our find and we get the standard looking shape and we can say
01:11:33.680 | what the amount we want to use and
01:11:35.880 | we can then go ahead and
01:11:38.680 | Start training using exactly the same API. We've seen before
01:11:44.120 | So this is all identical
01:11:46.920 | You can pass in I'm not sure if you've seen this before
01:11:51.000 | Custom metrics what this does is it just says please print out a number at the end of every epoch by calling
01:11:57.560 | this function and this is a function we defined a little bit earlier, which was the
01:12:02.120 | Root mean squared percentage error. First of all going either the power of our
01:12:07.320 | Sales because our sales were originally logged. So this doesn't change the training at all
01:12:14.800 | It just it's just something to print out
01:12:16.840 | So we train that for a while
01:12:20.920 | And you know, we've got some benefits that the original people that built this don't have specifically we've got things like
01:12:29.280 | Cyclical not cyclic learning rate stochastic gradient descent with restarts. And so it's actually interesting to have a look and compare
01:12:37.400 | Although our validation set isn't identical to the test set it's very similar
01:12:45.720 | It's a two-week period that is at the end of the training data
01:12:49.880 | so our numbers should be similar and if we look at what we get 0.097 and compare that to the
01:12:57.360 | Leaderboard public leaderboard
01:13:00.880 | You can see we're kind of
01:13:07.520 | Let's have a look in the top actually that's interesting
01:13:13.960 | There's a big difference between the public and private leaderboard it would have
01:13:19.960 | Would have been right at the top of the private leaderboard
01:13:22.280 | But only in the top 30 or 40 on the public leaderboard. So not quite sure but you can see like we're certainly in
01:13:28.400 | the top end of this competition I
01:13:33.200 | actually tried running the third place getters code and
01:13:38.120 | Their final result was over 0.1. So I actually think that where should be compared to the private leaderboard
01:13:48.840 | So anyway, so you can see there basically there's a technique for dealing with time series and
01:13:55.600 | Structured data and you know, interestingly the group that that used this technique. They actually wrote a paper about it. That's linked in this notebook
01:14:04.640 | When you compare it to the folks that won this competition and came second
01:14:11.560 | They did the other folks did way more feature engineering like the winners of this competition were actually
01:14:19.000 | subject matter experts in logistics sales forecasting and so they had their own like code to create lots and lots of features and
01:14:27.400 | Talking to the folks at Pinterest who built their very similar model for recommendations of Pinterest
01:14:33.400 | They said the same thing which is that when they switched from gradient boosting machines to deep learning
01:14:38.880 | They did like way way way less
01:14:41.760 | Feature engineering it was a much much simpler model and requires much less maintenance
01:14:48.440 | And so this is like one of the big benefits of using this approach to deep learning. We can get state-of-the-art results
01:14:54.400 | But with a lot less work
01:15:00.080 | Are we using any time series in any of these fits
01:15:06.960 | indirectly
01:15:10.280 | Absolutely using what we just saw we have day of week month of year all that stuff columns
01:15:17.200 | And most of them are being treated as categories. So we're building a distributed representation of January
01:15:23.040 | We're building a distributed representation of Sunday. We're building a distributed representation of Christmas. So we're not using any
01:15:30.720 | Classic time series techniques all we're doing is
01:15:35.760 | true fully connected layers in a neural net
01:15:40.360 | Embedded matrix, that's what
01:15:42.960 | Exactly. Exactly. Yes. So the embedding matrix is able to deal with this stuff like
01:15:48.400 | Day of week periodicity and so forth in a way
01:15:52.480 | Richer way than any
01:15:55.800 | Standard time series technique I've ever come across
01:15:58.400 | one last question
01:16:01.280 | The matrix in the earlier models when we did the CNN we did not pass it during the fit
01:16:08.480 | We passed it when the data was
01:16:10.640 | When we got the data, so we're not passing
01:16:15.120 | Anything to fit just the learning rate and the number of cycles
01:16:19.000 | In this case we're passing in metrics because we want to print out some extra stuff
01:16:22.800 | There is a difference in that we're calling data dot get learner. So with
01:16:29.480 | The imaging approach
01:16:34.600 | We just go learner dot trained and pass it the data
01:16:40.680 | In for these kinds of models in fact for a lot of the models the model that we build
01:16:46.680 | Depends on the data in this case. We actually need to know like
01:16:50.760 | What embedding matrices do we have?
01:16:53.400 | And stuff like that. So in this case, it's actually the data object that creates the learner
01:16:59.200 | So yeah, it is it is a bit upside down to what we've seen before
01:17:04.440 | So just to summarize or maybe I'm confused
01:17:09.920 | So in this case what we are doing is that we have some kind of a structured data
01:17:16.400 | We did feature engineering
01:17:18.400 | We got some column in a database or some things in it a parent is data frame
01:17:25.320 | Yeah data frame and then we are mapping it to deep learning by using this
01:17:33.060 | Embedding matrix for the categorical variables. So the continuous we just put them straight in
01:17:38.580 | So all I need to do is like if I have a if I have already have a feature engineering model
01:17:46.100 | Yeah, then to map it to deep learning. I just have to figure out which one I can move in to categorical and then
01:17:52.560 | Yeah, great question. So yes, exactly if you want to use this on your own data set
01:17:59.900 | Step one is list the categorical variable names list the continuous variable names
01:18:05.040 | Put it in a data frame pandas data frame
01:18:08.920 | Step two is to
01:18:12.580 | Create a list of which row indexes do you want new validation set?
01:18:18.700 | step three
01:18:21.980 | Is to call this line of code using this exact like these exact you can just copy and paste it
01:18:29.600 | step four is to create your list of how big you want each embedding matrix to be and
01:18:35.760 | Then step five is to call get learner
01:18:39.560 | You can use these exact parameters to start with
01:18:42.880 | And if it over fits or under fits you can fiddle with them and then the final step is to call
01:18:49.320 | Fit so yeah, almost all of this code will be nearly identical
01:18:56.140 | Have a couple of questions one is
01:19:02.540 | How is data augmentation can be used in this case and the second one is?
01:19:09.340 | Why what are dropouts doing in here? Okay, so data augmentation I have no idea. I mean, that's a really interesting question. I
01:19:21.300 | Think it's got to be domain specific. I've never seen any paper or anybody in industry doing data augmentation with structured data and deep learning
01:19:28.220 | So I don't I think it can be done. I just haven't seen it done
01:19:32.060 | What is dropout doing?
01:19:35.140 | Exactly the same as before so at each point
01:19:39.460 | we have
01:19:42.380 | The output of each of these linear layers is just a
01:19:49.380 | Rank one tensor and so dropout is going to go ahead and say let's throw away half of the activations
01:19:56.420 | and the very first dropout embedding dropout literally goes through the embedding matrix and says
01:20:03.460 | Let's throw away half the activations
01:20:07.980 | That's it
01:20:11.860 | Okay, let's take a break and let's come back at a 5 past 8
01:20:16.980 | Okay, thanks everybody
01:20:18.980 | So now
01:20:29.940 | We're going to move into something
01:20:32.660 | Equally exciting actually before I do I just mention that I had a good question during the break which was
01:20:40.260 | What's the downside like?
01:20:44.340 | Like look almost no one's using this
01:20:46.860 | Why not
01:20:50.580 | And and basically I think the answer is like as we discussed before
01:20:54.660 | No one in academia almost is working on this because it's not something that people really publish on
01:21:00.260 | And as a result there haven't been really great examples where people could look at and say oh, here's a technique that works
01:21:08.660 | Well, so let's have our company implemented
01:21:12.020 | But perhaps equally importantly
01:21:14.020 | Until now with this fast AI library. There hasn't been any
01:21:18.500 | Way to to do it conveniently if you wanted to implement one of these models
01:21:24.380 | You had to write all the custom code
01:21:27.100 | Yourself or else now as we discussed. It's you know six
01:21:31.900 | It's basically a six step process, you know involving about you know, not much more than six lines of code
01:21:41.340 | So the reason I mentioned this is to say like I think there are a lot of big
01:21:46.420 | commercial and scientific
01:21:49.100 | opportunities to use this to solve problems that previously haven't been solved very well before
01:21:55.860 | So like I'll be really interested to hear if some of you
01:22:00.220 | Try this out, you know, maybe on like
01:22:03.780 | Old Kaggle competitions you might find like oh I would have won this if I'd use this technique
01:22:09.420 | That would be interesting or if you've got some data set you work with at work
01:22:13.900 | You know some kind of predictive model that you've been doing with a GBM or a random forest. Does this help?
01:22:18.700 | You know the thing I I'm still somewhat new to this I've been doing this for
01:22:26.220 | Basically since the start of the year was when I started working on these structured deep learning models
01:22:31.860 | So I haven't had enough opportunity to know
01:22:35.540 | Where might it fail? It's worked for nearly everything. I've tried it with so far
01:22:39.480 | But yeah, I think this class is the first time that
01:22:44.700 | There's going to be like more than half a dozen people in the world who actually are working on this
01:22:50.220 | So I think you know as a group we're going to hopefully learn a lot and build some interesting things
01:22:55.120 | and this would be a great thing if you're thinking of writing a post about something or here's an area that
01:23:01.420 | There's a couple of that. There's a post from Instacart about what they did
01:23:05.260 | Pinterest has a
01:23:08.340 | Riley AI video about what they did that's about it, and there's two academic papers
01:23:13.860 | Both about Kaggle competition victories one from a Yoshio Yoshio Benjio and his group they won a taxi
01:23:23.300 | Destination forecasting competition and then also the one linked
01:23:28.300 | for this Rossman competition
01:23:32.540 | Yeah, there's some background on that all right
01:23:34.540 | so language
01:23:37.380 | natural language processing
01:23:39.900 | is the area which
01:23:42.900 | Is kind of like the most up-and-coming area of deep learning. It's kind of like two or three years behind
01:23:49.820 | Computer vision in deep learning it was kind of like the the second area that deep learning started getting really popular in and
01:23:59.340 | You know computer vision
01:24:01.340 | Got to the point where it was like the clear state-of-the-art
01:24:04.700 | For most computer vision things maybe in like 2014, you know and in some things in like 2012
01:24:11.380 | In NLP, we're still at the point where
01:24:14.740 | For a lot of things deep learning is now the state of the art, but not quite everything
01:24:19.580 | but as you'll see the state of kind of
01:24:23.820 | The software and some of the concepts is much less mature than it is for computer vision
01:24:30.340 | So in general none of the stuff we talk about after computer vision is going to be as like
01:24:36.620 | Settled as the computer vision stuff was so NLP
01:24:40.980 | One of the interesting things is in the last few months
01:24:43.980 | Some of the good ideas from computer vision have started to spread into NLP for the first time and we've seen some really big
01:24:51.180 | Advances so a lot of the stuff you'll see in NLP is is pretty new
01:24:54.920 | So I'm going to start with a particular
01:24:58.900 | Kind of NLP problem and one of the things you'll find in NLP
01:25:03.780 | It's like there are particular problems you can solve and they have particular names
01:25:07.580 | and so there's a particular kind of problem in NLP called language modeling and
01:25:12.020 | Language modeling has a very specific definition. It means build a model where given a
01:25:18.740 | Few words of a sentence. Can you predict what the next word is going to be?
01:25:23.140 | So if you're using your mobile phone and you're typing away and you press space and then it says like this is what the next
01:25:30.700 | Word might be like SwiftKey does this like really well and SwiftKey actually uses deep learning for this
01:25:36.620 | That's that's a language model. Okay, so it has a very specific meaning when we say language modeling
01:25:42.980 | We mean a model that can predict the next word of a sentence
01:25:47.980 | So let me give you an example. I
01:25:49.980 | downloaded
01:25:51.820 | about 18 months worth of
01:25:53.820 | Papers from archive. So for those of you that don't know it archive is
01:25:59.100 | The most popular pre-print server in this community and various others
01:26:05.060 | And has you know, lots of academic papers
01:26:08.220 | and so I grabbed the
01:26:12.820 | Abstracts and the topics for each and so here's an example. So the category of this particular paper was compute a
01:26:19.660 | CSNI is computer science and networking and
01:26:22.140 | Then the summaries let the abstract of the paper
01:26:25.180 | Let's say in the exploitation of mm-wave bands is one of the key enabler for 5g mobile blah blah blah. Okay, so here's like an
01:26:32.800 | example
01:26:35.140 | piece of text from my language model
01:26:39.420 | So I trained a language model on this archive data set that I downloaded and then I built a simple little test
01:26:45.940 | which basically
01:26:48.260 | You would pass it some like priming text
01:26:52.140 | So you'd say like oh imagine you started reading a document that said
01:26:55.460 | Category is computer science networking and the summary is algorithms that and then I said, please write
01:27:03.100 | An archive abstract so it said that if it's networking
01:27:08.900 | algorithms that
01:27:10.220 | Use the same network as a single node are not able to achieve the same performance as a traditional network based routing algorithms in this
01:27:16.860 | Paper we propose a novel routing scheme, but okay
01:27:19.700 | So it it's learned by reading archive papers that somebody who was saying algorithms that
01:27:26.500 | Where the word cat CSNI came before it is going to talk like this and remember it started out not knowing English at all
01:27:35.740 | Right, it actually started out with an embedding matrix for every word in English that was random
01:27:42.180 | Okay, and by reading lots of archive papers, it weren't what kind of words followed others
01:27:47.700 | So then I tried what if we said cat computer science computer vision?
01:27:52.220 | summary
01:27:54.300 | algorithms that
01:27:55.820 | Use the same data to perform image classification are increasingly being used to improve the performance of image classification
01:28:03.100 | Algorithms and this paper we propose a novel method for image classification using a deep convolutional neural network parentheses CNN
01:28:10.020 | So you can see like it's kind of like almost the same sentence as back here
01:28:15.940 | But things have just changed into this world of computer vision rather than networking
01:28:21.060 | So I tried something else which is like, okay
01:28:23.700 | Category computer vision and I created the world's shortest ever abstract algorithms
01:28:29.260 | And then I said title on and the title of this is going to be on the performance of deep learning for image classification
01:28:36.980 | EOS is end of string. So that's like end of title
01:28:40.740 | What if it was networking summary algorithms title on the performance of wireless networks as opposed to?
01:28:48.420 | Towards computer vision towards a new approach to image classification
01:28:52.900 | Networking towards a new approach to the analysis of wireless networks
01:28:58.340 | So like I find this mind-blowing right? I started out with some random matrices
01:29:04.020 | Richard like literally no
01:29:07.260 | No, pre-trained anything. I fed it 18 months worth of archive articles and it learnt not only
01:29:14.380 | How to write English pretty well
01:29:17.420 | but also after you say something's a convolutional neural network, you should then use parentheses to say what it's called and
01:29:24.900 | furthermore that the kinds of things people talk and say create algorithms for in computer vision are
01:29:30.940 | performing image classification and in networking are
01:29:34.220 | Achieving the same performance as traditional network-based routing algorithms. So like a language model is
01:29:42.500 | Can be like incredibly deep and subtle
01:29:47.420 | Right, and so we're going to try and build that
01:29:50.480 | But actually not because we care about this at all
01:29:54.540 | We're going to build it because we're going to try and create a pre-trained model
01:29:58.340 | what we're actually going to try and do is take IMDB movie reviews and
01:30:02.960 | Figure out whether they're positive or negative
01:30:06.060 | So if you think about it, this is a lot like cats versus dogs. It's a classification algorithm, but rather than an image
01:30:13.160 | We're going to have the text of a review
01:30:15.620 | So I'd really like to use a pre-trained network
01:30:19.860 | like I would at least like a net to start with a network that knows how to read English, right and so
01:30:27.380 | My view was like okay that to know how to read English means you should be able to like predict the next word of a sentence
01:30:34.740 | so what if we pre-train a language model and
01:30:38.700 | Then use that pre-trained language model and then just like in computer vision
01:30:43.580 | Stick some new layers on the end and ask it instead of to predicting the next word in the sentence
01:30:49.340 | Instead predict whether something is positive or negative
01:30:52.520 | So when I started working on this, this was actually a new idea
01:30:57.860 | Unfortunately in the last couple of months I've been doing it
01:31:01.300 | You know a few people have actually couple people have started publishing this and so this has moved from being a totally new idea to being
01:31:07.660 | a you know somewhat new idea
01:31:12.420 | so this idea of
01:31:14.780 | Creating a language model making that the pre-trained model for a classification model is what we're going to learn to do now
01:31:22.380 | And so the idea is we're really kind of trying to leverage exactly what we learned in our computer vision work
01:31:28.420 | Which is how do we do fine-tuning to create powerful classification models? Yes, you know
01:31:33.820 | So why don't you think that doing just directly what you want to do?
01:31:40.820 | Doesn't work better
01:31:43.660 | Well a because it doesn't just turns out it doesn't empirically
01:31:48.300 | And the reason it doesn't is a number of things
01:31:52.460 | first of all
01:31:55.180 | as we know
01:31:56.780 | Fine-tuning a pre-trained network is really powerful
01:31:59.500 | Right. So if we can get it to learn some related tasks first, then we can use all that information
01:32:06.900 | To try and help it on the second task
01:32:12.380 | the other reason is
01:32:14.380 | IMDB movie reviews
01:32:17.140 | You know up to a thousand words long
01:32:19.300 | They're pretty big and so after reading a thousand words knowing nothing about
01:32:24.220 | How English is structured or even what the concept of the word is?
01:32:28.340 | or punctuation or whatever
01:32:31.100 | at the end of this thousand
01:32:33.340 | Integers, you know, they end up in inches. All you get is a one or a zero
01:32:38.340 | Positive or negative and so trying to like learn the entire structure of English and then how it expresses positive and negative
01:32:44.540 | Sentiments from a single number is just too much to expect
01:32:48.260 | So by building a language model first we can try to build a neural network that kind of understands
01:32:54.900 | The English of movie reviews and then we hope that some of the things it's learnt about
01:33:01.100 | Are going to be useful in deciding whether something's a positive or a negative
01:33:05.060 | That's a great question
01:33:08.020 | Thanks. Is this similar to the CAR RNN by Carpathi?
01:33:15.780 | Yeah, this is somewhat similar to CAR RNN by Carpathi. So the famous CAR as in C-H-A-R RNN
01:33:23.660 | Try to predict the next letter given a number of previous letters
01:33:29.100 | Language models generally work at a word level. They don't have to
01:33:33.460 | and doing things at a word level turns out to be a
01:33:37.940 | Can be quite a bit more powerful and we're going to focus on word level modeling in this course
01:33:42.980 | To what extent are these generated words?
01:33:47.380 | Actual copies of what it's found in the in the training data set or are these completely
01:33:54.100 | Random things that it actually learned and how do we know how to distinguish between those two? Yeah, I mean these are all good questions
01:34:02.340 | The words are definitely words we've seen before the work because it's not at a character level
01:34:06.660 | So it can only give us the word it's seen before the sentences
01:34:10.060 | There's a number of kind of rigorous ways of doing it
01:34:14.380 | But I think the easiest is to get a sense of like well here are two like different categories
01:34:19.780 | Where it's kind of created very similar concepts, but mixing them up in just the right way like it would be very hard
01:34:27.660 | To to do what we've seen here just by like spitting back things. It's seen before
01:34:34.220 | But you could of course actually go back and check. You know have you seen that sentence before or like a string distance
01:34:40.780 | Have you seen a similar sentence before?
01:34:42.780 | in this case
01:34:44.820 | And of course another way to do it is the length most importantly when we train the language model as we'll see
01:34:51.080 | We'll have a validation set and so we're trying to predict the next word
01:34:54.540 | Of something that's never seen before and so if it's good at doing that. It should be good at generating text in this case the purpose
01:35:03.380 | The purpose is not to generate text
01:35:05.380 | That was just a fun example and so I'm not really going to study that too much
01:35:09.340 | But you know you during the week totally can like you can totally build
01:35:14.620 | The or you know greater American novel generator or whatever
01:35:18.940 | there are actually some tricks to
01:35:21.740 | To using language models to generate text that I'm not using here. They're pretty simple
01:35:27.940 | We can talk about them on the forum if you like, but my focus is actually on classification
01:35:33.180 | So I think that's the thing which is
01:35:35.500 | Incredibly powerful like text classification I
01:35:40.880 | Don't know you're a hedge fund
01:35:43.340 | You want to like read every article as soon as it comes out through Reuters or Twitter or whatever and immediately
01:35:50.220 | Identify things which in the past have caused you know massive market drops. That's a classification model or you want to
01:36:00.740 | Recognize all of the customer service
01:36:02.740 | queries which tend to be associated with people who
01:36:06.940 | Who leave your you know who cancel their contracts in the next month?
01:36:12.500 | That's a classification problem, so like it's a really powerful kind of thing for
01:36:17.740 | data journalism
01:36:20.540 | Activision activism
01:36:24.260 | commerce so forth right like
01:36:27.500 | I'm trying to class documents into whether they're part of legal discovery or not part of legal discovery
01:36:32.680 | Okay, so you get the idea?
01:36:38.260 | In terms of stuff. We're importing we're importing a few new things here
01:36:41.820 | one of the bunch of things we're importing is
01:36:45.180 | Torch text torch text is PI torches like NLP
01:36:52.420 | Library and so fast AI is designed to work hand-in-hand with porch text as you'll see and then there's a few
01:36:59.180 | Text specific sub bits of faster fast AI that we'll be using
01:37:04.200 | So we're going to be working with the IMDB large movie review data set. It's very very well studied in academia
01:37:12.740 | you know
01:37:15.420 | Lots and lots of people over the years have
01:37:17.660 | Studied this data set
01:37:21.180 | 50,000 reviews highly polarized reviews either positive or negative each one has been
01:37:26.980 | classified by sentiment
01:37:29.860 | Okay, so we're going to try our first of all however to create a language model
01:37:33.540 | So we're going to ignore the sentiment entirely right so just like the dogs and cats
01:37:37.580 | Pre-train the model to do one thing and then fine-tune it to do something else
01:37:41.300 | Because this kind of idea in NLP is is so so so new
01:37:47.980 | There's basically no models you can download for this so we're going to have to create our own
01:37:52.940 | right, so
01:37:55.620 | Having downloaded the data you can use the link here. We do the usual stuff saying the path to it training and validation path
01:38:03.220 | And as you can see it looks pretty pretty traditional compared to vision. There's a directory of training
01:38:10.120 | There's a directory of test we don't actually have separate test and validation in this case
01:38:15.940 | And just like in in vision the training directory has a bunch of files in it
01:38:22.440 | In this case not representing images, but representing movie reviews
01:38:26.940 | So we could cat one of those files and here we learn about the classic zombie Geddon movie
01:38:36.460 | I have to say with a name like zombie Geddon and an atom bomb on the front cover
01:38:42.120 | I was expecting a flat-out chop socky funku
01:38:45.040 | Rented if you want to get stoned on a Friday night and laugh with your buddies
01:38:51.780 | Don't rent it if you're an uptight weenie or want a zombie movie with lots of fresh eating
01:38:55.560 | I think I'm going to enjoy zombie Geddon so all right, so we've learned something today
01:39:00.360 | All right, so we can just use standard unique stuff to see like how many words are in the data set so the training set we've got
01:39:09.360 | 17 and a half million words
01:39:13.400 | Test set we've got five point six million words
01:39:16.300 | So here's
01:39:20.260 | These are this is IMDB so IMDB is yeah random people this is not a New York Times listed review as far as I know
01:39:30.060 | Okay, so
01:39:35.580 | Before we can do anything with text we have to turn it into a list of tokens
01:39:41.580 | A token is basically like a word right so we're going to try and turn this eventually into a list of numbers
01:39:47.180 | So the first step is to turn it into a list of words
01:39:49.580 | That's called tokenization in NLP NLP has a huge lot of jargon that we'll we'll learn over time
01:39:56.180 | One thing that's a bit tricky though when we're doing tokenization is here
01:40:02.740 | I've tokenized that review and then joined it back up with spaces and you'll see here that wasn't
01:40:09.220 | Has become two tokens which makes perfect sense right wasn't is two things, right?
01:40:16.340 | Dot dot dot has become one token
01:40:20.500 | Right, where else lots of exclamation marks has become lots of tokens. So like a good tokenizer
01:40:26.960 | will do a good job of recognizing like
01:40:30.260 | Pieces of an English sentence each separate piece of punctuation will be separated
01:40:36.740 | And each part of a multi-part word will be separated as appropriate. So
01:40:42.500 | Spacey is a I think it's an Australian developed piece of software actually that does lots of NLP stuff
01:40:49.260 | It's got the best tokenizer. I know and so
01:40:52.220 | Fast AI is designed to work. Well with the spacey tokenizer as its torch text. So here's an example of
01:40:59.100 | Tokenization, right so what we do with torch text is we basically have to start out by creating
01:41:06.700 | Something called a field and a field is a definition of how to pre-process some text
01:41:12.620 | And so here's an example of the definition of a field. It says I want to lowercase
01:41:17.360 | The text and I want to tokenize it with the function called spacey tokenize
01:41:23.160 | Okay, so it hasn't done anything yet. We're just telling her when we do do something
01:41:28.100 | This is what to do. And so that we're going to store that
01:41:30.620 | description of what to do in a thing called
01:41:33.420 | capital text
01:41:35.580 | And so this is this is none of this is but this is not fast AI specific at all
01:41:39.900 | This is part of torch text. You can go to the torch text website read the docs. There's not lots of docs yet
01:41:45.340 | This is all very very new
01:41:48.300 | Probably the best information you'll find about it is in this lesson, but there's some more information on this site
01:41:54.260 | Alright, so what we can now do is go ahead and create the usual fast AI model data object
01:42:03.060 | Okay, and so to create the model data object. We have to provide a few bits of information
01:42:07.820 | We have to say what's the training set?
01:42:10.260 | So the path to the text files the validation set and the test set in this case just to keep things simple
01:42:17.660 | I don't have a separate validation in test set so I'm going to pass in the validation set for both of those two things
01:42:23.620 | Right. So now we can create our model data object as per usual. The first thing we give it is the path
01:42:31.060 | The second thing we give it is the torch text field definition of how to pre-process that text
01:42:36.940 | The third thing we give it is the dictionary or the list of all of the files we have train validation test
01:42:44.540 | As per usual we can pass in a batch size and then we've got a special special couple of extra things here
01:42:51.900 | One is a very commonly used in NLP minimum frequency. What this says is
01:43:00.020 | In a moment, we're going to be replacing every one of these words with an integer
01:43:04.980 | Which basically will be a unique index for every word and this basically says if there are any words that occur less than 10 times
01:43:13.340 | Just call it unknown
01:43:16.220 | Right don't think of it as a word, but we'll see that in more detail in a moment
01:43:20.740 | And then we're going to see this in more detail as well BP TT stands for back prop through time
01:43:27.580 | And this is where we define how long a sentence will we?
01:43:32.060 | Stick on the GPU at once. So we're going to break them up in this case. We're going to break them up into sentences of
01:43:38.820 | 70 tokens or less on the whole so we're going to see all this in a moment
01:43:44.860 | All right. So after building our model data object, right what it actually does is it's going to fill this text field
01:43:54.700 | With an additional attribute called vocab and this is a really important NLP concept
01:44:01.020 | I'm sorry. There's so many NLP concepts. We just have to throw at you kind of quickly, but we'll see them a few times
01:44:05.980 | right a
01:44:08.100 | Vocab is the vocabulary and the vocabulary in NLP has a very specific meaning it is
01:44:14.100 | What is the list of unique words that appeared in this text?
01:44:17.160 | So every one of them is going to get a unique index. So let's take a look right here is text
01:44:24.540 | Vocab dot I to s this stands for this is all torch text not fast AI
01:44:29.340 | Text of vocab dot int to string
01:44:32.300 | Maps the integer zero to unknown the integer one the padding into to the then comma dot and
01:44:41.500 | Of two and so forth. All right, so this is the first 12
01:44:45.540 | elements of the array
01:44:50.220 | Of the vocab from the IMDB movie review and it's been sorted by frequency
01:44:55.820 | Except for the first two special ones. So for example, we can then go backwards s to I string to int
01:45:02.900 | Here is the it's in position 0 1 2 so stream to int the is 2
01:45:09.460 | So the vocab lets us take a word and map it to an integer or take an integer and map it to a word
01:45:19.060 | Right. And so that means that we can then take
01:45:22.060 | the first 12 tokens for example of our text and turn them into
01:45:28.380 | 12 it's so for example here is of the you can see 7 2 and
01:45:35.940 | Here you can see 7 2
01:45:38.500 | Right. So we're going to be working in this form. Did you have a question? Yeah, could you pass that back there?
01:45:47.940 | Is it a common to any stemming or limitizing?
01:45:50.860 | Not really. No
01:45:53.900 | Generally tokenization is is what we want like with a language model
01:45:57.800 | We you know to keep it as general as possible we want to know what's coming next and so like whether it's
01:46:04.700 | Future tense or past tense or plural or singular like we don't really know which things are going to be interesting in which aren't
01:46:15.420 | It seems that it's generally best to kind of leave it alone as much as possible
01:46:20.380 | Be the short answer
01:46:23.340 | You know having said that as I say, this is all pretty new
01:46:26.660 | So if there are some particular areas that some researcher maybe has already discovered that some other kinds of pre-processing are helpful
01:46:33.620 | You know, I wouldn't be surprised not to know about it
01:46:37.220 | So when you're dealing with
01:46:40.420 | You know natural language is in context important context is very important. So if you're if you're using
01:46:46.940 | Words no, no, we're not looking at words
01:46:51.780 | This is this look this is I just don't get some of the big premises of this like they're in order
01:46:57.740 | Yeah, so just because we replaced I with the number 12
01:47:02.700 | These are still in that order. Yeah
01:47:07.380 | There is a different way of dealing with natural language called a bag of words and bag of words
01:47:12.380 | You do throw away the order in the context and in the machine learning course
01:47:16.220 | We'll be learning about working with bag of words representations
01:47:18.940 | But my belief is that they are
01:47:21.740 | No longer useful or in the verge of becoming no longer useful
01:47:26.540 | We're starting to learn how to use dick learning to use context properly now
01:47:32.620 | But it's kind of for the first time it's really like only in the last few months
01:47:37.140 | All right, so I mentioned that we've got two numbers batch size and BPT T back prop through time
01:47:45.420 | So this is kind of subtle
01:47:47.620 | So we've got some big long piece of text
01:47:58.940 | Okay, so we've got some big long piece of text, you know, here's our sentence. It's a bunch of words, right and
01:48:03.540 | Actually what happens in a language model is even though we have lots of movie reviews
01:48:10.460 | They actually all get concatenated together into one big block of text, right? So it's basically predict the next word
01:48:18.580 | In this huge long thing, which is all of the IMDb movie reviews concatenate together. So this thing is, you know
01:48:26.340 | What do we say? It was like tens of millions of words long and so what we do
01:48:32.060 | Is we split it up into batches?
01:48:36.020 | First right so these like are our spits into batches, right? And so if we said
01:48:42.420 | we want a batch size of
01:48:45.020 | 64 we actually break the whatever was 60 million words into the 64
01:48:51.620 | sections
01:48:53.700 | right, and then we take each one of the 64 sections and
01:48:59.060 | We move it
01:49:02.340 | Like underneath the previous one I didn't do a great job of that
01:49:09.140 | Right move it underneath
01:49:14.420 | So we end up with a matrix
01:49:18.320 | Which is
01:49:24.100 | Actually, I think we've moved them across wise so it's actually I think just transpose it we end up with a matrix. It's like 64
01:49:37.460 | columns
01:49:39.340 | Wide and the length let's say the original was 64 million right then the length is like
01:49:46.900 | 10 million
01:49:50.060 | Right. So each of these represents
01:49:52.740 | 1/64 of our entire IMDb review set
01:49:58.340 | And so that's our starting point
01:50:01.140 | so then what we do is
01:50:03.660 | We then grab a little chunk of this at a time and those chunk lengths are approximately equal to
01:50:11.500 | BP TT which I think we had equal to 70. So we basically grab a little
01:50:16.980 | 70 long
01:50:19.500 | section and
01:50:20.980 | That's the first thing we chuck into our GPU. That's a batch, right? So a batch is always of length of width
01:50:28.020 | 64 or batch size and each bit is a sequence of length up to 70
01:50:35.220 | So let me show you
01:50:37.260 | Right. So here if I go take my train data loader
01:50:42.060 | I don't know if you folks have tried playing with this yet
01:50:44.980 | But you can take any data loader wrap it with it up to turn it into an iterator and then call next on it to grab
01:50:51.660 | a batch of data just as if you were a neural net you get exactly what the neural net gets and you can see here we
01:50:58.940 | get back a
01:51:00.940 | 75 by 64
01:51:03.060 | Tensor right so it's 64 wide right and I said it's approximately
01:51:09.900 | 70 high and
01:51:13.300 | But not exactly
01:51:15.140 | And that's actually kind of interesting a really neat trick that torch text does is they randomly change
01:51:22.060 | The back prop through time number every time so each epoch it's getting slightly different
01:51:29.300 | bits of text
01:51:32.220 | This is kind of like in computer vision. We randomly shuffle the images
01:51:37.080 | We can't randomly shuffle the words right because we need to be in the right order
01:51:42.100 | So instead we randomly move their break points a little bit. Okay, so this is the equivalent
01:51:47.240 | so in other words this
01:51:50.340 | This here is of length 75 right there's a there's an ellipsis in the middle
01:52:00.420 | And that represents the first 75 words of the first review
01:52:05.700 | Right, where else this 75 here?
01:52:09.780 | Represents the first 75 words of this of the second of the 64 segments
01:52:15.060 | That's it have to go in like 10 million words to find that one right and so here's the first
01:52:20.780 | 75 words of the last of those 64 segments okay, and so then what we have
01:52:27.820 | down here is
01:52:30.940 | The next
01:52:34.540 | Sequence right so 51 there's 51
01:52:38.540 | 6 1 5 there's 6 1 5 25 there's 25 right and in this case
01:52:45.180 | It actually is of the same size
01:52:47.820 | It's also 75 by 64 but for minor technical reasons being flattened out
01:52:53.060 | Into a single vector that basically it's exactly the same as this matrix, but it's just moved down
01:53:01.980 | By one because we're trying to predict the next word
01:53:05.740 | Right so that all happens for us right if we ask for and this is the fast AI now if you ask for a language model
01:53:15.420 | object then it's going to create these batches of
01:53:18.820 | batch size width by BP TT height
01:53:23.980 | Bits of our language corpus along with the same thing shuffled along by one word
01:53:32.220 | Right and so we're always going to try and predict the next word
01:53:36.100 | So why don't you instead of just arbitrarily choosing 64?
01:53:47.100 | Why don't you choose like like 64 is a large number
01:53:52.900 | Maybe like do it by sentences and make it a large number and then pat it was zero or something
01:54:00.860 | if you
01:54:02.340 | You know so that you actually have a one full sentence per line
01:54:05.460 | Basically wouldn't that make more sense not really because remember we're using columns right so each of our columns is of length about 10 million
01:54:13.140 | Right so although it's true that those columns aren't always exactly finishing on a full stop. This so damn long. We don't care
01:54:21.520 | Because they're like 10 million one
01:54:25.340 | Right and we're trying to also each line contains multiple sentences column contains more
01:54:32.120 | Yeah, it's it's it's of length about 10 million
01:54:35.500 | And it contains many many many many many sentences
01:54:38.880 | Because remember the first thing we did was take the whole thing and split it into 64 groups
01:54:43.660 | Okay, great
01:54:50.620 | So um I found this you know pertaining to this question this thing about like
01:54:55.960 | What's in this language model matrix a little mind-bending for quite a while?
01:55:01.780 | So don't worry if it takes a while and you have to ask a thousand questions on the forum. That's fine, right?
01:55:09.540 | Go back and listen to what I just said in this lecture again
01:55:12.420 | go back to that bit where I showed you is putting it up to 64 and moving them around and try it with some sentences and
01:55:17.600 | Excel or something and see if you can do a better job of explaining it than I did
01:55:22.240 | Because this is like how torch text works
01:55:26.260 | And then what fast AI adds on is this idea of like kind of how to build a language model out of it
01:55:33.460 | Although actually a lot of that stolen from torch text as well like there's sometimes where torch text starts and fast AI ends
01:55:39.700 | Is well vice versa is a little?
01:55:41.700 | Subtle they really work closely together, okay?
01:55:48.020 | Now that we have a model data object
01:55:50.980 | That can feed us
01:55:53.540 | Batches we can go ahead and create a model right and so in this case
01:55:59.300 | We're going to create an embedding matrix and our vocab
01:56:04.140 | We can see how big our vocab was
01:56:06.780 | Let's have a look back here, so we can see here in the model data object there are
01:56:18.020 | Kind of pieces that we're going to go through that's basically equal to the number of
01:56:22.540 | The total length of everything divided by batch size times
01:56:27.620 | BPTT and this one I wanted to show you NT
01:56:31.020 | I've got the definition up here number of unique tokens NT is the number of tokens
01:56:36.080 | That's the size of our vocab so we've got three thirty four thousand nine hundred and forty five unique words
01:56:43.700 | And notice the unique words it had to appear at least ten times
01:56:46.900 | Okay, because otherwise they've been replaced with
01:56:50.300 | The length of the data set is one because as far as a language model is concerned there's only one
01:57:00.860 | Thing which is the whole corpus all right, and then that thing has
01:57:06.500 | Here it is twenty point six million
01:57:09.420 | words
01:57:11.500 | right
01:57:12.820 | So those thirty four thousand hundred and forty five things are used to create an embedding matrix
01:57:18.820 | Of number of roses equal to
01:57:23.060 | Thirty four
01:57:27.340 | nine four five
01:57:29.340 | Right and so the first one represents onk the second one represents pad
01:57:35.180 | The third one was dot the fourth one was comma this one. I'm just guessing was there and so forth
01:57:42.660 | Right and so each one of these gets an
01:57:45.460 | embedding vector
01:57:47.660 | So this is literally identical to what we did
01:57:50.500 | Before the break right this is a categorical variable. It's just a very high cardinality categorical variable and furthermore
01:57:59.300 | It's the only variable right. This is pretty standard in NLP. You have a variable which is a word
01:58:06.740 | Right we have a single categorical variable
01:58:10.260 | single column basically, and it's it's of thirty four thousand nine hundred and forty five
01:58:16.860 | Cardinality categorical variable and so we're going to create an embedding matrix for it
01:58:21.900 | So M size is the size of the embedding vector 200, okay?
01:58:28.020 | So that's going to be length 200 a lot bigger than our previous embedding vectors not surprising because a word
01:58:34.740 | Has a lot more nuance to it than the concept of Sunday
01:58:39.580 | right
01:58:40.780 | Or Rossman's Berlin store or whatever right so it's generally an embedding size for a word
01:58:47.640 | Will be somewhere between about 50 and about 600?
01:58:50.400 | Okay, so I've kind of gone some in the middle
01:58:52.980 | We then have to say as per usual how many activations
01:58:58.100 | Do you want in your layers so we're going to use 500 and then how many layers?
01:59:02.140 | Do you want in your neural net we're going to use three okay?
01:59:08.140 | This is a minor technical detail it turns out that
01:59:11.180 | We're going to learn later about the atom optimizer
01:59:14.460 | That basically the defaults for it don't work very well with these kinds of models
01:59:18.720 | So we just have to change some of these you know basically any time you're doing NLP. You should probably
01:59:24.660 | include this line
01:59:27.300 | Because it works pretty well
01:59:30.060 | So having done that we can now again take our model data object and grab a model out of it
01:59:36.260 | And we can pass in a few different things
01:59:38.580 | What optimization function do we want how big an embedding do we want how many hidden activate how many activations number of hidden?
01:59:46.860 | how many layers and
01:59:48.900 | How much dropout of many different kinds?
01:59:52.500 | So this language model. We're going to use is a very recent development called awd LSTM by Stephen Meridy
02:00:01.020 | Who's a NLP researcher based in San Francisco and his main contribution really was to show like?
02:00:07.680 | How to put dropout all over the place in in these NLP models?
02:00:13.360 | So we're not going to worry now
02:00:15.740 | We'll do this in the last lecture is worrying about like what all that like
02:00:18.780 | What is the architecture and what are all these dropouts for now?
02:00:22.460 | Just know it's the same as per usual if you try to build an NLP model and your under fitting
02:00:28.540 | Then decrease all of these dropouts if you're over fitting then increase all of these dropouts in roughly this ratio
02:00:35.960 | Okay, that's that's my rule of thumb and it again. This is such a recent paper
02:00:42.260 | Nobody else is working on this model anyway, so there's not a lot of guidance, but I've found this these ratios work
02:00:49.260 | Well, that's what Stephen's been using as well
02:00:51.500 | There's another kind of way we can avoid overfitting that we'll talk about in the last class
02:00:58.540 | Again for now this one actually works totally reliably so all of your NLP models probably want this particular line of code
02:01:05.600 | And then this one we're going to talk about at the end last lecture as well you can always include this basically what it says is
02:01:14.700 | When you do
02:01:19.220 | When you look at your gradients, and you multiply them by the learning rate, and you decide how much to update your weights by
02:01:26.580 | This says clip them
02:01:28.580 | like literally like
02:01:31.220 | Like don't let them be more than zero point three
02:01:34.740 | and this is quite a cool little trick right because like
02:01:39.540 | If you're learning rates pretty high, and you kind of don't want to get in that situation
02:01:46.140 | We talked about where you're kind of got this kind of thing where you go
02:01:54.100 | You know rather than little step little step little step instead you go like oh too big oh too big right with gradient
02:02:01.340 | Clipping it kind of goes this far, and it's like oh my goodness. I'm going too far. I'll stop
02:02:05.900 | Right that's basically what gradient flipping does
02:02:11.980 | Anyway, so these are a bunch of parameters the details don't matter too much right now. You can just steal these
02:02:16.980 | And then we can go ahead and call
02:02:22.140 | With exactly the same parameters as usual
02:02:24.140 | So Jeremy, um there are all these other
02:02:32.940 | Work embedding things like like
02:02:36.420 | What to make and glove so I have two questions about that one is
02:02:41.840 | How are those different from these and the second question? Why don't you initialize them with one of those? Yeah, so
02:02:51.900 | So basically that's a great question, so basically
02:02:54.540 | People have pre-trained
02:02:57.820 | These embedding matrices before to do various other tasks. They're not whole pre-trained models
02:03:03.860 | They're just a pre-trained embedding matrix, and you can download them and as unit says they have names like word2vec and love
02:03:10.780 | And they're literally just a matrix
02:03:12.780 | There's no reason we couldn't download them really it's just like
02:03:20.300 | kind of
02:03:22.300 | I found that
02:03:25.620 | Building a whole pre-trained model in this way didn't seem to benefit much if at all from using pre-trained word vectors
02:03:32.700 | We're also using a whole pre-trained language model
02:03:35.140 | Made a much bigger difference
02:03:37.460 | So like you remember what a big those of you who saw word2vec it made a big splash when it came out
02:03:42.820 | I I'm finding this technique of pre-trained language models seems much more powerful
02:03:49.740 | Basically, but I think we combine both to make them a little better still
02:03:53.620 | What is what is the model that you have used like how can I know the architecture of the model?
02:04:00.020 | So we'll be learning about the model architecture in the last lesson for now. It's a recurrent neural network
02:04:07.980 | Using something called LSTM long short-term memory
02:04:17.740 | So if they had lots of details that we're skipping over but you know you can do all this without any of those details
02:04:23.500 | We go ahead and fit the model
02:04:25.980 | I found that this language model took quite a while to fit so I kind of like ran it for a while
02:04:31.260 | Noticed it was still under fitting save where it was up to
02:04:34.860 | Ran it a bit more with longer cycle length saved it again. It still
02:04:39.500 | was kind of under fitting
02:04:42.180 | You know run it again
02:04:44.220 | And kind of finally got to the point where it's like kind of honestly I kind of ran out of patience
02:04:48.300 | So I just like saved it at that point
02:04:53.700 | I did the same kind of test that we looked at before so I was like oh it wasn't quite what I was expecting
02:04:58.620 | But I really liked it anyway the best and then I was like okay
02:05:01.080 | Let's see how that goes the best performance was one in the movie was a little bit. I say okay
02:05:05.020 | It looks like the language models working pretty well
02:05:07.180 | So I've pre-trained the language model
02:05:12.980 | And so now I want to use it
02:05:14.980 | Fine-tune it to do classification send my classification now obviously if I'm going to use a pre-trained model
02:05:21.760 | I need to use exactly the same vocab right the word there
02:05:25.860 | Still needs to map for the number two so that I can look up the vector for that right so that's why I first of all
02:05:33.820 | Load back up my my field object the thing with the vocab in right now in this case
02:05:41.060 | If I run it straight afterwards, this is unnecessary
02:05:43.880 | It's already in memory, but this means I can come back to this later right and a new session basically
02:05:50.860 | I can then go ahead and say okay. I've never got one more field right in addition to my field
02:05:59.780 | Which represents the reviews I've also got a field which represents the label
02:06:05.780 | And the details are too important here
02:06:09.060 | Now this time I need to not treat the whole thing as one big
02:06:14.180 | Piece of text, but every review is separate because each one has a different sentiment attached to it
02:06:20.420 | And it so happens that torch text already has a data set that does that for IMDB, so I just used IMDB
02:06:27.460 | built into torch text
02:06:30.180 | So basically once we've done all that we end up with something where we can like grab for a particular example
02:06:36.180 | We can grab its label
02:06:38.860 | positive and
02:06:40.020 | Here's some of the text. This is another great Tom Beringdon movie blah blah blah blah all right, so
02:06:45.220 | This is all not nothing fast AI specific here
02:06:49.820 | We'll come back to it in the last lecture
02:06:51.660 | But torch text docs can help understand what's going on all you need to know is that
02:06:56.660 | Once you've used this special talks torch text thing called splits to grab a splits object
02:07:02.860 | You can pass it straight into fast AI text data from splits and that basically converts a torch text
02:07:10.140 | Object into a fast AI object we can train on so as soon as you've done that you can just go ahead and say
02:07:17.500 | Get model right and that gets us our learner
02:07:20.700 | And then we can load into it the pre-trained model the language model
02:07:26.860 | right, and so we can now take that pre-trained language model and
02:07:31.900 | Use the stuff that we're kind of familiar with right so we can
02:07:35.300 | Make sure that you know all it's at the last layer is frozen train it a bit
02:07:40.140 | Unfreeze it train it a bit and the nice thing is once you've got a pre-trained
02:07:45.300 | Language model it actually trained super fast you can see here. It's like a couple of minutes
02:07:50.380 | The epoch and it only took me to get my is my best one here
02:07:56.060 | It already took me like 10 epochs, so it's like 20 minutes to train this bit. It's really fast
02:08:01.900 | And I ended up with
02:08:03.900 | 94.5% so how good is 94.5% well it so happens that
02:08:11.540 | Actually one of Stephen Verity's colleagues James Bradbury recently created a paper
02:08:17.220 | Looking at the state at like where they tried to create a new state-of-the-art for a bunch of NLP things and one of the things
02:08:25.980 | I looked at was
02:08:27.940 | IMDB and they actually have here a list of the current world's best for
02:08:33.180 | IMDB and
02:08:35.780 | Even with stuff that is highly specialized for sentiment analysis the best anybody had previously come up with was 94 94.1
02:08:43.220 | So in other words this technique
02:08:45.700 | getting 94.5 is literally
02:08:48.980 | better than
02:08:51.100 | Anybody has created in the world before as far as we know or as far as James Bradbury knows
02:08:58.820 | so when I say like there are big opportunities to use this I mean like
02:09:03.180 | This is a technique that nobody else currently has access to which you know you could like it, you know, whatever
02:09:10.300 | IBM has in Watson or whatever any big company has you know that they're
02:09:16.180 | Advertising unless they have some secret source that they're not publishing which they don't right because people get you know
02:09:23.020 | If they have a better thing they publish it
02:09:25.380 | Then you now have access to a better text classification method than as ever existed before
02:09:30.340 | So I really hope that you know, you can try this out and see how you go
02:09:35.140 | There may be some things that works really well on and others that it doesn't work as well on I don't know
02:09:41.860 | I think this kind of sweet spot here that we had about 25,000
02:09:48.420 | You know short to medium sized documents if you don't have at least that much text
02:09:54.060 | It may be hard to train a different language model
02:09:56.540 | But having said that there's a lot more we do here, right and we won't be able to do it in part one of this course
02:10:02.660 | We're doing part two, but for example, we could start like training language models that look at like
02:10:08.860 | You know lots and lots of medical journals and then we could like make a downloadable
02:10:13.620 | medical language model that then anybody could use to like fine-tune on like a
02:10:20.300 | Prostate cancer subset of medical literature for instance, like there's so much we could do
02:10:26.300 | It's kind of exciting and then you know to your nets point we could also combine this with like pre-trained word vectors
02:10:32.020 | so like even without
02:10:34.260 | Trying that hard like, you know, we even without news like
02:10:37.780 | we could have pre-trained a Wikipedia say corpus language model and then fine-tuned it into a
02:10:45.820 | IMDb language model and then fine-tuned that into an IBM IMDb sentiment analysis model and we would have got something better than this
02:10:53.100 | So like this and I really think this is the tip of the iceberg
02:10:56.780 | And I was talking there's a really fantastic researcher called Sebastian ruder who is
02:11:04.500 | Basically the only NLP researcher. I know who's been really really writing a lot about
02:11:11.380 | Training and fine-tuning and transfer learning and NLP and I was asking him like why isn't this happening more?
02:11:17.740 | And his view was it's because there isn't the software to make it easy, you know
02:11:23.500 | So I'm actually going to share this lecture with with him tomorrow
02:11:27.780 | Because you know it feels like there's you know
02:11:32.540 | Hopefully going to be a lot of stuff coming out now that we're making it really easy to do this
02:11:41.380 | We're kind of out of time so what I'll do is I'll quickly look at
02:11:45.580 | Collaborative filtering introduction and then we'll finish it next time the collaborative filtering. There's very very little new to learn
02:11:53.360 | We basically learned everything we're going to need
02:11:56.300 | So collaborative filtering will will cover this quite quickly next week
02:12:02.980 | And then we're going to do a really deep dive into collaborative filtering next week
02:12:07.980 | Where we're going to learn about like we're actually going to from scratch learn how to do stochastic gradient descent
02:12:13.820 | How to create loss functions how they work exactly and then we'll go from there and we'll gradually build back up to really deeply understand
02:12:22.820 | What's going on in the structured models and then what's going on in confidence and then finally what's going on in recurrent neural networks
02:12:30.500 | And hopefully we'll be able to build them all
02:12:32.940 | From scratch okay, so this is kind of going to be really important this movie lens data set because we're going to use it to
02:12:39.100 | learn a lot of like
02:12:40.860 | Really foundational theory and kind of math behind it so the movie lens data set
02:12:47.380 | This is basically what it looks like it contains a bunch of ratings. It says user number one
02:12:54.140 | Watched movie number 31 and they gave it a rating of two and a half
02:12:58.740 | at this particular time and
02:13:02.020 | Then they watched movie one or two nine and they gave it a rating of three and they watched rating one ones movie one one seven
02:13:08.340 | Two and they gave it a rating of four. Okay, and so forth
02:13:11.020 | So this is the ratings table. This is really the only one that matters and our goal will be for some user
02:13:18.740 | We haven't seen before sorry for some user movie combination. We haven't seen before we have to predict if they'll like it
02:13:25.580 | Right and so this is how recommendation systems are built
02:13:29.220 | This is how like Amazon besides what books to recommend how Netflix decides what movies to recommend and so forth
02:13:34.880 | To make it more interesting we'll also actually download a list of movies so each movie
02:13:42.020 | We're actually going to have the title and so for that question earlier about like what's actually going to be in these embedding matrices
02:13:47.420 | How do we interpret them? We're actually going to be able to look and see
02:13:50.260 | How that's working?
02:13:52.660 | So basically this is kind of like what we're creating this is kind of crosstab of users
02:13:59.960 | by movies
02:14:01.400 | Alright, and so feel free to look ahead during the week. You'll see basically as per usual collab filter data set from CSP
02:14:08.300 | model data dot get learner
02:14:10.800 | Learn dot fit and we're done and you won't be surprised to hear when we then take that and we can cut the benchmarks
02:14:16.680 | It seems to be better than the benchmarks where you looked at so that'll basically be it and then next week
02:14:22.040 | We'll have a deep dive and we'll see how to actually build this from scratch. All right. See you next week
02:14:27.640 | [APPLAUSE]