Back to Index

Lesson 4: Deep Learning 2018


Chapters

0:0
10:11 The Whole Thing Now Isn't Going To Work As Well It's Not Going To Recognize that Image Right so It Has to in Order for this To Work It Has To Try and Find a Representation that that Actually Continues To Work Even As Random Half of the Activations Get Thrown Away every Time All Right so It's a It's It's I Guess About Four Years Old Now Three or Four Years Old and It's Been Absolutely Critical in Making Modern Deep Learning Work and the Reason Why Is It Really Just About Solve the Problem of Generalization for Us Before Drop Out Came Along if You Try To Train a Model with Lots of Parameters and You Were Overfitting and You Already Tried All the Imitation You Could and You Already Had As Much Data as You Could You There Were some Other Things You Could Try but to a Large Degree You Were Kind Of Stuck Okay and So Then Geoffrey Hinton and His Colleagues Came Up with this this Dropout Idea That Was Loosely Inspired
17:40 We Actually End Up with in this Case a Reasonably Good Result because We'Re Not Training It for Very Long and this Particular Pre-Trained Network Is Really Well Suited to this Particular Problem Yesterday so Jeremy What Kind of Piece Should We Were Using by Default so the One That's There by Default for the First Layer Is 0 25 and for the Second Layer Is 0 5 That Seems To Work Pretty Well for Most Things Right So like It's It's It You Don't Necessarily Need To Change It At All Basically if You Find It's Overfitting
28:3 What We'Re Telling Our Neural Net down the Track Is that for every Different Level of Say Year You Know 2000 2001 2002 You Can Treat It Totally Differently Where Else if We Say It's Continuous Its Have To Come Up with some Kind of Like Function some Kind of Smooth Ish Function Right and So Often Even for Things like a Year That Actually Are Continuous but They Don't Actually Have Many Distinct Levels It Often Works Better To Treat It as Categorical so another Good Example Day of Week Right So like Day of Week between Naught & 6 It's a Number and It Means Something Motifs between 3 & 5 Is Two Days and Has Meaning
45:7 Create the Learner
45:44 Embeddings
46:4 Continuous Variables
57:5 Embedding Matrices
61:56 Distributed Representation
71:50 Custom Metrics
79:14 Data Augmentation
79:32 What Is Dropout Doing
84:48 Nlp
85:10 Language Modeling
91:14 Language Model
97:32 Create a Language Model
99:50 Tokenization in Nlp
100:58 Example of Tokenization
102:4 Create the Model Data Object
102:25 Model Data Object
102:55 Minimum Frequency
107:11 Bag of Words
115:55 Create a Model
116:0 Embedding Matrix
118:19 Create an Embedding Matrix
124:24 Fit the Model

Transcript

Okay, hi everybody welcome back good to see you all here It's been another Busy week of deep learning Lots of cool things going on and like last week I wanted to highlight a few really interesting articles that some of some of you folks who have written Fatali wrote one of the best articles I've seen for a while.

I think actually talking about differential learning rates and stochastic gradient descent with restarts Be sure to check it out if you can because what he's done. I feel like he's done a great job of Kind of positioning in a place that you can get a lot out of it You know regardless of your background, but for those who want to go further He's also got links to like the academic papers that came from and kind of graphs of showing examples of all of all the things He's talking about And I think it's a it's a particularly Nicely done article so a good kind of role model for technical communication One of the things I've liked about you know seeing people post these Post these articles during the week is the discussion on the forums have also been like really great.

There's been a lot of a lot of people helping out like Explaining things you know which you know maybe there's parts of the post bit where people have said actually that's not quite how it works And people have learned new things that way people have come up with new ideas as a result as well These discussions of stochastic gradient descent with restarts and cyclical learning rates has been a few of them actually Anand Sahar has written another great post talking about a similar Similar topic and why it works so well and again lots of great pictures and references to Papers and most importantly perhaps code are showing how it actually works Mark Hoffman covered the same topic at kind of a nice introductory level.

I think really really kind of clear intuition Many can't talk specifically about differential learning rates And why it's interesting and again providing some nice context people not familiar with transfer learning You're not going back back to saying like well. What is transfer learning? Why is that interesting and given that why could differential learning rates be helpful?

and then One thing I particularly liked about Arjun's article was that he talked not just about the technology that we're looking at but also talked about some of the implications particularly from a commercial point of view So thinking about like based on some of the things we've learned about so far What are some of the implications that that has you know in real life?

And lots of background lots of pictures And then discussing some of the yeah some of the implications So there's been lots of great stuff online and thanks to everybody for all the great work that you've been doing As we talked about last week if you're kind of vaguely wondering about writing something But you're feeling a bit intimidated about it because you've never really written a technical post before just jump in you know It's it's it's it's a really Welcoming and encouraging group.

I think to to work with So we're going to have a kind of an interesting lesson today, which is we're going to cover a Whole lot of different applications, so we've we've spent quite a lot of time on computer vision And today we're going to try if we can to get through three totally different areas Structured learning so looking at kind of how you look at So we're going to start out looking at structured learning or structured data learning by which I mean Building models on top of things look more like database tables So kind of columns of different types of data.

They might be financial or geographical or whatever We're going to look at using deep learning for language natural language processing And we're going to look at using deep learning for recommendation systems, and so we're going to cover these at a very high level and the focus will be on Here is how to use the software to do it More than here is what's going on behind the scenes, and then the next three lessons We'll be digging into the details of what's been going on behind the scenes and also coming back to Looking at a lot of the details of computer vision that we kind of skipped over so far So the focus today is really on like how do you actually do these applications?

And we'll kind of talk briefly about some of the concepts involved Before we do I did want to talk about one key New concept Which is dropout and you might have seen dropout mentioned a bunch of times already and got the got the impression that this is Something important and indeed it is So to look at dropout.

I'm going to look at the the dog breeds Current cable competition that's going on and what I've done is I've gone ahead and I've created a pre-trained network as per usual and I've passed in pre compute equals true and so that's going to Pre-compute the activations that come out of the last convolutional layer.

Remember an activation is just a number It's a number just a reminder an activation Like here is one activation. It's a number and Specifically the activations are calculated based on some Weights also called parameters that make up kernels or filters and they get applied to the previous layers activations Which could well be the inputs or they could themselves be the results of other calculations Okay, so when we say activation keep remembering we're talking about a number that's being calculated So we pre compute some activations And then what we do is we put on top of that a bunch of additional Initially randomly generated Fully connected layers, so we're just going to do some matrix modifications on top of these just like in our Excel worksheet at the very end We had this matrix that we just did a matrix multiplication So what you can actually do is if you just type The name of your learner object you can actually see What's in it?

You can see the layers in it. So when I was previously been skipping over a little bit about oh We add a few layers to the end. These are actually the layers that we add We're going to do batch norm in the last lesson. So don't worry about that for now a Linear layer simply means a matrix multiply.

Okay, so this is a matrix which has a 1024 rows and 512 columns and so in other words, it's going to take in 1,024 activations and spit out 512 activations Then we have a relu which remember is just replace the negatives with zero We'll skip over the batch norm We'll come back to drop out and then we have a second linear layer that takes those 512 activations from the previous linear layer and puts them through a new matrix multiply 512 by 120 spits out a new 120 activations and then finally put that through Softmax and for those of you that don't remember softmax.

We looked at that last year at last week It's this idea that we basically just Take the previous the activation. Let's say for dog Go either the power of that and then divide that into the sum of either the power of all the activations So that was the thing that adds up to one all of them add up to one and each one individually is between zero and one okay, so That's that's what we added on top and that's the thing when we have pre compute equals true That's the thing we train so I wanted to talk about what this dropout is and what this P is because it's a really important Thing that we get to choose So a dropout layer with P equals zero point five Literally does this we go over to our spreadsheet and let's pick any layer with some activations and let's say okay I'm going to apply dropout with a P of zero point five two times two what that means is I go through and with a 50% chance I Pick a cell right pick an activation.

So I picked like half of them randomly and I delete them Okay That's that's what dropout is right? So it's so the P equals point five means what's the probability of deleting that cell Right. So when I delete those cells If you have a walk like look at the output It doesn't actually change by very much at all just a little bit particularly because remember it's going through a max pooling layer Right, so it's only going to change it at all if it was actually the maximum in that group of four and furthermore, it's just one piece of you know, if it's going into a convolution rather than into a max pool It's just one piece of that that filter so interestingly The idea of like randomly throwing away half of the activations in a layer Has a really interesting result and one important thing to mention is each mini batch we throw away a different random half of activations in that layer and so what it means is It it forces it to not overfit right in other words if there's some particular activation That's really learnt just that exact That exact dog or that exact cat right then when that gets dropped out The whole thing now isn't going to work as well.

It's not going to recognize that image, right? so it has to in order for this to work it has to try and find a representation that That actually continues to work even as random half of the activations get thrown away every time Right, so it's a it's it's I guess about four years old now.

They're four years old and it's been Absolutely critical in Making modern deep learning work and the reason why is it really just about solve? The problem of generalization for us before dropout came along if you try to train a model with lots of parameters and you were overfitting and You already tried all the data augmentation you called and you already had as much data as you could you?

There were some other things you could try, but to a large degree you were kind of stuck and so then Jeffrey Hinton and his colleagues came up with this this dropout idea that was loosely inspired by the way the brain works And also loosely inspired by Jeffrey Hinton's experience in bank tele cues apparently and yeah, somehow they came up with this amazing idea of like hey, let's let's try throwing things away at random and So as you can imagine if your P was like 0.01 Then you're throwing away 1% of your activations for that layer at random.

It's not going to randomly Change things up very much at all So it's not really going to protect you from Overfitting much at all on the other hand if your P was 0.99 then that would be like going through the whole thing and throwing away nearly everything right and That would be very hard for it to overfit so that would be great for generalization, but it's also going to kill your accuracy so this is kind of playoff between high P values generalize well But will decrease your training accuracy and low P values will generalize less well, but will give you a less good training accuracy So for those of you that have been wondering why is it that particularly early in training are my validation losses better?

Than my training losses right which seems otherwise really surprising. Hopefully some of you have been wondering why that is because on a data set that it never gets to see you wouldn't expect the losses to ever be That's better and the reason why is because when we look at the validation set we turn off dropout Right so in other words when you're doing inference when you're trying to say is this a cat or is this a dog?

We certainly don't want to be including Random dropout there right we want to be using the best model we can Okay, so that's why early in training in particular We actually see that our validation Accuracy and loss tends to be better If we're using dropout, okay, so yes You have to do anything to accommodate for the fact that you are throwing away some That's a great question, so We don't but pytorch does so pytorch behind the scenes does two things if you say P equals point five It throws away half of the activations but it also Doubles all the activations that are already there so on average the kind of the average activation doesn't change Which is pretty pretty neat trick?

So yeah, you don't have to worry about it basically it's done for you So if we say so you can pass in peas This is the this is the P value for all of the added layers to say With fastai what dropout do you want on each of the layers in these these added layers?

It won't change the dropout in the pre-trained network like the hope is that that's already been Pretty trained with some appropriate level of dropout We don't change it but on these layers that we add you can say how much and so you can see here I said peas equals point five so my first dropout has point five my second dropout has point five All right, and remember coming to the input of this Was the output of the last convolutional layer of pre-trained network?

And we go away and we actually throw away half of that before you can start go through our linear layer Throw away the negatives Throw away half of the result of that go through another linear layer and then pass that to our softmax for Minor numerical precision region reasons it turns out to be better to take the log of the softmax then the softmax directly And that's why you'll have noticed that when you actually get predictions out of our models you always have to go NP dot X of the predictions Again, the details as to why aren't important.

So if we want to Try removing dropout. We could go peas equals zero Right and you'll see where else before we started with the point seven six accuracy in the first epoch now You've got a point eight accuracy in the first epoch So by not doing dropout our first epoch worked better not surprisingly because we're not throwing anything away but by the third epoch here, we had eighty four point eight and Here we have eighty four point one.

So it started out better and ended up worse So even after three epochs, you can already see we're massively overfitting, right? We've got point three loss on the train and point five loss on the validation And so if you look now you can see in the resulting model there's no dropout at all So if the P is zero, we don't even add it to the model Another thing to mention is you might have noticed that what we've been doing is we've been adding two linear layers Right in our additional layers.

You don't have to do that. By the way, there's actually a parameter called extra fully connected Layers that you can basically pass a list of how long do you want all how big do you want each of the additional fully connected? Layers to be and so by default Well, you need to have at least one Right because you need something that takes the output of the convolutional layer which in this case is a size 1024 and turns it into the number of Classes you have cats versus dogs would be two dog breeds would be 120 Planet satellite 17 whatever that's you always need one linear layer at least and you can't pick how big that is That's defined by your problem But you can choose what the other size is or if it happens at all So if we were to pass in an empty list, then now we're saying don't add any additional linear layers Just the one that we have to have Right.

So here if we've got P's equals zero extra fully connected layers is empty. This is like the minimum possible Kind of top model we can put on top and again like if we do that You can see above we actually end up with in this case a Reasonably good result because we're not training it for very long and this particular pre-trained network is very well suited To this particular problem.

Yes, you know So Jeremy, what kind of P should we we using? by default So the one that's there by default for the first layer Is 0.25 and for the second layer is 0.5 That seems to work pretty well For most things right? So like it's it's you you don't necessarily need to change it at all Basically, if you find it's overfitting Just start bumping it up.

So try first of all setting it to 0.5 That'll set them both to 0.5 if it's still overfitting a lot try 0.7 like you can you can narrow down And like there's not that many Numbers change right and if you're under fitting Then you can try making it lower It's unlikely you would need to make it much lower because like even in these dogs versus cats situations You know, we don't seem to have to make it lower so it's more likely you'd be increasing it to like 0.6 or 0.7 But you can fiddle around I find these the ones that are there by default seem to work pretty well most of the time So one place I actually did increase this Was in the dog breeds one.

I did set it them both to point five when I used a Bigger model so like resnet 34 has less parameters So it doesn't overfit as much but then when I started bumping pumping it up to like a resnet 50 Which has a lot more parameters. I noticed it started overfitting.

So then I also increased my dropout. So as you use like Bigger models you'll often need to add more dropout. Can you pass that over there, please? You know If you set B 2.5 roughly what percentage is it 50% 50%? Yeah Was there how can you pass that back?

Thanks. Is there a particular way in which you can determine if the data is being old fitted? Yeah You can see that the like here you can see that the training error is a Loss is much lower than the validation loss you can't tell if it's like to overfitted like Zero overfitting is not generally optimal like the only way to find that out is Remember the only thing you're trying to do is to get this number low right the validation loss number low So in the end you kind of have to play around with a few different things and see which thing ends up getting the validation Loss low, but you're kind of going to feel over time for your particular problem What does overfitting?

What does too much overfitting look like? Great so So that's dropout and we're going to be using that a lot and remember it's there by default service here another question So I have two questions one is So when it says the dropout rate is 7.5 Is does it like you know a delete each cell with a probability of?

0.5 does it just pick 50% randomly? I mean, I know both effectively It's the former yeah, okay, okay, second question is why why does the average activation matter? well, it matters because the remember if you look at the Excel spreadsheet that the result of this cell for example is equal to These Nine Multiplied by each of these nine Right and add it up, so if we deleted half of these Then that would also cause this number to half which would cause like everything else after that to change and so if you change What it means you know like it then you're changing something that used to say like oh Fluffy ears are fluffy if this is greater than point six now It's only fluffy if it's greater than point three like we're changing the meaning of everything So you the goal here is to delete things without changing We're using a linear activation for one of the earlier activations Why are we using linear?

Yeah? Why that particular activation? Because that's what this set of layers is so we've we've the the pre trained network is all is the convolutional network And that's pretty computed, so we don't see it so what that spits out is a vector So the only choice we have is to use linear layers at this point Okay Can we have different level of dropout by layer?

Yes, absolutely how to do that great so You can absolutely have different dropout by layer, and that's why this is actually called peas So you can pass in an array here, so if I went zero comma 0.2 for example and then extra fully connected. I might add 512 Right then that's going to be zero dropout before the first of them and point two dropout before the second of them Yes requests, and I must admit.

I don't have a great Intuition even after doing this for a few years for like When should earlier or later layers have different amounts of dropout? It's still something I kind of play with and I can't quite find rules of thumb So there's some of you come up with some good rules of thumb.

I'd love to Hear about them. I think if in doubt You can use the same dropout in every fully connected layer The other thing you can try is often people only put dropout on the very last Linear layer, so that'd be the two things to try So Jeremy, why do you monitor the log loss the loss instead of the accuracy going up?

Well because the loss is the only thing that we can see For both the validation set and the training set so it's nice to be able to compare them also as we'll learn about Later the loss is the thing that we're actually optimizing So it's it's kind of a little more.

It's a little easier to monitor that and understand what that means Can you pass it over there? So with dropout we are kind of adding some random noise every iteration right so So that means that we don't do as much learning right or actually so that's right So we have to play around with the learning rate and it doesn't seem to impact the learning rate Enough that I've ever noticed it.

I I would say you're probably right in theory it might but not enough that it's ever affected me Okay, so let's talk about this Structured data problem and so to remind you we were looking at Kaggles Rossman competition Which is a German? Chain of supermarkets, I believe and you can find this in lesson 3 Rossman and The main data set is the one where we were looking to say at a particular store How much did they sell?

Okay, and there's a few big key pieces of information one is what was the date another was were they open? Did they have a promotion on? Was it a holiday in that state? And was it a holiday as for school a state holiday there? Or was it a school holiday there and then we had some more information about stores like what for this store?

What kind of stuff did they tend to sell what kind of store are they how far away the competition and so forth so? With the data set like this there's really two main kinds of column. There's columns that we think of as Categorical they have a number of levels so the assortment Column is categorical, and it has levels such as a B and C Where else something like competition distance we would call continuous It has a number attached to it where differences or ratios even if that number have some kind of meaning And so we need to deal with these two things quite differently, okay, so anybody who's done any Machine learning of any kind will be familiar with using continuous columns if you've done any linear regression for example You can just like multiply them by parameters for instance Categorical columns we're going to have to think about a little bit more We're not going to go through the data cleaning we're going to assume that that's in feature engineering we're going to assume all that's been done And so basically at the end of that we have a list of columns and the in this case I Didn't do any of the thinking around the feature engineering or data cleaning myself This is all directly from the third place winners of this competition And so they came up with all of these different Columns that they found useful and so You'll notice the list here is a list of the things that we're going to treat as categorical variables Numbers like year a month and day Although we could treat them as continuous like they the differences between 2000 and 2003 is meaningful We don't have to right and you'll see shortly how how categorical variables are treated But basically if we decide to make something a categorical variable what we're telling our neural net down the track is That for every different level of say year, you know, 2000 2001 2002 you can treat it totally differently Where else if we say it's continuous, it's going to have to come up with some kind of like function some kind of smooth ish function right and so often even for things like year that actually are continuous But they don't actually have many distinct levels it often works better To treat it as categorical So another good example day of week, right?

So like day of week between naught and six It's a number and it means something like the difference between three and five is two days and has meaning but if you think about like how would Sales in a store vary by day of week It could well be that like, you know, Saturdays and Sundays are over here and Fridays are over here and Wednesdays over here Like each day is going to behave Kind of qualitatively differently, right?

So by saying this is the categorical variable as you'll see we're going to let the neural net Do that right? So this thing where we get where we say Which are continuous and which are categorical to some extent? This is a modeling decision you get to make now if something is coded in your data is like a B and C or You know Jeremy and your net or whatever you actually you're going to have to call that categorical, right?

There's no way to treat that directly as a continuous variable On the other hand if it starts out as a continuous variable like age or day of week You get to decide Whether you want to treat it as continuous or categorical. Okay, so summarize if it's categorical in the data It's going to have to be categorical in the model if it's continuous in the data You get to pick whether to make it continuous or categorical in the model So in this case again, I just did whatever the third place winners of this competition did These are the ones that they decided to use as categorical.

These were the ones they decided to use as continuous and you can see that basically The continuous ones are all of the ones which are actual Floating point numbers like competition distance actually has a decimal place to it, right and temperature actually has a decimal place to it So these would be very hard to make categorical because they have many many levels right like if it's like five digits of floating point then potentially there will be as many levels as there are As there are rows and by the way the word we use to say how many levels are in a category?

We use the word cardinality, right? So if you hear me say cardinality for example the cardinality of the day of week Variable is seven because there are seven different days of the week Do you have a heuristic for one to have been continuous variables or do you ever been variables?

I don't ever been continuous variables So yeah, so one thing we could do with like max temperature is group it into 0 to 10 10 to 20 20 to 30 and then call that categorical interestingly a paper just came out last week in which a group of researchers found that Sometimes bidding can be helpful But that literally came out in the last week and until that time I haven't seen anything in deep learning saying that so I haven't I haven't looked at it myself until this week.

I would have said it's a bad idea Now I have to think differently. I guess maybe it is sometimes So if you're using Year as a category what happens when you run the model on a year? It's never seen so you trained it in Well, we'll get there. Yeah, the short answer is it'll be treated as an unknown category And so plan does which is the underlying data frame thing?

We're using with categories as a special category called unknown and if it sees a category it hasn't seen before it gets treated as unknown So for our deep learning model unknown would just be another category If our data set training the data set doesn't have a category and Test has unknown.

How will it be? It'll just be part of this unknown category. Well, it's still predict It'll predict something right like it will just have the value 0 behind the scenes and if there's been any unknowns of any kind in the training set then it'll have learned a Way to predict unknown if it hasn't it's going to have some random vector.

And so that's a Interesting detail around training that we probably won't talk about in this part of the course But we can certainly talk about on the forum Okay, so we've got our categorical and continuous variable lists to find in this case there was a 800,000 rows So 800,000 dates basically by stores And so you can now take all of these columns loop through each one and Replace it in the data frame with a version where you say take it and change its type to category Okay, and so that just that's just a pandas thing.

So I'm not going to teach you pandas There's plenty of books particularly Wes McKinney's books book on Python for data analysis is great But hopefully it's intuitive as to what's going on even if you haven't seen the specific syntax before So we're going to turn that column into a categorical column And then for the continuous variables, we're going to make them all 32-bit floating-point and for the reason for that is that PyTorch Expects everything to be 32-bit floating-point.

Okay, so like some of these include like 1-0 things like Can't see them straight away. But anyway, some of them. Yeah, like was there a promo was was a holiday And so that'll become the floating-point values one and zero instance. Okay, so I try to do as much of my work as possible on small data sets For when I'm working with images that generally means resizing the images to like 64 by 64 or 128 by 128 We can't do that with structured data.

So instead I tend to take a sample. So I randomly pick a few rows So I start running with a sample and I can use exactly the same thing that we've seen before For getting a validation set we can use the same way to get some random Random row numbers to use in a random sample.

Okay, so this is just a bunch of random numbers And then okay, so that's going to be a size 150,000 rather than 840,000 And so my data that before I go any further it basically looks like this. You can see I've got some booleans here I've got some Integers here of various different scales.

There's my year 2014 And I've got some letters here. So even though I said Please call that a pandas category Pandas still displays that in the notebook as strings, right? It's just stored in internally differently so then the first AI library has a special little function called process data frame and Process data frame takes a data frame and you tell it.

What's my dependent variable? Right, and it does a few different things The first thing is it pulls out that dependent variable and puts it into a separate variable Okay, and deletes it from the original data frame So DF now does not have the sales column in where else y just contains the sales column Something else that it does is it does scaling?

so neural nets Really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around one all right, so we can always take our data and Subtract the mean and divide by the standard deviation to make that happen So that's what do scale equals true does and it actually returns a special object Which keeps track of what mean and standard deviation did it use for that normalizing?

So you can then do the same thing to the test set later It also handles missing values so missing values and categorical variables just become the ID 0 and then all the other categories become 12345 for that categorical variable for continuous variables that replaces the missing value with the median And creates a new column That's a Boolean and just says is this missing or not and I'm going to skip over this pretty quickly because we talk about this In detail in the machine learning course, okay, so if you've got any questions about this part That would be a good place to go.

It's nothing deep learning specific there So you can see afterwards year 2014 For example has become year 2 ok because these categorical variables have all been replaced with With contiguous integers starting at 0 Right and the reason for that is later on we're going to be putting them into a matrix Right and so we wouldn't want the matrix to be 2014 rows long when it could just be 2 rows long so that's the basic idea there, and you'll see that the AC for example has been replaced in the same way with 1 and 3 Okay, so we now have a data frame Which does not contain the dependent variable and where everything is a number okay?

And so that's the that's where we need to get to to do deep learning and all of the stage above that As I said we talk about in detail in the machine learning course nothing deep learning specific about any of it This is exactly what we throw into our random forests as well, so Another Thing we talk about a lot in the machine learning course of course is validation sets In this case we need to predict the next two weeks of sales Right it's not like pick a random set of sales, but we have to pick the next two weeks of sales.

That was what the Kaggle competition folks told us to do And therefore I'm going to create a validation set which is the last two weeks of my training set right to try and make it as similar to the test set as possible and We just posted actually Rachel wrote this thing last week about Creating validation sets so if you go to fast at AI you can check it out We'll put that in the lesson wiki as well But it's basically a summary of a recent machine learning lesson that we did The videos are available for that as well, and this is kind of a written a written summary of it Okay So yeah So Rachel has been a lot of time thinking about kind of you know How do you need to think about validation sets and training sets and test sets and so forth and that's all there?

But again, nothing deep learning specific, so let's get straight to the the deep learning action, okay? so in this particular competition as always with any competition or any kind of Machine learning project you really need to make sure you have a strong understanding of your metric How are you going to be judged here and in this case?

You know Kaggle makes it easy they tell us how we're going to be judged and so we're going to be judged on the roots mean squared percentage error Right so we're going to say like oh you predicted three It was actually three point three so you were ten percent out And then we're going to average all those percents right and remember.

I warned you that You are going to need to make sure you know logarithms really well right and so in this case from you know We're basically being saying your prediction divided by the actual the mean of that Right is the thing that we care about and so we don't have a Metric in Pytorch called root mean squared percent error We could actually easily create it by the way if you look at the source code You'll see like it's you know a line of code, but easier still would be to realize that That if you have That right then you could replace a with like Log of a dash and B with like log of B dash And then you can replace that whole thing with a subtraction That's just the rule of logs right and so if you don't know that rule Then you know make sure you go look it up because it's super helpful But it means in this case all we need to do is to Take the log of our data which I actually did earlier in this Notebook and when you take the log of the data getting the root means great error Will actually get you the root means great percent error for free, okay?

But then when we want to like print out our root means red percent error We actually have to go either the power of it Again, right and then we can actually return the percent difference, so that's all that's going on here It's again. Not really deep learning specific at all So here we finally get to the deep learning alright, so as per usual like you'll see everything We look at today looks exactly the same as everything.

We've looked at so far. Which is first we create a model data object Something that has a validation set Training set an optional test set built into it from that we will get a learner we will then Optionally called learner dot LR find will then called learner dot fit It'll be all the same parameters and everything that you've seen many times before okay So the difference though is obviously we're not going to go Image classifier data dot from CSV or dot from paths we need to get some different kind of model data And so for stuff that is in rows and columns We use columnar model data Okay, but this will return an object with basically the same API that you're familiar with and rather than from paths Or from CSV this is from data frame, okay, so this gets past a few things The path here is just used for it to know where should it store?

Like model files or stuff like that right this is just basically saying where do you want to store anything that you save later? This is the list of the indexes of the rows that we want to put in the validation set we created earlier Here's our data frame okay, and then Let's have a look here's this is where we did the log right so I talked the The Y that came out of property F our dependent variable.

I logged it and I call that YL Right so we tell it When we create our model data we need to tell it that's our dependent variable So so far we've got list of the stuff to go in the validation set which is what's our independent variables? What's our dependent variables and then we have to tell it which things do we want treated as categorical right?

Because remember by this time Everything's a number Right so it could do the whole thing as if it's continuous it would just be totally meaningless Right so we need to tell it which things do we want to treat as categories and so here we just pass in That list of names that we use before okay, and then a bunch of the parameters are the same as the ones you're used to for example you can set the batch size Yeah, so after we do that.

We've got a You know a standard Model data object with a trend train DL Attribute there's a vowel DL attribute a train DS attribute of our DS attribute. It's got a length It's got all this stuff Exactly like it did in all of our image based data objects Okay, so now we need to create the the model or create the learner and so to skip ahead a little bit We're basically going to pass in something that looks pretty familiar We're going to be passing saying from our model from our model data Create a learner that is suitable for it And we'll basically be passing in a few other bits of information which will include How much dropout to use at the very start?

How many how many activations to have in each layer how much dropout to use at the at the later layers? But then there's a couple of extra things that we need to learn about and specifically it's this thing called embeddings So this is really the key new concept we have to learn about all right, so All we're doing basically is we're going to take our Let's forget about categorical variables for a moment and just think about the continuous variables For our continuous variables all we're going to do Is we're going to grab them all Okay, so for our continuous variables, we're basically going to say like okay, here's a big list of all of our continuous variables like the minimum temperature and maximum temperature and the distance to the nearest competitor and so forth right and so here's just a bunch of floating point numbers and so basically what the neural nets going to do is it's going to take that that 1d array or Or vector or to be very DL like rank 1 tensor All means the same thing okay, so we're going to take our rank 1 tensor And let's put it through a matrix multiplication, so let's say this has got like I don't know 20 continuous variables, and then we can put it through a matrix which Must have 20 rows.

That's how much this multiplication works, and then we can decide how many columns we want right So maybe we decided 100 right and so that matrix multiplication is going to spit out a new length 100 rank 1 tensor Okay, that's that's what that's what a linear. That's what a matrix product does and that's the definition of a linear layer in deep length Okay, and so then the next thing we do is we can put that through a relu right which means we throw away the negatives Okay, and now we can put that through another matrix product.

Okay, so this is going to have to have a hundred rows by definition And we can have as many columns as we like and so let's say maybe this was The last layer so the next thing we're trying to do is to predict sales So there's just one value, we're trying to predict the sales so we could put it through a Matrix product that just had one column and that's going to spit out a single number All right, so that's like That's kind of like a one layer Neural net if you like now in practice, you know we wouldn't make it one layer so we'd actually have like You know, maybe we'd have 50 here and so then that gives us a 50 long vector and then Maybe we then put that into our final 50 by 1 And that spits out a single number and one reason I wanted to change that there was to point out, you know, relu You would never put relu in the last layer Like you'd never want to throw away the negatives because that the softmax The softmax Needs negatives in it because it's the negatives that are the things that allow it to create low probabilities That's minor detail, but it's useful to remember.

Okay, so basically So basically a simple view of a Fully connected neural net is something that takes in as an input a rank one tensor it's bits it's through a linear layer an Activation layer another linear layer Softmax and That's the output Okay, and so we could obviously decide to add more Linear layers we could decide maybe to add dropout Right.

So these are some of the decisions that we we get to make right but we there's not that much we can do Right. There's not much really crazy architecture stuff to do. So when we come back to Image models later in the course We're going to learn about all the weird things that go on and like res nets and inception networks and blah blah blah But in these fully connected networks, they're really pretty simple.

They're just interspersed linear layers that is matrix products and Activation functions like value and a softmax at the end And if it's not classification which actually ours is not classification in this case. We're trying to predict sales There isn't even a softmax Right, we don't want it to be between 0 and 1 Okay, so we can just throw away the last activation all together If we have time we can talk about a slight trick we can do there but for now we can think of it that way So that was all assuming that everything was continuous, right?

But what about categorical, right? So we've got like Day of week right and We're going to treat it as categorical, right? So it's like Saturday Sunday Monday Six Friday okay, how do we feed that in because I want to find a way of getting that in so that we still end up with a rank one tensor of floats and so the trick is this we create a new little matrix of With seven rows And as many columns as we choose right so let's pick four right so here's our Seven rows and four columns Right and basically what we do is let's add our categorical variables to the end.

So let's say the first row was Sunday Right then what we do is we do a lookup into this matrix and we say oh here's Sunday We do a lookup into here and we grab This row and so this matrix we basically fill with floating point numbers. So we're going to end up grabbing a little Subset of four floating point numbers.

It's Sunday's particular for floating point numbers And so that way we convert Sunday Into a rank one tensor of four floating point numbers and initially those four numbers are random Right and in fact this whole thing we initially start out random, okay But then we're going to put that through our neural net, right?

So we basically then take those four numbers and we remove Sunday instead we add Our four numbers on here, right? So we've turned our categorical thing into a floating point vector Right and so now we can just put that through our neural net just like before and at the very end we find out the loss and then we can figure out which direction is down and Do gradient descent in that direction and eventually that will find its way back To this little list of four numbers and it'll say okay those random numbers weren't very good This one needs to go up a bit that one needs to go up a bit that one needs to go down a bit That one needs to go up a bit and so we'll actually update our original those four numbers in that matrix and We'll do this again and again and again And so this this matrix will stop looking random and it will start looking more and more like like The exact four numbers that happen to work best for Sunday the exact four numbers that happen to work best for Friday and so forth And so in other words this matrix is just another bunch of weights in our neural net All right, and so matrices of this type are called embedding matrices So an embedding matrix is something where we start out with an integer between zero and the maximum number of levels of that category We literally index into a matrix to find our particular row So if it was the level was one we take the first row we grab that row and we append it to all of our continuous variables and So we now have a new Vector of continuous variables and when we can do the same thing for let's say zip code Right, so we could like have an embedding matrix.

Let's say there are 5,000 zip codes It would be 5,000 rows long as wide as we decide maybe it's 50 wide and so we'd say okay. Here's nine four zero zero three That zip code is index number four in our matrix So go down and we find the fourth row regret those 50 numbers and append those Onto our big vector and then everything after that is just the same.

We just put it through a linear layer value linear layer, whatever What are those four numbers Represent that's a great question and we'll learn more about that when we look at collaborative filtering for now They represent no more or no less than any other parameter in our neural net, you know, they're just They're just parameters that we're learning that happen to end up giving us a good loss We will discover later that these particular parameters often However, our human interpretable and quite can be quite interesting, but that's a side effect of them.

It's not Fundamental they're just four random numbers for now that we're that we're learning or sets of four random numbers To have a good heuristic for the dimensionality of the embedding matrix, so why four here sure do So What I first of all did was I made a little list of every categorical variable and its cardinality Okay, so there they allow so there's a hundred and that's a thousand plus different stores apparently in Rossman's network There are eight days of the week That's because there are seven days of the week plus one left over four unknown Even if there were no missing values in the original data I always still set aside one just in case there's a missing or an unknown or something different in the test set Again four years, but there's actually three plus room for an unknown and so forth.

Alright, so what I do My rule of thumb is this Take the cardinality of the variable Divide it by two But don't make it bigger than 50 Okay, so These are my embedding matrices. So my store matrix. So the that has to have a thousand one hundred and sixteen rows because I need to look up right to find his store number three and then it's going to return back a Rank one tensor of length 50 Day of week it's going to look up into which one of the eight and return the thing of length four So would you typically build an embedding matrix for each categorical feature?

Yes. Yeah, so that's what I've done here So I've said For C in categorical variables See how many categories there are and then for each of those things create one of these and Then this is called embedding sizes And then you may have noticed that that's actually the first thing that we pass to get learner And so that tells it for every categorical variable.

That's the embedding matrix to use for that variable That is behind you doesn't yes So besides Random initialization are there other ways to actually initialize embedding? Yes or no, there's two ways one is random the other is pre-trained and We'll probably talk about pre-trained more later in the course But the basic idea though is if somebody else at Rossman had already trained a neural net just like you you would use a pre-trained net from image net to look at pictures of cats and dogs if Somebody else has pre-trained a network to predict cheese sales in Rossman You may as well start with their embedding matrix of stores to predict liquor sales in Rossman And this is what happens for example at At Pinterest and Instacart they both use this technique Instacart uses it for routing their shoppers Pinterest uses it for deciding what to display on a web page when you go there and they have embedding matrices of products In Instacart's case of stores that get shared in the organization so people don't have to train new ones So for the embedding size Why wouldn't you just use like the one hot scheme and just Well, what is the advantage of doing this?

As opposed to just doing a lot of questions. So so we could easily as you point out have Instead of passing in these four numbers. We could instead have passed in seven numbers all zeros, but one of them is a one and that also is a list of floats and That would totally work and that's how Generally speaking categorical variables have been used in statistics for many years.

It's called dummy variable coding The problem is that in that case? the concept of Sunday Could only ever be associated with a single floating-point number Right, and so it basically gets this kind of linear behavior. It says like Sunday is more or less of a single thing Yeah, well, it's not just interactions.

It's saying like now Sunday is a concept in four-dimensional space Right. And so what we tend to find happen is that these Embedding vectors tend to get these kind of rich semantic concepts. So for example if it turns out that Weekends Kind of have a different behavior You'll tend to see that Saturday and Sunday will have like some particular number higher or more likely it turns out that certain days of the week are associated with higher sales of Certain kinds of goods that you kind of can't go without I don't know like gas or milk say Where else there might be other products?

like like wine, for example Like wine that tend to be associated with like the days before weekends or holidays, right? So there might be kind of a column which is like To what extent is this day of the week? Kind of associated with people going out You know, so basically yeah by by having this higher dimensionality vector rather than just a single number It gives the deep learning Network a chance to learn these rich Representations and so this idea of an embedding is actually what's called a distributed representation It's kind of the fun most fundamental concept of neural networks It's this idea that a concept in a neural network has a kind of a high dimensional Representation and often it can be hard to interpret because the idea is like each of these Numbers in this vector doesn't even have to have just one meaning You know It could mean one thing if this is low and that one's high and something else if that one's high and that one's low Because it's going through this kind of rich nonlinear Function right and so it's this It's this rich representation that allows it to learn such such such interesting Relationships Kind of oh another question.

Sure. I'll speak louder. So are there Is an embedding so I get the the fundamental of the like the word vector word to Vic vector algebra You can run on this but are they embedding suited suitable for certain types of variables? Like are are these only suitable for? Are there different categories that that the embeddings are suitable for an embedding is suitable for any categorical variable?

Okay, so so the only thing it it can't really work Well at all for would be something that is too high cardinality So like in other words, we had likes whatever it was 600,000 rows if you had a variable with 600,000 levels That's just not a useful categorical variable you could bucketize it I guess But yeah in general like you can see here that the third place getters in this competition Really decided that everything that was not too high cardinality They put them all as categorical variables and I think that's a good rule of thumb You know if you can make it a categorical variable you may as well because that way it can learn this rich distributed representation Or else if you leave it as continuous, you know, the most it can do is to kind of try and find a You know a single functional form that fits it well after question, so You were saying that you are kind of increasing the dimension But actually in in most cases we will use a one-holding calling which has even a bigger dimension That so so in a way you are also Reducing but in the most rich.

I think that's that's that's fair. Yeah. Yeah it like Yes, you know you can think of it as one hot encoding which actually is high dimensional, but it's not Meaningfully high dimensional because everything except one is zero I'm saying that also because even this will reduce the amount of memory and things like this that you have to write This is better.

You're absolutely right. Absolutely, right? And and so we may as well go ahead and actually describe like what's going on with the matrix algebra behind the scenes It this if this doesn't quite make sense you can kind of skip over it But for some people I know this really helps if we started out with something saying this is Sunday right we could represent this as a one hot encoded vector right and so Sunday, you know, maybe was positioned here.

So that would be a one and then the rest of zeros Okay, and then we've got our Embedding matrix right with eight rows and in this case four columns One way to think of this actually is a matrix product Right, so I said you could think of this as like looking up the number one, you know and finding like its index in the array But if you think about it, that's actually identical to doing a matrix product between a one hot encoded vector and The embedding matrix like you're going to go zero times this row one times this row zero times this row And so it's like a one hot embedding matrix product is identical to doing a lookup and so Some people in the bad old days actually implemented embedding Matrices by doing a one hot encoding and then a matrix product and in fact a lot of like machine learning methods still kind of do that But as your net was kind of alluding to it's that's terribly inefficient.

So all of the modern Libraries implement this as taken take an integer and do a lookup into an array But the nice thing about realizing that it's actually a matrix product Mathematically is it makes it more obvious? How the gradients are going to flow so when we do stochastic gradient descent, it's we can think of it as just another Linear layer.

Okay, does it say that's like a somewhat minor detail, but hopefully for some of you it helps Could you touch on using dates and times as categoricals how that affects seasonality? Yeah, absolutely. That's a great question Did I cover dates at all last week? No, okay So I covered dates in a lot of detail in the machine learning course, but it's worth briefly mentioning here There's a fast AI function called add date part Which takes a data frame and a column name That column name needs to be a date It removes unless you've got drop equals false It optionally removes the column from the data frame and replaces it with lots of columns representing all of the useful information about that date like Day of week day of month month of year year is at the start of a quarter Is at the end of a quarter basically everything that pandas?

gives us And so that way we end up When we look at our list of features where you can see them here, right? Year month week day day of week, etc. So these all get created for us by add date path so we end up with you know this Eight long embedding Matrix, so I guess eight rows by four column embedding matrix for day of week and Conceptually that allows us allows our model to create some pretty interesting time series models Right like it can if there's something that has a seven-day period cycle That kind of goes up on Mondays and down on Wednesdays, but only for dairy and only in Berlin It can totally do that, but it has all the information it needs to do that So this turns out to be a really fantastic way to deal with time series So I'm really glad you asked the question you just need to make sure that That the the cycle indicator in your time series exists as a column So if you didn't have a column there called day of week it would be very very difficult for the neural network to somehow learn to do like a Divide mod 7 and then somehow look that up in an embedding matrix I get not impossible, but really hard would use lots of computation wouldn't do it very well So an example of the kind of thing that you need to think about might be Holidays for example, you know, or if you were doing something in in, you know of sales of Beverages in San Francisco You probably want a list of like when when are the when is the ball game on at AT&T Park?

All right, because that's going to impact how many people that are drinking beer in soma all right, so you need to make sure that the kind of the basic indicators or Periodicities or whatever are there in your data and as long as they are the neural nets going to learn to use them So I'm kind of trying to skip over some of the non-deep learning parts All right, so The key thing here is that we've got our model data that came from the data frame We tell it how big to make the embedding matrices We also have to tell it of the columns in that data frame How many of those?

Categorical variables or how many of them are continuous variables. So the actual parameter is number of continuous variables So you can hear you can see we just pass in how many columns are there minus how many categorical variables are there? so then that way the The neural net knows how to create something that puts the continuous variables over here and the categorical variables over there The embedding matrix has its own dropout All right.

So this is the dropout applied to the embedding matrix This is the number of activations in the first linear layer the number of activations in the second linear layer The dropout in the first linear layer the dropout for the second linear layer This bit we won't worry about for now and then finally is how many outputs do we want to create?

Okay, so this is the output of the last linear layer and obviously it's one because we want to predict a single number Which is sales? Okay So after that we now have a learner where we can call our find and we get the standard looking shape and we can say what the amount we want to use and we can then go ahead and Start training using exactly the same API.

We've seen before So this is all identical You can pass in I'm not sure if you've seen this before Custom metrics what this does is it just says please print out a number at the end of every epoch by calling this function and this is a function we defined a little bit earlier, which was the Root mean squared percentage error.

First of all going either the power of our Sales because our sales were originally logged. So this doesn't change the training at all It just it's just something to print out So we train that for a while And you know, we've got some benefits that the original people that built this don't have specifically we've got things like Cyclical not cyclic learning rate stochastic gradient descent with restarts.

And so it's actually interesting to have a look and compare Although our validation set isn't identical to the test set it's very similar It's a two-week period that is at the end of the training data so our numbers should be similar and if we look at what we get 0.097 and compare that to the Leaderboard public leaderboard You can see we're kind of Let's have a look in the top actually that's interesting There's a big difference between the public and private leaderboard it would have Would have been right at the top of the private leaderboard But only in the top 30 or 40 on the public leaderboard.

So not quite sure but you can see like we're certainly in the top end of this competition I actually tried running the third place getters code and Their final result was over 0.1. So I actually think that where should be compared to the private leaderboard So anyway, so you can see there basically there's a technique for dealing with time series and Structured data and you know, interestingly the group that that used this technique.

They actually wrote a paper about it. That's linked in this notebook When you compare it to the folks that won this competition and came second They did the other folks did way more feature engineering like the winners of this competition were actually subject matter experts in logistics sales forecasting and so they had their own like code to create lots and lots of features and Talking to the folks at Pinterest who built their very similar model for recommendations of Pinterest They said the same thing which is that when they switched from gradient boosting machines to deep learning They did like way way way less Feature engineering it was a much much simpler model and requires much less maintenance And so this is like one of the big benefits of using this approach to deep learning.

We can get state-of-the-art results But with a lot less work Yes Are we using any time series in any of these fits indirectly Absolutely using what we just saw we have day of week month of year all that stuff columns And most of them are being treated as categories.

So we're building a distributed representation of January We're building a distributed representation of Sunday. We're building a distributed representation of Christmas. So we're not using any Classic time series techniques all we're doing is true fully connected layers in a neural net Embedded matrix, that's what Exactly. Exactly. Yes. So the embedding matrix is able to deal with this stuff like Day of week periodicity and so forth in a way Richer way than any Standard time series technique I've ever come across one last question The matrix in the earlier models when we did the CNN we did not pass it during the fit We passed it when the data was When we got the data, so we're not passing Anything to fit just the learning rate and the number of cycles In this case we're passing in metrics because we want to print out some extra stuff There is a difference in that we're calling data dot get learner.

So with The imaging approach We just go learner dot trained and pass it the data but In for these kinds of models in fact for a lot of the models the model that we build Depends on the data in this case. We actually need to know like What embedding matrices do we have?

And stuff like that. So in this case, it's actually the data object that creates the learner So yeah, it is it is a bit upside down to what we've seen before Yeah So just to summarize or maybe I'm confused So in this case what we are doing is that we have some kind of a structured data We did feature engineering We got some column in a database or some things in it a parent is data frame Yeah data frame and then we are mapping it to deep learning by using this in Embedding matrix for the categorical variables.

So the continuous we just put them straight in So all I need to do is like if I have a if I have already have a feature engineering model Yeah, then to map it to deep learning. I just have to figure out which one I can move in to categorical and then Yeah, great question.

So yes, exactly if you want to use this on your own data set Step one is list the categorical variable names list the continuous variable names Put it in a data frame pandas data frame Step two is to Create a list of which row indexes do you want new validation set?

step three Is to call this line of code using this exact like these exact you can just copy and paste it step four is to create your list of how big you want each embedding matrix to be and Then step five is to call get learner You can use these exact parameters to start with And if it over fits or under fits you can fiddle with them and then the final step is to call Fit so yeah, almost all of this code will be nearly identical I Have a couple of questions one is How is data augmentation can be used in this case and the second one is?

Why what are dropouts doing in here? Okay, so data augmentation I have no idea. I mean, that's a really interesting question. I Think it's got to be domain specific. I've never seen any paper or anybody in industry doing data augmentation with structured data and deep learning So I don't I think it can be done.

I just haven't seen it done What is dropout doing? Exactly the same as before so at each point we have The output of each of these linear layers is just a Rank one tensor and so dropout is going to go ahead and say let's throw away half of the activations and the very first dropout embedding dropout literally goes through the embedding matrix and says Let's throw away half the activations That's it Okay, let's take a break and let's come back at a 5 past 8 Okay, thanks everybody So now We're going to move into something Equally exciting actually before I do I just mention that I had a good question during the break which was What's the downside like?

Like look almost no one's using this Why not And and basically I think the answer is like as we discussed before No one in academia almost is working on this because it's not something that people really publish on And as a result there haven't been really great examples where people could look at and say oh, here's a technique that works Well, so let's have our company implemented But perhaps equally importantly Until now with this fast AI library.

There hasn't been any Way to to do it conveniently if you wanted to implement one of these models You had to write all the custom code Yourself or else now as we discussed. It's you know six It's basically a six step process, you know involving about you know, not much more than six lines of code So the reason I mentioned this is to say like I think there are a lot of big commercial and scientific opportunities to use this to solve problems that previously haven't been solved very well before So like I'll be really interested to hear if some of you Try this out, you know, maybe on like Old Kaggle competitions you might find like oh I would have won this if I'd use this technique That would be interesting or if you've got some data set you work with at work You know some kind of predictive model that you've been doing with a GBM or a random forest.

Does this help? You know the thing I I'm still somewhat new to this I've been doing this for Basically since the start of the year was when I started working on these structured deep learning models So I haven't had enough opportunity to know Where might it fail? It's worked for nearly everything.

I've tried it with so far But yeah, I think this class is the first time that There's going to be like more than half a dozen people in the world who actually are working on this So I think you know as a group we're going to hopefully learn a lot and build some interesting things and this would be a great thing if you're thinking of writing a post about something or here's an area that There's a couple of that.

There's a post from Instacart about what they did Pinterest has a Riley AI video about what they did that's about it, and there's two academic papers Both about Kaggle competition victories one from a Yoshio Yoshio Benjio and his group they won a taxi Destination forecasting competition and then also the one linked for this Rossman competition so Yeah, there's some background on that all right so language natural language processing is the area which Is kind of like the most up-and-coming area of deep learning.

It's kind of like two or three years behind Computer vision in deep learning it was kind of like the the second area that deep learning started getting really popular in and You know computer vision Got to the point where it was like the clear state-of-the-art For most computer vision things maybe in like 2014, you know and in some things in like 2012 In NLP, we're still at the point where For a lot of things deep learning is now the state of the art, but not quite everything but as you'll see the state of kind of The software and some of the concepts is much less mature than it is for computer vision So in general none of the stuff we talk about after computer vision is going to be as like Settled as the computer vision stuff was so NLP One of the interesting things is in the last few months Some of the good ideas from computer vision have started to spread into NLP for the first time and we've seen some really big Advances so a lot of the stuff you'll see in NLP is is pretty new So I'm going to start with a particular Kind of NLP problem and one of the things you'll find in NLP It's like there are particular problems you can solve and they have particular names and so there's a particular kind of problem in NLP called language modeling and Language modeling has a very specific definition.

It means build a model where given a Few words of a sentence. Can you predict what the next word is going to be? So if you're using your mobile phone and you're typing away and you press space and then it says like this is what the next Word might be like SwiftKey does this like really well and SwiftKey actually uses deep learning for this That's that's a language model.

Okay, so it has a very specific meaning when we say language modeling We mean a model that can predict the next word of a sentence So let me give you an example. I downloaded about 18 months worth of Papers from archive. So for those of you that don't know it archive is The most popular pre-print server in this community and various others And has you know, lots of academic papers and so I grabbed the Abstracts and the topics for each and so here's an example.

So the category of this particular paper was compute a CSNI is computer science and networking and Then the summaries let the abstract of the paper Let's say in the exploitation of mm-wave bands is one of the key enabler for 5g mobile blah blah blah. Okay, so here's like an example piece of text from my language model So I trained a language model on this archive data set that I downloaded and then I built a simple little test which basically You would pass it some like priming text So you'd say like oh imagine you started reading a document that said Category is computer science networking and the summary is algorithms that and then I said, please write An archive abstract so it said that if it's networking algorithms that Use the same network as a single node are not able to achieve the same performance as a traditional network based routing algorithms in this Paper we propose a novel routing scheme, but okay So it it's learned by reading archive papers that somebody who was saying algorithms that Where the word cat CSNI came before it is going to talk like this and remember it started out not knowing English at all Right, it actually started out with an embedding matrix for every word in English that was random Okay, and by reading lots of archive papers, it weren't what kind of words followed others So then I tried what if we said cat computer science computer vision?

summary algorithms that Use the same data to perform image classification are increasingly being used to improve the performance of image classification Algorithms and this paper we propose a novel method for image classification using a deep convolutional neural network parentheses CNN So you can see like it's kind of like almost the same sentence as back here But things have just changed into this world of computer vision rather than networking So I tried something else which is like, okay Category computer vision and I created the world's shortest ever abstract algorithms And then I said title on and the title of this is going to be on the performance of deep learning for image classification EOS is end of string.

So that's like end of title What if it was networking summary algorithms title on the performance of wireless networks as opposed to? Towards computer vision towards a new approach to image classification Networking towards a new approach to the analysis of wireless networks So like I find this mind-blowing right?

I started out with some random matrices Richard like literally no No, pre-trained anything. I fed it 18 months worth of archive articles and it learnt not only How to write English pretty well but also after you say something's a convolutional neural network, you should then use parentheses to say what it's called and furthermore that the kinds of things people talk and say create algorithms for in computer vision are performing image classification and in networking are Achieving the same performance as traditional network-based routing algorithms.

So like a language model is Can be like incredibly deep and subtle Right, and so we're going to try and build that But actually not because we care about this at all We're going to build it because we're going to try and create a pre-trained model what we're actually going to try and do is take IMDB movie reviews and Figure out whether they're positive or negative So if you think about it, this is a lot like cats versus dogs.

It's a classification algorithm, but rather than an image We're going to have the text of a review So I'd really like to use a pre-trained network like I would at least like a net to start with a network that knows how to read English, right and so My view was like okay that to know how to read English means you should be able to like predict the next word of a sentence so what if we pre-train a language model and Then use that pre-trained language model and then just like in computer vision Stick some new layers on the end and ask it instead of to predicting the next word in the sentence Instead predict whether something is positive or negative So when I started working on this, this was actually a new idea Unfortunately in the last couple of months I've been doing it You know a few people have actually couple people have started publishing this and so this has moved from being a totally new idea to being a you know somewhat new idea so so this idea of Creating a language model making that the pre-trained model for a classification model is what we're going to learn to do now And so the idea is we're really kind of trying to leverage exactly what we learned in our computer vision work Which is how do we do fine-tuning to create powerful classification models?

Yes, you know So why don't you think that doing just directly what you want to do? Doesn't work better Well a because it doesn't just turns out it doesn't empirically And the reason it doesn't is a number of things first of all as we know Fine-tuning a pre-trained network is really powerful Right.

So if we can get it to learn some related tasks first, then we can use all that information To try and help it on the second task the other reason is IMDB movie reviews You know up to a thousand words long They're pretty big and so after reading a thousand words knowing nothing about How English is structured or even what the concept of the word is?

or punctuation or whatever at the end of this thousand Integers, you know, they end up in inches. All you get is a one or a zero Positive or negative and so trying to like learn the entire structure of English and then how it expresses positive and negative Sentiments from a single number is just too much to expect So by building a language model first we can try to build a neural network that kind of understands The English of movie reviews and then we hope that some of the things it's learnt about Are going to be useful in deciding whether something's a positive or a negative That's a great question Thanks.

Is this similar to the CAR RNN by Carpathi? Yeah, this is somewhat similar to CAR RNN by Carpathi. So the famous CAR as in C-H-A-R RNN Try to predict the next letter given a number of previous letters Language models generally work at a word level. They don't have to and doing things at a word level turns out to be a Can be quite a bit more powerful and we're going to focus on word level modeling in this course To what extent are these generated words?

Actual copies of what it's found in the in the training data set or are these completely Random things that it actually learned and how do we know how to distinguish between those two? Yeah, I mean these are all good questions The words are definitely words we've seen before the work because it's not at a character level So it can only give us the word it's seen before the sentences There's a number of kind of rigorous ways of doing it But I think the easiest is to get a sense of like well here are two like different categories Where it's kind of created very similar concepts, but mixing them up in just the right way like it would be very hard To to do what we've seen here just by like spitting back things.

It's seen before But you could of course actually go back and check. You know have you seen that sentence before or like a string distance Have you seen a similar sentence before? in this case And of course another way to do it is the length most importantly when we train the language model as we'll see We'll have a validation set and so we're trying to predict the next word Of something that's never seen before and so if it's good at doing that.

It should be good at generating text in this case the purpose The purpose is not to generate text That was just a fun example and so I'm not really going to study that too much But you know you during the week totally can like you can totally build The or you know greater American novel generator or whatever there are actually some tricks to To using language models to generate text that I'm not using here.

They're pretty simple We can talk about them on the forum if you like, but my focus is actually on classification So I think that's the thing which is Incredibly powerful like text classification I Don't know you're a hedge fund You want to like read every article as soon as it comes out through Reuters or Twitter or whatever and immediately Identify things which in the past have caused you know massive market drops.

That's a classification model or you want to Recognize all of the customer service queries which tend to be associated with people who Who leave your you know who cancel their contracts in the next month? That's a classification problem, so like it's a really powerful kind of thing for data journalism Activision activism law commerce so forth right like I'm trying to class documents into whether they're part of legal discovery or not part of legal discovery Okay, so you get the idea?

so In terms of stuff. We're importing we're importing a few new things here one of the bunch of things we're importing is Torch text torch text is PI torches like NLP Library and so fast AI is designed to work hand-in-hand with porch text as you'll see and then there's a few Text specific sub bits of faster fast AI that we'll be using So we're going to be working with the IMDB large movie review data set.

It's very very well studied in academia you know Lots and lots of people over the years have Studied this data set 50,000 reviews highly polarized reviews either positive or negative each one has been classified by sentiment Okay, so we're going to try our first of all however to create a language model So we're going to ignore the sentiment entirely right so just like the dogs and cats Pre-train the model to do one thing and then fine-tune it to do something else Because this kind of idea in NLP is is so so so new There's basically no models you can download for this so we're going to have to create our own right, so Having downloaded the data you can use the link here.

We do the usual stuff saying the path to it training and validation path And as you can see it looks pretty pretty traditional compared to vision. There's a directory of training There's a directory of test we don't actually have separate test and validation in this case And just like in in vision the training directory has a bunch of files in it In this case not representing images, but representing movie reviews So we could cat one of those files and here we learn about the classic zombie Geddon movie I have to say with a name like zombie Geddon and an atom bomb on the front cover I was expecting a flat-out chop socky funku Rented if you want to get stoned on a Friday night and laugh with your buddies Don't rent it if you're an uptight weenie or want a zombie movie with lots of fresh eating I think I'm going to enjoy zombie Geddon so all right, so we've learned something today All right, so we can just use standard unique stuff to see like how many words are in the data set so the training set we've got 17 and a half million words Test set we've got five point six million words So here's These are this is IMDB so IMDB is yeah random people this is not a New York Times listed review as far as I know Okay, so Before we can do anything with text we have to turn it into a list of tokens A token is basically like a word right so we're going to try and turn this eventually into a list of numbers So the first step is to turn it into a list of words That's called tokenization in NLP NLP has a huge lot of jargon that we'll we'll learn over time One thing that's a bit tricky though when we're doing tokenization is here I've tokenized that review and then joined it back up with spaces and you'll see here that wasn't Has become two tokens which makes perfect sense right wasn't is two things, right?

Dot dot dot has become one token Right, where else lots of exclamation marks has become lots of tokens. So like a good tokenizer will do a good job of recognizing like Pieces of an English sentence each separate piece of punctuation will be separated And each part of a multi-part word will be separated as appropriate.

So Spacey is a I think it's an Australian developed piece of software actually that does lots of NLP stuff It's got the best tokenizer. I know and so Fast AI is designed to work. Well with the spacey tokenizer as its torch text. So here's an example of Tokenization, right so what we do with torch text is we basically have to start out by creating Something called a field and a field is a definition of how to pre-process some text And so here's an example of the definition of a field.

It says I want to lowercase The text and I want to tokenize it with the function called spacey tokenize Okay, so it hasn't done anything yet. We're just telling her when we do do something This is what to do. And so that we're going to store that description of what to do in a thing called capital text And so this is this is none of this is but this is not fast AI specific at all This is part of torch text.

You can go to the torch text website read the docs. There's not lots of docs yet This is all very very new so Probably the best information you'll find about it is in this lesson, but there's some more information on this site Alright, so what we can now do is go ahead and create the usual fast AI model data object Okay, and so to create the model data object.

We have to provide a few bits of information We have to say what's the training set? So the path to the text files the validation set and the test set in this case just to keep things simple I don't have a separate validation in test set so I'm going to pass in the validation set for both of those two things Right.

So now we can create our model data object as per usual. The first thing we give it is the path The second thing we give it is the torch text field definition of how to pre-process that text The third thing we give it is the dictionary or the list of all of the files we have train validation test As per usual we can pass in a batch size and then we've got a special special couple of extra things here One is a very commonly used in NLP minimum frequency.

What this says is In a moment, we're going to be replacing every one of these words with an integer Which basically will be a unique index for every word and this basically says if there are any words that occur less than 10 times Just call it unknown Right don't think of it as a word, but we'll see that in more detail in a moment And then we're going to see this in more detail as well BP TT stands for back prop through time And this is where we define how long a sentence will we?

Stick on the GPU at once. So we're going to break them up in this case. We're going to break them up into sentences of 70 tokens or less on the whole so we're going to see all this in a moment All right. So after building our model data object, right what it actually does is it's going to fill this text field With an additional attribute called vocab and this is a really important NLP concept I'm sorry.

There's so many NLP concepts. We just have to throw at you kind of quickly, but we'll see them a few times right a Vocab is the vocabulary and the vocabulary in NLP has a very specific meaning it is What is the list of unique words that appeared in this text?

So every one of them is going to get a unique index. So let's take a look right here is text Vocab dot I to s this stands for this is all torch text not fast AI Text of vocab dot int to string Maps the integer zero to unknown the integer one the padding into to the then comma dot and Of two and so forth.

All right, so this is the first 12 elements of the array Of the vocab from the IMDB movie review and it's been sorted by frequency Except for the first two special ones. So for example, we can then go backwards s to I string to int Here is the it's in position 0 1 2 so stream to int the is 2 So the vocab lets us take a word and map it to an integer or take an integer and map it to a word Right.

And so that means that we can then take the first 12 tokens for example of our text and turn them into 12 it's so for example here is of the you can see 7 2 and Here you can see 7 2 Right. So we're going to be working in this form.

Did you have a question? Yeah, could you pass that back there? Is it a common to any stemming or limitizing? Not really. No Generally tokenization is is what we want like with a language model We you know to keep it as general as possible we want to know what's coming next and so like whether it's Future tense or past tense or plural or singular like we don't really know which things are going to be interesting in which aren't so It seems that it's generally best to kind of leave it alone as much as possible Be the short answer You know having said that as I say, this is all pretty new So if there are some particular areas that some researcher maybe has already discovered that some other kinds of pre-processing are helpful You know, I wouldn't be surprised not to know about it So when you're dealing with You know natural language is in context important context is very important.

So if you're if you're using Words no, no, we're not looking at words This is this look this is I just don't get some of the big premises of this like they're in order Yeah, so just because we replaced I with the number 12 These are still in that order.

Yeah There is a different way of dealing with natural language called a bag of words and bag of words You do throw away the order in the context and in the machine learning course We'll be learning about working with bag of words representations But my belief is that they are No longer useful or in the verge of becoming no longer useful We're starting to learn how to use dick learning to use context properly now But it's kind of for the first time it's really like only in the last few months All right, so I mentioned that we've got two numbers batch size and BPT T back prop through time So this is kind of subtle So we've got some big long piece of text Okay, so we've got some big long piece of text, you know, here's our sentence.

It's a bunch of words, right and Actually what happens in a language model is even though we have lots of movie reviews They actually all get concatenated together into one big block of text, right? So it's basically predict the next word In this huge long thing, which is all of the IMDb movie reviews concatenate together.

So this thing is, you know What do we say? It was like tens of millions of words long and so what we do Is we split it up into batches? First right so these like are our spits into batches, right? And so if we said we want a batch size of 64 we actually break the whatever was 60 million words into the 64 sections right, and then we take each one of the 64 sections and We move it Like underneath the previous one I didn't do a great job of that Right move it underneath So we end up with a matrix Which is 64 Actually, I think we've moved them across wise so it's actually I think just transpose it we end up with a matrix.

It's like 64 columns Wide and the length let's say the original was 64 million right then the length is like 10 million long Right. So each of these represents 1/64 of our entire IMDb review set And so that's our starting point so then what we do is We then grab a little chunk of this at a time and those chunk lengths are approximately equal to BP TT which I think we had equal to 70.

So we basically grab a little 70 long section and That's the first thing we chuck into our GPU. That's a batch, right? So a batch is always of length of width 64 or batch size and each bit is a sequence of length up to 70 So let me show you Right.

So here if I go take my train data loader I don't know if you folks have tried playing with this yet But you can take any data loader wrap it with it up to turn it into an iterator and then call next on it to grab a batch of data just as if you were a neural net you get exactly what the neural net gets and you can see here we get back a 75 by 64 Tensor right so it's 64 wide right and I said it's approximately 70 high and But not exactly And that's actually kind of interesting a really neat trick that torch text does is they randomly change The back prop through time number every time so each epoch it's getting slightly different bits of text This is kind of like in computer vision.

We randomly shuffle the images We can't randomly shuffle the words right because we need to be in the right order So instead we randomly move their break points a little bit. Okay, so this is the equivalent so in other words this This here is of length 75 right there's a there's an ellipsis in the middle And that represents the first 75 words of the first review Right, where else this 75 here?

Represents the first 75 words of this of the second of the 64 segments That's it have to go in like 10 million words to find that one right and so here's the first 75 words of the last of those 64 segments okay, and so then what we have down here is The next Sequence right so 51 there's 51 6 1 5 there's 6 1 5 25 there's 25 right and in this case It actually is of the same size It's also 75 by 64 but for minor technical reasons being flattened out Into a single vector that basically it's exactly the same as this matrix, but it's just moved down By one because we're trying to predict the next word Right so that all happens for us right if we ask for and this is the fast AI now if you ask for a language model data object then it's going to create these batches of batch size width by BP TT height Bits of our language corpus along with the same thing shuffled along by one word Right and so we're always going to try and predict the next word So why don't you instead of just arbitrarily choosing 64?

Why don't you choose like like 64 is a large number Maybe like do it by sentences and make it a large number and then pat it was zero or something if you You know so that you actually have a one full sentence per line Basically wouldn't that make more sense not really because remember we're using columns right so each of our columns is of length about 10 million Right so although it's true that those columns aren't always exactly finishing on a full stop.

This so damn long. We don't care Because they're like 10 million one Right and we're trying to also each line contains multiple sentences column contains more Yeah, it's it's it's of length about 10 million And it contains many many many many many sentences Because remember the first thing we did was take the whole thing and split it into 64 groups Okay, great So um I found this you know pertaining to this question this thing about like What's in this language model matrix a little mind-bending for quite a while?

So don't worry if it takes a while and you have to ask a thousand questions on the forum. That's fine, right? but Go back and listen to what I just said in this lecture again go back to that bit where I showed you is putting it up to 64 and moving them around and try it with some sentences and Excel or something and see if you can do a better job of explaining it than I did Because this is like how torch text works And then what fast AI adds on is this idea of like kind of how to build a language model out of it Although actually a lot of that stolen from torch text as well like there's sometimes where torch text starts and fast AI ends Is well vice versa is a little?

Subtle they really work closely together, okay? so Now that we have a model data object That can feed us Batches we can go ahead and create a model right and so in this case We're going to create an embedding matrix and our vocab We can see how big our vocab was Let's have a look back here, so we can see here in the model data object there are 202 Kind of pieces that we're going to go through that's basically equal to the number of The total length of everything divided by batch size times BPTT and this one I wanted to show you NT I've got the definition up here number of unique tokens NT is the number of tokens That's the size of our vocab so we've got three thirty four thousand nine hundred and forty five unique words And notice the unique words it had to appear at least ten times Okay, because otherwise they've been replaced with The length of the data set is one because as far as a language model is concerned there's only one Thing which is the whole corpus all right, and then that thing has Here it is twenty point six million words right So those thirty four thousand hundred and forty five things are used to create an embedding matrix Of number of roses equal to Thirty four nine four five Right and so the first one represents onk the second one represents pad The third one was dot the fourth one was comma this one.

I'm just guessing was there and so forth Right and so each one of these gets an embedding vector So this is literally identical to what we did Before the break right this is a categorical variable. It's just a very high cardinality categorical variable and furthermore It's the only variable right.

This is pretty standard in NLP. You have a variable which is a word Right we have a single categorical variable single column basically, and it's it's of thirty four thousand nine hundred and forty five Cardinality categorical variable and so we're going to create an embedding matrix for it So M size is the size of the embedding vector 200, okay?

So that's going to be length 200 a lot bigger than our previous embedding vectors not surprising because a word Has a lot more nuance to it than the concept of Sunday right Or Rossman's Berlin store or whatever right so it's generally an embedding size for a word Will be somewhere between about 50 and about 600?

Okay, so I've kind of gone some in the middle We then have to say as per usual how many activations Do you want in your layers so we're going to use 500 and then how many layers? Do you want in your neural net we're going to use three okay?

This is a minor technical detail it turns out that We're going to learn later about the atom optimizer That basically the defaults for it don't work very well with these kinds of models So we just have to change some of these you know basically any time you're doing NLP.

You should probably include this line Because it works pretty well So having done that we can now again take our model data object and grab a model out of it And we can pass in a few different things What optimization function do we want how big an embedding do we want how many hidden activate how many activations number of hidden?

how many layers and How much dropout of many different kinds? So this language model. We're going to use is a very recent development called awd LSTM by Stephen Meridy Who's a NLP researcher based in San Francisco and his main contribution really was to show like? How to put dropout all over the place in in these NLP models?

So we're not going to worry now We'll do this in the last lecture is worrying about like what all that like What is the architecture and what are all these dropouts for now? Just know it's the same as per usual if you try to build an NLP model and your under fitting Then decrease all of these dropouts if you're over fitting then increase all of these dropouts in roughly this ratio Okay, that's that's my rule of thumb and it again.

This is such a recent paper Nobody else is working on this model anyway, so there's not a lot of guidance, but I've found this these ratios work Well, that's what Stephen's been using as well There's another kind of way we can avoid overfitting that we'll talk about in the last class Again for now this one actually works totally reliably so all of your NLP models probably want this particular line of code And then this one we're going to talk about at the end last lecture as well you can always include this basically what it says is When you do When you look at your gradients, and you multiply them by the learning rate, and you decide how much to update your weights by This says clip them like literally like Like don't let them be more than zero point three and this is quite a cool little trick right because like If you're learning rates pretty high, and you kind of don't want to get in that situation We talked about where you're kind of got this kind of thing where you go You know rather than little step little step little step instead you go like oh too big oh too big right with gradient Clipping it kind of goes this far, and it's like oh my goodness.

I'm going too far. I'll stop Right that's basically what gradient flipping does so Anyway, so these are a bunch of parameters the details don't matter too much right now. You can just steal these And then we can go ahead and call fit With exactly the same parameters as usual So Jeremy, um there are all these other Work embedding things like like What to make and glove so I have two questions about that one is How are those different from these and the second question?

Why don't you initialize them with one of those? Yeah, so So basically that's a great question, so basically People have pre-trained These embedding matrices before to do various other tasks. They're not whole pre-trained models They're just a pre-trained embedding matrix, and you can download them and as unit says they have names like word2vec and love And they're literally just a matrix There's no reason we couldn't download them really it's just like kind of I found that Building a whole pre-trained model in this way didn't seem to benefit much if at all from using pre-trained word vectors We're also using a whole pre-trained language model Made a much bigger difference So like you remember what a big those of you who saw word2vec it made a big splash when it came out I I'm finding this technique of pre-trained language models seems much more powerful Basically, but I think we combine both to make them a little better still What is what is the model that you have used like how can I know the architecture of the model?

So we'll be learning about the model architecture in the last lesson for now. It's a recurrent neural network Using something called LSTM long short-term memory Okay so So if they had lots of details that we're skipping over but you know you can do all this without any of those details We go ahead and fit the model I found that this language model took quite a while to fit so I kind of like ran it for a while Noticed it was still under fitting save where it was up to Ran it a bit more with longer cycle length saved it again.

It still was kind of under fitting You know run it again And kind of finally got to the point where it's like kind of honestly I kind of ran out of patience So I just like saved it at that point and I did the same kind of test that we looked at before so I was like oh it wasn't quite what I was expecting But I really liked it anyway the best and then I was like okay Let's see how that goes the best performance was one in the movie was a little bit.

I say okay It looks like the language models working pretty well So I've pre-trained the language model And so now I want to use it Fine-tune it to do classification send my classification now obviously if I'm going to use a pre-trained model I need to use exactly the same vocab right the word there Still needs to map for the number two so that I can look up the vector for that right so that's why I first of all Load back up my my field object the thing with the vocab in right now in this case If I run it straight afterwards, this is unnecessary It's already in memory, but this means I can come back to this later right and a new session basically I can then go ahead and say okay.

I've never got one more field right in addition to my field Which represents the reviews I've also got a field which represents the label Okay And the details are too important here Now this time I need to not treat the whole thing as one big Piece of text, but every review is separate because each one has a different sentiment attached to it And it so happens that torch text already has a data set that does that for IMDB, so I just used IMDB built into torch text So basically once we've done all that we end up with something where we can like grab for a particular example We can grab its label positive and Here's some of the text.

This is another great Tom Beringdon movie blah blah blah blah all right, so This is all not nothing fast AI specific here We'll come back to it in the last lecture But torch text docs can help understand what's going on all you need to know is that Once you've used this special talks torch text thing called splits to grab a splits object You can pass it straight into fast AI text data from splits and that basically converts a torch text Object into a fast AI object we can train on so as soon as you've done that you can just go ahead and say Get model right and that gets us our learner And then we can load into it the pre-trained model the language model right, and so we can now take that pre-trained language model and Use the stuff that we're kind of familiar with right so we can Make sure that you know all it's at the last layer is frozen train it a bit Unfreeze it train it a bit and the nice thing is once you've got a pre-trained Language model it actually trained super fast you can see here.

It's like a couple of minutes The epoch and it only took me to get my is my best one here It already took me like 10 epochs, so it's like 20 minutes to train this bit. It's really fast And I ended up with 94.5% so how good is 94.5% well it so happens that Actually one of Stephen Verity's colleagues James Bradbury recently created a paper Looking at the state at like where they tried to create a new state-of-the-art for a bunch of NLP things and one of the things I looked at was IMDB and they actually have here a list of the current world's best for IMDB and Even with stuff that is highly specialized for sentiment analysis the best anybody had previously come up with was 94 94.1 So in other words this technique getting 94.5 is literally better than Anybody has created in the world before as far as we know or as far as James Bradbury knows so so when I say like there are big opportunities to use this I mean like This is a technique that nobody else currently has access to which you know you could like it, you know, whatever IBM has in Watson or whatever any big company has you know that they're Advertising unless they have some secret source that they're not publishing which they don't right because people get you know If they have a better thing they publish it Then you now have access to a better text classification method than as ever existed before So I really hope that you know, you can try this out and see how you go There may be some things that works really well on and others that it doesn't work as well on I don't know I think this kind of sweet spot here that we had about 25,000 You know short to medium sized documents if you don't have at least that much text It may be hard to train a different language model But having said that there's a lot more we do here, right and we won't be able to do it in part one of this course We're doing part two, but for example, we could start like training language models that look at like You know lots and lots of medical journals and then we could like make a downloadable medical language model that then anybody could use to like fine-tune on like a Prostate cancer subset of medical literature for instance, like there's so much we could do It's kind of exciting and then you know to your nets point we could also combine this with like pre-trained word vectors so like even without Trying that hard like, you know, we even without news like we could have pre-trained a Wikipedia say corpus language model and then fine-tuned it into a IMDb language model and then fine-tuned that into an IBM IMDb sentiment analysis model and we would have got something better than this So like this and I really think this is the tip of the iceberg And I was talking there's a really fantastic researcher called Sebastian ruder who is Basically the only NLP researcher.

I know who's been really really writing a lot about Training and fine-tuning and transfer learning and NLP and I was asking him like why isn't this happening more? And his view was it's because there isn't the software to make it easy, you know So I'm actually going to share this lecture with with him tomorrow Because you know it feels like there's you know Hopefully going to be a lot of stuff coming out now that we're making it really easy to do this Okay We're kind of out of time so what I'll do is I'll quickly look at Collaborative filtering introduction and then we'll finish it next time the collaborative filtering.

There's very very little new to learn We basically learned everything we're going to need So collaborative filtering will will cover this quite quickly next week And then we're going to do a really deep dive into collaborative filtering next week Where we're going to learn about like we're actually going to from scratch learn how to do stochastic gradient descent How to create loss functions how they work exactly and then we'll go from there and we'll gradually build back up to really deeply understand What's going on in the structured models and then what's going on in confidence and then finally what's going on in recurrent neural networks And hopefully we'll be able to build them all From scratch okay, so this is kind of going to be really important this movie lens data set because we're going to use it to learn a lot of like Really foundational theory and kind of math behind it so the movie lens data set This is basically what it looks like it contains a bunch of ratings.

It says user number one Watched movie number 31 and they gave it a rating of two and a half at this particular time and Then they watched movie one or two nine and they gave it a rating of three and they watched rating one ones movie one one seven Two and they gave it a rating of four.

Okay, and so forth So this is the ratings table. This is really the only one that matters and our goal will be for some user We haven't seen before sorry for some user movie combination. We haven't seen before we have to predict if they'll like it Right and so this is how recommendation systems are built This is how like Amazon besides what books to recommend how Netflix decides what movies to recommend and so forth To make it more interesting we'll also actually download a list of movies so each movie We're actually going to have the title and so for that question earlier about like what's actually going to be in these embedding matrices How do we interpret them?

We're actually going to be able to look and see How that's working? So basically this is kind of like what we're creating this is kind of crosstab of users by movies Alright, and so feel free to look ahead during the week. You'll see basically as per usual collab filter data set from CSP model data dot get learner Learn dot fit and we're done and you won't be surprised to hear when we then take that and we can cut the benchmarks It seems to be better than the benchmarks where you looked at so that'll basically be it and then next week We'll have a deep dive and we'll see how to actually build this from scratch.

All right. See you next week