back to indexLesson 4: Deep Learning 2019 - NLP; Tabular data; Collaborative filtering; Embeddings
Chapters
0:0
4:18 Neural Networks
5:45 Transfer Learning
5:56 Transfer Learning in Nlp
6:44 Language Model
7:19 Engrams
8:34 Wiki Text 103 Data Set
12:2 Self Supervised Learning
15:20 Basic Process
20:29 Fine-Tuning Our Imdb Language Language Model
22:39 Create a Language Model Learner
23:43 Amount of Dropout
27:18 Create Our Classifier
34:31 Discriminative Learning Rates
35:3 Random Forest
36:41 Tabular Data
40:39 What Are the 10 % of Cases Where You Would Not Default to Neural Nets
42:42 Data Block Api
46:45 Normalization
48:14 Add Labels
52:16 Layers
53:5 Collaborative Filtering
59:14 The Cold Start Problem
67:29 Microsoft Excel
70:33 Matrix Multiplications
73:8 Loss Function
75:7 Gradient Descent Solver in Excel
81:0 Get Embedding
00:00:00.000 |
Okay, welcome to lesson 4. We are going to finish our journey through these kind of key applications. We've already looked at a range of vision applications. We've looked at classification, localization, image regression. We've briefly touched on NLP. We're going to do a deeper dive into NLP transfer learning today. We're going to then look at tabular data and we're going to look at a range of vision applications. 00:00:29.000 |
Which are both super useful applications and then we're going to take a complete U turn. We're going to take that collaborative filtering example and dive deeply into it to understand exactly what's happening. Mathematically exactly what's happening in the computer and we're going to use that to gradually go back in reverse order through the applications again. 00:00:49.000 |
In order to understand exactly what's going on behind the scenes of all of those applications. 00:00:56.000 |
Before we do, somebody on the forum was kind enough to point out that when we compared ourselves to what we think might be the state-of-the-art or was recently the state-of-the-art for Canvaard, it wasn't a fair comparison because the paper actually used a small subset of the classes and we used all of the classes. 00:01:20.000 |
So Jason in our study group was kind enough to rerun the experiments with the correct subset of classes from the paper and our accuracy went up to 94% compared to 91.5% in the paper. 00:01:34.000 |
So I think that's a really cool result and a great example of how some pretty much just using the defaults nowadays can get you far beyond what was the best of a year or two ago. 00:01:48.000 |
Certainly the best last year when we were doing this course because we started it quite intensely. 00:02:09.000 |
to understand really what was going on there. So first of all, a quick review. 00:02:15.000 |
So remember NLP is natural language processing. It's about taking text and doing something with it. 00:02:24.000 |
And text classification is a particularly useful, kind of practically useful applications. It's what we're going to start off focusing on. 00:02:33.000 |
Because classifying a text classifying a document can be used for anything from spam prevention to identifying fake news to finding a diagnosis through medical reports. 00:02:50.000 |
Finding mentions of your product in Twitter, so on and so forth. 00:02:56.000 |
So it's pretty interesting and actually there was a great example. 00:03:02.000 |
There was a great example during the week from one of our students who is a lawyer. 00:03:10.000 |
And he mentioned on the forum that he had a really great results from classifying legal texts using this NLP approach. 00:03:21.000 |
And I thought this was a great example. So this is the poster that they presented at an academic conference this week describing the approach. 00:03:29.000 |
And actually this series of three steps that you see here, and I'm sure you recognize this classification matrix, 00:03:37.000 |
this series of three steps here is what we're going to start by digging into. 00:03:43.000 |
So we're going to start out with a movie review like this one and going to decide whether it's positive or negative sentiment about the movie. 00:03:54.000 |
But here's the problem. We have in the training set 25,000 movie reviews. 00:04:04.000 |
So we've got 25,000 movie reviews and for each one we have like one bit of information. 00:04:14.000 |
And as we're going to look into in a lot more detail today and in the current lessons, our neural networks, 00:04:20.000 |
remember they're just a bunch of matrix multiplies and simple nonlinearities, particularly replacing negatives with zeros. 00:04:32.000 |
And so if you start out with some random parameters and try to train those parameters to learn how to recognize positive versus negative movie reviews, 00:04:44.000 |
you only have 20 literally 25,000 ones and zeros to actually tell you I like this one, I don't like that one. 00:04:49.000 |
That's clearly not enough information to learn basically how to speak English, 00:04:55.000 |
how to speak English well enough to recognize they liked this or they didn't like this. 00:05:00.000 |
And sometimes that can be pretty nuanced, right? 00:05:03.000 |
The English language often, particularly with like movie reviews, people, because these are like online movie reviews in IMDB, 00:05:10.000 |
people can often like use sarcasm, it could be really quite tricky. 00:05:13.000 |
So for a long time, in fact until very recently, like this year, 00:05:21.000 |
neural nets didn't do a good job at all of this kind of classification problem. 00:05:28.000 |
And that was why there's not enough information available. 00:05:33.000 |
So the trick, hopefully you can all guess, it's to use transfer learning, it's always the trick. 00:05:39.000 |
So last year in this course, I tried something crazy, which was I thought, 00:05:45.000 |
what if I try transfer learning to demonstrate that it can work for NLP as well. 00:05:50.000 |
And I tried it out and it worked extraordinarily well. 00:05:55.000 |
And so here we are a year later, and transfer learning in NLP is absolutely the hit thing now. 00:06:01.000 |
And so I'm going to describe to you what happens. 00:06:04.000 |
The key thing is we're going to start with the same kind of thing that we used for computer vision, 00:06:11.000 |
a pre-trained model that's been trained to do something different to what we're doing with it. 00:06:17.000 |
And so for ImageNet, that was originally built as a model to predict which of a thousand categories each photo falls into. 00:06:25.000 |
And people then fine-tuned that for all kinds of different things, as we've seen. 00:06:30.000 |
So we're going to start with a pre-trained model that's going to do something else, not movie review classification. 00:06:36.000 |
We're going to start with a pre-trained model, which is called a language model. 00:06:40.000 |
A language model has a very specific meaning in NLP, and it's this. 00:06:45.000 |
A language model is a model that learns to predict the next word of a sentence. 00:06:50.000 |
And to predict the next word of a sentence, you actually have to know quite a lot about English, 00:06:57.000 |
assuming you're doing it in English, and quite a lot of world knowledge. 00:07:01.000 |
And by world knowledge, I'll give you an example. 00:07:03.000 |
Here's your language model, and it's read, I'd like to eat a hot, what? Obviously, dog, right? 00:07:17.000 |
Now, previous approaches to NLP use something called n-grams, largely, 00:07:21.000 |
which is basically saying how often do these pairs or triplets of words tend to appear next to each other? 00:07:27.000 |
And n-grams are terrible at this kind of thing. 00:07:29.000 |
As you can see, there's not enough information here to decide what the next word probably is. 00:07:40.000 |
If you train a neural net to predict the next word of a sentence, 00:07:47.000 |
then you actually have a lot of information, rather than having a single bit for every 2,000-word movie review, 00:07:53.000 |
liked it or didn't like it, every single word you can try and predict the next word. 00:07:59.000 |
So in a 2,000-word movie review, there are 1,999 opportunities to predict the next word. 00:08:07.000 |
Better still, you don't just have to look at movie reviews. 00:08:11.000 |
Because really the hard thing isn't so much does this person like the movie or not, 00:08:20.000 |
So you can learn how do you speak English, roughly, from some much bigger set of documents. 00:08:27.000 |
And so what we did was we started with Wikipedia. 00:08:31.000 |
And Stephen Meridy and some of his colleagues built something called the Wikitext 103 dataset, 00:08:36.000 |
which is simply a subset of most of the largest articles from Wikipedia 00:08:43.000 |
with a little bit of preprocessing that's available for download. 00:08:46.000 |
And so you're basically grabbing Wikipedia, and then I built a language model on all of Wikipedia. 00:08:52.000 |
So I've just built a neural net which would predict the next word in every significantly sized Wikipedia article. 00:09:02.000 |
If I remember correctly, it's something like a billion tokens. 00:09:05.000 |
So we've got a billion separate things to predict. 00:09:08.000 |
Every time we make a mistake on one of those predictions, the loss can get gradients from that, 00:09:15.000 |
and we can update our weights and make them better and better 00:09:17.000 |
until we can get pretty good at predicting the next word of Wikipedia. 00:09:23.000 |
Because at that point I've got a model that knows probably how to complete sentences like this, 00:09:27.000 |
and so it knows quite a lot about English and quite a lot about how the world works, 00:09:32.000 |
what kinds of things tend to be hot in different situations, for instance. 00:09:40.000 |
in 1996, in a speech to the United Nations, the United States President Blah said, 00:09:48.000 |
now that would be a really good language model because it would actually have to know 00:09:52.000 |
who was the United States President in that year. 00:09:55.000 |
So getting really good at training language models is a great way to learn a lot about, 00:10:00.000 |
or teach a neural net, a lot about what is our world, what's in our world, 00:10:11.000 |
and it's actually one that philosophers have been studying for hundreds of years now. 00:10:15.000 |
There's actually a whole theory of philosophy, 00:10:18.000 |
which is about what can be learned from studying language alone. 00:10:27.000 |
And so here's the interesting thing, you can start by training a language model on all of Wikipedia, 00:10:31.000 |
and then we can make that available to all of you. 00:10:34.000 |
Just like a pre-trained image net model for vision, 00:10:36.000 |
we've now made available a pre-trained wiki text model for NLP. 00:10:41.000 |
Not because it's particularly useful of itself, 00:10:43.000 |
predicting the next word of sentences is somewhat useful, 00:10:49.000 |
But it tells us it's a model that understands a lot about language, 00:10:56.000 |
So then we can take that and we can do transfer learning to create a new language model 00:11:04.000 |
that's specifically good at predicting the next word of movie reviews. 00:11:10.000 |
So if we can build a language model that's good at predicting the next word of movie reviews, 00:11:19.000 |
then that's going to understand a lot about my favorite actor is Tom Hu. 00:11:27.000 |
I thought the photography was fantastic, but I wasn't really so happy about the director. 00:11:33.000 |
It's going to learn a lot about specifically how movie reviews are written. 00:11:38.000 |
It'll even learn things like what are the names of some popular movies. 00:11:44.000 |
So that would then mean we can still use a huge corpus of lots of movie reviews, 00:11:49.000 |
even if we don't know whether they're positive or negative, 00:11:52.000 |
to learn a lot about how movie reviews are written. 00:11:54.000 |
So for all of this pre-training and all of this language model fine-tuning, 00:11:59.000 |
It's what the researcher Yann LeCun calls self-supervised learning. 00:12:04.000 |
In other words, it's a classic supervised model. 00:12:07.000 |
We have labels, but the labels are not things that somebody else has created. 00:12:11.000 |
They're kind of built into the data set itself. 00:12:16.000 |
Because at this point we've now got something that's good at understanding movie reviews, 00:12:20.000 |
and we can fine-tune that with transfer learning to do the thing we want to do, 00:12:25.000 |
which in this case is to classify movie reviews to be positive or negative. 00:12:29.000 |
And so my hope was, and I tried this last year, 00:12:32.000 |
that at that point 25,000 ones and zeros would be enough feedback to fine-tune that model. 00:12:47.000 |
Does the language model approach work for text and forums that are informal English, 00:12:52.000 |
misspelled words or slang or short form like S6 instead of Samsung S6? 00:13:03.000 |
Particularly if you start with your WikiText model and then fine-tune it with your, 00:13:13.000 |
It could be emails or tweets or medical reports or whatever. 00:13:19.000 |
So you could fine-tune it so it can learn a bit about the specifics of the slang 00:13:25.000 |
or abbreviations or whatever that didn't appear in the full corpus. 00:13:30.000 |
And so interestingly, this is one of the big things that people were surprised about 00:13:36.000 |
People thought that learning from something like Wikipedia wouldn't be that helpful 00:13:41.000 |
because it's not that representative of how people tend to write. 00:13:44.000 |
But it turns out it's extremely helpful because there's a much bigger difference 00:13:48.000 |
between Wikipedia and random words than there is between Wikipedia and Reddit. 00:14:02.000 |
So these language models themselves can be quite powerful. 00:14:06.000 |
So for example, there was a blog post from, what do they call it, SwiftKey? 00:14:13.000 |
SwiftKey, the folks that do the mobile phone predictive text keyboard. 00:14:19.000 |
And they described how they kind of rewrote their underlying model to use neural nets. 00:14:26.000 |
So now, this was a year or two ago, now most phone keyboards seem to do this. 00:14:30.000 |
You'll be typing away on your mobile phone and in the predictions 00:14:33.000 |
there'll be something telling you what words you might want next. 00:14:39.000 |
Another example was the researcher Andre Kapathy, 00:14:42.000 |
who now runs all this stuff at Tesla, back when he was a PhD student. 00:14:48.000 |
He created a language model of text in LaTeX documents 00:14:53.000 |
and created these automatic generation of LaTeX documents 00:14:57.000 |
that then became these kind of automatically generated papers. 00:15:02.000 |
So we're not really that interested in the output of the language model ourselves. 00:15:07.000 |
We're just interested in it because it's helpful with this process. 00:15:13.000 |
So we briefly looked at the process last week, so let's just have a reminder. 00:15:20.000 |
The basic process is we're going to start with the data in some format. 00:15:27.000 |
So for example, we've prepared a little IMDB sample that you can use where it's in CSV file. 00:15:32.000 |
So you can read it in with pandas and see there's negative or positive, 00:15:36.000 |
the text of each movie review, and a boolean of is it in the validation set or the training set. 00:15:45.000 |
And so you can just go text data bunch from CSV to grab a language model specific data bunch, 00:15:52.000 |
and then you can create a learner from that in the usual way and fit it. 00:15:56.000 |
You can save the data bunch, which means that the preprocessing that is done, 00:16:01.000 |
you don't have to do it again, you can just load it. 00:16:08.000 |
Well, what happens behind the scenes if we now load it as a classification data bunch 00:16:13.000 |
that's going to allow us to see the labels as well. 00:16:16.000 |
Then as we described, it basically creates a separate unit. 00:16:21.000 |
We call it a token for each separate part of a word. 00:16:25.000 |
So most of them are just four words, but sometimes if it's like an apostrophe S from its, 00:16:33.000 |
Every bit of punctuation tends to get its own token like a comma or a full stop and so forth. 00:16:39.000 |
And then the next thing that we do is numericalization, 00:16:46.000 |
which is where we find what are all of the unique tokens that appear here, 00:16:56.000 |
And that big list of unique possible tokens is called the vocabulary. 00:17:02.000 |
And so what we then do is we replace the tokens with the ID of where is that token in the vocab. 00:17:16.000 |
As you'll learn, every word in our vocab is going to require a separate row in a weight matrix in our neural net. 00:17:26.000 |
And so to avoid that weight matrix getting too huge, we restrict the vocab to no more than by default 60,000 words. 00:17:35.000 |
And if a word doesn't appear more than two times, we don't put it in the vocab either. 00:17:40.000 |
So we kind of keep the vocab to a reasonable size in that way. 00:17:45.000 |
And so when you see these XXUNK, that's an unknown token. 00:17:52.000 |
So when you see those unknown tokens, it just means this was something that was not a common enough word to appear in our vocab. 00:18:07.000 |
We also have a couple of other special tokens like XX field. 00:18:12.000 |
This is a special thing where if you've got like title, summary, abstract, body, like separate parts of a document, 00:18:19.000 |
each one will get a separate field, so they all get numbered. 00:18:23.000 |
Also, you'll find if there's something in all caps, it gets lowercase and a token called XX cap will get added to it. 00:18:32.000 |
Personally, I more often use the data block API because there's less to remember about exactly what data bunch to use 00:18:41.000 |
and what parameters and so forth, and it can be a bit more flexible. 00:18:45.000 |
So another approach to doing this is to just decide what kind of list you're creating. 00:18:52.000 |
So in this case, my independent variable is text. 00:18:57.000 |
How do you want to split it into validation versus training? 00:19:01.000 |
So in this case, column number two is the validation flag. 00:19:06.000 |
How do you want to label it with positive or negative sentiment, for example? 00:19:11.000 |
So column zero had that and then turn that into a data bunch. 00:19:19.000 |
So now let's grab the whole dataset which has 25,000 reviews in training, 25,000 reviews in validation, 00:19:29.000 |
and then 50,000 what they call unsupervised movie reviews. 00:19:33.000 |
So 50,000 movie reviews that haven't been scored at all. 00:19:39.000 |
So there it is, positive, negative, unsupervised. 00:19:45.000 |
So we're going to start, as we described, with the language model. 00:19:52.000 |
Now the good news is we don't have to train the Wikitext 103 language model. 00:19:57.000 |
You can use exactly the same steps that you see here. 00:19:59.000 |
Just download the Wikitext 103 corpus and run the same code. 00:20:05.000 |
But it takes two or three days on a decent GPU, so not much point you're doing it. 00:20:12.000 |
Even if you've got a big corpus of medical documents or legal documents, 00:20:18.000 |
There's just no reason to start with random weights. 00:20:21.000 |
It's always good to use transfer learning if you can. 00:20:27.000 |
So we're going to start then at this point, which is fine-tuning our IMDB language model. 00:20:33.000 |
So we can say, OK, it's a list of text files. 00:20:43.000 |
So that's why we use a different constructor for our independent variable, text files list. 00:20:49.000 |
And in this case we have to make sure we just don't include the train and test folders. 00:21:00.000 |
Why are we randomly splitting it by 10% rather than using the predefined train and test they gave us? 00:21:06.000 |
This is one of the cool things about transfer learning. 00:21:09.000 |
Even though our test set or our validation set has to be held aside, 00:21:13.000 |
it's actually only the labels that we have to keep aside. 00:21:17.000 |
So we're not allowed to use the labels in the test set. 00:21:19.000 |
So if you think about something like a Kaggle competition, 00:21:21.000 |
you certainly can't use the labels because they don't even give them to you. 00:21:25.000 |
But you can certainly use the independent variables. 00:21:27.000 |
So in this case you can absolutely use the text that is in the test set to train your language model. 00:21:36.000 |
It's actually when you do the language model, 00:21:38.000 |
concatenate the training and test set together and then just split out a smaller validation set. 00:21:44.000 |
So you've got more data to train your language model. 00:21:49.000 |
And so if you're doing NLP stuff on Kaggle, for example, 00:21:53.000 |
or you've just got a smaller subset of labeled data, 00:21:57.000 |
make sure that you use all of the text you have to train your language model because there's no reason not to. 00:22:05.000 |
Well, remember a language model kind of has its own labels. 00:22:10.000 |
So label for language model does that for us. 00:22:16.000 |
And that takes a few minutes to tokenize and numericalize. 00:22:30.000 |
And at this point, things are going to look very familiar. 00:22:35.000 |
But instead of creating a CNN learner, we're going to create a language model learner. 00:22:43.000 |
So behind the scenes, this is actually not going to create a CNN, a convolutional neural network. 00:22:48.000 |
It's going to create an RNN, a recurrent neural network. 00:22:51.000 |
So we're going to be learning exactly how they're built over the coming lessons. 00:22:55.000 |
But in short, they're the same basic structure. 00:23:14.000 |
So as usual, when we create a learner, you have to pass in two things. 00:23:22.000 |
And in this case, what pre-trained model we want to use. 00:23:27.000 |
And so here, the pre-trained model is the Wikitext 103 model. 00:23:31.000 |
That will be downloaded for you from FastAI if you haven't used it before. 00:23:35.000 |
Just like the same thing with things like ImageNet pre-trained models are downloaded for you. 00:23:46.000 |
We've talked briefly about this idea that there's something called regularization. 00:23:49.000 |
And you can reduce the regularization to avoid underfitting. 00:23:53.000 |
So for now, just know that by using a number lower than one is because when I first tried to run this, 00:24:01.000 |
And so if you reduce that number, then it will avoid underfitting. 00:24:13.000 |
And so what's happening here is we are just fine-tuning the last layers. 00:24:21.000 |
So normally after we fine-tuned the last layers, the next thing we do is we go unfreeze and train the whole thing. 00:24:35.000 |
And as you can see, even on a pretty beefy GPU, that takes two or three hours. 00:24:44.000 |
So probably tonight I might train it overnight and try and do a little bit better. 00:24:49.000 |
Because you can see, well, I guess I'm not underfitting. 00:24:53.000 |
I'm guessing I could probably train this a bit longer because you can see the accuracy hasn't started going down again. 00:24:59.000 |
So I wouldn't mind trying to train that a bit longer. 00:25:04.000 |
Point three means we're guessing the next word of the movie review correctly about a third of the time. 00:25:11.000 |
That sounds like a pretty high number, the idea that you can actually guess the next word that often. 00:25:17.000 |
So that's a good sign that my language model is doing pretty well. 00:25:21.000 |
For more limited domain documents like medical transcripts and legal transcripts, 00:25:29.000 |
you'll often find this accuracy gets a lot higher. 00:25:42.000 |
So you can now run learn.predict and pass in the start of a sentence. 00:25:50.000 |
And it will try and finish off that sentence for you. 00:25:53.000 |
Now I should mention this is not designed to be a good text generation system. 00:25:59.000 |
This is really more designed to kind of check that it seems to be creating something that's vaguely sensible. 00:26:05.000 |
There's a lot of tricks that you can use to generate much higher quality text, none of which we're using here. 00:26:13.000 |
But you can kind of see that it's certainly not random words that it's generating. It sounds vaguely English-like, even though it doesn't make any sense. 00:26:25.000 |
So at this point we have a movie review model. 00:26:35.000 |
So now we're going to save that in order to load it into our classifier, to be our pre-trained model for the classifier. 00:26:44.000 |
But I actually don't want to save the whole thing. 00:26:46.000 |
The second half of the language model is all about predicting the next word rather than about understanding the sentence so far. 00:26:57.000 |
So the bit which is specifically about understanding the sentence so far is called the encoder. 00:27:05.000 |
And again, we're going to learn the details of the coming weeks. 00:27:09.000 |
We're just going to save the encoder so the bit that understands the sentence rather than the bit that generates the word. 00:27:21.000 |
So step one, as per usual, is to create a data bunch. 00:27:24.000 |
And we're going to do basically exactly the same thing. 00:27:26.000 |
Bring it in. And here's our path. But we want to make sure that it uses exactly the same vocab that it used for the language model. 00:27:36.000 |
If word number 10 was "the" in the language model, we need to make sure that word number 10 is "the" in the classifier. 00:27:44.000 |
Because otherwise the pre-trained model is going to be totally meaningless. 00:27:49.000 |
So that's why we pass in the vocab from the language model to make sure that this data bunch is going to have exactly the same vocab. 00:27:59.000 |
Split by folder. And this time label -- so remember the last time we had split randomly. 00:28:06.000 |
But this time we need to make sure that the labels of the test set are not touched. 00:28:13.000 |
And then this time we label it, not for a language model, but we label these classes. 00:28:23.000 |
And remember sometimes you'll find that you run out of GPU memory. 00:28:27.000 |
This will very often happen to you if you -- so I was running this in an 11 gig machine. 00:28:33.000 |
So you should make sure this number is a bit lower if you run out of memory. 00:28:36.000 |
You may also want to make sure you restart the notebook and kind of start it just from here. 00:28:41.000 |
So batch size 50 is as high as I could get on an 11 gig card. 00:28:45.000 |
If you're using a P2 or P3 on Amazon or the K80 on Google, for example, I think you'll get 16 gigs. 00:28:55.000 |
So you might be able to make this a bit higher, get it up to 64. 00:28:58.000 |
So you can find whatever batch size fits on your card. 00:29:02.000 |
So here's our data bunch, as we saw before, and the labels. 00:29:07.000 |
So this time, rather than creating a language model learner, we're creating a text classifier learner. 00:29:13.000 |
But again, same thing, pass in the data that we want, figure out how much regularization we need. 00:29:19.000 |
Again, if you're overfitting, then you can increase this number. 00:29:23.000 |
If you're underfitting, you can decrease the number. 00:29:25.000 |
And most importantly, load in our pre-trained model. 00:29:29.000 |
And remember specifically it's this half of the model called the encoder, which is the bit that we want to load in. 00:29:38.000 |
Now I'll find the learning rate and fit for a little bit. 00:29:43.000 |
And we're already up nearly to 92% accuracy after less than three minutes of training. 00:29:51.000 |
In your particular domain, whether it be law or medicine or journalism or government or whatever, 00:29:59.000 |
you probably only need to train your domain's language model once. 00:30:09.000 |
But once you've got it, you can now very quickly create all kinds of different classifiers and models with that. 00:30:17.000 |
In this case, already a pretty good model after three minutes. 00:30:21.000 |
So when you first start doing this, you might find it a bit annoying that your first models take four hours or more to create that language model. 00:30:31.000 |
But the key thing to remember is you only have to do that once for your entire domain of stuff that you're interested in. 00:30:37.000 |
And then you can build lots of different classifiers and other models on top of that in a few minutes. 00:30:46.000 |
So we can save that to make sure we don't have to run it again. 00:30:50.000 |
I'm going to explain this more in just a few minutes. 00:30:57.000 |
And what that says is unfreeze the last two layers. 00:31:03.000 |
And so we've just found it really helps with these text classification. 00:31:07.000 |
Not to unfreeze the whole thing, but to unfreeze one layer at a time. 00:31:24.000 |
You'll also see I'm passing in this thing, momentums equals 0.8.7. 00:31:30.000 |
We're going to learn exactly what that means in the next week or two, probably next week. 00:31:39.000 |
So maybe by the time you watch the video of this, this won't even be necessary anymore. 00:31:43.000 |
Basically we found for training recurrent neural networks, RNNs, 00:31:48.000 |
it really helps to decrease the momentum a little bit. 00:31:55.000 |
So that gets us a 94.4 accuracy after about half an hour or less of training, 00:32:02.000 |
actually quite a lot less, of training the actual classifier. 00:32:06.000 |
And we can actually get this quite a bit better with a few tricks. 00:32:16.000 |
But even this very simple kind of standard approach is pretty great. 00:32:21.000 |
If we compare it to last year's state-of-the-art on IMDB, 00:32:26.000 |
this is from the Cove paper from McCann et al. at Salesforce Research, 00:32:36.000 |
In the best paper they could find, they found a fairly domain-specific sentiment analysis paper from 2017. 00:32:49.000 |
And the best models I've been able to build since have been about 95, 95.1. 00:32:56.000 |
So if you're looking to do text classification, 00:32:58.000 |
this really standardized transfer learning approach works super well. 00:33:14.000 |
And we'll be learning more about NLP later in this course. 00:33:18.000 |
But now I wanted to switch over and look at Tabula. 00:33:22.000 |
Now, Tabula data is pretty interesting because it's the stuff that, for a lot of you, 00:33:28.000 |
is actually what you use day-to-day at work in spreadsheets and relational databases. 00:33:38.000 |
So where does the magic number of 2.6 to the fourth in the learning rate come from? 00:33:50.000 |
So the learning rate is various things divided by 2.6 to the fourth. 00:34:00.000 |
The reason it's to the fourth, you will learn about the end of today. 00:34:14.000 |
Basically, as we're going to see in more detail later today, 00:34:19.000 |
the difference between the bottom of the slice and the top of the slice is basically, 00:34:24.000 |
what's the difference between how quickly the lowest layer of the model learns 00:34:28.000 |
versus the highest layer of the model learns? 00:34:31.000 |
So this is called discriminative learning rates. 00:34:34.000 |
So really the question is, as you go from layer to layer, 00:34:41.000 |
And we found out that for NLP RNNs, the answer is 2.6. 00:34:49.000 |
I ran lots and lots of different models, like a year ago or so, 00:34:56.000 |
using lots of different sets of hyperparameters of various types, 00:34:59.000 |
dropout, learning rates, and discriminative learning rate, and so forth. 00:35:03.000 |
And then I created something called a random forest, 00:35:06.000 |
which is a kind of model where I attempted to predict 00:35:09.000 |
how accurate my NLP classifier would be based on the hyperparameters. 00:35:14.000 |
And then I used random forest interpretation methods 00:35:18.000 |
to basically figure out what the optimal parameter settings were. 00:35:23.000 |
And I found out that the answer for this number was 2.6. 00:35:28.000 |
So that's actually not something I've published, 00:35:30.000 |
or I don't think I've even talked about it before. 00:35:39.000 |
I think Stephen Merity and somebody else did publish a paper 00:35:44.000 |
describing a similar approach, so the basic idea may be out there already. 00:35:49.000 |
Some of that idea comes from a researcher named Frank Hutter 00:35:55.000 |
They did some interesting work showing how you can use random forests 00:36:05.000 |
A lot of people are very interested in this thing called AutoML, 00:36:08.000 |
which is this idea of building models to figure out how to train your model. 00:36:16.000 |
but we do find that building models to better understand 00:36:21.000 |
how your hyperparameters work and then finding those rules of thumb, 00:36:24.000 |
like basically it can always be 2.6, quite helpful. 00:36:29.000 |
So that's just something we've kind of been playing with. 00:36:43.000 |
Tabular data, such as you might see in a spreadsheet 00:36:47.000 |
or a relational database or a financial report, 00:36:51.000 |
it can contain all kinds of different things. 00:36:56.000 |
It can contain all kinds of different things, 00:36:58.000 |
and I kind of tried to make a little list of some of the kinds of things 00:37:01.000 |
that I've seen tabular data analysis used for. 00:37:06.000 |
Using neural nets for analyzing tabular data is, 00:37:11.000 |
or at least last year when I first presented this, 00:37:18.000 |
When I first presented this, people were deeply skeptical 00:37:26.000 |
because everybody knows that you should use logistic regression 00:37:29.000 |
or random forests or gradient boosting machines, 00:37:32.000 |
all of which have their place between certain types of things. 00:37:45.000 |
It's not true that neural nets are not useful for tabular data. 00:37:50.000 |
We've shown this in quite a few of our courses, 00:38:04.000 |
and stuff describing how they've been using neural nets 00:38:09.000 |
One of the key things that comes up again and again 00:38:12.000 |
is that although feature engineering doesn't go away, 00:38:18.000 |
Pinterest, for example, replaced the gradient boosting machines 00:38:23.000 |
how to put stuff on their homepage with neural nets. 00:38:30.000 |
and they described how it really made engineering a lot easier 00:38:39.000 |
You still need some, but it was just simpler. 00:38:43.000 |
So they ended up with something that was more accurate, 00:38:55.000 |
that you need in your toolbox for analyzing tabular data, 00:39:04.000 |
when I was doing machine learning with tabular data. 00:39:11.000 |
It's kind of my standard first go-to approach now 00:39:15.000 |
and it tends to be pretty reliable, pretty effective. 00:39:22.000 |
is that until now there hasn't been an easy way 00:39:27.000 |
to kind of create and train tabular neural nets, 00:39:30.000 |
like nobody's really made it available on a library. 00:39:33.000 |
So we've actually just created fastai.tabular 00:39:39.000 |
and I think this is pretty much the first time 00:39:41.000 |
that it's become really easy to use neural nets with tabular data. 00:39:51.000 |
This is actually coming directly from the examples folder 00:39:59.000 |
And as per usual, as well as importing fastai, 00:40:09.000 |
We assume that your data is in a pandas data frame. 00:40:13.000 |
A pandas data frame is kind of the standard format 00:40:21.000 |
but probably the most common might be pd.read_csv. 00:40:27.000 |
you can probably get it into a pandas data frame easily enough. 00:40:53.000 |
I guess I still tend to kind of give them a try. 00:41:17.000 |
I would say try a random forest and try a neural net. 00:41:27.000 |
and see if I can make them better and better. 00:41:29.000 |
But if the random forest is doing way better, 00:41:32.000 |
I'd probably just stick with that, use whatever works. 00:41:41.000 |
So I currently have the wrong notebook in the lesson repo, 00:41:46.000 |
so I'll update it after the class, so sorry about that. 00:41:57.000 |
And so we've got a little thing, adult sample. 00:42:12.000 |
that's good for experimenting with, basically. 00:42:16.000 |
And it's a CSV file, so you can read it into a data frame 00:42:28.000 |
If it's in Spark or Hadoop, pandas can read from that. 00:42:31.000 |
Pandas can read from most stuff that you can throw at it. 00:42:34.000 |
So that's why we kind of use it as a default starting point. 00:42:39.000 |
And as per usual, I think it's nice to use the data block API. 00:42:46.000 |
And so in this case, the list that we're trying to create 00:42:49.000 |
is a tabular list, and we're going to create it from a data frame. 00:42:54.000 |
And so you can tell it what the data frame is 00:42:58.000 |
to kind of save models and intermediate steps is. 00:43:01.000 |
And then you need to tell it what are your categorical variables 00:43:09.000 |
about what that means to the neural net next week. 00:43:29.000 |
Some of those variables, like age, are basically numbers. 00:43:37.000 |
You could be 13.36 years old or 19.4 years old or whatever. 00:43:46.000 |
are options that can be selected from a discrete group. 00:43:53.000 |
Sometimes those options might be quite a lot more, 00:44:16.000 |
And so we're going to need to use a different approach 00:44:18.000 |
to the neural net to modeling categorical variables 00:44:24.000 |
we're going to be using something called embeddings, 00:44:35.000 |
Because pixels in a neural net are already numbers. 00:44:37.000 |
These continuous things are already numbers as well. 00:44:43.000 |
So that's why you have to tell the tabular list from DataFrame 00:44:59.000 |
It's kind of nice to have one API for doing everything. 00:45:05.000 |
Then we've got something which is a lot like transforms 00:45:13.000 |
Transforms in computer vision do things like flip a photo 00:45:31.000 |
but the key difference, which is quite important, 00:45:33.000 |
is that a processor is something that happens ahead of time. 00:45:44.000 |
So transformations are really for data augmentation 00:45:51.000 |
Whereas processes are things that you want to do once ahead of time. 00:45:55.000 |
So we have a number of processes in the FastAI library, 00:46:00.000 |
and the ones we're going to use this time are fill missing. 00:46:16.000 |
And we're going to do normalization ahead of time, 00:46:41.000 |
which is a binary column of saying whether that was missing or not. 00:46:46.000 |
Normalization, there's an important thing here, 00:46:55.000 |
you need to do exactly the same thing to the validation set 00:46:59.000 |
So whatever you replace your missing values with, 00:47:02.000 |
you need to replace them with exactly the same thing 00:47:08.000 |
There are kinds of things that if you have to do it manually, 00:47:12.000 |
you'll screw it up lots of times until you finally get it right. 00:47:21.000 |
Then we're going to split into training versus validation sets, 00:47:27.000 |
and in this case we do it by providing a list of indexes, 00:47:36.000 |
I don't quite remember the details of this data set, 00:47:38.000 |
but it's very common for wanting to keep your validation sets 00:47:46.000 |
they should be the map tiles that are next to each other. 00:47:50.000 |
they should be days that are next to each other. 00:47:54.000 |
they should be video frames next to each other, 00:47:58.000 |
So it's often a good idea to use split by IDX 00:48:01.000 |
and to grab a range that's next to each other 00:48:04.000 |
if your data has some kind of structure like that 00:48:07.000 |
or find some other way to structure it in that way. 00:48:10.000 |
All right, so that's now given us a training and a validation set. 00:48:16.000 |
and in this case the labels can come straight from the data frame 00:48:19.000 |
so we just have to tell it which column it is. 00:48:23.000 |
I think it's whether they're making over $50,000 salary. 00:48:28.000 |
That's the thing we're trying to predict in this case. 00:48:39.000 |
So at that point we have something that looks like this. 00:48:54.000 |
You get a learner, in this case it's a tabular learner, 00:49:15.000 |
>> How to combine NLP tokenized data with metadata, 00:49:24.000 |
how to use information like who the actors are, 00:49:34.000 |
so I need to learn a little bit more about how neural net architectures work. 00:49:40.000 |
Conceptually, it's kind of the same as the way we combine categorical variables 00:49:46.000 |
Basically, in the neural network you can have two different sets of inputs 00:49:55.000 |
It could go into an early layer or into a later layer. 00:49:59.000 |
If it's like text and an image and some metadata, 00:50:03.000 |
you probably want the text going into an RNN, 00:50:07.000 |
the metadata going into some kind of tabular model like this, 00:50:10.000 |
and then you'd have them basically all concatenated together 00:50:13.000 |
and then go through some fully connected layers 00:50:17.000 |
We'll probably largely get into that in part two. 00:50:20.000 |
In fact, we might entirely get into part two. 00:50:23.000 |
I'm not sure if we'll have time to cover it in part one. 00:50:31.000 |
of what we'll be learning in the next three weeks. 00:50:36.000 |
Next question is, do you think things like scikit-learn 00:50:42.000 |
Will everyone use deep learning tools in the future 00:50:59.000 |
I mean, xgboost is a really nice piece of software. 00:51:04.000 |
There's quite a few really nice pieces of software 00:51:13.000 |
have some really nice features for interpretation, 00:51:15.000 |
which I'm sure we'll find similar versions for neural nets, 00:51:38.000 |
Again, it's hard to predict where things will end up. 00:51:42.000 |
In some ways, it's more focused on some older approaches 00:51:50.000 |
They keep on adding new things, so we'll see. 00:51:53.000 |
I keep trying to incorporate more scikit-learn stuff 00:51:58.000 |
I think I can do it better, and I throw it away again. 00:52:01.000 |
So that's why there's still no scikit-learn dependencies 00:52:13.000 |
Okay, so we're going to learn what layers equals means 00:52:23.000 |
where we're basically defining our architecture, 00:52:26.000 |
just like when we chose ResNet 34 or whatever for ConvNets. 00:52:32.000 |
We'll look at more about metrics in a moment, 00:52:34.000 |
but just to remind you, metrics are just the things 00:52:40.000 |
So in this case, we're saying I want you to print out 00:52:53.000 |
And the idea was that after three and a half lessons, 00:52:55.000 |
we're going to hit the end of all of the quick overview 00:52:57.000 |
of applications, and then we're going to go down 00:53:00.000 |
I think we're going to be to the minute we're going to hit it, 00:53:03.000 |
because the next one is collaborative filtering. 00:53:09.000 |
So collaborative filtering is where you have information 00:53:22.000 |
It's basically something where you have something 00:53:33.000 |
or what they've written about or what they reviewed. 00:53:36.000 |
So in the most basic version of collaborative filtering, 00:53:40.000 |
you just have two columns, something like user ID 00:53:43.000 |
and movie ID, and that just says this user bought that movie, 00:53:46.000 |
this user bought that movie, this user bought that movie. 00:53:48.000 |
So for example, Amazon has a really big list of user IDs 00:53:56.000 |
Then you can add additional information to that table, 00:54:03.000 |
So it's now like user ID, movie ID, number of stars. 00:54:09.000 |
You could add a time code, so like this user bought this product 00:54:17.000 |
But they're all basically the same kind of structure. 00:54:21.000 |
So there's kind of like two ways you could draw 00:54:32.000 |
where you've got like user and, I don't know, movie. 00:54:37.000 |
And you've got user ID, movie ID, user ID, you know. 00:54:43.000 |
watch that movie, possibly also plus number of stars, 00:54:58.000 |
would be you could have like all the users down here 00:55:12.000 |
And then, you know, you can look and find a particular cell 00:55:25.000 |
could be the rating of that user for that movie 00:55:28.000 |
or there's just a one there if that user watched that movie or whatever. 00:55:31.000 |
So there's like two different ways of representing the same information. 00:55:37.000 |
Conceptually, it's often easier to think of it this way, right? 00:55:44.000 |
But most of the time you won't store it that way explicitly 00:55:47.000 |
because most of the time you'll have what's called a very sparse matrix, 00:55:51.000 |
which is to say most users haven't watched most movies 00:55:56.000 |
or most customers haven't purchased most products. 00:56:00.000 |
So if you store it as a matrix where every combination of customer 00:56:05.000 |
and product is a separate cell in that matrix, it's going to be enormous. 00:56:09.000 |
So you tend to store it like this or you can store it as a matrix 00:56:14.000 |
using some kind of special sparse matrix format. 00:56:18.000 |
And if that sounds interesting, you should check out 00:56:20.000 |
Rachel's computational linear algebra course on FastAI 00:56:24.000 |
where we have lots and lots and lots of information 00:56:30.000 |
For now, though, we're just going to kind of keep it 00:56:38.000 |
So for collaborative filtering, there's a really nice data set 00:56:44.000 |
called movie lens created by the group lens group very hopefully. 00:56:52.000 |
And you can download various different sizes, 20 million. 00:56:59.000 |
We've actually created an extra small version for playing around with, 00:57:06.000 |
And then probably next week we'll use the bigger version. 00:57:13.000 |
But so you can grab the small version using urls.ml sample. 00:57:17.000 |
And it's a CSV, so you can read it with pandas. 00:57:28.000 |
We don't actually know anything about who these users are. 00:57:32.000 |
There is some information about what the movies are, 00:57:46.000 |
So that's a subset of our data that's the head. 00:57:49.000 |
So the head in pandas is just the first few rows. 00:57:56.000 |
the nice thing about collaborative filtering is it's incredibly simple. 00:58:03.000 |
So you can now go ahead and say get collaborative learner, 00:58:08.000 |
and you can actually just pass in the data frame directly. 00:58:13.000 |
The architecture, you have to tell it how many factors you want to use, 00:58:16.000 |
and we're going to learn what that means after the break. 00:58:19.000 |
And then something that can be helpful is to tell it what the range of scores are, 00:58:24.000 |
and we're going to see how that helps after the break as well. 00:58:27.000 |
So in this case, the minimum score is zero, the maximum score is five. 00:58:37.000 |
And trains for a few epochs, and there it is. 00:58:41.000 |
So at the end of it, you now have something where you can pick 00:58:45.000 |
a user ID and a movie ID and guess whether or not that user will like that movie. 00:58:54.000 |
So this is obviously a super useful application 00:58:59.000 |
that a lot of you are probably going to try during the week in past classes. 00:59:03.000 |
A lot of people have taken this collaborative filtering approach back to their workplaces 00:59:08.000 |
and discovered that using it in practice is much more tricky than this 00:59:13.000 |
because in practice you have something called the cold start problem. 00:59:16.000 |
So the cold start problem is that the time you particularly want to be good 00:59:21.000 |
at recommending movies is when you have a new user, 00:59:25.000 |
and the time you particularly care about recommending a movie is when it's a new movie. 00:59:31.000 |
But at that point, you don't have any data in your collaborative filtering system, 00:59:38.000 |
As I say this, we don't currently have anything built in to fast AI 00:59:43.000 |
And that's really because the cold start problem, the only way I know of to solve it, 00:59:47.000 |
in fact the only way I think that conceptually you can solve it, 00:59:50.000 |
is to have a second model, which is not a collaborative filtering model, 00:59:53.000 |
but a metadata driven model for new users or new movies. 01:00:01.000 |
I don't know if Netflix still does this, but certainly what they used to do 01:00:04.000 |
when I signed up to Netflix was they started showing me lots of movies 01:00:08.000 |
and saying, "Have you seen this? Did you like it? Have you seen this? Did you like it?" 01:00:12.000 |
So they fixed the cold start problem through the UX. 01:00:19.000 |
They found like 20 really common movies and asked me if I liked them. 01:00:23.000 |
They used my replies to those 20 to show me 20 more that I might have seen. 01:00:27.000 |
And by the time I had gone through 60, there was no cold start problem anymore. 01:00:34.000 |
And for new movies, it's not really a problem because like the first 100 users 01:00:38.000 |
who haven't seen the movie, you know, go in and say whether they liked it 01:00:42.000 |
and then the next 100,000, the next million, it's not a cold start problem anymore. 01:00:48.000 |
But the other thing you can do if you, for whatever reason, 01:00:52.000 |
kind of can't go through that UX of like asking people, "Did you like those things?" 01:00:56.000 |
So for example, if you're selling products and you don't really want to show them 01:00:59.000 |
like a big selection of your products and say, "Did you like this?" 01:01:05.000 |
You can instead try and use a metadata-based kind of tabular model. 01:01:14.000 |
You can try and make some guesses about the initial recommendations. 01:01:18.000 |
So collaborative filtering is specifically for once you have a bit of information 01:01:24.000 |
about your users and movies or customers and products or whatever. 01:01:37.000 |
How does the language model trained in this manner perform on code switch data 01:01:42.000 |
such as Hindi written in English words or text with a lot of emojis? 01:02:06.000 |
And where they are in Wikipedia, it's more like a Wikipedia page about the emoji 01:02:11.000 |
rather than the emoji being used in a sensible place. 01:02:15.000 |
But you can and should do this language model fine-tuning 01:02:24.000 |
where you take a corpus of text where people are using emojis in usual ways. 01:02:29.000 |
And so you fine-tune the wiki text language model to your Reddit 01:02:37.000 |
If you think about it, there's hundreds of thousands of possible words 01:02:41.000 |
that people can be using, but a small number of possible emojis. 01:02:44.000 |
So it'll very quickly learn how those emojis are being used. 01:02:55.000 |
So I'm not very familiar with Hindi, but I'll take an example. 01:02:59.000 |
In Mandarin, you could have a model that's trained with Chinese characters. 01:03:04.000 |
So there's kind of five or six thousand Chinese characters in common use. 01:03:09.000 |
But there's also a romanization of those characters called pinyin. 01:03:13.000 |
And it's a bit tricky because although there's a 01:03:17.000 |
nearly direct mapping from the character to the pinyin, 01:03:21.000 |
I mean there is a direct mapping, the pronunciation's not exactly direct, 01:03:25.000 |
there isn't direct mapping from the pinyin to the character 01:03:29.000 |
because one pinyin corresponds to multiple characters. 01:03:34.000 |
So the first thing to note is that if you're going to use this approach for Chinese, 01:03:44.000 |
you would need to start with a Chinese language model. 01:03:47.000 |
So actually, FastAI has something called a model zoo 01:03:52.000 |
where we're adding more and more language models for different languages 01:03:56.000 |
and also increasingly for different domain areas 01:04:00.000 |
like English medical texts or even language models for things other than NLP 01:04:05.000 |
like genome sequences, molecular data, musical MIDI notes, and so forth. 01:04:17.000 |
To then convert that, that'll be in either simplified or traditional Chinese, 01:04:22.000 |
to then convert that into, if you want to do pinyin, 01:04:26.000 |
you could either kind of map the vocab directly 01:04:31.000 |
or as you'll learn, these multi-layer models, 01:04:35.000 |
it's only the first layer that basically converts the tokens into a set of vectors. 01:04:43.000 |
You could actually throw that away and fine-tune just the first layer of the model. 01:04:50.000 |
So that second part is going to require a few more weeks of learning 01:04:55.000 |
before you exactly understand how to do that and so forth. 01:04:58.000 |
But if it's something you're interested in doing, we can talk about it on the forum 01:05:01.000 |
because it's a kind of a nice test of understanding. 01:05:12.000 |
Is there an RNN model involved in tabular.models? 01:05:18.000 |
So we're going to look at time series tabular data next week. 01:05:28.000 |
you don't use a RNN for time series tabular data, 01:05:33.000 |
but instead you extract a bunch of columns for things like day of week, 01:05:38.000 |
is it a weekend, is it a holiday, was the store open, stuff like that. 01:05:43.000 |
It turns out that adding those extra columns, 01:05:47.000 |
which you can do somewhat automatically, basically gives you state-of-the-art results. 01:05:55.000 |
There are some good uses of RNNs for time series, 01:06:01.000 |
but not really for these kind of tabular style time series, 01:06:05.000 |
like retail store logistics databases and stuff like that. 01:06:15.000 |
And is there a source to learn more about the cold start problem? 01:06:26.000 |
If you know a good resource, please mention it on the forums. 01:06:35.000 |
So that is both the break and the middle of lesson four. 01:06:45.000 |
and it's the point at which we have now seen an example of all the key applications. 01:06:50.000 |
And so the rest of this course is going to be digging deeper into how they actually work behind the scenes, 01:06:57.000 |
more of the theory, more of how the source code is written, and so forth. 01:07:03.000 |
So it's a good time to have a nice break, come back. 01:07:10.000 |
And furthermore, it's my birthday today, so it's really a special moment. 01:07:19.000 |
So let's have a break and come back at 5 past 8. 01:07:26.000 |
So Microsoft Excel, this is one of my favorite ways to explore data and understand models. 01:07:44.000 |
And actually, this one we can probably largely do in Google Sheets. 01:07:48.000 |
I've tried to move as much as I can over the last few weeks into Google Sheets, 01:07:53.000 |
but I just keep finding this is such a terrible product. 01:08:01.000 |
Please try to find a copy of Microsoft Excel because there's nothing close. 01:08:10.000 |
Spreadsheets get a bad rap from people that basically don't know how to use them, 01:08:16.000 |
just like people who spend their lives on Excel and then they start using Python. 01:08:20.000 |
And they're like, what the hell is this stupid thing? 01:08:23.000 |
It takes thousands of hours to get really good at spreadsheets, 01:08:26.000 |
but a few dozen hours to get competent at them. 01:08:30.000 |
And once you're competent at them, you can see everything in front of you. 01:08:38.000 |
I'll give you one spreadsheet tip today, which is if you hold down the control key 01:08:43.000 |
or command key on your keyboard and press the arrow keys, here's control right. 01:08:48.000 |
It takes you to the end of a block of a table that you're in, 01:08:52.000 |
and it's by far the best way to move around the place. 01:08:59.000 |
In this case, I want to skip around through this table, 01:09:03.000 |
so I can hit control down right to get to the bottom right, 01:09:07.000 |
control left up to get to the top left, skip around and see what's going on. 01:09:19.000 |
one way to look at collaborative filtering data is like this. 01:09:23.000 |
And so what we did was we grabbed from the movie lens data 01:09:33.000 |
and just filtered the data set down to those 15. 01:09:40.000 |
And as you can see, when you do it that way, it's not sparse anymore. 01:09:52.000 |
So this is something that we can now build a model with. 01:10:04.000 |
What we want to do is we want to create something 01:10:07.000 |
which can predict for user 293 will they like movie 49, for example. 01:10:17.000 |
So we've got to come up with some way of, you know, 01:10:21.000 |
some function that can represent that decision. 01:10:31.000 |
And so we're going to take this idea of doing some matrix multiplications. 01:10:44.000 |
And I've created here another matrix of random numbers. 01:10:50.000 |
More specifically, for each movie I've created five random numbers. 01:10:57.000 |
And for each user I've created five random numbers. 01:11:04.000 |
And so we could say then that user 14, movie 27, did they like it or not? 01:11:18.000 |
Well the rating, what we could do would be to multiply together this vector and that vector. 01:11:25.000 |
We could do a dot product, and here's a dot product. 01:11:31.000 |
And so then we can basically do that for every possible thing in here. 01:11:40.000 |
And, you know, thanks to spreadsheets we can just do that in one place 01:11:44.000 |
and copy it over and it fills in the whole thing for us. 01:11:49.000 |
Well, this is the basic starting point of a neural net, isn't it? 01:11:54.000 |
The basic starting point of a neural net is that you take the matrix 01:11:58.000 |
multiplication of two matrices and that's what your first layer always is. 01:12:05.000 |
And so we just have to come up with some way of saying like, 01:12:08.000 |
well, what are two matrices that we can multiply? 01:12:12.000 |
And so clearly, you know, you need a matrix for a user, 01:12:21.000 |
you know, or a vector for a user, a matrix for all the users, 01:12:25.000 |
and a vector for a movie, or a matrix for all the movies, 01:12:31.000 |
and multiply them together and you get some numbers, right? 01:12:38.000 |
Like, so they don't mean anything yet, they're just random, right? 01:12:42.000 |
But we can now use gradient descent to try to make these numbers, 01:12:49.000 |
and these numbers give us results that are closer to what we wanted. 01:13:01.000 |
Well, we've set this up now as a linear model, right? 01:13:06.000 |
So the next thing we need is a loss function. 01:13:10.000 |
So we can calculate our loss function by saying, well, okay, movie 3 01:13:21.000 |
for user_id 14 should have been a rating of 3. 01:13:29.000 |
With this random matrices, it's actually a rating of 0.91. 01:13:33.000 |
So we can find the sum of squared errors would be 3 minus 0.91 squared. 01:13:46.000 |
So there's actually a sum squared in Excel already, some x minus y squared, 01:13:56.000 |
so we can use just some x minus y squared function passing in those two ranges. 01:14:02.000 |
And then divide by the count to get the mean. 01:14:06.000 |
So here is a number that is the square root of the mean squared error. 01:14:14.000 |
So sometimes you'll see people talk about MSE, so that's the mean squared error. 01:14:19.000 |
Sometimes you'll see RMSE, that's the root mean squared error. 01:14:23.000 |
So since I've got a square root at the front, this is the square root mean squared error. 01:14:32.000 |
So now all we need to do is use gradient descent to try to modify our weight matrices 01:14:53.000 |
So it's probably worth knowing how to do that, so we have to install add-ins. 01:15:07.000 |
So the gradient descent solver in Excel is called solver, 01:15:16.000 |
You'll need to make sure that in your settings that you've enabled the solver extension, 01:15:21.000 |
And all you need to do is say, which cell represents my loss function? 01:15:36.000 |
So you can see here I've got H19 to V23, which is up here, 01:15:48.000 |
And then you can just say, okay, set your loss function to a minimum 01:15:59.000 |
And you'll see it starts at 2.81, and you can see the numbers going down. 01:16:04.000 |
And so all that's doing is using gradient descent exactly the same way 01:16:09.000 |
that we did when we did it manually in the notebook the other day. 01:16:13.000 |
But it's rather than solving the mean squared error for a at x in the Python. 01:16:23.000 |
Instead, it is solving the loss function here, 01:16:27.000 |
which is the mean squared error of the dot product of each of those vectors 01:16:37.000 |
So we'll let that run for a little while and see what happens. 01:16:42.000 |
But basically in micro, here is a simple way of creating a neural network, 01:16:50.000 |
which is really in this case, it's like just a single linear layer 01:16:57.000 |
with gradient descent to solve a collaborative filtering problem. 01:17:03.000 |
So let's go back and see what we do over here. 01:17:14.000 |
Okay, so the function that was called in the notebook was get collab learner. 01:17:25.000 |
one of the really good ways to dig deeper into deep learning 01:17:28.000 |
is to dig into the fastai source code and see what's going on. 01:17:33.000 |
And so if you're going to be able to do that, 01:17:35.000 |
you need to know how to use your editor well enough to dig through the source code. 01:17:40.000 |
And basically there's two main things you need to know how to do. 01:17:45.000 |
like a particular class or function by its name. 01:17:49.000 |
And the other is that when you're looking at a particular symbol, 01:17:54.000 |
So for example, in this case I want to find get collab learner. 01:17:58.000 |
So in most editors, including the one I used, VIM, 01:18:05.000 |
you can set it up so that you can kind of hit tab or something 01:18:08.000 |
and it jumps through all the possible completions 01:18:12.000 |
and you can hit enter and it jumps straight to the definition for you. 01:18:18.000 |
So here is the definition of get collab learner. 01:18:24.000 |
And as you can see, it's pretty small as these things tend to be. 01:18:29.000 |
And in this case it kind of wraps a data frame 01:18:34.000 |
and automatically creates a data bunch for you because it's so simple. 01:18:37.000 |
But the key thing it does then is to create a model of a particular kind, 01:18:41.000 |
which is an embedding.bias model passing in the various things you asked for. 01:18:47.000 |
So you want to find out in your editor how you jump to the definition of that, 01:18:51.000 |
which in VIM you just hit control right square bracket. 01:18:57.000 |
And here is the definition of embedding.bias. 01:19:02.000 |
And so now we have everything on screen at once 01:19:08.000 |
and as you can see there's not much going on. 01:19:12.000 |
So the models that are being created for you by FastAI 01:19:30.000 |
It's a little more nuanced than that but that's a good starting point for now. 01:19:41.000 |
when you calculate the result of that layer or neural net or whatever, 01:19:45.000 |
specifically it always calls a method for you called forward. 01:19:49.000 |
So it's in here that you get to find out how this thing is actually calculated. 01:19:55.000 |
When the model is built at the start it calls this thing called 01:20:01.000 |
underscore underscore init underscore underscore. 01:20:04.000 |
And as I think we briefly mentioned before in Python, 01:20:07.000 |
people tend to call this dunder init, double underscore init. 01:20:11.000 |
So dunder init is how we create the model and forward is how we run the model. 01:20:19.000 |
One thing if you're watching carefully you might notice 01:20:22.000 |
is there's nothing here saying how to calculate the gradients of the model 01:20:30.000 |
So you only have to tell it how to calculate the output of your model 01:20:34.000 |
and PyTorch will go ahead and calculate the gradients for you. 01:20:44.000 |
the model contains a set of weights for a user, 01:20:57.000 |
and each one of those is coming from this thing called get embedding. 01:21:14.000 |
And all it does basically is it calls this PyTorch thing called nn.embedding. 01:21:23.000 |
So in PyTorch they have a lot of like standard neural network layers set up for you. 01:21:31.000 |
And then this thing here is, it just randomizes it. 01:21:35.000 |
So this is something which creates normal random numbers for the embedding. 01:21:43.000 |
An embedding, not surprisingly, is a matrix of weights. 01:21:59.000 |
Specifically an embedding is a matrix of weights that looks something like this. 01:22:04.000 |
It's a matrix of weights which you can basically look up into 01:22:16.000 |
and we're going to be digging into this in a lot more detail in the coming lessons, 01:22:20.000 |
but an embedding matrix is just a weight matrix 01:22:23.000 |
that is designed to be something that you kind of index into it as an array 01:22:35.000 |
And so in our case we have an embedding matrix for a user 01:22:43.000 |
And here we have been taking the dot product of them. 01:22:49.000 |
But if you think about it, that's not quite enough 01:22:55.000 |
maybe there are certain movies that everybody likes more. 01:23:00.000 |
Maybe there are some users that just tend to like movies more. 01:23:04.000 |
So I don't really just want to multiply these two vectors together 01:23:08.000 |
but I really want to add a single number of how popular is this movie 01:23:13.000 |
and add a single number of how much does this user like movies in general. 01:23:20.000 |
Remember how I said there's this kind of idea of bias 01:23:24.000 |
and the way we dealt with that in our gradient descent notebook 01:23:30.000 |
But what we tend to do in practice is we actually explicitly say 01:23:40.000 |
So we don't just want to have prediction equals dot product of these two things. 01:23:47.000 |
We want to say it's the dot product of those two things plus a bias term 01:24:08.000 |
and then we also set up the bias vector for the users 01:24:22.000 |
Just like we did, right? We just take that product, 01:24:32.000 |
and then putting aside the min and max score for a moment, 01:24:38.000 |
So you can see that our model is literally doing 01:24:43.000 |
what we did here with the tweak that we're also adding the bias. 01:24:58.000 |
And for these kinds of collaborative filtering problems, 01:25:04.000 |
this kind of simple linear model actually tends to work pretty well. 01:25:13.000 |
And then there's one tweak that we do at the end, 01:25:17.000 |
which is that in our case we said that there's a min score of 0 01:25:40.000 |
so you do that dot product and you add on the two biases 01:25:43.000 |
and that could give you any possible number along the number line 01:25:47.000 |
from very negative through to very positive numbers. 01:25:50.000 |
But we know that we always want to end up with a number between 0 and 5. 01:26:01.000 |
So what if we mapped that number line like so to this function? 01:26:12.000 |
And so the shape of that function is called a sigmoid. 01:26:20.000 |
And so it's going to asymptote to 5 and it's going to asymptote to 0. 01:26:27.000 |
And so that way whatever number comes out of our dot product 01:26:33.000 |
and adding the biases, if we then stick it through this function, 01:26:37.000 |
it's never going to be higher than 5 and never going to be smaller than 0. 01:26:45.000 |
because our parameters could learn a set of weights 01:26:54.000 |
So why would we do this extra thing if it's not necessary? 01:26:57.000 |
Well the reason is we want to make life as easy for our model as possible. 01:27:08.000 |
so it's impossible for it to ever predict too much or ever predict too little, 01:27:12.000 |
then it can spend more of its weights predicting the thing we care about 01:27:16.000 |
which is deciding who's going to like what movie. 01:27:19.000 |
So this is an idea we're going to keep coming back to when it comes to making neural networks work better. 01:27:26.000 |
It's about all these little decisions that we make 01:27:29.000 |
to basically make it easier for the network to learn the right thing. 01:27:34.000 |
So that's the last tweak here, which is we take 01:27:44.000 |
we put it through a sigmoid, and so a sigmoid is just a function, 01:27:48.000 |
it's basically 1 over 1 plus e^x, the definition doesn't much matter 01:27:52.000 |
but it just has the shape that I just mentioned, 01:27:58.000 |
And if you then multiply that by max minus min plus min, 01:28:02.000 |
then that's going to give you something that's between min score and max score. 01:28:06.000 |
So that means that this tiny little neural network, 01:28:12.000 |
I mean it's a push to call it a neural network, but it is. 01:28:14.000 |
It's a neural network with one weight matrix and no non-linearities, 01:28:20.000 |
so it's kind of the world's most boring neural network with a sigmoid at the end. 01:28:25.000 |
That's actually, I guess it does have a non-linearity, 01:28:34.000 |
That actually turns out to give close to state-of-the-art performance, 01:28:42.000 |
like I've looked up online to find out what are the best results people have 01:28:48.000 |
and the results I get from this little thing are better than any of the results 01:28:52.000 |
I can find from the standard commercial products that you can download 01:28:58.000 |
And the trick seems to be that adding this little sigmoid makes a big difference. 01:29:10.000 |
There was a question about how you set up your VIN, 01:29:14.000 |
but I wanted to know if you had more to say about that. 01:29:34.000 |
When you've got a class that you're not currently working on, 01:29:44.000 |
You basically want something where it's easy to close and open-fold. 01:29:52.000 |
And then, as I mentioned, you also want something 01:29:54.000 |
where you can jump to the definition of things, 01:30:00.000 |
So if you want to jump to the definition of "learner." 01:30:02.000 |
Basically, VIM already does all this for you. 01:30:08.000 |
I basically hardly use any extensions or anything. 01:30:12.000 |
Another great editor to use is VS Code, Visual Studio Code. 01:30:20.000 |
and it has all the same features that you're seeing that VIM does. 01:30:23.000 |
Basically, VS Code does all of those things as well. 01:30:29.000 |
I quite like using VIM because I can use it on the remote machine 01:30:33.000 |
and play around, but you can of course just clone the Git repo 01:30:40.000 |
onto your local computer and open it up in VS Code to play around with it. 01:30:44.000 |
Just don't try and look through the code just on GitHub or something. 01:30:50.000 |
You need to be able to open it and close it and jump and jump back. 01:30:55.000 |
Maybe people can create some threads on the forum for VIM tips, 01:31:04.000 |
For me, I would say if you're going to pick an editor, 01:31:12.000 |
If you want to use something on the terminal side, 01:31:15.000 |
I would go with VIM or Emacs. To me, they're clear winners. 01:31:27.000 |
What I wanted to close with today is to take this collaborative filtering example 01:31:32.000 |
and describe how we're going to build on top of it for the next three lessons 01:31:36.000 |
to create the more complex neural networks we've been seeing. 01:31:40.000 |
Roughly speaking, this is the bunch of concepts that we need to learn about. 01:31:49.000 |
Let's think about what happens when you're using a CNN or a neural network 01:32:15.000 |
You've got lots of pixels, but let's take a single pixel. 01:32:27.000 |
Each one of those is some number between 0 and 255. 01:32:31.000 |
We kind of normalize them so they're a floating point 01:32:35.000 |
with the mean of 0 and standard deviation of 1. 01:32:56.000 |
We basically treat that as a vector and we multiply it by a matrix. 01:33:06.000 |
Depending on how you think of the rows and the columns, 01:33:10.000 |
let's treat the matrix as having three rows and then how many columns? 01:33:19.000 |
Just like with the collaborative filtering version, 01:33:23.000 |
I decided to pick a vector of size 5 for each of my embedding vectors. 01:33:30.000 |
That would mean that that's an embedding of size 5. 01:33:35.000 |
You can get to pick how big your weight matrix is. 01:33:46.000 |
Initially, this weight matrix contains random numbers. 01:33:51.000 |
Remember when we looked up get embedding matrix just now? 01:33:55.000 |
The first line was like create the matrix and the second was fill it with random numbers. 01:34:00.000 |
It all gets hidden behind the scenes by fast AI and PyTorch. 01:34:05.000 |
It's creating a matrix of random numbers when you set it up. 01:34:10.000 |
The number of rows has to be 3 to match the input. 01:34:14.000 |
The number of columns can be as big as you like. 01:34:16.000 |
After you multiply the input vector by that weight matrix, 01:34:20.000 |
you're going to end up with a vector of size 5. 01:34:29.000 |
People often ask how much linear algebra do I need to know to be able to do deep learning? 01:34:38.000 |
If you're not familiar with this, that's fine. 01:34:46.000 |
You just need to know computationally what do they do. 01:34:52.000 |
You've got to be very comfortable with a matrix of size blah times a matrix of size blah 01:35:02.000 |
If you have 3 and then remember in NumPy and PyTorch we use at times 3 by 5 gives a vector of size 5. 01:35:16.000 |
It goes through an activation function such as ReLU, which is just max 0, x, 01:35:26.000 |
and spits out a new vector which is of course going to be exactly the same size. 01:35:33.000 |
Because no activation function changes the size. 01:35:38.000 |
It only changes the contents, so that's still a size 5. 01:35:49.000 |
And again, it can be any number of columns, but the number of rows has to map nicely. 01:36:19.000 |
And again, that gives us something of the same size. 01:36:23.000 |
And then we can put that through another matrix. 01:36:31.000 |
Actually, just to make this a bit clearer, you'll see why in a moment, I'm going to use 8, not 10. 01:36:45.000 |
So my last weight matrix has to be 10 in size. 01:36:54.000 |
Because then that's going to mean my final output is a vector of 10 in size. 01:37:00.000 |
And remember, if we're doing digit recognition, what happens? 01:37:15.000 |
And if the number that we're trying to predict was the number 3, that's the thing we're trying to predict. 01:37:25.000 |
Then that means that there is a 3, 0, 0, 0, in the third position. 01:37:34.000 |
So what happens is our neural net runs along, starting with our input. 01:37:43.000 |
And going weight matrix, ReLU, weight matrix, ReLU, weight matrix, final output. 01:37:51.000 |
And then we compare these two together to see how close they are, how close they match, using some loss function. 01:38:00.000 |
We'll learn about all the loss functions that we use next week. 01:38:03.000 |
For now, the only one we've learned is mean squared error. 01:38:06.000 |
And we compare the actual, you can think of them as probabilities for each of the 10 to the actual each of the 10 to get a loss. 01:38:16.000 |
And then we find the gradients of every one of the weight matrices with respect to that, and we update the weight matrices. 01:38:23.000 |
So the main thing I wanted to show right now is the terminology we use, because it's really important. 01:38:36.000 |
Specifically, they initially are matrices containing random numbers. 01:38:40.000 |
And we can refer to these yellow things, in PyTorch they're called parameters. 01:38:51.000 |
Sometimes we'll refer to them as weights, although weights is slightly less accurate because there can also be biases. 01:39:00.000 |
But we kind of use the terms a little bit interchangeably, but strictly speaking we should call them parameters. 01:39:06.000 |
And then after each of those matrix products, that calculates a vector of numbers. 01:39:12.000 |
So here are some numbers that are calculated by a weight matrix, multiply. 01:39:26.000 |
And then there are some other sets of numbers that are calculated as a result of a ReLU, as a relevant activation function. 01:39:52.000 |
So activations and parameters both refer to numbers, they are numbers. 01:39:58.000 |
The parameters are numbers that are stored, they're used to make a calculation. 01:40:05.000 |
Activations are the result of a calculation, they're numbers that are calculated. 01:40:11.000 |
So they're the two key things you need to remember. So use these terms, and use them correctly and accurately. 01:40:25.000 |
And if you read these terms they mean these very specific things, so don't mix them up in your head. 01:40:31.000 |
And remember they're nothing weird and magical, they're very simple things. An activation is the result of either a matrix multiply or an activation function. 01:40:43.000 |
And a parameter are the numbers inside the matrices that we multiply by. 01:40:49.000 |
Okay, that's it. And then there are some special layers, so every one of these things that does a calculation, all of these things that does a calculation, are all called layers. 01:41:07.000 |
They're the layers of our neural net. So every layer results in a set of activations, because there's a calculation that results in a set of results. 01:41:19.000 |
There's a special layer at the start, which is called the input layer, and then at the end you just have a set of activations. 01:41:28.000 |
And we can refer to those special, I mean they're not special mathematically, but they're semantically special, we can call those the outputs. 01:41:37.000 |
So the important point to realize here is the outputs of a neural net are not actually mathematically special, they're just the activations of a layer. 01:41:46.000 |
And so what we did in our collaborative filtering example, we did something interesting, we actually added an additional activation function right at the very end. 01:42:00.000 |
We added an extra activation function, which was sigmoid. 01:42:06.000 |
Specifically it was a scaled sigmoid, between 0 and 5. And that's really common. It's very common to have an activation function as your last layer. 01:42:16.000 |
And it's almost never going to be a value, because it's very unlikely that what you actually want is something that stops, that truncates at 0. 01:42:24.000 |
It's very often going to be a sigmoid or something similar, because it's very likely that actually what you want is something that's between two values, and kind of scaled in that way. 01:42:34.000 |
So that's nearly it, right? So we've got inputs, weights, activations, activation functions, which we sometimes call non-linearities, output. 01:42:46.000 |
And then the function that compares those two things together is called the loss function, which so far we've used MSE. 01:42:59.000 |
Yeah, okay. And that's enough for today. So what we're going to do next week is we're going to kind of add in a few more extra bits, 01:43:10.000 |
which is we're going to learn the loss function that's used for classification, which is called cross-entropy. 01:43:15.000 |
We're going to use the activation function that's used for single-label classification, which is called softmax. 01:43:21.000 |
And we're also going to learn exactly what happens when we do fine-tuning in terms of how these layers actually, what happens with unfreeze and what happens when we create transfer learning. 01:43:32.000 |
So thanks, everybody. Looking forward to seeing you next week.