back to indexLesson 4: Practical Deep Learning for Coders 2022
Chapters
0:0 Using Huggingface
3:24 Finetuning pretrained model
5:14 ULMFit
9:15 Transformer
10:52 Zeiler & Fergus
14:47 US Patent Phrase to Phase Matching Kaggle competition
16:10 NLP Classification
20:56 Kaggle configs, insert python in bash, read competition website
24:51 Pandas, numpy, matplotlib, & pytorch
29:26 Tokenization
33:20 Huggingface model hub
36:40 Examples of tokenized sentences
38:47 Numericalization
41:13 Question: rationale behind how input data was formatted
43:20 ULMFit fits large documents easily
45:55 Overfitting & underfitting
50:45 Splitting the dataset
52:31 Creating a good validation set
57:13 Test set
59:0 Metric vs loss
61:27 The problem with metrics
64:10 Pearson correlation
70:27 Correlation is sensitive to outliers
74:0 Training a model
79:20 Question: when is it ok to remove outliers?
82:10 Predictions
85:30 Opportunities for research and startups
86:16 Misusing NLP
93:0 Question: isn’t the target categorical in this case?
00:00:00.000 |
Hi, everybody, and welcome to Practical Deep Learning for Coders Lesson 4, which I think 00:00:07.120 |
is the lesson that a lot of the regulars in the community have been most excited about 00:00:12.120 |
because it's where we're going to get some totally new material, totally new topic we've 00:00:20.520 |
We're going to cover natural language processing in LP, and you'll find there is indeed a chapter 00:00:25.800 |
about that in the book, but we're going to do it in a totally different way to how it's 00:00:31.680 |
In the book, we do NLP using the FastAI library using recurrent neural networks, RNNs. 00:00:40.920 |
Today we're going to do something else, which is we're going to do transformers. 00:00:48.360 |
And we're not even going to use the FastAI library at all, in fact. 00:00:54.240 |
So what we're going to be doing today is we're going to be fine-tuning a pre-trained NLP 00:01:01.680 |
model using a library called Hugging Face Transformers. 00:01:06.080 |
Now given this is the Fast.AI course, you might be wondering why we'd be using a different 00:01:13.880 |
The reason is that I think that it's really useful for everybody to have experience and 00:01:21.440 |
practice of using more than one library, because you'll get to see the same concepts applied 00:01:30.960 |
And I think that's great for your understanding of what these concepts are. 00:01:37.040 |
Also I really like the Hugging Face Transformers library. 00:01:40.440 |
It's absolutely the state of the art in NLP, and it's well worth knowing. 00:01:47.040 |
If you're watching this on video, by the time you're watching it, we will probably have 00:01:50.280 |
completed our integration of the transformers library into FastAI, so it's in the process 00:01:54.760 |
of becoming the main NLP foundation for FastAI. 00:02:01.480 |
So you'll be able to combine transformers and FastAI together. 00:02:08.800 |
So I think there's a lot of benefits to this, and in the end you're going to know how to 00:02:15.600 |
Now the other thing is Hugging Face Transformers doesn't have the same layered architecture 00:02:22.720 |
that FastAI has, which means, particularly for beginners, the kind of high level, top 00:02:31.200 |
tier API that you'll be using most of the time is not as ready to go for beginners as 00:02:41.680 |
And so that's actually, I think, a good thing. 00:02:44.880 |
You know the basic idea now of how gradient descent works and how parameters are learned 00:02:55.800 |
I think you're ready to try using a somewhat lower level library that does a little bit 00:03:04.520 |
It's a very well-designed library, and it's still reasonably high level, but you're going 00:03:11.480 |
And that's kind of how the rest of the course in general is going to be on the whole, is 00:03:14.920 |
we're going to get a bit deeper and a bit deeper and a bit deeper. 00:03:19.240 |
Now so first of all, let's talk about what we're going to be doing with fine-tuning a 00:03:27.040 |
We've talked about that in passing before, but we haven't really been able to describe 00:03:30.880 |
it in any detail because you haven't had the foundations. 00:03:37.040 |
You played with these sliders last week, and hopefully you've all actually gone into this 00:03:42.720 |
notebook and dragged them around and tried to get an intuition for this idea of moving 00:03:47.600 |
them up and down, makes the loss go up and down, and so forth. 00:03:51.580 |
So I mentioned that your job was to move these sliders to get this as nice as possible, but 00:03:58.640 |
when it was given to you, the person who gave it to you said, "Oh, actually slider A, that 00:04:08.720 |
And slider B, we think it's like around two and a half. 00:04:16.280 |
Now that would be pretty helpful, wouldn't it, because you could immediately start focusing 00:04:20.800 |
on the one we have no idea about, get that in roughly the right spot, and then the one 00:04:25.000 |
you've kind of got a vague idea about, you could just tune it a little bit, and the one 00:04:27.920 |
that they said was totally confident you wouldn't move at all, you would probably tune these 00:04:37.120 |
A pre-trained model is a bunch of parameters that have already been fit, where some of 00:04:43.880 |
them are already pretty confident of what they should be, and some of them we really 00:04:50.760 |
And so fine-tuning is the process of taking those ones we have no idea what they should 00:04:54.840 |
be at all and trying to get them right, and then moving the other ones a little bit. 00:05:01.440 |
The idea of fine-tuning a pre-trained NLP model in this way was pioneered by an algorithm 00:05:10.120 |
called ULMfit, which was first presented actually in a fast AI course, I think the very first 00:05:18.720 |
It was later turned into an academic paper by me in conjunction with a then PhD student 00:05:23.960 |
named Sebastian Ruder, who's now one of the world's top NLP researchers, and went on to 00:05:28.960 |
help inspire a huge change, a huge step improvement in NLP capabilities around the world, along 00:05:38.440 |
with a number of other important innovations at the time. 00:05:43.300 |
This is the basic process that ULMfit described. 00:05:51.600 |
Step one was to build something called a language model using basically nearly all of Wikipedia. 00:05:58.740 |
And what the language model did was it tried to predict the next word of a Wikipedia article, 00:06:06.760 |
in fact every next word of every Wikipedia article. 00:06:14.120 |
There are Wikipedia articles which would say things like the 17th prime number is dot, dot, 00:06:25.640 |
dot, or the 40th president of the United States, Blah, said at his residence, Blah, that. 00:06:34.880 |
Filling in these kinds of things requires understanding a lot about how language is 00:06:40.400 |
structured and about the world and about math and so forth. 00:06:47.560 |
So to get good at being a language model, a neural network has to get good at a lot 00:06:55.160 |
It has to understand how language works at a reasonably good level, and it needs to understand 00:07:00.880 |
what it's actually talking about, and what is actually true, what is actually not true, 00:07:05.720 |
and the different ways in which things are expressed, and so forth. 00:07:10.480 |
So this was trained using a very similar approach to what we'll be looking at for fine tuning, 00:07:17.440 |
but it started with random weights, and at the end of it there was a model that could 00:07:21.040 |
predict more than 30% of the time correctly what the next word of a Wikipedia article 00:07:31.040 |
So in this particular case for the ULM FIT paper, we then took that and we were trying 00:07:36.200 |
to-- the first task I did actually for the FAST AI course back when I invented this was 00:07:42.600 |
to try and figure out whether IMDB movie reviews were positive or negative sentiment. 00:07:51.440 |
So what I did was I created a second language model. 00:07:55.200 |
So again, the language model here is something that predicts the next word of a sentence. 00:07:58.640 |
But rather than using Wikipedia, I took this pre-trained model that was trained on Wikipedia, 00:08:04.360 |
and I ran a few more epochs using IMDB movie reviews. 00:08:10.000 |
So it got very good at predicting the next word of an IMDB movie review. 00:08:15.760 |
And then finally, I took those weights and I fine-tuned them for the task of predicting 00:08:24.120 |
whether or not a movie review was positive or negative sentiment. 00:08:32.640 |
This is a particularly interesting approach because this very first model-- in fact, the 00:08:37.640 |
first two models-- if you think about it, they don't require any labels. 00:08:41.000 |
They didn't have to collect any kind of document categories or do any kind of surveys or collect 00:08:48.560 |
All I needed was the actual text of Wikipedia and movie reviews themselves because the labels 00:08:57.360 |
Now since we built ULMfit-- and we used RNNs, the current neural networks for this-- at 00:09:07.840 |
about the same time-ish that we released this, a new kind of architecture, particularly useful 00:09:13.040 |
for NLP at the time, was developed called transformers. 00:09:17.800 |
And transformers were particularly built because they can take really good advantage of modern 00:09:29.280 |
They didn't really allow you to predict the next word of a sentence. 00:09:36.680 |
It's just not how they're structured for reasons we'll talk about probably in part two of the 00:09:42.520 |
So they threw away the idea of predicting the next word of a sentence. 00:09:45.440 |
And then instead, they did something just as good and pretty clever. 00:09:49.680 |
They took kind of chunks of Wikipedia or whatever text they're looking at and deleted at random 00:09:55.700 |
a few words and asked the model to predict what were the words that were deleted, essentially. 00:10:04.960 |
Other than that, the basic concept was the same as ULMfit. 00:10:09.060 |
They replaced our RNN approach with a transformer model. 00:10:12.520 |
They replaced our language model approach with what's called a masked language model. 00:10:16.240 |
But other than that, the basic idea was the same. 00:10:19.080 |
So today, we're going to be looking at models using what's become the much more popular 00:10:27.040 |
approach than ULMfit, which is this transformers masked language model approach. 00:10:34.000 |
And I should mention, we do have a professor from the University of Queensland, John Williams, 00:10:41.760 |
joining us, who will be asking the highest voted questions from the community. 00:10:50.880 |
I suspect this is where you're going tonight. 00:10:53.200 |
But we've got a good question here on the forum, which is, how do you go from a model 00:10:57.800 |
that's trained to predict the next word to a model that can be used for classification? 00:11:04.600 |
So, yeah, we will be getting into that in more detail. 00:11:09.440 |
And in fact, maybe a good place to start would be the next slide, kind of give you a sense 00:11:16.120 |
You might remember in lesson one, we looked at this fantastic Zyla and Fergus paper, where 00:11:21.480 |
we looked at visualizations of the first layer of a ImageNet classification model. 00:11:28.160 |
And layer one had sets of weights that found diagonal edges. 00:11:34.400 |
And here are some examples of bits of photos that successfully matched with and opposite 00:11:42.640 |
And here's some examples of bits of pictures that matched. 00:11:46.120 |
And then layer two combined those, and now you know how those were combined, right? 00:11:51.680 |
These were rectified linear units that were added together, and then sets of those rectified 00:11:58.520 |
linear units, the outputs of those, they're called activations, were then themselves run 00:12:02.880 |
through a matrix multiplier, a rectified linear unit, added together. 00:12:06.480 |
So now you don't just have to have edge detectors, but layer two had corner detectors. 00:12:11.640 |
And here's some examples of some corners that that corner detector successfully found. 00:12:16.240 |
And remember, these were not engineered in any way, they just evolved from the gradient 00:12:24.960 |
Layer two had examples of circle detectors, as it turns out. 00:12:29.160 |
And skipping a bit, by the time we got to layer five, we had bird and lizard eyeball 00:12:35.120 |
detectors, and dog face detectors, and flower detectors, and so forth. 00:12:45.800 |
Nowadays, you'd have something like a ResNet 50, would be something you'd probably be training 00:12:51.020 |
pretty regularly in this course, so that you've got 50 layers, not just five layers. 00:12:56.840 |
Now the later layers do things that are much more specific to the training task, which 00:13:04.640 |
is actually predicting, really, what is it that we're looking at. 00:13:09.360 |
The early layers, pretty unlikely you're going to need to change them much, as long as you're 00:13:17.360 |
You're going to need edge detectors, gradient detectors. 00:13:21.160 |
So what we do in the fine-tuning process is there's actually one extra layer after this, 00:13:29.240 |
which is the layer that actually says, what is this? 00:13:33.720 |
You actually delete that, or you throw it away. 00:13:36.520 |
So now that last matrix multiply has one output, or one output per category you're predicting. 00:13:45.520 |
So the model now has that last matrix that's spitting out, it depends, but generally a 00:13:54.840 |
What we do, as we'll learn more shortly in the coming lesson, we just stick a new random 00:14:06.080 |
So it learns to use these kinds of features to predict whatever it is you're trying to 00:14:14.280 |
And then we gradually train all of those layers. 00:14:19.680 |
And so that's a bit hand-wavy, but we'll, particularly in part two, actually build that 00:14:28.280 |
And in fact, in this lesson, time permitting, we're actually going to start going down the 00:14:31.300 |
process of actually building a real world neural net in Python. 00:14:37.440 |
So we'll be starting to actually make some progress towards that goal. 00:14:47.080 |
So we're going to look at a Kaggle competition that's actually on, as I speak. 00:14:54.740 |
And I created this notebook called Getting Started with NLP for Absolute Beginners. 00:15:00.100 |
And so the competition is called the US Patent Phrase-to-Phrase Matching Competition. 00:15:07.400 |
And so I'm going to take you through a complete submission to this competition. 00:15:16.480 |
And Kaggle competitions are interesting, particularly the ones that are not playground competitions, 00:15:20.560 |
but the real competitions with real money applied. 00:15:22.840 |
They're interesting because this is an actual project that an actual organization is prepared 00:15:29.620 |
to invest money in getting solved using their actual data. 00:15:34.240 |
So a lot of people are a bit dismissive of Kaggle competitions as being not very real. 00:15:41.440 |
You're not worrying about stuff like productionizing the model. 00:15:44.660 |
But in terms of getting real data about a real problem that real organizations really 00:15:50.160 |
care about and a very direct way to measure the accuracy of your solution, you can't really 00:15:59.040 |
It's a good competition to experiment with for trying NLP. 00:16:02.840 |
Now, as I mentioned here, probably the most widely useful application for NLP is classification. 00:16:10.200 |
And as we've discussed in computer vision, classification refers to taking an object 00:16:15.040 |
and trying to identify a category that object belongs to. 00:16:19.960 |
So previously, we've mainly been looking at images. 00:16:22.720 |
Today, we're going to be looking at documents. 00:16:26.560 |
Now in NLP, when we say document, we don't specifically mean a 20-page long essay. 00:16:36.280 |
A document could be three or four words, or a document could be the entire encyclopedia. 00:16:41.720 |
So a document is just an input to an NLP model that contains text. 00:16:50.360 |
Now classifying a document, so deciding what category a document belongs to, is a surprisingly 00:17:00.880 |
There's all kinds of stuff you could do with that. 00:17:02.920 |
So for example, we've already mentioned sentiment analysis. 00:17:07.680 |
We try to decide on the category, positive or negative sentiment. 00:17:11.520 |
Author identification would be taking a document and trying to find the category of author. 00:17:18.120 |
Digital discovery would be taking documents and putting them into categories according 00:17:25.600 |
Triaging inbound emails would be putting them into categories of throw away, send to customer 00:17:41.360 |
And for people interested in trying out NLP in real life, I would suggest classification 00:17:48.080 |
would be the place I would start for looking for accessible, real world, useful problems 00:17:58.240 |
Now the Kaggle competition does not immediately look like a classification competition. 00:18:13.040 |
What it contains is data that looks like this. 00:18:16.200 |
It has a thing that they call anchor, a thing they call target, a thing they call context 00:18:22.600 |
Now these are, I can't remember exactly how it is, but I think these are from patents. 00:18:29.840 |
And I think on the patents there are various things they have to fill in in the patent. 00:18:40.840 |
And in the competition, the goal is to come up with a model that automatically determines 00:18:45.240 |
which anchor and target pairs are talking about the same thing. 00:18:50.640 |
So a score of one here, wood article and wooden article obviously talking about the same thing. 00:18:57.400 |
A score of zero here, abatement and forest region not talking about the same thing. 00:19:02.800 |
So the basic idea is that we're trying to guess the score. 00:19:08.120 |
And it's kind of a classification problem, kind of not. 00:19:12.080 |
We're basically trying to classify things into either these two things are the same 00:19:18.080 |
It's kind of not because we have not just one and zero, but also 0.25, 0.5 and 0.75. 00:19:24.480 |
There's also a column called context, which I believe is like the category that this patent 00:19:31.920 |
And my understanding is that whether the anchor and the target count as similar or not depends 00:19:42.280 |
So how would we take this and turn it into something like a classification problem? 00:19:51.080 |
So the suggestion I make here is that we could basically say, OK, let's put some constant 00:20:01.640 |
string like text one or field one before the first column, and then something else like 00:20:12.000 |
Maybe also the context I should have as well, text three in the context, and then try to 00:20:16.840 |
choose a category of meaning similarity, different, similar or identical. 00:20:20.280 |
So we could basically concatenate those three pieces together, call that a document, and 00:20:25.800 |
then try to train a model that can predict these categories. 00:20:30.360 |
That would be an example of how we can take this basically similarity problem and turn 00:20:38.280 |
it into something that looks like a classification problem. 00:20:41.760 |
And we tend to do this a lot in deep learning is we kind of take problems that look a bit 00:20:49.280 |
novel and different and turn them into a problem that looks like something we recognize. 00:20:54.640 |
So on Kaggle, this is a larger data set that you're going to need a GPU to run. 00:21:05.640 |
So you can click on the Accelerator button and choose GPU to make sure that you're using 00:21:11.840 |
If you click Copy and Edit on my document, I think that will happen for you automatically. 00:21:17.120 |
Personally, I like using things like PaperSpace generally better than Kaggle. 00:21:26.000 |
Kaggle's pretty good, but you only get 30 hours a week of GPU time, and the notebook editor 00:21:32.680 |
for me is not as good as the real JupyterLab environment. 00:21:36.480 |
So there's some information here I won't go through, but it basically describes how you 00:21:40.960 |
can download stuff to PaperSpace or your own computer as well if you want to. 00:21:48.160 |
So I basically create this little boolean always in my notebooks called isKaggle, which 00:21:53.400 |
is going to be true if it's running on Kaggle and false otherwise, and any little changes 00:21:57.100 |
I need to make, I'd say if isKaggle and put those changes. 00:22:05.260 |
So here you can see here if I'm not on Kaggle and I don't have the data yet, then download 00:22:11.600 |
And Kaggle has a little API, which is quite handy for doing stuff like downloading data 00:22:15.600 |
and uploading notebooks and stuff like that, submitting to competitions. 00:22:22.040 |
If we are on Kaggle, then the data's already going to be there for us, which is actually 00:22:25.560 |
a good reason for beginners to use Kaggle is you don't have to worry about grabbing 00:22:30.520 |
It's sitting there for you as soon as you open the notebook. 00:22:35.680 |
Kaggle has a lot of Python packages installed, but not necessarily all the ones you want. 00:22:42.400 |
And at the point I wrote this, they didn't have HuggingFace's datasets package for some 00:22:47.440 |
reason, so you can always just install stuff. 00:22:49.880 |
So you might remember the exclamation mark means this is not a Python command, but a 00:22:57.000 |
But it's quite neat. You can even put bash commands inside Python conditionals. 00:23:03.340 |
So that's a pretty cool little trick in notebooks. 00:23:09.800 |
Another cool little trick in notebooks is that if you do use a bash command, like ls, 00:23:15.640 |
but you then want to insert the contents of a Python variable, just chuck it in parentheses. 00:23:20.800 |
So I've got a Python variable called path, and I can go ls path in parentheses, and that 00:23:27.700 |
will ls the contents of the Python variable path. 00:23:33.280 |
So when we ls that, we can see that there's some CSV files. 00:23:37.640 |
So what I'm going to do is kind of take you through roughly the process, the kind of process 00:23:42.240 |
I went through when I first look at a competition. 00:23:46.640 |
So the first thing is already dataset, indeed. 00:23:53.560 |
As well as looking at it here, the other thing I would do is I would go to the competition 00:23:59.360 |
website, and if you go to data, a lot of people skip over this, which is a terrible idea because 00:24:09.960 |
it actually tells you what the dependent variable means, what the different files are, what 00:24:17.520 |
So don't just rely on looking at the data itself, but look at the information that you're 00:24:32.240 |
So for CSV files, the CSV files are comma-separated values. 00:24:35.880 |
So they're just text files with a comma between each field. 00:24:39.840 |
But we can read them using pandas, which for some reason is always called PD. 00:24:48.280 |
Pandas is one of, I guess, probably four key libraries that you have to know to do data 00:24:57.520 |
science in Python, and specifically, those four libraries are NumPy, Matplotlib, Pandas, 00:25:18.920 |
So NumPy is what we use for basic numerical programming, Matplotlib we use for plotting, 00:25:26.800 |
Pandas we use for tables of data, and PyTorch we use for deep learning. 00:25:36.200 |
Those are all covered in a fantastic book by the author as Pandas, which the new version 00:25:46.200 |
is actually available for free, I believe, Python for Data Analysis. 00:25:53.240 |
So if you're not familiar with these libraries, just read the whole book. 00:25:59.080 |
It doesn't take too long to get through, and it's got lots of cool tips, and it's very 00:26:09.560 |
Often I see people kind of trying to jump ahead and want to be like, oh, I want to know 00:26:14.760 |
how to create a new architecture, or build a speech recognition system, or whatever. 00:26:20.800 |
But it then turns out that they don't know how to use these fundamental libraries. 00:26:24.840 |
So it's always good to be bold and be trying to build things, but do also take the time 00:26:28.480 |
to make sure you finish reading the first AI book and read at least Wes McKinney's book. 00:26:36.240 |
That would be enough to really give you all the basic knowledge you need, I think. 00:26:41.540 |
So with Pandas, we can read a CSV file, and that creates something called a data frame, 00:26:50.380 |
So now that we've got a data frame, we can see what we're working with. 00:26:57.440 |
And when in Jupyter we just put the name of a variable containing a data frame, we've 00:27:01.320 |
got the first five rows, the last five rows, and the size, so we've got 36,473 rows. 00:27:09.360 |
So other things I like to use for understanding a data frame is the describe method. 00:27:18.540 |
If you pass include equals object, that will describe basically all the string fields, 00:27:28.720 |
And so you can see here that that anchor field we looked at, there's actually only 733 unique 00:27:35.120 |
So this thing, you can see that there's lots of repetition out of 36,000. 00:27:45.240 |
This is the most common one, it appears 152 times. 00:27:49.620 |
And then context, we also see lots of repetition, there's 106 of those contexts. 00:27:53.600 |
So this is a nice little method, we can see a lot about the data in a glance. 00:27:58.960 |
And when I first saw this in this competition, I thought, well, this is actually not that 00:28:06.880 |
Each document is very short, three or four words, really, and lots of it is repeated. 00:28:15.760 |
So as I'm looking through it, I'm thinking, what are some key features of this data set? 00:28:20.200 |
And that would be something I'd be thinking, wow, we've got to do a lot with not very much 00:28:29.120 |
So here's how we can just go ahead and create a single string like I described, which contains 00:28:36.040 |
some kind of field separator, plus the context, the target, and the anchor. 00:28:42.840 |
So we're going to pop that into a field called input. 00:28:47.400 |
Something slightly weird in pandas is there's two ways of referring to a column. 00:28:51.880 |
You can use square brackets and a string to get the input column, or you can just treat 00:28:59.380 |
When you're setting it, you should always use the forms in here. 00:29:06.560 |
I tend to use this one because it's less typing. 00:29:08.960 |
So you can see now we've got these concatenated rows. 00:29:18.220 |
So we've now got some documents to do NLP with. 00:29:23.040 |
Now the problem is, as you know from the last lesson, neural networks work with numbers. 00:29:29.040 |
We're going to take some numbers, and we're going to multiply them by matrices. 00:29:35.360 |
We're going to replace the negatives with zeros and add them up, and we're going to 00:29:40.420 |
That's our neural network, with some little wrinkles, but that's the basic idea. 00:29:45.080 |
So how on earth do we do that for these strings? 00:29:51.740 |
So there's basically two steps we're going to take. 00:29:54.960 |
The first step is to split each of these into tokens. 00:30:07.680 |
There's a few problems with splitting things into words, though. 00:30:11.360 |
The first is that some languages, like Chinese, don't have words, or at least certainly not 00:30:16.880 |
space-separated words, and in fact, in Chinese, sometimes it's a bit fuzzy to even say where 00:30:24.800 |
Some words are kind of not even -- the pieces are not next to each other. 00:30:29.460 |
Another reason is that what we're going to be doing is after we've split it into words, 00:30:34.400 |
or something like words, we're going to be getting a list of all of the unique words 00:30:41.480 |
And every one of those unique words is going to get a number. 00:30:45.120 |
As you'll see later on, the bigger the vocabulary, the more memory is going to get used, the 00:30:55.000 |
In general, we don't want a vocabulary to be too big. 00:31:02.720 |
So instead, nowadays, people tend to tokenize into something called subwords, which is pieces 00:31:12.600 |
So the process of turning it into smaller units, like words, is called tokenization, 00:31:20.440 |
The token is just the more general concept of whatever we're splitting it into. 00:31:25.760 |
So we're going to get hugging face transformers and hugging face datasets doing our work for 00:31:33.480 |
And so what we're going to do is we're going to turn our pandas data frame into a hugging 00:31:45.360 |
PyTorch has a class called dataset, and hugging face has a class called dataset, and they're 00:31:51.320 |
So this is a hugging face dataset, hugging face datasets dataset. 00:31:55.840 |
So we can turn a data frame into a data set just using the from pandas method. 00:32:03.520 |
So if we take a look, it just tells us, all right, it's got these features. 00:32:09.880 |
And remember, input is the one we just created with the concatenated strings. 00:32:20.000 |
So now we're going to do these two things, tokenization, which is to split each text 00:32:23.100 |
up into tokens, and then numericalization, which is to turn each token into its unique 00:32:31.760 |
The vocabulary, remember, being the list of unique tokens. 00:32:37.760 |
Now particularly in this stage, tokenization, there's a lot of little decisions that have 00:32:47.400 |
The good news is you don't have to make them, because whatever pre-trained model you used, 00:32:53.620 |
the people that pre-trained it made some decisions, and you're going to have to do exactly the 00:32:58.360 |
same thing, otherwise you'll end up with a different vocabulary to them, and that's going 00:33:04.840 |
So that means before you start tokenizing, you have to decide on what model to use. 00:33:13.760 |
It has a library of, I believe, hundreds of models. 00:33:21.120 |
I guess I shouldn't say hugging face transformers, it's really the hugging face model hub. 00:33:25.400 |
44,000 models, so many more even than Tim's image models. 00:33:33.600 |
And so these models, they vary in a couple of ways. 00:33:36.760 |
There's a variety of different architectures, just like in Tim, but then something which 00:33:42.080 |
is different to Tim is that each of those architectures can be trained on different 00:33:49.040 |
So for example, I could type patent, and see if there's any pre-trained patent, there is. 00:33:54.120 |
So there's a whole lot of pre-trained patent models, isn't that amazing? 00:34:00.000 |
So quite often, thanks to the hugging face model hub, you can start your pre-trained 00:34:08.080 |
model with something that's actually pretty similar to what you actually want to do, or 00:34:12.560 |
at least was trained on the same kind of documents. 00:34:18.240 |
Having said that, there are some just generally pretty good models that work for a lot of 00:34:26.400 |
things a lot of the time, and DiBerta V3 is certainly one of those. 00:34:36.920 |
This is a very new area, NLP has been practically really effective for general users for only 00:34:50.160 |
a year or two, whereas for computer vision it's been quite a while. 00:34:56.440 |
So you'll find that a lot of things aren't as quite well bedded down. 00:35:00.040 |
I don't have a picture to show you of which models are the best or the fastest and the 00:35:08.680 |
A lot of this stuff is like stuff that we're figuring out as a community using competitions 00:35:14.680 |
like this, in fact, and this is one of the first NLP competitions actually in the kind 00:35:22.520 |
So we've been studying these competitions closely, and I can tell you that DiBerta is 00:35:28.880 |
actually a really good starting point for a lot of things, so that's why we've picked 00:35:35.400 |
And just like in Tim for image, you know, a model says often going to be a small, a medium, 00:35:39.960 |
a large, and of course we should start with small, right, because small is going to be 00:35:44.880 |
faster to train, we're going to be able to do more iterations and so forth, okay. 00:35:57.900 |
So at this point, remember, the only reason we picked our model is because we have to 00:36:07.200 |
To tell transformers that we want to tokenize the same way that the people that built a 00:36:12.120 |
model did, we use something called autotokenizer. 00:36:16.280 |
It's basically just a dictionary which says, oh, which model uses which tokenizer. 00:36:20.560 |
So when we say autotokenizer from pre-trained, it will download the vocabulary and the details 00:36:26.320 |
about how this particular model tokenized dataset. 00:36:35.080 |
So at this point, we can now take that tokenizer and pass the string to it. 00:36:42.600 |
So if I pass the string g'day folks on Jeremy from fast.ai, you'll see it's kind of putting 00:36:53.160 |
So if you've ever wondered whether g'day is one word or two, you know, it's actually three 00:36:58.160 |
tokens according to this tokenizer, and I'm is three tokens, and fast.ai is three tokens. 00:37:05.920 |
This punctuation is a token, and so you kind of get the idea. 00:37:10.440 |
These underscores here, that represents the start of a word, right. 00:37:14.680 |
So that's kind of, there's this concept that, like, the start of a word is kind of part 00:37:19.360 |
So if you see a capital I in the middle of a word versus the start of a word, that kind 00:37:24.080 |
So this is what happens when we tokenize this sentence using the tokenizer that the Daburda 00:37:36.800 |
So here's a less common, unless you're a big platypus fan like me, less common sentence. 00:37:49.720 |
And so, okay, in this particular vocabulary, platypus got its own word, its own token, 00:37:58.080 |
And so I still remember grade one, for some reason, our teacher got us all to learn how 00:38:03.480 |
to spell ornithrinx, so one of my favorite words. 00:38:08.400 |
So you can see here, it's been split into all, knee, toe, ring, us. 00:38:16.120 |
So every one of these tokens you see here is going to be in the vocabulary, right? 00:38:21.240 |
The list of unique tokens that was created when this particular model, this pre-trained 00:38:30.600 |
So somewhere in that list, we'll find underscore capital A, and it'll have a number. 00:38:37.620 |
And so that's how we'll be able to turn these into numbers. 00:38:41.400 |
So this first process is called tokenization, and then the thing where we take these tokens 00:38:46.120 |
and turn them into numbers is called numericalization. 00:38:50.920 |
So our dataset, remember we put our string into the input field. 00:38:57.980 |
So here's a function that takes a document, grabs its input, and tokenizes it. 00:39:04.200 |
Okay, so we'll call this our tokenization function. 00:39:09.140 |
Tokenization can take a minute or two, so we may as well get all of our processes doing 00:39:15.080 |
So if you use the dataset.map, it will paralyze that process and just pass in your function. 00:39:21.600 |
Make sure you pass batch equals true so it can do a bunch at a time. 00:39:26.080 |
Behind the scenes, this is going through something called the tokenizes library, which is a pretty 00:39:29.240 |
optimized Rust library that uses SIMD and parallel processing and so forth. 00:39:37.040 |
So with batch equals true, it'll be able to do more stuff at once. 00:39:41.640 |
So look, it only took six seconds, so pretty fast. 00:39:46.000 |
So now when we look at a row of our tokenized dataset, it's going to contain exactly the 00:39:54.560 |
No, sorry, it's not going to take exactly the same as our original dataset. 00:39:59.160 |
It's going to contain exactly the same input as our original dataset, and it's also going 00:40:06.240 |
These numbers are the position in the vocabulary of each of the tokens in the string. 00:40:16.640 |
So we've now successfully turned a string into a list of numbers. 00:40:27.960 |
We can see, for example, that we've got "of" at a separate word. 00:40:32.800 |
That's going to be an underscore "of" in the vocabulary. 00:40:36.560 |
We can grab the vocabulary, look up "of," find that it's 265, and check here, yep, here 00:40:48.020 |
It's just looking stuff up in a dictionary to get the numbers. 00:40:57.040 |
So that is the tokenization and Americanization necessary in NLP to turn our documents into 00:41:06.140 |
numbers to allow us to put it into our model. 00:41:14.320 |
So there's a couple, and this seems like a good time to throw them out, and it's related 00:41:17.840 |
to how you've formatted your input data into these sentences that you've just tokenized. 00:41:25.620 |
So one question was really about how you choose those keywords and the order of the fields 00:41:32.840 |
that you -- so I guess just interested in an explanation, is it more art or science, 00:41:43.200 |
I tried "X," I tried putting them backwards, doesn't matter. 00:41:49.560 |
We just want some way, something that it can learn from. 00:41:53.680 |
So if I just concatenated it without these headers before each one, it wouldn't know 00:41:59.160 |
where abatement of pollution ended and where abatement started, right? 00:42:03.320 |
So I did just something that it can learn from. 00:42:04.960 |
This is a nice thing about neural nets, they're so flexible. 00:42:10.120 |
As long as you give it the information somehow, it doesn't really matter how you give it the 00:42:19.920 |
I could have used punctuation, I could have put like, I don't know, one semicolon here 00:42:24.640 |
and two here and three here, yeah, it's not a big deal. 00:42:28.360 |
Like, at the level where you're like trying to get an extra half a percent to get up the 00:42:33.160 |
later board or Kaggle competition, you may find tweaking these things makes tiny differences, 00:42:38.160 |
but in practice, you won't generally find it matters too much. 00:42:45.520 |
And I guess the second part of that, excuse me again, somebody's asking if one of their 00:42:51.160 |
fields was particularly long, say it was a thousand characters, is there any special handling 00:42:58.000 |
Do you need to re-inject those kind of special marker tokens? 00:43:03.360 |
Does it change if you've got much bigger fields that you're trying to learn and query? 00:43:08.520 |
>> Long documents and ULM fit require no special consideration. 00:43:17.200 |
So IMDB, in fact, has multi-thousand word movie reviews and it works great. 00:43:25.000 |
To this day, ULM fit is probably the best approach, you know, for reasonably quickly 00:43:35.160 |
Otherwise, if you use transformer based approaches, large documents are challenging, specifically, 00:43:45.240 |
transformers basically have to do the whole document at once, where else ULM fit can split 00:43:49.720 |
it into multiple pieces and read it gradually. 00:43:52.120 |
And so that means you'll find that people trying to work with large documents tend to 00:43:56.120 |
spend a lot of money on GPUs because they need the big fancy ones with lots of memory. 00:44:02.080 |
So generally speaking, I would say if you're trying to do stuff with documents of over 00:44:06.360 |
2,000 words, you might want to look at ULM fit. 00:44:13.080 |
Try transformers, see if it works for you, but I'd certainly try both. 00:44:16.400 |
For under 2,000 words, transformers should be fine unless you've got nothing but like 00:44:24.000 |
a laptop GPU or something with not much memory. 00:44:32.120 |
So how can face transformers has these, you know, as I say it right now that I find them 00:44:40.600 |
somewhat obscure and not particularly well-documented expectations about your data that you kind 00:44:47.240 |
And one of those is that it expects that your target is a column called labels. 00:44:54.040 |
So once I figured that out, I just went, got our tokenized data set and renamed our score 00:44:58.380 |
column to labels and everything started working. 00:45:02.880 |
So probably is, you know, I don't know if at some point they'll make this a bit more 00:45:05.640 |
flexible, but probably best to just call your target labels and life will be easy. 00:45:14.000 |
You might have seen back when I went LS path that there was another data set there called 00:45:21.000 |
And if you look at it, it looks a lot like our training set, our other CSV that we've 00:45:27.720 |
been working with, but it's missing the score, the labels. 00:45:37.240 |
And so we're going to talk a little bit about that now because my claim here is that perhaps 00:45:42.960 |
the most important idea in machine learning is the idea of having separate training, validation, 00:46:03.400 |
So test and validation sets are all about identifying and controlling for something 00:46:10.480 |
called overfitting, and we're going to try and learn about this through example. 00:46:16.760 |
So this is the same information that's in that Kaggle notebook I've just put on some 00:46:27.720 |
So I'm going to create a function here called plot poly, and I'm actually going to use the 00:46:34.600 |
same data that, I don't know if you remember, we used it earlier for trying to fit this 00:46:47.400 |
This is the data we're going to use, and we're going to use this to look at overfitting. 00:46:53.820 |
So the details of this function don't matter too much. 00:46:57.440 |
What matters is what we do with it, which is that it allows us to basically pass in the 00:47:07.520 |
So for those of you that remember, a first degree polynomial is just a line. 00:47:14.120 |
A second degree polynomial will be Y equals A squared X plus BX plus C. A third degree 00:47:20.560 |
polynomial will have a cubic, fourth degree quartic, and so forth. 00:47:25.760 |
And what I've done here is I've plotted what happens if we try to fit a line to our data. 00:47:36.200 |
So what happened here is we did a linear regression, and what we're using here is a very cool library 00:47:45.800 |
Scikit-learn is something that, you know, I think it'd be fair to say it's mainly designed 00:47:50.400 |
for kind of classic machine learning methods, like kind of linear regression and stuff like 00:47:56.520 |
Very advanced versions of these things, but it's also great for doing these quick and 00:48:02.480 |
So in this case, I wanted to do what's called a polynomial regression, which is fitting 00:48:06.320 |
the polynomial to data, and it's just these two lines of code. 00:48:11.960 |
So in this case, a degree one polynomial is just a line. 00:48:14.640 |
So I fit it, and then I show it with the data, and there it is. 00:48:18.880 |
Now that's what we call underfit, which is to say there's not enough kind of complexity 00:48:25.180 |
in this model I fit to match the data that's there. 00:48:37.760 |
All the stuff up here, we're going to be predicting too low. 00:48:40.200 |
All the stuff down here, we're predicting too low. 00:48:42.160 |
All the stuff in the middle, we're predicting too high. 00:48:44.640 |
A common misunderstanding is that simpler models are more reliable in some way, but 00:48:52.600 |
models that are too simple will be systematically incorrect, as you see here. 00:49:01.160 |
What happens if we fit a 10-degree polynomial? 00:49:10.560 |
In this case, it's not really showing us what the actual-- remember, this is originally 00:49:15.960 |
a quadratic because this is meant to match, particularly at the ends here. 00:49:19.820 |
It's predicting things that are way above what we would expect in real life. 00:49:25.480 |
And it's trying really hard to get through this point, but clearly this point was just 00:49:34.280 |
It's done a good job of fitting to our exact data points, but if we sample some more data 00:49:40.000 |
points from this distribution, honestly, we probably would suspect they're not going to 00:49:45.240 |
be very close to this, particularly if they're a bit beyond the edges. 00:49:53.600 |
Now, underfitting is actually pretty easy to recognize because we can actually look 00:49:58.320 |
at our training data and see that it's not very close. 00:50:02.720 |
Underfitting is a bit harder to recognize because the training data is actually very 00:50:11.800 |
Now, on the other hand, here's what happens if we fit the quadratic. 00:50:19.600 |
And here I've got both the real line and the fit line, and you can see they're pretty close. 00:50:26.400 |
And that's, of course, what we actually want. 00:50:35.760 |
So how do we tell whether we have something more like this or something more like this? 00:50:43.200 |
Well, what we do is we do something pretty straightforward, is we take our original dataset, 00:50:49.000 |
these points, and we remove a few of them, let's say 20% of them. 00:50:56.840 |
We then fit our model using only those points we haven't removed. 00:51:03.020 |
And then we measure how good it is by looking at only the points we removed. 00:51:09.760 |
So in this case, let's say we had removed, I'm just trying to think, if I'd removed this 00:51:18.440 |
point here, then it might have kind of gone off down over here. 00:51:24.000 |
And so then when we look at how well it fits, we would say, oh, this one's miles away. 00:51:31.560 |
The data that we take away and don't let the model see it when it's training is called 00:51:40.440 |
So in fast AI, we've seen splitters before, right? 00:51:43.120 |
The splitters are the things that separate out the validation set. 00:51:46.760 |
Fast AI won't let you train a model without a validation set. 00:51:50.760 |
Fast AI always shows you your metrics, so things like accuracy, measured only on the 00:51:58.700 |
Most libraries make it really easy to shoot yourself in the foot by not having a validation 00:52:08.520 |
So you've got to be particularly careful when using other libraries. 00:52:13.360 |
Hacking face transformers is good about this, so they make sure that they do show you your 00:52:27.680 |
Now creating a good validation set is not generally as simple as just randomly pulling 00:52:32.240 |
some of your data out of your model, out of the data that you train with your model. 00:52:38.960 |
The reason why is imagine that this was the data you were trying to fit something to. 00:52:48.120 |
And you randomly remove some, so it looks like this. 00:52:55.420 |
Because you've kind of like still got all the data you would want around the points. 00:52:59.880 |
And in a time series like this, this is dates and sales, in real life you're probably going 00:53:07.160 |
So if you created your validation set by randomly removing stuff from the middle, it's not really 00:53:12.060 |
a good indication of how you're going to be using this model. 00:53:15.320 |
Instead you should truncate and remove the last couple of weeks. 00:53:20.180 |
So if this was your validation set and this is your training set, that's going to be actually 00:53:25.220 |
testing whether you can use this to predict the future rather than using it to predict 00:53:33.960 |
Kaggle competitions are a fantastic way to test your ability to create a good validation 00:53:41.240 |
Because Kaggle competitions only allow you to submit generally a couple of times a day. 00:53:48.200 |
The data set that you are scored on in the leaderboard during that time is actually only 00:53:56.680 |
In fact, it's a totally separate subset to the one you'll be scored on on the end of 00:54:05.200 |
And it's not until you've done it that you will get that visceral feeling of like, oh 00:54:13.480 |
In the real world, outside of Kaggle, you will often not even know that you overfit. 00:54:20.560 |
You just destroy value of your organization silently. 00:54:24.240 |
So it's a really good idea to do this kind of stuff on Kaggle a few times first in real 00:54:28.460 |
competitions to really make sure that you are confident you know how to avoid overfitting, 00:54:33.960 |
how to find a good validation set, and how to interpret it correctly. 00:54:38.340 |
And you really don't get that until you screw it up a few times. 00:54:45.040 |
Good example of this was there was a distracted driver competition on Kaggle. 00:54:49.960 |
There were these kind of pictures from inside a car. 00:54:55.240 |
And the idea was that you had to try and predict whether somebody was driving in a distracted 00:55:02.600 |
And on Kaggle, they did something pretty smart. 00:55:04.520 |
The test set, so the thing that they scored you on the leaderboard, contained people that 00:55:08.900 |
didn't exist at all in the competition data that you train the model with. 00:55:15.320 |
So if you wanted to create an effective validation set in this competition, you would have to 00:55:19.480 |
make sure that you separated the photos so that your validation set contained photos 00:55:24.280 |
of people that aren't in the data you're training your model on. 00:55:29.320 |
There was another one like that, the Kaggle fisheries competition, which had boats that 00:55:38.120 |
So they were basically pictures of boats and you're meant to try to guess, predict what 00:55:44.080 |
And it turned out that a lot of people accidentally figured out what the fish were by looking 00:55:49.560 |
at the boat because certain boats tended to catch certain kinds of fish. 00:55:54.280 |
And so by messing up their validation set, they were really overconfident of the accuracy 00:56:02.040 |
I'll mention in passing, if you've been around Kaggle a bit, you'll see people talk about 00:56:10.720 |
I'm just going to mention, be very, very careful. 00:56:14.600 |
Cross validation is explicitly not about building a good validation set, so you've got to be 00:56:26.960 |
Another thing I'll mention is that Scikit-Learn conveniently offers something called train 00:56:31.440 |
test split, as does Hugging Face datasets, as does Fast AI, we have something called 00:56:40.520 |
It can be encouraged, it can almost feel like it's encouraging you to use a randomized validation 00:56:48.080 |
set because there are these methods that do it for you. 00:56:51.400 |
But yeah, be very, very careful, because very, very often that's not what you want. 00:56:58.400 |
So if you want what a validation set is, so that's the bit that you pull out of your data 00:57:02.560 |
that you don't train with, but you do measure your accuracy with, so what's a test set? 00:57:10.200 |
It's basically another validation set, but you don't even use it for tracking your accuracy 00:57:20.080 |
Well, imagine you tried two new models every day for three months. 00:57:24.320 |
That's how long a Kaggle competition goes for. 00:57:26.880 |
So you would have tried 180 models, and then you look at the accuracy on the validation 00:57:33.760 |
Some of those models, you would have got a good accuracy on the validation set potentially 00:57:38.600 |
because of pure chance, just a coincidence, and then you get all excited and you submit 00:57:43.080 |
that to Kaggle, and you think you're going to win the competition, and you mess it up. 00:57:47.480 |
And that's because you actually overfit using the validation set. 00:57:52.880 |
So you actually want to know whether you've really found a good model or not. 00:57:58.640 |
So in fact, on Kaggle, they have two test sets. 00:58:02.320 |
They've got the one that gives you feedback on the leaderboard during the competition 00:58:05.400 |
and a second test set, which you don't get to see until after the competition is finished. 00:58:12.720 |
So in real life, you've got to be very careful about this, not to try so many models during 00:58:18.360 |
your model building process that you accidentally find one that's good by coincidence. 00:58:24.360 |
And only if you have a test set that you've held out or you know that. 00:58:28.600 |
Now that leads to the obvious question, which is very challenging, is you spent three months 00:58:33.120 |
working on a model, worked well on your validation set, you did a good job of locking that test 00:58:38.720 |
set away in a safe so you weren't allowed to use it, and at the end of the three months, 00:58:41.840 |
you finally checked it on the test set and it's terrible. 00:58:51.760 |
There really isn't any choice other than starting again. 00:59:05.120 |
So you've got a validation set, what are you going to do with it? 00:59:08.360 |
What you're going to do with a validation set is you're going to measure some metrics. 00:59:16.960 |
It's a number that tells you how good is your model. 00:59:29.760 |
Go to Overview, click on Evaluation, and find out, and it says, "Oh, we will evaluate 00:59:39.520 |
Therefore, this is the metric you care about. 00:59:49.120 |
So one obvious question is, is this the same as the loss function? 00:59:52.880 |
Is this the thing that we will take the derivative of and find the gradient and use that to improve 01:00:01.480 |
And the answer is maybe, sometimes, but probably not. 01:00:13.520 |
Now if we were using accuracy to calculate our derivative and get the gradient, you could 01:00:18.880 |
have a model that's actually slightly better, you know, slightly like it's doing a better 01:00:23.200 |
job of recognizing dogs and cats, but not so much better that it's actually caused any 01:00:40.200 |
You don't want bumpy functions, because they don't have nice gradients. 01:00:48.680 |
You want a function that's nice and smooth, something like, for instance, the average 01:00:55.120 |
absolute error, mean absolute error, which we've used before. 01:01:00.320 |
So that's the difference between your metrics and your loss. 01:01:03.440 |
Now be careful, right, because when you're training, your model's spending all of its 01:01:06.720 |
time trying to improve the loss, and most of the time that's not the same as the thing 01:01:11.680 |
you actually care about, which is your metric. 01:01:13.520 |
So you've got to keep those two different things in mind. 01:01:17.440 |
The other thing to keep in mind is that in real life, you can't go to a website and be 01:01:28.440 |
In real life, the model that you choose, there isn't one number that tells you whether it's 01:01:36.080 |
good or bad, and even if there was, you wouldn't be able to find it out ahead of time. 01:01:41.640 |
In real life, the model you use is a part of a complex process, often involving humans, 01:01:50.240 |
both as users or customers, and as people involved as part of the process. 01:01:58.680 |
There's all kinds of things that are changing over time, and there's lots and lots of outcomes 01:02:07.440 |
One metric is not enough to capture all of that. 01:02:11.080 |
Unfortunately, because it's so convenient to pick one metric and use that to say, "I've 01:02:20.960 |
got a good model," that very often finds its way into industry, into government, where 01:02:29.480 |
people roll out these things that are good on the one metric that happened to be easy 01:02:36.200 |
Again and again, we found people's lives turned upside down because of how badly they get 01:02:44.360 |
screwed up by models that have been incorrectly measured using a single metric. 01:02:49.800 |
My partner, Rachel Thomas, has written this article, which I recommend you read about 01:02:54.200 |
the problem with metrics, is a big problem for AI. 01:03:05.760 |
There's actually this thing called Goodhart's Law that states, "When a measure becomes a 01:03:14.880 |
When I was a management consultant 20 years ago, we were always part of these strategic 01:03:23.800 |
things trying to find key performance indicators and ways to set commission rates for sales 01:03:30.200 |
people and we were really doing a lot of this stuff, which is basically about picking metrics. 01:03:36.800 |
We see that happen, go wrong in industry all the time. 01:03:41.280 |
AI is dramatically worse because AI is so good at optimizing metrics. 01:03:47.640 |
That's why you have to be extra, extra, extra careful about metrics when you are trying 01:03:54.760 |
Anyway, as I said in Kaggle, we don't have to worry about any of that. 01:04:00.320 |
We are just going to use the Pearson correlation coefficient, which is all very well, as long 01:04:04.160 |
as you know what the hell the Pearson correlation coefficient is. 01:04:12.000 |
So Pearson correlation coefficient is usually abbreviated using letter R and it's the most 01:04:17.960 |
widely used measure of how similar two variables are. 01:04:22.800 |
If your predictions are very similar to the real values, then the Pearson correlation 01:04:29.840 |
coefficient will be high and that's what you want. 01:04:40.640 |
Minus one means you predicted exactly the wrong answer, which in a Kaggle competition 01:04:45.200 |
would be great because then you can just reverse all of your answers and you'll be perfect. 01:04:50.200 |
Minus one means you got everything exactly correct. 01:04:55.600 |
Generally speaking, in courses or textbooks when they teach you about the Pearson correlation 01:04:59.440 |
coefficient, at this point they will show you a mathematical function. 01:05:04.480 |
I'm not going to do that because that tells you nothing about the Pearson correlation 01:05:08.860 |
What we actually care about is not the mathematical function, but how it behaves. 01:05:15.120 |
I find most people even who work in data science have not actually looked at a bunch of data 01:05:23.480 |
So let's do that right now so that you're not one of those people. 01:05:27.900 |
The best way I find to understand how data behaves in real life is to look at real life 01:05:33.680 |
So there's a data set, Scikit-learn comes with a number of data sets and one of them 01:05:37.280 |
is called California Housing and it's a data set where each row is a district. 01:05:45.960 |
And it's kind of demographic information about different districts and about the value of 01:05:59.560 |
I'm not going to try to plot the whole thing because it's too big and this is a very common 01:06:03.280 |
question I have from people is how do I plot data sets with far too many points? 01:06:16.280 |
Whatever you see with a thousand points is going to be the same as what you see with 01:06:19.600 |
There's no reason to plot huge amounts of data generally, just grab a random sample. 01:06:27.160 |
Now NumPy has something called CoreCoF to get the correlation coefficient between every 01:06:33.560 |
variable and every other variable and it returns a matrix so I can look down here, so for example 01:06:40.560 |
here is the correlation coefficient between variable one and variable one which of course 01:06:45.240 |
is exactly perfectly 1.0 because variable one is the same as variable one. 01:06:50.600 |
Here is the small inverse correlation between variable one and variable two and medium sized 01:06:57.680 |
positive correlation between variable one and variable three and so forth. 01:07:01.740 |
This is symmetric about the diagonal because the correlation between variable one and variable 01:07:06.040 |
eight is the same as the correlation between variable eight and variable one. 01:07:17.280 |
So that's great when we wanted to get a bunch of values all at once. 01:07:20.360 |
For the Kaggle competition we don't want that, we just want a single correlation number. 01:07:24.760 |
If we just pass in a pair of variables we still get a matrix which is kind of weird, 01:07:35.640 |
So when I want to grab a correlation coefficient I'll just return the zeroth row first column. 01:07:42.000 |
So that's what Core is, that's going to be our single correlation coefficient. 01:07:46.180 |
So let's look at the correlation between two things, for example median income and medium 01:07:54.260 |
house value, 0.67, okay is that high, medium, low, how big is that, what does it look like? 01:08:03.460 |
So the main thing we need to understand is what these things look like. 01:08:07.500 |
So what I suggest we do is we're going to take a 10 minute break, 9 minute break, we'll 01:08:12.560 |
come back at half-past and then we're going to look at some examples of correlation coefficients. 01:08:27.540 |
So what I've done here is I've created a little function called show correlations, I'm going 01:08:33.160 |
to pass in a data frame and a couple of columns as strings, going to grab each of those columns 01:08:37.880 |
as series, do a scatter plot and then show the correlation. 01:08:43.760 |
So we already mentioned median income and median house valuation of 0.68. 01:08:52.040 |
So I don't know if you had some intuition about what you expected, but as you can see 01:08:57.400 |
it's still plenty of variation even at that reasonably high correlation. 01:09:09.320 |
Also you can see here that visualizing your data is very important if you're working with 01:09:13.460 |
this dataset because you can immediately see all these dots along here, that's clearly 01:09:21.280 |
So this is like when it's not until you look at pictures like this that you've got to pick 01:09:28.520 |
Oh, little trick, on the scatter plot I put alpha is 0.5, that creates some transparency. 01:09:36.260 |
For these kind of scatter plots that really helps because it kind of creates darker areas 01:09:49.800 |
So this one's gone down from 0.68 to 0.43 median income versus the number of rooms per house. 01:09:57.360 |
As you'd expect, more rooms, it's more income. 01:10:07.000 |
Now you'll find that a lot of these statistical measures like correlation rely on the square 01:10:15.880 |
And when you have big outliers like this, the square of the difference goes crazy. 01:10:21.360 |
And so this is another place we'd want to look at the data first and say, oh, that's 01:10:28.200 |
There's probably more correlation here, but there's a few examples of some houses with 01:10:33.000 |
lots and lots of room where people that aren't very rich live. 01:10:36.520 |
Maybe these are some kind of shared accommodation or something. 01:10:46.600 |
So let's get rid of the houses, the houses with 15 rooms or more. 01:10:53.680 |
And now you can see it's gone up from 0.43 to 0.68, even though we probably only got 01:11:00.200 |
rid of one, two, three, four, five, six, even got rid of seven data points. 01:11:04.960 |
So we've got to be very careful of outliers, and that means if you're trying to win a Kaggle 01:11:07.840 |
competition where the metric is correlation, and you just get a couple of rows really badly 01:11:14.080 |
wrong, then that's going to be a disaster to your score. 01:11:18.520 |
So you've got to make sure that you do a pretty good job of every row. 01:11:23.840 |
So there's what a correlation of 0.68 looks like. 01:11:30.840 |
And this is kind of interesting, isn't it, because 0.34 sounds like quite a good relationship, 01:11:40.500 |
So this is something I strongly suggest, is if you're working with a new metric, draw some 01:11:45.560 |
pictures of a few different levels of that metric to kind of try to get a feel for, like, 01:11:52.600 |
You know, what does 0.6 look like, what does 0.3 look like, and so forth? 01:11:58.440 |
And here's an example of a correlation of minus 0.2. 01:12:07.440 |
OK, so there's just more of a kind of a general tip of something I like to do when playing 01:12:11.440 |
with a new metric, and I recommend you do as well. 01:12:13.840 |
I think we've now got a sense of what the correlation feels like. 01:12:17.480 |
Now you can go look up the equation on Wikipedia if you're into that kind of thing. 01:12:23.720 |
We need to report the correlation after each epoch, because we want to know how our training's 01:12:30.680 |
HuggingFace expects you to return a dictionary because it's going to use the keys of the 01:12:41.400 |
So here's something that gets the correlation and returns it as a dictionary with the label 01:12:49.120 |
OK, so we've done metrics, we've done our training validation split. 01:12:56.480 |
Oh, we might have actually skipped over the bit where we actually did the split, did I? 01:13:03.960 |
So to actually do the split, in this Kaggle competition, I've got another notebook we'll 01:13:12.000 |
look at later where we actually split this properly, but here we're just going to do 01:13:15.920 |
a random split, just to keep things simple for now, of 25 percent of the data will be 01:13:24.000 |
So if we go ds_train_test_split, it returns a dataset dict, which has a train and a test. 01:13:33.600 |
So that looks a lot like a datasets object in fast.ai, very similar idea. 01:13:42.160 |
So this will be the thing that we'll be able to train with. 01:13:44.720 |
So it's going to train with this dataset and return the metrics on this dataset. 01:13:49.240 |
This is really a validation set, but HuggingFace datasets calls a test. 01:14:03.800 |
In fast.ai, we use something called a learner. 01:14:06.920 |
The equivalent in HuggingFace transformers is called trainer. 01:14:14.360 |
Something we'll learn about quite shortly is the idea of mini-batches and batch sizes. 01:14:19.400 |
In short, each time we pass some data to our model for training, it's going to send through 01:14:25.880 |
a few rows at a time to the GPU so that it can calculate those in parallel. 01:14:32.960 |
Those bunch of rows is called a batch or a mini-batch, and the number of rows is called 01:14:40.120 |
So here we're going to set the batch size to 128. 01:14:43.240 |
Generally speaking, the larger your batch size, the more it can do in parallel at once, 01:14:48.800 |
But if you make it too big, you're going to get an out-of-memory error on your GPU. 01:14:54.360 |
So it's a bit of trial and error to find a batch size that works. 01:15:04.140 |
We'll talk in the next lesson, unless we get to this lesson, about a technique to automatically 01:15:11.760 |
find or semi-automatically find a good learning rate. 01:15:14.600 |
We already know what a learning rate is from the last lesson. 01:15:16.800 |
I played around and found one that seems to train quite quickly without falling apart. 01:15:26.800 |
Generally I kind of -- if I don't have a -- so Hacking-Face Transformers doesn't have something 01:15:33.320 |
to help you find the learning rate, the integration we're doing in Fast AI will let you do that. 01:15:38.560 |
But if you're using a framework that doesn't have that, you can just start with a really 01:15:42.120 |
low learning rate and then kind of double it and keep doubling it until it falls apart. 01:15:51.520 |
Hacking-Face Transformers uses this thing called training arguments, which is a class 01:15:54.880 |
we just provide all of the kind of configuration. 01:15:58.480 |
So you have to tell it what your learning rate is. 01:16:05.520 |
This stuff here is the same as what we call basically fit one cycle in Fast AI. 01:16:10.920 |
You always want this to be true because it's going to be faster, pretty much. 01:16:16.360 |
And then this stuff here you can probably use exactly the same every time. 01:16:19.920 |
There's a lot of boilerplate compared to Fast AI, as you see. 01:16:26.800 |
This stuff you can probably use the same every time. 01:16:30.300 |
So we now need to create our model. So the equivalent of the vision learner function 01:16:37.980 |
that we've used to automatically create a reasonable vision model in Hacking-Face Transformers, 01:16:46.120 |
they've got lots of different ones depending on what you're trying to do. 01:16:50.200 |
So we're trying to do classification, as we've discussed, of sequences. 01:16:55.320 |
So if we call auto model for sequence classification, it will create a model that is appropriate 01:17:00.480 |
for classifying sequences from a pre-trained model. 01:17:04.600 |
And this is the name of the model that we just did earlier, the DiBerto V3. 01:17:10.840 |
It has to know when it adds that random matrix to the end how many outputs it needs to have. 01:17:21.920 |
And then this is the equivalent of creating a learner. 01:17:24.640 |
It contains a model and the data, the training data and the test data. 01:17:30.440 |
Again, there's a lot more boilerplate here than Fast AI, but you can kind of see the 01:17:35.680 |
We just have to do a little bit more manually. 01:17:40.840 |
So it's going to tokenize it for us using that function. 01:17:43.760 |
And then these are the metrics that it will print out each time. 01:17:48.080 |
That's that little function we created which returns a dictionary. 01:17:53.360 |
At the moment, I find hugging face transformers very verbose. 01:17:56.280 |
It spits out lots and lots and lots of text which you can ignore. 01:18:01.000 |
And we can finally call train which will spit out much more text again which you can ignore. 01:18:06.560 |
And as you can see, as it trains, it's printing out the loss. 01:18:11.180 |
And here's our Pearson correlation coefficient. 01:18:20.120 |
And it took five minutes to run, maybe that's five minutes per epoch on Kaggle which doesn't 01:18:28.520 |
have particularly great GPUs, but good for free. 01:18:32.760 |
And we've got something that has got a very high level of correlation in assessing how 01:18:41.880 |
And the only reason it could do that is because it used a pre-trained model, right? 01:18:46.680 |
There's no way you could just have that tiny amount of information and figure out whether 01:18:54.180 |
This pre-trained model already knows a lot about language. 01:18:57.240 |
It already has a good sense of whether two phrases are similar or not. 01:19:02.960 |
You can see, given that after one epoch, it was already at 0.8, you know, this was a model 01:19:08.240 |
that already did something pretty close to what we needed. 01:19:11.640 |
It didn't really need that much extra tuning for this particular task. 01:19:25.080 |
It's actually a bit back on the topic before where you were showing us the visual interpretation 01:19:30.280 |
of the Pearson coefficient and you were talking about outliers. 01:19:34.080 |
And we've got a question here from Kevin asking, how do you decide when it's OK to remove outliers? 01:19:41.360 |
Like you pointed out something in that data set. 01:19:45.560 |
And clearly, your model is going to train a lot better if you clean that up. 01:19:50.360 |
But I think Kevin's point here is, you know, those kinds of outliers will probably exist 01:19:58.040 |
So I think he's just looking for some practical advice on how you handle that in a more general 01:20:05.560 |
So outliers should never just be removed, like for modeling. 01:20:16.540 |
So if we take the example of the California housing data set, you know, if I was really 01:20:21.000 |
working with that data set in real life, I would be saying, oh, that's interesting. 01:20:25.440 |
It seems like there's a separate group of districts with a different kind of behavior. 01:20:29.640 |
My guess is that they're going to be kind of like dorms or something like that, you know, 01:20:35.800 |
And so I would be saying like, oh, clearly, from looking at this data set, these two different 01:20:43.640 |
And I would probably split them into two separate analyses. 01:20:48.720 |
You know, the word outlier, it kind of exists in a statistical sense, right? 01:20:59.160 |
There can be things that are well outside our normal distribution and mess up our kind 01:21:06.320 |
It doesn't exist in a sense of like, oh, things that we should like ignore or throw away. 01:21:12.560 |
You know, some of the most useful kind of insights I've had in my life in data projects 01:21:18.640 |
has been by digging into outliers, so-called outliers, and understanding, well, what are 01:21:28.280 |
And it's kind of often in those edge cases that you discover really important things 01:21:33.640 |
about like where processes go wrong or about, you know, kinds of behaviors you didn't even 01:21:38.800 |
know existed, or indeed about, you know, kind of labeling problems or process problems, 01:21:44.400 |
which you really want to fix them at the source, because otherwise when you go into production, 01:21:48.760 |
you're going to have more of those so-called outliers. 01:21:52.440 |
So yeah, I'd say never delete outliers without investigating them and having a strategy for 01:22:02.280 |
like understanding where they came from and like, what should you do about them? 01:22:08.800 |
So now that we've got a trained model, you'll see that it actually behaves really a lot 01:22:16.160 |
And hopefully the impression you'll get from going through this process is largely a sense 01:22:23.400 |
It's like, oh, yeah, this looks like stuff I've seen before, you know, like a bit more 01:22:28.960 |
wordy and some slight changes, but it really is very, very similar to the way we've done 01:22:35.240 |
Because now that we've got a trained trainer rather than learner, we can call predict. 01:22:41.360 |
And now we're going to pass in our data set from the Kaggle test file. 01:22:48.400 |
And so that's going to give us our predictions, which we can cast to float. 01:22:57.160 |
So here are the predictions we made of similarity. 01:23:02.640 |
Now, again, not just for your inputs, but also for your outputs, always look at them. 01:23:12.160 |
And interestingly, I looked at quite a few Kaggle notebooks from other people for this 01:23:19.400 |
And nearly all of them had the problem we have right now, which is negative predictions 01:23:28.760 |
So I'll be showing you how to fix this in a more proper way, maybe hopefully in the 01:23:37.240 |
But for now, you know, we could at least just round these off, right? 01:23:42.120 |
Because we know that none of the scores are going to be bigger than one or smaller than 01:23:46.360 |
But our correlation coefficient will definitely improve if we at least round this up to zero 01:23:53.040 |
As I say, there are better ways to do this, but that's certainly better than nothing. 01:23:57.600 |
So in PyTorch, you might remember from when we looked at Relu, there's a thing called 01:24:03.400 |
And that will clip everything under zero to zero and everything over one to one. 01:24:15.600 |
So Kaggle expects submissions to generally be in a CSV file. 01:24:19.920 |
And Hackingface datasets kind of looks a lot like pandas, really. 01:24:24.600 |
We can create our submission file with our two columns, call.csv. 01:24:41.120 |
So yeah, you know, it's kind of nice to see how -- you know, in a sense, how far deep learning 01:24:51.120 |
has come since we started this course a few years ago that nowadays, you know, there are 01:24:56.840 |
multiple libraries around to kind of do the same thing. 01:25:00.480 |
We can, you know, use them in multiple application areas. 01:25:10.840 |
And NLP, because it's kind of like the most recent area that's really become effective 01:25:19.840 |
in the last year or two, is probably where the biggest opportunities are for, you know, 01:25:27.000 |
big wins both in research and commercialization. 01:25:33.120 |
And so if you're looking to build a startup, for example, one of the key things that VCs 01:25:37.200 |
look for, you know, that they'll ask is like, "Well, why now?" 01:25:41.840 |
You know, "Why would you build this company now?" 01:25:44.000 |
And of course, you know, with NLP, the answer is really simple. 01:25:46.680 |
It's like -- it can often be like, "Well, until last year, this wasn't possible," you 01:25:52.320 |
know, or "It took ten times more time," or "It took ten times more money," or whatever. 01:26:03.600 |
Okay, so it's worth thinking about both use and misuse of modern NLP. 01:26:20.840 |
Here is a conversation on a subreddit from a couple of years ago. 01:26:34.920 |
So the question I want you to be thinking about is what subreddit do you think this 01:26:38.920 |
comes from, this debate about military spending? 01:26:46.200 |
And the answer is it comes from a subreddit that posts automatically generated conversations 01:26:54.840 |
Now this is like a totally previous generation of model. 01:27:00.640 |
So even then, you could see these models were generating context-appropriate, believable 01:27:12.680 |
You know, I would strongly believe that like any of our kind of like aperture of competent 01:27:20.400 |
fast AI alumni would be fairly easily able to create a bot which could create context-appropriate 01:27:27.640 |
pros on Twitter or Facebook groups or whatever, you know, arguing for a side of an argument. 01:27:37.160 |
And you can scale that up such that 99% of Twitter was these bots and nobody would know, 01:27:46.520 |
And that's very worrying to me because a lot of, you know, a lot of kind of the way people 01:27:57.000 |
see the world is now really coming out of their social media conversations, which at 01:28:03.880 |
Like it would not be that hard to create something that's kind of optimized towards moving a 01:28:11.120 |
point of view amongst a billion people, you know, in a very subtle way, very gradually 01:28:16.760 |
over a long period of time by multiple bots, each pretending to argue with each other and 01:28:21.720 |
one of them getting the upper hand and so forth. 01:28:29.200 |
Here is the start of an article in The Guardian, which I'll let you read. 01:28:52.880 |
This article was, you know, quite long, these are just the first few paragraphs. 01:28:57.280 |
And at the end, it explains that this article was written by GPT-3. 01:29:01.320 |
It was given the instruction, "Please write a short op-ed around 500 words, keep the language 01:29:05.400 |
simple and concise, focus on why humans have nothing to fear from AI." 01:29:11.120 |
So GPT-3 produced eight outputs and then they say basically the editors at The Guardian did 01:29:18.560 |
about the same level of editing that they would do for humans. 01:29:21.720 |
In fact, they found it a bit less editing required than humans. 01:29:25.860 |
So, you know, again, like you can create longer pieces of context appropriate prose designed 01:29:40.200 |
You know, we won't know probably for decades if ever, but sometimes we get a clue based 01:29:46.960 |
Here's something from back 2017 and the pre kind of deep learning NLP days. 01:29:54.160 |
There were millions of submissions to the FTC about the net neutrality situation in 01:30:00.640 |
America, very, very heavily biased towards the point of view of saying we want to get 01:30:11.520 |
An analysis by Jeff Kao showed that something like 99% of them, and in particular nearly 01:30:17.920 |
all of the ones which were pro removal net neutrality were clearly auto generated by 01:30:24.840 |
basically if you look at the green, there's like selecting from a menu, so we've got Americans 01:30:30.840 |
as opposed to Washington bureaucrats deserve to enjoy the services they desire. 01:30:35.160 |
Individuals as opposed to Washington bureaucrats should be just people like me as opposed to 01:30:38.960 |
so-called experts should be, and you get the idea. 01:30:41.320 |
Now this is an example of a very, very, you know, simple approach to auto generating huge 01:30:49.920 |
We don't know for sure, but it looks like this might have been successful because this 01:30:56.320 |
You know, despite what seems to be actually overwhelming disagreement from the public 01:31:03.440 |
that everybody, almost everybody likes net neutrality, the FTC got rid of it, and this 01:31:09.360 |
was a big part of the basis, was like, oh, we got all these comments from the public 01:31:12.840 |
and everybody said they don't want net neutrality. 01:31:16.760 |
So imagine a similar thing where you absolutely couldn't do this, you couldn't figure it out 01:31:22.040 |
because everyone was really very compelling and very different, that's, you know, it's 01:31:28.320 |
kind of worrying about how we deal with that. 01:31:33.200 |
I will say when I talk about this stuff, often people say, oh, no worries, we'll be able 01:31:36.840 |
to model to recognize, you know, bot generated content, but, you know, if I put my black 01:31:46.100 |
hat on, I'm like, nah, that's not going to work, right? 01:31:49.240 |
If you told me to build something that beats the bot classifiers, I'd say, no worries, easy. 01:31:56.120 |
You know, I will take the code or the service or whatever that does the bot classifying, 01:32:01.920 |
and I will include beating that in my loss function, and I will fine-tune my model until 01:32:09.480 |
When I used to run an email company, we had a similar problem with spam prevention, you 01:32:14.760 |
know, spammers could always take a spam prevention algorithm and change their emails until it 01:32:20.900 |
didn't get the spam prevention algorithm anymore, for example. 01:32:26.600 |
So yes, I'm really excited about the opportunities for students in this course to build, you 01:32:37.040 |
know, I think very valuable businesses, really cool research, and so forth using these pretty 01:32:44.720 |
new NLP techniques that are now pretty accessible, and I'm also really worried about the things 01:32:50.660 |
I do think, though, that the more people that understand these capabilities, the less chance 01:32:59.240 |
Yeah, I mean, it's a throwback to the workbook that you had before. 01:33:09.600 |
The question Manakandan is asking, shouldn't num labels be 5, 0, 0.25, 0.5, 0.751 instead 01:33:21.600 |
Is the target a categorical, or are we considering this as a regression problem? 01:33:27.640 |
So there's one label because there's one column. 01:33:33.920 |
Even if this was being treated as a categorical problem with five categories, it's still considered 01:33:42.080 |
In this case, though, we're actually treating it as a regression problem. 01:33:48.040 |
It's just one of the things that's a bit tricky. 01:33:50.160 |
I was trying to figure this out just the other day. 01:33:52.300 |
It's not documented as far as I can tell on the Hugging Phase Transformer's website. 01:33:56.520 |
But if you pass in one label to auto model for sequence classification, it turns it into 01:34:01.520 |
a regression problem, which is actually why we ended up with predictions that were less 01:34:09.040 |
So we'll be learning next time about the use of sigmoid functions to resolve this problem, 01:34:26.760 |
As much as I enjoyed putting this together, I'm really excited about it, and can't wait