Lesson 4: Practical Deep Learning for Coders 2022

00:00:00.000 | Hi, everybody, and welcome to Practical Deep Learning for Coders Lesson 4, which I think

00:00:07.120 | is the lesson that a lot of the regulars in the community have been most excited about

00:00:12.120 | because it's where we're going to get some totally new material, totally new topic we've

00:00:18.600 | never covered before.

00:00:20.520 | We're going to cover natural language processing in LP, and you'll find there is indeed a chapter

00:00:25.800 | about that in the book, but we're going to do it in a totally different way to how it's

00:00:30.520 | done in the book.

00:00:31.680 | In the book, we do NLP using the FastAI library using recurrent neural networks, RNNs.

00:00:40.920 | Today we're going to do something else, which is we're going to do transformers.

00:00:48.360 | And we're not even going to use the FastAI library at all, in fact.

00:00:54.240 | So what we're going to be doing today is we're going to be fine-tuning a pre-trained NLP

00:01:01.680 | model using a library called Hugging Face Transformers.

00:01:06.080 | Now given this is the Fast.AI course, you might be wondering why we'd be using a different

00:01:10.120 | library other than FastAI.

00:01:13.880 | The reason is that I think that it's really useful for everybody to have experience and

00:01:21.440 | practice of using more than one library, because you'll get to see the same concepts applied

00:01:29.240 | in different ways.

00:01:30.960 | And I think that's great for your understanding of what these concepts are.

00:01:37.040 | Also I really like the Hugging Face Transformers library.

00:01:40.440 | It's absolutely the state of the art in NLP, and it's well worth knowing.

00:01:47.040 | If you're watching this on video, by the time you're watching it, we will probably have

00:01:50.280 | completed our integration of the transformers library into FastAI, so it's in the process

00:01:54.760 | of becoming the main NLP foundation for FastAI.

00:02:01.480 | So you'll be able to combine transformers and FastAI together.

00:02:08.800 | So I think there's a lot of benefits to this, and in the end you're going to know how to

00:02:12.320 | do NLP in a really fantastic library.

00:02:15.600 | Now the other thing is Hugging Face Transformers doesn't have the same layered architecture

00:02:22.720 | that FastAI has, which means, particularly for beginners, the kind of high level, top

00:02:31.200 | tier API that you'll be using most of the time is not as ready to go for beginners as

00:02:39.200 | you're used to from FastAI.

00:02:41.680 | And so that's actually, I think, a good thing.

00:02:43.240 | You're up to lesson four.

00:02:44.880 | You know the basic idea now of how gradient descent works and how parameters are learned

00:02:52.240 | as part of a flexible function.

00:02:55.800 | I think you're ready to try using a somewhat lower level library that does a little bit

00:03:00.840 | less for you.

00:03:02.280 | So it's going to be a little bit more work.

00:03:04.520 | It's a very well-designed library, and it's still reasonably high level, but you're going

00:03:09.480 | to learn to go a little bit deeper.

00:03:11.480 | And that's kind of how the rest of the course in general is going to be on the whole, is

00:03:14.920 | we're going to get a bit deeper and a bit deeper and a bit deeper.

00:03:19.240 | Now so first of all, let's talk about what we're going to be doing with fine-tuning a

00:03:25.920 | pre-trained model.

00:03:27.040 | We've talked about that in passing before, but we haven't really been able to describe

00:03:30.880 | it in any detail because you haven't had the foundations.

00:03:35.440 | Now you do.

00:03:37.040 | You played with these sliders last week, and hopefully you've all actually gone into this

00:03:42.720 | notebook and dragged them around and tried to get an intuition for this idea of moving

00:03:47.600 | them up and down, makes the loss go up and down, and so forth.

00:03:51.580 | So I mentioned that your job was to move these sliders to get this as nice as possible, but

00:03:58.640 | when it was given to you, the person who gave it to you said, "Oh, actually slider A, that

00:04:05.480 | should be on 2.0."

00:04:07.440 | We know for sure.

00:04:08.720 | And slider B, we think it's like around two and a half.

00:04:13.760 | Slider C, we've got no idea.

00:04:16.280 | Now that would be pretty helpful, wouldn't it, because you could immediately start focusing

00:04:20.800 | on the one we have no idea about, get that in roughly the right spot, and then the one

00:04:25.000 | you've kind of got a vague idea about, you could just tune it a little bit, and the one

00:04:27.920 | that they said was totally confident you wouldn't move at all, you would probably tune these

00:04:32.000 | sliders really quickly.

00:04:35.200 | That's what a pre-trained model is.

00:04:37.120 | A pre-trained model is a bunch of parameters that have already been fit, where some of

00:04:43.880 | them are already pretty confident of what they should be, and some of them we really

00:04:49.080 | have no idea at all.

00:04:50.760 | And so fine-tuning is the process of taking those ones we have no idea what they should

00:04:54.840 | be at all and trying to get them right, and then moving the other ones a little bit.

00:05:01.440 | The idea of fine-tuning a pre-trained NLP model in this way was pioneered by an algorithm

00:05:10.120 | called ULMfit, which was first presented actually in a fast AI course, I think the very first

00:05:16.840 | fast AI course.

00:05:18.720 | It was later turned into an academic paper by me in conjunction with a then PhD student

00:05:23.960 | named Sebastian Ruder, who's now one of the world's top NLP researchers, and went on to

00:05:28.960 | help inspire a huge change, a huge step improvement in NLP capabilities around the world, along

00:05:38.440 | with a number of other important innovations at the time.

00:05:43.300 | This is the basic process that ULMfit described.

00:05:51.600 | Step one was to build something called a language model using basically nearly all of Wikipedia.

00:05:58.740 | And what the language model did was it tried to predict the next word of a Wikipedia article,

00:06:06.760 | in fact every next word of every Wikipedia article.

00:06:12.100 | Doing that is very difficult.

00:06:14.120 | There are Wikipedia articles which would say things like the 17th prime number is dot, dot,

00:06:25.640 | dot, or the 40th president of the United States, Blah, said at his residence, Blah, that.

00:06:34.880 | Filling in these kinds of things requires understanding a lot about how language is

00:06:40.400 | structured and about the world and about math and so forth.

00:06:47.560 | So to get good at being a language model, a neural network has to get good at a lot

00:06:53.160 | of things.

00:06:55.160 | It has to understand how language works at a reasonably good level, and it needs to understand

00:07:00.880 | what it's actually talking about, and what is actually true, what is actually not true,

00:07:05.720 | and the different ways in which things are expressed, and so forth.

00:07:10.480 | So this was trained using a very similar approach to what we'll be looking at for fine tuning,

00:07:17.440 | but it started with random weights, and at the end of it there was a model that could

00:07:21.040 | predict more than 30% of the time correctly what the next word of a Wikipedia article

00:07:27.320 | would be.

00:07:31.040 | So in this particular case for the ULM FIT paper, we then took that and we were trying

00:07:36.200 | to-- the first task I did actually for the FAST AI course back when I invented this was

00:07:42.600 | to try and figure out whether IMDB movie reviews were positive or negative sentiment.

00:07:47.480 | Did the person like the movie or not?

00:07:51.440 | So what I did was I created a second language model.

00:07:55.200 | So again, the language model here is something that predicts the next word of a sentence.

00:07:58.640 | But rather than using Wikipedia, I took this pre-trained model that was trained on Wikipedia,

00:08:04.360 | and I ran a few more epochs using IMDB movie reviews.

00:08:10.000 | So it got very good at predicting the next word of an IMDB movie review.

00:08:15.760 | And then finally, I took those weights and I fine-tuned them for the task of predicting

00:08:24.120 | whether or not a movie review was positive or negative sentiment.

00:08:28.520 | So those were the three steps.

00:08:32.640 | This is a particularly interesting approach because this very first model-- in fact, the

00:08:37.640 | first two models-- if you think about it, they don't require any labels.

00:08:41.000 | They didn't have to collect any kind of document categories or do any kind of surveys or collect

00:08:47.560 | anything.

00:08:48.560 | All I needed was the actual text of Wikipedia and movie reviews themselves because the labels

00:08:53.480 | was, what's the next word of a sentence?

00:08:57.360 | Now since we built ULMfit-- and we used RNNs, the current neural networks for this-- at

00:09:07.840 | about the same time-ish that we released this, a new kind of architecture, particularly useful

00:09:13.040 | for NLP at the time, was developed called transformers.

00:09:17.800 | And transformers were particularly built because they can take really good advantage of modern

00:09:22.920 | accelerators like Google's TPUs.

00:09:29.280 | They didn't really allow you to predict the next word of a sentence.

00:09:36.680 | It's just not how they're structured for reasons we'll talk about probably in part two of the

00:09:41.520 | course.

00:09:42.520 | So they threw away the idea of predicting the next word of a sentence.

00:09:45.440 | And then instead, they did something just as good and pretty clever.

00:09:49.680 | They took kind of chunks of Wikipedia or whatever text they're looking at and deleted at random

00:09:55.700 | a few words and asked the model to predict what were the words that were deleted, essentially.

00:10:03.360 | So it's a pretty similar idea.

00:10:04.960 | Other than that, the basic concept was the same as ULMfit.

00:10:09.060 | They replaced our RNN approach with a transformer model.

00:10:12.520 | They replaced our language model approach with what's called a masked language model.

00:10:16.240 | But other than that, the basic idea was the same.

00:10:19.080 | So today, we're going to be looking at models using what's become the much more popular

00:10:27.040 | approach than ULMfit, which is this transformers masked language model approach.

00:10:31.520 | OK.

00:10:32.520 | John, do we have any questions?

00:10:34.000 | And I should mention, we do have a professor from the University of Queensland, John Williams,

00:10:41.760 | joining us, who will be asking the highest voted questions from the community.

00:10:46.360 | What do you got, John?

00:10:47.360 | Yeah.

00:10:48.360 | Thanks, Jeremy.

00:10:49.360 | And we might be jumping the gun here.

00:10:50.880 | I suspect this is where you're going tonight.

00:10:53.200 | But we've got a good question here on the forum, which is, how do you go from a model

00:10:57.800 | that's trained to predict the next word to a model that can be used for classification?

00:11:03.600 | Sure.

00:11:04.600 | So, yeah, we will be getting into that in more detail.

00:11:09.440 | And in fact, maybe a good place to start would be the next slide, kind of give you a sense

00:11:14.560 | of this.

00:11:16.120 | You might remember in lesson one, we looked at this fantastic Zyla and Fergus paper, where

00:11:21.480 | we looked at visualizations of the first layer of a ImageNet classification model.

00:11:28.160 | And layer one had sets of weights that found diagonal edges.

00:11:34.400 | And here are some examples of bits of photos that successfully matched with and opposite

00:11:39.760 | diagonal edges, and kind of color gradients.

00:11:42.640 | And here's some examples of bits of pictures that matched.

00:11:46.120 | And then layer two combined those, and now you know how those were combined, right?

00:11:51.680 | These were rectified linear units that were added together, and then sets of those rectified

00:11:58.520 | linear units, the outputs of those, they're called activations, were then themselves run

00:12:02.880 | through a matrix multiplier, a rectified linear unit, added together.

00:12:06.480 | So now you don't just have to have edge detectors, but layer two had corner detectors.

00:12:11.640 | And here's some examples of some corners that that corner detector successfully found.

00:12:16.240 | And remember, these were not engineered in any way, they just evolved from the gradient

00:12:22.320 | descent training process.

00:12:24.960 | Layer two had examples of circle detectors, as it turns out.

00:12:29.160 | And skipping a bit, by the time we got to layer five, we had bird and lizard eyeball

00:12:35.120 | detectors, and dog face detectors, and flower detectors, and so forth.

00:12:45.800 | Nowadays, you'd have something like a ResNet 50, would be something you'd probably be training

00:12:51.020 | pretty regularly in this course, so that you've got 50 layers, not just five layers.

00:12:56.840 | Now the later layers do things that are much more specific to the training task, which

00:13:04.640 | is actually predicting, really, what is it that we're looking at.

00:13:09.360 | The early layers, pretty unlikely you're going to need to change them much, as long as you're

00:13:13.920 | looking at some kind of natural photos.

00:13:17.360 | You're going to need edge detectors, gradient detectors.

00:13:21.160 | So what we do in the fine-tuning process is there's actually one extra layer after this,

00:13:29.240 | which is the layer that actually says, what is this?

00:13:32.080 | It's a dog, or a cat, or whatever.

00:13:33.720 | You actually delete that, or you throw it away.

00:13:36.520 | So now that last matrix multiply has one output, or one output per category you're predicting.

00:13:44.080 | We throw that away.

00:13:45.520 | So the model now has that last matrix that's spitting out, it depends, but generally a

00:13:51.480 | few hundred activations.

00:13:54.840 | What we do, as we'll learn more shortly in the coming lesson, we just stick a new random

00:14:01.600 | matrix on the end of that.

00:14:04.080 | And that's what we initially train.

00:14:06.080 | So it learns to use these kinds of features to predict whatever it is you're trying to

00:14:13.200 | predict.

00:14:14.280 | And then we gradually train all of those layers.

00:14:18.400 | So that's basically how it's done.

00:14:19.680 | And so that's a bit hand-wavy, but we'll, particularly in part two, actually build that

00:14:25.920 | from scratch ourselves.

00:14:28.280 | And in fact, in this lesson, time permitting, we're actually going to start going down the

00:14:31.300 | process of actually building a real world neural net in Python.

00:14:37.440 | So we'll be starting to actually make some progress towards that goal.

00:14:41.120 | Okay.

00:14:42.200 | So let's jump into the notebook.

00:14:47.080 | So we're going to look at a Kaggle competition that's actually on, as I speak.

00:14:54.740 | And I created this notebook called Getting Started with NLP for Absolute Beginners.

00:15:00.100 | And so the competition is called the US Patent Phrase-to-Phrase Matching Competition.

00:15:07.400 | And so I'm going to take you through a complete submission to this competition.

00:15:16.480 | And Kaggle competitions are interesting, particularly the ones that are not playground competitions,

00:15:20.560 | but the real competitions with real money applied.

00:15:22.840 | They're interesting because this is an actual project that an actual organization is prepared

00:15:29.620 | to invest money in getting solved using their actual data.

00:15:34.240 | So a lot of people are a bit dismissive of Kaggle competitions as being not very real.

00:15:40.440 | And it's certainly true.

00:15:41.440 | You're not worrying about stuff like productionizing the model.

00:15:44.660 | But in terms of getting real data about a real problem that real organizations really

00:15:50.160 | care about and a very direct way to measure the accuracy of your solution, you can't really

00:15:55.980 | get better than this.

00:15:58.040 | So this is a good place.

00:15:59.040 | It's a good competition to experiment with for trying NLP.

00:16:02.840 | Now, as I mentioned here, probably the most widely useful application for NLP is classification.

00:16:10.200 | And as we've discussed in computer vision, classification refers to taking an object

00:16:15.040 | and trying to identify a category that object belongs to.

00:16:19.960 | So previously, we've mainly been looking at images.

00:16:22.720 | Today, we're going to be looking at documents.

00:16:26.560 | Now in NLP, when we say document, we don't specifically mean a 20-page long essay.

00:16:36.280 | A document could be three or four words, or a document could be the entire encyclopedia.

00:16:41.720 | So a document is just an input to an NLP model that contains text.

00:16:50.360 | Now classifying a document, so deciding what category a document belongs to, is a surprisingly

00:16:57.920 | rich thing to do.

00:17:00.880 | There's all kinds of stuff you could do with that.

00:17:02.920 | So for example, we've already mentioned sentiment analysis.

00:17:05.680 | That's a classification task.

00:17:07.680 | We try to decide on the category, positive or negative sentiment.

00:17:11.520 | Author identification would be taking a document and trying to find the category of author.

00:17:18.120 | Digital discovery would be taking documents and putting them into categories according

00:17:21.440 | to in or out of scope for a court case.

00:17:25.600 | Triaging inbound emails would be putting them into categories of throw away, send to customer

00:17:33.040 | service, send to sales, et cetera.

00:17:36.720 | So classification is a very, very rich area.

00:17:41.360 | And for people interested in trying out NLP in real life, I would suggest classification

00:17:48.080 | would be the place I would start for looking for accessible, real world, useful problems

00:17:54.560 | you can solve right away.

00:17:58.240 | Now the Kaggle competition does not immediately look like a classification competition.

00:18:05.320 | What it contains, let me show you some data.

00:18:13.040 | What it contains is data that looks like this.

00:18:16.200 | It has a thing that they call anchor, a thing they call target, a thing they call context

00:18:21.120 | and a score.

00:18:22.600 | Now these are, I can't remember exactly how it is, but I think these are from patents.

00:18:29.840 | And I think on the patents there are various things they have to fill in in the patent.

00:18:35.960 | One of those things is called anchor.

00:18:38.680 | One of those things is called target.

00:18:40.840 | And in the competition, the goal is to come up with a model that automatically determines

00:18:45.240 | which anchor and target pairs are talking about the same thing.

00:18:50.640 | So a score of one here, wood article and wooden article obviously talking about the same thing.

00:18:57.400 | A score of zero here, abatement and forest region not talking about the same thing.

00:19:02.800 | So the basic idea is that we're trying to guess the score.

00:19:08.120 | And it's kind of a classification problem, kind of not.

00:19:12.080 | We're basically trying to classify things into either these two things are the same

00:19:15.840 | or these two things aren't the same.

00:19:18.080 | It's kind of not because we have not just one and zero, but also 0.25, 0.5 and 0.75.

00:19:24.480 | There's also a column called context, which I believe is like the category that this patent

00:19:30.160 | was filed in.

00:19:31.920 | And my understanding is that whether the anchor and the target count as similar or not depends

00:19:37.520 | on what the patent was filed under.

00:19:42.280 | So how would we take this and turn it into something like a classification problem?

00:19:51.080 | So the suggestion I make here is that we could basically say, OK, let's put some constant

00:20:01.640 | string like text one or field one before the first column, and then something else like

00:20:07.520 | text two before the second column.

00:20:12.000 | Maybe also the context I should have as well, text three in the context, and then try to

00:20:16.840 | choose a category of meaning similarity, different, similar or identical.

00:20:20.280 | So we could basically concatenate those three pieces together, call that a document, and

00:20:25.800 | then try to train a model that can predict these categories.

00:20:30.360 | That would be an example of how we can take this basically similarity problem and turn

00:20:38.280 | it into something that looks like a classification problem.

00:20:41.760 | And we tend to do this a lot in deep learning is we kind of take problems that look a bit

00:20:49.280 | novel and different and turn them into a problem that looks like something we recognize.

00:20:54.640 | So on Kaggle, this is a larger data set that you're going to need a GPU to run.

00:21:05.640 | So you can click on the Accelerator button and choose GPU to make sure that you're using

00:21:10.440 | a GPU.

00:21:11.840 | If you click Copy and Edit on my document, I think that will happen for you automatically.

00:21:17.120 | Personally, I like using things like PaperSpace generally better than Kaggle.

00:21:26.000 | Kaggle's pretty good, but you only get 30 hours a week of GPU time, and the notebook editor

00:21:32.680 | for me is not as good as the real JupyterLab environment.

00:21:36.480 | So there's some information here I won't go through, but it basically describes how you

00:21:40.960 | can download stuff to PaperSpace or your own computer as well if you want to.

00:21:48.160 | So I basically create this little boolean always in my notebooks called isKaggle, which

00:21:53.400 | is going to be true if it's running on Kaggle and false otherwise, and any little changes

00:21:57.100 | I need to make, I'd say if isKaggle and put those changes.

00:22:05.260 | So here you can see here if I'm not on Kaggle and I don't have the data yet, then download

00:22:10.600 | it.

00:22:11.600 | And Kaggle has a little API, which is quite handy for doing stuff like downloading data

00:22:15.600 | and uploading notebooks and stuff like that, submitting to competitions.

00:22:22.040 | If we are on Kaggle, then the data's already going to be there for us, which is actually

00:22:25.560 | a good reason for beginners to use Kaggle is you don't have to worry about grabbing

00:22:29.520 | the data at all.

00:22:30.520 | It's sitting there for you as soon as you open the notebook.

00:22:35.680 | Kaggle has a lot of Python packages installed, but not necessarily all the ones you want.

00:22:42.400 | And at the point I wrote this, they didn't have HuggingFace's datasets package for some

00:22:47.440 | reason, so you can always just install stuff.

00:22:49.880 | So you might remember the exclamation mark means this is not a Python command, but a

00:22:55.200 | shell command, a bash command.

00:22:57.000 | But it's quite neat. You can even put bash commands inside Python conditionals.

00:23:03.340 | So that's a pretty cool little trick in notebooks.

00:23:09.800 | Another cool little trick in notebooks is that if you do use a bash command, like ls,

00:23:15.640 | but you then want to insert the contents of a Python variable, just chuck it in parentheses.

00:23:20.800 | So I've got a Python variable called path, and I can go ls path in parentheses, and that

00:23:27.700 | will ls the contents of the Python variable path.

00:23:31.680 | So there's another little trick for you.

00:23:33.280 | So when we ls that, we can see that there's some CSV files.

00:23:37.640 | So what I'm going to do is kind of take you through roughly the process, the kind of process

00:23:42.240 | I went through when I first look at a competition.

00:23:46.640 | So the first thing is already dataset, indeed.

00:23:49.160 | What's in it?

00:23:50.160 | It's got some CSV files.

00:23:53.560 | As well as looking at it here, the other thing I would do is I would go to the competition

00:23:59.360 | website, and if you go to data, a lot of people skip over this, which is a terrible idea because

00:24:09.960 | it actually tells you what the dependent variable means, what the different files are, what

00:24:15.000 | the columns are, and so forth.

00:24:17.520 | So don't just rely on looking at the data itself, but look at the information that you're

00:24:22.880 | given about the data.

00:24:32.240 | So for CSV files, the CSV files are comma-separated values.

00:24:35.880 | So they're just text files with a comma between each field.

00:24:39.840 | But we can read them using pandas, which for some reason is always called PD.

00:24:48.280 | Pandas is one of, I guess, probably four key libraries that you have to know to do data

00:24:57.520 | science in Python, and specifically, those four libraries are NumPy, Matplotlib, Pandas,

00:25:16.800 | and PyTorch.

00:25:18.920 | So NumPy is what we use for basic numerical programming, Matplotlib we use for plotting,

00:25:26.800 | Pandas we use for tables of data, and PyTorch we use for deep learning.

00:25:36.200 | Those are all covered in a fantastic book by the author as Pandas, which the new version

00:25:46.200 | is actually available for free, I believe, Python for Data Analysis.

00:25:53.240 | So if you're not familiar with these libraries, just read the whole book.

00:25:59.080 | It doesn't take too long to get through, and it's got lots of cool tips, and it's very

00:26:03.400 | readable.

00:26:04.400 | I do find a lot of people doing this course.

00:26:09.560 | Often I see people kind of trying to jump ahead and want to be like, oh, I want to know

00:26:14.760 | how to create a new architecture, or build a speech recognition system, or whatever.

00:26:20.800 | But it then turns out that they don't know how to use these fundamental libraries.

00:26:24.840 | So it's always good to be bold and be trying to build things, but do also take the time

00:26:28.480 | to make sure you finish reading the first AI book and read at least Wes McKinney's book.

00:26:36.240 | That would be enough to really give you all the basic knowledge you need, I think.

00:26:41.540 | So with Pandas, we can read a CSV file, and that creates something called a data frame,

00:26:45.680 | which is just a table of data, as you see.

00:26:50.380 | So now that we've got a data frame, we can see what we're working with.

00:26:57.440 | And when in Jupyter we just put the name of a variable containing a data frame, we've

00:27:01.320 | got the first five rows, the last five rows, and the size, so we've got 36,473 rows.

00:27:09.360 | So other things I like to use for understanding a data frame is the describe method.

00:27:18.540 | If you pass include equals object, that will describe basically all the string fields,

00:27:23.680 | the non-numeric fields.

00:27:25.280 | So in this case, there's four of those.

00:27:28.720 | And so you can see here that that anchor field we looked at, there's actually only 733 unique

00:27:34.120 | values.

00:27:35.120 | So this thing, you can see that there's lots of repetition out of 36,000.

00:27:41.480 | So there's lots of repetition.

00:27:45.240 | This is the most common one, it appears 152 times.

00:27:49.620 | And then context, we also see lots of repetition, there's 106 of those contexts.

00:27:53.600 | So this is a nice little method, we can see a lot about the data in a glance.

00:27:58.960 | And when I first saw this in this competition, I thought, well, this is actually not that

00:28:04.000 | much language data when you think about it.

00:28:06.880 | Each document is very short, three or four words, really, and lots of it is repeated.

00:28:15.760 | So as I'm looking through it, I'm thinking, what are some key features of this data set?

00:28:20.200 | And that would be something I'd be thinking, wow, we've got to do a lot with not very much

00:28:24.200 | unique data here.

00:28:29.120 | So here's how we can just go ahead and create a single string like I described, which contains

00:28:36.040 | some kind of field separator, plus the context, the target, and the anchor.

00:28:42.840 | So we're going to pop that into a field called input.

00:28:47.400 | Something slightly weird in pandas is there's two ways of referring to a column.

00:28:51.880 | You can use square brackets and a string to get the input column, or you can just treat

00:28:56.800 | it as an attribute.

00:28:59.380 | When you're setting it, you should always use the forms in here.

00:29:04.880 | When reading it, you can use either.

00:29:06.560 | I tend to use this one because it's less typing.

00:29:08.960 | So you can see now we've got these concatenated rows.

00:29:13.040 | So head is the first few rows.

00:29:18.220 | So we've now got some documents to do NLP with.

00:29:23.040 | Now the problem is, as you know from the last lesson, neural networks work with numbers.

00:29:29.040 | We're going to take some numbers, and we're going to multiply them by matrices.

00:29:35.360 | We're going to replace the negatives with zeros and add them up, and we're going to

00:29:38.520 | do that a few times.

00:29:40.420 | That's our neural network, with some little wrinkles, but that's the basic idea.

00:29:45.080 | So how on earth do we do that for these strings?

00:29:51.740 | So there's basically two steps we're going to take.

00:29:54.960 | The first step is to split each of these into tokens.

00:29:59.960 | Tokens are basically words.

00:30:02.280 | We're going to split it into words.

00:30:07.680 | There's a few problems with splitting things into words, though.

00:30:11.360 | The first is that some languages, like Chinese, don't have words, or at least certainly not

00:30:16.880 | space-separated words, and in fact, in Chinese, sometimes it's a bit fuzzy to even say where

00:30:22.480 | a word begins and ends.

00:30:24.800 | Some words are kind of not even -- the pieces are not next to each other.

00:30:29.460 | Another reason is that what we're going to be doing is after we've split it into words,

00:30:34.400 | or something like words, we're going to be getting a list of all of the unique words

00:30:38.240 | that appear, which is called the vocabulary.

00:30:41.480 | And every one of those unique words is going to get a number.

00:30:45.120 | As you'll see later on, the bigger the vocabulary, the more memory is going to get used, the

00:30:51.680 | more data we'll need to train.

00:30:55.000 | In general, we don't want a vocabulary to be too big.

00:31:02.720 | So instead, nowadays, people tend to tokenize into something called subwords, which is pieces

00:31:09.920 | of words.

00:31:10.920 | So I'll show you what it looks like.

00:31:12.600 | So the process of turning it into smaller units, like words, is called tokenization,

00:31:18.840 | and we call them tokens instead of words.

00:31:20.440 | The token is just the more general concept of whatever we're splitting it into.

00:31:25.760 | So we're going to get hugging face transformers and hugging face datasets doing our work for

00:31:30.840 | us.

00:31:33.480 | And so what we're going to do is we're going to turn our pandas data frame into a hugging

00:31:39.620 | face datasets dataset.

00:31:42.960 | It's a bit confusing.

00:31:45.360 | PyTorch has a class called dataset, and hugging face has a class called dataset, and they're

00:31:50.320 | different things.

00:31:51.320 | So this is a hugging face dataset, hugging face datasets dataset.

00:31:55.840 | So we can turn a data frame into a data set just using the from pandas method.

00:32:01.400 | And so we've now got a dataset.

00:32:03.520 | So if we take a look, it just tells us, all right, it's got these features.

00:32:09.880 | And remember, input is the one we just created with the concatenated strings.

00:32:14.120 | And here's those 36,000 rows.

00:32:20.000 | So now we're going to do these two things, tokenization, which is to split each text

00:32:23.100 | up into tokens, and then numericalization, which is to turn each token into its unique

00:32:28.680 | ID based on where it is in the vocabulary.

00:32:31.760 | The vocabulary, remember, being the list of unique tokens.

00:32:37.760 | Now particularly in this stage, tokenization, there's a lot of little decisions that have

00:32:45.400 | to be made.

00:32:47.400 | The good news is you don't have to make them, because whatever pre-trained model you used,

00:32:53.620 | the people that pre-trained it made some decisions, and you're going to have to do exactly the

00:32:58.360 | same thing, otherwise you'll end up with a different vocabulary to them, and that's going

00:33:03.080 | to mess everything up.

00:33:04.840 | So that means before you start tokenizing, you have to decide on what model to use.

00:33:10.160 | Hugging face transformers is a lot like Tim.

00:33:13.760 | It has a library of, I believe, hundreds of models.

00:33:21.120 | I guess I shouldn't say hugging face transformers, it's really the hugging face model hub.

00:33:25.400 | 44,000 models, so many more even than Tim's image models.

00:33:33.600 | And so these models, they vary in a couple of ways.

00:33:36.760 | There's a variety of different architectures, just like in Tim, but then something which

00:33:42.080 | is different to Tim is that each of those architectures can be trained on different

00:33:46.520 | corpuses for solving different problems.

00:33:49.040 | So for example, I could type patent, and see if there's any pre-trained patent, there is.

00:33:54.120 | So there's a whole lot of pre-trained patent models, isn't that amazing?

00:34:00.000 | So quite often, thanks to the hugging face model hub, you can start your pre-trained

00:34:08.080 | model with something that's actually pretty similar to what you actually want to do, or

00:34:12.560 | at least was trained on the same kind of documents.

00:34:18.240 | Having said that, there are some just generally pretty good models that work for a lot of

00:34:26.400 | things a lot of the time, and DiBerta V3 is certainly one of those.

00:34:36.920 | This is a very new area, NLP has been practically really effective for general users for only

00:34:50.160 | a year or two, whereas for computer vision it's been quite a while.

00:34:56.440 | So you'll find that a lot of things aren't as quite well bedded down.

00:35:00.040 | I don't have a picture to show you of which models are the best or the fastest and the

00:35:06.400 | most accurate and whatever, right?

00:35:08.680 | A lot of this stuff is like stuff that we're figuring out as a community using competitions

00:35:14.680 | like this, in fact, and this is one of the first NLP competitions actually in the kind

00:35:20.200 | of modern NLP era.

00:35:22.520 | So we've been studying these competitions closely, and I can tell you that DiBerta is

00:35:28.880 | actually a really good starting point for a lot of things, so that's why we've picked

00:35:33.400 | it.

00:35:34.400 | It's really cool.

00:35:35.400 | And just like in Tim for image, you know, a model says often going to be a small, a medium,

00:35:39.960 | a large, and of course we should start with small, right, because small is going to be

00:35:44.880 | faster to train, we're going to be able to do more iterations and so forth, okay.

00:35:57.900 | So at this point, remember, the only reason we picked our model is because we have to

00:36:01.080 | make sure we tokenize in the same way.

00:36:07.200 | To tell transformers that we want to tokenize the same way that the people that built a

00:36:12.120 | model did, we use something called autotokenizer.

00:36:15.280 | It's nothing fancy.

00:36:16.280 | It's basically just a dictionary which says, oh, which model uses which tokenizer.

00:36:20.560 | So when we say autotokenizer from pre-trained, it will download the vocabulary and the details

00:36:26.320 | about how this particular model tokenized dataset.

00:36:35.080 | So at this point, we can now take that tokenizer and pass the string to it.

00:36:42.600 | So if I pass the string g'day folks on Jeremy from fast.ai, you'll see it's kind of putting

00:36:49.760 | it into words, kind of not.

00:36:53.160 | So if you've ever wondered whether g'day is one word or two, you know, it's actually three

00:36:58.160 | tokens according to this tokenizer, and I'm is three tokens, and fast.ai is three tokens.

00:37:05.920 | This punctuation is a token, and so you kind of get the idea.

00:37:10.440 | These underscores here, that represents the start of a word, right.

00:37:14.680 | So that's kind of, there's this concept that, like, the start of a word is kind of part

00:37:18.160 | of the token.

00:37:19.360 | So if you see a capital I in the middle of a word versus the start of a word, that kind

00:37:22.400 | of means a different thing.

00:37:24.080 | So this is what happens when we tokenize this sentence using the tokenizer that the Daburda

00:37:31.120 | V3 developers used.

00:37:36.800 | So here's a less common, unless you're a big platypus fan like me, less common sentence.

00:37:45.800 | A platypus is an ornithrinx and a tenus.

00:37:49.720 | And so, okay, in this particular vocabulary, platypus got its own word, its own token,

00:37:54.840 | but ornithrinx didn't.

00:37:58.080 | And so I still remember grade one, for some reason, our teacher got us all to learn how

00:38:03.480 | to spell ornithrinx, so one of my favorite words.

00:38:08.400 | So you can see here, it's been split into all, knee, toe, ring, us.

00:38:16.120 | So every one of these tokens you see here is going to be in the vocabulary, right?

00:38:21.240 | The list of unique tokens that was created when this particular model, this pre-trained

00:38:27.600 | model was first trained.

00:38:30.600 | So somewhere in that list, we'll find underscore capital A, and it'll have a number.

00:38:37.620 | And so that's how we'll be able to turn these into numbers.

00:38:41.400 | So this first process is called tokenization, and then the thing where we take these tokens

00:38:46.120 | and turn them into numbers is called numericalization.

00:38:50.920 | So our dataset, remember we put our string into the input field.

00:38:57.980 | So here's a function that takes a document, grabs its input, and tokenizes it.

00:39:04.200 | Okay, so we'll call this our tokenization function.

00:39:09.140 | Tokenization can take a minute or two, so we may as well get all of our processes doing

00:39:12.880 | it at the same time to save some time.

00:39:15.080 | So if you use the dataset.map, it will paralyze that process and just pass in your function.

00:39:21.600 | Make sure you pass batch equals true so it can do a bunch at a time.

00:39:26.080 | Behind the scenes, this is going through something called the tokenizes library, which is a pretty

00:39:29.240 | optimized Rust library that uses SIMD and parallel processing and so forth.

00:39:37.040 | So with batch equals true, it'll be able to do more stuff at once.

00:39:41.640 | So look, it only took six seconds, so pretty fast.

00:39:46.000 | So now when we look at a row of our tokenized dataset, it's going to contain exactly the

00:39:52.640 | same as our original dataset.

00:39:54.560 | No, sorry, it's not going to take exactly the same as our original dataset.

00:39:59.160 | It's going to contain exactly the same input as our original dataset, and it's also going

00:40:03.560 | to contain a bunch of numbers.

00:40:06.240 | These numbers are the position in the vocabulary of each of the tokens in the string.

00:40:16.640 | So we've now successfully turned a string into a list of numbers.

00:40:21.860 | So that is a great first step.

00:40:24.880 | So we can see how this works.

00:40:27.960 | We can see, for example, that we've got "of" at a separate word.

00:40:32.800 | That's going to be an underscore "of" in the vocabulary.

00:40:36.560 | We can grab the vocabulary, look up "of," find that it's 265, and check here, yep, here

00:40:44.560 | it is, 265.

00:40:45.560 | So it's not rocket science, right?

00:40:48.020 | It's just looking stuff up in a dictionary to get the numbers.

00:40:57.040 | So that is the tokenization and Americanization necessary in NLP to turn our documents into

00:41:06.140 | numbers to allow us to put it into our model.

00:41:09.560 | Any questions so far, John?

00:41:11.440 | >> Excuse me, yeah, thanks, Jeremy.

00:41:14.320 | So there's a couple, and this seems like a good time to throw them out, and it's related

00:41:17.840 | to how you've formatted your input data into these sentences that you've just tokenized.

00:41:25.620 | So one question was really about how you choose those keywords and the order of the fields

00:41:32.840 | that you -- so I guess just interested in an explanation, is it more art or science,

00:41:40.120 | how you -- >> No, it's arbitrary.

00:41:42.200 | I tried a few things.

00:41:43.200 | I tried "X," I tried putting them backwards, doesn't matter.

00:41:49.560 | We just want some way, something that it can learn from.

00:41:53.680 | So if I just concatenated it without these headers before each one, it wouldn't know

00:41:59.160 | where abatement of pollution ended and where abatement started, right?

00:42:03.320 | So I did just something that it can learn from.

00:42:04.960 | This is a nice thing about neural nets, they're so flexible.

00:42:10.120 | As long as you give it the information somehow, it doesn't really matter how you give it the

00:42:15.840 | information as long as it's there, right?

00:42:19.920 | I could have used punctuation, I could have put like, I don't know, one semicolon here

00:42:24.640 | and two here and three here, yeah, it's not a big deal.

00:42:28.360 | Like, at the level where you're like trying to get an extra half a percent to get up the

00:42:33.160 | later board or Kaggle competition, you may find tweaking these things makes tiny differences,

00:42:38.160 | but in practice, you won't generally find it matters too much.

00:42:43.480 | >> Right, thank you.

00:42:45.520 | And I guess the second part of that, excuse me again, somebody's asking if one of their

00:42:51.160 | fields was particularly long, say it was a thousand characters, is there any special handling

00:42:57.000 | required there?

00:42:58.000 | Do you need to re-inject those kind of special marker tokens?

00:43:03.360 | Does it change if you've got much bigger fields that you're trying to learn and query?

00:43:08.520 | >> Long documents and ULM fit require no special consideration.

00:43:17.200 | So IMDB, in fact, has multi-thousand word movie reviews and it works great.

00:43:25.000 | To this day, ULM fit is probably the best approach, you know, for reasonably quickly

00:43:31.760 | and easily using large documents.

00:43:35.160 | Otherwise, if you use transformer based approaches, large documents are challenging, specifically,

00:43:45.240 | transformers basically have to do the whole document at once, where else ULM fit can split

00:43:49.720 | it into multiple pieces and read it gradually.

00:43:52.120 | And so that means you'll find that people trying to work with large documents tend to

00:43:56.120 | spend a lot of money on GPUs because they need the big fancy ones with lots of memory.

00:44:02.080 | So generally speaking, I would say if you're trying to do stuff with documents of over

00:44:06.360 | 2,000 words, you might want to look at ULM fit.

00:44:13.080 | Try transformers, see if it works for you, but I'd certainly try both.

00:44:16.400 | For under 2,000 words, transformers should be fine unless you've got nothing but like

00:44:24.000 | a laptop GPU or something with not much memory.

00:44:32.120 | So how can face transformers has these, you know, as I say it right now that I find them

00:44:40.600 | somewhat obscure and not particularly well-documented expectations about your data that you kind

00:44:45.600 | of have to figure out.

00:44:47.240 | And one of those is that it expects that your target is a column called labels.

00:44:54.040 | So once I figured that out, I just went, got our tokenized data set and renamed our score

00:44:58.380 | column to labels and everything started working.

00:45:02.880 | So probably is, you know, I don't know if at some point they'll make this a bit more

00:45:05.640 | flexible, but probably best to just call your target labels and life will be easy.

00:45:14.000 | You might have seen back when I went LS path that there was another data set there called

00:45:17.960 | test.csv.

00:45:21.000 | And if you look at it, it looks a lot like our training set, our other CSV that we've

00:45:27.720 | been working with, but it's missing the score, the labels.

00:45:33.560 | This is called a test set.

00:45:37.240 | And so we're going to talk a little bit about that now because my claim here is that perhaps

00:45:42.960 | the most important idea in machine learning is the idea of having separate training, validation,

00:45:49.640 | and test data sets.

00:46:03.400 | So test and validation sets are all about identifying and controlling for something

00:46:10.480 | called overfitting, and we're going to try and learn about this through example.

00:46:16.760 | So this is the same information that's in that Kaggle notebook I've just put on some

00:46:22.320 | slides here.

00:46:27.720 | So I'm going to create a function here called plot poly, and I'm actually going to use the

00:46:34.600 | same data that, I don't know if you remember, we used it earlier for trying to fit this

00:46:42.000 | quadratic.

00:46:43.000 | We created an X and some X and some Y data.

00:46:47.400 | This is the data we're going to use, and we're going to use this to look at overfitting.

00:46:53.820 | So the details of this function don't matter too much.

00:46:57.440 | What matters is what we do with it, which is that it allows us to basically pass in the

00:47:06.040 | degree of a polynomial.

00:47:07.520 | So for those of you that remember, a first degree polynomial is just a line.

00:47:13.120 | It's Y equals AX.

00:47:14.120 | A second degree polynomial will be Y equals A squared X plus BX plus C. A third degree

00:47:20.560 | polynomial will have a cubic, fourth degree quartic, and so forth.

00:47:25.760 | And what I've done here is I've plotted what happens if we try to fit a line to our data.

00:47:33.200 | It doesn't fit very well.

00:47:36.200 | So what happened here is we did a linear regression, and what we're using here is a very cool library

00:47:44.800 | called scikit-learn.

00:47:45.800 | Scikit-learn is something that, you know, I think it'd be fair to say it's mainly designed

00:47:50.400 | for kind of classic machine learning methods, like kind of linear regression and stuff like

00:47:55.520 | that.

00:47:56.520 | Very advanced versions of these things, but it's also great for doing these quick and

00:48:01.480 | dirty things.

00:48:02.480 | So in this case, I wanted to do what's called a polynomial regression, which is fitting

00:48:06.320 | the polynomial to data, and it's just these two lines of code.

00:48:09.880 | It's a super nice library.

00:48:11.960 | So in this case, a degree one polynomial is just a line.

00:48:14.640 | So I fit it, and then I show it with the data, and there it is.

00:48:18.880 | Now that's what we call underfit, which is to say there's not enough kind of complexity

00:48:25.180 | in this model I fit to match the data that's there.

00:48:32.920 | So an underfit model is a problem.

00:48:35.680 | It's going to be systematically biased.

00:48:37.760 | All the stuff up here, we're going to be predicting too low.

00:48:40.200 | All the stuff down here, we're predicting too low.

00:48:42.160 | All the stuff in the middle, we're predicting too high.

00:48:44.640 | A common misunderstanding is that simpler models are more reliable in some way, but

00:48:52.600 | models that are too simple will be systematically incorrect, as you see here.

00:49:01.160 | What happens if we fit a 10-degree polynomial?

00:49:08.280 | That's not great either.

00:49:10.560 | In this case, it's not really showing us what the actual-- remember, this is originally

00:49:15.960 | a quadratic because this is meant to match, particularly at the ends here.

00:49:19.820 | It's predicting things that are way above what we would expect in real life.

00:49:25.480 | And it's trying really hard to get through this point, but clearly this point was just

00:49:28.960 | some noise.

00:49:30.960 | So this is what we call overfit.

00:49:34.280 | It's done a good job of fitting to our exact data points, but if we sample some more data

00:49:40.000 | points from this distribution, honestly, we probably would suspect they're not going to

00:49:45.240 | be very close to this, particularly if they're a bit beyond the edges.

00:49:49.240 | So that's what overfitting looks like.

00:49:51.040 | We don't want underfitting or overfitting.

00:49:53.600 | Now, underfitting is actually pretty easy to recognize because we can actually look

00:49:58.320 | at our training data and see that it's not very close.

00:50:02.720 | Underfitting is a bit harder to recognize because the training data is actually very

00:50:10.800 | close.

00:50:11.800 | Now, on the other hand, here's what happens if we fit the quadratic.

00:50:19.600 | And here I've got both the real line and the fit line, and you can see they're pretty close.

00:50:26.400 | And that's, of course, what we actually want.

00:50:35.760 | So how do we tell whether we have something more like this or something more like this?

00:50:43.200 | Well, what we do is we do something pretty straightforward, is we take our original dataset,

00:50:49.000 | these points, and we remove a few of them, let's say 20% of them.

00:50:56.840 | We then fit our model using only those points we haven't removed.

00:51:03.020 | And then we measure how good it is by looking at only the points we removed.

00:51:09.760 | So in this case, let's say we had removed, I'm just trying to think, if I'd removed this

00:51:18.440 | point here, then it might have kind of gone off down over here.

00:51:24.000 | And so then when we look at how well it fits, we would say, oh, this one's miles away.

00:51:31.560 | The data that we take away and don't let the model see it when it's training is called

00:51:37.800 | the validation set.

00:51:40.440 | So in fast AI, we've seen splitters before, right?

00:51:43.120 | The splitters are the things that separate out the validation set.

00:51:46.760 | Fast AI won't let you train a model without a validation set.

00:51:50.760 | Fast AI always shows you your metrics, so things like accuracy, measured only on the

00:51:56.000 | validation set.

00:51:57.340 | This is really unusual.

00:51:58.700 | Most libraries make it really easy to shoot yourself in the foot by not having a validation

00:52:03.740 | set or accidentally not using it correctly.

00:52:06.420 | So fast AI won't even let you do that.

00:52:08.520 | So you've got to be particularly careful when using other libraries.

00:52:13.360 | Hacking face transformers is good about this, so they make sure that they do show you your

00:52:19.960 | metrics on a validation set.

00:52:27.680 | Now creating a good validation set is not generally as simple as just randomly pulling

00:52:32.240 | some of your data out of your model, out of the data that you train with your model.

00:52:38.960 | The reason why is imagine that this was the data you were trying to fit something to.

00:52:48.120 | And you randomly remove some, so it looks like this.

00:52:53.060 | That looks very easy, doesn't it?

00:52:55.420 | Because you've kind of like still got all the data you would want around the points.

00:52:59.880 | And in a time series like this, this is dates and sales, in real life you're probably going

00:53:04.600 | to want to predict future dates.

00:53:07.160 | So if you created your validation set by randomly removing stuff from the middle, it's not really

00:53:12.060 | a good indication of how you're going to be using this model.

00:53:15.320 | Instead you should truncate and remove the last couple of weeks.

00:53:20.180 | So if this was your validation set and this is your training set, that's going to be actually

00:53:25.220 | testing whether you can use this to predict the future rather than using it to predict

00:53:30.280 | the past.

00:53:33.960 | Kaggle competitions are a fantastic way to test your ability to create a good validation

00:53:39.440 | set.

00:53:41.240 | Because Kaggle competitions only allow you to submit generally a couple of times a day.

00:53:48.200 | The data set that you are scored on in the leaderboard during that time is actually only

00:53:55.680 | a small subset.

00:53:56.680 | In fact, it's a totally separate subset to the one you'll be scored on on the end of

00:54:00.200 | the competition.

00:54:02.080 | And so most beginners on Kaggle overfit.

00:54:05.200 | And it's not until you've done it that you will get that visceral feeling of like, oh

00:54:10.480 | my god, I overfit.

00:54:13.480 | In the real world, outside of Kaggle, you will often not even know that you overfit.

00:54:20.560 | You just destroy value of your organization silently.

00:54:24.240 | So it's a really good idea to do this kind of stuff on Kaggle a few times first in real

00:54:28.460 | competitions to really make sure that you are confident you know how to avoid overfitting,

00:54:33.960 | how to find a good validation set, and how to interpret it correctly.

00:54:38.340 | And you really don't get that until you screw it up a few times.

00:54:45.040 | Good example of this was there was a distracted driver competition on Kaggle.

00:54:49.960 | There were these kind of pictures from inside a car.

00:54:55.240 | And the idea was that you had to try and predict whether somebody was driving in a distracted

00:55:00.200 | way or not.

00:55:02.600 | And on Kaggle, they did something pretty smart.

00:55:04.520 | The test set, so the thing that they scored you on the leaderboard, contained people that

00:55:08.900 | didn't exist at all in the competition data that you train the model with.

00:55:15.320 | So if you wanted to create an effective validation set in this competition, you would have to

00:55:19.480 | make sure that you separated the photos so that your validation set contained photos

00:55:24.280 | of people that aren't in the data you're training your model on.

00:55:29.320 | There was another one like that, the Kaggle fisheries competition, which had boats that

00:55:37.040 | didn't appear.

00:55:38.120 | So they were basically pictures of boats and you're meant to try to guess, predict what

00:55:41.520 | fish were in the pictures.

00:55:44.080 | And it turned out that a lot of people accidentally figured out what the fish were by looking

00:55:49.560 | at the boat because certain boats tended to catch certain kinds of fish.

00:55:54.280 | And so by messing up their validation set, they were really overconfident of the accuracy

00:55:59.400 | of their model.

00:56:02.040 | I'll mention in passing, if you've been around Kaggle a bit, you'll see people talk about

00:56:08.760 | cross validation a lot.

00:56:10.720 | I'm just going to mention, be very, very careful.

00:56:14.600 | Cross validation is explicitly not about building a good validation set, so you've got to be

00:56:20.440 | super, super careful if you ever do that.

00:56:26.960 | Another thing I'll mention is that Scikit-Learn conveniently offers something called train

00:56:31.440 | test split, as does Hugging Face datasets, as does Fast AI, we have something called

00:56:38.880 | Random Splitter.

00:56:40.520 | It can be encouraged, it can almost feel like it's encouraging you to use a randomized validation

00:56:48.080 | set because there are these methods that do it for you.

00:56:51.400 | But yeah, be very, very careful, because very, very often that's not what you want.

00:56:58.400 | So if you want what a validation set is, so that's the bit that you pull out of your data

00:57:02.560 | that you don't train with, but you do measure your accuracy with, so what's a test set?

00:57:10.200 | It's basically another validation set, but you don't even use it for tracking your accuracy

00:57:16.680 | while you build your model.

00:57:19.080 | Why not?

00:57:20.080 | Well, imagine you tried two new models every day for three months.

00:57:24.320 | That's how long a Kaggle competition goes for.

00:57:26.880 | So you would have tried 180 models, and then you look at the accuracy on the validation

00:57:31.520 | set for each one.

00:57:33.760 | Some of those models, you would have got a good accuracy on the validation set potentially

00:57:38.600 | because of pure chance, just a coincidence, and then you get all excited and you submit

00:57:43.080 | that to Kaggle, and you think you're going to win the competition, and you mess it up.

00:57:47.480 | And that's because you actually overfit using the validation set.

00:57:52.880 | So you actually want to know whether you've really found a good model or not.

00:57:58.640 | So in fact, on Kaggle, they have two test sets.

00:58:02.320 | They've got the one that gives you feedback on the leaderboard during the competition

00:58:05.400 | and a second test set, which you don't get to see until after the competition is finished.

00:58:12.720 | So in real life, you've got to be very careful about this, not to try so many models during

00:58:18.360 | your model building process that you accidentally find one that's good by coincidence.

00:58:24.360 | And only if you have a test set that you've held out or you know that.

00:58:28.600 | Now that leads to the obvious question, which is very challenging, is you spent three months

00:58:33.120 | working on a model, worked well on your validation set, you did a good job of locking that test

00:58:38.720 | set away in a safe so you weren't allowed to use it, and at the end of the three months,

00:58:41.840 | you finally checked it on the test set and it's terrible.

00:58:46.680 | What do you do?

00:58:48.200 | Honestly, you have to go back to square one.

00:58:51.760 | There really isn't any choice other than starting again.

00:58:56.240 | So this is tough.

00:58:57.240 | But it's better to know, right?

00:58:58.240 | Better to know than to not know.

00:58:59.680 | So that's what a test set's for.

00:59:05.120 | So you've got a validation set, what are you going to do with it?

00:59:08.360 | What you're going to do with a validation set is you're going to measure some metrics.

00:59:14.440 | So a metric is something like accuracy.

00:59:16.960 | It's a number that tells you how good is your model.

00:59:21.520 | Now on Kaggle, this is very easy.

00:59:26.560 | What metric should we use?

00:59:28.080 | Well, they tell us.

00:59:29.760 | Go to Overview, click on Evaluation, and find out, and it says, "Oh, we will evaluate

00:59:36.800 | on the Pearson correlation coefficient."

00:59:39.520 | Therefore, this is the metric you care about.

00:59:49.120 | So one obvious question is, is this the same as the loss function?

00:59:52.880 | Is this the thing that we will take the derivative of and find the gradient and use that to improve

00:59:59.160 | our parameters during training?

01:00:01.480 | And the answer is maybe, sometimes, but probably not.

01:00:09.680 | For example, consider accuracy.

01:00:13.520 | Now if we were using accuracy to calculate our derivative and get the gradient, you could

01:00:18.880 | have a model that's actually slightly better, you know, slightly like it's doing a better

01:00:23.200 | job of recognizing dogs and cats, but not so much better that it's actually caused any

01:00:30.240 | incorrectly classified cat to become a dog.

01:00:33.360 | So the accuracy doesn't change at all.

01:00:35.960 | The gradient is zero.

01:00:39.120 | You don't want stuff like that.

01:00:40.200 | You don't want bumpy functions, because they don't have nice gradients.

01:00:44.600 | Often they don't have gradients at all.

01:00:46.080 | They're basically zero nearly everywhere.

01:00:48.680 | You want a function that's nice and smooth, something like, for instance, the average

01:00:55.120 | absolute error, mean absolute error, which we've used before.

01:01:00.320 | So that's the difference between your metrics and your loss.

01:01:03.440 | Now be careful, right, because when you're training, your model's spending all of its

01:01:06.720 | time trying to improve the loss, and most of the time that's not the same as the thing

01:01:11.680 | you actually care about, which is your metric.

01:01:13.520 | So you've got to keep those two different things in mind.

01:01:17.440 | The other thing to keep in mind is that in real life, you can't go to a website and be

01:01:25.880 | told what metric to use.

01:01:28.440 | In real life, the model that you choose, there isn't one number that tells you whether it's

01:01:36.080 | good or bad, and even if there was, you wouldn't be able to find it out ahead of time.

01:01:41.640 | In real life, the model you use is a part of a complex process, often involving humans,

01:01:50.240 | both as users or customers, and as people involved as part of the process.

01:01:58.680 | There's all kinds of things that are changing over time, and there's lots and lots of outcomes

01:02:04.040 | of decisions that are made.

01:02:07.440 | One metric is not enough to capture all of that.

01:02:11.080 | Unfortunately, because it's so convenient to pick one metric and use that to say, "I've

01:02:20.960 | got a good model," that very often finds its way into industry, into government, where

01:02:29.480 | people roll out these things that are good on the one metric that happened to be easy

01:02:33.840 | to measure.

01:02:36.200 | Again and again, we found people's lives turned upside down because of how badly they get

01:02:44.360 | screwed up by models that have been incorrectly measured using a single metric.

01:02:49.800 | My partner, Rachel Thomas, has written this article, which I recommend you read about

01:02:54.200 | the problem with metrics, is a big problem for AI.

01:03:04.240 | It's not just an AI thing.

01:03:05.760 | There's actually this thing called Goodhart's Law that states, "When a measure becomes a

01:03:08.880 | target, it ceases to be a good measure."

01:03:14.880 | When I was a management consultant 20 years ago, we were always part of these strategic

01:03:23.800 | things trying to find key performance indicators and ways to set commission rates for sales

01:03:30.200 | people and we were really doing a lot of this stuff, which is basically about picking metrics.

01:03:36.800 | We see that happen, go wrong in industry all the time.

01:03:41.280 | AI is dramatically worse because AI is so good at optimizing metrics.

01:03:47.640 | That's why you have to be extra, extra, extra careful about metrics when you are trying

01:03:52.880 | to use a model in real life.

01:03:54.760 | Anyway, as I said in Kaggle, we don't have to worry about any of that.

01:04:00.320 | We are just going to use the Pearson correlation coefficient, which is all very well, as long

01:04:04.160 | as you know what the hell the Pearson correlation coefficient is.

01:04:09.440 | If you don't, let's learn about it.

01:04:12.000 | So Pearson correlation coefficient is usually abbreviated using letter R and it's the most

01:04:17.960 | widely used measure of how similar two variables are.

01:04:22.800 | If your predictions are very similar to the real values, then the Pearson correlation

01:04:29.840 | coefficient will be high and that's what you want.

01:04:37.080 | R can be between minus one and one.

01:04:40.640 | Minus one means you predicted exactly the wrong answer, which in a Kaggle competition

01:04:45.200 | would be great because then you can just reverse all of your answers and you'll be perfect.

01:04:50.200 | Minus one means you got everything exactly correct.

01:04:55.600 | Generally speaking, in courses or textbooks when they teach you about the Pearson correlation

01:04:59.440 | coefficient, at this point they will show you a mathematical function.

01:05:04.480 | I'm not going to do that because that tells you nothing about the Pearson correlation

01:05:07.860 | coefficient.

01:05:08.860 | What we actually care about is not the mathematical function, but how it behaves.

01:05:15.120 | I find most people even who work in data science have not actually looked at a bunch of data

01:05:19.880 | sets to understand how R behaves.

01:05:23.480 | So let's do that right now so that you're not one of those people.

01:05:27.900 | The best way I find to understand how data behaves in real life is to look at real life

01:05:31.680 | data.

01:05:33.680 | So there's a data set, Scikit-learn comes with a number of data sets and one of them

01:05:37.280 | is called California Housing and it's a data set where each row is a district.

01:05:45.960 | And it's kind of demographic information about different districts and about the value of

01:05:53.000 | houses in that district.

01:05:59.560 | I'm not going to try to plot the whole thing because it's too big and this is a very common

01:06:03.280 | question I have from people is how do I plot data sets with far too many points?

01:06:09.320 | The answer is very simple, get less points.

01:06:16.280 | Whatever you see with a thousand points is going to be the same as what you see with

01:06:18.600 | a million points.

01:06:19.600 | There's no reason to plot huge amounts of data generally, just grab a random sample.

01:06:27.160 | Now NumPy has something called CoreCoF to get the correlation coefficient between every

01:06:33.560 | variable and every other variable and it returns a matrix so I can look down here, so for example

01:06:40.560 | here is the correlation coefficient between variable one and variable one which of course

01:06:45.240 | is exactly perfectly 1.0 because variable one is the same as variable one.

01:06:50.600 | Here is the small inverse correlation between variable one and variable two and medium sized

01:06:57.680 | positive correlation between variable one and variable three and so forth.

01:07:01.740 | This is symmetric about the diagonal because the correlation between variable one and variable

01:07:06.040 | eight is the same as the correlation between variable eight and variable one.

01:07:10.600 | So this is a correlation coefficient matrix.

01:07:17.280 | So that's great when we wanted to get a bunch of values all at once.

01:07:20.360 | For the Kaggle competition we don't want that, we just want a single correlation number.

01:07:24.760 | If we just pass in a pair of variables we still get a matrix which is kind of weird,

01:07:32.160 | it's not weird, it's not what we want.

01:07:33.760 | So we should grab one of these.

01:07:35.640 | So when I want to grab a correlation coefficient I'll just return the zeroth row first column.

01:07:42.000 | So that's what Core is, that's going to be our single correlation coefficient.

01:07:46.180 | So let's look at the correlation between two things, for example median income and medium

01:07:54.260 | house value, 0.67, okay is that high, medium, low, how big is that, what does it look like?

01:08:03.460 | So the main thing we need to understand is what these things look like.

01:08:07.500 | So what I suggest we do is we're going to take a 10 minute break, 9 minute break, we'll

01:08:12.560 | come back at half-past and then we're going to look at some examples of correlation coefficients.

01:08:24.700 | Okay welcome back.

01:08:27.540 | So what I've done here is I've created a little function called show correlations, I'm going

01:08:33.160 | to pass in a data frame and a couple of columns as strings, going to grab each of those columns

01:08:37.880 | as series, do a scatter plot and then show the correlation.

01:08:43.760 | So we already mentioned median income and median house valuation of 0.68.

01:08:49.280 | So here it is, here's what 0.68 looks like.

01:08:52.040 | So I don't know if you had some intuition about what you expected, but as you can see

01:08:57.400 | it's still plenty of variation even at that reasonably high correlation.

01:09:09.320 | Also you can see here that visualizing your data is very important if you're working with

01:09:13.460 | this dataset because you can immediately see all these dots along here, that's clearly

01:09:18.920 | truncation, right?

01:09:21.280 | So this is like when it's not until you look at pictures like this that you've got to pick

01:09:24.440 | stuff like this, pictures are great.

01:09:28.520 | Oh, little trick, on the scatter plot I put alpha is 0.5, that creates some transparency.

01:09:36.260 | For these kind of scatter plots that really helps because it kind of creates darker areas

01:09:41.680 | in places where there's lots of dots.

01:09:43.840 | So yeah, alpha in scatter plots is nice.

01:09:47.000 | Okay here's another pair.

01:09:49.800 | So this one's gone down from 0.68 to 0.43 median income versus the number of rooms per house.

01:09:57.360 | As you'd expect, more rooms, it's more income.

01:10:03.840 | But this is a very weird looking thing.

01:10:07.000 | Now you'll find that a lot of these statistical measures like correlation rely on the square

01:10:13.920 | of the difference.

01:10:15.880 | And when you have big outliers like this, the square of the difference goes crazy.

01:10:21.360 | And so this is another place we'd want to look at the data first and say, oh, that's

01:10:25.600 | going to be a bit of an issue.

01:10:28.200 | There's probably more correlation here, but there's a few examples of some houses with

01:10:33.000 | lots and lots of room where people that aren't very rich live.

01:10:36.520 | Maybe these are some kind of shared accommodation or something.

01:10:42.840 | So R is very sensitive to outliers.

01:10:46.600 | So let's get rid of the houses, the houses with 15 rooms or more.

01:10:53.680 | And now you can see it's gone up from 0.43 to 0.68, even though we probably only got

01:11:00.200 | rid of one, two, three, four, five, six, even got rid of seven data points.

01:11:04.960 | So we've got to be very careful of outliers, and that means if you're trying to win a Kaggle

01:11:07.840 | competition where the metric is correlation, and you just get a couple of rows really badly

01:11:14.080 | wrong, then that's going to be a disaster to your score.

01:11:18.520 | So you've got to make sure that you do a pretty good job of every row.

01:11:23.840 | So there's what a correlation of 0.68 looks like.

01:11:27.400 | OK, here's a correlation of 0.34.

01:11:30.840 | And this is kind of interesting, isn't it, because 0.34 sounds like quite a good relationship,

01:11:37.400 | but you almost can't see it.

01:11:40.500 | So this is something I strongly suggest, is if you're working with a new metric, draw some

01:11:45.560 | pictures of a few different levels of that metric to kind of try to get a feel for, like,

01:11:51.600 | what does it mean?

01:11:52.600 | You know, what does 0.6 look like, what does 0.3 look like, and so forth?

01:11:58.440 | And here's an example of a correlation of minus 0.2.

01:12:03.160 | It's a very slight negative slope.

01:12:07.440 | OK, so there's just more of a kind of a general tip of something I like to do when playing

01:12:11.440 | with a new metric, and I recommend you do as well.

01:12:13.840 | I think we've now got a sense of what the correlation feels like.

01:12:17.480 | Now you can go look up the equation on Wikipedia if you're into that kind of thing.

01:12:23.720 | We need to report the correlation after each epoch, because we want to know how our training's

01:12:29.680 | going.

01:12:30.680 | HuggingFace expects you to return a dictionary because it's going to use the keys of the

01:12:38.040 | dictionary to, like, label each metric.

01:12:41.400 | So here's something that gets the correlation and returns it as a dictionary with the label

01:12:48.120 | Pearson.

01:12:49.120 | OK, so we've done metrics, we've done our training validation split.

01:12:56.480 | Oh, we might have actually skipped over the bit where we actually did the split, did I?

01:13:02.960 | I did.

01:13:03.960 | So to actually do the split, in this Kaggle competition, I've got another notebook we'll

01:13:12.000 | look at later where we actually split this properly, but here we're just going to do

01:13:15.920 | a random split, just to keep things simple for now, of 25 percent of the data will be

01:13:21.720 | a validation set.

01:13:24.000 | So if we go ds_train_test_split, it returns a dataset dict, which has a train and a test.

01:13:33.600 | So that looks a lot like a datasets object in fast.ai, very similar idea.

01:13:42.160 | So this will be the thing that we'll be able to train with.

01:13:44.720 | So it's going to train with this dataset and return the metrics on this dataset.

01:13:49.240 | This is really a validation set, but HuggingFace datasets calls a test.

01:13:58.840 | OK, we're now ready to train our model.

01:14:03.800 | In fast.ai, we use something called a learner.

01:14:06.920 | The equivalent in HuggingFace transformers is called trainer.

01:14:10.520 | So we'll bring that in.

01:14:14.360 | Something we'll learn about quite shortly is the idea of mini-batches and batch sizes.

01:14:19.400 | In short, each time we pass some data to our model for training, it's going to send through

01:14:25.880 | a few rows at a time to the GPU so that it can calculate those in parallel.

01:14:32.960 | Those bunch of rows is called a batch or a mini-batch, and the number of rows is called

01:14:37.920 | the batch size.

01:14:40.120 | So here we're going to set the batch size to 128.

01:14:43.240 | Generally speaking, the larger your batch size, the more it can do in parallel at once,

01:14:47.800 | and it will be faster.

01:14:48.800 | But if you make it too big, you're going to get an out-of-memory error on your GPU.

01:14:54.360 | So it's a bit of trial and error to find a batch size that works.

01:14:57.960 | Epochs we've seen before.

01:15:00.080 | Then we've got the learning rate.

01:15:04.140 | We'll talk in the next lesson, unless we get to this lesson, about a technique to automatically

01:15:11.760 | find or semi-automatically find a good learning rate.

01:15:14.600 | We already know what a learning rate is from the last lesson.

01:15:16.800 | I played around and found one that seems to train quite quickly without falling apart.

01:15:21.680 | So I just tried a few.

01:15:26.800 | Generally I kind of -- if I don't have a -- so Hacking-Face Transformers doesn't have something

01:15:33.320 | to help you find the learning rate, the integration we're doing in Fast AI will let you do that.

01:15:38.560 | But if you're using a framework that doesn't have that, you can just start with a really

01:15:42.120 | low learning rate and then kind of double it and keep doubling it until it falls apart.

01:15:51.520 | Hacking-Face Transformers uses this thing called training arguments, which is a class

01:15:54.880 | we just provide all of the kind of configuration.

01:15:58.480 | So you have to tell it what your learning rate is.

01:16:05.520 | This stuff here is the same as what we call basically fit one cycle in Fast AI.

01:16:10.920 | You always want this to be true because it's going to be faster, pretty much.

01:16:16.360 | And then this stuff here you can probably use exactly the same every time.

01:16:19.920 | There's a lot of boilerplate compared to Fast AI, as you see.

01:16:26.800 | This stuff you can probably use the same every time.

01:16:30.300 | So we now need to create our model. So the equivalent of the vision learner function

01:16:37.980 | that we've used to automatically create a reasonable vision model in Hacking-Face Transformers,

01:16:46.120 | they've got lots of different ones depending on what you're trying to do.

01:16:50.200 | So we're trying to do classification, as we've discussed, of sequences.

01:16:55.320 | So if we call auto model for sequence classification, it will create a model that is appropriate

01:17:00.480 | for classifying sequences from a pre-trained model.

01:17:04.600 | And this is the name of the model that we just did earlier, the DiBerto V3.

01:17:10.840 | It has to know when it adds that random matrix to the end how many outputs it needs to have.

01:17:15.480 | So we have one label, which is the score.

01:17:19.880 | So that's going to create our model.

01:17:21.920 | And then this is the equivalent of creating a learner.

01:17:24.640 | It contains a model and the data, the training data and the test data.

01:17:30.440 | Again, there's a lot more boilerplate here than Fast AI, but you can kind of see the

01:17:34.040 | same basic steps here.

01:17:35.680 | We just have to do a little bit more manually.

01:17:37.920 | But it's nothing too crazy.

01:17:40.840 | So it's going to tokenize it for us using that function.

01:17:43.760 | And then these are the metrics that it will print out each time.

01:17:48.080 | That's that little function we created which returns a dictionary.

01:17:53.360 | At the moment, I find hugging face transformers very verbose.

01:17:56.280 | It spits out lots and lots and lots of text which you can ignore.

01:18:01.000 | And we can finally call train which will spit out much more text again which you can ignore.

01:18:06.560 | And as you can see, as it trains, it's printing out the loss.

01:18:11.180 | And here's our Pearson correlation coefficient.

01:18:15.080 | So it's training.

01:18:16.480 | And we've got a 0.834 correlation.

01:18:19.120 | That's pretty cool, right?

01:18:20.120 | And it took five minutes to run, maybe that's five minutes per epoch on Kaggle which doesn't

01:18:28.520 | have particularly great GPUs, but good for free.

01:18:32.760 | And we've got something that has got a very high level of correlation in assessing how

01:18:39.360 | similar the two columns are.

01:18:41.880 | And the only reason it could do that is because it used a pre-trained model, right?

01:18:46.680 | There's no way you could just have that tiny amount of information and figure out whether

01:18:50.600 | those two columns are very similar.

01:18:54.180 | This pre-trained model already knows a lot about language.

01:18:57.240 | It already has a good sense of whether two phrases are similar or not.

01:19:01.200 | And we've just fine-tuned it.

01:19:02.960 | You can see, given that after one epoch, it was already at 0.8, you know, this was a model

01:19:08.240 | that already did something pretty close to what we needed.

01:19:11.640 | It didn't really need that much extra tuning for this particular task.

01:19:18.080 | We've got any questions here, John?

01:19:23.440 | Yeah, we do.

01:19:25.080 | It's actually a bit back on the topic before where you were showing us the visual interpretation

01:19:30.280 | of the Pearson coefficient and you were talking about outliers.

01:19:33.080 | Yeah.

01:19:34.080 | And we've got a question here from Kevin asking, how do you decide when it's OK to remove outliers?

01:19:41.360 | Like you pointed out something in that data set.

01:19:45.560 | And clearly, your model is going to train a lot better if you clean that up.

01:19:50.360 | But I think Kevin's point here is, you know, those kinds of outliers will probably exist

01:19:56.720 | in the test set as well.

01:19:58.040 | So I think he's just looking for some practical advice on how you handle that in a more general

01:20:02.040 | sense.

01:20:05.560 | So outliers should never just be removed, like for modeling.

01:20:16.540 | So if we take the example of the California housing data set, you know, if I was really

01:20:21.000 | working with that data set in real life, I would be saying, oh, that's interesting.

01:20:25.440 | It seems like there's a separate group of districts with a different kind of behavior.

01:20:29.640 | My guess is that they're going to be kind of like dorms or something like that, you know,

01:20:32.880 | probably low income housing.

01:20:35.800 | And so I would be saying like, oh, clearly, from looking at this data set, these two different

01:20:40.680 | groups can't be treated the same way.

01:20:42.640 | They have very different behaviors.

01:20:43.640 | And I would probably split them into two separate analyses.

01:20:48.720 | You know, the word outlier, it kind of exists in a statistical sense, right?

01:20:59.160 | There can be things that are well outside our normal distribution and mess up our kind

01:21:02.840 | of metrics and things.

01:21:04.500 | It doesn't exist in a real sense.

01:21:06.320 | It doesn't exist in a sense of like, oh, things that we should like ignore or throw away.

01:21:12.560 | You know, some of the most useful kind of insights I've had in my life in data projects

01:21:18.640 | has been by digging into outliers, so-called outliers, and understanding, well, what are

01:21:24.760 | they?

01:21:25.760 | Where did they come from?

01:21:28.280 | And it's kind of often in those edge cases that you discover really important things

01:21:33.640 | about like where processes go wrong or about, you know, kinds of behaviors you didn't even

01:21:38.800 | know existed, or indeed about, you know, kind of labeling problems or process problems,

01:21:44.400 | which you really want to fix them at the source, because otherwise when you go into production,

01:21:48.760 | you're going to have more of those so-called outliers.

01:21:52.440 | So yeah, I'd say never delete outliers without investigating them and having a strategy for

01:22:02.280 | like understanding where they came from and like, what should you do about them?

01:22:06.200 | All right.

01:22:08.800 | So now that we've got a trained model, you'll see that it actually behaves really a lot

01:22:14.720 | like a fast AI learner.

01:22:16.160 | And hopefully the impression you'll get from going through this process is largely a sense

01:22:22.400 | of familiarity.

01:22:23.400 | It's like, oh, yeah, this looks like stuff I've seen before, you know, like a bit more

01:22:28.960 | wordy and some slight changes, but it really is very, very similar to the way we've done

01:22:34.240 | it before.

01:22:35.240 | Because now that we've got a trained trainer rather than learner, we can call predict.

01:22:41.360 | And now we're going to pass in our data set from the Kaggle test file.

01:22:48.400 | And so that's going to give us our predictions, which we can cast to float.

01:22:55.160 | And here they are.

01:22:57.160 | So here are the predictions we made of similarity.

01:23:02.640 | Now, again, not just for your inputs, but also for your outputs, always look at them.

01:23:10.160 | Always.

01:23:11.160 | Right?

01:23:12.160 | And interestingly, I looked at quite a few Kaggle notebooks from other people for this

01:23:17.600 | competition.

01:23:19.400 | And nearly all of them had the problem we have right now, which is negative predictions

01:23:26.000 | and predictions over one.

01:23:28.760 | So I'll be showing you how to fix this in a more proper way, maybe hopefully in the

01:23:34.720 | next lesson.

01:23:37.240 | But for now, you know, we could at least just round these off, right?

01:23:42.120 | Because we know that none of the scores are going to be bigger than one or smaller than

01:23:45.360 | zero.

01:23:46.360 | But our correlation coefficient will definitely improve if we at least round this up to zero

01:23:50.440 | and round this down to one.

01:23:53.040 | As I say, there are better ways to do this, but that's certainly better than nothing.

01:23:57.600 | So in PyTorch, you might remember from when we looked at Relu, there's a thing called

01:24:02.400 | Clip.

01:24:03.400 | And that will clip everything under zero to zero and everything over one to one.

01:24:08.680 | And so now that looks much better.

01:24:13.080 | So here's our predictions.

01:24:15.600 | So Kaggle expects submissions to generally be in a CSV file.

01:24:19.920 | And Hackingface datasets kind of looks a lot like pandas, really.

01:24:24.600 | We can create our submission file with our two columns, call.csv.

01:24:31.760 | And there we go.

01:24:36.240 | That's basically it.

01:24:41.120 | So yeah, you know, it's kind of nice to see how -- you know, in a sense, how far deep learning

01:24:51.120 | has come since we started this course a few years ago that nowadays, you know, there are

01:24:56.840 | multiple libraries around to kind of do the same thing.

01:25:00.480 | We can, you know, use them in multiple application areas.

01:25:05.320 | They all look kind of pretty familiar.

01:25:07.480 | They're reasonably beginner-friendly.

01:25:10.840 | And NLP, because it's kind of like the most recent area that's really become effective

01:25:19.840 | in the last year or two, is probably where the biggest opportunities are for, you know,

01:25:27.000 | big wins both in research and commercialization.

01:25:33.120 | And so if you're looking to build a startup, for example, one of the key things that VCs

01:25:37.200 | look for, you know, that they'll ask is like, "Well, why now?"

01:25:41.840 | You know, "Why would you build this company now?"

01:25:44.000 | And of course, you know, with NLP, the answer is really simple.

01:25:46.680 | It's like -- it can often be like, "Well, until last year, this wasn't possible," you

01:25:52.320 | know, or "It took ten times more time," or "It took ten times more money," or whatever.

01:25:58.280 | So I think NLP is a huge opportunity area.

01:26:03.600 | Okay, so it's worth thinking about both use and misuse of modern NLP.

01:26:17.600 | And I want to show you a subreddit.

01:26:20.840 | Here is a conversation on a subreddit from a couple of years ago.

01:26:24.480 | I'll let you have a quick read of it.

01:26:34.920 | So the question I want you to be thinking about is what subreddit do you think this

01:26:38.920 | comes from, this debate about military spending?

01:26:46.200 | And the answer is it comes from a subreddit that posts automatically generated conversations

01:26:50.240 | between GPT-2 models.

01:26:54.840 | Now this is like a totally previous generation of model.

01:26:58.480 | They're much, much better now.

01:27:00.640 | So even then, you could see these models were generating context-appropriate, believable

01:27:06.720 | pros.

01:27:12.680 | You know, I would strongly believe that like any of our kind of like aperture of competent

01:27:20.400 | fast AI alumni would be fairly easily able to create a bot which could create context-appropriate

01:27:27.640 | pros on Twitter or Facebook groups or whatever, you know, arguing for a side of an argument.

01:27:37.160 | And you can scale that up such that 99% of Twitter was these bots and nobody would know,

01:27:43.600 | you know, nobody would know.

01:27:46.520 | And that's very worrying to me because a lot of, you know, a lot of kind of the way people

01:27:57.000 | see the world is now really coming out of their social media conversations, which at

01:28:02.120 | this point, they're controllable.

01:28:03.880 | Like it would not be that hard to create something that's kind of optimized towards moving a

01:28:11.120 | point of view amongst a billion people, you know, in a very subtle way, very gradually

01:28:16.760 | over a long period of time by multiple bots, each pretending to argue with each other and

01:28:21.720 | one of them getting the upper hand and so forth.

01:28:29.200 | Here is the start of an article in The Guardian, which I'll let you read.

01:28:52.880 | This article was, you know, quite long, these are just the first few paragraphs.

01:28:57.280 | And at the end, it explains that this article was written by GPT-3.

01:29:01.320 | It was given the instruction, "Please write a short op-ed around 500 words, keep the language

01:29:05.400 | simple and concise, focus on why humans have nothing to fear from AI."

01:29:11.120 | So GPT-3 produced eight outputs and then they say basically the editors at The Guardian did

01:29:18.560 | about the same level of editing that they would do for humans.

01:29:21.720 | In fact, they found it a bit less editing required than humans.

01:29:25.860 | So, you know, again, like you can create longer pieces of context appropriate prose designed

01:29:33.440 | to argue a particular point of view.

01:29:36.880 | What kind of things might this be used for?

01:29:40.200 | You know, we won't know probably for decades if ever, but sometimes we get a clue based

01:29:45.280 | on older technology.

01:29:46.960 | Here's something from back 2017 and the pre kind of deep learning NLP days.

01:29:54.160 | There were millions of submissions to the FTC about the net neutrality situation in

01:30:00.640 | America, very, very heavily biased towards the point of view of saying we want to get

01:30:08.240 | rid of net neutrality.

01:30:11.520 | An analysis by Jeff Kao showed that something like 99% of them, and in particular nearly

01:30:17.920 | all of the ones which were pro removal net neutrality were clearly auto generated by

01:30:24.840 | basically if you look at the green, there's like selecting from a menu, so we've got Americans

01:30:30.840 | as opposed to Washington bureaucrats deserve to enjoy the services they desire.

01:30:35.160 | Individuals as opposed to Washington bureaucrats should be just people like me as opposed to

01:30:38.960 | so-called experts should be, and you get the idea.

01:30:41.320 | Now this is an example of a very, very, you know, simple approach to auto generating huge

01:30:48.920 | amounts of text.

01:30:49.920 | We don't know for sure, but it looks like this might have been successful because this

01:30:55.320 | went through.

01:30:56.320 | You know, despite what seems to be actually overwhelming disagreement from the public

01:31:03.440 | that everybody, almost everybody likes net neutrality, the FTC got rid of it, and this

01:31:09.360 | was a big part of the basis, was like, oh, we got all these comments from the public

01:31:12.840 | and everybody said they don't want net neutrality.

01:31:16.760 | So imagine a similar thing where you absolutely couldn't do this, you couldn't figure it out

01:31:22.040 | because everyone was really very compelling and very different, that's, you know, it's

01:31:28.320 | kind of worrying about how we deal with that.

01:31:33.200 | I will say when I talk about this stuff, often people say, oh, no worries, we'll be able

01:31:36.840 | to model to recognize, you know, bot generated content, but, you know, if I put my black

01:31:46.100 | hat on, I'm like, nah, that's not going to work, right?

01:31:49.240 | If you told me to build something that beats the bot classifiers, I'd say, no worries, easy.

01:31:56.120 | You know, I will take the code or the service or whatever that does the bot classifying,

01:32:01.920 | and I will include beating that in my loss function, and I will fine-tune my model until

01:32:06.760 | it beats the bot classifier, you know.

01:32:09.480 | When I used to run an email company, we had a similar problem with spam prevention, you

01:32:14.760 | know, spammers could always take a spam prevention algorithm and change their emails until it

01:32:20.900 | didn't get the spam prevention algorithm anymore, for example.

01:32:26.600 | So yes, I'm really excited about the opportunities for students in this course to build, you

01:32:37.040 | know, I think very valuable businesses, really cool research, and so forth using these pretty

01:32:44.720 | new NLP techniques that are now pretty accessible, and I'm also really worried about the things

01:32:49.660 | that might go wrong.

01:32:50.660 | I do think, though, that the more people that understand these capabilities, the less chance

01:32:55.320 | they'll go wrong.

01:32:56.640 | John, was there some questions?

01:32:59.240 | Yeah, I mean, it's a throwback to the workbook that you had before.

01:33:05.480 | Yeah, that's the one.

01:33:09.600 | The question Manakandan is asking, shouldn't num labels be 5, 0, 0.25, 0.5, 0.751 instead

01:33:20.360 | of 1?

01:33:21.600 | Is the target a categorical, or are we considering this as a regression problem?

01:33:25.960 | Yeah, it's a good question.

01:33:27.640 | So there's one label because there's one column.

01:33:33.920 | Even if this was being treated as a categorical problem with five categories, it's still considered

01:33:39.080 | one label.

01:33:42.080 | In this case, though, we're actually treating it as a regression problem.

01:33:48.040 | It's just one of the things that's a bit tricky.

01:33:50.160 | I was trying to figure this out just the other day.

01:33:52.300 | It's not documented as far as I can tell on the Hugging Phase Transformer's website.

01:33:56.520 | But if you pass in one label to auto model for sequence classification, it turns it into

01:34:01.520 | a regression problem, which is actually why we ended up with predictions that were less

01:34:06.700 | than 0 and bigger than 1.

01:34:09.040 | So we'll be learning next time about the use of sigmoid functions to resolve this problem,

01:34:16.680 | and that should fix it up for us.

01:34:21.120 | OK, great.

01:34:22.600 | Well thanks, everybody.

01:34:23.600 | I hope you enjoyed learning about NLP.

01:34:26.760 | As much as I enjoyed putting this together, I'm really excited about it, and can't wait

01:34:30.800 | for next week's lesson.

01:34:32.640 | See ya.

01:34:33.400 | [APPLAUSE]

Lesson 4: Practical Deep Learning for Coders 2022

Chapters