Back to Index

Lesson 4: Practical Deep Learning for Coders 2022


Chapters

0:0 Using Huggingface
3:24 Finetuning pretrained model
5:14 ULMFit
9:15 Transformer
10:52 Zeiler & Fergus
14:47 US Patent Phrase to Phase Matching Kaggle competition
16:10 NLP Classification
20:56 Kaggle configs, insert python in bash, read competition website
24:51 Pandas, numpy, matplotlib, & pytorch
29:26 Tokenization
33:20 Huggingface model hub
36:40 Examples of tokenized sentences
38:47 Numericalization
41:13 Question: rationale behind how input data was formatted
43:20 ULMFit fits large documents easily
45:55 Overfitting & underfitting
50:45 Splitting the dataset
52:31 Creating a good validation set
57:13 Test set
59:0 Metric vs loss
61:27 The problem with metrics
64:10 Pearson correlation
70:27 Correlation is sensitive to outliers
74:0 Training a model
79:20 Question: when is it ok to remove outliers?
82:10 Predictions
85:30 Opportunities for research and startups
86:16 Misusing NLP
93:0 Question: isn’t the target categorical in this case?

Transcript

Hi, everybody, and welcome to Practical Deep Learning for Coders Lesson 4, which I think is the lesson that a lot of the regulars in the community have been most excited about because it's where we're going to get some totally new material, totally new topic we've never covered before. We're going to cover natural language processing in LP, and you'll find there is indeed a chapter about that in the book, but we're going to do it in a totally different way to how it's done in the book.

In the book, we do NLP using the FastAI library using recurrent neural networks, RNNs. Today we're going to do something else, which is we're going to do transformers. And we're not even going to use the FastAI library at all, in fact. So what we're going to be doing today is we're going to be fine-tuning a pre-trained NLP model using a library called Hugging Face Transformers.

Now given this is the Fast.AI course, you might be wondering why we'd be using a different library other than FastAI. The reason is that I think that it's really useful for everybody to have experience and practice of using more than one library, because you'll get to see the same concepts applied in different ways.

And I think that's great for your understanding of what these concepts are. Also I really like the Hugging Face Transformers library. It's absolutely the state of the art in NLP, and it's well worth knowing. If you're watching this on video, by the time you're watching it, we will probably have completed our integration of the transformers library into FastAI, so it's in the process of becoming the main NLP foundation for FastAI.

So you'll be able to combine transformers and FastAI together. So I think there's a lot of benefits to this, and in the end you're going to know how to do NLP in a really fantastic library. Now the other thing is Hugging Face Transformers doesn't have the same layered architecture that FastAI has, which means, particularly for beginners, the kind of high level, top tier API that you'll be using most of the time is not as ready to go for beginners as you're used to from FastAI.

And so that's actually, I think, a good thing. You're up to lesson four. You know the basic idea now of how gradient descent works and how parameters are learned as part of a flexible function. I think you're ready to try using a somewhat lower level library that does a little bit less for you.

So it's going to be a little bit more work. It's a very well-designed library, and it's still reasonably high level, but you're going to learn to go a little bit deeper. And that's kind of how the rest of the course in general is going to be on the whole, is we're going to get a bit deeper and a bit deeper and a bit deeper.

Now so first of all, let's talk about what we're going to be doing with fine-tuning a pre-trained model. We've talked about that in passing before, but we haven't really been able to describe it in any detail because you haven't had the foundations. Now you do. You played with these sliders last week, and hopefully you've all actually gone into this notebook and dragged them around and tried to get an intuition for this idea of moving them up and down, makes the loss go up and down, and so forth.

So I mentioned that your job was to move these sliders to get this as nice as possible, but when it was given to you, the person who gave it to you said, "Oh, actually slider A, that should be on 2.0." We know for sure. And slider B, we think it's like around two and a half.

Slider C, we've got no idea. Now that would be pretty helpful, wouldn't it, because you could immediately start focusing on the one we have no idea about, get that in roughly the right spot, and then the one you've kind of got a vague idea about, you could just tune it a little bit, and the one that they said was totally confident you wouldn't move at all, you would probably tune these sliders really quickly.

That's what a pre-trained model is. A pre-trained model is a bunch of parameters that have already been fit, where some of them are already pretty confident of what they should be, and some of them we really have no idea at all. And so fine-tuning is the process of taking those ones we have no idea what they should be at all and trying to get them right, and then moving the other ones a little bit.

The idea of fine-tuning a pre-trained NLP model in this way was pioneered by an algorithm called ULMfit, which was first presented actually in a fast AI course, I think the very first fast AI course. It was later turned into an academic paper by me in conjunction with a then PhD student named Sebastian Ruder, who's now one of the world's top NLP researchers, and went on to help inspire a huge change, a huge step improvement in NLP capabilities around the world, along with a number of other important innovations at the time.

This is the basic process that ULMfit described. Step one was to build something called a language model using basically nearly all of Wikipedia. And what the language model did was it tried to predict the next word of a Wikipedia article, in fact every next word of every Wikipedia article.

Doing that is very difficult. There are Wikipedia articles which would say things like the 17th prime number is dot, dot, dot, or the 40th president of the United States, Blah, said at his residence, Blah, that. Filling in these kinds of things requires understanding a lot about how language is structured and about the world and about math and so forth.

So to get good at being a language model, a neural network has to get good at a lot of things. It has to understand how language works at a reasonably good level, and it needs to understand what it's actually talking about, and what is actually true, what is actually not true, and the different ways in which things are expressed, and so forth.

So this was trained using a very similar approach to what we'll be looking at for fine tuning, but it started with random weights, and at the end of it there was a model that could predict more than 30% of the time correctly what the next word of a Wikipedia article would be.

So in this particular case for the ULM FIT paper, we then took that and we were trying to-- the first task I did actually for the FAST AI course back when I invented this was to try and figure out whether IMDB movie reviews were positive or negative sentiment. Did the person like the movie or not?

So what I did was I created a second language model. So again, the language model here is something that predicts the next word of a sentence. But rather than using Wikipedia, I took this pre-trained model that was trained on Wikipedia, and I ran a few more epochs using IMDB movie reviews.

So it got very good at predicting the next word of an IMDB movie review. And then finally, I took those weights and I fine-tuned them for the task of predicting whether or not a movie review was positive or negative sentiment. So those were the three steps. This is a particularly interesting approach because this very first model-- in fact, the first two models-- if you think about it, they don't require any labels.

They didn't have to collect any kind of document categories or do any kind of surveys or collect anything. All I needed was the actual text of Wikipedia and movie reviews themselves because the labels was, what's the next word of a sentence? Now since we built ULMfit-- and we used RNNs, the current neural networks for this-- at about the same time-ish that we released this, a new kind of architecture, particularly useful for NLP at the time, was developed called transformers.

And transformers were particularly built because they can take really good advantage of modern accelerators like Google's TPUs. They didn't really allow you to predict the next word of a sentence. It's just not how they're structured for reasons we'll talk about probably in part two of the course. So they threw away the idea of predicting the next word of a sentence.

And then instead, they did something just as good and pretty clever. They took kind of chunks of Wikipedia or whatever text they're looking at and deleted at random a few words and asked the model to predict what were the words that were deleted, essentially. So it's a pretty similar idea.

Other than that, the basic concept was the same as ULMfit. They replaced our RNN approach with a transformer model. They replaced our language model approach with what's called a masked language model. But other than that, the basic idea was the same. So today, we're going to be looking at models using what's become the much more popular approach than ULMfit, which is this transformers masked language model approach.

OK. John, do we have any questions? And I should mention, we do have a professor from the University of Queensland, John Williams, joining us, who will be asking the highest voted questions from the community. What do you got, John? Yeah. Thanks, Jeremy. And we might be jumping the gun here.

I suspect this is where you're going tonight. But we've got a good question here on the forum, which is, how do you go from a model that's trained to predict the next word to a model that can be used for classification? Sure. So, yeah, we will be getting into that in more detail.

And in fact, maybe a good place to start would be the next slide, kind of give you a sense of this. You might remember in lesson one, we looked at this fantastic Zyla and Fergus paper, where we looked at visualizations of the first layer of a ImageNet classification model.

And layer one had sets of weights that found diagonal edges. And here are some examples of bits of photos that successfully matched with and opposite diagonal edges, and kind of color gradients. And here's some examples of bits of pictures that matched. And then layer two combined those, and now you know how those were combined, right?

These were rectified linear units that were added together, and then sets of those rectified linear units, the outputs of those, they're called activations, were then themselves run through a matrix multiplier, a rectified linear unit, added together. So now you don't just have to have edge detectors, but layer two had corner detectors.

And here's some examples of some corners that that corner detector successfully found. And remember, these were not engineered in any way, they just evolved from the gradient descent training process. Layer two had examples of circle detectors, as it turns out. And skipping a bit, by the time we got to layer five, we had bird and lizard eyeball detectors, and dog face detectors, and flower detectors, and so forth.

Nowadays, you'd have something like a ResNet 50, would be something you'd probably be training pretty regularly in this course, so that you've got 50 layers, not just five layers. Now the later layers do things that are much more specific to the training task, which is actually predicting, really, what is it that we're looking at.

The early layers, pretty unlikely you're going to need to change them much, as long as you're looking at some kind of natural photos. You're going to need edge detectors, gradient detectors. So what we do in the fine-tuning process is there's actually one extra layer after this, which is the layer that actually says, what is this?

It's a dog, or a cat, or whatever. You actually delete that, or you throw it away. So now that last matrix multiply has one output, or one output per category you're predicting. We throw that away. So the model now has that last matrix that's spitting out, it depends, but generally a few hundred activations.

What we do, as we'll learn more shortly in the coming lesson, we just stick a new random matrix on the end of that. And that's what we initially train. So it learns to use these kinds of features to predict whatever it is you're trying to predict. And then we gradually train all of those layers.

So that's basically how it's done. And so that's a bit hand-wavy, but we'll, particularly in part two, actually build that from scratch ourselves. And in fact, in this lesson, time permitting, we're actually going to start going down the process of actually building a real world neural net in Python.

So we'll be starting to actually make some progress towards that goal. Okay. So let's jump into the notebook. So we're going to look at a Kaggle competition that's actually on, as I speak. And I created this notebook called Getting Started with NLP for Absolute Beginners. And so the competition is called the US Patent Phrase-to-Phrase Matching Competition.

And so I'm going to take you through a complete submission to this competition. And Kaggle competitions are interesting, particularly the ones that are not playground competitions, but the real competitions with real money applied. They're interesting because this is an actual project that an actual organization is prepared to invest money in getting solved using their actual data.

So a lot of people are a bit dismissive of Kaggle competitions as being not very real. And it's certainly true. You're not worrying about stuff like productionizing the model. But in terms of getting real data about a real problem that real organizations really care about and a very direct way to measure the accuracy of your solution, you can't really get better than this.

So this is a good place. It's a good competition to experiment with for trying NLP. Now, as I mentioned here, probably the most widely useful application for NLP is classification. And as we've discussed in computer vision, classification refers to taking an object and trying to identify a category that object belongs to.

So previously, we've mainly been looking at images. Today, we're going to be looking at documents. Now in NLP, when we say document, we don't specifically mean a 20-page long essay. A document could be three or four words, or a document could be the entire encyclopedia. So a document is just an input to an NLP model that contains text.

Now classifying a document, so deciding what category a document belongs to, is a surprisingly rich thing to do. There's all kinds of stuff you could do with that. So for example, we've already mentioned sentiment analysis. That's a classification task. We try to decide on the category, positive or negative sentiment.

Author identification would be taking a document and trying to find the category of author. Digital discovery would be taking documents and putting them into categories according to in or out of scope for a court case. Triaging inbound emails would be putting them into categories of throw away, send to customer service, send to sales, et cetera.

So classification is a very, very rich area. And for people interested in trying out NLP in real life, I would suggest classification would be the place I would start for looking for accessible, real world, useful problems you can solve right away. Now the Kaggle competition does not immediately look like a classification competition.

What it contains, let me show you some data. What it contains is data that looks like this. It has a thing that they call anchor, a thing they call target, a thing they call context and a score. Now these are, I can't remember exactly how it is, but I think these are from patents.

And I think on the patents there are various things they have to fill in in the patent. One of those things is called anchor. One of those things is called target. And in the competition, the goal is to come up with a model that automatically determines which anchor and target pairs are talking about the same thing.

So a score of one here, wood article and wooden article obviously talking about the same thing. A score of zero here, abatement and forest region not talking about the same thing. So the basic idea is that we're trying to guess the score. And it's kind of a classification problem, kind of not.

We're basically trying to classify things into either these two things are the same or these two things aren't the same. It's kind of not because we have not just one and zero, but also 0.25, 0.5 and 0.75. There's also a column called context, which I believe is like the category that this patent was filed in.

And my understanding is that whether the anchor and the target count as similar or not depends on what the patent was filed under. So how would we take this and turn it into something like a classification problem? So the suggestion I make here is that we could basically say, OK, let's put some constant string like text one or field one before the first column, and then something else like text two before the second column.

Maybe also the context I should have as well, text three in the context, and then try to choose a category of meaning similarity, different, similar or identical. So we could basically concatenate those three pieces together, call that a document, and then try to train a model that can predict these categories.

That would be an example of how we can take this basically similarity problem and turn it into something that looks like a classification problem. And we tend to do this a lot in deep learning is we kind of take problems that look a bit novel and different and turn them into a problem that looks like something we recognize.

So on Kaggle, this is a larger data set that you're going to need a GPU to run. So you can click on the Accelerator button and choose GPU to make sure that you're using a GPU. If you click Copy and Edit on my document, I think that will happen for you automatically.

Personally, I like using things like PaperSpace generally better than Kaggle. Kaggle's pretty good, but you only get 30 hours a week of GPU time, and the notebook editor for me is not as good as the real JupyterLab environment. So there's some information here I won't go through, but it basically describes how you can download stuff to PaperSpace or your own computer as well if you want to.

So I basically create this little boolean always in my notebooks called isKaggle, which is going to be true if it's running on Kaggle and false otherwise, and any little changes I need to make, I'd say if isKaggle and put those changes. So here you can see here if I'm not on Kaggle and I don't have the data yet, then download it.

And Kaggle has a little API, which is quite handy for doing stuff like downloading data and uploading notebooks and stuff like that, submitting to competitions. If we are on Kaggle, then the data's already going to be there for us, which is actually a good reason for beginners to use Kaggle is you don't have to worry about grabbing the data at all.

It's sitting there for you as soon as you open the notebook. Kaggle has a lot of Python packages installed, but not necessarily all the ones you want. And at the point I wrote this, they didn't have HuggingFace's datasets package for some reason, so you can always just install stuff.

So you might remember the exclamation mark means this is not a Python command, but a shell command, a bash command. But it's quite neat. You can even put bash commands inside Python conditionals. So that's a pretty cool little trick in notebooks. Another cool little trick in notebooks is that if you do use a bash command, like ls, but you then want to insert the contents of a Python variable, just chuck it in parentheses.

So I've got a Python variable called path, and I can go ls path in parentheses, and that will ls the contents of the Python variable path. So there's another little trick for you. So when we ls that, we can see that there's some CSV files. So what I'm going to do is kind of take you through roughly the process, the kind of process I went through when I first look at a competition.

So the first thing is already dataset, indeed. What's in it? It's got some CSV files. As well as looking at it here, the other thing I would do is I would go to the competition website, and if you go to data, a lot of people skip over this, which is a terrible idea because it actually tells you what the dependent variable means, what the different files are, what the columns are, and so forth.

So don't just rely on looking at the data itself, but look at the information that you're given about the data. So for CSV files, the CSV files are comma-separated values. So they're just text files with a comma between each field. But we can read them using pandas, which for some reason is always called PD.

Pandas is one of, I guess, probably four key libraries that you have to know to do data science in Python, and specifically, those four libraries are NumPy, Matplotlib, Pandas, and PyTorch. So NumPy is what we use for basic numerical programming, Matplotlib we use for plotting, Pandas we use for tables of data, and PyTorch we use for deep learning.

Those are all covered in a fantastic book by the author as Pandas, which the new version is actually available for free, I believe, Python for Data Analysis. So if you're not familiar with these libraries, just read the whole book. It doesn't take too long to get through, and it's got lots of cool tips, and it's very readable.

I do find a lot of people doing this course. Often I see people kind of trying to jump ahead and want to be like, oh, I want to know how to create a new architecture, or build a speech recognition system, or whatever. But it then turns out that they don't know how to use these fundamental libraries.

So it's always good to be bold and be trying to build things, but do also take the time to make sure you finish reading the first AI book and read at least Wes McKinney's book. That would be enough to really give you all the basic knowledge you need, I think.

So with Pandas, we can read a CSV file, and that creates something called a data frame, which is just a table of data, as you see. So now that we've got a data frame, we can see what we're working with. And when in Jupyter we just put the name of a variable containing a data frame, we've got the first five rows, the last five rows, and the size, so we've got 36,473 rows.

So other things I like to use for understanding a data frame is the describe method. If you pass include equals object, that will describe basically all the string fields, the non-numeric fields. So in this case, there's four of those. And so you can see here that that anchor field we looked at, there's actually only 733 unique values.

So this thing, you can see that there's lots of repetition out of 36,000. So there's lots of repetition. This is the most common one, it appears 152 times. And then context, we also see lots of repetition, there's 106 of those contexts. So this is a nice little method, we can see a lot about the data in a glance.

And when I first saw this in this competition, I thought, well, this is actually not that much language data when you think about it. Each document is very short, three or four words, really, and lots of it is repeated. So as I'm looking through it, I'm thinking, what are some key features of this data set?

And that would be something I'd be thinking, wow, we've got to do a lot with not very much unique data here. So here's how we can just go ahead and create a single string like I described, which contains some kind of field separator, plus the context, the target, and the anchor.

So we're going to pop that into a field called input. Something slightly weird in pandas is there's two ways of referring to a column. You can use square brackets and a string to get the input column, or you can just treat it as an attribute. When you're setting it, you should always use the forms in here.

When reading it, you can use either. I tend to use this one because it's less typing. So you can see now we've got these concatenated rows. So head is the first few rows. So we've now got some documents to do NLP with. Now the problem is, as you know from the last lesson, neural networks work with numbers.

We're going to take some numbers, and we're going to multiply them by matrices. We're going to replace the negatives with zeros and add them up, and we're going to do that a few times. That's our neural network, with some little wrinkles, but that's the basic idea. So how on earth do we do that for these strings?

So there's basically two steps we're going to take. The first step is to split each of these into tokens. Tokens are basically words. We're going to split it into words. There's a few problems with splitting things into words, though. The first is that some languages, like Chinese, don't have words, or at least certainly not space-separated words, and in fact, in Chinese, sometimes it's a bit fuzzy to even say where a word begins and ends.

Some words are kind of not even -- the pieces are not next to each other. Another reason is that what we're going to be doing is after we've split it into words, or something like words, we're going to be getting a list of all of the unique words that appear, which is called the vocabulary.

And every one of those unique words is going to get a number. As you'll see later on, the bigger the vocabulary, the more memory is going to get used, the more data we'll need to train. In general, we don't want a vocabulary to be too big. So instead, nowadays, people tend to tokenize into something called subwords, which is pieces of words.

So I'll show you what it looks like. So the process of turning it into smaller units, like words, is called tokenization, and we call them tokens instead of words. The token is just the more general concept of whatever we're splitting it into. So we're going to get hugging face transformers and hugging face datasets doing our work for us.

And so what we're going to do is we're going to turn our pandas data frame into a hugging face datasets dataset. It's a bit confusing. PyTorch has a class called dataset, and hugging face has a class called dataset, and they're different things. So this is a hugging face dataset, hugging face datasets dataset.

So we can turn a data frame into a data set just using the from pandas method. And so we've now got a dataset. So if we take a look, it just tells us, all right, it's got these features. And remember, input is the one we just created with the concatenated strings.

And here's those 36,000 rows. So now we're going to do these two things, tokenization, which is to split each text up into tokens, and then numericalization, which is to turn each token into its unique ID based on where it is in the vocabulary. The vocabulary, remember, being the list of unique tokens.

Now particularly in this stage, tokenization, there's a lot of little decisions that have to be made. The good news is you don't have to make them, because whatever pre-trained model you used, the people that pre-trained it made some decisions, and you're going to have to do exactly the same thing, otherwise you'll end up with a different vocabulary to them, and that's going to mess everything up.

So that means before you start tokenizing, you have to decide on what model to use. Hugging face transformers is a lot like Tim. It has a library of, I believe, hundreds of models. I guess I shouldn't say hugging face transformers, it's really the hugging face model hub. 44,000 models, so many more even than Tim's image models.

And so these models, they vary in a couple of ways. There's a variety of different architectures, just like in Tim, but then something which is different to Tim is that each of those architectures can be trained on different corpuses for solving different problems. So for example, I could type patent, and see if there's any pre-trained patent, there is.

So there's a whole lot of pre-trained patent models, isn't that amazing? So quite often, thanks to the hugging face model hub, you can start your pre-trained model with something that's actually pretty similar to what you actually want to do, or at least was trained on the same kind of documents.

Having said that, there are some just generally pretty good models that work for a lot of things a lot of the time, and DiBerta V3 is certainly one of those. This is a very new area, NLP has been practically really effective for general users for only a year or two, whereas for computer vision it's been quite a while.

So you'll find that a lot of things aren't as quite well bedded down. I don't have a picture to show you of which models are the best or the fastest and the most accurate and whatever, right? A lot of this stuff is like stuff that we're figuring out as a community using competitions like this, in fact, and this is one of the first NLP competitions actually in the kind of modern NLP era.

So we've been studying these competitions closely, and I can tell you that DiBerta is actually a really good starting point for a lot of things, so that's why we've picked it. It's really cool. And just like in Tim for image, you know, a model says often going to be a small, a medium, a large, and of course we should start with small, right, because small is going to be faster to train, we're going to be able to do more iterations and so forth, okay.

So at this point, remember, the only reason we picked our model is because we have to make sure we tokenize in the same way. To tell transformers that we want to tokenize the same way that the people that built a model did, we use something called autotokenizer. It's nothing fancy.

It's basically just a dictionary which says, oh, which model uses which tokenizer. So when we say autotokenizer from pre-trained, it will download the vocabulary and the details about how this particular model tokenized dataset. So at this point, we can now take that tokenizer and pass the string to it.

So if I pass the string g'day folks on Jeremy from fast.ai, you'll see it's kind of putting it into words, kind of not. So if you've ever wondered whether g'day is one word or two, you know, it's actually three tokens according to this tokenizer, and I'm is three tokens, and fast.ai is three tokens.

This punctuation is a token, and so you kind of get the idea. These underscores here, that represents the start of a word, right. So that's kind of, there's this concept that, like, the start of a word is kind of part of the token. So if you see a capital I in the middle of a word versus the start of a word, that kind of means a different thing.

So this is what happens when we tokenize this sentence using the tokenizer that the Daburda V3 developers used. So here's a less common, unless you're a big platypus fan like me, less common sentence. A platypus is an ornithrinx and a tenus. And so, okay, in this particular vocabulary, platypus got its own word, its own token, but ornithrinx didn't.

And so I still remember grade one, for some reason, our teacher got us all to learn how to spell ornithrinx, so one of my favorite words. So you can see here, it's been split into all, knee, toe, ring, us. So every one of these tokens you see here is going to be in the vocabulary, right?

The list of unique tokens that was created when this particular model, this pre-trained model was first trained. So somewhere in that list, we'll find underscore capital A, and it'll have a number. And so that's how we'll be able to turn these into numbers. So this first process is called tokenization, and then the thing where we take these tokens and turn them into numbers is called numericalization.

So our dataset, remember we put our string into the input field. So here's a function that takes a document, grabs its input, and tokenizes it. Okay, so we'll call this our tokenization function. Tokenization can take a minute or two, so we may as well get all of our processes doing it at the same time to save some time.

So if you use the dataset.map, it will paralyze that process and just pass in your function. Make sure you pass batch equals true so it can do a bunch at a time. Behind the scenes, this is going through something called the tokenizes library, which is a pretty optimized Rust library that uses SIMD and parallel processing and so forth.

So with batch equals true, it'll be able to do more stuff at once. So look, it only took six seconds, so pretty fast. So now when we look at a row of our tokenized dataset, it's going to contain exactly the same as our original dataset. No, sorry, it's not going to take exactly the same as our original dataset.

It's going to contain exactly the same input as our original dataset, and it's also going to contain a bunch of numbers. These numbers are the position in the vocabulary of each of the tokens in the string. So we've now successfully turned a string into a list of numbers. So that is a great first step.

So we can see how this works. We can see, for example, that we've got "of" at a separate word. That's going to be an underscore "of" in the vocabulary. We can grab the vocabulary, look up "of," find that it's 265, and check here, yep, here it is, 265. So it's not rocket science, right?

It's just looking stuff up in a dictionary to get the numbers. So that is the tokenization and Americanization necessary in NLP to turn our documents into numbers to allow us to put it into our model. Any questions so far, John? >> Excuse me, yeah, thanks, Jeremy. So there's a couple, and this seems like a good time to throw them out, and it's related to how you've formatted your input data into these sentences that you've just tokenized.

So one question was really about how you choose those keywords and the order of the fields that you -- so I guess just interested in an explanation, is it more art or science, how you -- >> No, it's arbitrary. I tried a few things. I tried "X," I tried putting them backwards, doesn't matter.

We just want some way, something that it can learn from. So if I just concatenated it without these headers before each one, it wouldn't know where abatement of pollution ended and where abatement started, right? So I did just something that it can learn from. This is a nice thing about neural nets, they're so flexible.

As long as you give it the information somehow, it doesn't really matter how you give it the information as long as it's there, right? I could have used punctuation, I could have put like, I don't know, one semicolon here and two here and three here, yeah, it's not a big deal.

Like, at the level where you're like trying to get an extra half a percent to get up the later board or Kaggle competition, you may find tweaking these things makes tiny differences, but in practice, you won't generally find it matters too much. >> Right, thank you. And I guess the second part of that, excuse me again, somebody's asking if one of their fields was particularly long, say it was a thousand characters, is there any special handling required there?

Do you need to re-inject those kind of special marker tokens? Does it change if you've got much bigger fields that you're trying to learn and query? >> Long documents and ULM fit require no special consideration. So IMDB, in fact, has multi-thousand word movie reviews and it works great. To this day, ULM fit is probably the best approach, you know, for reasonably quickly and easily using large documents.

Otherwise, if you use transformer based approaches, large documents are challenging, specifically, transformers basically have to do the whole document at once, where else ULM fit can split it into multiple pieces and read it gradually. And so that means you'll find that people trying to work with large documents tend to spend a lot of money on GPUs because they need the big fancy ones with lots of memory.

So generally speaking, I would say if you're trying to do stuff with documents of over 2,000 words, you might want to look at ULM fit. Try transformers, see if it works for you, but I'd certainly try both. For under 2,000 words, transformers should be fine unless you've got nothing but like a laptop GPU or something with not much memory.

So how can face transformers has these, you know, as I say it right now that I find them somewhat obscure and not particularly well-documented expectations about your data that you kind of have to figure out. And one of those is that it expects that your target is a column called labels.

So once I figured that out, I just went, got our tokenized data set and renamed our score column to labels and everything started working. So probably is, you know, I don't know if at some point they'll make this a bit more flexible, but probably best to just call your target labels and life will be easy.

You might have seen back when I went LS path that there was another data set there called test.csv. And if you look at it, it looks a lot like our training set, our other CSV that we've been working with, but it's missing the score, the labels. This is called a test set.

And so we're going to talk a little bit about that now because my claim here is that perhaps the most important idea in machine learning is the idea of having separate training, validation, and test data sets. So test and validation sets are all about identifying and controlling for something called overfitting, and we're going to try and learn about this through example.

So this is the same information that's in that Kaggle notebook I've just put on some slides here. So I'm going to create a function here called plot poly, and I'm actually going to use the same data that, I don't know if you remember, we used it earlier for trying to fit this quadratic.

We created an X and some X and some Y data. This is the data we're going to use, and we're going to use this to look at overfitting. So the details of this function don't matter too much. What matters is what we do with it, which is that it allows us to basically pass in the degree of a polynomial.

So for those of you that remember, a first degree polynomial is just a line. It's Y equals AX. A second degree polynomial will be Y equals A squared X plus BX plus C. A third degree polynomial will have a cubic, fourth degree quartic, and so forth. And what I've done here is I've plotted what happens if we try to fit a line to our data.

It doesn't fit very well. So what happened here is we did a linear regression, and what we're using here is a very cool library called scikit-learn. Scikit-learn is something that, you know, I think it'd be fair to say it's mainly designed for kind of classic machine learning methods, like kind of linear regression and stuff like that.

Very advanced versions of these things, but it's also great for doing these quick and dirty things. So in this case, I wanted to do what's called a polynomial regression, which is fitting the polynomial to data, and it's just these two lines of code. It's a super nice library. So in this case, a degree one polynomial is just a line.

So I fit it, and then I show it with the data, and there it is. Now that's what we call underfit, which is to say there's not enough kind of complexity in this model I fit to match the data that's there. So an underfit model is a problem. It's going to be systematically biased.

All the stuff up here, we're going to be predicting too low. All the stuff down here, we're predicting too low. All the stuff in the middle, we're predicting too high. A common misunderstanding is that simpler models are more reliable in some way, but models that are too simple will be systematically incorrect, as you see here.

What happens if we fit a 10-degree polynomial? That's not great either. In this case, it's not really showing us what the actual-- remember, this is originally a quadratic because this is meant to match, particularly at the ends here. It's predicting things that are way above what we would expect in real life.

And it's trying really hard to get through this point, but clearly this point was just some noise. So this is what we call overfit. It's done a good job of fitting to our exact data points, but if we sample some more data points from this distribution, honestly, we probably would suspect they're not going to be very close to this, particularly if they're a bit beyond the edges.

So that's what overfitting looks like. We don't want underfitting or overfitting. Now, underfitting is actually pretty easy to recognize because we can actually look at our training data and see that it's not very close. Underfitting is a bit harder to recognize because the training data is actually very close.

Now, on the other hand, here's what happens if we fit the quadratic. And here I've got both the real line and the fit line, and you can see they're pretty close. And that's, of course, what we actually want. So how do we tell whether we have something more like this or something more like this?

Well, what we do is we do something pretty straightforward, is we take our original dataset, these points, and we remove a few of them, let's say 20% of them. We then fit our model using only those points we haven't removed. And then we measure how good it is by looking at only the points we removed.

So in this case, let's say we had removed, I'm just trying to think, if I'd removed this point here, then it might have kind of gone off down over here. And so then when we look at how well it fits, we would say, oh, this one's miles away. The data that we take away and don't let the model see it when it's training is called the validation set.

So in fast AI, we've seen splitters before, right? The splitters are the things that separate out the validation set. Fast AI won't let you train a model without a validation set. Fast AI always shows you your metrics, so things like accuracy, measured only on the validation set. This is really unusual.

Most libraries make it really easy to shoot yourself in the foot by not having a validation set or accidentally not using it correctly. So fast AI won't even let you do that. So you've got to be particularly careful when using other libraries. Hacking face transformers is good about this, so they make sure that they do show you your metrics on a validation set.

Now creating a good validation set is not generally as simple as just randomly pulling some of your data out of your model, out of the data that you train with your model. The reason why is imagine that this was the data you were trying to fit something to. And you randomly remove some, so it looks like this.

That looks very easy, doesn't it? Because you've kind of like still got all the data you would want around the points. And in a time series like this, this is dates and sales, in real life you're probably going to want to predict future dates. So if you created your validation set by randomly removing stuff from the middle, it's not really a good indication of how you're going to be using this model.

Instead you should truncate and remove the last couple of weeks. So if this was your validation set and this is your training set, that's going to be actually testing whether you can use this to predict the future rather than using it to predict the past. Kaggle competitions are a fantastic way to test your ability to create a good validation set.

Because Kaggle competitions only allow you to submit generally a couple of times a day. The data set that you are scored on in the leaderboard during that time is actually only a small subset. In fact, it's a totally separate subset to the one you'll be scored on on the end of the competition.

And so most beginners on Kaggle overfit. And it's not until you've done it that you will get that visceral feeling of like, oh my god, I overfit. In the real world, outside of Kaggle, you will often not even know that you overfit. You just destroy value of your organization silently.

So it's a really good idea to do this kind of stuff on Kaggle a few times first in real competitions to really make sure that you are confident you know how to avoid overfitting, how to find a good validation set, and how to interpret it correctly. And you really don't get that until you screw it up a few times.

Good example of this was there was a distracted driver competition on Kaggle. There were these kind of pictures from inside a car. And the idea was that you had to try and predict whether somebody was driving in a distracted way or not. And on Kaggle, they did something pretty smart.

The test set, so the thing that they scored you on the leaderboard, contained people that didn't exist at all in the competition data that you train the model with. So if you wanted to create an effective validation set in this competition, you would have to make sure that you separated the photos so that your validation set contained photos of people that aren't in the data you're training your model on.

There was another one like that, the Kaggle fisheries competition, which had boats that didn't appear. So they were basically pictures of boats and you're meant to try to guess, predict what fish were in the pictures. And it turned out that a lot of people accidentally figured out what the fish were by looking at the boat because certain boats tended to catch certain kinds of fish.

And so by messing up their validation set, they were really overconfident of the accuracy of their model. I'll mention in passing, if you've been around Kaggle a bit, you'll see people talk about cross validation a lot. I'm just going to mention, be very, very careful. Cross validation is explicitly not about building a good validation set, so you've got to be super, super careful if you ever do that.

Another thing I'll mention is that Scikit-Learn conveniently offers something called train test split, as does Hugging Face datasets, as does Fast AI, we have something called Random Splitter. It can be encouraged, it can almost feel like it's encouraging you to use a randomized validation set because there are these methods that do it for you.

But yeah, be very, very careful, because very, very often that's not what you want. So if you want what a validation set is, so that's the bit that you pull out of your data that you don't train with, but you do measure your accuracy with, so what's a test set?

It's basically another validation set, but you don't even use it for tracking your accuracy while you build your model. Why not? Well, imagine you tried two new models every day for three months. That's how long a Kaggle competition goes for. So you would have tried 180 models, and then you look at the accuracy on the validation set for each one.

Some of those models, you would have got a good accuracy on the validation set potentially because of pure chance, just a coincidence, and then you get all excited and you submit that to Kaggle, and you think you're going to win the competition, and you mess it up. And that's because you actually overfit using the validation set.

So you actually want to know whether you've really found a good model or not. So in fact, on Kaggle, they have two test sets. They've got the one that gives you feedback on the leaderboard during the competition and a second test set, which you don't get to see until after the competition is finished.

So in real life, you've got to be very careful about this, not to try so many models during your model building process that you accidentally find one that's good by coincidence. And only if you have a test set that you've held out or you know that. Now that leads to the obvious question, which is very challenging, is you spent three months working on a model, worked well on your validation set, you did a good job of locking that test set away in a safe so you weren't allowed to use it, and at the end of the three months, you finally checked it on the test set and it's terrible.

What do you do? Honestly, you have to go back to square one. There really isn't any choice other than starting again. So this is tough. But it's better to know, right? Better to know than to not know. So that's what a test set's for. So you've got a validation set, what are you going to do with it?

What you're going to do with a validation set is you're going to measure some metrics. So a metric is something like accuracy. It's a number that tells you how good is your model. Now on Kaggle, this is very easy. What metric should we use? Well, they tell us. Go to Overview, click on Evaluation, and find out, and it says, "Oh, we will evaluate on the Pearson correlation coefficient." Therefore, this is the metric you care about.

So one obvious question is, is this the same as the loss function? Is this the thing that we will take the derivative of and find the gradient and use that to improve our parameters during training? And the answer is maybe, sometimes, but probably not. For example, consider accuracy. Now if we were using accuracy to calculate our derivative and get the gradient, you could have a model that's actually slightly better, you know, slightly like it's doing a better job of recognizing dogs and cats, but not so much better that it's actually caused any incorrectly classified cat to become a dog.

So the accuracy doesn't change at all. The gradient is zero. You don't want stuff like that. You don't want bumpy functions, because they don't have nice gradients. Often they don't have gradients at all. They're basically zero nearly everywhere. You want a function that's nice and smooth, something like, for instance, the average absolute error, mean absolute error, which we've used before.

So that's the difference between your metrics and your loss. Now be careful, right, because when you're training, your model's spending all of its time trying to improve the loss, and most of the time that's not the same as the thing you actually care about, which is your metric. So you've got to keep those two different things in mind.

The other thing to keep in mind is that in real life, you can't go to a website and be told what metric to use. In real life, the model that you choose, there isn't one number that tells you whether it's good or bad, and even if there was, you wouldn't be able to find it out ahead of time.

In real life, the model you use is a part of a complex process, often involving humans, both as users or customers, and as people involved as part of the process. There's all kinds of things that are changing over time, and there's lots and lots of outcomes of decisions that are made.

One metric is not enough to capture all of that. Unfortunately, because it's so convenient to pick one metric and use that to say, "I've got a good model," that very often finds its way into industry, into government, where people roll out these things that are good on the one metric that happened to be easy to measure.

Again and again, we found people's lives turned upside down because of how badly they get screwed up by models that have been incorrectly measured using a single metric. My partner, Rachel Thomas, has written this article, which I recommend you read about the problem with metrics, is a big problem for AI.

It's not just an AI thing. There's actually this thing called Goodhart's Law that states, "When a measure becomes a target, it ceases to be a good measure." When I was a management consultant 20 years ago, we were always part of these strategic things trying to find key performance indicators and ways to set commission rates for sales people and we were really doing a lot of this stuff, which is basically about picking metrics.

We see that happen, go wrong in industry all the time. AI is dramatically worse because AI is so good at optimizing metrics. That's why you have to be extra, extra, extra careful about metrics when you are trying to use a model in real life. Anyway, as I said in Kaggle, we don't have to worry about any of that.

We are just going to use the Pearson correlation coefficient, which is all very well, as long as you know what the hell the Pearson correlation coefficient is. If you don't, let's learn about it. So Pearson correlation coefficient is usually abbreviated using letter R and it's the most widely used measure of how similar two variables are.

If your predictions are very similar to the real values, then the Pearson correlation coefficient will be high and that's what you want. R can be between minus one and one. Minus one means you predicted exactly the wrong answer, which in a Kaggle competition would be great because then you can just reverse all of your answers and you'll be perfect.

Minus one means you got everything exactly correct. Generally speaking, in courses or textbooks when they teach you about the Pearson correlation coefficient, at this point they will show you a mathematical function. I'm not going to do that because that tells you nothing about the Pearson correlation coefficient. What we actually care about is not the mathematical function, but how it behaves.

I find most people even who work in data science have not actually looked at a bunch of data sets to understand how R behaves. So let's do that right now so that you're not one of those people. The best way I find to understand how data behaves in real life is to look at real life data.

So there's a data set, Scikit-learn comes with a number of data sets and one of them is called California Housing and it's a data set where each row is a district. And it's kind of demographic information about different districts and about the value of houses in that district. I'm not going to try to plot the whole thing because it's too big and this is a very common question I have from people is how do I plot data sets with far too many points?

The answer is very simple, get less points. Whatever you see with a thousand points is going to be the same as what you see with a million points. There's no reason to plot huge amounts of data generally, just grab a random sample. Now NumPy has something called CoreCoF to get the correlation coefficient between every variable and every other variable and it returns a matrix so I can look down here, so for example here is the correlation coefficient between variable one and variable one which of course is exactly perfectly 1.0 because variable one is the same as variable one.

Here is the small inverse correlation between variable one and variable two and medium sized positive correlation between variable one and variable three and so forth. This is symmetric about the diagonal because the correlation between variable one and variable eight is the same as the correlation between variable eight and variable one.

So this is a correlation coefficient matrix. So that's great when we wanted to get a bunch of values all at once. For the Kaggle competition we don't want that, we just want a single correlation number. If we just pass in a pair of variables we still get a matrix which is kind of weird, it's not weird, it's not what we want.

So we should grab one of these. So when I want to grab a correlation coefficient I'll just return the zeroth row first column. So that's what Core is, that's going to be our single correlation coefficient. So let's look at the correlation between two things, for example median income and medium house value, 0.67, okay is that high, medium, low, how big is that, what does it look like?

So the main thing we need to understand is what these things look like. So what I suggest we do is we're going to take a 10 minute break, 9 minute break, we'll come back at half-past and then we're going to look at some examples of correlation coefficients. Okay welcome back.

So what I've done here is I've created a little function called show correlations, I'm going to pass in a data frame and a couple of columns as strings, going to grab each of those columns as series, do a scatter plot and then show the correlation. So we already mentioned median income and median house valuation of 0.68.

So here it is, here's what 0.68 looks like. So I don't know if you had some intuition about what you expected, but as you can see it's still plenty of variation even at that reasonably high correlation. Also you can see here that visualizing your data is very important if you're working with this dataset because you can immediately see all these dots along here, that's clearly truncation, right?

So this is like when it's not until you look at pictures like this that you've got to pick stuff like this, pictures are great. Oh, little trick, on the scatter plot I put alpha is 0.5, that creates some transparency. For these kind of scatter plots that really helps because it kind of creates darker areas in places where there's lots of dots.

So yeah, alpha in scatter plots is nice. Okay here's another pair. So this one's gone down from 0.68 to 0.43 median income versus the number of rooms per house. As you'd expect, more rooms, it's more income. But this is a very weird looking thing. Now you'll find that a lot of these statistical measures like correlation rely on the square of the difference.

And when you have big outliers like this, the square of the difference goes crazy. And so this is another place we'd want to look at the data first and say, oh, that's going to be a bit of an issue. There's probably more correlation here, but there's a few examples of some houses with lots and lots of room where people that aren't very rich live.

Maybe these are some kind of shared accommodation or something. So R is very sensitive to outliers. So let's get rid of the houses, the houses with 15 rooms or more. And now you can see it's gone up from 0.43 to 0.68, even though we probably only got rid of one, two, three, four, five, six, even got rid of seven data points.

So we've got to be very careful of outliers, and that means if you're trying to win a Kaggle competition where the metric is correlation, and you just get a couple of rows really badly wrong, then that's going to be a disaster to your score. So you've got to make sure that you do a pretty good job of every row.

So there's what a correlation of 0.68 looks like. OK, here's a correlation of 0.34. And this is kind of interesting, isn't it, because 0.34 sounds like quite a good relationship, but you almost can't see it. So this is something I strongly suggest, is if you're working with a new metric, draw some pictures of a few different levels of that metric to kind of try to get a feel for, like, what does it mean?

You know, what does 0.6 look like, what does 0.3 look like, and so forth? And here's an example of a correlation of minus 0.2. It's a very slight negative slope. OK, so there's just more of a kind of a general tip of something I like to do when playing with a new metric, and I recommend you do as well.

I think we've now got a sense of what the correlation feels like. Now you can go look up the equation on Wikipedia if you're into that kind of thing. We need to report the correlation after each epoch, because we want to know how our training's going. HuggingFace expects you to return a dictionary because it's going to use the keys of the dictionary to, like, label each metric.

So here's something that gets the correlation and returns it as a dictionary with the label Pearson. OK, so we've done metrics, we've done our training validation split. Oh, we might have actually skipped over the bit where we actually did the split, did I? I did. So to actually do the split, in this Kaggle competition, I've got another notebook we'll look at later where we actually split this properly, but here we're just going to do a random split, just to keep things simple for now, of 25 percent of the data will be a validation set.

So if we go ds_train_test_split, it returns a dataset dict, which has a train and a test. So that looks a lot like a datasets object in fast.ai, very similar idea. So this will be the thing that we'll be able to train with. So it's going to train with this dataset and return the metrics on this dataset.

This is really a validation set, but HuggingFace datasets calls a test. OK, we're now ready to train our model. In fast.ai, we use something called a learner. The equivalent in HuggingFace transformers is called trainer. So we'll bring that in. Something we'll learn about quite shortly is the idea of mini-batches and batch sizes.

In short, each time we pass some data to our model for training, it's going to send through a few rows at a time to the GPU so that it can calculate those in parallel. Those bunch of rows is called a batch or a mini-batch, and the number of rows is called the batch size.

So here we're going to set the batch size to 128. Generally speaking, the larger your batch size, the more it can do in parallel at once, and it will be faster. But if you make it too big, you're going to get an out-of-memory error on your GPU. So it's a bit of trial and error to find a batch size that works.

Epochs we've seen before. Then we've got the learning rate. We'll talk in the next lesson, unless we get to this lesson, about a technique to automatically find or semi-automatically find a good learning rate. We already know what a learning rate is from the last lesson. I played around and found one that seems to train quite quickly without falling apart.

So I just tried a few. Generally I kind of -- if I don't have a -- so Hacking-Face Transformers doesn't have something to help you find the learning rate, the integration we're doing in Fast AI will let you do that. But if you're using a framework that doesn't have that, you can just start with a really low learning rate and then kind of double it and keep doubling it until it falls apart.

Hacking-Face Transformers uses this thing called training arguments, which is a class we just provide all of the kind of configuration. So you have to tell it what your learning rate is. This stuff here is the same as what we call basically fit one cycle in Fast AI. You always want this to be true because it's going to be faster, pretty much.

And then this stuff here you can probably use exactly the same every time. There's a lot of boilerplate compared to Fast AI, as you see. This stuff you can probably use the same every time. So we now need to create our model. So the equivalent of the vision learner function that we've used to automatically create a reasonable vision model in Hacking-Face Transformers, they've got lots of different ones depending on what you're trying to do.

So we're trying to do classification, as we've discussed, of sequences. So if we call auto model for sequence classification, it will create a model that is appropriate for classifying sequences from a pre-trained model. And this is the name of the model that we just did earlier, the DiBerto V3.

It has to know when it adds that random matrix to the end how many outputs it needs to have. So we have one label, which is the score. So that's going to create our model. And then this is the equivalent of creating a learner. It contains a model and the data, the training data and the test data.

Again, there's a lot more boilerplate here than Fast AI, but you can kind of see the same basic steps here. We just have to do a little bit more manually. But it's nothing too crazy. So it's going to tokenize it for us using that function. And then these are the metrics that it will print out each time.

That's that little function we created which returns a dictionary. At the moment, I find hugging face transformers very verbose. It spits out lots and lots and lots of text which you can ignore. And we can finally call train which will spit out much more text again which you can ignore.

And as you can see, as it trains, it's printing out the loss. And here's our Pearson correlation coefficient. So it's training. And we've got a 0.834 correlation. That's pretty cool, right? And it took five minutes to run, maybe that's five minutes per epoch on Kaggle which doesn't have particularly great GPUs, but good for free.

And we've got something that has got a very high level of correlation in assessing how similar the two columns are. And the only reason it could do that is because it used a pre-trained model, right? There's no way you could just have that tiny amount of information and figure out whether those two columns are very similar.

This pre-trained model already knows a lot about language. It already has a good sense of whether two phrases are similar or not. And we've just fine-tuned it. You can see, given that after one epoch, it was already at 0.8, you know, this was a model that already did something pretty close to what we needed.

It didn't really need that much extra tuning for this particular task. We've got any questions here, John? Yeah, we do. It's actually a bit back on the topic before where you were showing us the visual interpretation of the Pearson coefficient and you were talking about outliers. Yeah. And we've got a question here from Kevin asking, how do you decide when it's OK to remove outliers?

Like you pointed out something in that data set. And clearly, your model is going to train a lot better if you clean that up. But I think Kevin's point here is, you know, those kinds of outliers will probably exist in the test set as well. So I think he's just looking for some practical advice on how you handle that in a more general sense.

So outliers should never just be removed, like for modeling. So if we take the example of the California housing data set, you know, if I was really working with that data set in real life, I would be saying, oh, that's interesting. It seems like there's a separate group of districts with a different kind of behavior.

My guess is that they're going to be kind of like dorms or something like that, you know, probably low income housing. And so I would be saying like, oh, clearly, from looking at this data set, these two different groups can't be treated the same way. They have very different behaviors.

And I would probably split them into two separate analyses. You know, the word outlier, it kind of exists in a statistical sense, right? There can be things that are well outside our normal distribution and mess up our kind of metrics and things. It doesn't exist in a real sense.

It doesn't exist in a sense of like, oh, things that we should like ignore or throw away. You know, some of the most useful kind of insights I've had in my life in data projects has been by digging into outliers, so-called outliers, and understanding, well, what are they? Where did they come from?

And it's kind of often in those edge cases that you discover really important things about like where processes go wrong or about, you know, kinds of behaviors you didn't even know existed, or indeed about, you know, kind of labeling problems or process problems, which you really want to fix them at the source, because otherwise when you go into production, you're going to have more of those so-called outliers.

So yeah, I'd say never delete outliers without investigating them and having a strategy for like understanding where they came from and like, what should you do about them? All right. So now that we've got a trained model, you'll see that it actually behaves really a lot like a fast AI learner.

And hopefully the impression you'll get from going through this process is largely a sense of familiarity. It's like, oh, yeah, this looks like stuff I've seen before, you know, like a bit more wordy and some slight changes, but it really is very, very similar to the way we've done it before.

Because now that we've got a trained trainer rather than learner, we can call predict. And now we're going to pass in our data set from the Kaggle test file. And so that's going to give us our predictions, which we can cast to float. And here they are. So here are the predictions we made of similarity.

Now, again, not just for your inputs, but also for your outputs, always look at them. Always. Right? And interestingly, I looked at quite a few Kaggle notebooks from other people for this competition. And nearly all of them had the problem we have right now, which is negative predictions and predictions over one.

So I'll be showing you how to fix this in a more proper way, maybe hopefully in the next lesson. But for now, you know, we could at least just round these off, right? Because we know that none of the scores are going to be bigger than one or smaller than zero.

But our correlation coefficient will definitely improve if we at least round this up to zero and round this down to one. As I say, there are better ways to do this, but that's certainly better than nothing. So in PyTorch, you might remember from when we looked at Relu, there's a thing called Clip.

And that will clip everything under zero to zero and everything over one to one. And so now that looks much better. So here's our predictions. So Kaggle expects submissions to generally be in a CSV file. And Hackingface datasets kind of looks a lot like pandas, really. We can create our submission file with our two columns, call.csv.

And there we go. That's basically it. So yeah, you know, it's kind of nice to see how -- you know, in a sense, how far deep learning has come since we started this course a few years ago that nowadays, you know, there are multiple libraries around to kind of do the same thing.

We can, you know, use them in multiple application areas. They all look kind of pretty familiar. They're reasonably beginner-friendly. And NLP, because it's kind of like the most recent area that's really become effective in the last year or two, is probably where the biggest opportunities are for, you know, big wins both in research and commercialization.

And so if you're looking to build a startup, for example, one of the key things that VCs look for, you know, that they'll ask is like, "Well, why now?" You know, "Why would you build this company now?" And of course, you know, with NLP, the answer is really simple.

It's like -- it can often be like, "Well, until last year, this wasn't possible," you know, or "It took ten times more time," or "It took ten times more money," or whatever. So I think NLP is a huge opportunity area. Okay, so it's worth thinking about both use and misuse of modern NLP.

And I want to show you a subreddit. Here is a conversation on a subreddit from a couple of years ago. I'll let you have a quick read of it. So the question I want you to be thinking about is what subreddit do you think this comes from, this debate about military spending?

And the answer is it comes from a subreddit that posts automatically generated conversations between GPT-2 models. Now this is like a totally previous generation of model. They're much, much better now. So even then, you could see these models were generating context-appropriate, believable pros. You know, I would strongly believe that like any of our kind of like aperture of competent fast AI alumni would be fairly easily able to create a bot which could create context-appropriate pros on Twitter or Facebook groups or whatever, you know, arguing for a side of an argument.

And you can scale that up such that 99% of Twitter was these bots and nobody would know, you know, nobody would know. And that's very worrying to me because a lot of, you know, a lot of kind of the way people see the world is now really coming out of their social media conversations, which at this point, they're controllable.

Like it would not be that hard to create something that's kind of optimized towards moving a point of view amongst a billion people, you know, in a very subtle way, very gradually over a long period of time by multiple bots, each pretending to argue with each other and one of them getting the upper hand and so forth.

Here is the start of an article in The Guardian, which I'll let you read. This article was, you know, quite long, these are just the first few paragraphs. And at the end, it explains that this article was written by GPT-3. It was given the instruction, "Please write a short op-ed around 500 words, keep the language simple and concise, focus on why humans have nothing to fear from AI." So GPT-3 produced eight outputs and then they say basically the editors at The Guardian did about the same level of editing that they would do for humans.

In fact, they found it a bit less editing required than humans. So, you know, again, like you can create longer pieces of context appropriate prose designed to argue a particular point of view. What kind of things might this be used for? You know, we won't know probably for decades if ever, but sometimes we get a clue based on older technology.

Here's something from back 2017 and the pre kind of deep learning NLP days. There were millions of submissions to the FTC about the net neutrality situation in America, very, very heavily biased towards the point of view of saying we want to get rid of net neutrality. An analysis by Jeff Kao showed that something like 99% of them, and in particular nearly all of the ones which were pro removal net neutrality were clearly auto generated by basically if you look at the green, there's like selecting from a menu, so we've got Americans as opposed to Washington bureaucrats deserve to enjoy the services they desire.

Individuals as opposed to Washington bureaucrats should be just people like me as opposed to so-called experts should be, and you get the idea. Now this is an example of a very, very, you know, simple approach to auto generating huge amounts of text. We don't know for sure, but it looks like this might have been successful because this went through.

You know, despite what seems to be actually overwhelming disagreement from the public that everybody, almost everybody likes net neutrality, the FTC got rid of it, and this was a big part of the basis, was like, oh, we got all these comments from the public and everybody said they don't want net neutrality.

So imagine a similar thing where you absolutely couldn't do this, you couldn't figure it out because everyone was really very compelling and very different, that's, you know, it's kind of worrying about how we deal with that. I will say when I talk about this stuff, often people say, oh, no worries, we'll be able to model to recognize, you know, bot generated content, but, you know, if I put my black hat on, I'm like, nah, that's not going to work, right?

If you told me to build something that beats the bot classifiers, I'd say, no worries, easy. You know, I will take the code or the service or whatever that does the bot classifying, and I will include beating that in my loss function, and I will fine-tune my model until it beats the bot classifier, you know.

When I used to run an email company, we had a similar problem with spam prevention, you know, spammers could always take a spam prevention algorithm and change their emails until it didn't get the spam prevention algorithm anymore, for example. So yes, I'm really excited about the opportunities for students in this course to build, you know, I think very valuable businesses, really cool research, and so forth using these pretty new NLP techniques that are now pretty accessible, and I'm also really worried about the things that might go wrong.

I do think, though, that the more people that understand these capabilities, the less chance they'll go wrong. John, was there some questions? Yeah, I mean, it's a throwback to the workbook that you had before. Yeah, that's the one. The question Manakandan is asking, shouldn't num labels be 5, 0, 0.25, 0.5, 0.751 instead of 1?

Is the target a categorical, or are we considering this as a regression problem? Yeah, it's a good question. So there's one label because there's one column. Even if this was being treated as a categorical problem with five categories, it's still considered one label. In this case, though, we're actually treating it as a regression problem.

It's just one of the things that's a bit tricky. I was trying to figure this out just the other day. It's not documented as far as I can tell on the Hugging Phase Transformer's website. But if you pass in one label to auto model for sequence classification, it turns it into a regression problem, which is actually why we ended up with predictions that were less than 0 and bigger than 1.

So we'll be learning next time about the use of sigmoid functions to resolve this problem, and that should fix it up for us. OK, great. Well thanks, everybody. I hope you enjoyed learning about NLP. As much as I enjoyed putting this together, I'm really excited about it, and can't wait for next week's lesson.

See ya.