Lesson 10: Deep Learning Part 2 2018 - NLP Classification and Translation

00:00:00.000 | So, welcome to lesson 10, or as somebody on the forum described it, lesson 10, mod 7,

00:00:07.600 | which is probably a clearer way to think about this.

00:00:11.300 | We're going to be talking about NLP.

00:00:15.860 | Before we do, let's do a quick review of last week.

00:00:21.900 | Last week, there's quite a few people who have flown here to San Francisco for this

00:00:26.320 | in-person course.

00:00:27.320 | I'm seeing them pretty much every day, they're working full-time on this, and quite a few

00:00:31.880 | of them are still struggling to understand the material from last week.

00:00:35.280 | So if you're finding it difficult, that's fine.

00:00:37.640 | One of the reasons I kind of put it up there up front is so that we've got something to

00:00:42.920 | concentrate about and think about and gradually work towards so that by lesson 14, mod 7,

00:00:50.220 | you'll get a second crack at it.

00:00:52.960 | But there's so many pieces, so hopefully you can keep developing better understanding.

00:00:58.360 | To understand the pieces, you'll need to understand the shapes of convolutional layer outputs,

00:01:03.960 | and receptive fields, and loss functions, and everything.

00:01:08.520 | So it's all stuff that you need to understand for all of your deep learning studies anyway.

00:01:15.640 | So everything you do to develop an understanding of last week's lesson is going to help you

00:01:19.640 | with everything else.

00:01:22.480 | One key thing I wanted to mention is we started out with something which is really pretty

00:01:26.440 | simple, which is single object classifier, single object bounding box without a classifier,

00:01:34.040 | and then single object classifier and bounding box.

00:01:38.080 | And anybody who's spent some time studying since lesson 8, mod 7, has got to the point

00:01:46.840 | where they understand this bit.

00:01:49.680 | Now the reason I mention this is because the bit where we go to multiple objects is actually

00:01:55.940 | almost identical to this, except we first have to solve the matching problem.

00:02:00.800 | We end up creating far more activations than we need for our number of bounding boxes,

00:02:07.280 | ground truth bounding boxes, so we match each ground truth object to a subset of those activations.

00:02:12.920 | And once we've done that, the loss function that we then do to each matched pair is almost

00:02:18.080 | identical to this loss function.

00:02:20.800 | So if you're feeling stuck, go back to lesson 8 and make sure you understand the data set,

00:02:30.040 | the data loader, and most importantly the loss function from the end of lesson 8 or

00:02:35.640 | the start of lesson 9.

00:02:40.880 | So once we've got this thing which can predict the class and bounding box for one object,

00:02:47.240 | we went to multiple objects by creating more activations.

00:02:51.800 | We had to then deal with the matching problem, we then basically moved each of those anchor

00:02:58.640 | boxes in and out a little bit and around a little bit, so they tried to line up with

00:03:05.160 | particular ground truth objects.

00:03:07.720 | And we talked about how we took advantage of the convolutional nature of the network

00:03:13.920 | to try to have activations that had a receptive field that was similar to the ground truth

00:03:21.560 | object we were predicting.

00:03:23.360 | And Chloe Sultan provided this fantastic picture, I guess for her own notes, but she shared

00:03:29.800 | it with everybody, which is lovely, to talk about what does SSD multi-head forward do

00:03:36.320 | line by line.

00:03:37.880 | And I partly wanted to show this to help you with your revision, but I also partly wanted

00:03:41.680 | to show this to kind of say, doing this kind of stuff is very useful for you to do, like

00:03:48.840 | walk through and in whatever way helps you make sure you understand something.

00:03:53.120 | You can see what Chloe's done here is she's focused particularly on the dimensions of

00:03:59.760 | the tensor at each point in the path as we kind of gradually down-sampling using these

00:04:06.440 | tragedy convolutions, making sure she understands why those grid sizes happen, and then understanding

00:04:13.000 | how the outputs come out of those.

00:04:16.680 | And so one thing you might be wondering is how did Chloe calculate these numbers?

00:04:23.480 | So I don't know the answer I haven't spoken to her, but obviously one approach would be

00:04:27.840 | like from first principles just thinking through it.

00:04:30.360 | But then you want to know what am I right?

00:04:33.040 | And so this is where you've got to remember this pdb.settrace idea.

00:04:38.560 | So I just went in just before class and went into SSD multi-head.forward and entered pdb.settrace,

00:04:47.040 | and then I ran a single batch.

00:04:50.040 | And so I put the trace at the end, and then I could just print out the size of all of

00:04:56.440 | these guys.

00:04:58.520 | So which reminds me, last week there may have been a point where I said 21 + 4 = 26, which

00:05:11.920 | is not true in most universes.

00:05:21.840 | And by the way, when I code I do that stuff, that's the kind of thing I do all the time.

00:05:25.920 | So that's why we have debuggers and know how to check things and do things in small little

00:05:30.760 | bits along the way.

00:05:32.500 | So this idea of putting a debugger inside your forward function and printing out the

00:05:36.240 | sizes is something which is damn super helpful.

00:05:40.440 | Or you could just put a print statement here as well.

00:05:44.720 | So I actually don't know if that's how Chloe figured it out, but that's how I would if

00:05:48.360 | I was her.

00:05:50.560 | And then we talked about increasing k, which is the number of anchor boxes for each convolutional

00:05:55.360 | grid cell, which we can do with different zooms and different aspect ratios.

00:05:59.760 | And so that gives us a plethora of activations, and therefore predicted bounding boxes, which

00:06:07.400 | we then went down to a small number using non-maximum suppression.

00:06:15.920 | And I'll try to remember to put a link -- there's a really interesting paper that one of our

00:06:18.960 | students told me about that I hadn't heard about, which is attempting to -- you know

00:06:24.120 | I've mentioned non-maximum suppression.

00:06:26.320 | It's kind of hacky, kind of ugly, it's totally heuristic, I didn't even talk about the code

00:06:32.120 | because it seems kind of hideous.

00:06:34.960 | So somebody actually came up with a paper recently which attempts to do an end-to-end

00:06:38.960 | ConvNet to replace that NMS piece.

00:06:42.840 | So I'll put that paper up.

00:06:45.320 | Nobody's created a PyTorch implementation yet, so it would be an interesting project

00:06:50.880 | if anyone wanted to try that.

00:06:55.800 | One thing I've noticed in our study groups during the week is not enough people reading

00:07:00.800 | papers.

00:07:03.200 | What we are doing in class now is implementing papers.

00:07:06.520 | The papers are the real ground truth.

00:07:10.940 | And I think from talking to people, a lot of the reason people aren't reading papers

00:07:14.360 | is because a lot of people don't think they're capable of reading papers, they don't think

00:07:18.560 | they're the kind of people that read papers.

00:07:21.400 | But you are.

00:07:22.640 | You're here.

00:07:24.160 | And we started looking at a paper last week and we read the words that were in English

00:07:29.120 | and we largely understood them.

00:07:32.120 | So if you actually look through this picture from SSD, carefully you'll realize that SSD

00:07:40.800 | multi-head dot forward is not doing the same as this.

00:07:44.880 | And then you might think, oh, I wonder if this is better.

00:07:49.240 | And my answer is probably, because SSD multi-head dot forward was the first thing I tried just

00:07:56.540 | to get something out there, but between this and the YOLO version, there are probably much

00:08:04.240 | better ways.

00:08:05.660 | One thing you'll notice in particular is they use a smaller K, but they have a lot more

00:08:10.600 | sets of grids, 1x1, 3x3, 5x5, 10x10, 19x19 and 38x38, 8,700 per plus, so a lot more than

00:08:22.080 | we had, so that'd be an interesting thing to experiment with.

00:08:25.600 | Another thing I noticed is that we had 4x4, 2x2, 1x1, which means there's a lot of overlap

00:08:33.840 | like every set fits within every other set.

00:08:37.360 | In this case where you've got 1, 3, 5, you don't have that overlap, so it might actually

00:08:42.960 | make it easier to learn.

00:08:44.120 | So there's lots of interesting things you can play with based on stuff that's either

00:08:48.920 | trying to make it closer to the paper or think about other things you could try that aren't

00:08:53.000 | in the paper or whatever.

00:08:55.720 | Perhaps the most important thing I would recommend is to put the code and the equations next

00:09:02.520 | to each other.

00:09:03.520 | Yes, Rachel?

00:09:04.520 | There was a question of whether you could speak about the use cyclic learning rate argument

00:09:08.800 | and the fit function.

00:09:09.800 | We will get there.

00:09:13.920 | So put the code and the equations from the paper next to each other and draw in one of

00:09:21.560 | two groups.

00:09:22.560 | You're either a code person like me who's not that happy about math, in which case I

00:09:29.480 | start with the code and then I look at the math and I learn about how the math maps to

00:09:34.560 | the code and end up eventually understanding the math.

00:09:38.520 | All your PhD in Stochastic Differential Equations like Rachel, whatever that means, in which

00:09:47.400 | case you can look at the math and then learn about how the code completes the math.

00:09:52.400 | But either way, unless you're one of those rare people who is equally comfortable in

00:09:56.800 | either world, you'll learn about one or the other.

00:10:02.740 | Now learning about code is pretty easy because there's documentation and we know how to look

00:10:07.760 | it up and so forth.

00:10:09.320 | Sometimes learning the math is hard because the notation might seem hard to look up, but

00:10:13.760 | there's actually a lot of resources.

00:10:15.720 | For example, a list of mathematical symbols on Wikipedia is amazingly great.

00:10:21.560 | It has examples of them, explanations of what they mean, and tells you what to search for

00:10:27.280 | to find out more about it.

00:10:30.880 | Really terrific.

00:10:31.880 | And if you Google for math notation cheat sheet, you'll find more of these kinds of

00:10:37.680 | terrific resources.

00:10:40.200 | So over time, you do need to learn the notation, but as you'll see from the Wikipedia page,

00:10:46.560 | there's not actually that much of it.

00:10:49.600 | Obviously there's a lot of concepts behind it, but once you know the notation you can

00:10:53.800 | then quickly look up the concept as it pertains to the particular thing you're studying.

00:10:59.160 | Nobody learns all of math and then starts machine learning.

00:11:05.160 | Everybody, even top researchers I know, when they're reading a new paper will very often

00:11:10.800 | come to bits of math they haven't seen before and they'll have to go away and learn that

00:11:15.160 | bit of math.

00:11:18.800 | Another thing you should try doing is to recreate things that you see in the papers.

00:11:24.560 | So here was the key most important figure 1 from the focal loss paper, the Retlinet paper.

00:11:32.400 | So recreate it.

00:11:35.040 | And very often I put these challenges up on the forums, so keep an eye on the lesson threads

00:11:41.760 | during the forums, and so I put this challenge up there and within about 3 minutes Serada

00:11:46.840 | had said "done it" in Microsoft Excel naturally along with actually a lot more information

00:11:53.880 | than in the original paper.

00:11:55.280 | A nice thing here is that she was actually able to draw a line showing at a 0.5 ground

00:12:00.600 | truth probability what's the loss for different amounts of gamma, which is kind of cool.

00:12:07.360 | And if you want to cheat, she's also provided Python code on the forum too.

00:12:15.760 | I did discover a minor bug in my code last week, the way that I was flattening out the

00:12:20.920 | convolutional activations did not line up with how I was using them in the loss function,

00:12:26.800 | and fixing that actually made it quite a bit better, so my motorbikes and cows are actually

00:12:32.160 | in the right place.

00:12:33.160 | So when you go back to the notebook, you'll see it's a little less bad than it was last

00:12:39.680 | time.

00:12:41.160 | So there's some quick coverage of what's gone before.

00:12:48.200 | Yes?

00:12:49.200 | >> Quick question, are you going to put the PowerPoint on GitHub?

00:12:53.820 | >> I'll put a subset of it on GitHub.

00:12:58.200 | >> And then secondly, usually when we down sample, we increase the number of filters

00:13:02.900 | or depth.

00:13:03.900 | When we're doing sampling from 77 to 44, why are we decreasing the number from 512 to 256?

00:13:10.400 | Why not decrease dimension in SSD head?

00:13:14.200 | Is it performance related?

00:13:16.240 | >> 77 to 44?

00:13:19.760 | Oh, 7 by 7 to 4 by 4?

00:13:23.920 | I guess they've got the stars and the colors.

00:13:32.200 | >> Oh yes, that's right, they're weird italics.

00:13:35.200 | >> It's because -- well, largely it's because that's kind of what the papers tend to do.

00:13:38.640 | We've got a number of -- well, we have a number of out paths and we kind of want each one

00:13:44.720 | to be the same, so we don't want each one to have a different number of filters.

00:13:51.120 | And also this is what the papers did, so I was trying to match up with that, having these

00:13:56.760 | 256.

00:13:57.760 | It's a different concept because we're taking advantage of not just the last layer, but

00:14:02.880 | the layers before that as well. Life's easier if we make them more consistent.

00:14:12.080 | So we're now going to move to NLP, and so let me kind of lay out where we're going here.

00:14:23.560 | We've seen a couple of times now this idea of taking a pre-trained model, in fact we've

00:14:28.400 | seen it in every lesson. Take a pre-trained model, rip off some stuff on the top, replace

00:14:33.280 | it with some new stuff, get it to do something similar.

00:14:42.000 | And so what we're going to do -- and so we've kind of dived in a little bit deeper to that,

00:14:48.440 | to say like okay, with conv_learner.pre_trained, it had a standard way of sticking stuff on

00:14:56.960 | the top which does a particular thing which was classification.

00:15:01.600 | And then we learned actually we can stick any PyTorch module we like on the end and

00:15:07.640 | have it do anything we like with a custom head. And so suddenly you discover, wow, there's

00:15:16.280 | some really interesting things we can do. In fact, that reminds me, Yang Lu said, well,

00:15:37.960 | what if we did a different kind of custom head? And so the different custom head was,

00:15:42.200 | well, let's take the original pictures and rotate them and then make our dependent variable

00:15:51.480 | the opposite of that rotation basically and see if it can learn to unrotate it. And this

00:15:57.680 | is like a super useful thing, obviously. In fact, I think Google Photos nowadays has this

00:16:03.160 | option that it will actually automatically rotate your photos for you.

00:16:09.200 | But the cool thing is, as Yang Lu shows here, you can build that network right now by doing

00:16:15.000 | exactly the same as our previous lesson, but your custom head is one that spits out a single

00:16:21.320 | number which is how much to rotate by, and your dataset has a dependent variable which

00:16:27.120 | is how much did you rotate by. So you suddenly realize with this idea of a backbone plus

00:16:34.040 | a custom head, you can do almost anything you can think about.

00:16:41.920 | So today we're going to look at the same idea and say, okay, how does that apply to NLP?

00:16:49.640 | And then in the next lesson, we're going to go further and say, well, if NLP and computer

00:16:56.600 | vision kind of let you do the same basic ideas, how do we combine them two? And we're going

00:17:02.120 | to learn about a model that can actually learn to find word structures from images, or images

00:17:11.840 | from word structures, or images from images. And that will form the basis, if you wanted

00:17:17.860 | to go further, of doing things like going from an image to a sentence, it's called image

00:17:23.480 | captioning, or going from a sentence to an image, which will start to do a phrased image.

00:17:31.120 | And so from there, we're going to go deeper then into computer vision to think about what

00:17:39.800 | other kinds of things we can do with this idea of a pre-trained network plus a custom

00:17:44.400 | head. And so we'll look at various kinds of image enhancement, like increasing the resolution

00:17:49.940 | of a low-res photo to guess what was missing, or adding artistic filters on top of photos,

00:17:58.200 | or changing photos of forces into photos of zebras and stuff like that.

00:18:04.280 | And then finally, that's going to bring us all the way back to bounding boxes again.

00:18:11.240 | And so to get there, we're going to first of all learn about segmentation, which is

00:18:14.480 | not just figuring out where a bounding box is, but figuring out what every single pixel

00:18:19.640 | in an image is part of. So this pixel is part of a person, this pixel is part of a car.

00:18:25.640 | And then we're going to use that idea, particularly an idea called unet, and it turns out that

00:18:30.640 | this idea from unet, we can apply to bounding boxes where it's called feature pyramids.

00:18:36.280 | Everything has to have a different name in every slightly different area. And we'll use

00:18:41.480 | that to hopefully get some really good results with bounding boxes.

00:18:48.240 | So that's kind of our path from here. So it's all going to build on each other, but take

00:18:54.120 | us into lots of different areas.

00:18:57.480 | Now for NLP last part, we relied on a pretty great library called TorchText. But as pretty

00:19:06.200 | great as it was, I've since then found the limitations of it too problematic to keep

00:19:13.000 | using it. As a lot of you complained on the forums, it's pretty damn slow. Partly because

00:19:21.160 | it's not doing parallel processing, and partly it's because it doesn't remember what you did

00:19:29.440 | last time and it does it all over again from scratch.

00:19:34.920 | And then it's kind of hard to do fairly simple things, like a lot of you were trying to get

00:19:38.840 | into the toxic comment competition on Kaggle, which was a multi-label problem, and trying

00:19:44.520 | to do that with TorchText. I eventually got it working, but it took me like a week of

00:19:49.600 | hacking away, which is kind of ridiculous.

00:19:53.000 | So to fix all these problems, we've created a new library called FastAI.Text. FastAI.Text

00:19:59.800 | is a replacement for the combination of TorchText and FastAI.NLP. So don't use FastAI.NLP anymore.

00:20:10.940 | That's obsolete. It's slower, it's more confusing, it's less good in every way, but there's a

00:20:18.880 | lot of overlaps. Intentionally, a lot of the classes have the same names, a lot of the

00:20:24.280 | functions have the same names, but this is the non-TorchText version.

00:20:33.480 | So we're going to work with IMDB again. For those of you who have forgotten, go back and

00:20:38.440 | check out lesson 4. Basically this is a data set of movie reviews, and you remember we

00:20:46.840 | used it to find out whether we might enjoy some Begeddon or not, and we thought probably

00:20:52.440 | my kind of thing.

00:20:56.460 | So we're going to use the same data set, and by default it calls itself ACLIMDB, so this

00:21:03.080 | is just the raw data set that you can download. And as you can see, I'm doing from FastAI.Text

00:21:13.560 | import star. There's no TorchText, and I'm not using FastAI.NLP.

00:21:21.160 | I'm going to use Pathlib as per usual. We're going to learn about what these tags are later.

00:21:27.720 | So you might remember the basic path for NLP is that we have to take sentences and turn

00:21:37.360 | them into numbers, and there's a couple of steps to get there.

00:21:44.280 | So at the moment, somewhat intentionally, FastAI.Text doesn't provide that many helper functions.

00:21:54.240 | It's really designed more to let you handle things in a fairly flexible way. So as you

00:22:00.200 | can see here, I wrote something called Get Texts, which goes through each thing in classes,

00:22:08.360 | and these are the three classes that they have in IMDB. Negative, positive, and then

00:22:13.760 | there's another folder, unsupervised. That's stuff they haven't gotten around for labeling

00:22:17.360 | yet. So I'm just going to call that a class.

00:22:20.400 | And so I just go through each one of those classes, and then I just find every file in

00:22:27.800 | that folder with that name, and I open it up and read it and chuck it into the end of

00:22:33.080 | this array. And as you can see, with Pathlib it's super easy to grab stuff and pull it

00:22:41.080 | in, and then the label is just whatever class I'm up to so far. So I'll go ahead and do that

00:22:49.120 | for the train bit, and I'll go ahead and do that for the test bit.

00:22:54.200 | So there's 70,000 in train, 25,000 in test, 50,000 of the train ones are unsupervised.

00:23:00.280 | We won't actually be able to use them when we get to the classification piece. So I actually

00:23:05.320 | find this much easier than the torch text approach of having lots of layers and wrappers

00:23:12.440 | and stuff, because in the end reading text files is not that hard.

00:23:20.480 | One thing that's always a good idea is to sort things randomly. It's useful to know this

00:23:28.640 | simple trick for sorting things randomly, particularly when you've got multiple things

00:23:32.000 | you have to sort the same way, in this case I've got labels and texts. np.random.permutation,

00:23:38.040 | if you give it an integer, it gives you a random list from 0 up to and not including

00:23:44.720 | the number you give it in some random order. So you can then just pass that in as an indexer

00:23:53.480 | to give you a list that's sorted in that random order. So in this case it's going to sort

00:23:58.240 | train texts and train labels in the same random way. So it's a useful little idiom to use.

00:24:08.120 | So now I've got my texts and my labels sorted. I can go ahead and create a data frame from

00:24:13.240 | them. Why am I doing this? The reason I'm doing this is because there is a somewhat standard

00:24:22.360 | approach starting to appear for text classification datasets, which is to have your training set

00:24:31.480 | as a CSV file with the labels first and the text of the NLP document second in a train.csv

00:24:45.280 | and a test.csv. So basically it looks like this. You've got your labels and your texts.

00:24:50.880 | And then a file called classes.text, which just lists the classes. I think it's somewhat

00:24:57.920 | standard. In a reasonably recent academic paper, Yann LeCun and a team of researchers looked

00:25:05.920 | at quite a few datasets and they used this format for all of them. And so that's what

00:25:12.960 | I've started using as well for my recent paper. So what I've done is you'll find that this

00:25:20.360 | notebook, if you put your data into this format, the whole notebook will work every time. So

00:25:29.400 | rather than having a thousand different classes or formats and readers and writers and whatever,

00:25:35.360 | I've just said let's just pick a standard format and your job, your code is, you can

00:25:39.880 | do it perfectly well, is to put it in that format which is the CSV file. The CSV files

00:25:46.920 | have no header by default.

00:25:52.560 | Now you'll notice at the start here that I had two different paths. One was the classification

00:25:58.040 | path, one was the language model path. In NLP, you'll see LM all the time. LM means language

00:26:05.640 | model in NLP. So the classification path is going to contain the information that we're

00:26:14.160 | going to use to create a sentiment analysis model. The language model path is going to

00:26:19.000 | contain the information we need to create a language model. So they're a little bit different.

00:26:23.880 | One thing that's different is that when we create the train.csv and the classification

00:26:30.360 | path, we remove everything that has a label of 2 because label of 2 is unsupervised. So

00:26:40.640 | when we remove the unsupervised data from the classifier, we can't use it. So that means

00:26:47.320 | this is going to have actually 25,000 positive, 25,000 negative. The second difference is

00:26:52.720 | the labels. For the classification path, the labels are the actual labels. But for the

00:26:59.760 | language model, there are no labels, so we just use a bunch of zeroes. That just makes

00:27:04.920 | it a little bit easier because we can use a consistent data frame format or CSV format.

00:27:12.900 | Now the language model, we can create our own validation set. So you've probably come

00:27:19.620 | across by now sklearn.modelSelection.trainTestSplit, which is a really simple little function

00:27:26.480 | that grabs a data set and randomly splits it into a training set and a validation set

00:27:32.200 | according to whatever proportion you specify. So in this case, I can catenate my classification

00:27:38.680 | training and validation together. So it's going to be 100,000 altogether, split it by

00:27:44.000 | 10%, so now I've got 90,000 training, 10,000 validation for my language model. So go ahead

00:27:51.320 | and save that.

00:27:52.680 | So that's my basic get the data in a standard format for my language model and my classifier.

00:28:04.240 | So the next thing we need to do is tokenization. Tokenization means at this stage we've got

00:28:12.400 | for a document, for a movie review, we've got a big long string, and we want to put it into

00:28:17.460 | a list of tokens, which are kind of a list of words, but not quite. For example, 'don't',

00:28:26.480 | we want to be 'don't', we probably want 'full stop' to be a token. So tokenization is something

00:28:35.680 | that we passed off to a terrific library called Spacey, partly terrific because an Australian

00:28:43.080 | wrote it and partly terrific because it's good at what it does. We've put a bit of stuff

00:28:50.760 | on top of Spacey, but the vast majority of the work is being done by Spacey.

00:28:55.560 | Before we pass it to Spacey, I've written this simple fixup function, which is basically

00:29:03.440 | each time I looked at a different dataset, and I've looked at about a dozen in building

00:29:06.880 | this, everyone had different weird things that needed to be replaced. Here are all the

00:29:15.440 | ones I've come up with so far. Hopefully this will help you out as well. So I HTML and escape

00:29:24.320 | all the entities, and then there's a bunch more things I replace. Have a look at the

00:29:29.320 | result of running this on text that you put in and make sure there's not more weird tokens

00:29:34.720 | in there. It's amazing how many weird things people do to text.

00:29:41.120 | So basically I've got this function called getAll, which is going to go ahead and call

00:29:47.800 | getTexts, and text is going to go ahead and do a few things, one of which is to apply

00:29:52.520 | that fixup that we just mentioned. So let's kind of look through this because there's

00:29:58.840 | some interesting things to point out. So I've got to use pandas to open our train.csv from

00:30:05.000 | the language model path, but I'm passing in an extra parameter you may not have seen before

00:30:09.840 | called chunksites. Python and pandas can both be pretty inefficient when it comes to storing

00:30:19.080 | and using text data. And so you'll see that very few people in NLP are working with large

00:30:31.000 | corpuses and I think part of the reason is that traditional tools have just made it really

00:30:36.960 | difficult - you run out of memory all the time. So this process I'm showing you today

00:30:43.680 | I have used on corpuses of over a billion words successfully using this exact code. And so

00:30:50.720 | one of the simple tricks is to use this thing called chunksites with pandas. What that means

00:30:55.740 | is that pandas does not return a data frame, but it returns an iterator that we can iterate

00:31:01.760 | through chunks of a data frame. And so that's why I don't say "tok_train = get_texts" but

00:31:15.920 | instead I call "get_all" which loops through the data frame. But actually what it's really

00:31:21.240 | doing is it's looping through chunks of the data frame. So each of those chunks is basically

00:31:27.280 | a data frame representing a subset of the data.

00:31:31.940 | "When I'm working with NLP data, many times I come across data with foreign text or characters.

00:31:37.900 | Is it better to discard them or keep them?"

00:31:40.000 | No, no, definitely keep them. And this whole process is Unicode, and I've actually used

00:31:45.800 | this on Chinese text. This is designed to work on pretty much anything. In general,

00:31:55.920 | most of the time it's not a good idea to remove anything. Old-fashioned NLP approaches tend

00:32:02.640 | to do all this lemmatization and all these normalization steps to get rid of lowercase

00:32:08.640 | everything blah blah blah. But that's throwing away information which you don't know ahead

00:32:14.480 | of time whether it's useful or not. So don't throw away information.

00:32:20.720 | So we go through each chunk, each of which is a data frame, and we call get_texts. get_texts

00:32:26.560 | is going to grab the labels and make them into ints. It's going to grab then the texts.

00:32:37.960 | And I'll point out a couple of things. The first is that before we include the text,

00:32:42.320 | we have this beginning of stream token, which you might remember we used way back up here.

00:32:49.800 | There's nothing special about these particular strings of letters, they're just ones I figured

00:32:53.520 | don't appear in normal texts very often. So every text is going to start with XBOS. Why

00:33:01.940 | is that? Because it's often really useful for your model to know when a new text is starting.

00:33:08.980 | For example, if it's a language model, you're going to concatenate all the text together,

00:33:14.280 | and so it'd be really helpful for it to know this article is finished and a new one started,

00:33:18.200 | so I should probably forget some of that context now. Ditto is quite often texts have multiple

00:33:27.880 | fields like a title, an abstract, and then the main document. And so by the same token,

00:33:32.880 | I've got this thing here which lets us actually have multiple fields in our CSP. So this process

00:33:40.080 | is designed to be very flexible. And again, at the start of each one, we put a special

00:33:44.200 | field starts here token, followed by the number of the field that's starting here for as many

00:33:49.920 | fields as we have. Then we apply our fix up to it, and then most importantly we tokenize

00:33:56.400 | it and we tokenize it by doing a process or multiprocessor. So tokenizing tends to be

00:34:10.440 | pretty slow, but we've all got multiple cores in our machines now and some of the better

00:34:15.280 | machines on AWS and stuff can have dozens of cores. Here on our university computer,

00:34:21.200 | we've got 56 cores. So spaCy is not very amenable to multiprocessing, but I finally figured

00:34:31.800 | out how to get it to work. And the good news is it's all wrapped up in this one function

00:34:36.500 | now. And so all you need to pass to that function is a list of things to tokenize, which each

00:34:43.320 | part of that list will be tokenized on a different core. And so I've also created this function

00:34:48.520 | called partition by cores, which takes a list and splits it into sub-lists. The number of

00:34:54.200 | sub-lists is the number of cores that you have in your computer. So on my machine, without

00:35:03.280 | multiprocessing, this takes about an hour and a half, and with multiprocessing it takes

00:35:09.280 | about two minutes. So it's a really handy thing to have. And now that this code's here, feel

00:35:16.080 | free to look inside it and take advantage of it through your own stuff. Remember, we

00:35:21.880 | all have multiple cores even in our laptops, and very few things in Python take advantage

00:35:29.040 | of it unless you make a bit of an effort to make it work.

00:35:35.040 | So there's a couple of tricks to get things working quickly and reliably. As it runs,

00:35:39.560 | it prints out how it's going. And so here's the result of the end. Beginning of stream

00:35:47.080 | token, beginning of field number one token, here's the tokenized text. You'll see that

00:35:53.120 | the punctuation is on the whole, now a separate token. You'll see there's a few interesting

00:36:02.240 | little things. One is this. What's this? T-up, MGM. Well, MGM was originally capitalized,

00:36:11.960 | but the interesting thing is that normally people either lowercase everything or they

00:36:18.120 | leave the case as is. Now if you leave the case as is, then screw you, or caps, and screw

00:36:26.920 | you, lowercase, are two totally different sets of tokens that have to be learned from

00:36:32.280 | scratch. Or if you lowercase them all, then there's no difference at all between screw

00:36:37.720 | you and screw you.

00:36:41.200 | So how do you fix this so that you both get the semantic impact of "I'm shouting now!"

00:36:50.240 | but not have every single word have to learn the shouted version versus the normal version.

00:36:55.040 | And so the idea I came up with, and I'm sure other people have done this too, is to come

00:36:59.440 | up with a unique token to mean the next thing is all uppercase. So then I lowercase it,

00:37:06.600 | so now whatever used to be uppercase is now lowercase, it's just one token, and then we

00:37:10.240 | can learn the semantic meaning of all uppercase.

00:37:14.480 | And so I've done a similar thing. If you've got 29 exclamation marks in a row, we don't

00:37:19.280 | learn a separate token for 29 exclamation marks. Instead I put in a special token for

00:37:24.840 | the next thing repeats lots of times, and then I put the number 29, and then I put the

00:37:30.120 | exclamation mark.

00:37:32.000 | And so there's a few little tricks like that, and if you're interested in LP, have a look

00:37:36.120 | at the code for Tokenizer for these little tricks that I've added in because some of

00:37:40.560 | them are kind of fun.

00:37:45.440 | So the nice thing with doing things this way is we can now just np.save that and load it

00:37:52.160 | back up later. We don't have to recalculate all this stuff each time like we tend to have

00:37:57.600 | to do with TorchText or a lot of other libraries.

00:38:02.160 | So we've now got it tokenized. The next thing we need to do is turn it into numbers, which

00:38:08.600 | we call numericalizing it. And the way we numericalize it is very simple. We make a list of all the

00:38:14.560 | words that appear in some order, and then we replace every word with its index into that

00:38:20.000 | list. The list of all the tokens that appear, we call the vocabulary.

00:38:29.160 | So here's an example of some of the vocabulary. The counter class in Python is very handy

00:38:34.200 | for this. It basically gives us a list of unique items and their counts. So here are the 25

00:38:42.960 | most common things in the vocabulary. You can see there are things like apostrophe s and

00:38:48.480 | double quote and end of paragraph, and also stuff like that.

00:38:54.720 | Generally speaking, we don't want every unique token in our vocabulary. If it doesn't appear

00:39:01.320 | at least two times, then it might just be a spelling mistake or a word. We can't learn

00:39:06.560 | anything about it if it doesn't appear that often. Also the stuff that we're going to

00:39:11.120 | be learning about at least so far on this part gets a bit clunky once you've got a vocabulary

00:39:16.680 | bigger than 60,000. Time permitting, we may look at some work I've been doing recently

00:39:22.520 | on handling larger vocabularies, otherwise that might have to come in a future course.

00:39:28.600 | But actually for classification, I've discovered that doing more than about 60,000 words doesn't

00:39:32.920 | seem to help anyway. So we're going to limit our vocabulary to 60,000 words, things that

00:39:38.560 | appear at least twice. So here's a simple way to do that. Use that dot most common, pass

00:39:45.080 | in the max_vocab size. That'll sort it by the frequency, by the way. And if it appears

00:39:52.360 | less often than a minimum frequency, then don't bother with it at all.

00:39:57.000 | So that gives us i to s. That's the same name that torch text used. Remember it means int

00:40:02.920 | to string. So this is just the list of the unique tokens in the vocab. I'm going to insert

00:40:10.200 | two more tokens, a token for unknown, a vocab item for unknown, and a vocab item for padding.

00:40:19.960 | Then we can create the dictionary which goes in the opposite direction, so string to int.

00:40:26.760 | And that won't cover everything because we intentionally truncated it down to 60,000 words.

00:40:33.440 | And so if we come across something that's not in the dictionary, we want to replace

00:40:36.960 | it with 0 for unknown, so we can use a default dict for that, with a lambda function that

00:40:43.600 | always returns 0. So you can see all these things we're using that keep coming back up.

00:40:51.480 | So now that we've got our s to i dictionary defined, we can then just call that for every

00:40:56.480 | word for every sentence. And so there's our numericalized version, and there it is. And

00:41:06.920 | so of course the nice thing is again, we can save that step as well. So each time we get

00:41:12.840 | to another step, we can save it. And these are not very big files. Compared to what you

00:41:17.880 | get used to with images, text is generally pretty small. Very important to also save

00:41:28.560 | that vocabulary. Because this list of numbers means nothing, unless you know what each number

00:41:36.160 | refers to, and that's what I2S tells you. So you save those three things, and then later

00:41:42.840 | on you can load them back up. So now our vocab size is 60,002, and our training language

00:41:52.800 | model has 90,000 documents in it.

00:42:00.120 | So that's the preprocessing you do. We can probably wrap a little bit more of that in

00:42:05.520 | little utility functions if we want to, but it's all pretty straightforward, and basically

00:42:10.400 | that exact code will work for any dataset you have once you've got it in that CSV format.

00:42:18.240 | So here is a kind of a new insight that's not new at all, which is that we'd like to

00:42:31.280 | pre-train something. Like we know from lesson 4 that if we pre-train our classifier by first

00:42:39.680 | creating a language model, and then fine-tuning that as a classifier, that was helpful. Remember

00:42:45.520 | it actually got us a new state-of-the-art result. We got the best IMDB classifier result

00:42:50.720 | that had ever been published. But quite a bit. Well, we're not going far enough though,

00:42:58.040 | because IMDB movie reviews are not that different to any other English document compared to

00:43:12.000 | how different they are to a random string or even to a Chinese document. So just like

00:43:19.940 | ImageNet allowed us to train things that recognize stuff that kind of looks like pictures, and

00:43:26.760 | we could use it on stuff that was nothing to do with ImageNet, like satellite images.

00:43:30.680 | Why don't we train a language model that's just like good at English, and then fine-tune

00:43:37.400 | it to be good at movie reviews? So this basic insight led me to try building a language

00:43:47.800 | model on Wikipedia. So my friend Stephen Meridy has already processed Wikipedia, found a subset

00:43:58.920 | of nearly the most of it, but throwing away the stupid little articles, and he calls that

00:44:08.240 | Wikitex 103. So I grabbed Wikitex 103 and I trained a language model on it. I used exactly

00:44:16.640 | the same approach I'm about to show you for training an IMDB language model, but instead

00:44:21.760 | I trained a Wikitex 103 language model. And then I saved it and I've made it available

00:44:29.640 | for anybody who wants to use it at this URL. So this is not a URL for Wikitex 103, the

00:44:36.920 | documents, this is the Wikitex 103, the language model. So the idea now is let's train an IMDB

00:44:46.160 | language model which starts with these words.

00:44:50.600 | Now hopefully to you folks, this is an extremely obvious, extremely non-controversial idea because

00:44:58.720 | it's basically what we've done in nearly every class so far. But when I first mentioned this

00:45:09.560 | to people in the NLP community, I guess June/July of last year, there couldn't have been less

00:45:18.920 | interest. I asked on Twitter, where a lot of the top Twitter researchers are people that

00:45:24.960 | I follow and they follow me back, I was like "hey, what if we pre-trained a general language

00:45:29.800 | model?" and they're like "no, all language is different, you can't do that" or "I don't

00:45:36.080 | know why you would bother anyway, I've talked to people at conferences and I'm pretty sure

00:45:43.280 | people have tried that and it's stupid." There was just this weird straight past. I guess

00:45:56.000 | because I am arrogant and I ignored them even though they know much more about NLP than

00:46:03.960 | I do and just tried it anyway and let me show you what happened.

00:46:10.400 | So here's how we do it. Grab the wiki text models, and if you use wget -r it'll actually

00:46:21.280 | recursively grab the whole directory, it's got a few things in it. We need to make sure

00:46:27.480 | that our language model has exactly the same embedding size, number of hidden and number

00:46:32.900 | of layers as my wiki text one did, otherwise you can't load the weights in. So here's our

00:46:41.800 | pre-trained path, here's our pre-trained language model path, let's go ahead and torch.load in

00:46:48.400 | those weights from the forward wiki text 103 model. We don't normally use torch.load, but

00:46:58.440 | that's the PyTorch way of grabbing a file. And it basically gives you a dictionary containing

00:47:07.080 | the name of the layer and a tensor of those weights or an array of those weights.

00:47:14.760 | Now here's the problem, that wiki text language model was built with a certain vocabulary which

00:47:21.720 | was not the same as this one was built on. So my number 40 was not the same as wiki text

00:47:27.680 | 103 models number 40. So we need to map one to the other. That's very, very simple because

00:47:35.120 | luckily I saved the i2s for the wiki text vocab. So here's the list of what each word

00:47:44.280 | is when I trained the wiki text 103 model, and so we can do the same default dict trick

00:47:50.520 | to map it in reverse, and I'm going to use -1 to mean that it's not in the wiki text

00:47:56.520 | dictionary. And so now I can just say my new set of weights is just a whole bunch of zeros

00:48:05.000 | with vocab size by embedding size, so we're going to create an embedding matrix. I'm then

00:48:10.480 | going to go through every one of the words in my IMDB vocabulary. I'm going to look it

00:48:17.200 | up in S to i2, so string to int for the wiki text 103 vocabulary, and see if that's words

00:48:24.280 | there. And if that is word there, then I'm not going to get this -1, so r will be greater

00:48:31.520 | than or equal to 0. So in that case I will just set that row of the embedding matrix

00:48:36.800 | to the weight that I just looked at, which was stored inside this named element. So these

00:48:45.520 | names, you can just look at this dictionary and it's pretty obvious what each name corresponds

00:48:51.360 | to because it looks very similar to the names that you gave it when you set up your module.

00:48:55.440 | So here are the encoder weights. So grab it from the encoder weights. If I don't find it,

00:49:05.400 | then I will use the row mean. In other words, here is the average embedding weight across

00:49:12.400 | all of the wiki text 103 things. So that's pretty simple, so I'm going to end up with

00:49:18.560 | an embedding matrix for every word that's in both my vocabulary for IMDB and the wiki

00:49:24.040 | text 103 vocabulary. I will use the wiki text 103's embedding matrix weights for anything

00:49:30.240 | else. I will just use whatever was the average weight from the wiki text 103 embedding matrix.

00:49:36.080 | And then I'll go ahead and I will replace the encoder weights with that turn into a tensor.

00:49:43.600 | We haven't talked much about weight tying, we might do so later, but basically the decoder,

00:49:48.500 | so the thing that turns the final prediction back into a word, uses exactly the same weights,

00:49:56.380 | so I pop it there as well. And then there's a bit of a weird thing with how we do embedding

00:50:01.600 | dropout that ends up with a whole separate copy of them for a reason that doesn't matter

00:50:06.360 | much. So we just pop the weights back where they need to go.

00:50:09.960 | So this is now something that a dictionary we can now, or a set of torch state which

00:50:16.920 | we can load in. So let's go ahead and create our language model. And so the basic approach

00:50:25.240 | we're going to use, and I'm going to look at this in more detail in a moment, but the

00:50:27.920 | basic approach we're going to use is I'm going to concatenate all of the documents together

00:50:38.000 | into a single list of tokens of length 24.998 million.

00:50:47.260 | So that's going to be what I pass in as my training set. So the language model, we basically

00:50:54.720 | just take all our documents and just concatenate them back to back. And we're going to be continuously

00:50:59.280 | trying to predict what's the next word after these words. And we'll look at these details

00:51:06.600 | in a moment. I'm going to set up a whole bunch of dropout. We'll look at that in detail in

00:51:11.280 | a moment. Once we've got a model data object, we can then grab the model from it. So that's

00:51:17.680 | going to give us a learner. And then as per usual, we can call learner.fit. So we first

00:51:27.280 | of all, as per usual, just do a single epoch on the last layer just to get that okay. And

00:51:34.320 | the way I've set it up is the last layer is actually the embedding weights. Because that's

00:51:38.680 | obviously the thing that's going to be the most wrong, because a lot of those embedding

00:51:42.020 | weights didn't even exist in the vocab, so we're just going to train a single epoch of

00:51:47.200 | just the embedding weights. And then we'll start doing a few epochs of the full model.

00:51:54.200 | And so how is that looking? Well here's lesson 4, which was our academic world's best ever

00:52:02.920 | result. And after 14 epochs we had a 4.23 loss. Here after 1 epoch we have a 4.12 loss.

00:52:19.800 | So by pre-training on Wikitext 103, in fact let's go and have a look, we kept training

00:52:26.720 | and training at a different rate. Eventually we got to 4.16. So by pre-training on Wikitext

00:52:32.400 | 103 we have a better loss after 1 epoch than the best loss we got for the language model

00:52:38.560 | otherwise. Yes, Rachel?

00:52:42.200 | What is the Wikitext 103 model? Is it AWD LSTM again?

00:52:47.320 | Yeah and we're about to dig into that. The way I trained it was literally the same lines

00:52:54.120 | of code that you see here, but without pre-training it on Wikitext 103.

00:53:00.760 | So let's take a 10-minute break, come back at 7.40 and we'll dig in and have a look at

00:53:06.760 | these models.

00:53:08.720 | Ok welcome back. Before we go back into language models and NLP classifiers, a quick discussion

00:53:17.280 | about something pretty new at the moment which is the FastAI doc project. So the goal of

00:53:23.200 | the FastAI doc project is to create documentation that makes readers say "Wow, that's the most

00:53:30.320 | fantastic documentation I've ever read." And so we have some specific ideas about how to

00:53:37.440 | do that, but it's the same kind of idea of top-down, thoughtful, take-full advantage

00:53:45.800 | of the medium approach, interactive, experimental code first that we're all familiar with.

00:53:54.040 | If you're interested in getting involved, the basic approach you can see in the docs

00:54:01.180 | directory. So this is the readme in the docs directory. In there there is, amongst other

00:54:09.600 | things, a transforms_template.adoc. What the hell is adoc? Adoc is ASCII doc. How many

00:54:17.600 | people here have come across ASCII doc? That's awesome. People are laughing because there's

00:54:25.280 | one hand up and it's somebody who was in our study group today who talked to me about ASCII

00:54:29.560 | doc. ASCII doc is the most amazing project. It's like Markdown, but it's like what Markdown

00:54:36.280 | needs to be to create actual books, and a lot of actual books are written in ASCII doc.

00:54:43.280 | And so it's as easy to use as Markdown, but there's way more cool stuff you can do with

00:54:48.440 | it. In fact, here is an ASCII doc file here, and as you'll see it looks very normal. There's

00:54:53.720 | headings and this is pre-formatted text, and there's lists and whatever else. It looks

00:55:05.800 | pretty standard, and actually I'll show you a more complete ASCII doc thing, a more standard

00:55:13.840 | ASCII doc thing. But you can do stuff like say put a table of contents here please. You

00:55:20.880 | can say colon colon means put a definition list here please. Plus means this is a continuation

00:55:28.780 | of the previous list item. So there's just little things that you can do which are super

00:55:34.600 | handy or make it slightly smaller than everything else. So it's like turbocharged Markdown.

00:55:43.800 | And so this ASCII doc creates this HTML. And I didn't add any CSS or do anything myself.

00:55:52.280 | We literally started this project like 4 hours ago. So this is like just an example basically.

00:55:58.480 | And so you can see we've got a table of contents, we can jump straight to here, we've got a

00:56:05.920 | cross-reference we can click on to jump straight to the cross-reference. Each method comes

00:56:11.960 | along with its details and so on and so forth. And to make things even easier, rather than

00:56:18.380 | having to know that the argument list is meant to be smaller than the main part, or how do

00:56:25.980 | you create a cross-reference, or how are you meant to format the arguments to the method

00:56:32.280 | name and list out each one of its arguments, we've created a special template where you

00:56:38.880 | can just write various stuff in curly brackets like "please put the arguments here, and here

00:56:43.800 | is an example of one argument, and here is a cross-reference, and here is a method," and

00:56:49.400 | so forth. So we're in the process of documenting the documentation template that there's basically

00:56:55.440 | like 5 or 6 of these little curly bracket things you'll need to learn. But for you to

00:56:59.760 | create a documentation of a class or a method, you can just copy one that's already there

00:57:05.680 | and so the idea is we're going to have, it'll almost be like a book. There'll be tables

00:57:12.320 | and pictures and little video segments and hyperlink throughout and all that stuff. You

00:57:21.320 | might be wondering what about docstrings, but actually I don't know if you've noticed,

00:57:25.760 | but if you look at the Python standard library and look at the docstring for example for

00:57:31.320 | regex compile, it's a single line. Nearly every docstring in Python is a single line.

00:57:38.040 | And Python then does exactly this. They have a website containing the documentation that

00:57:43.080 | says like "Hey, this is what regular expressions are and this is what you need to know about

00:57:46.940 | them and if you want them to go faster, you'll need to use compile and here's lots of information

00:57:51.040 | about compile and here's the examples." It's not in the docstring. And that's how we're

00:57:55.840 | doing it as well. Our docstrings will be one line unless you need two sometimes. It's going

00:58:03.640 | to be very similar to Python, but even better. So everybody is welcome to help contribute

00:58:11.920 | to the documentation and hopefully by the time you're watching this on the MOOC, it'll

00:58:16.640 | be recently fleshed out and we'll try to keep a list of things to do.

00:58:26.560 | So I'm going to do one first. So one question that came up in the break was how does this

00:58:35.440 | compare to Word2Vec? And this is actually a great thing for you to spend time thinking

00:58:41.440 | about during the week is how does this compare to Word2Vec. I'll give you the summary now,

00:58:46.360 | but it's a very important conceptual difference. The main conceptual difference is, what is

00:58:51.320 | Word2Vec? Word2Vec is a single embedding matrix. Each word has a vector and that's it. So in

00:59:00.520 | other words, it's a single layer from a pre-trained model and specifically that layer is the input

00:59:08.440 | layer. And also specifically that pre-trained model is a linear model that is pre-trained

00:59:16.960 | on something called a co-occurrence matrix. So we have no particular reason to believe

00:59:22.600 | that this model has learned anything much about the English language or that it has

00:59:27.040 | any particular capabilities because it's just a single linear layer and that's it.

00:59:34.320 | So what's this Wikitex 103 model? It's a language model. It has a 400-dimensional embedding

00:59:45.200 | matrix, 3 hidden layers with 1,150 activations per layer, and regularization and all of that

00:59:57.560 | stuff. Tired input output, matrixes, it's basically a state-of-the-art AWD. So what's

01:00:05.920 | the difference between a single layer of a single linear model versus a three-layer recurrent

01:00:14.800 | neural network? Everything. They're very different levels of capability. And so you'll see when

01:00:22.200 | you try using a pre-trained language model versus a Word2vec layer, you'll get very,

01:00:29.240 | very different results for the vast majority of tasks.

01:00:33.360 | What if the NumPy array does not fit in memory? Is it possible to write a PyTorch data loader

01:00:38.180 | directly from a large CSV file?

01:00:42.440 | It almost certainly won't come up, so I'm not going to spend time on it. These things

01:00:46.200 | are tiny. They're just ints. Think about how many ints you would need to run out of memory.

01:00:52.680 | It's not going to happen. They don't have to fit in GPU memory, just in your memory. I've

01:00:57.880 | actually done another Wikipedia model, which I called GigaWiki, which was on all of Wikipedia,

01:01:06.680 | and even that easily fits in memory. The reason I'm not using it is because it turned out

01:01:10.480 | not to really help very much versus Wikitex 103, but I've built a bigger model than anybody

01:01:17.800 | else I found in the academic literature pretty much, and it fits in memory on a single machine.

01:01:24.600 | What is the idea behind averaging the weights of embeddings?

01:01:27.720 | They've got to be set to something. There are words that weren't there, so other options

01:01:34.560 | is we could leave them at 0, but that seems like a very extreme thing to do. 0 is a very

01:01:38.880 | extreme number. Why would it be 0? We could set it equal to some random numbers, but if

01:01:46.160 | so, what would be the mean and standard deviation of those random numbers, or should it be uniform?

01:01:50.960 | If we just average the rest of the embeddings, then we have something that's a reasonable

01:01:56.800 | scale.

01:01:57.800 | Just to clarify, this is how you're initializing words that didn't appear in the training corpus.

01:02:02.040 | Thanks, Rachel, that's right.

01:02:03.040 | I think you've pretty much just answered this one, but someone had asked if there's a specific

01:02:09.320 | advantage to creating our own pre-trained embedding over using glob or Word2Vec.

01:02:14.520 | I think I have. We're not creating a pre-trained embedding; we're creating a pre-trained model.

01:02:23.120 | Let's talk a little bit more. This is a ton of stuff we've seen before, but it's changed

01:02:26.880 | a little bit. It's actually a lot easier than it was in Part 1, but I want to go a little

01:02:30.800 | bit deeper into the language model loader.

01:02:38.000 | So this is the language model loader, and I really hope that by now you've learned in

01:02:41.280 | your editor or IDE how to jump to symbols. I don't want it to be a burden for you to

01:02:48.920 | find out what the source code of a language model loader is. And if it's still a burden,

01:02:54.000 | please go back and try and learn those keyboard shortcuts in VS Code. If your editor does

01:03:00.760 | not make it easy, don't use that editor anymore. There's lots of good free editors that make

01:03:05.520 | this easy.

01:03:10.360 | So here's the source code for language model loader. It's interesting to notice that it's

01:03:18.720 | not doing anything particularly tricky. It's not deriving from anything at all. What makes

01:03:30.400 | it something that's capable of being a data loader is it's something you can iterate over.

01:03:36.640 | And so specifically, here's the fit function inside fastai.model. This is where everything

01:03:47.680 | ends up eventually, which goes through each epoch, and then it creates an iterator from

01:03:52.960 | the data loader, and it just does a for loop through it. So anything you can do a for loop

01:03:57.480 | through can be a data loader. And specifically, it needs to return tuples of many batches,

01:04:05.800 | an independent and dependent variable for many batches.

01:04:09.320 | So anything with a dunder-eater method is something that can act as an iterator. And

01:04:17.600 | yield is a neat little Python keyword you probably should learn about if you don't already

01:04:22.520 | know it, but it basically spits out a thing and waits for you to ask for another thing,

01:04:27.800 | normally in a for loop or something.

01:04:30.720 | So in this case, we start by initializing the language model, passing it in the numbers.

01:04:38.600 | So this is a numericalized, big, long list of all of our documents concatenated together.

01:04:46.060 | And the first thing we do is to batchify it. And this is the thing which quite a few of

01:04:52.280 | you got confused about last time. If our batch size is 64 and we have 25 million numbers in

01:05:05.320 | our list, we are not creating items of length 64. We're not doing that. We're creating 64

01:05:15.080 | items in total. So each of them is of size t/64, which is 390,000. So that's what we

01:05:27.000 | do here when we reshape it so that this axis here is of length 64, and then this -1 is

01:05:36.400 | everything else. So that's 390,000 long. And then we transpose it.

01:05:44.560 | So that means that we now have 64 columns, 390,000 rows, and then what we do each time

01:05:52.560 | we do an iterate is we grab one batch of some sequence length, we'll look at that in a moment,

01:06:00.120 | but basically it's approximately equal to bptt, which we set to 70, stands for backprop

01:06:10.160 | through time, and we just grab that many rows. So from i to i plus 70 rows, and then we try

01:06:23.800 | to predict that plus 1. So we've got 64 columns, and each of those is 1/64 of our 25 million

01:06:35.880 | or whatever it was, tokens, hundreds of thousands long, and we just grab 70 at a time. So each

01:06:45.160 | of those columns each time we grab it is going to hook up to the previous column. So that's

01:06:51.600 | why we get this consistency, this language model. It's stateful, just really important.

01:06:59.600 | Pretty much all the cool stuff in the language model is stolen from Stephen Merrity's AWD

01:07:06.640 | LSTM, including this little trick here, which is if we always grab 70 at a time and then

01:07:15.200 | we go back and do a new epoch, we're going to grab exactly the same batches every time.

01:07:20.480 | There's no randomness. Now normally we shuffle our data every time we do an epoch, or every

01:07:26.000 | time we grab some data we grab it at random. You can't do that with a language model because

01:07:30.660 | this set has to join up to the previous set because it's trying to learn the sentence.

01:07:38.120 | If you suddenly jump somewhere else, then that doesn't make any sense as a sentence.

01:07:43.400 | So Stephen's idea is to say, since we can't shuffle the order, let's instead randomly

01:07:51.380 | change the size, the sequence length. So basically he says, 95% of the time we'll use bptt, 70,

01:08:02.020 | but 5% of the time we'll use half that. And then he says, you know what, I'm not even

01:08:08.640 | going to make that the sequence length, I'm going to create a normally distributed random

01:08:13.320 | number with that average and a standard deviation of 5, and I'll make that the sequence length.

01:08:20.080 | So the sequence length is 70ish, and that means every time we go through we're getting

01:08:26.600 | slightly different batches. So we've got that little bit of extra randomness. I asked Stephen

01:08:34.420 | Meridy where he came up with this idea. Did he think of it? He was like, I think I thought

01:08:40.840 | of it, but it seemed so obvious that I bet I didn't think of it, which is true of every

01:08:46.280 | time I come up with an idea in deep learning, it always seems so obvious that you assume

01:08:49.640 | somebody else has thought of it, but I think he thought of it.

01:08:54.860 | So this is a nice thing to look at if you're trying to do something a bit unusual with

01:09:01.600 | a data loader. It's like, okay, here's a simple kind of role model you can use as to creating

01:09:07.840 | a data loader from scratch, something that spits out batches of data. So our language

01:09:16.200 | model loader just took in all of the documents concatenated together along with the batch

01:09:20.900 | size and the BPTT.

01:09:23.960 | Now generally speaking, we want to create a learner, and the way we normally do that

01:09:28.700 | is by getting a model data object and by calling some kind of method which have various names,

01:09:34.360 | but often we call that method getModel. And so the idea is that the model data object

01:09:39.920 | has enough information to know what kind of model to give you. So we have to create that

01:09:45.720 | model data object, which means we need that class, and so that's very easy to do.

01:09:55.860 | So here are all of the pieces. We're going to create a custom learner, a custom model

01:09:59.900 | data class and a custom model class. So a model data class, again, this one doesn't inherit

01:10:07.040 | from anything, so you really see there's almost nothing to do. You need to tell it most importantly

01:10:14.440 | what's your training set, give it a data loader, what's the validation set, give it a data

01:10:19.680 | loader, and optionally give it a test set, plus anything else it needs to know. So it

01:10:29.040 | might need to know the VPTT, it needs to know the number of tokens, that's the vocab size,

01:10:38.240 | it needs to know what is the padding index, and so that it can save temporary files and

01:10:45.360 | models, model data always needs to know the path. And so we just grab all that stuff and

01:10:50.000 | we dump it. And that's it, that's the entire initializer, there's no logic there at all.

01:10:55.920 | So then all of the work happens inside get_model. And so get_model calls something we'll look

01:11:03.120 | at later which just grabs a normal PyTorch NN.module architecture. And jux it on the GPU.

01:11:14.440 | Note with PyTorch normally we would say .cuda. With fast.ai, it's better to say to GPU. And

01:11:21.040 | the reason is that if you don't have a GPU, it will leave it on the CPU, and it also provides

01:11:27.440 | a global variable you can set to choose whether it goes on the GPU or not. So it's a better

01:11:33.360 | approach.

01:11:35.520 | So we wrap the model in a language model. And the language model is this. Basically

01:11:40.840 | a language model is a subclass of basic model. It basically almost does nothing except it

01:11:48.820 | defines layer groups. And so remember how when we do discriminative learning rates where

01:11:54.660 | different layers have different learning rates, or we freeze different amounts, we don't provide

01:12:03.300 | a different learning rate for every layer because there can be like a thousand layers.

01:12:07.680 | We provide a different learning rate for every layer group. So when you create a custom model,

01:12:13.300 | you just have to override this one thing which returns a list of all of your layer groups.

01:12:21.840 | So in this case, my last layer group contains the last part of the model and one bit of

01:12:28.680 | dropout, and the rest of it, this star here, means pull this apart. So this is basically

01:12:34.120 | going to be one layer per RNN layer.

01:12:40.200 | So that's all that is. And then finally, turn that into a learner. And so a learner you

01:12:47.520 | just pass in the model and it turns it into a learner. In this case we have overridden

01:12:52.480 | learner and the only thing we've done is to say I want the default loss function to be

01:12:59.160 | cross-entropy. So this entire set of custom model, custom model data, custom learner all

01:13:07.960 | fits on a single screen, and they always basically look like this. So that's a kind of little

01:13:15.040 | dig inside this pretty boring part of the code base.

01:13:19.200 | So the interesting part of this code base is getLanguageModel. GetLanguageModel is actually

01:13:24.200 | the thing that gives us our awdlstm. And it actually contains the big idea, the big, incredibly

01:13:35.440 | simple idea that everybody else here thinks it's really obvious, that everybody in the

01:13:40.280 | NLP community I spoke to thought was insane, which is basically every model can be thought

01:13:47.720 | of as a backbone plus a head, and if you pre-train the backbone and stick on a random head, you

01:13:55.960 | can do fine-tuning and that's a good idea.

01:14:00.120 | And so these two bits of the code are literally right next to each other. There is this bit

01:14:08.520 | of fastai.lm_rnn. Here's getLanguageModel. Here's getClassifier. getLanguageModel creates

01:14:18.000 | an RNN encoder and then creates a sequential model that sticks on top of that a linear

01:14:24.200 | decoder. Classifier creates an RNN encoder and then a sequential model that sticks on

01:14:30.160 | top of that a pooling linear classifier. We'll see what these differences are in a moment,

01:14:35.440 | but you get the basic idea. They're basically doing pretty much the same thing. They've

01:14:39.880 | got this head and then they're sticking on a simple linear layer on top.

01:14:46.280 | So it's worth digging in a little bit deeper and seeing what's going on here. Yes, Rich?

01:14:52.240 | >> There was a question earlier about whether any of this translates to other languages.

01:14:59.080 | >> Yeah, this whole thing works in any language you like.

01:15:02.800 | >> I mean, would you have to retrain your language model on a corpus from that language?

01:15:10.920 | >> Absolutely.

01:15:11.920 | >> Okay.

01:15:12.920 | >> So the wikitext-103-pre-trained-language-model knows English. You could use it maybe as

01:15:22.080 | a pre-trained start for a French or German model. Start by retraining the embedding layer

01:15:27.840 | from scratch. Might be helpful. Chinese, maybe not so much. But given that a language model

01:15:35.560 | can be trained from any unlabeled documents at all, you'd never have to do that. Because

01:15:42.280 | almost every language in the world has plenty of documents. You can grab newspapers, web

01:15:51.120 | pages, parliamentary records, whatever. As long as you've got a few thousand documents

01:15:59.520 | showing somewhat normal usage of that language, you can create a language model.

01:16:04.640 | And so I know some of our students, one of our students, whose name I'll have to look

01:16:09.280 | after in a week, very embarrassing, tried this approach for Thai. He said the first

01:16:16.600 | model he built easily beat the previous day of the entire classifier. For those of you

01:16:24.160 | that are international fellows, this is an easy way for you to whip out a paper in which

01:16:31.440 | you either create the first ever classifier in your language or beat everybody else's

01:16:36.160 | classifier in your language and then you can tell them that you've been a student of deep

01:16:41.080 | learning for six months and piss off all the academics in your country.

01:16:47.160 | So here's our edit encoder. It's just a standard edit module. Most of the text in it is actually

01:16:57.280 | just documentation, as you can see. It looks like there's more going on in it than there

01:17:03.280 | actually is, but really all there is is we create an embedding layer, we create an LSTM

01:17:09.640 | for each layer that's been asked for, and that's it. Everything else in it is dropout.

01:17:19.520 | Basically all of the interesting stuff in the AWED LSTM paper is all of the places you

01:17:25.320 | can put dropout. And then the forward is basically the same thing, right? It's call the embedding

01:17:35.240 | layer, add some dropout, go through each layer, call that RNN layer, append it to our list

01:17:44.960 | of outputs, add dropout, and that's about it.

01:17:54.320 | So it's really pretty straightforward. The paper you want to be reading, as I've mentioned,

01:18:05.020 | is the AWD LSTM paper, which is this one here, regularizing and optimizing LSTM language

01:18:10.440 | models, and it's well-written and pretty accessible and entirely implemented inside FastAI as

01:18:20.920 | well, so you can see all of the code for that paper. And like a lot of the code is shamelessly

01:18:29.240 | plagiarized with Stephen's permission from his excellent GitHub repo, AWD LSTM, and the

01:18:36.880 | process of which I picked some of his bugs as well. I even told him about them.

01:18:46.920 | So I'm talking increasingly about "please read the papers", so here's the paper, "please

01:18:51.320 | read this paper", and it refers to other papers. So for things like why is it that the encoder

01:19:00.960 | weight and the decoder weight are the same? Well, it's because there's this thing called

01:19:10.720 | "tie_weights", this is inside that get_language model, there's a thing called "tie_weights",

01:19:21.040 | it defaults to true, and if it's true then we literally use the same weight matrix for

01:19:32.280 | the encoder and the decoder. So they're literally pointing at the same block of memory. And

01:19:39.160 | so why is that? What's the result of it? That's one of the citations in Stephen's paper, which

01:19:44.920 | is also a well-written paper, you can go and look up and learn about work time. So there's

01:19:49.440 | a lot of cool stuff in there.

01:19:53.040 | So we have basically a standard RNN, the only way it's not standard is it's just got lots

01:19:57.960 | more types of dropout in it, and then a sequential model, on top of that we stick a linear decoder,

01:20:06.600 | which is literally half the screen of code. It's got a single linear layer, we initialize

01:20:15.320 | the weights to some range, we add some dropout, and that's it. So we've got an RNN, on top

01:20:25.040 | of that we stick a linear layer with dropout and we're finished. So that's the language

01:20:29.880 | model. So what dropout you choose matters a lot, and through a lot of experimentation

01:20:46.000 | I found a bunch of dropouts -- you can see here we've got each of these corresponds to

01:20:51.840 | a particular argument -- a bunch of dropouts that tend to work pretty well for language

01:20:56.480 | models. But if you have less data for your language model, you'll need more dropout. If

01:21:06.680 | you have more data, you can benefit from less dropout, you don't want to regularize more

01:21:11.200 | than you have to. Rather than having to tune every one of these 5 things, my claim is they're

01:21:19.000 | already pretty good ratios to each other, so just tune this number. I just multiply

01:21:24.000 | it all by something. So there's really just one number you have to tune. If you're overfitting,

01:21:33.480 | then you'll need to increase this number. If you're underfitting, you'll need to decrease

01:21:37.040 | this number. Other than that, these ratios actually seem pretty good.

01:21:45.640 | So one important idea which may seem pretty minor, but again it's incredibly controversial,

01:21:55.000 | is that we should measure accuracy when we look at a language model. So normally in language

01:22:01.500 | models we look at this loss value, which is just cross-entropy loss, but specifically

01:22:08.680 | where you nearly always take e^ of that, which the NLP community calls perplexity. Perplexity

01:22:16.120 | is just e^ of cross-entropy.

01:22:22.240 | There's a lot of problems with comparing things based on cross-entropy loss. I'm not sure

01:22:29.120 | I've got time to go into it in detail now, but the basic problem is that it's kind of

01:22:35.400 | like that thing we learned about focal loss. Cross-entropy loss, if you're right, it wants

01:22:40.240 | you to be really confident that you're right. So it really penalizes a model that doesn't

01:22:46.720 | kind of say, I'm so sure this is wrong, whereas accuracy doesn't care at all about how confident

01:22:52.360 | you are, it just cares about whether you're right. And this is much more often the thing

01:22:56.520 | which you care about in real life. So this accuracy is how often do we guess the next

01:23:02.760 | word correctly. And I just find that a much more stable number to keep track of. So that's

01:23:09.920 | a simple little thing that I do.

01:23:14.720 | So we trained for a while, and we get down to a 3.9 cross-entropy loss, and if you go

01:23:32.160 | e^, that kind of gives you a sense of what's happened with language models. If you look

01:23:45.840 | at academic papers from about 18 months ago, you'll see them talking about state-of-the-art

01:23:54.760 | complexities of over 100. The rate at which our ability to kind of understand language,

01:24:04.440 | and I think measuring language model accuracy or complexity is not a terrible proxy for

01:24:11.440 | understanding language. If I can guess what you're going to say next, I pretty much need

01:24:16.640 | to understand language pretty well, and also the kind of things you might talk about pretty

01:24:20.480 | well. So this number has just come down so much. It's been amazing. NLP in the last 12

01:24:29.160 | to 18 months. And it's going to come down a lot more. It really feels like 2011-2012 computer

01:24:35.960 | vision. We're just starting to understand transfer learning and fine-tuning, and these

01:24:41.240 | basic models are getting so much better.

01:24:44.880 | So everything you thought about what NLP can and can't do is very rapidly going out of date.

01:24:53.920 | But there's still lots of stuff NLP is not good at, to be clear. Just like in 2012 there

01:24:58.560 | was lots of stuff computer vision wasn't good at. But it's changing incredibly rapidly,

01:25:03.420 | and now is a very, very good time to be getting very, very good at NLP or starting start-ups

01:25:10.340 | based on NLP because there's a whole bunch of stuff which computers were absolutely shit

01:25:15.120 | at two years ago, and now are not quite as good at people, and then next year they'll

01:25:22.320 | be much better at people.

01:25:25.140 | Two questions. One, what is your ratio of paper reading versus coding in a week?

01:25:35.000 | What do you think, Rachel? You see me. I mean, it's a lot more coding, right?

01:25:39.000 | It's a lot more coding. I feel like it also really varies from week to week. I feel like

01:25:42.560 | they're...

01:25:44.320 | Like with that bounding box stuff, there was all these papers and no map through them,

01:25:54.040 | and so I didn't even know which one to read first, and then I'd read the citations and

01:25:58.200 | didn't understand any of them. So there was a few weeks of just kind of reading papers

01:26:02.600 | before I even knew what to start coding. That's unusual though. Most of the time, I don't

01:26:10.560 | know, any time I start reading a paper, I'm always convinced that I'm not smart enough

01:26:15.120 | to understand it, always, regardless of the paper, and somehow eventually I do. But yeah,

01:26:23.480 | I try to spend as much time as I can coding.

01:26:26.880 | And then the second question, is your dropout rate the same through the training or do you

01:26:32.360 | adjust it and the weights accordingly?

01:26:34.680 | I'll just say one more thing about the last bit, which is very often, like the vast majority,

01:26:42.080 | nearly always, after I've read a paper, even after I've read the bit that says this is

01:26:49.920 | the problem I'm trying to solve, I'll kind of stop there and try to implement something

01:26:54.080 | that I think might solve that problem, and then I'll go back and read the paper and I'll

01:26:57.600 | read little bits about how I solve these problem bits, and I'll be like, oh that's a good idea,

01:27:02.240 | and then I'll try to implement those.

01:27:04.120 | And so that's why, for example, I didn't actually implement SSD. My custom head is not the same

01:27:11.320 | as their head. It's because I kind of read the gist of it and then I tried to create

01:27:15.560 | something best as I could and then go back to the papers and try to see why. So by the

01:27:20.960 | time I got to the focal loss paper, I was driving myself crazy with how come I can't

01:27:28.520 | find small objects, how come it's always predicting background, and I read the focal loss paper

01:27:33.600 | and I was like, that's why! It's so much better when you deeply understand the problem they're

01:27:42.480 | trying to solve. And I do find the vast majority of the time, by the time I read that bit of

01:27:46.800 | the paper which is like solving the problem, I'm then like, yeah but these three ideas I

01:27:51.720 | came up with, they didn't try. And you suddenly realize that you've got new ideas. Or else

01:27:57.040 | if you just implement the paper mindlessly, you tend not to have these insights about

01:28:04.840 | better ways to do it.

01:28:10.120 | Varying dropout is really interesting and there are some recent papers actually that

01:28:15.080 | suggest gradually changing dropout and it was either a good idea to gradually make it

01:28:21.600 | smaller or to gradually make it bigger. I'm not sure which. Maybe one of us can try and

01:28:29.200 | find it during the week. I haven't seen it widely used. I tried it a little bit with

01:28:34.280 | the most recent paper I wrote and I had some good results. I think I was gradually making

01:28:42.680 | it smaller but I can't remember.

01:28:45.720 | And then the next question is, "Am I correct in thinking that this language model is built

01:28:50.000 | on word embeddings? Would it be valuable to try this with phrase or sentence embeddings?"

01:28:56.120 | I asked this because I saw from Google the other day universal sentence encoder.

01:29:02.360 | Yeah, this is much better than that. Do you see what I mean? This is not just an embedding

01:29:07.480 | of a sentence, this is an entire model. An embedding by definition is like a fixed thing.

01:29:16.920 | I think they're asking, they're saying that this language, well the first question is,

01:29:21.920 | is this language model built on word embeddings?

01:29:24.480 | Right, but it's not saying, a sentence or a phrase embedding is always a model that

01:29:32.160 | creates that. We've got a model that's like trying to understand language, it's not just

01:29:39.000 | a phrase, it's not just a sentence, it's a document in the end and it's not just an embedding,

01:29:45.200 | we're training through the whole thing.

01:29:46.960 | So this has been a huge problem with NLP for years now is this attachment they have to

01:29:54.120 | embeddings. So even the paper that the community has been most excited about recently from

01:30:00.280 | AI2, the Allen Institute, called ELMO, and they found much better results across lots

01:30:07.840 | of models. But again, it was an embedding. They took a fixed model and created a fixed

01:30:12.720 | set of numbers which they then fed into a model. But in computer vision, we've known

01:30:19.080 | for years that that approach of having a fixed set of features, they're called hypercolons

01:30:26.800 | in computer vision. People stopped using them like 3 or 4 years ago because fine-tuning

01:30:33.920 | the entire model works much better.

01:30:37.640 | So for those of you that have spent quite a lot of time with NLP and not much time with

01:30:42.040 | computer vision, you're going to have to start relearning. All that stuff you have been told

01:30:48.600 | about this idea that there are these things called embeddings and that you learn them

01:30:53.800 | ahead of time, and then you apply these fixed things, whether it be word level or phrase

01:31:00.120 | level or whatever level, don't do that. You want to actually create a pre-trained model

01:31:06.840 | and fine-tune it end to end. You'll see some specific results.

01:31:16.800 | For using accuracy instead of perplexity as a metric for the model, could we work that

01:31:26.920 | into the loss function rather than just use it as a metric?

01:31:30.080 | No, you never want to do that whether it be computer vision or NLP or whatever. It's too

01:31:34.120 | bumpy. So cross-entropy is fine as a loss function. And I'm not saying instead of I

01:31:41.040 | use it in addition, I think it's good to look at the accuracy and to look at the cross-entropy.

01:31:47.480 | But for your loss function, you need something nice and smooth. Accuracy doesn't work very

01:31:53.480 | well.

01:31:54.480 | You'll see there's two different versions of save. There's save and save encoder. Save

01:32:00.040 | saves the whole model as per usual. Save encoder saves just that bit. In other words, in the

01:32:11.520 | sequential model, it saves just that bit and not that bit. In other words, this bit, which

01:32:18.340 | is the bit that actually makes it into a language model, we don't care about in the classifier,

01:32:23.520 | we just care about that bit. So let's now create the classifier. I'm going to go through this

01:32:34.280 | bit pretty quickly because it's the same. But when you go back during the week and look

01:32:38.120 | at the code, convince yourself it's the same. We do getAllPD, read_csv again, juxize again,

01:32:43.880 | getAll again, save those tokens again. We don't create a new I2S vocabulary. We obviously

01:32:52.900 | want to use the same vocabulary we had in the language model because we're about to reload

01:32:58.540 | the same encoder. Same default dict, same way of creating our numericalized list, which

01:33:08.060 | as per before we can save. So that's all the same. Later on we can reload those rather

01:33:14.000 | than having to rebuild them.

01:33:17.000 | So all of our hyperparameters are the same. We can change the dropout. Optimize a function.

01:33:31.120 | Pick a batch size that is as big as you can that doesn't run out of memory. This bit's

01:33:38.760 | a bit interesting. There's some fun stuff going on here. The basic idea here is that

01:33:50.000 | for the classifier we do really want to look at a document. We need to say is this document

01:33:57.040 | positive or negative. So we do want to shuffle the documents because we like to shuffle things.

01:34:05.480 | But those documents are different lengths, so if we stick them all into one batch -- this

01:34:11.960 | is a handy thing that fastAI does for you -- you can stick things at different lengths

01:34:15.360 | into a batch and it will automatically pad them, so you don't have to worry about that.

01:34:20.920 | But if they're wildly different lengths, then you're going to be wasting a lot of computation

01:34:25.160 | times. There might be one thing there that's 2,000 words long and everything else is 50

01:34:29.240 | words long and that means you end up with a 2,000-wide tensor. That's pretty annoying.

01:34:36.000 | So James Bradbury, who's actually one of Stephen Meridy's colleagues and the guy who came up

01:34:41.480 | with TorchText, came up with an idea which was let's sort the dataset by length-ish.

01:34:55.120 | So kind of make it so the first things in the list are on the whole, shorter than the

01:35:03.160 | things at the end, but a little bit random as well.

01:35:09.360 | And so I'll show you how I implemented that.

01:35:14.820 | So the first thing we need is a dataset. So we have a dataset passing in the documents

01:35:24.880 | and their labels. And so here's a text dataset and it inherits from dataset. Here is dataset

01:35:31.800 | from PyTorch. And actually, dataset doesn't do anything at all. It says you need to get

01:35:38.820 | item if you don't have one, you're going to get an error, you need a length if you don't

01:35:42.560 | have one, you're going to get an error. So this is an abstract class.

01:35:48.640 | So we're going to pass in our x, we're going to pass in our y, and getItem is going to

01:35:54.640 | grab the x and grab the y and return them. It couldn't be much simpler. Optionally, it

01:36:02.920 | could reverse it. Optionally it could stick an end of stream at the end. Optionally it

01:36:06.400 | could stick a start of stream at the beginning. We're not doing any of those things. So literally

01:36:09.640 | all we're doing is putting in an x, putting in a y, and then grab an item, we're returning

01:36:14.100 | the x and the y as a tuple. And the length is how long the x array is. So that's all

01:36:22.200 | the dataset is. Something with a length that you can index.

01:36:27.920 | So to turn it into a data loader, you simply pass the dataset to the data loader constructor,

01:36:34.300 | and it's now going to go ahead and give you a batch of that at a time. Normally you can

01:36:39.560 | say shuffle=true or shuffle=false, it will decide whether to randomize it for you. In

01:36:44.920 | this case though, we're actually going to pass in a sampler parameter. The sampler is

01:36:50.920 | a class we're going to define that tells the data loader how to shuffle. So for the validation

01:36:59.120 | set, we're going to define something that actually just sorts it. It just deterministically

01:37:04.440 | sorts it so all the shortest documents will be at the start, all the longest documents

01:37:09.840 | will be at the end, and that's going to minimize the amount of padding.

01:37:13.720 | For the training sampler, we're going to create this thing I call a sort-ish sampler, which

01:37:22.000 | also sorts-ish. So this is where I really like PyTorch is that they came up with this

01:37:31.600 | idea for an API for their data loader where we can hook in new classes to make it behave

01:37:38.280 | in different ways. So here's a sort-sampler, it's simply something which again has a length,

01:37:46.320 | which is the length of the data source, and it has an iterator, which is simply an iterator

01:37:52.160 | which goes through the data source sorted by length of the key, and I pass in as the

01:38:02.080 | key lambda function which returns the length.

01:38:10.280 | And so for the sort-ish sampler, I won't go through the details, but it basically does

01:38:16.040 | the same thing with a little bit of randomness. So it's just another of these beautiful little

01:38:24.960 | design things in PyTorch that I discovered. I could take James Bradbury's ideas, which

01:38:31.760 | he had written a whole new set of classes around, and I could actually just use the

01:38:37.760 | inbuilt hooks inside PyTorch. You will notice that it's not actually PyTorch's data loader,

01:38:46.700 | it's actually FastAI's data loader, but it's basically almost entirely plagiarized from

01:38:51.600 | PyTorch but customized in some ways to make it faster, mainly by using multithreading instead

01:38:57.520 | of multiprocessing.

01:38:58.520 | Does the pre-trained LSTM depth and BBTT need to match with the new one we are training?

01:39:07.520 | No, the BBTT doesn't need to match at all. That's just like how many things do we look

01:39:11.620 | at at a time, it's got nothing to do with the architecture.

01:39:16.640 | So now we can call that function we just saw before, getRNNClassifier. It's going to create

01:39:22.200 | exactly the same encoder, more or less, and we're going to pass in the same architectural

01:39:28.720 | details as before. But this time, the head that we add on, you've got a few more things

01:39:37.200 | you can do. One is you can add more than one hidden layer. So this layer here says this

01:39:43.800 | is what the input to my classifier section, my head, is going to be. This is the output

01:39:51.440 | of the first layer, this is the output of the second layer, and you can add as many

01:39:55.720 | as you like. So you can basically create a little multi-layered neural net classifier

01:40:00.240 | at the end. And so ditto, these are the dropouts to go after each of these layers. And then

01:40:08.200 | here are all of the AWD LSTM dropouts, which we're going to basically plagiarize that idea

01:40:13.780 | for our classifier. We're going to use the RNN learner, just like before. We're going

01:40:21.860 | to use discriminative learning rates for different layers. You can try using weight decay or not,

01:40:31.640 | I've been fiddling around a bit with that to see what happens. And so we start out just

01:40:37.240 | training the last layer and we get 92.9% accuracy, then we unfreeze one more layer, get 93.3 accuracy,

01:40:47.760 | and then we fine-tune the whole thing. And after 3 epochs, so this was kind of the

01:41:07.120 | main attempt before our paper came along at using a pre-trained model. And what they did

01:41:14.800 | is they used a pre-trained translation model. But they didn't fine-tune the whole thing,

01:41:25.460 | they just took the activations of the translation model. And when they tried IMDB, they got 91.8%

01:41:47.220 | which we beat easily after only fine-tuning one layer. They weren't state-of-the-art there,

01:41:57.700 | the state-of-the-art is 94.1, which we beat after fine-tuning the whole thing for 3 epochs.

01:42:07.300 | And so by the end, we're at 94.8, which is obviously a huge difference because in terms

01:42:15.460 | of error rate, that's gone down from 5.9, and then I'll tell you a simple little trick. Go

01:42:22.400 | back to the start of this notebook, and reverse the order of all of the documents, and then

01:42:31.280 | rerun the whole thing. And when you get to the bit that says wt103, replace this fwd

01:42:41.220 | for forward with bwd for backward. That's a backward English language model that learns

01:42:47.220 | to read English backwards. So if you redo this whole thing, put all the documents in reverse,

01:42:54.420 | and change this to backward, you now have a second classifier which classifies things

01:42:59.300 | by positive or negative sentiment based on the reverse document. If you then take the

01:43:07.740 | two predictions and take the average of them, you basically have a bidirectional model that

01:43:13.020 | you've trained each bit separately. That gets you to 95.4% accuracy. So we basically load

01:43:19.820 | it from 5.9 to 4.6.

01:43:22.900 | So this kind of 20% change in the state-of-the-art is almost unheard of. You have to go back

01:43:32.020 | to Jeffrey Hinton's ImageNet computer vision thing where they chop 30% off the state-of-the-art.

01:43:39.380 | It doesn't happen very often. So you can see this idea of just use transfer learning is

01:43:47.880 | ridiculously powerful, but every new field thinks their new field is too special and

01:43:55.140 | you can't do it. So it's a big opportunity for all of us.

01:44:02.980 | So we turned this into a paper, and when I say we, I did it with this guy, Sebastian

01:44:07.420 | Reuter. You might remember his name because in lesson 5 I told you that I actually had

01:44:14.180 | shared lesson 4 with Sebastian because I think he's an awesome researcher who I thought might

01:44:20.060 | like it. I didn't know him personally at all. And much to my surprise, he actually watched

01:44:27.100 | the damn video. I was like, what NLP researcher is going to watch some beginner's video? He

01:44:33.900 | watched the whole video and he was like, that's actually quite fantastic. Well, thank you

01:44:38.740 | very much, that's awesome coming from you. And he said, hey, we should turn this into

01:44:44.580 | a paper. And I said, I don't write papers, I don't care about papers, I'm not interested

01:44:50.700 | in papers, that sounds really boring. And he said, okay, how about I write the paper

01:44:58.100 | for you? And I said, you can't really write a paper about this yet because you'd have

01:45:04.780 | to do studies to compare it to other things, they're called ablation studies to see which

01:45:08.500 | bits actually work. There's no rigor here, I just put in everything that came in my head

01:45:12.780 | and chucked it all together and it happened to work. And it's like, okay, what if I write

01:45:17.380 | all the paper and do all the ablation studies, then can we write the paper? And I said, well,

01:45:23.740 | it's like a whole library that I haven't documented and I'm not going to yet and you don't know

01:45:31.060 | how it all works. He said, okay, if I write the paper and do the ablation studies and

01:45:35.300 | figure out from scratch how the code works without bothering you, then can we write the

01:45:38.860 | paper? I was like, yeah, if you did all those things, you can write the paper. And he was

01:45:48.740 | like, okay. And so then two days later he comes back and he says, okay, I've done a

01:45:51.580 | draft with the paper. So I share this story to say like, if you're some student in Ireland

01:46:02.700 | and you want to do good work, don't let anybody stop you. I did not encourage him to say the

01:46:10.940 | least. But in the end he was like, look, I want to do this work, I think it's going to

01:46:16.300 | be good and I'll figure it out. And he wrote a fantastic paper and he did the ablation

01:46:22.420 | studies and he figured out how fast AI works and now we're planning to write another paper

01:46:27.420 | together. You've got to be a bit careful because sometimes I get messages from random people

01:46:36.300 | saying like, I've got lots of good ideas, can we have coffee? I can have coffee at my

01:46:43.980 | office any time, thank you. But it's very different to say like, hey, I took your ideas

01:46:49.660 | and I wrote a paper and I did a bunch of experiments and I figured out how your code works. I added

01:46:53.700 | documentation to it, should we submit this to a conference? Do you see what I mean? There's

01:47:02.300 | nothing to stop you doing amazing work and if you do amazing work that helps somebody

01:47:08.660 | else, like in this case, I'm happy that we have a paper. I don't deeply care about papers,

01:47:15.700 | but I think it's cool that these ideas now have this rigorous study. Let me show you

01:47:20.220 | what he did. He took all my code, so I'd already done all the fast AI.txt and stuff like that.

01:47:29.580 | As you've seen, it lets us work with large corpuses. Sebastian is fantastically well

01:47:36.660 | read and he said here's a paper that Jan Lekudins and guys just came out with where they tried

01:47:41.620 | lots of different classification data sets, so I'm going to try running your code on all

01:47:46.500 | these data sets. These are the data sets. Some of them had many, many hundreds of thousands

01:47:52.940 | of documents and they were far bigger than anything I had tried, but I thought it should

01:47:57.620 | work. He had a few good little ideas as we went along and so you should totally make

01:48:07.980 | sure you read the paper. He said this thing that you called in the lessons differential

01:48:18.100 | learning rates, differential means something else. Maybe we should rename it. It's now called

01:48:25.100 | discriminative learning rates. This idea that we had from Part 1 where we used different

01:48:29.620 | learning rates for different layers, after doing some literature research, it does seem

01:48:34.940 | like that hasn't been done before so it's now officially a thing, discriminative learning

01:48:40.540 | rates.

01:48:41.540 | So all these ideas, this is something we learned in Lesson 1. It now has an equation with Greek

01:48:46.740 | and everything. When you see an equation with Greek and everything, that doesn't necessarily

01:48:52.300 | mean it's more complex than anything we did in Lesson 1 because this one isn't. Again,

01:48:57.420 | that idea of unfreezing a layer at a time also seems to have never been done before,

01:49:03.540 | so it's now a thing and it's got the very clever name gradual unfreezing.

01:49:11.180 | So then, long promised, we're going to look at this, slanted triangular learning rates.

01:49:19.860 | So this actually was not my idea. Leslie Smith, one of my favorite researchers who you all

01:49:25.780 | now know about, emailed me a while ago and said I'm so over a circle called learning

01:49:31.780 | rates, I don't do that anymore, I now do a slightly different version where I have one

01:49:35.280 | cycle which goes up quickly at the start and then slows it down afterwards. And he said

01:49:40.900 | I often find it works better, I tried going back over all of my old data sets and it worked

01:49:45.020 | better for all of them, everyone I tried.

01:49:48.060 | So this is what the learning rate looks like. You can use it in fastAI just by adding UCLR

01:49:53.540 | equals to your fit. This first number is the ratio between the highest learning rate and

01:50:01.100 | the lowest learning rate. So here this is 1/32 of that. The second number is the ratio

01:50:07.880 | between the first peak and the last peak. And so the basic idea is if you're doing a cycle

01:50:15.340 | length 10 and you want the first epoch to be the upward bit and the other 9 epochs to

01:50:23.700 | be the downward bit, then you would use 10. And I find that works pretty well, that was

01:50:28.660 | also Leslie's suggestion, make about 1/10 of it the upward bit and about 9/10 the downward

01:50:35.500 | bit.

01:50:36.940 | Since he told me about it, maybe two days ago, he wrote this amazing paper, a disciplined

01:50:43.880 | approach to neural network hyperparameters, in which he described something very slightly

01:50:49.440 | different to this again, but the same basic idea. This is a must-read paper. It's got all

01:50:57.220 | the kinds of ideas that fastAI talks about a lot in great depth, and nobody else is talking

01:51:05.100 | about this stuff. It's kind of a slog, unfortunately Leslie had to go away on a trip before he

01:51:12.020 | really had time to edit it properly, so it's a little bit slow reading, but don't let that

01:51:17.140 | stop you, it's amazing.

01:51:19.740 | So this triangle, this is the equation from my paper with Sebastian. Sebastian was like,

01:51:24.220 | "Jeremy, can you send me the math equation behind that code you wrote?" And I was like,

01:51:29.100 | "No, I just wrote the code, I could not turn it into math." So he figured out the math

01:51:33.340 | for it.

01:51:37.140 | So you might have noticed the first layer of our classifier was equal to embedding size

01:51:47.820 | times 3. Why times 3? Times 3 because, and again this seems to be something which people

01:51:54.960 | haven't done before, so a new idea, concat pooling, which is that we take the average

01:52:04.460 | pooling over the sequence of the activations, the max pooling of the sequence over the activations,

01:52:10.940 | and the final set of activations and just concatenate them all together.

01:52:14.820 | Again, this is something which we talked about in Part 1, but it doesn't seem to be in the

01:52:20.940 | literature before, so it's now called concat pooling, and again it's now got an equation

01:52:25.940 | and everything, but this is the entirety of the implementation. Pool with average, pool

01:52:32.580 | with max, concatenate those two along with the final sequence.

01:52:38.460 | So you can go through this paper and see how the fastai code implements each piece.

01:52:47.100 | So then, to me one of the kind of interesting pieces is the difference between RNN encoder,

01:52:55.180 | which you've already seen, and multibatch RNN encoder. So what's the difference there?

01:53:00.780 | So the key difference is that the normal RNN encoder for the language model, we could just

01:53:05.900 | do bptt chunk at a time, but no problem, and predict the next word.

01:53:16.420 | But for the classifier, we need to do the whole document. We need to do the whole movie

01:53:22.300 | review before we decide if it's positive or negative. And the whole movie review can easily

01:53:26.780 | be 2000 words long, and I can't fit 2000 words worth of gradients in my GPU memory for every

01:53:37.700 | single one of my activations -- sorry, for every one of my weights. So what do I do?

01:53:44.700 | And so the idea was very simple, which is I go through my whole sequence length one

01:53:52.140 | batch of bptt at a time, and I call super.forward, so in other words the RNN encoder, to grab

01:54:04.940 | its outputs.

01:54:07.060 | And then I've got this maximum sequence length parameter where it says, okay, as long as you're

01:54:17.260 | doing no more than that sequence length, then start appending it to my list of outputs.

01:54:25.380 | So in other words, the thing that it sends back to this pooling is only as many activations

01:54:37.740 | as we've asked it to keep. And so that way you can basically figure out how much, what's

01:54:45.540 | maxsec do you, can your particular GPU handle.

01:54:51.940 | So it's still using the whole document, but let's say maxsec is 1000 words, and your longest

01:54:59.540 | document length is 2000 words. Then it's still going through the RNN creating state for those

01:55:05.700 | first 1000 words, but it's not actually going to store the activations for the backprop

01:55:14.180 | the first 1000, it's only going to keep the last 1000.

01:55:17.500 | So that means that it can't backprop the loss back to any state that was created in the

01:55:26.460 | first 1000 words. Basically that's now gone.

01:55:31.680 | So it's a really simple piece of code, and honestly when I wrote it, I didn't spend much

01:55:39.500 | time thinking about it, it seems so obviously the only way that this could possibly work.

01:55:44.500 | But again, it seems to be a new thing, so we now have backprop through time for text

01:55:49.420 | classification.

01:55:50.420 | So you can see there's lots of little pieces in this paper.

01:55:56.460 | So what was the result?

01:55:59.020 | So the result was on every single dataset we tried, we got a better result than any

01:56:06.140 | previous academic for text classification.

01:56:11.460 | So IMDB, Trek 6, AG News, DBpedia, Yelp, all different types.

01:56:20.820 | And honestly IMDB was the only one I spent any time trying to optimize the model, so

01:56:25.660 | like most of them we just did it like whatever came out first, so if we actually spent time

01:56:29.900 | on it I think these would be a lot better.

01:56:33.380 | And the things that these are comparing to, most of them are, you'll see they're different

01:56:40.180 | on each table because they're optimized, these are like customized algorithms on the whole.

01:56:45.500 | So this is saying one simple fine-tuning algorithm can beat these really customized algorithms.

01:56:56.420 | And so here's one of the really cool things that Sebastian did with his ablation studies,

01:57:02.580 | which is I was really keen that if we were going to publish a paper we had to say why

01:57:06.580 | does it work.

01:57:08.980 | So Sebastian went through and tried removing all of those different contributions I mentioned.

01:57:17.540 | So what if we don't use gradual freezing?

01:57:22.340 | What if we don't use discriminative learning rates?

01:57:24.860 | What if instead of discriminative learning rates we use cosine annealing?

01:57:28.900 | What if we don't do any pre-training with Wikipedia?

01:57:37.340 | What if we don't do any fine-tuning?

01:57:40.580 | And the really interesting one to me was what's the validation error rate on IMDB if we only

01:57:46.980 | use 100 training examples versus 200 versus 500?

01:57:50.940 | And you can see, very interestingly, the full version of this approach is nearly as accurate

01:58:01.140 | on just 100 training examples, like it's still very accurate versus 20,000 training examples.

01:58:09.460 | Whereas if you're training from scratch on 100, it's almost random.

01:58:14.540 | So it's what I expected, kind of set to Sebastian.

01:58:18.660 | I really think this is most beneficial when you don't have much data, and this is like

01:58:23.940 | where FastAI is most interested in contributing, small data regimes, small compute regimes

01:58:28.900 | and so forth.

01:58:29.900 | So he did these studies to check.

01:58:33.100 | So I want to show you a couple of tricks as to how you can run these kinds of studies.

01:58:42.940 | The first trick is something which I know you're all going to find really handy.

01:58:49.060 | I know you've all been annoyed when you're running something in a Jupyter notebook and

01:58:52.620 | you lose your internet connection for long enough that it decides you've gone away and

01:58:57.740 | then your session disappears and you have to start it again from scratch.

01:59:02.700 | So what do you do?

01:59:05.780 | There's a very simple cool thing called VNC, where basically you can install on your AWS

01:59:13.460 | instance or paper space or whatever, xWindows, a lightweight window manager, a VNC server,

01:59:22.540 | Firefox, a terminal, and some fonts.

01:59:28.500 | Track these lines at the end of your VNC xstartup configuration file, and then run this command.

01:59:38.700 | It's now running a server where you can then run the type VNC viewer on your computer,

01:59:54.500 | and you point it at your server.

01:59:59.180 | Specifically, what you do is you use SSH port forwarding to port 4913 to localhost 5913.

02:00:13.740 | And so then you connect to port 5913 on localhost, send it off to port 5913 on your server, which

02:00:25.460 | is the VNC port because you said colon 13 here, and it will display an xWindows desktop.

02:00:32.460 | And then you can click on the Linux start like button and click on Firefox, and you

02:00:37.420 | now have Firefox, and you'll see here in Firefox it says localhost because this Firefox is

02:00:44.700 | running on my AWS server.

02:00:47.780 | So you now run Firefox, you start your thing running, and then you close your VNC viewer,

02:00:53.700 | remembering that Firefox is like displaying on this virtual VNC display, not in a real

02:00:59.660 | display.

02:01:00.660 | And so then later on that day, you log back into VNC viewer and it pops up again, so it's

02:01:05.460 | like a persistent desktop.

02:01:08.300 | And it's shockingly fast, it works really well.

02:01:11.380 | So there's trick number 1.

02:01:14.020 | And there's lots of different VNC servers and clients and whatever, but this one worked

02:01:18.860 | fine for me.

02:01:19.860 | So you can see here I connect to localhost 5913.

02:01:27.980 | Trick number 2 is to create Python scripts.

02:01:33.020 | This is what we ended up doing.

02:01:34.960 | So I ended up creating a little Python script for Sebastian to say this is the basic steps

02:01:39.960 | you need to do, and now you need to create different versions for everything else, and

02:01:43.300 | I suggested to him that he tried using this thing called Google Fire.

02:01:47.140 | What Google Fire does is you create a function with shitloads of parameters.

02:01:53.100 | And so these are all the things that Sebastian wanted to try doing.

02:01:56.100 | Different dropout amounts, different learning rates, do I use pre-training or not, do I

02:02:00.380 | use CLI or not, do I use discriminative learning rate or not, do I go backwards or not, blah

02:02:05.940 | blah blah.

02:02:06.940 | So you create a function, and then you add something saying if name equals main, fire.fire,

02:02:11.900 | and the function name, you do nothing else at all.

02:02:14.580 | You don't have to add any metadata, any docstrings, anything at all, and you then call that script

02:02:20.460 | and automatically you now have a command line interface, and that's it.

02:02:27.060 | So that's a super fantastic easy way to run lots of different variations in a terminal.

02:02:34.700 | And this ends up being easier if you want to do lots of variations than using a notebook

02:02:40.500 | because you can just have a bash script that tries all of them and spits them all out.

02:02:48.180 | You'll find inside the dl2-course directory, there's now something called imdb-scripts,

02:02:58.040 | and I've put there all of the scripts that Sebastian and I used.

02:03:02.780 | So you'll see because we needed to tokenize every single dataset, we had to turn every

02:03:10.460 | dataset and numericalize every dataset, we had to train a language model on every dataset,

02:03:15.220 | we had to train and classify every dataset, we had to do all of those things in a variety

02:03:18.420 | of different ways to compare them, we had a script for all of those things.

02:03:21.940 | So you can check out and see all of the scripts that we used.

02:03:32.460 | When you're doing a lot of scripts and stuff, you've got different code all over the place,

02:03:37.420 | eventually it might get frustrating that you don't want to symlink your fastai library

02:03:43.340 | again and again, but you probably don't want to pip-install it because that version tends

02:03:48.460 | to be a little bit old, we move so fast you want to use the current version in git.

02:03:54.120 | If you say pip-install-a. from the fastai-repo base, it does something quite neat which basically

02:04:03.780 | creates a symlink to the fastai library inside your site packages directory.

02:04:15.540 | Your site packages directory is like your main Python library.

02:04:20.900 | And so if you do this, you can then access fastai from anywhere, but every time you do

02:04:28.060 | git pull, you've got the most recent version. One downside of this is that it installs any

02:04:34.980 | updated versions of packages from pip which can confuse conda a little bit.

02:04:42.120 | So another alternative here is just to symlink the fastai library to your site packages library.

02:04:50.980 | That works just as well. And then you can use fastai again from anywhere, and it's quite

02:04:57.740 | handy when you want to run scripts that use fastai from different directories on your

02:05:05.700 | system.

02:05:07.420 | So one more thing before we go, which is something you can try if you like.

02:05:17.660 | You don't have to tokenize words. Instead of tokenizing words, you can tokenize what

02:05:26.060 | are called subword units.

02:05:29.140 | And so for example, unsupervised could be tokenized as unsupervised. Tokenizer could

02:05:40.820 | be tokenized as tokenizer. And then you can do the same thing, the language model that

02:05:47.780 | works on subword units, the classifier that works on subword units, etc.

02:05:55.740 | So how well does that work? I started playing with it and with not too much playing, I was

02:06:04.060 | getting classification results that were nearly as good as using word-level tokenization.

02:06:10.260 | Not quite as good, but nearly as good.

02:06:14.860 | I suspect with more careful thinking and playing around, maybe I could have got as good or

02:06:21.060 | better. But even if I couldn't, if you create a subword unit wiki text model, then IMDB model,

02:06:34.060 | language model, and then classifier forwards and backwards for subword units, and then

02:06:39.340 | ensemble it with the forwards and backwards word-level ones, you should be able to beat

02:06:44.340 | us.

02:06:46.220 | So here's an approach you may be able to beat our state-of-the-art result.

02:06:52.340 | Google has, as Sebastian told me about this particular project, Google has a project called

02:06:57.780 | Sentence Piece, which actually uses a neural net to figure out the optimal splitting up

02:07:05.900 | of words, and so you end up with a vocabulary of subword units. In my playing around, I

02:07:12.700 | found that creating a vocabulary of about 30,000 subword units seems to be about optimal.

02:07:19.940 | So if you're interested, there's something you can try. It's a bit of a pain to install.

02:07:25.300 | It's C++. It doesn't have great error messages. But it will work. There is a Python library

02:07:31.780 | for it, and if anybody tries this, I'm happy to help them get it working. There's been

02:07:38.540 | little if any experiments with ensembling subword and word-level stuff classification,

02:07:46.060 | and I do think it should be the best approach.

02:07:48.620 | Alright, thanks everybody. Have a great week and see you next Monday.

02:07:52.460 | [APPLAUSE]

Lesson 10: Deep Learning Part 2 2018 - NLP Classification and Translation

Chapters