Lesson 3: Deep Learning 2019 - Data blocks; Multi-label classification; Segmentation

Welcome back to lesson three. So we're going to start with a quick correction, which is to let you know that when we referred to this chart as coming from Quora last week, we were correct. It did come from Quora, but actually, we realized originally it came from Andrew Un's excellent machine learning course on Coursera.

So apologies for the incorrect citation. But in exchange, let's talk about Andrew Un's excellent machine learning course on Coursera. It's really great, as you can see, people gave it 4.9 out of 5 stars. In some ways, it's a little dated, but a lot of the content really is as appropriate as ever, and taught in a more bottom-up style.

So it can be quite nice to combine Andrew's bottom-up style and our top-down style and meet somewhere in the middle. Also, if you're interested in more machine learning foundations, you should check out our machine learning course as well. If you go to course.fast.ai and click on the machine learning button, that will take you to our course, which is about twice as long as this deep learning course, and kind of takes you much more gradually through some of the foundational stuff around validation sets and model interpretation and how PyTorch tensors work and stuff like that.

So I think all these courses together, if you want to really dig deeply into the material, do all of them. I know a lot of people here have, and end up saying, I got more out of each one by doing a whole lot, or you can skip backwards and forwards, see which one works for you.

So we started talking about deploying your web app last week. One thing that's gonna make life a lot easier for you is that on the course V3 website, there's a production section, where right now we have one platform, but more will be added by the time this video comes out, showing you how to deploy your web app really, really easily.

And when I say easily, for example, here's the how to deploy on Zite guide created by San Francisco study group member, Navjot. As you can see, it's just a page. There's almost nothing to do, and it's free. It's not gonna serve 10,000 simultaneous requests, but it'll certainly get you started, and I found it works really well.

It's fast, and so deploying a model doesn't have to be slow or complicated anymore. And the nice thing is, you can kind of use this for an MVP. And if you do find you're starting to get 1,000 simultaneous requests, then you know that things are working out, and you can start to upgrade your instance types or add to a more traditional big engineering approach.

So if you actually use this starter kit, it will actually create my teddy bear finder for you. And this is an example of my teddy bear finder. So the idea is it's as simple as possible, this template. So you can fill in your own style sheets, your own custom logic, and so forth.

This is kind of designed to be a minimal thing, so you can see exactly what's going on. The back end is a simple kind of rest style interface that sends back JSON. And the front end is a super simple little JavaScript thing. So yeah, it should be a good way to get a sense of how to build a web app which talks to a PyTorch model.

So examples of web apps people have built during the week. Edward Ross built the what car is that app? Or more specifically, the what Australian car is that. I thought it was kind of interesting that Edward said on the forum that the building of the app was actually a great experience in terms of understanding how the model works himself better.

And it's interesting that he's describing trying it out on his phone. A lot of people think like, if I want something on my phone, I have to create some kind of mobile TensorFlow, ONNX, whatever, tricky mobile app. You really don't. You can run it all in the cloud and make it just a web app or use some kind of simple little gooey front end that talks to a rest back end.

It's not that often that you'll need to actually run stuff on the phone, so this is a good example of that. C. Werner has created a guitar classifier. You can decide whether your food is healthy or not. Apparently, this one is healthy. That can't be right. I would have thought a hamburger is more what we're looking for, but there you go.

Apparently, Trinidad and Tobago is the home of the hummingbird. So if you're visiting, you can find out what kind of hummingbird you're looking at. You can decide whether or not to eat a mushroom. If you happen to be one of the cousins of Charlie Harrington, you can now figure out who is who.

I believe this was actually designed for his fiance. Even will tell you about the interests of this particular cousin. So, a fairly niche application, but apparently, there are 36 people who will appreciate this at least. I have no cousins. That's a lot of cousins. This is an example of an app which actually takes a video feed and turns it into a motion classifier.

That's pretty cool. I like it. Team 26, good job. Here's a similar one for American Sign Language. And so it's not a big step from taking a single image model to taking a video model. You can just grab the occasional frame, put it through your model, and update the UI as the kind of model results come in.

So it's really cool that you can do this kind of stuff either in client or in browser nowadays. Henry Pluchy has built your city from.space, which he describes as creepy, how accurate it is. So here is where I live, which it figured out was in the United States. It's interesting, he describes here how he actually had to be very thoughtful about the validation set he built, make sure that the satellite tiles were not overlapping or close to each other.

In doing so, he realized he had to download more data. But once he did, he got this amazingly effective model that can look at satellite imagery and figure out what country it's from. I thought this one was pretty interesting, which was doing univariate time series analysis by converting it into a picture using something I've never heard of, a gradient angular field.

But he says he's getting close to state of the art results for univariate time series modeling by turning it into a picture. And so I like this idea of turning stuff that's not a picture into a picture. So something really interesting about this project, which was looking at emotion classification from faces, was that he was specifically asking the question, how well does it go without changing anything, just using the default settings?

Which I think is a really interesting experiment because we're all told it's really hard to train models and it takes a lot of specific knowledge. And actually we're finding that that's often not the case. And he looked at this facial expression recognition data set. There was a 2017 paper that he compared his results to, and he got equal or slightly better results than the state of the art paper on face recognition, emotion recognition without doing any custom hyperparameter tuning at all.

So that was really cool. And then Elena Harley, who I featured one of her works last week, has done another really cool work in the genomics space, which is looking at variant analysis, looking at false positives in these kinds of pictures. And she found she was able to decrease the number of false positives coming out of the kind of industry standard software she was using by 500% by using a deep learning workflow.

I think this is a nice example of something where if you're going through spending hours every day looking at something, in this case, looking at kind of get rid of the false positives, maybe you can make that a lot faster by using deep learning to do a lot of the work for you.

And again, this is an example of a computer vision based approach on something which initially wasn't actually images. So that's a really cool application. So really nice to see what people have been building in terms of both web apps, and just classifiers. What we're gonna do today is look at a whole lot more different types of model that you can build.

And we're gonna kind of zip through them pretty quickly. And then we're gonna go back and say, like, how did all these things work? What's the common denominator? But all of these things, you can create web apps from these as well. But you'll have to think about how to slightly change that template to make it work with these different applications.

I think that'll be a really good exercise in making sure you understand the material. So the first one we're gonna look at is a dataset of satellite images. And satellite imaging is a really fertile area for deep learning. It's certainly a lot of people already using deep learning and satellite imaging, but only scratching the surface.

And the dataset that we're gonna look at looks like this. It has satellite tiles, and for each one, as you can see, there's a number of different labels for each tile. One of the labels always represents the weather that's shown. So in this case, cloudy or partly cloudy. And then all of the other labels tell you any interesting features that are seen there.

So primary means primary rainforest. Agriculture means there's some farming, road, road, and so forth. And so, as I'm sure you can tell, this is a little different to all the classifiers we've seen so far, cuz there's not just one label, there's potentially multiple labels. So multi-label classification can be done in a very similar way.

But the first thing we're gonna need to do is to download the data. Now this data comes from Kaggle. Kaggle is mainly known for being a competition's website. And it's really great to download data from Kaggle when you're learning, because you can see, how would I have gone in that competition?

And it's a good way to see whether you kind of know what you're doing. I tend to think the goal is to try and get in the top 10%. And in my experience, all the people in the top 10% of a competition really know what they're doing. So if you can get in the top 10%, then that's a really good sign.

Pretty much every Kaggle data set is not available for download outside of Kaggle, at least the competition data sets. So you have to download it through Kaggle. And the good news is that Kaggle provides a Python-based downloader tool, which you can use. So we've got a quick description here of how to download stuff from Kaggle.

So to install stuff, to download stuff from Kaggle, you first have to install the Kaggle download tool. So just pip install Kaggle. And so you can see what we tend to do when there's one off things to do, is we show you the commented out version in the notebook.

And you can just remove the comment. So here's a cool tip for you. If you select a few lines and then hit Ctrl + slash, it uncomments them all. And then when you're done, select them again, Ctrl + slash again, and re-comments them all, okay? So if you run this line, it'll install Kaggle for you.

Depending on your platform, you may need sudo, you may need slash something else, slash pip, you may need source activate. So have a look on the setup instructions, actually the returning to work instructions on the course website to see when we do condor install, you have to do the same basic steps for your pip install.

So once you've got that module installed, you can then go ahead and download the data. And basically it's as simple as saying Kaggle competitions download, the competition name, and then the files that you want. The only other steps before you do that is that you have to authenticate yourself.

And you'll see there's a little bit of information here on exactly how you can go about downloading from Kaggle the file containing your API authentication information. So I won't bother going through it here, but just follow these steps. Sometimes stuff on Kaggle is not just zipped or tarred, but it's compressed with a program called 7-zip, which will have a 7Z extension.

If that's the case, you'll need to either apt-install P7-zip, or here's something really nice. Some kind of person has actually created a condor installation of 7-zip that works on every platform. So you can always just run this condor install, doesn't even require sudo or anything like that. And this is actually a good example of where condor is super handy, is that you can actually install binaries, and libraries, and stuff like that, and it's nicely cross platform.

So if you don't have 7-zip installed, that's a good way to get it. And so this is how you unzip a 7-zip file. In this case, it's tarred and 7-zipped, so you can do this all in one step. So 7-za is the name of the 7-zip archiver program that you would run.

Okay, so that's all basic stuff, which if you're not so familiar with the command line and stuff, it might take you a little bit of experimenting to get it working. Feel free to ask on the forum. Make sure you search the forum first to get started, okay. So once you've got the data downloaded and unzipped, you can take a look at it.

So in this case, because we have multiple labels for each tile, we clearly can't have a different folder for each image telling us what the label is. We need some different way to label it. And so the way that Kaggle did it was they provided a CSV file that had each file name along with a list of all of the labels.

In order to just take a look at that CSV file, we can read it using the pandas library. If you haven't used pandas before, it's kind of the standard way of dealing with tabular data in Python. It pretty much always appears in the PD namespace. In this case, we're not really doing anything with it, other than just showing you the contents of this file.

So we can read it, we can take a look at the first few lines, and there it is. So we want to turn this into something we can use for modeling. So the kind of object that we use for modeling is an object of the data bunch plus, so we have to somehow create a data bunch out of this.

Once we have a data bunch, we'll be able to go .show batch to take a look at it. And then we'll be able to go create CNN with it, and then we'll be able to start training, okay? So really, the trickiest step previously in deep learning has often been getting your data into a form that you can get it into a model.

So far, we've been showing you how to do that using various factory methods. So methods where you basically say, I want to create this kind of data from this kind of source with these kinds of options. The problem is, I mean, that works fine sometimes, and we showed you a few ways of doing it over the last couple of weeks.

But sometimes, you want more flexibility. Because there's so many choices that you have to make about where do the files live, and what's the structure they're in, and how do the labels appear, and how do you split out the validation set, and how do you transform it, and so forth.

So we've got this unique API that I'm really proud of called the DataBlock API. And the DataBlock API makes each one of those decisions a separate decision that you make, there's separate methods and with their own parameters for every choice that you make around how do I create, set up my data.

So for example, to grab the planet data, we would say we've got a list of image files that are in a folder, and they're labeled based on a CSV with this name. They have this separator. Remember, I showed you back here that there's a space between them. So by passing in separator, it's going to create multiple labels.

The images are in this folder, they have this suffix. We're going to randomly split out a validation set with 20% of the data. We're going to create data sets from that, which we're then going to transform with these transformations. And then we're going to create a data bunch out of that, which we'll then normalize using these statistics.

So there's all these different steps. So to give you a sense of what that looks like, the first thing I'm going to do is go back and explain what are all of the PyTorch and FastAI classes that you need to know about that are going to appear in this process, because you're going to see them all the time in the FastAI docs and the PyTorch docs.

So the first one you need to know about is a class called a dataset. And the dataset class is part of PyTorch, and this is the source code for the dataset class. As you can see, it actually does nothing at all. So the dataset class in PyTorch defines two things, getItem and when.

In Python, these special things that are underscore, underscore something, underscore, underscore, Pythonistas call them dunder something. This would be dunder getItem, dunder len. And they're basically special magical methods that do some special behavior. This particular method, you can look them up in the Python docs. This particular method means that your object, if you had an object called o, can be indexed with square brackets, something like that, right?

So that would call getItem with three as the index. And then this one called len means that you can go len o and it will call that method. And you can see in this case, they're both not implemented. So that is to say, although PyTorch says to tell PyTorch about your data, you have to create a dataset.

It doesn't really do anything to help you create the dataset. It just defines what the dataset needs to do. So in other words, your data, the starting point for your data, is something where you can say, what is the third item of data in my dataset? So that's what getItem does, and how big is my dataset?

That's what the length does. So FastAI has lots of dataset subclasses that do that for all different kinds of stuff. And so, so far, you've been seeing image classification datasets. And so they're datasets where getItem will return an image and a single label of what is that image. So that's what a dataset is.

Now, a dataset is not enough to train a model. The first thing we know we have to do, if you think back to the gradient descent tutorial last week, is we have to have a few images or a few items at a time so that our GPU can work in parallel.

Remember, we do this thing called a mini-batch. A mini-batch is a few items that we present to the model at a time that we can train from in parallel. So to create a mini-batch, we use another PyTorch class called a data loader. And so a data loader takes a dataset in its constructor.

So it's now saying, this is something I can get the third item and the fifth item and the ninth item. And it's gonna grab items at random and create a batch of whatever size you ask for and pop it on the GPU and send it off to your model for you.

So a data loader is something that grabs individual items, combines them into a mini-batch, pops them on the GPU for modeling. So that's called a data loader and it comes from a dataset. So you can see already there's kind of choices you have to make. What kind of dataset am I creating?

What is the data for it where it's gonna come from? And then when I create my data loader, what batch size do I wanna use, right? This still isn't enough to train a model, not really, because we've got no way to validate the model. If all we have is a training set, then we have no way to know how we're doing, because we need a separate set of held out data.

A validation set to see how we're getting along. So for that, we use a fast AI class called a data bunch. And a data bunch is something which, as it says here, binds together a training data loader, and a valid data loader. And when you look at the fast AI docs, when you see these monospace font things, they're always referring to some symbol you can look up elsewhere.

So in this case, you can see train DL is here. And there's no point knowing that there's an argument with a certain name, unless you know what that argument is. So you should always look after the colon to find out that is a data loader. Okay, so when you create a data bunch, you're basically giving it a training set data loader and a validation set data loader.

And that's now an object that you can send off to a learner and start learning, start fitting, right? So they're the basic pieces. So coming back to here. This stuff plus this line is all the stuff which is creating the data set. So it's saying where did the images come from?

Cuz the data set, the index returns two things. It returns the image and the labels, assuming it's an image data set. So where do the images come from? Where do the labels come from? And then I'm gonna create two separate data sets, the training and the validation. This is the thing that actually turns them into PyTorch data sets.

This is the thing that transforms them, okay? And then this is actually gonna create the data loader and the data bunch in one go. So let's look at some examples of this data block API. Because once you understand the data block API, you'll never be lost for how to convert your data set into something you can start modeling with.

So here's some examples of using the data block API. So for example, if you're looking at MNIST, which remember is the pictures and classes of handwritten numerals, you can do something like this. This, what kind of data set is this gonna be? It's gonna come from a list of image files, which are in some folder.

And they're labeled according to the folder name that they're in. And then we're gonna split it into train and validation, according to the folder that they're in, train and validation. You can optionally add a test set. We're gonna be talking more about test sets later in the course. Okay, we'll convert those into PyTorch data sets, now that that's all set up.

We'll then transform them using this set of transforms. And we're gonna transform into something of this size. And then we're gonna convert them into a data bunch. So each of those stages inside these parentheses are various parameters you can pass to customize how that all works, right? But in the case of something like this MNIST data set, all the defaults pretty much work, so this is all fine.

So here it is, so you can check. Let's grab something. So data.trainDS is the data set, not the data loader, the data set. So I can actually index into it with a particular number. So here is the zero indexed item in the training data set. It's got an image and a label.

We can show batch to see an example of the pictures of it, and we can then start training. Here are the classes that are in that data set. And this little cut down sample of MNIST just has threes and sevens. Here's an example using Planet. This is actually, again, a little subset of Planet we use to make it easy to try things out.

So in this case, again, it's an image file list. Again, we're grabbing it from a folder. This time we're labeling it based on a CSV file. We're randomly splitting it. By default, it's 20%, creating data sets, transforming it using these transforms. We're gonna use a smaller size and then create a data bunch.

There it is. And so data bunches know how to draw themselves, amongst other things. So here's some more examples we're gonna be seeing later today. What if we look at this data set called Canvid? Canvid looks like this. It contains pictures, and every pixel in the picture is color coded, right?

So in this case, we have a list of files in a folder, and we're gonna label them, in this case, using a function. And so this function is basically the thing, we're gonna see it later, which tells it whereabouts of the color coding for each pixel. It's in a different place.

Randomly split it in some way, create some data sets in some way. We can tell it for our particular list of classes. How do we know what pixel value one versus pixel value two is? And that was something that we can basically read in, like so. Again, some transforms, create a data bunch.

You can optionally pass in things like what batch size do you want. And again, it knows how to draw itself, and you can start learning with that. For one more example, what if we wanted to create something like this? It has like bars, and chair, and remote control, and book.

This is called an object detection data set. So again, we've got a little minimal CoCo data set. CoCo is kind of the most famous academic data set for object detection. We can create it using the same process. Grab a list of files from a folder, label them according to this little function.

Randomly split them, create an object detection data set, create a data bunch. In this case, as you'll learn when we get to object detection, you have to use generally smaller batch sizes, or you'll run out of memory. And as you'll also learn, you have to use something called a collation function.

And once that's all done, we can again show it, and here's our object detection data set. So you get the idea, right? So here's a really convenient notebook, where will you find this? Ah, this notebook is the documentation. Remember how I told you that all of the documentation comes from notebooks?

You'll find them in your fast AI repo in docs_source. So this, which you can play with and experiment with inputs and outputs, and try all the different parameters, you will find the data block API examples of use. If you go to the documentation, here it is, the data block API examples of use.

All right, so remember, everything that you wanna use in fast AI, you can look it up in the documentation. So let's search, data block API. Go straight there, and away you go. And so once you find some documentation that you actually wanna try playing with yourself, just look up the name, data block.

And then you can open up a notebook with the same name in the fast AI repo, and play with it yourself, okay? So that's a quick overview of this really nice data block API. And there's lots of documentation for all of the different ways you can label inputs, and split data, and create data sets, and so forth.

And so that's what we're using for Planet, okay? So we're using that API. You'll see in the documentation these two steps we had all joined up together. We can certainly do that here too, but you'll learn in a moment why it is that we're actually splitting these up into two separate steps, which is also fine as well.

So a few interesting points about this, transforms. So transforms by default. Remember, you can hit Shift + Tab to get all the information, right? Transforms by default will flip randomly each image, right? But they'll actually randomly only flip them horizontally, which makes sense, right? If you're trying to tell if something's a cat or a dog, doesn't matter whether it's pointing left or right, but you wouldn't expect it to be upside down.

On the other hand, satellite imagery, whether something's cloudy or hazy, or whether there's a road there or not, could absolutely be flipped upside down. There's no such thing as a right way up in space. So flipvert, which defaults to false, we're going to flip over to true. To say like, yeah, randomly, you should actually do that.

And it doesn't just flip it vertically. It actually tries also each possible 90 degree rotation. So there are eight possible kind of symmetries that it tries out. So there's various other things here. I've found that these particular settings work pretty well for Planet. One that's interesting is warp. Perspective warping is something which very few libraries provide, and those that do provide it, it tends to be really slow.

I think fast AI is the first one to provide really fast perspective warping. And basically the reason this is interesting is if I kind of look at you from below versus look at you from above, your shape changes, right? And so when you're taking a photo of a cat or a dog, sometimes you'll be higher, sometimes you'll be lower, then that kind of change of shape is certainly something that you would want to include as you're creating your training batches.

You want to modify it a little bit each time. Not true for satellite images. A satellite always points straight down at the planet. So if you added perspective warping, you would be making changes that aren't going to be there in real life. So I turn that off. So this is all something called data augmentation.

We'll be talking a lot more about it later in the course. But you can start to get a feel for the kinds of things that you can do to augment your data. And in general, maybe the most important one is if you're looking at astronomical data or kind of pathology, digital slide data or satellite data.

Data where there isn't really an up or a down, turning on flipvert equals true is generally going to make your models generalize better. Okay, so here's the steps necessary to create our data bunch. And so now to create a satellite imagery classifier, multi-label classifier, that's going to figure out for each satellite tile what's the weather and what else, what can I see in it.

There's basically nothing else to learn. Everything else that you've already learnt is going to be exactly nearly the same. Here it is, learn equals createCNN, data, architecture, right? And in this case, when I first built this notebook, I used ResNet 34 as per usual. And I found this was a case, I tried ResNet 50 as I always like to do.

I found ResNet 50 helped a little bit, and I had some time to run it. So in this case, I was using ResNet 50. There's one more change I make, which is metrics. Now to remind you, a metric has got nothing to do with how the model trains. Changing your metrics will not change your resulting model at all.

The only thing that we use metrics for is we print them out during training. So you hear it's printing out accuracy and it's printing out this other metric called F beta. So if you're trying to figure out how to do a better job with your model, changing the metrics will never be something that you need to do there.

They're just to show you how you're going. So that's the first thing to know. You can have one metric or no metrics or a list of multiple metrics to be printed out as your model's trading. In this case, I wanna know two things. The first thing I wanna know is the accuracy.

And the second thing I wanna know is how would I go on Kaggle? And Kaggle told me that I'm gonna be judged on a particular metric called the F score. So I'm not gonna bother telling you about the F score. It's not really interesting enough to be worth spending your time on.

You can look it up, but it's basically this. When you have a classifier, you're gonna have some false positives. You're gonna have some false negatives. How do you weigh up those two things to kind of create a single number? There's lots of different ways of doing that. And something called the F score is basically a nice way of combining that into a single number.

And there are various kinds of F scores, F1, F2, and so forth. And Kaggle said, in the competition rules, we're gonna use a metric called F2. So we have a metric called F beta, which in other words it's F with 1 or 2 or whatever depending on the value of beta.

And we can have a look at its signature. And you can see that it's got a threshold and a beta. Okay, so the beta is 2 by default. And Kaggle said that they're gonna use F2, so I don't have to change that. But there's one other thing that I need to set, which is a threshold.

What does that mean? Well, here's the thing. Do you remember we had a little look the other day at the source code for the accuracy metric? So if you put two question marks, you get the source code. And we found that it used this thing called argmax. And the reason for that, if you remember, was we kind of had this input image that came in, and it went through our model.

And at the end, it came out with a table of ten numbers, right? This is like if we're doing MNIST digit recognition. The ten numbers were like the probability of each of the possible digits. And so then we had to look through all of those and find out which one was the biggest.

And so the function in NumPy or PyTorch or just math notation that finds the biggest and returns its index is called argmax, right? So to get the accuracy for our pet detector, we used this accuracy function called argmax to find out behind the scenes which class ID pet was the one that we're looking at.

And then it compared that to the actual and then took the average. And that was the accuracy. We can't do that for satellite recognition in this case, because there isn't one label we're looking for. There's lots. So instead, what we do is we look at, so in this case.

So I don't know if you remember, but a data bunch has a special attribute called c. And c is gonna be basically how many outputs do we want our model to create? And so for any kind of classifier, we want one probability for each possible class. So in other words, data.c for classifiers is always gonna be equal to the length of data.classes, right?

So data.classes, there they all are. There's the 17 possibilities, right? So we're gonna have one probability for each of those. But then we're not just gonna pick out one of those 17. We're gonna pick out n of those 17. And so what we do is we compare each probability to some threshold.

And then we say anything that's higher than that threshold, we're gonna assume that the model's saying it does have that feature. And so we can pick that threshold. I found that for this particular data set, a threshold of 0.2 seems to generally work pretty well. This is the kind of thing you can easily just experiment to find a good threshold.

So I decided I wanted to print out the accuracy at a threshold of 0.2. So the normal accuracy function doesn't work that way. It doesn't argmax. We have to use a different accuracy function called accuracy_thresh. And that's the one that's gonna compare every probability to a threshold and return all the things higher than that threshold and compare accuracy that way.

And so one of the things we would pass in is Thresh. Now of course, our metric is gonna be calling our function for us. So we don't get to tell it every time it calls back what threshold do we want. So we really wanna create a special version of this function that always uses an accuracy of a threshold of 0.2.

So one way to do that would be to go define something called accuracy_02 that takes some input and some target and returns accuracy threshold with that input and that target and a threshold of 0.2. We could do it that way, okay? But it's so common that you wanna kind of say, create a new function that's just like that other function, but we're always gonna call it with a particular parameter.

That computer science has a term for that. It's called a partial, it's called a partial function application. And so Python 3 has something called partial that takes some function and some list of keywords and values and creates a new function. That is exactly the same as this function, but is always gonna call it with that keyword argument.

So here, this is exactly the same thing as the thing I just typed in. O2 is now a new function that calls accuracy_thresh with a threshold of 0.2. And so this is a really common thing to do, particularly with the fastAI library, cuz there's lots of places where you have to pass in functions.

And you very often wanna pass in a slightly customized version of a function. So here's how you do it. So here I've got an accuracy threshold 0.2. I've got a fbeta threshold 0.2. I can pass them both in as metrics. And I can then go ahead and do all the normal stuff.

Lrfind, recorder.plot, find the thing with the steepest slope. So I don't know, somewhere around 1a neg 2, so we'll make that our learning rate. And then fit for a while with 5, Lr, and see how we go, okay? And so we've got an accuracy of about 96% and an fbeta of about 0.926.

And so you could then go and have a look at planet, leaderboard, private leaderboard, okay? And so the top 50th is about 0.93. So we kinda say, we're on the right track, okay, with something we're doing fine. So as you can see, once you get to a point that the data's there, there's very little extra to do most of the time.

>> So when your model makes an incorrect prediction in a deployed app, is there a good way to record that error and use that learning to improve the model in a more targeted way? >> Yeah, that's a great question. So the first bit, is there a way to record that?

Of course there is, you record it, that's up to you, right? So maybe some of you can try it this week. You'll need to have your user tell you, you were wrong. This Australian car, you said it was a Holden, and actually it's a Falcon. So first of all, you'll need to collect that feedback.

And the only way to do that is to ask the user to tell you when it's wrong. So you now need to record in some log somewhere, something saying, this was the file, I've stored it here. This was the prediction I made. This is the actual that they told me.

And then at the end of the day or at the end of the week, you could set up a little job to run something or you can manually run something. And what are you gonna do? You're gonna do some fine-tuning. What does fine-tuning look like? Good segue, Rachel. It looks like this, right?

So let's pretend here's your safe model, right? And so then we unfreeze, right? And then we fit a little bit more, right? Now in this case, I'm fitting with my original data set. But you could create a new data bunch with just the misclassified instances and go ahead and fit, right?

And the misclassified ones are likely to be particularly interesting. So you might want to fit at a slightly higher learning rate, in order to make them kind of really mean more. Or you might want to run them through a few more epochs. But it's exactly the same thing, right?

You just co-fit with your misclassified examples and passing in the correct classification. And that should really help your model quite a lot. There are various other tweaks you can do to this, but that's the basic idea. >> Next question, could someone talk a bit more about the data block ideology?

I'm not quite sure how the blocks are meant to be used. Do they have to be in a certain order? Is there any other library that uses this type of programming that I could look at? >> Yes, they do have to be in a certain order. They do have to be in a certain order.

And it's basically the order that you see in the example of use, right? What kind of data do you have? Where does it come from? How do you label it? How do you split it? What kind of data sets do you want? Optionally, how do I transform it? And then how do I create a data bunch from it?

So they're the steps. I mean, we invented this API. I don't know if other people have independently invented it. The basic idea of kind of a pipeline of things that dot into each other is pretty common in a number of places. Not so much in Python, but you see it more in JavaScript.

Although this kind of approach of each stage produces something slightly different. You tend to see it more in ETL software, like Extraction Transformation and Loading Software, where there's kind of particular stages in a pipeline. So yeah, I mean, it's been inspired by a bunch of things. But yeah, all you need to know is kind of use this example to guide you, and then look up the documentation to see which particular kind of thing you want.

And in this case, the image file list, you're actually not going to find the documentation or image file list in data blocks documentation, because this is specific to the vision application. So to then go and actually find out how to do something for your particular application, you would then go to look at text and vision and so forth, and that's where you can find out what are the data block API pieces available for that application.

And of course, you can then look at the source code. If you've got some totally new application, you could create your own part of any of these stages. Pretty much all of these functions are, you know, very few lines of code. Maybe we could look at an example of one, image list from folder.

So let's just put that somewhere temporary, and then we're gonna go t dot label from CSV. Then you can look at the documentation to see exactly what that does, and that's gonna call label from data frame. So I mean, this is already useful. If you wanted to create a data frame, a pandas data frame from something other than the CSV, you now know that you could actually just call label from data frame, and you can look up to find what that does.

And as you can see, most fast AI functions are no more than a few lines of code. They're normally pretty straightforward to see what are all the pieces there and how can you use them. And it's probably one of these things that as you play around with it, you'll get a good sense of how it all gets put together.

But if during the week there are particular things where you're thinking, I don't understand how to do this, please let us know and we'll try to help you. Sure. >> What resources do you recommend for getting started with video, for example, being able to pull frames and submit them to your model?

>> I guess, I mean, the answer is it depends. If you're using the web, which I guess probably most of you will be, then there's web APIs that basically do that for you. So you can grab the frames with the web API and then they're just images which you can pass along.

If you're doing it client side, I guess most people tend to use OpenCV for that. But maybe people during the week who are doing these video apps can tell us what have you used and found useful, and we can start to prepare something in the lesson wiki with a list of video resources, since it sounds like some people are interested.

Okay, so just like usual, we unfreeze our model and then we fit some more and we get down to 9 to 9-ish. So one thing to notice here is that before we unfreeze, you'll tend to get this shape pretty much all the time. If you do your learning rate finder before you unfreeze, it's pretty easy.

You know, find the steepest slope, not the bottom, right? Remember we're trying to find the bit where we can like slide down it quickly. So if you start at the bottom, it's just gonna send you straight off to the end here, so somewhere around here, and then we can call it again after you unfreeze.

And you'll generally get a very different shape, right? And this is a little bit harder to say what to look for, because it tends to be this kind of shape where you get a little bit of upward and then a kind of very gradual downward and then up here.

So, you know, I tend to kind of look for just before it shoots up and go back about 10x, right, as a kind of a rule of thumb, so 1a neg 5, right? And that is what I do for the first half of my slice. And then for the second half of my slice, I normally do whatever learning rate I used for the frozen part, so Lr, which was 0.01, kind of divided by 5, or divided by 10, somewhere around that.

So that's kind of my rule of thumb, right? Look for the bit kind of at the bottom, find about 10x smaller. That's the number that I put here, and then Lr over 5 or Lr over 10 is kind of what I put there. Seems to work most of the time.

We'll be talking more about exactly what's going on here. This is called discriminative learning rates as the course continues. So how am I gonna get this better than 929? Because there are, how many people in this competition? About 1,000 teams, right? So we wanna get into the top 10%.

So the top 5% would be 0.931-ish. The top 10% is gonna be about 929-ish. So we're not quite there, right? So here's a trick, right? I don't know if you remember, but when I created my data set, I put size equals 128, and actually the images that Kaggle gave us are 256.

So I used the size of 128 partially cuz I wanted to experiment quickly. It's much quicker and easier to use small images to experiment. But there's a second reason. I now have a model that's pretty good at recognizing the contents of 128 by 128 satellite images. So what am I gonna do if I now wanna create a model that's pretty good at 256 by 256 satellite images?

Well, why don't I use transfer learning? Why don't I start with a model that's good at 128 by 128 images and fine tune that, so don't start again, right? And that's actually gonna be really interesting because if I'm trained quite a lot, if I'm on the verge of overfitting, which I don't wanna do, right?

Then I'm basically creating a whole new data set effectively, one where my images are twice the size on each axis, right? So four times bigger. So it's really a totally different data set as far as my convolutional neural networks concerned. So I kind of gonna lose all that overfitting, I get to start again.

So let's create a new learner, right? Well, let's keep our same learner, but use a new data bunch, where the data bunch is 256 by 256. So that's why I actually stopped here, right, before I created my data sets. Cuz I'm gonna now take this data source, and I'm gonna create a new data bunch with 256 instead.

So let's have a look at how we do that. So here it is, take that source, right, take that source, transform it with the same transforms as before, but this time use size 256. Now that should be better anyway, because this is gonna be higher resolution images. But also I'm gonna start with, I haven't got rid of my learner, it's the same learner I had before, so I'm gonna start with this kind of pre-trained model, and so I'm gonna replace the data inside my learner with this new data bunch.

And then I will freeze again, so that means I'm going back to just training the last few layers. And I will do a new LR find, and because I actually now have a pretty good model, like it's pretty good for 128 by 128, so it's probably gonna be like at least okay for 256 by 256, I don't get that same sharp shape that I did before.

But I can certainly see where it's way too high, right? So, I'm gonna pick something well before where it's way too high. Again, maybe 10x smaller. So here I'm gonna go 1e neg 2 over 2, that seems well before it shoots up. And so let's fit a little bit more, okay?

So we've frozen again, so we're just training the last few layers and fit a little bit more. And as you can see, I very quickly, remember kind of 928 was where we got to before or after quite a few epochs. We're straight up there, and suddenly we've passed 0.93, all right?

So we're now already kind of into the top 10%, so we've hit our first goal, right? We're doing, we're at the very least pretty confident at the problem of just recognizing satellite imagery. But of course now, we can do the same thing before. We can unfreeze and train a little more, okay?

Again, using the same kind of approach I described before, we allow it over 5 here, and even smaller one here, train a little bit more, 0.9314. So, that's actually pretty good, 0.9314. Somewhere around top 20-ish. So you can see actually when my friend Brendan and I entered this competition, we came 22nd with 0.9315.

And we spent, this was a year or two ago, months trying to get here. So using kind of pretty much defaults with the minor tweaks and one trick, which is the resizing tweak, you can kind of get right up into the top of the leaderboard of this very challenging competition.

Now, I should say we don't really know where we'd be. We'd actually have to check it on the test set that Kaggle gave us and actually submit to the competition, which you can do. You can do a late submission. And so later on in the course, we'll learn how to do that.

But we certainly know we're doing well. We're doing very well, so that's great news. And so you can see also as I kind of go along, I tend to save things. I just, you can name your models whatever you like. But I just want to basically know, was it kind of before or after the unfreeze?

So kind of had stage one or two, what size was I training on? What architecture was I training on? So that way, I can kind of always go back and experiment pretty easily. So that's Planet, multi-label classification. Let's look at another example. So the other example next we're going to look at is this data set called Canva.

And it's going to be doing something called Segmentation. We're going to start with a picture like this. And we're going to try and create a color coded picture like this. Where all of the bicycle pixels are the same color. All of the road line pixels are the same color.

All of the tree pixels are the same color. All of the building pixels are the same color. The sky is the same color, and so forth, okay? Now, we're not actually going to make them colors. We're actually going to do it where each of those pixels has a unique number.

So in this case, the top left is buildings. I guess building is number four. The top right is trees, so tree is 26, and so forth, all right? So in other words, this single top left pixel, we're basically, like I mentioned this, we're going to do a classification problem, just like the pet's classification, for the very top left pixel.

We're going to say, what is that top left pixel? Is it bicycle, road lines, sidewalk, building? What is the very top left pixel? And then what is the next pixel along? What is the next pixel along? So we're going to do a little classification problem for every single pixel in every single image.

So that's called segmentation, all right? In order to build a segmentation model, you actually need to download or create a dataset where someone has actually labeled every pixel. So as you can imagine, that's a lot of work, okay? So that's going to be a lot of work. You're probably not going to create your own segmentation datasets, but you're probably going to download or find them from somewhere else.

This is very common in medicine, life sciences. You know, if you're looking through slides at nuclei, it's very likely you already have a whole bunch of segmented cells and segmented nuclei. If you're in radiology, you probably already have lots of examples of segmented lesions and so forth. There's a lot of, you know, kind of different domain areas where there are domain-specific tools for creating these segmented images.

As you could guess from this example, it's also very common in kind of self-driving cars and stuff like that where you need to see, you know, what objects are around and where are they. In this case, there's a nice dataset called CanvaD, which we can download, and they have already got a whole bunch of images and segment masks prepared for us, which is pretty cool.

And remember, pretty much all of the datasets that we have provided kind of inbuilt URLs for, you can see their details at course.fastedai/datasets, and nearly all of them are academic datasets where some very kind people have gone to all of this trouble for us so that we can use this dataset and made it available for us to use.

So if you do use it, one of these datasets for any kind of project, it would be very, very nice if you were to go and find the citation and say, you know, thanks to these people for this dataset, okay, because they've provided it, and all they're asking in return is for us to give them that credit.

Okay, so here is the CanvaD dataset, here is the citation, and on our datasets page that will link to the academic paper where it came from. Okay, Rachel, now is a good time for a question. >> Is there a way to use learn.lr/find and have it return a suggested number directly rather than having to plot it as a graph and then pick a learning rate by visually inspecting that graph?

And then there are a few other questions, I think, around more guidance on reading the learning rate finder graph. >> Yeah, I mean, that's a great question. I mean, the short answer is no. And the reason the answer is no is because this is still a bit more artisanal than I would like.

As you can kind of see, I've been kind of saying how I read this learning rate graph depends a bit on what stage I'm at and kind of what the shape of it is. I guess, like, when you're just training the head, so before you unfreeze, it pretty much always looks like this.

And you could certainly create something that kind of creates a slightly, you know, creates a smooth version of this, finds the sharpest negative slope and picked that. You would probably be fine nearly all the time. But then for, you know, these kinds of ones, you know, it requires a certain amount of experimentation.

But the good news is you can experiment, right? You can try. Obviously, if the line's going up, you don't want it. And certainly, at the very bottom point, you don't want it, right, because you need it to be going downwards. But if you kind of start with somewhere around 10x smaller than that, and then also you could try another 10x smaller than that, try a few numbers and find out which ones work best.

And within a small number of weeks, you will find that you're picking the best learning rate most of the time, right? So I don't know. It's kind of -- so at this stage, it still requires a bit of playing around to get a sense of the different kinds of shapes that you see and how to respond to them.

Maybe by the time this video comes out, someone will have a pretty reliable auto learning rate finder. We're not there yet. It's probably not a massively difficult job to do, be an interesting project, collect a whole bunch of different datasets, maybe grab all the datasets from our datasets page, try and come up with some simple heuristic, compare it to all the different lessons I've shown.

That would be a really fun project to do. But at the moment, we don't have that. I'm sure it's possible. But we haven't got there. Okay. So how do we do image segmentation? Same way we do everything else. And so basically we're going to start with some path which has got some information in it of some sort.

So I always start by, you know, untiring my data, doing LS, see what I was given. In this case, there's a folder called labels and a folder called images. So I'll create paths for each of those. We'll take a look inside each of those. And you know, at this point, like, you can see there's some kind of coded file names for the images and some kind of coded file names for the segment masks.

And then you kind of have to figure out how to map from one to the other. You know, normally these kind of datasets will come with a readme you can look at or you can look at their website. Often it's kind of obvious. In this case, I can see, like, these ones always have this kind of particular format.

These ones always have exactly the same format with an underscore P. So I kind of -- when I did this, honestly, I just guessed. I thought, oh, it's probably the same thing, underscore P. And so I created a little function that basically took the file name and added the underscore P and put it in a different place.

And I tried opening it and I noticed it worked. So I've created this little function that converts from the image file names to the equivalent label file names. I opened up that to make sure it works. Normally we use open image to open a file and then you can go .show to take a look at it.

But this -- as we described, this is not a usual image file. It contains integers. So you have to use open mask rather than open image because we want to return integers, not floats. And fast AI knows how to deal with masks, so if you go mask.show, it will automatically color code it for you in some appropriate way.

That's why we say open mask. So we can kind of have a look inside, look at the data, see what the size is. So there's 720 by 960. We can take a look at the data inside and so forth. The other thing you might have noticed is that they gave us a file called codes.text and a file called valid.text.

So codes.text, we can load it up and have a look inside. And not surprisingly, it's got a list telling us that, for example, number 4 is 0, 1, 2, 3, 4. It's building. Top left is building. There you go. Okay? So just like we had, you know, grizzlies, black bears and teddies, here we've got the coding for what each one of these pixels means.

So we need to create a data bunch. So to create a data bunch, we can go through the data block API and say okay, we've got a list of image files that are in a folder. We need to create labels, which we can use with that get y file name function we just created.

We then need to split into training and validation. In this case, I don't do it randomly. Why not? Because actually the pictures they've given us are frames from videos. So if I did them randomly, I would be having like two frames next to each other, one in the validation set, one in the training set.

That would be far too easy. That's treating. So the people that created this data set actually gave us a data set saying here is the list of file names that are meant to be in your validation set. And they're non-contiguous parts of the video. So here's how you can split your validation and training using a file name file.

So from that, I can create my data sets. And so I actually have a list of class names. So like often with stuff like the planet data set or the pets data set, we actually have a string saying this is a pug or this is a ragdoll or this is a burman or this is cloudy or whatever.

In this case, you don't have every single pixel labeled with an entire string. That would be incredibly inefficient. They're each labeled with just a number and then there's a separate file telling you what those numbers mean. So here's where we get to tell it and the data block API, this is the list of what the numbers mean.

So these are the kind of parameters that the data block API gives you. Here's our transformations. And so here's an interesting point. Remember I told you that, for example, sometimes we randomly flip an image, right? What if we randomly flip the independent variable image but we don't also randomly flip this one?

They're now not matching anymore, right? So we need to tell fast.ai that I want to transform the Y. So X is our independent variable, Y is our independent. I want to transform the Y as well. So whatever you do to the X, I also want you to do to the Y.

So there's all these little parameters that we can play with and I can create a data bunch. I'm using a smaller batch size because as you can imagine, because I'm creating a classifier for every pixel, that's going to take a lot more GPU. So I found a batch size of 8 is all I could handle and then normalize in the usual way.

And this is quite nice. Fast.ai, because it knows that you've given it a segmentation problem, when you call show batch, it actually combines the two pieces for you and it will color code the photo. Isn't that nice? So you can see here the green on the trees and the red on the lines and this kind of color on the walls and so forth, right?

So you can see here, here are the pedestrians, this is the pedestrian's backpack. So this is what the ground truth data looks like. So once we've got that, we can go ahead and create a learner, I'll show you some more details in a moment, call lrfind, find the sharpest bit which looks about 1a neg 2, call fit, passing in slice lr and see the accuracy and save the model and unfreeze and train a little bit more.

So that's the basic idea, okay? And so we're going to have a break and when we come back, I'm going to show you some little tweaks that we can do and I'm also going to explain this custom metric that we've created and then we'll be able to go on and look at some other cool things.

So let's all come back at 8 o'clock, 6 minutes. Okay, welcome back everybody and we're going to start off with a question we got during the break. >> Could you use unsupervised learning here, pixel classification with the bike example to avoid needing a human to label a heap of images?

>> Well, not exactly unsupervised learning, but you can certainly get a sense of where things are without needing these kind of labels. And time permitting, we'll try and see some examples of how to do that. But you're certainly not going to get such a quality and such a specific output as what you see here, though.

If you want to get this level of segmentation mask, you need a pretty good segmentation mask ground truth to work with. >> Is there a reason we shouldn't deliberately make a lot of smaller data sets to step up from in tuning, let's say 64 by 64, 128 by 128, 256 by 256, and so on?

>> Yes, you should totally do that. It works great. Try it. I found this idea is something that I first came up with in the course a couple of years ago and I kind of thought it seemed obvious and just presented it as a good idea and then I later discovered that nobody had really published this before and then we started experimenting with it and it was basically the main trick that we used to win the ImageNet competition, the Dawnbench ImageNet training competition, and we were like, wow, people, this wasn't -- not only was this not standard, nobody had heard of it before.

There's been now a few papers that use this trick for various specific purposes, but it's still largely unknown and it means that you can train much faster, it generalizes better. There's still a lot of unknowns about exactly like how small and how big and how much at each level and so forth, but I guess in as much as it has a name now, it probably does and I guess we'd call it progressive resizing.

I found that going much under 64 by 64 tends not to help very much, but yeah, it's a great technique and I definitely try a few different sizes. >> What does accuracy mean for pixel-wise segmentation? Is it correctly classified pixels divided by the total number of pixels? >> Yep, that's it.

So if you imagine each pixel was a separate object you're classifying, it's exactly the same accuracy. And so you actually can just pass in accuracy as your metric, but in this case, we actually don't. We've created a new metric called Accuracy Canvid, and the reason for that is that when they labeled the images, sometimes they labeled a pixel as a void.

I'm not quite sure why, maybe some that they didn't know or somebody felt that they'd made a mistake or whatever, but some of the pixels are void, and in the Canvid paper, they say when you're reporting accuracy, you should remove the void pixels. So we've created a Accuracy Canvid, so all metrics take the actual output of the neural net, that's the input to the, this is what they call the input, the input to the metric, and the target, i.e.

the labels we're trying to predict. So we then basically create a mask, so we look for the places where the target is not equal to void. And then we just take the input, do the argmax as per usual, just the standard accuracy argmax, but then we just grab those that are not equal to the void code, and we do the same for the target, and we take the mean.

So it's just a standard accuracy, it's almost exactly the same as the accuracy source code we saw before with the addition of this mask. So this quite often happens, that the particular Kaggle competition metric you're using, or the particular way your organization scores things or whatever, there's often little tweaks you have to do, and this is how easy it is.

And so as you'll see, to do this stuff, the main thing you need to know pretty well is how to do basic mathematical operations in PyTorch. So that's just something you kind of need to practice. >> I've noticed that most of the examples in most of my models result in a training loss greater than the validation loss.

What are the best ways to correct that? I should add that this still happens after trying many variations on number of epochs and learning rate. >> Okay, good question. So remember from last week, if your training loss is higher than your validation loss, then you're underfitting, okay? It definitely means that you're underfitting, you want your training loss to be lower than your validation loss.

If you're underfitting, you can train for longer, you can train the last bit at a lower learning rate, but if you're still underfitting, then you're gonna have to decrease regularization, and we haven't talked about that yet. So in the second half of this part of the course, we're gonna be talking quite a lot about regularization and specifically how to avoid overfitting or underfitting by using regularization.

If you wanna skip ahead, we're gonna be learning about weight decay, dropout, and data augmentation will be the key things that we're talking about. Okay, for segmentation, we don't just create a convolutional neural network. We can, but actually, an architecture called UNET turns out to be better, and actually, let's find it.

Okay, so this is what a UNET looks like, and this is from the university website where they talk about the UNET, and so we'll be learning about this both in this part of the course and in part two, if you do it. But basically, this bit down on the left-hand side is what a normal convolutional neural network looks like.

It's something which starts with a big image and gradually makes it smaller and smaller and smaller and smaller until eventually you just have one prediction. What a UNET does is it then takes that and makes it bigger and bigger and bigger again, and then it takes every stage of the downward path and kind of copies it across, and it creates this U-shape.

It was originally actually created or published as a biomedical image segmentation method, but it turns out to be useful for far more than just biomedical image segmentation. So it was presented at MICHI, which is the main medical imaging conference, and as of just yesterday, it actually just became the most cited paper of all time from that conference.

So it's been incredibly useful, over 3,000 citations. You don't really need to know any of the details at this stage. All you need to know is if you want to create a segmentation model, you want to be saying learner.create_unit rather than create_cnn. But you pass it the normal stuff, your data bunch, an architecture, and some metrics.

So having done that, everything else works the same. You can do the LR finder, find the slope, train it for a while, watch the accuracy go up, save it from time to time, unfreeze, probably want to go about 10 less, so it's still going up. So probably 10 less than that.

So 1e neg 5, LR over 5, train a bit more, and there we go. Now here's something interesting. You can learn.recorder is where we keep track of what's going on during training, and it's got a number of nice methods, one of which is plot losses, and this plots your training loss and your validation loss.

And you'll see quite often they actually go up a bit before they go down. Why is that? That's because you can also plot your learning rate over time, and you'll see that your learning rate goes up and then it goes down. Why is that? Because we said fit one cycle, and that's what fit one cycle does.

It actually makes the learning rate start low, go up, and then go down again. Why is that a good idea? Well, to find out why that's a good idea, let's first of all look at a really cool project done by Jose Fernandez-Portal during the week. He took our gradient descent demo notebook and actually plotted the weights over time, not just the ground truth and model over time.

And he did it for a few different learning rates. And so remember, we had two weights. We were doing basically y equals ax plus b, or in his nomenclature here, y equals w naught x plus w1. And so we can actually look and see over time what happens to those weights.

And we know this is the correct answer here, right? So at a learning rate of 0.1, it kind of slides on in here, and you can see that it takes a little bit of time to get to the right point, and you can see the loss improving. At a higher learning rate of 0.7, you can see that the ground truth, the model jumps to the ground truth really quickly.

And you can see that the weights jump straight to the right place really quickly. What if we have a learning rate that's really too high? You can see it takes a very, very, very long time to get to the right point. Or if it's really too high, it diverges.

So you can see here why getting the right learning rate is important. When you get the right learning rate, it really zooms into the best spot very quickly. Now as you get closer to the final spot, something interesting happens, which is that you really want your learning rate to decrease, because you're getting close to the right spot.

And what actually happens -- so what actually happens is -- I can only draw 2D, sorry. You don't generally actually have some kind of loss function surface that looks like that. Remember, there's lots of dimensions, but it actually tends to look bumpy, like that. And so you kind of want a learning rate that's high enough to jump over the bumps.

But then once you get close to the middle, once you get close to the best answer, you don't want to be just jumping backwards and forwards between bumps. So you really want your learning rate to go down so that as you get closer, you take smaller and smaller steps.

So that's why it is that we want our learning rate to go down at the end. Now this idea of decreasing the learning rate during training has been around forever, and it's just called learning rate annealing. But the idea of gradually increasing it at the start is much more recent, and it mainly comes from a guy called Leslie Smith.

If you're in San Francisco next week, actually, you can come and join me and Leslie Smith. We're having a meetup where we'll be talking about this stuff, so come along to that. What Leslie discovered is that if you gradually increase your learning rate, what tends to happen is that actually -- actually what tends to happen is that loss function surfaces tend to kind of look something like this, bumpy, bumpy, bumpy, bumpy, bumpy, flat, bumpy, bumpy, bumpy, bumpy, bumpy, something like this, right?

They have flat areas and bumpy areas. And if you end up in the bottom of a bumpy area, that solution will tend not to generalize very well because you found a solution that's -- it's good in that one place, but it's not very good in other places, whereas if you found one in the flat area, it probably will generalize well because it's not only good in that one spot, but it's good kind of around it as well.

If you have a really small learning rate, it will tend to kind of log down and stick in these places, right? But if you gradually increase the learning rate, then it will kind of like jump down and then as the learning rate goes up, it's going to start kind of going up again like this, right?

And then the learning rate is now going to be up here. It's going to be bumping backwards and forwards, and eventually the learning rate starts to come down again, and so it will tend to find its way to these flat areas. So it turns out that gradually increasing the learning rate is a really good way of helping the model to explore the whole function surface and try and find areas where both the loss is low and also it's not bumpy, because if it was bumpy, it would get kicked out again.

And so this allows us to train at really high learning rates, so it tends to mean that we solve our problem much more quickly and we tend to end up with much more generalizable solutions. So if you call plot losses and find that it's just getting a little bit worse and then it gets a lot better, you've found a really good maximum learning rate.

So when you actually call fit one cycle, you're not actually passing in a learning rate, you're actually passing in a maximum learning rate. And if it's kind of always going down, particularly after you unfreeze, that suggests you could probably bump your learning rates up a little bit, because you really want to see this kind of shape.

It's going to train faster and generalize better, just a little bit. And you'll tend to particularly see it in the validation set, the orange is the validation set. And again, the difference between knowing the theory and being able to do it is looking at lots of these pictures. So after you train stuff, type learn.recorder.

and hit tab and see what's in there, right? And particularly the things that start with plot and start getting a sense of, like, what are these pictures looking like when you're getting good results? And then try making the learning rate much higher, try making it much lower, more epochs, less epochs and get a sense for what these look like.

So in this case, we used a size in our transforms of the original image size over 2. These two slashes in Python means integer divide, okay? Because obviously we can't have half pixel amounts in our sizes. So integer divide divided by 2. And we used batch size of 8.

And I found that fits on my GPU. It might not fit on yours. If it doesn't, you can just decrease the batch size down to 4. And this isn't really solving the problem, because the problem is to segment all of the pixels, not half of the pixels. So I'm going to use the same trick that I did last time, which is I'm now going to put the size up to the full size of the source images, which means I now have to halve my batch size, otherwise I run out of GPU memory.

And I'm then going to set my learner. I can either say learn.data equals my new data, or I actually found I've had a lot of trouble with kind of GPU memory, so I generally restarted my kernel, came back here, created a new learner, and loaded up the weights that I saved last time.

But the key thing here being that this learner now has the same weights that I had here, but the data is now the full image size. So I can now do an LR find again, find an area where it's kind of, you know, well before it goes up. So I'm going to use 1nx3 and fit some more.

And then unfreeze and fit some more. And you can go to learn.show_results to see how your predictions compare to the ground truth. And you've got to say they really look pretty good. Not bad, huh? So, how good is pretty good? An accuracy of 92.15. The best paper I know of for segmentation was a paper called the 100 layers tiramisu, which developed a convolutional dense net, came out about two years ago.

So after I trained this today, I went back and looked at the paper to find their state of the art accuracy. Here it is. And I looked it up. And their best was 91.5. And we got 92.1. So I've got to say, when this happened today, I was like, wow.

I don't know if better results have come out since this paper. But I remember when this paper came out, and it was a really big deal. And I was like, wow. This is an exceptionally good segmentation result. Like when you compare it to the previous bests that they compared it to, it was a big step up.

And so like in last year's course, we spent a lot of time in the course re-implementing the 100 layers tiramisu. And now, with our totally default fast AI class, I'm easily beating this. And I also remember this I had to train for hours and hours and hours, whereas today's I trained in minutes.

So this is a super strong architecture for segmentation. So yeah, I'm not going to promise that this is the definite state of the art today because I haven't done a complete literature search to see what's happened in the last two years. But it's certainly beating the world's best approach the last time I looked into this, which was in last year's course, basically.

And so these are kind of just all the little tricks I guess we've picked up along the way in terms of like how to train things well. Things like using the pre-trained model and things like using the one cycle convergence and all these little tricks. They work extraordinarily well.

And it's really nice to be able to like show something in class where we can say, we actually haven't published the paper on the exact details of how this variation of the unit works. There's a few little tweaks we do. But if you come back for part two, we'll be going into all of the details about how we make this work so well.

But for you, all you have to know at this stage is that you can say learner.create_unit and you should get great results also. There's another trick you can use if you're running out of memory a lot, which is you can actually do something called mixed precision training. And mixed precision training means that instead of using, for those of you that have done a little bit of computer science, instead of using single precision floating point numbers, you can do all the--most of the calculations in your model with half precision floating point numbers.

So 16 bits instead of 32 bits. Tradition--I mean, the very idea of this has only been around really for the last couple of years in terms of like hardware that actually does this reasonably quickly. And then fast AI library I think is the first and probably still the only that makes it actually easy to use this.

If you add to FP16 on the end of any learner call, you're actually going to get a model that trains in 16-bit precision. Because it's so new, you'll need to have kind of the most recent CUDA drivers and all that stuff for this even to work. I'm going to trade it this morning on some of the platforms.

It just killed the kernel. So you need to make sure you've got the most recent drivers. But if you've got a really recent GPU, like a 2080 Ti, not only will it work, but it will work about twice as fast as otherwise. Now, the reason I'm mentioning it is that it's going to use less GPU RAM.

So even if you don't have like a 2080 Ti, you might find--or you'll probably find that things that didn't fit into your GPU without this then do fit in with this. Now, I actually have never seen people use 16-bit precision floating point for segmentation before. Just for a bit of a laugh, I tried it and actually discovered that I got an even better result.

So I only found this this morning so I don't have anything more to add here other than quite often when you make things a little bit less precise in deep learning, it generalizes a little bit better. And I've never seen a 92.5 accuracy on Canva before. So yeah, not only will this be faster, you'll be able to use bigger batch values, but you might even find like I did that you get an even better result.

So that's a cool little trick. You just need to make sure that every time you create a learner, you add this to FP16. If your kernel dies, it probably means you have slightly out of date CUDA drivers or maybe even an old--too old graphics card. I'm not sure exactly which cards support FP16.

Okay, so one more before we kind of rewind. Sorry, two more. The first one I'm going to show you is an interesting data set called the BWE head pose data set. And Gabrielle Fanelli was kind enough to give us permission to use this in the class. His team created this cool data set.

Here's what the data set looks like. It's pictures. It's actually got a few things in it. We're just going to do a simplified version. And one of the things they do is they have a dot saying this is the center of the face. And so we're going to try and create a model that can find the center of a face.

So for this data set, there's a few data set specific things we have to do which I don't really even understand but I just know from the read me that you have to. They use some kind of depth sensing camera. I think they actually used a Connect, you know, Xbox Connect.

There's some kind of calibration numbers that they provide in a little file which I had to read in. And then they provided a little function that you have to use to take their coordinates to change it from this depth sensor calibration thing to end up with actual coordinates. So when you open this and you see these little conversion routines, that's just, you know, I'm just doing what they told us to do basically.

It's got nothing particularly to do with deep learning to end up with this dot. The interesting bit really is where we create something which is not an image or an image segment, but an image points. And we'll mainly learn about this later in the course. But basically, image points use this idea of kind of the coordinates, right?

They're not pixel values, they're XY coordinates. There's just two numbers. As you can see--let me see. Okay. So here's an example for a particular image file name, this particular image file, and here it is. The coordinates of the center of the face are 263, 428. And here it is.

So there's just two numbers which represent whereabouts on this picture as the center of the face. So if we're going to create a model that can find the center of a face, we need a neural network that spits out two numbers. But note, this is not a classification model.

These are not two numbers that you look up in a list to find out that they're road or building or ragdoll, cat or whatever. They're actual locations. So, so far everything we've done has been a classification model, something that's created labels or classes. This, for the first time, is what we call a regression model.

A lot of people think regression means linear regression. It doesn't. Regression just means any kind of model where your output is some continuous number or set of numbers. So, this is, we need to create an image regression model, something that can predict these two numbers. So how do you do that?

Same way as always, right? So we can actually just say I've got a list of image files, it's in a folder, and I want to label them using this function that we wrote that basically does the stuff that the README says to grab the coordinates out of their text files.

So that's going to give me the two numbers for everyone, and then I'm going to split it according to some function. And so in this case, the files they gave us, again, they're from videos, and so I picked just one folder to be my validation set, in other words, a different person.

So again, I was trying to think about, like, how do I validate this fairly? So I said, well, the fair validation would be to make sure that it works well on a person that it's never seen before. So my validation set is all going to be a particular person.

Create a data set, and so this data set, I just tell it what kind of data set is it. Well, they're going to be a set of points. So points means, you know, specific coordinates. Do some transforms. Again, I have to say transform Y equals true, because that red dot needs to move if I flip or rotate or what, right?

Pick some size, I just picked a size that's going to work pretty quickly. Create a data bunch, normalize it, and again, show batch, there it is. Okay? And notice that their red dots don't always seem to be quite in the middle of the face. I don't know exactly what their kind of internal algorithm for putting dots on.

It kind of sometimes looks like it's meant to be the nose, but sometimes it's not quite the nose. Anyway, you get the -- it's somewhere around the center of the face, or the nose. So how do we create a model? We create a CNN. But we're going to be learning a lot about loss functions in the next few lessons.

But generally, basically the loss function is that number that says how good is the model. And so for classification, we use this loss function called cross-entropy loss, which says basically -- you remember this from earlier lessons? Did you predict the correct class, and were you confident of that prediction?

Now, we can't use that for regression. So instead, we use something called mean-squared error. And if you remember from last lesson, we actually implemented mean-squared error from scratch. It's just the difference between the two squared and added up together. Okay. So we need to tell it this is not classification, so we use mean-squared error.

So this is not classification, so we have to use mean-squared error. And then once we've created the learner, we've told it what loss function to use, we can go ahead and do lrfind. We can then fit. And you can see here, within a minute and a half, our mean-squared error is 0.0004.

Now, the nice thing is about, like, mean-squared error, that's very easy to interpret, right? So we're trying to predict something, which is somewhere around a few hundred. And we're getting a squared error on average of 0.0004. So we can feel pretty confident that this is a really good model.

And then we can look at the results by learn.show_results, and we can see predictions, ground truth. It's doing a nearly perfect job. Okay? So that's how you can do image regression models. So any time you've got something you're trying to predict, which is some continuous value, you use an approach that's something like this.

So last example, before we look at some kind of more foundational theory stuff, NLP. And next week we're going to be looking at a lot more NLP. But let's now do the same thing, but rather than creating a classification of pictures, let's try and classify documents. And so we're going to go through this in a lot more detail next week, but let's do the quick version.

Rather than importing from fastai.vision, I now import for the first time from fastai.txt. That's where you'll find all the application-specific stuff for analyzing text documents. And in this case, we're going to use a dataset called imdb. And imdb has lots of movie reviews. They're generally about a couple of thousand words.

And each movie review has been classified as either negative or positive. So it's just in a CSV file, so we can use pandas to read it, we can take a little look, we can take a look at a review. And basically, as per usual, we can either use factory methods or the data block API to create a data bunch.

So here's the quick way to create a data bunch from a CSV of texts, data bunch from CSV, and that's that. And yeah, at this point, I could create a learner and start training it. But we're going to show you a little bit more detail, which we're mainly going to look at next week.

The steps that actually happen when you create these data bunches is there's a few steps. The first is it does something called tokenization, which is it takes those words, and it converts them into a standard form of tokens, where there's basically each token represents a word. But it does things like see here, see how didn't has been turned here into two separate words?

And you see how everything's been lower cased? See how your has been turned into two separate words? So tokenization is trying to make sure that each token, each thing that we've got with spaces around it here, represents a single linguistic concept. Also, it finds words that are really rare, like really rare names and stuff like that and replaces them with a special token called unknown.

So anything starting with XX and fast AI is some special token. So that's just tokenization. So we end up with something where we've got a list of tokenized words. You'll also see that things like punctuation end up with spaces around them to make sure that they're separate tokens. The next thing we do is we take a complete unique list of all of the possible tokens, that's called the vocab, and that gets created for us.

And so here's the first ten items of the vocab. So here is every possible token, the first ten of them that appear in all of the movie reviews. And we then replace every movie review with a list of numbers. And the list of numbers simply says what numbered thing in the vocab is in this place.

So here's six is zero, one, two, three, four, five, six. So this is the word "a." And this is three, zero, one, two, three, this is a comma, and so forth. So through tokenization and numericalization, this is the standard way in NLP of turning a document into a list of numbers.

We can do that with the data block API, right? So this time it's not image files list, it's text, split data from a CSV, convert them to data sets, tokenize them, numericalize them, create a data bunch, and at that point we can start to create a model. As we learn about next week, when we do NLP classification, we actually create two models.

The first model is something called a language model, which, as you can see, we train in a kind of a usual way. We say we want to create a language model learner, we train it, we can save it, we unfreeze, we train some more, and then after we've created a language model, we fine tune it to create the classifier.

So here's the thing where we create the data bunch for the classifier, we create a learner, we train it, and we end up with some accuracy. So that's the really quick version. We're going to go through it in more detail next week. But you can see the basic idea of training an NLP classifier is very, very, very similar to creating every other model we've seen so far.

And this accuracy, so the current state of the art for IMDB classification is actually the algorithm that we built and published with a colleague named Sebastian Ruder, and this basically, what I just showed you is pretty much the state of the art algorithm with some minor tweaks. You can get this up to about 95% if you try really hard.

So this is very close to the state of the art accuracy that we developed. There's a question. Okay, now's a great time for a question. >> For a dataset very different than ImageNet, like the satellite images or genomic images shown in lesson two, we should use our own stats.

Jeremy once said if you're using a pre-trained model, you need to use the same stats it was trained with. Why is that? Isn't it that normalized data with its own stats will have roughly the same distribution like ImageNet? The only thing I can think of which may differ is skewness.

Is it the possibility of skewness or something else the reason of your statement? And does that mean you don't recommend using pre-trained models with very different datasets like the one-point mutation that you mentioned in lesson two? >> No. As you can see, I've used pre-trained models for all of those things.

Every time I've used an imageNet trained model and every time I've used ImageNet stats. Why is that? Because that model was trained with those stats. So for example, imagine you're trying to classify different types of green frogs. So if you were to use your own per-channel means from your dataset, you would end up converting them to a mean of zero, standard deviation of one for each of your red, green and blue channels, which means they don't look like green frogs anymore.

They now look like gray frogs, right? But ImageNet expects frogs to be green, okay? So you need to normalize with the same stats that the ImageNet training people normalized with, otherwise the unique characteristics of your dataset won't appear anymore. You actually normalize them out in terms of the per-channel statistics.

So you should always use the same stats that the model was trained with. Okay. So in every case, what we're doing here is we're using gradient descent with mini-batches, so stochastic gradient descent, to fit some parameters of a model. And those parameters are parameters to basically matrix multiplications. In the second half of this part, we're actually going to learn about a little tweak called convolutions, but it's basically a type of matrix multiplication.

The thing is, though, no amount of matrix multiplications is possibly going to create something that can read IMDB reviews and decide if it's positive or negative, or look at satellite imagery and decide whether it's got a road in it. That's far more than a linear classifier can do. Now, we know these are deep neural networks, and deep neural networks contain lots of these matrix multiplications.

But every matrix multiplication is just a linear model, and a linear function on top of a linear function is just another linear function. If you remember back to your high school math, you might remember that if you have a Y equals AX plus B, and then you stick another CY plus D on top of that, it's still just another slope and another intercept.

So no amount of stacking matrix multiplications is going to help in the slightest. So what are these models actually -- what are we actually doing? And here's the interesting thing. All we're actually doing is we literally do have a matrix multiplication or a slight variation like a convolution that we'll learn about.

But after each one, we do something called a non-linearity or an activation function. An activation function is something that takes the result of that matrix multiplication and sticks it through some function. And these are some of the functions that we use. In the old days, the most common function that we used to use was basically this shape.

These shapes are called sigmoid. And they have, you know, particular mathematical definitions. Nowadays, we almost never use those for these -- between each matrix multiply. Nowadays, we nearly always use this one. It's called a rectified linear unit. It's very important when you're doing deep learning to use big long words that sound impressive, otherwise normal people might think they can do it too.

But just between you and me, a rectified linear unit is defined using the following function. That's it. Okay. So -- and if you want to be really exclusive, of course, you then shorten the long version and you call it a ReLU to show that you're really in the exclusive team.

So this is a ReLU activation. So here's the crazy thing. If you take your red, green, blue pixel inputs and you chuck them through a matrix multiplication and then you replace the negatives with zero and you put it through another matrix multiplication, place the negatives with zero and you keep doing that again and again and again, you have a deep learning neural network.

That's it. All right. So how the hell does that work? So an extremely cool guy called Michael Nielsen showed how this works. He has a very nice website. There's actually more than a website. It's a book. Neural networks and deep learning dot com. And he has these beautiful little JavaScript things where you can get to play around -- because this was back in the old days, this was back when we used to use sigmoids, right?

And what he shows is that if you have enough little -- he shows these little matrix modifications. If you have enough little matrix modifications followed by sigmoids, and exactly the same thing works for a matrix multiplication followed by a ReLU, you can actually create arbitrary shapes, right? And so this idea that these combinations of linear functions and nonlinearities can create arbitrary shapes actually has a name.

And this name is the universal approximation theorem. And what it says is that if you have stacks of linear functions and nonlinearities, the thing you end up with can approximate any function arbitrarily closely. So you just need to make sure that you have a big enough matrix to multiply by or enough of them.

So if you have, you know, now this function, which is just a sequence of matrix multipliers and nonlinearities, where the nonlinearities can be, you know, basically any of these things. And we normally use this one. If that can approximate anything, then all you need is some way to find the particular values of the weight matrices in your matrix multipliers that solve the problem you want to solve.

And we already know how to find the values of parameters. We can use gradient descent. And so that's actually it, right? And this is the bit I find the hardest thing normally to explain to students is that we're actually done now. People often come up to me after this lesson and they say, what's the rest?

Please explain to me the rest of deep learning. But, like, no, there's no rest. Like, we have a function where we take our input pixels or whatever, we multiply them by some weight matrix, we replace the negatives with zeros, we multiply it by another weight matrix, replace the negatives with zeros, we do that a few times, we see how close it is to our target, and then we use gradient descent to update our weight matrices using the derivatives.

And we do that a few times. And eventually we end up with something that can classify movie reviews or can recognize pictures of ragdoll cats. That's actually it. Okay? So the reason it's hard to understand intuitively is because we're talking about weight matrices that have, you know, once you wrap them all up, something like 100 million parameters.

They're very big weight matrices, right? So your intuition about what multiplying something by a linear model and replacing the negatives with zeros a bunch of times can do, your intuition doesn't hold, right? You just have to accept empirically the truth is doing that works really well. So in part two of the course, we're actually going to build these from scratch, right?

But I mean, just to skip ahead, you'll basically will find that, you know, it's going to be kind of five lines of code, right? It's going to be a little for loop that goes, you know, t equals, you know, x at weight matrix one, t two equals max of t comma zero, stick that in a for loop that goes through each weight matrix and at the end calculate my loss function.

And of course we're not going to calculate the gradients ourselves because PyTorch does that for us. And that's about it. So, okay, question. >> There's a question about tokenization. I'm curious about how tokenizing words works when they depend on each other such as San Francisco. >> Yeah, okay. Okay.

Tokenization, how do you tokenize something like San Francisco? San Francisco contains two tokens, San Francisco. That's it. That's how you tokenize San Francisco. The question may be coming from people who have done, like, traditional NLP often need to kind of use these things called ngrams. And ngrams are kind of this idea of, like, a lot of NLP in the old days was all built on top of linear models where you basically counted how many times particular strings of text appeared, like the phrase San Francisco.

That would be a bigram or an ngram with an n of two. The cool thing is that with deep learning we don't have to worry about that. Like with many things, a lot of the complex feature engineering disappears when you do deep learning. So with deep learning, each token is literally just a word or in the case that the word really consists of two words, like your, you split it into two words.

And then what we're going to do is we're going to then let the deep learning model figure out how best to combine words together. Now when we say, like, let the deep learning model figure it out, of course, all we really mean is find the weight matrices using gradient descent to give the right answer.

Like, there's not really much more to it than that. Again, there's some minor tweaks, right? In the second half of the course we're going to be learning about the particular tweak for image models, which is using a convolution. There'll be a CNN. For language, there's a particular tweak we do called using recurrent models or an RNN.

But they're very minor tweaks on what we've just described. So basically it turns out with an RNN that it can learn that SAN plus Francisco has a different meaning when those two things are together. >> Some satellite images have four channels. How can we deal with data that has four channels or two channels when using pre-trained models?

>> Yeah, that's a good question. I think that's something that we're going to try and incorporate into fast AI. So hopefully by the time you watch this video there will be easier ways to do this. But the basic idea is a pre-trained image net model expects red, green, and blue pixels.

So if you've only got two channels there's a few things you can do. But basically you want to create a third channel. And so you can create the third channel as either being all zeros or it could be the average of the other two channels. And so you can just use normal PyTorch arithmetic to create that third channel.

You could either do that ahead of time in a little loop and save your three channel versions or you could create a custom dataset class that does that on demand. For fourth channel you probably don't want to get rid of the fourth channel. So instead what you'd have to do is to actually modify the model itself.

So to know how to do that we'll only know how to do that in a couple more lessons time. But basically the idea is that the initial weight matrix, weight matrix is really the wrong term. They're not weight matrices, they're weight tensors so they can have more than just two dimensions.

So that initial weight matrix in the neural net it's going to have, it's actually a tensor and one of its axes is going to have three slices in it. So you would just have to change that to add an extra slice which I would generally just initialize to zero or to some random numbers.

So that's the short version. But really to answer this, to understand exactly what I meant by that, we're going to need a couple more lessons to get there. Okay, so wrapping up, what have we looked at today? Basically we started out by saying, hey, it's really easy now to create web apps.

We've got starter kits for you that show you how to create web apps and people have created some really cool web apps using what we've learned so far which is single label classification. But the cool thing is the exact same steps we use to do single label classification, you can also do to do multi-label classification such as in the planet or you could use to do segmentation or you could use to do, or you could use to do any kind of image regression or, this is probably a bit early if you don't try this yet, you could do for an LP classification and a lot more.

So in each case, all we're actually doing is we're doing gradient to descent on not just two parameters but on maybe 100 million parameters but still just plain gradient descent along with a non-linearity, which is normally this one, which it turns out the universal approximation theorem tells us, lets us arbitrarily accurately approximate any given function including functions such as converting a spoken waveform into the thing the person was saying or converting a sentence in Japanese to a sentence in English or converting a picture of a dog into the word dog.

These are all mathematical functions that we can learn using this approach. So this week, see if you can come up with an interesting idea of a problem that you would like to solve which is either multi-label classification or image regression or image segmentation, something like that and see if you can try to solve that problem.

You will probably find the hardest part of solving that problem is coming up creating the data bunch and so then you'll need to dig into the data block API to try to figure out how to create the data bunch from the data you have. And with some practice you will start to get pretty good at that.

It's not a huge API, there's a small number of pieces, it's also very easy to add your own but for now, you know, ask on the forum if you try something and you get stuck. Okay, great. So next week we're going to come back and we're going to look at some more NLP.

We're going to learn some more about some details about how we actually train with SGD quickly. We're going to learn about things like Atom and RMS prop and so forth. And hopefully we're also going to show off lots of really cool web apps and models that you've all built during the week.

So I'll see you then. Thanks.

Lesson 3: Deep Learning 2019 - Data blocks; Multi-label classification; Segmentation

Chapters

Transcript