back to index

Lesson 3: Deep Learning 2019 - Data blocks; Multi-label classification; Segmentation


Chapters

0:0
0:54 Machine Learning Course
2:27 Deployment
3:36 Examples of Web Apps
4:37 Guitar Classifier
5:49 American Sign Language
9:54 Satellite Images
11:1 Download the Data
13:53 Condor Installation
14:18 Unzip a 7-Zip File
17:9 The Data Block Api
18:32 Data Set
21:35 Data Loader
24:10 Examples of Using the Data Block Api
27:51 Object Detection Data Set
28:54 Datablock Api Examples
31:25 Warp Perspective Warping
32:21 Data Augmentation
33:46 Metrics
40:11 Partial Function Application
43:1 Fine-Tuning
47:38 What Resources Do You Recommend for Getting Started with Video
51:44 Transfer Learning
57:59 Build a Segmentation Model
60:43 Reading the Learning Rate Finder Graph
65:53 Create a Databank
66:10 Split into Training and Validation
67:33 Transformations
72:34 What Does Accuracy Mean for Pix Pixel Wise Segmentation
76:25 Segmentation
77:9 Convolutional Neural Network
79:3 Plot Losses
82:36 Decreasing the Learning Rate during Training
91:10 Mixed Precision Training
95:35 Image Points
96:59 Regression Model
97:12 Create an Image Regression Model
99:22 Cross-Entropy Loss
99:37 Mean Squared Error
102:48 Tokenization
104:5 Vocab
105:27 Language Model
110:35 Activation Function
112:32 Michael Nielsen
113:27 The Universal Approximation Theorem
117:27 Engrams
123:0 The Universal Approximation Theorem

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back to lesson three.
00:00:03.160 | So we're going to start with a quick correction, which is to let you know that
00:00:09.160 | when we referred to this chart as coming from Quora last week, we were correct.
00:00:13.080 | It did come from Quora, but actually, we realized originally it came from Andrew
00:00:16.920 | Un's excellent machine learning course on Coursera.
00:00:20.200 | So apologies for the incorrect citation.
00:00:22.960 | But in exchange, let's talk about Andrew Un's excellent machine learning
00:00:26.160 | course on Coursera.
00:00:28.440 | It's really great, as you can see, people gave it 4.9 out of 5 stars.
00:00:33.400 | In some ways, it's a little dated, but a lot of the content really is
00:00:39.040 | as appropriate as ever, and taught in a more bottom-up style.
00:00:43.880 | So it can be quite nice to combine Andrew's bottom-up style and
00:00:47.520 | our top-down style and meet somewhere in the middle.
00:00:49.840 | Also, if you're interested in more machine learning foundations,
00:00:54.240 | you should check out our machine learning course as well.
00:00:56.800 | If you go to course.fast.ai and click on the machine learning button,
00:01:00.160 | that will take you to our course, which is about twice as long as this deep
00:01:05.080 | learning course, and kind of takes you much more gradually through some of
00:01:08.320 | the foundational stuff around validation sets and model interpretation and
00:01:13.000 | how PyTorch tensors work and stuff like that.
00:01:16.280 | So I think all these courses together, if you want to really dig deeply into
00:01:21.080 | the material, do all of them.
00:01:24.120 | I know a lot of people here have, and end up saying, I got more out of each one
00:01:28.080 | by doing a whole lot, or you can skip backwards and forwards,
00:01:30.560 | see which one works for you.
00:01:31.520 | So we started talking about deploying your web app last week.
00:01:41.040 | One thing that's gonna make life a lot easier for
00:01:44.160 | you is that on the course V3 website, there's a production section,
00:01:49.720 | where right now we have one platform, but more will be added by the time this
00:01:54.560 | video comes out, showing you how to deploy your web app really, really easily.
00:02:00.320 | And when I say easily, for example, here's the how to deploy on
00:02:05.480 | Zite guide created by San Francisco study group member, Navjot.
00:02:10.240 | As you can see, it's just a page.
00:02:12.560 | There's almost nothing to do, and it's free.
00:02:16.160 | It's not gonna serve 10,000 simultaneous requests, but
00:02:22.080 | it'll certainly get you started, and I found it works really well.
00:02:25.440 | It's fast, and so deploying a model doesn't have to be slow or
00:02:31.360 | complicated anymore.
00:02:32.960 | And the nice thing is, you can kind of use this for an MVP.
00:02:35.760 | And if you do find you're starting to get 1,000 simultaneous requests,
00:02:39.340 | then you know that things are working out, and you can start to upgrade your instance
00:02:43.680 | types or add to a more traditional big engineering approach.
00:02:49.440 | So if you actually use this starter kit,
00:02:53.200 | it will actually create my teddy bear finder for you.
00:02:57.040 | And this is an example of my teddy bear finder.
00:02:59.360 | So the idea is it's as simple as possible, this template.
00:03:03.320 | So you can fill in your own style sheets, your own custom logic, and so forth.
00:03:08.200 | This is kind of designed to be a minimal thing, so
00:03:10.760 | you can see exactly what's going on.
00:03:12.680 | The back end is a simple kind of rest style interface that sends back JSON.
00:03:20.600 | And the front end is a super simple little JavaScript thing.
00:03:23.760 | So yeah, it should be a good way to get a sense of how to build
00:03:30.220 | a web app which talks to a PyTorch model.
00:03:33.360 | So examples of web apps people have built during the week.
00:03:41.360 | Edward Ross built the what car is that app?
00:03:45.000 | Or more specifically, the what Australian car is that.
00:03:48.040 | I thought it was kind of interesting that Edward said on the forum that the building
00:03:52.760 | of the app was actually a great experience in terms of understanding
00:03:57.040 | how the model works himself better.
00:03:59.640 | And it's interesting that he's describing trying it out on his phone.
00:04:09.040 | A lot of people think like, if I want something on my phone, I have to create
00:04:11.960 | some kind of mobile TensorFlow, ONNX, whatever, tricky mobile app.
00:04:17.480 | You really don't.
00:04:18.120 | You can run it all in the cloud and make it just a web app or
00:04:21.760 | use some kind of simple little gooey front end that talks to a rest back end.
00:04:28.040 | It's not that often that you'll need to actually run stuff on the phone, so
00:04:31.800 | this is a good example of that.
00:04:36.120 | C. Werner has created a guitar classifier.
00:04:39.720 | You can decide whether your food is healthy or not.
00:04:44.280 | Apparently, this one is healthy.
00:04:45.440 | That can't be right.
00:04:46.240 | I would have thought a hamburger is more what we're looking for, but there you go.
00:04:49.560 | Apparently, Trinidad and Tobago is the home of the hummingbird.
00:04:54.560 | So if you're visiting, you can find out what kind of hummingbird you're looking at.
00:04:57.560 | You can decide whether or not to eat a mushroom.
00:05:02.960 | If you happen to be one of the cousins of Charlie Harrington,
00:05:06.640 | you can now figure out who is who.
00:05:08.320 | I believe this was actually designed for his fiance.
00:05:11.160 | Even will tell you about the interests of this particular cousin.
00:05:14.960 | So, a fairly niche application, but apparently,
00:05:18.680 | there are 36 people who will appreciate this at least.
00:05:21.760 | I have no cousins.
00:05:24.720 | That's a lot of cousins.
00:05:25.560 | This is an example of an app which actually takes a video feed and
00:05:32.080 | turns it into a motion classifier.
00:05:34.320 | That's pretty cool.
00:05:38.160 | I like it.
00:05:40.900 | Team 26, good job.
00:05:44.880 | Here's a similar one for American Sign Language.
00:05:52.320 | And so it's not a big step from taking a single image model to taking a video model.
00:06:01.600 | You can just grab the occasional frame, put it through your model, and
00:06:05.720 | update the UI as the kind of model results come in.
00:06:11.860 | So it's really cool that you can do this kind of stuff either in client or
00:06:15.920 | in browser nowadays.
00:06:17.400 | Henry Pluchy has built your city from.space,
00:06:28.520 | which he describes as creepy, how accurate it is.
00:06:32.840 | So here is where I live, which it figured out was in the United States.
00:06:35.960 | It's interesting, he describes here how he actually had to be very thoughtful about
00:06:40.800 | the validation set he built, make sure that the satellite tiles were not overlapping or
00:06:45.920 | close to each other.
00:06:47.320 | In doing so, he realized he had to download more data.
00:06:49.920 | But once he did, he got this amazingly effective model that can look at satellite
00:06:54.640 | imagery and figure out what country it's from.
00:06:58.160 | I thought this one was pretty interesting, which was doing univariate time series
00:07:02.880 | analysis by converting it into a picture using something I've never heard of,
00:07:09.480 | a gradient angular field.
00:07:11.600 | But he says he's getting close to state of the art results for
00:07:14.440 | univariate time series modeling by turning it into a picture.
00:07:17.840 | And so I like this idea of turning stuff that's not a picture into a picture.
00:07:25.880 | So something really interesting about this project, which was looking at
00:07:30.000 | emotion classification from faces, was that he was specifically asking the question,
00:07:35.160 | how well does it go without changing anything, just using the default settings?
00:07:39.000 | Which I think is a really interesting experiment because we're all told it's
00:07:42.120 | really hard to train models and it takes a lot of specific knowledge.
00:07:47.440 | And actually we're finding that that's often not the case.
00:07:50.120 | And he looked at this facial expression recognition data set.
00:07:54.200 | There was a 2017 paper that he compared his results to, and he got equal or
00:08:00.120 | slightly better results than the state of the art paper on face recognition,
00:08:05.080 | emotion recognition without doing any custom hyperparameter tuning at all.
00:08:09.760 | So that was really cool.
00:08:10.520 | And then Elena Harley, who I featured one of her works last week,
00:08:17.080 | has done another really cool work in the genomics space, which is looking at
00:08:21.760 | variant analysis,
00:08:27.480 | looking at false positives in these kinds of pictures.
00:08:33.040 | And she found she was able to decrease the number of false positives coming out of
00:08:37.600 | the kind of industry standard software she was using by 500%
00:08:42.880 | by using a deep learning workflow.
00:08:46.440 | I think this is a nice example of something where if you're going through
00:08:50.480 | spending hours every day looking at something, in this case,
00:08:54.320 | looking at kind of get rid of the false positives,
00:08:57.360 | maybe you can make that a lot faster by using deep learning to do a lot of the work
00:09:01.440 | for you.
00:09:02.680 | And again, this is an example of a computer vision based approach on something
00:09:06.480 | which initially wasn't actually images.
00:09:09.880 | So that's a really cool application.
00:09:12.400 | So really nice to see what people have been building in terms of both web apps,
00:09:20.440 | and just classifiers.
00:09:22.440 | What we're gonna do today is look at a whole lot more different types of model
00:09:26.560 | that you can build.
00:09:27.560 | And we're gonna kind of zip through them pretty quickly.
00:09:29.800 | And then we're gonna go back and say, like, how did all these things work?
00:09:32.880 | What's the common denominator?
00:09:33.960 | But all of these things, you can create web apps from these as well.
00:09:40.360 | But you'll have to think about how to slightly change that template to make it
00:09:44.160 | work with these different applications.
00:09:47.280 | I think that'll be a really good exercise in making sure you understand the material.
00:09:51.680 | So the first one we're gonna look at is a dataset of satellite images.
00:09:57.080 | And satellite imaging is a really fertile area for deep learning.
00:10:04.880 | It's certainly a lot of people already using deep learning and
00:10:07.560 | satellite imaging, but only scratching the surface.
00:10:10.720 | And the dataset that we're gonna look at looks like this.
00:10:15.240 | It has satellite tiles, and for each one, as you can see,
00:10:20.160 | there's a number of different labels for each tile.
00:10:23.160 | One of the labels always represents the weather that's shown.
00:10:28.560 | So in this case, cloudy or partly cloudy.
00:10:31.160 | And then all of the other labels tell you any interesting features that are seen there.
00:10:37.640 | So primary means primary rainforest.
00:10:40.800 | Agriculture means there's some farming, road, road, and so forth.
00:10:45.200 | And so, as I'm sure you can tell, this is a little different to all the classifiers
00:10:50.080 | we've seen so far, cuz there's not just one label, there's potentially multiple labels.
00:10:55.320 | So multi-label classification can be done in a very similar way.
00:10:59.840 | But the first thing we're gonna need to do is to download the data.
00:11:02.880 | Now this data comes from Kaggle.
00:11:05.640 | Kaggle is mainly known for being a competition's website.
00:11:09.240 | And it's really great to download data from Kaggle when you're learning,
00:11:12.880 | because you can see, how would I have gone in that competition?
00:11:16.000 | And it's a good way to see whether you kind of know what you're doing.
00:11:18.720 | I tend to think the goal is to try and get in the top 10%.
00:11:23.640 | And in my experience, all the people in the top 10% of a competition
00:11:26.960 | really know what they're doing.
00:11:29.560 | So if you can get in the top 10%, then that's a really good sign.
00:11:32.720 | Pretty much every Kaggle data set is not available for
00:11:37.440 | download outside of Kaggle, at least the competition data sets.
00:11:41.360 | So you have to download it through Kaggle.
00:11:43.200 | And the good news is that Kaggle provides a Python-based downloader tool,
00:11:48.480 | which you can use.
00:11:49.800 | So we've got a quick description here of how to download stuff from Kaggle.
00:11:53.400 | So to install stuff, to download stuff from Kaggle,
00:11:59.200 | you first have to install the Kaggle download tool.
00:12:02.920 | So just pip install Kaggle.
00:12:04.720 | And so you can see what we tend to do when there's one off things to do,
00:12:08.400 | is we show you the commented out version in the notebook.
00:12:10.800 | And you can just remove the comment.
00:12:12.280 | So here's a cool tip for you.
00:12:13.920 | If you select a few lines and then hit Ctrl + slash, it uncomments them all.
00:12:19.800 | And then when you're done, select them again, Ctrl + slash again, and
00:12:23.360 | re-comments them all, okay?
00:12:25.000 | So if you run this line, it'll install Kaggle for you.
00:12:28.800 | Depending on your platform, you may need sudo,
00:12:33.880 | you may need slash something else, slash pip, you may need source activate.
00:12:40.040 | So have a look on the setup instructions, actually the returning to work
00:12:44.960 | instructions on the course website to see when we do condor install,
00:12:50.600 | you have to do the same basic steps for your pip install.
00:12:53.640 | So once you've got that module installed, you can then go ahead and
00:13:01.800 | download the data.
00:13:02.960 | And basically it's as simple as saying Kaggle competitions download,
00:13:07.720 | the competition name, and then the files that you want.
00:13:12.120 | The only other steps before you do that is that you have to authenticate yourself.
00:13:17.740 | And you'll see there's a little bit of information here on exactly how you can
00:13:21.160 | go about downloading from Kaggle the file containing your
00:13:26.080 | API authentication information.
00:13:28.280 | So I won't bother going through it here, but just follow these steps.
00:13:33.600 | Sometimes stuff on Kaggle is not just zipped or tarred, but
00:13:38.520 | it's compressed with a program called 7-zip, which will have a 7Z extension.
00:13:45.760 | If that's the case, you'll need to either apt-install P7-zip, or
00:13:50.880 | here's something really nice.
00:13:52.640 | Some kind of person has actually created a condor installation of 7-zip that works
00:13:56.480 | on every platform.
00:13:57.720 | So you can always just run this condor install,
00:14:00.040 | doesn't even require sudo or anything like that.
00:14:02.960 | And this is actually a good example of where condor is super handy,
00:14:06.360 | is that you can actually install binaries, and libraries, and
00:14:09.760 | stuff like that, and it's nicely cross platform.
00:14:12.280 | So if you don't have 7-zip installed, that's a good way to get it.
00:14:17.640 | And so this is how you unzip a 7-zip file.
00:14:22.600 | In this case, it's tarred and 7-zipped, so you can do this all in one step.
00:14:29.920 | So 7-za is the name of the 7-zip archiver program that you would run.
00:14:34.160 | Okay, so that's all basic stuff, which if you're not so familiar with the command
00:14:38.960 | line and stuff, it might take you a little bit of experimenting to get it working.
00:14:42.800 | Feel free to ask on the forum.
00:14:44.400 | Make sure you search the forum first to get started, okay.
00:14:50.720 | So once you've got the data downloaded and unzipped, you can take a look at it.
00:14:56.160 | So in this case, because we have multiple labels for
00:15:04.960 | each tile, we clearly can't have a different folder for
00:15:10.400 | each image telling us what the label is.
00:15:12.440 | We need some different way to label it.
00:15:14.680 | And so the way that Kaggle did it was they provided a CSV file that had
00:15:19.400 | each file name along with a list of all of the labels.
00:15:25.640 | In order to just take a look at that CSV file,
00:15:27.960 | we can read it using the pandas library.
00:15:31.160 | If you haven't used pandas before,
00:15:32.960 | it's kind of the standard way of dealing with tabular data in Python.
00:15:40.280 | It pretty much always appears in the PD namespace.
00:15:43.440 | In this case, we're not really doing anything with it,
00:15:45.600 | other than just showing you the contents of this file.
00:15:48.680 | So we can read it, we can take a look at the first few lines, and there it is.
00:15:53.320 | So we want to turn this into something we can use for modeling.
00:15:59.360 | So the kind of object that we use for modeling is an object of the data bunch
00:16:05.440 | plus, so we have to somehow create a data bunch out of this.
00:16:08.920 | Once we have a data bunch, we'll be able to go .show batch to take a look at it.
00:16:15.000 | And then we'll be able to go create CNN with it, and
00:16:17.600 | then we'll be able to start training, okay?
00:16:19.600 | So really, the trickiest step previously in deep learning has often been
00:16:26.480 | getting your data into a form that you can get it into a model.
00:16:29.000 | So far, we've been showing you how to do that using various factory methods.
00:16:36.000 | So methods where you basically say, I want to create this kind of data from
00:16:39.800 | this kind of source with these kinds of options.
00:16:42.640 | The problem is, I mean, that works fine sometimes, and
00:16:45.560 | we showed you a few ways of doing it over the last couple of weeks.
00:16:49.560 | But sometimes, you want more flexibility.
00:16:53.200 | Because there's so many choices that you have to make about where do the files live,
00:16:58.160 | and what's the structure they're in, and how do the labels appear, and
00:17:01.080 | how do you split out the validation set, and how do you transform it, and so forth.
00:17:05.120 | So we've got this unique API that I'm really proud of called the DataBlock API.
00:17:11.680 | And the DataBlock API makes each one of those decisions a separate decision
00:17:16.160 | that you make, there's separate methods and with their own parameters for
00:17:19.240 | every choice that you make around how do I create, set up my data.
00:17:24.720 | So for example, to grab the planet data,
00:17:28.720 | we would say we've got a list of image files that are in a folder, and
00:17:33.760 | they're labeled based on a CSV with this name.
00:17:37.640 | They have this separator.
00:17:39.040 | Remember, I showed you back here that there's a space between them.
00:17:42.000 | So by passing in separator, it's going to create multiple labels.
00:17:44.920 | The images are in this folder, they have this suffix.
00:17:48.080 | We're going to randomly split out a validation set with 20% of the data.
00:17:52.640 | We're going to create data sets from that,
00:17:55.080 | which we're then going to transform with these transformations.
00:17:58.840 | And then we're going to create a data bunch out of that,
00:18:00.880 | which we'll then normalize using these statistics.
00:18:04.120 | So there's all these different steps.
00:18:06.080 | So to give you a sense of what that looks like,
00:18:11.280 | the first thing I'm going to do is go back and explain what are all of
00:18:15.640 | the PyTorch and FastAI classes that you need to know about that are going to
00:18:20.600 | appear in this process, because you're going to see them all the time in
00:18:24.960 | the FastAI docs and the PyTorch docs.
00:18:27.440 | So the first one you need to know about is a class called a dataset.
00:18:33.840 | And the dataset class is part of PyTorch, and
00:18:38.560 | this is the source code for the dataset class.
00:18:42.640 | As you can see, it actually does nothing at all.
00:18:46.080 | So the dataset class in PyTorch defines two things, getItem and when.
00:18:59.040 | In Python, these special things that are underscore, underscore something,
00:19:02.680 | underscore, underscore, Pythonistas call them dunder something.
00:19:06.800 | This would be dunder getItem, dunder len.
00:19:09.440 | And they're basically special magical methods that do some special behavior.
00:19:16.000 | This particular method, you can look them up in the Python docs.
00:19:19.320 | This particular method means that your object, if you had an object called o,
00:19:23.600 | can be indexed with square brackets, something like that, right?
00:19:28.200 | So that would call getItem with three as the index.
00:19:32.760 | And then this one called len means that you can go len o and
00:19:39.080 | it will call that method.
00:19:40.240 | And you can see in this case, they're both not implemented.
00:19:43.120 | So that is to say, although PyTorch says to tell PyTorch about your data,
00:19:50.520 | you have to create a dataset.
00:19:52.400 | It doesn't really do anything to help you create the dataset.
00:19:55.520 | It just defines what the dataset needs to do.
00:19:57.800 | So in other words, your data, the starting point for your data,
00:20:01.600 | is something where you can say, what is the third item of data in my dataset?
00:20:07.080 | So that's what getItem does, and how big is my dataset?
00:20:10.200 | That's what the length does.
00:20:11.040 | So FastAI has lots of dataset subclasses that do that for
00:20:18.680 | all different kinds of stuff.
00:20:20.600 | And so, so far, you've been seeing image classification datasets.
00:20:25.520 | And so they're datasets where getItem will return an image and
00:20:30.760 | a single label of what is that image.
00:20:33.680 | So that's what a dataset is.
00:20:37.280 | Now, a dataset is not enough to train a model.
00:20:40.760 | The first thing we know we have to do, if you think back to the gradient descent
00:20:45.800 | tutorial last week, is we have to have a few images or
00:20:50.720 | a few items at a time so that our GPU can work in parallel.
00:20:55.120 | Remember, we do this thing called a mini-batch.
00:20:56.960 | A mini-batch is a few items that we present to the model at a time that
00:21:00.560 | we can train from in parallel.
00:21:02.200 | So to create a mini-batch, we use another PyTorch class called a data loader.
00:21:12.760 | And so a data loader takes a dataset in its constructor.
00:21:19.280 | So it's now saying, this is something I can get the third item and
00:21:22.080 | the fifth item and the ninth item.
00:21:23.960 | And it's gonna grab items at random and
00:21:27.240 | create a batch of whatever size you ask for and pop it on the GPU and
00:21:34.120 | send it off to your model for you.
00:21:35.840 | So a data loader is something that grabs individual items,
00:21:39.800 | combines them into a mini-batch, pops them on the GPU for modeling.
00:21:43.160 | So that's called a data loader and it comes from a dataset.
00:21:46.720 | So you can see already there's kind of choices you have to make.
00:21:51.360 | What kind of dataset am I creating?
00:21:53.120 | What is the data for it where it's gonna come from?
00:21:55.600 | And then when I create my data loader, what batch size do I wanna use, right?
00:21:59.680 | This still isn't enough to train a model, not really,
00:22:03.520 | because we've got no way to validate the model.
00:22:05.920 | If all we have is a training set, then we have no way to know how we're doing,
00:22:10.040 | because we need a separate set of held out data.
00:22:13.000 | A validation set to see how we're getting along.
00:22:15.800 | So for that, we use a fast AI class called a data bunch.
00:22:21.200 | And a data bunch is something which, as it says here,
00:22:23.800 | binds together a training data loader, and a valid data loader.
00:22:28.520 | And when you look at the fast AI docs,
00:22:31.800 | when you see these monospace font things,
00:22:36.360 | they're always referring to some symbol you can look up elsewhere.
00:22:38.920 | So in this case, you can see train DL is here.
00:22:42.920 | And there's no point knowing that there's an argument with a certain name,
00:22:47.240 | unless you know what that argument is.
00:22:49.800 | So you should always look after the colon to find out that is a data loader.
00:22:54.600 | Okay, so when you create a data bunch,
00:22:56.280 | you're basically giving it a training set data loader and
00:22:59.840 | a validation set data loader.
00:23:01.880 | And that's now an object that you can send off to a learner and
00:23:06.160 | start learning, start fitting, right?
00:23:09.840 | So they're the basic pieces.
00:23:12.800 | So coming back to here.
00:23:16.200 | This stuff plus this line is all the stuff which is creating the data set.
00:23:25.720 | So it's saying where did the images come from?
00:23:27.720 | Cuz the data set, the index returns two things.
00:23:30.760 | It returns the image and the labels, assuming it's an image data set.
00:23:34.800 | So where do the images come from?
00:23:36.720 | Where do the labels come from?
00:23:38.480 | And then I'm gonna create two separate data sets, the training and
00:23:41.240 | the validation.
00:23:42.640 | This is the thing that actually turns them into PyTorch data sets.
00:23:45.840 | This is the thing that transforms them, okay?
00:23:49.280 | And then this is actually gonna create the data loader and
00:23:53.640 | the data bunch in one go.
00:23:55.840 | So let's look at some examples of this data block API.
00:24:00.280 | Because once you understand the data block API, you'll never be lost for
00:24:04.520 | how to convert your data set into something you can start modeling with.
00:24:08.000 | So here's some examples of using the data block API.
00:24:13.520 | So for example, if you're looking at MNIST, which remember is the pictures and
00:24:19.080 | classes of handwritten numerals, you can do something like this.
00:24:26.920 | This, what kind of data set is this gonna be?
00:24:30.120 | It's gonna come from a list of image files, which are in some folder.
00:24:36.120 | And they're labeled according to the folder name that they're in.
00:24:42.160 | And then we're gonna split it into train and validation,
00:24:46.680 | according to the folder that they're in, train and validation.
00:24:50.040 | You can optionally add a test set.
00:24:53.440 | We're gonna be talking more about test sets later in the course.
00:24:56.120 | Okay, we'll convert those into PyTorch data sets, now that that's all set up.
00:25:01.280 | We'll then transform them using this set of transforms.
00:25:07.200 | And we're gonna transform into something of this size.
00:25:12.360 | And then we're gonna convert them into a data bunch.
00:25:14.240 | So each of those stages inside these parentheses are various parameters you
00:25:19.640 | can pass to customize how that all works, right?
00:25:22.560 | But in the case of something like this MNIST data set,
00:25:25.560 | all the defaults pretty much work, so this is all fine.
00:25:27.840 | So here it is, so you can check.
00:25:31.800 | Let's grab something.
00:25:33.120 | So data.trainDS is the data set, not the data loader, the data set.
00:25:37.440 | So I can actually index into it with a particular number.
00:25:40.160 | So here is the zero indexed item in the training data set.
00:25:44.360 | It's got an image and a label.
00:25:47.320 | We can show batch to see an example of the pictures of it, and
00:25:50.640 | we can then start training.
00:25:51.600 | Here are the classes that are in that data set.
00:25:55.600 | And this little cut down sample of MNIST just has threes and sevens.
00:25:59.580 | Here's an example using Planet.
00:26:05.040 | This is actually, again, a little subset of Planet we use to make it easy to
00:26:09.960 | try things out.
00:26:10.640 | So in this case, again, it's an image file list.
00:26:13.760 | Again, we're grabbing it from a folder.
00:26:16.040 | This time we're labeling it based on a CSV file.
00:26:18.440 | We're randomly splitting it.
00:26:19.960 | By default, it's 20%, creating data sets,
00:26:23.160 | transforming it using these transforms.
00:26:25.760 | We're gonna use a smaller size and then create a data bunch.
00:26:30.480 | There it is.
00:26:32.600 | And so data bunches know how to draw themselves, amongst other things.
00:26:38.680 | So here's some more examples we're gonna be seeing later today.
00:26:42.800 | What if we look at this data set called Canvid?
00:26:45.960 | Canvid looks like this.
00:26:48.720 | It contains pictures, and every pixel in the picture is color coded, right?
00:26:54.600 | So in this case, we have a list of files in a folder, and
00:26:59.400 | we're gonna label them, in this case, using a function.
00:27:03.480 | And so this function is basically the thing, we're gonna see it later,
00:27:06.800 | which tells it whereabouts of the color coding for each pixel.
00:27:09.920 | It's in a different place.
00:27:10.760 | Randomly split it in some way, create some data sets in some way.
00:27:17.960 | We can tell it for our particular list of classes.
00:27:21.960 | How do we know what pixel value one versus pixel value two is?
00:27:26.160 | And that was something that we can basically read in, like so.
00:27:28.840 | Again, some transforms, create a data bunch.
00:27:34.440 | You can optionally pass in things like what batch size do you want.
00:27:38.120 | And again, it knows how to draw itself, and you can start learning with that.
00:27:41.960 | For one more example, what if we wanted to create something like this?
00:27:46.480 | It has like bars, and chair, and remote control, and book.
00:27:51.520 | This is called an object detection data set.
00:27:53.680 | So again, we've got a little minimal CoCo data set.
00:27:57.120 | CoCo is kind of the most famous academic data set for object detection.
00:28:00.640 | We can create it using the same process.
00:28:03.720 | Grab a list of files from a folder, label them according to this little function.
00:28:09.120 | Randomly split them, create an object detection data set, create a data bunch.
00:28:14.400 | In this case, as you'll learn when we get to object detection,
00:28:16.920 | you have to use generally smaller batch sizes, or you'll run out of memory.
00:28:20.520 | And as you'll also learn, you have to use something called a collation function.
00:28:24.760 | And once that's all done, we can again show it, and
00:28:27.480 | here's our object detection data set.
00:28:29.760 | So you get the idea, right?
00:28:31.880 | So here's a really convenient notebook, where will you find this?
00:28:35.040 | Ah, this notebook is the documentation.
00:28:38.920 | Remember how I told you that all of the documentation comes from notebooks?
00:28:42.640 | You'll find them in your fast AI repo in docs_source.
00:28:47.160 | So this, which you can play with and experiment with inputs and outputs, and
00:28:51.400 | try all the different parameters, you will find the data block API examples of use.
00:28:56.360 | If you go to the documentation, here it is, the data block API examples of use.
00:29:01.440 | All right, so remember, everything that you wanna use in fast AI,
00:29:05.240 | you can look it up in the documentation.
00:29:06.840 | So let's search, data block API.
00:29:14.600 | Go straight there, and away you go.
00:29:22.840 | And so once you find some documentation that you actually wanna try playing with
00:29:26.800 | yourself, just look up the name, data block.
00:29:29.920 | And then you can open up a notebook with the same name in the fast AI repo,
00:29:33.760 | and play with it yourself, okay?
00:29:36.360 | So that's a quick overview of this really nice data block API.
00:29:42.520 | And there's lots of documentation for all of the different ways you can
00:29:46.760 | label inputs, and split data, and create data sets, and so forth.
00:29:50.000 | And so that's what we're using for Planet, okay?
00:29:56.280 | So we're using that API.
00:29:57.560 | You'll see in the documentation these two steps we had all joined up together.
00:30:04.680 | We can certainly do that here too, but you'll learn in a moment why it is that
00:30:10.240 | we're actually splitting these up into two separate steps, which is also fine as well.
00:30:14.360 | So a few interesting points about this, transforms.
00:30:21.320 | So transforms by default.
00:30:26.640 | Remember, you can hit Shift + Tab to get all the information, right?
00:30:29.720 | Transforms by default will flip randomly each image, right?
00:30:37.000 | But they'll actually randomly only flip them horizontally, which makes sense, right?
00:30:42.120 | If you're trying to tell if something's a cat or a dog,
00:30:44.200 | doesn't matter whether it's pointing left or right, but
00:30:46.360 | you wouldn't expect it to be upside down.
00:30:48.400 | On the other hand, satellite imagery, whether something's cloudy or hazy, or
00:30:52.480 | whether there's a road there or not, could absolutely be flipped upside down.
00:30:55.920 | There's no such thing as a right way up in space.
00:30:58.440 | So flipvert, which defaults to false, we're going to flip over to true.
00:31:03.320 | To say like, yeah, randomly, you should actually do that.
00:31:06.240 | And it doesn't just flip it vertically.
00:31:07.480 | It actually tries also each possible 90 degree rotation.
00:31:10.680 | So there are eight possible kind of symmetries that it tries out.
00:31:14.480 | So there's various other things here.
00:31:17.880 | I've found that these particular settings work pretty well for Planet.
00:31:24.240 | One that's interesting is warp.
00:31:27.560 | Perspective warping is something which very few libraries provide, and
00:31:31.280 | those that do provide it, it tends to be really slow.
00:31:33.560 | I think fast AI is the first one to provide really fast perspective warping.
00:31:38.600 | And basically the reason this is interesting is if I kind of look at you
00:31:41.880 | from below versus look at you from above, your shape changes, right?
00:31:49.280 | And so when you're taking a photo of a cat or a dog, sometimes you'll be higher,
00:31:54.080 | sometimes you'll be lower, then that kind of change of shape is certainly something
00:31:58.920 | that you would want to include as you're creating your training batches.
00:32:03.680 | You want to modify it a little bit each time.
00:32:06.160 | Not true for satellite images.
00:32:09.360 | A satellite always points straight down at the planet.
00:32:13.360 | So if you added perspective warping,
00:32:16.160 | you would be making changes that aren't going to be there in real life.
00:32:19.640 | So I turn that off.
00:32:21.120 | So this is all something called data augmentation.
00:32:23.720 | We'll be talking a lot more about it later in the course.
00:32:27.320 | But you can start to get a feel for
00:32:28.800 | the kinds of things that you can do to augment your data.
00:32:33.040 | And in general, maybe the most important one is if you're looking at astronomical
00:32:37.440 | data or kind of pathology, digital slide data or satellite data.
00:32:43.920 | Data where there isn't really an up or a down, turning on flipvert equals true
00:32:48.840 | is generally going to make your models generalize better.
00:32:52.800 | Okay, so here's the steps necessary to create our data bunch.
00:32:58.960 | And so now to create a satellite imagery
00:33:03.840 | classifier, multi-label classifier, that's going to figure out for
00:33:08.280 | each satellite tile what's the weather and what else, what can I see in it.
00:33:13.400 | There's basically nothing else to learn.
00:33:14.960 | Everything else that you've already learnt is going to be exactly nearly the same.
00:33:20.720 | Here it is, learn equals createCNN, data, architecture, right?
00:33:27.440 | And in this case, when I first built this notebook,
00:33:31.040 | I used ResNet 34 as per usual.
00:33:33.480 | And I found this was a case, I tried ResNet 50 as I always like to do.
00:33:36.720 | I found ResNet 50 helped a little bit, and I had some time to run it.
00:33:39.920 | So in this case, I was using ResNet 50.
00:33:43.680 | There's one more change I make, which is metrics.
00:33:49.080 | Now to remind you, a metric has got nothing to do with how the model trains.
00:33:55.000 | Changing your metrics will not change your resulting model at all.
00:33:59.520 | The only thing that we use metrics for is we print them out during training.
00:34:04.160 | So you hear it's printing out accuracy and
00:34:05.920 | it's printing out this other metric called F beta.
00:34:08.240 | So if you're trying to figure out how to do a better job with your model,
00:34:13.160 | changing the metrics will never be something that you need to do there.
00:34:17.040 | They're just to show you how you're going.
00:34:21.120 | So that's the first thing to know.
00:34:22.760 | You can have one metric or no metrics or
00:34:25.400 | a list of multiple metrics to be printed out as your model's trading.
00:34:30.320 | In this case, I wanna know two things.
00:34:32.640 | The first thing I wanna know is the accuracy.
00:34:35.880 | And the second thing I wanna know is how would I go on Kaggle?
00:34:40.240 | And Kaggle told me that I'm gonna be judged on a particular metric called the F score.
00:34:46.600 | So I'm not gonna bother telling you about the F score.
00:34:48.760 | It's not really interesting enough to be worth spending your time on.
00:34:51.560 | You can look it up, but it's basically this.
00:34:54.840 | When you have a classifier, you're gonna have some false positives.
00:34:59.120 | You're gonna have some false negatives.
00:35:01.280 | How do you weigh up those two things to kind of create a single number?
00:35:05.360 | There's lots of different ways of doing that.
00:35:07.040 | And something called the F score is basically a nice way of
00:35:11.520 | combining that into a single number.
00:35:14.040 | And there are various kinds of F scores, F1, F2, and so forth.
00:35:18.880 | And Kaggle said, in the competition rules, we're gonna use a metric called F2.
00:35:24.080 | So we have a metric called F beta,
00:35:31.920 | which in other words it's F with 1 or 2 or whatever depending on the value of beta.
00:35:37.240 | And we can have a look at its signature.
00:35:39.840 | And you can see that it's got a threshold and a beta.
00:35:45.000 | Okay, so the beta is 2 by default.
00:35:47.600 | And Kaggle said that they're gonna use F2, so I don't have to change that.
00:35:51.760 | But there's one other thing that I need to set, which is a threshold.
00:35:58.200 | What does that mean?
00:35:59.600 | Well, here's the thing.
00:36:01.360 | Do you remember we had a little look the other day at the source code for
00:36:06.880 | the accuracy metric?
00:36:08.440 | So if you put two question marks, you get the source code.
00:36:11.520 | And we found that it used this thing called argmax.
00:36:14.360 | And the reason for that, if you remember, was we kind of
00:36:21.320 | had this input image that came in, and it went through our model.
00:36:25.880 | And at the end, it came out with a table of ten numbers, right?
00:36:32.920 | This is like if we're doing MNIST digit recognition.
00:36:35.040 | The ten numbers were like the probability of each of the possible digits.
00:36:42.200 | And so then we had to look through all of those and
00:36:44.200 | find out which one was the biggest.
00:36:47.680 | And so the function in NumPy or PyTorch or
00:36:51.560 | just math notation that finds the biggest and
00:36:54.200 | returns its index is called argmax, right?
00:37:00.120 | So to get the accuracy for our pet detector,
00:37:02.880 | we used this accuracy function called argmax to find out behind the scenes
00:37:08.080 | which class ID pet was the one that we're looking at.
00:37:13.000 | And then it compared that to the actual and then took the average.
00:37:21.400 | And that was the accuracy.
00:37:23.400 | We can't do that for satellite recognition in this case,
00:37:28.200 | because there isn't one label we're looking for.
00:37:31.400 | There's lots.
00:37:32.960 | So instead, what we do is we look at, so in this case.
00:37:37.280 | So I don't know if you remember, but a data bunch has a special attribute called c.
00:37:49.960 | And c is gonna be basically how many outputs do we want our model to create?
00:37:55.440 | And so for any kind of classifier, we want one probability for
00:37:59.280 | each possible class.
00:38:00.800 | So in other words, data.c for classifiers is always gonna be equal to
00:38:05.680 | the length of data.classes, right?
00:38:09.280 | So data.classes, there they all are.
00:38:12.440 | There's the 17 possibilities, right?
00:38:14.520 | So we're gonna have one probability for each of those.
00:38:18.640 | But then we're not just gonna pick out one of those 17.
00:38:21.440 | We're gonna pick out n of those 17.
00:38:24.240 | And so what we do is we compare each probability to some threshold.
00:38:29.000 | And then we say anything that's higher than that threshold,
00:38:31.800 | we're gonna assume that the model's saying it does have that feature.
00:38:35.800 | And so we can pick that threshold.
00:38:38.000 | I found that for this particular data set, a threshold of 0.2
00:38:45.180 | seems to generally work pretty well.
00:38:46.760 | This is the kind of thing you can easily just experiment to find a good threshold.
00:38:50.320 | So I decided I wanted to print out the accuracy at a threshold of 0.2.
00:38:55.880 | So the normal accuracy function doesn't work that way.
00:38:59.640 | It doesn't argmax.
00:39:01.040 | We have to use a different accuracy function called accuracy_thresh.
00:39:05.560 | And that's the one that's gonna compare every probability to a threshold and
00:39:09.040 | return all the things higher than that threshold and compare accuracy that way.
00:39:13.200 | And so one of the things we would pass in is Thresh.
00:39:16.960 | Now of course, our metric is gonna be calling our function for us.
00:39:23.320 | So we don't get to tell it every time it calls back what threshold do we want.
00:39:28.520 | So we really wanna create a special version of this function
00:39:32.360 | that always uses an accuracy of a threshold of 0.2.
00:39:36.520 | So one way to do that would be to go define something called accuracy_02
00:39:41.680 | that takes some input and some target and
00:39:45.760 | returns accuracy threshold with that input and
00:39:51.880 | that target and a threshold of 0.2.
00:39:56.520 | We could do it that way, okay?
00:39:58.760 | But it's so common that you wanna kind of say, create a new function
00:40:03.920 | that's just like that other function, but
00:40:06.200 | we're always gonna call it with a particular parameter.
00:40:08.640 | That computer science has a term for that.
00:40:10.680 | It's called a partial, it's called a partial function application.
00:40:13.480 | And so Python 3 has something called partial that takes some function and
00:40:20.720 | some list of keywords and values and creates a new function.
00:40:25.920 | That is exactly the same as this function, but
00:40:28.280 | is always gonna call it with that keyword argument.
00:40:31.680 | So here, this is exactly the same thing as the thing I just typed in.
00:40:35.560 | O2 is now a new function that calls accuracy_thresh with a threshold of 0.2.
00:40:40.760 | And so this is a really common thing to do, particularly with the fastAI library,
00:40:45.120 | cuz there's lots of places where you have to pass in functions.
00:40:49.600 | And you very often wanna pass in a slightly customized version of a function.
00:40:53.800 | So here's how you do it.
00:40:54.960 | So here I've got an accuracy threshold 0.2.
00:40:58.360 | I've got a fbeta threshold 0.2.
00:41:01.560 | I can pass them both in as metrics.
00:41:03.280 | And I can then go ahead and do all the normal stuff.
00:41:07.320 | Lrfind, recorder.plot, find the thing with the steepest slope.
00:41:13.520 | So I don't know, somewhere around 1a neg 2, so we'll make that our learning rate.
00:41:18.880 | And then fit for a while with 5, Lr, and see how we go, okay?
00:41:25.000 | And so we've got an accuracy of about 96% and an fbeta of about 0.926.
00:41:31.680 | And so you could then go and have a look at planet,
00:41:36.600 | leaderboard, private leaderboard, okay?
00:41:40.680 | And so the top 50th is about 0.93.
00:41:45.640 | So we kinda say, we're on the right track, okay, with something we're doing fine.
00:41:51.520 | So as you can see, once you get to a point that the data's there,
00:41:56.560 | there's very little extra to do most of the time.
00:42:00.000 | >> So when your model makes an incorrect prediction in a deployed app,
00:42:06.640 | is there a good way to record that error and
00:42:08.760 | use that learning to improve the model in a more targeted way?
00:42:14.040 | >> Yeah, that's a great question.
00:42:16.080 | So the first bit, is there a way to record that?
00:42:18.240 | Of course there is, you record it, that's up to you, right?
00:42:20.640 | So maybe some of you can try it this week.
00:42:22.280 | You'll need to have your user tell you, you were wrong.
00:42:28.440 | This Australian car, you said it was a Holden, and actually it's a Falcon.
00:42:32.920 | So first of all, you'll need to collect that feedback.
00:42:35.320 | And the only way to do that is to ask the user to tell you when it's wrong.
00:42:39.280 | So you now need to record in some log somewhere,
00:42:41.520 | something saying, this was the file, I've stored it here.
00:42:45.720 | This was the prediction I made.
00:42:47.320 | This is the actual that they told me.
00:42:51.080 | And then at the end of the day or at the end of the week,
00:42:54.360 | you could set up a little job to run something or you can manually run something.
00:42:59.000 | And what are you gonna do?
00:42:59.720 | You're gonna do some fine-tuning.
00:43:02.880 | What does fine-tuning look like?
00:43:04.560 | Good segue, Rachel.
00:43:05.920 | It looks like this, right?
00:43:07.680 | So let's pretend here's your safe model, right?
00:43:12.080 | And so then we unfreeze, right?
00:43:14.800 | And then we fit a little bit more, right?
00:43:18.160 | Now in this case, I'm fitting with my original data set.
00:43:21.560 | But you could create a new data bunch with just the misclassified
00:43:26.480 | instances and go ahead and fit, right?
00:43:29.600 | And the misclassified ones are likely to be particularly interesting.
00:43:34.320 | So you might want to fit at a slightly higher learning rate,
00:43:36.880 | in order to make them kind of really mean more.
00:43:39.360 | Or you might want to run them through a few more epochs.
00:43:41.720 | But it's exactly the same thing, right?
00:43:43.640 | You just co-fit with your misclassified examples and
00:43:48.000 | passing in the correct classification.
00:43:50.640 | And that should really help your model quite a lot.
00:43:53.920 | There are various other tweaks you can do to this, but that's the basic idea.
00:43:59.040 | >> Next question, could someone talk a bit more about the data block ideology?
00:44:04.840 | I'm not quite sure how the blocks are meant to be used.
00:44:07.800 | Do they have to be in a certain order?
00:44:09.720 | Is there any other library that uses this type of programming that I could look at?
00:44:13.640 | >> Yes, they do have to be in a certain order.
00:44:20.280 | They do have to be in a certain order.
00:44:25.600 | And it's basically the order that you see in the example of use, right?
00:44:31.040 | What kind of data do you have?
00:44:35.120 | Where does it come from?
00:44:36.680 | How do you label it?
00:44:38.320 | How do you split it?
00:44:39.720 | What kind of data sets do you want?
00:44:41.400 | Optionally, how do I transform it?
00:44:43.800 | And then how do I create a data bunch from it?
00:44:45.240 | So they're the steps.
00:44:47.360 | I mean, we invented this API.
00:44:55.200 | I don't know if other people have independently invented it.
00:44:58.840 | The basic idea of kind of a pipeline of things that dot into each other is
00:45:06.720 | pretty common in a number of places.
00:45:11.240 | Not so much in Python, but you see it more in JavaScript.
00:45:15.160 | Although this kind of approach of each stage produces something slightly different.
00:45:21.080 | You tend to see it more in ETL software, like Extraction Transformation and
00:45:26.960 | Loading Software, where there's kind of particular stages in a pipeline.
00:45:30.000 | So yeah, I mean, it's been inspired by a bunch of things.
00:45:32.600 | But yeah, all you need to know is kind of use this example to guide you, and
00:45:41.240 | then look up the documentation to see which particular kind of thing you want.
00:45:46.200 | And in this case, the image file list, you're actually not going to find
00:45:51.760 | the documentation or image file list in data blocks documentation,
00:45:54.840 | because this is specific to the vision application.
00:45:58.040 | So to then go and actually find out how to do something for
00:46:01.160 | your particular application, you would then go to look at text and vision and
00:46:06.080 | so forth, and that's where you can find out what are the data block API pieces
00:46:10.360 | available for that application.
00:46:12.880 | And of course, you can then look at the source code.
00:46:14.600 | If you've got some totally new application,
00:46:17.040 | you could create your own part of any of these stages.
00:46:21.640 | Pretty much all of these functions are, you know, very few lines of code.
00:46:27.680 | Maybe we could look at an example of one, image list from folder.
00:46:36.560 | So let's just put that somewhere temporary, and
00:46:40.000 | then we're gonna go t dot label from CSV.
00:46:44.280 | Then you can look at the documentation to see exactly what that does, and
00:46:50.560 | that's gonna call label from data frame.
00:46:54.160 | So I mean, this is already useful.
00:46:55.960 | If you wanted to create a data frame, a pandas data frame from something
00:47:00.320 | other than the CSV, you now know that you could actually just call
00:47:03.600 | label from data frame, and you can look up to find what that does.
00:47:07.520 | And as you can see, most fast AI functions are no more than a few lines of code.
00:47:14.160 | They're normally pretty straightforward to see what are all the pieces there and
00:47:17.760 | how can you use them.
00:47:20.920 | And it's probably one of these things that as you play around with it,
00:47:25.320 | you'll get a good sense of how it all gets put together.
00:47:28.560 | But if during the week there are particular things where you're thinking,
00:47:31.800 | I don't understand how to do this, please let us know and we'll try to help you.
00:47:35.200 | Sure.
00:47:37.260 | >> What resources do you recommend for getting started with video, for
00:47:43.200 | example, being able to pull frames and submit them to your model?
00:47:46.240 | >> I guess, I mean, the answer is it depends.
00:47:57.240 | If you're using the web, which I guess probably most of you will be,
00:48:03.800 | then there's web APIs that basically do that for you.
00:48:08.400 | So you can grab the frames with the web API and
00:48:12.840 | then they're just images which you can pass along.
00:48:15.920 | If you're doing it client side, I guess most people tend to use OpenCV for that.
00:48:21.600 | But maybe people during the week who are doing these video apps can tell us what
00:48:26.840 | have you used and found useful, and we can start to prepare something in the lesson
00:48:30.540 | wiki with a list of video resources, since it sounds like some people are interested.
00:48:35.160 | Okay, so just like usual, we unfreeze our model and
00:48:44.360 | then we fit some more and we get down to 9 to 9-ish.
00:48:51.480 | So one thing to notice here is that before we unfreeze,
00:48:58.740 | you'll tend to get this shape pretty much all the time.
00:49:01.040 | If you do your learning rate finder before you unfreeze, it's pretty easy.
00:49:04.040 | You know, find the steepest slope, not the bottom, right?
00:49:07.320 | Remember we're trying to find the bit where we can like slide down it quickly.
00:49:10.800 | So if you start at the bottom, it's just gonna send you straight off to the end
00:49:14.000 | here, so somewhere around here, and then we can call it again after you unfreeze.
00:49:23.120 | And you'll generally get a very different shape, right?
00:49:25.840 | And this is a little bit harder to say what to look for, because it tends to be
00:49:30.120 | this kind of shape where you get a little bit of upward and
00:49:32.200 | then a kind of very gradual downward and then up here.
00:49:35.120 | So, you know, I tend to kind of look for just before it shoots up and
00:49:40.480 | go back about 10x, right, as a kind of a rule of thumb, so 1a neg 5, right?
00:49:45.960 | And that is what I do for the first half of my slice.
00:49:50.440 | And then for the second half of my slice, I normally do whatever learning rate I
00:49:55.320 | used for the frozen part, so Lr, which was 0.01,
00:50:02.880 | kind of divided by 5, or divided by 10, somewhere around that.
00:50:06.080 | So that's kind of my rule of thumb, right?
00:50:08.960 | Look for the bit kind of at the bottom, find about 10x smaller.
00:50:12.840 | That's the number that I put here, and then Lr over 5 or Lr over 10 is kind of
00:50:17.440 | what I put there.
00:50:18.000 | Seems to work most of the time.
00:50:21.880 | We'll be talking more about exactly what's going on here.
00:50:24.120 | This is called discriminative learning rates as the course continues.
00:50:27.640 | So how am I gonna get this better than 929?
00:50:33.720 | Because there are, how many people in this competition?
00:50:40.280 | About 1,000 teams, right?
00:50:42.760 | So we wanna get into the top 10%.
00:50:47.880 | So the top 5% would be 0.931-ish.
00:50:52.040 | The top 10% is gonna be about 929-ish.
00:50:57.340 | So we're not quite there, right?
00:51:01.440 | So here's a trick, right?
00:51:04.200 | I don't know if you remember, but when I created my data set,
00:51:08.960 | I put size equals 128, and actually the images that Kaggle gave us are 256.
00:51:16.360 | So I used the size of 128 partially cuz I wanted to experiment quickly.
00:51:21.800 | It's much quicker and easier to use small images to experiment.
00:51:25.960 | But there's a second reason.
00:51:27.720 | I now have a model that's pretty good at recognizing
00:51:32.920 | the contents of 128 by 128 satellite images.
00:51:35.920 | So what am I gonna do if I now wanna create a model that's pretty good at 256
00:51:41.840 | by 256 satellite images?
00:51:44.320 | Well, why don't I use transfer learning?
00:51:46.640 | Why don't I start with a model that's good at 128 by 128 images and
00:51:51.600 | fine tune that, so don't start again, right?
00:51:55.160 | And that's actually gonna be really interesting because if I'm trained
00:52:00.240 | quite a lot, if I'm on the verge of overfitting, which I don't wanna do, right?
00:52:05.000 | Then I'm basically creating a whole new data set effectively,
00:52:08.800 | one where my images are twice the size on each axis, right?
00:52:12.920 | So four times bigger.
00:52:14.520 | So it's really a totally different data set as far as my convolutional neural
00:52:17.540 | networks concerned.
00:52:18.960 | So I kind of gonna lose all that overfitting, I get to start again.
00:52:24.240 | So let's create a new learner, right?
00:52:27.740 | Well, let's keep our same learner, but use a new data bunch,
00:52:31.880 | where the data bunch is 256 by 256.
00:52:34.840 | So that's why I actually stopped here, right, before I created my data sets.
00:52:40.960 | Cuz I'm gonna now take this data source, and
00:52:45.440 | I'm gonna create a new data bunch with 256 instead.
00:52:48.960 | So let's have a look at how we do that.
00:52:52.760 | So here it is, take that source, right, take that source,
00:52:58.600 | transform it with the same transforms as before, but this time use size 256.
00:53:04.480 | Now that should be better anyway, because this is gonna be higher resolution images.
00:53:09.040 | But also I'm gonna start with, I haven't got rid of my learner,
00:53:12.240 | it's the same learner I had before, so I'm gonna start with this kind of pre-trained
00:53:15.680 | model, and so I'm gonna replace the data inside my learner with this new data bunch.
00:53:23.000 | And then I will freeze again, so
00:53:25.200 | that means I'm going back to just training the last few layers.
00:53:29.840 | And I will do a new LR find, and because I actually now have a pretty good model,
00:53:35.560 | like it's pretty good for 128 by 128, so it's probably gonna be like at least okay
00:53:41.800 | for 256 by 256, I don't get that same sharp shape that I did before.
00:53:46.960 | But I can certainly see where it's way too high, right?
00:53:51.240 | So, I'm gonna pick something well before where it's way too high.
00:53:54.740 | Again, maybe 10x smaller.
00:53:57.560 | So here I'm gonna go 1e neg 2 over 2, that seems well before it shoots up.
00:54:03.440 | And so let's fit a little bit more, okay?
00:54:06.900 | So we've frozen again, so we're just training the last few layers and
00:54:10.120 | fit a little bit more.
00:54:11.160 | And as you can see, I very quickly, remember kind of 928 was where we got to
00:54:16.000 | before or after quite a few epochs.
00:54:17.680 | We're straight up there, and suddenly we've passed 0.93, all right?
00:54:22.280 | So we're now already kind of into the top 10%, so we've hit our first goal, right?
00:54:31.160 | We're doing, we're at the very least pretty confident
00:54:34.960 | at the problem of just recognizing satellite imagery.
00:54:37.440 | But of course now, we can do the same thing before.
00:54:40.280 | We can unfreeze and train a little more, okay?
00:54:44.840 | Again, using the same kind of approach I described before,
00:54:47.880 | we allow it over 5 here, and even smaller one here, train a little bit more, 0.9314.
00:54:55.820 | So, that's actually pretty good, 0.9314.
00:55:06.080 | Somewhere around top 20-ish.
00:55:10.840 | So you can see actually when my friend Brendan and
00:55:13.360 | I entered this competition, we came 22nd with 0.9315.
00:55:17.200 | And we spent, this was a year or two ago, months trying to get here.
00:55:22.600 | So using kind of pretty much defaults with the minor tweaks and
00:55:27.560 | one trick, which is the resizing tweak,
00:55:30.960 | you can kind of get right up into the top of the leaderboard of this very
00:55:35.320 | challenging competition.
00:55:36.480 | Now, I should say we don't really know where we'd be.
00:55:41.520 | We'd actually have to check it on the test set that Kaggle gave us and
00:55:44.600 | actually submit to the competition, which you can do.
00:55:46.960 | You can do a late submission.
00:55:48.360 | And so later on in the course, we'll learn how to do that.
00:55:53.200 | But we certainly know we're doing well.
00:55:56.720 | We're doing very well, so that's great news.
00:56:00.400 | And so you can see also as I kind of go along, I tend to save things.
00:56:06.000 | I just, you can name your models whatever you like.
00:56:08.840 | But I just want to basically know, was it kind of before or after the unfreeze?
00:56:12.960 | So kind of had stage one or two, what size was I training on?
00:56:16.760 | What architecture was I training on?
00:56:18.520 | So that way, I can kind of always go back and experiment pretty easily.
00:56:23.280 | So that's Planet, multi-label classification.
00:56:28.640 | Let's look at another example.
00:56:33.480 | So the other example next we're going to look at is this data set called Canva.
00:56:40.040 | And it's going to be doing something called Segmentation.
00:56:42.280 | We're going to start with a picture like this.
00:56:45.200 | And we're going to try and create a color coded picture like this.
00:56:49.000 | Where all of the bicycle pixels are the same color.
00:56:52.280 | All of the road line pixels are the same color.
00:56:55.120 | All of the tree pixels are the same color.
00:56:57.200 | All of the building pixels are the same color.
00:56:58.840 | The sky is the same color, and so forth, okay?
00:57:01.600 | Now, we're not actually going to make them colors.
00:57:04.600 | We're actually going to do it where each of those pixels has a unique number.
00:57:10.440 | So in this case, the top left is buildings.
00:57:12.960 | I guess building is number four.
00:57:14.960 | The top right is trees, so tree is 26, and so forth, all right?
00:57:19.720 | So in other words, this single top left pixel, we're basically, like I mentioned this,
00:57:27.440 | we're going to do a classification problem, just like the pet's classification, for the
00:57:31.640 | very top left pixel.
00:57:32.640 | We're going to say, what is that top left pixel?
00:57:35.720 | Is it bicycle, road lines, sidewalk, building?
00:57:40.840 | What is the very top left pixel?
00:57:42.760 | And then what is the next pixel along?
00:57:44.720 | What is the next pixel along?
00:57:46.080 | So we're going to do a little classification problem for every single pixel in every single
00:57:52.640 | image.
00:57:54.680 | So that's called segmentation, all right?
00:57:59.380 | In order to build a segmentation model, you actually need to download or create a dataset
00:58:08.440 | where someone has actually labeled every pixel.
00:58:13.400 | So as you can imagine, that's a lot of work, okay?
00:58:17.800 | So that's going to be a lot of work.
00:58:18.920 | You're probably not going to create your own segmentation datasets, but you're probably
00:58:23.480 | going to download or find them from somewhere else.
00:58:25.780 | This is very common in medicine, life sciences.
00:58:30.040 | You know, if you're looking through slides at nuclei, it's very likely you already have
00:58:35.560 | a whole bunch of segmented cells and segmented nuclei.
00:58:40.800 | If you're in radiology, you probably already have lots of examples of segmented lesions
00:58:45.520 | and so forth.
00:58:46.520 | There's a lot of, you know, kind of different domain areas where there are domain-specific
00:58:54.680 | tools for creating these segmented images.
00:58:57.680 | As you could guess from this example, it's also very common in kind of self-driving cars
00:59:03.520 | and stuff like that where you need to see, you know, what objects are around and where
00:59:08.960 | are they.
00:59:09.960 | In this case, there's a nice dataset called CanvaD, which we can download, and they have
00:59:17.920 | already got a whole bunch of images and segment masks prepared for us, which is pretty cool.
00:59:26.240 | And remember, pretty much all of the datasets that we have provided kind of inbuilt URLs
00:59:33.480 | for, you can see their details at course.fastedai/datasets, and nearly all of them are academic datasets
00:59:45.440 | where some very kind people have gone to all of this trouble for us so that we can use
00:59:50.600 | this dataset and made it available for us to use.
00:59:54.260 | So if you do use it, one of these datasets for any kind of project, it would be very,
00:59:58.960 | very nice if you were to go and find the citation and say, you know, thanks to these people
01:00:05.440 | for this dataset, okay, because they've provided it, and all they're asking in return is for
01:00:11.880 | us to give them that credit.
01:00:12.880 | Okay, so here is the CanvaD dataset, here is the citation, and on our datasets page
01:00:17.760 | that will link to the academic paper where it came from.
01:00:21.240 | Okay, Rachel, now is a good time for a question.
01:00:26.440 | >> Is there a way to use learn.lr/find and have it return a suggested number directly
01:00:34.720 | rather than having to plot it as a graph and then pick a learning rate by visually inspecting
01:00:39.240 | that graph?
01:00:40.240 | And then there are a few other questions, I think, around more guidance on reading the
01:00:44.040 | learning rate finder graph.
01:00:45.560 | >> Yeah, I mean, that's a great question.
01:00:48.520 | I mean, the short answer is no.
01:00:51.240 | And the reason the answer is no is because this is still a bit more artisanal than I
01:00:57.440 | would like.
01:00:58.440 | As you can kind of see, I've been kind of saying how I read this learning rate graph
01:01:01.960 | depends a bit on what stage I'm at and kind of what the shape of it is.
01:01:08.720 | I guess, like, when you're just training the head, so before you unfreeze, it pretty much
01:01:16.000 | always looks like this.
01:01:17.760 | And you could certainly create something that kind of creates a slightly, you know, creates
01:01:21.040 | a smooth version of this, finds the sharpest negative slope and picked that.
01:01:26.560 | You would probably be fine nearly all the time.
01:01:31.160 | But then for, you know, these kinds of ones, you know, it requires a certain amount of
01:01:37.200 | experimentation.
01:01:38.200 | But the good news is you can experiment, right?
01:01:42.240 | You can try.
01:01:43.240 | Obviously, if the line's going up, you don't want it.
01:01:47.320 | And certainly, at the very bottom point, you don't want it, right, because you need it
01:01:51.600 | to be going downwards.
01:01:53.160 | But if you kind of start with somewhere around 10x smaller than that, and then also you could
01:01:57.920 | try another 10x smaller than that, try a few numbers and find out which ones work best.
01:02:03.600 | And within a small number of weeks, you will find that you're picking the best learning
01:02:10.440 | rate most of the time, right?
01:02:12.520 | So I don't know.
01:02:13.520 | It's kind of -- so at this stage, it still requires a bit of playing around to get a
01:02:17.280 | sense of the different kinds of shapes that you see and how to respond to them.
01:02:22.640 | Maybe by the time this video comes out, someone will have a pretty reliable auto learning
01:02:27.120 | rate finder.
01:02:29.000 | We're not there yet.
01:02:30.680 | It's probably not a massively difficult job to do, be an interesting project, collect
01:02:37.760 | a whole bunch of different datasets, maybe grab all the datasets from our datasets page,
01:02:42.980 | try and come up with some simple heuristic, compare it to all the different lessons I've
01:02:49.160 | shown.
01:02:50.160 | That would be a really fun project to do.
01:02:52.440 | But at the moment, we don't have that.
01:02:58.940 | I'm sure it's possible.
01:03:01.320 | But we haven't got there.
01:03:05.160 | Okay.
01:03:06.240 | So how do we do image segmentation?
01:03:11.600 | Same way we do everything else.
01:03:13.160 | And so basically we're going to start with some path which has got some information in
01:03:18.560 | it of some sort.
01:03:19.560 | So I always start by, you know, untiring my data, doing LS, see what I was given.
01:03:25.160 | In this case, there's a folder called labels and a folder called images.
01:03:29.860 | So I'll create paths for each of those.
01:03:32.920 | We'll take a look inside each of those.
01:03:36.480 | And you know, at this point, like, you can see there's some kind of coded file names
01:03:41.400 | for the images and some kind of coded file names for the segment masks.
01:03:46.960 | And then you kind of have to figure out how to map from one to the other.
01:03:50.280 | You know, normally these kind of datasets will come with a readme you can look at or
01:03:53.400 | you can look at their website.
01:03:55.680 | Often it's kind of obvious.
01:03:56.680 | In this case, I can see, like, these ones always have this kind of particular format.
01:04:02.160 | These ones always have exactly the same format with an underscore P.
01:04:05.660 | So I kind of -- when I did this, honestly, I just guessed.
01:04:08.440 | I thought, oh, it's probably the same thing, underscore P.
01:04:11.800 | And so I created a little function that basically took the file name and added the underscore
01:04:17.640 | P and put it in a different place.
01:04:20.680 | And I tried opening it and I noticed it worked.
01:04:23.400 | So I've created this little function that converts from the image file names to the
01:04:30.040 | equivalent label file names.
01:04:32.120 | I opened up that to make sure it works.
01:04:35.920 | Normally we use open image to open a file and then you can go .show to take a look at
01:04:43.640 | But this -- as we described, this is not a usual image file.
01:04:47.760 | It contains integers.
01:04:51.400 | So you have to use open mask rather than open image because we want to return integers,
01:04:56.400 | not floats.
01:04:57.960 | And fast AI knows how to deal with masks, so if you go mask.show, it will automatically
01:05:04.400 | color code it for you in some appropriate way.
01:05:07.680 | That's why we say open mask.
01:05:09.080 | So we can kind of have a look inside, look at the data, see what the size is.
01:05:13.520 | So there's 720 by 960.
01:05:16.440 | We can take a look at the data inside and so forth.
01:05:22.200 | The other thing you might have noticed is that they gave us a file called codes.text
01:05:26.620 | and a file called valid.text.
01:05:28.840 | So codes.text, we can load it up and have a look inside.
01:05:32.600 | And not surprisingly, it's got a list telling us that, for example, number 4 is 0, 1, 2,
01:05:38.760 | 3, 4.
01:05:39.760 | It's building.
01:05:40.760 | Top left is building.
01:05:41.760 | There you go.
01:05:42.760 | Okay?
01:05:43.760 | So just like we had, you know, grizzlies, black bears and teddies, here we've got the
01:05:48.040 | coding for what each one of these pixels means.
01:05:53.280 | So we need to create a data bunch.
01:05:56.200 | So to create a data bunch, we can go through the data block API and say okay, we've got
01:06:00.720 | a list of image files that are in a folder.
01:06:03.960 | We need to create labels, which we can use with that get y file name function we just
01:06:09.480 | created.
01:06:10.560 | We then need to split into training and validation.
01:06:13.000 | In this case, I don't do it randomly.
01:06:15.800 | Why not?
01:06:17.040 | Because actually the pictures they've given us are frames from videos.
01:06:21.320 | So if I did them randomly, I would be having like two frames next to each other, one in
01:06:25.400 | the validation set, one in the training set.
01:06:27.880 | That would be far too easy.
01:06:28.880 | That's treating.
01:06:30.360 | So the people that created this data set actually gave us a data set saying here is the list
01:06:35.720 | of file names that are meant to be in your validation set.
01:06:38.920 | And they're non-contiguous parts of the video.
01:06:42.260 | So here's how you can split your validation and training using a file name file.
01:06:49.880 | So from that, I can create my data sets.
01:06:53.680 | And so I actually have a list of class names.
01:06:58.000 | So like often with stuff like the planet data set or the pets data set, we actually have
01:07:02.880 | a string saying this is a pug or this is a ragdoll or this is a burman or this is cloudy
01:07:10.120 | or whatever.
01:07:11.120 | In this case, you don't have every single pixel labeled with an entire string.
01:07:15.820 | That would be incredibly inefficient.
01:07:17.680 | They're each labeled with just a number and then there's a separate file telling you what
01:07:21.360 | those numbers mean.
01:07:22.800 | So here's where we get to tell it and the data block API, this is the list of what the
01:07:28.000 | numbers mean.
01:07:29.000 | So these are the kind of parameters that the data block API gives you.
01:07:34.200 | Here's our transformations.
01:07:36.020 | And so here's an interesting point.
01:07:37.480 | Remember I told you that, for example, sometimes we randomly flip an image, right?
01:07:43.640 | What if we randomly flip the independent variable image but we don't also randomly flip this
01:07:52.800 | They're now not matching anymore, right?
01:07:55.120 | So we need to tell fast.ai that I want to transform the Y.
01:08:00.600 | So X is our independent variable, Y is our independent.
01:08:03.280 | I want to transform the Y as well.
01:08:05.300 | So whatever you do to the X, I also want you to do to the Y.
01:08:08.620 | So there's all these little parameters that we can play with and I can create a data bunch.
01:08:14.640 | I'm using a smaller batch size because as you can imagine, because I'm creating a classifier
01:08:19.080 | for every pixel, that's going to take a lot more GPU.
01:08:22.200 | So I found a batch size of 8 is all I could handle and then normalize in the usual way.
01:08:28.720 | And this is quite nice.
01:08:30.440 | Fast.ai, because it knows that you've given it a segmentation problem, when you call show
01:08:36.140 | batch, it actually combines the two pieces for you and it will color code the photo.
01:08:40.760 | Isn't that nice?
01:08:42.600 | So you can see here the green on the trees and the red on the lines and this kind of
01:08:49.740 | color on the walls and so forth, right?
01:08:52.080 | So you can see here, here are the pedestrians, this is the pedestrian's backpack.
01:08:56.560 | So this is what the ground truth data looks like.
01:09:00.200 | So once we've got that, we can go ahead and create a learner, I'll show you some more details
01:09:11.320 | in a moment, call lrfind, find the sharpest bit which looks about 1a neg 2, call fit,
01:09:19.640 | passing in slice lr and see the accuracy and save the model and unfreeze and train a little
01:09:28.160 | bit more.
01:09:30.580 | So that's the basic idea, okay?
01:09:32.480 | And so we're going to have a break and when we come back, I'm going to show you some little
01:09:37.720 | tweaks that we can do and I'm also going to explain this custom metric that we've created
01:09:43.340 | and then we'll be able to go on and look at some other cool things.
01:09:46.720 | So let's all come back at 8 o'clock, 6 minutes.
01:09:53.200 | Okay, welcome back everybody and we're going to start off with a question we got during
01:09:59.200 | the break.
01:10:00.200 | >> Could you use unsupervised learning here, pixel classification with the bike example
01:10:08.760 | to avoid needing a human to label a heap of images?
01:10:13.120 | >> Well, not exactly unsupervised learning, but you can certainly get a sense of where
01:10:19.560 | things are without needing these kind of labels.
01:10:25.080 | And time permitting, we'll try and see some examples of how to do that.
01:10:30.000 | But you're certainly not going to get such a quality and such a specific output as what
01:10:37.120 | you see here, though.
01:10:38.480 | If you want to get this level of segmentation mask, you need a pretty good segmentation mask
01:10:44.800 | ground truth to work with.
01:10:52.640 | >> Is there a reason we shouldn't deliberately make a lot of smaller data sets to step up
01:10:57.700 | from in tuning, let's say 64 by 64, 128 by 128, 256 by 256, and so on?
01:11:06.160 | >> Yes, you should totally do that.
01:11:09.120 | It works great.
01:11:11.120 | Try it.
01:11:12.120 | I found this idea is something that I first came up with in the course a couple of years
01:11:20.680 | ago and I kind of thought it seemed obvious and just presented it as a good idea and then
01:11:26.200 | I later discovered that nobody had really published this before and then we started
01:11:29.480 | experimenting with it and it was basically the main trick that we used to win the ImageNet
01:11:36.320 | competition, the Dawnbench ImageNet training competition, and we were like, wow, people,
01:11:42.960 | this wasn't -- not only was this not standard, nobody had heard of it before.
01:11:48.320 | There's been now a few papers that use this trick for various specific purposes, but it's
01:11:53.960 | still largely unknown and it means that you can train much faster, it generalizes better.
01:11:59.760 | There's still a lot of unknowns about exactly like how small and how big and how much at
01:12:06.640 | each level and so forth, but I guess in as much as it has a name now, it probably does
01:12:14.680 | and I guess we'd call it progressive resizing.
01:12:17.800 | I found that going much under 64 by 64 tends not to help very much, but yeah, it's a great
01:12:28.240 | technique and I definitely try a few different sizes.
01:12:32.320 | >> What does accuracy mean for pixel-wise segmentation?
01:12:40.200 | Is it correctly classified pixels divided by the total number of pixels?
01:12:44.640 | >> Yep, that's it.
01:12:47.240 | So if you imagine each pixel was a separate object you're classifying, it's exactly the
01:12:53.160 | same accuracy.
01:12:55.560 | And so you actually can just pass in accuracy as your metric, but in this case, we actually
01:13:05.640 | don't.
01:13:06.640 | We've created a new metric called Accuracy Canvid, and the reason for that is that when
01:13:13.680 | they labeled the images, sometimes they labeled a pixel as a void.
01:13:19.320 | I'm not quite sure why, maybe some that they didn't know or somebody felt that they'd made
01:13:26.120 | a mistake or whatever, but some of the pixels are void, and in the Canvid paper, they say
01:13:31.600 | when you're reporting accuracy, you should remove the void pixels.
01:13:38.380 | So we've created a Accuracy Canvid, so all metrics take the actual output of the neural
01:13:46.340 | net, that's the input to the, this is what they call the input, the input to the metric,
01:13:51.000 | and the target, i.e. the labels we're trying to predict.
01:13:54.720 | So we then basically create a mask, so we look for the places where the target is not
01:13:59.940 | equal to void.
01:14:04.040 | And then we just take the input, do the argmax as per usual, just the standard accuracy argmax,
01:14:11.680 | but then we just grab those that are not equal to the void code, and we do the same for the
01:14:15.800 | target, and we take the mean.
01:14:18.840 | So it's just a standard accuracy, it's almost exactly the same as the accuracy source code
01:14:23.400 | we saw before with the addition of this mask.
01:14:27.340 | So this quite often happens, that the particular Kaggle competition metric you're using, or
01:14:36.440 | the particular way your organization scores things or whatever, there's often little tweaks
01:14:41.760 | you have to do, and this is how easy it is.
01:14:46.320 | And so as you'll see, to do this stuff, the main thing you need to know pretty well is
01:14:51.320 | how to do basic mathematical operations in PyTorch.
01:14:57.800 | So that's just something you kind of need to practice.
01:15:01.200 | >> I've noticed that most of the examples in most of my models result in a training loss
01:15:08.240 | greater than the validation loss.
01:15:10.560 | What are the best ways to correct that?
01:15:12.560 | I should add that this still happens after trying many variations on number of epochs
01:15:16.720 | and learning rate.
01:15:18.440 | >> Okay, good question.
01:15:21.040 | So remember from last week, if your training loss is higher than your validation loss,
01:15:25.280 | then you're underfitting, okay?
01:15:27.480 | It definitely means that you're underfitting, you want your training loss to be lower than
01:15:32.280 | your validation loss.
01:15:35.860 | If you're underfitting, you can train for longer, you can train the last bit at a lower
01:15:44.400 | learning rate, but if you're still underfitting, then you're gonna have to decrease regularization,
01:15:53.200 | and we haven't talked about that yet.
01:15:55.360 | So in the second half of this part of the course, we're gonna be talking quite a lot
01:16:00.240 | about regularization and specifically how to avoid overfitting or underfitting by using
01:16:07.800 | regularization.
01:16:09.400 | If you wanna skip ahead, we're gonna be learning about weight decay, dropout, and data augmentation
01:16:15.960 | will be the key things that we're talking about.
01:16:22.560 | Okay, for segmentation, we don't just create a convolutional neural network.
01:16:32.000 | We can, but actually, an architecture called UNET turns out to be better, and actually,
01:16:42.360 | let's find it.
01:16:46.360 | Okay, so this is what a UNET looks like, and this is from the university website where
01:16:57.200 | they talk about the UNET, and so we'll be learning about this both in this part of the
01:17:01.440 | course and in part two, if you do it.
01:17:04.320 | But basically, this bit down on the left-hand side is what a normal convolutional neural
01:17:11.080 | network looks like.
01:17:12.080 | It's something which starts with a big image and gradually makes it smaller and smaller
01:17:15.520 | and smaller and smaller until eventually you just have one prediction.
01:17:19.200 | What a UNET does is it then takes that and makes it bigger and bigger and bigger again,
01:17:24.520 | and then it takes every stage of the downward path and kind of copies it across, and it
01:17:29.160 | creates this U-shape.
01:17:31.160 | It was originally actually created or published as a biomedical image segmentation method,
01:17:38.600 | but it turns out to be useful for far more than just biomedical image segmentation.
01:17:43.320 | So it was presented at MICHI, which is the main medical imaging conference, and as of
01:17:50.640 | just yesterday, it actually just became the most cited paper of all time from that conference.
01:17:57.640 | So it's been incredibly useful, over 3,000 citations.
01:18:01.360 | You don't really need to know any of the details at this stage.
01:18:04.160 | All you need to know is if you want to create a segmentation model, you want to be saying
01:18:11.200 | learner.create_unit rather than create_cnn.
01:18:15.920 | But you pass it the normal stuff, your data bunch, an architecture, and some metrics.
01:18:24.400 | So having done that, everything else works the same.
01:18:27.320 | You can do the LR finder, find the slope, train it for a while, watch the accuracy go
01:18:34.040 | up, save it from time to time, unfreeze, probably want to go about 10 less, so it's still going
01:18:43.200 | So probably 10 less than that.
01:18:44.200 | So 1e neg 5, LR over 5, train a bit more, and there we go.
01:18:53.080 | Now here's something interesting.
01:18:56.900 | You can learn.recorder is where we keep track of what's going on during training, and it's
01:19:01.800 | got a number of nice methods, one of which is plot losses, and this plots your training
01:19:07.080 | loss and your validation loss.
01:19:10.720 | And you'll see quite often they actually go up a bit before they go down.
01:19:17.320 | Why is that?
01:19:18.920 | That's because you can also plot your learning rate over time, and you'll see that your learning
01:19:25.280 | rate goes up and then it goes down.
01:19:29.040 | Why is that?
01:19:30.540 | Because we said fit one cycle, and that's what fit one cycle does.
01:19:34.920 | It actually makes the learning rate start low, go up, and then go down again.
01:19:40.960 | Why is that a good idea?
01:19:41.960 | Well, to find out why that's a good idea, let's first of all look at a really cool project
01:19:50.480 | done by Jose Fernandez-Portal during the week.
01:19:54.720 | He took our gradient descent demo notebook and actually plotted the weights over time,
01:20:04.800 | not just the ground truth and model over time.
01:20:08.800 | And he did it for a few different learning rates.
01:20:12.020 | And so remember, we had two weights.
01:20:14.120 | We were doing basically y equals ax plus b, or in his nomenclature here, y equals w naught
01:20:20.520 | x plus w1.
01:20:22.960 | And so we can actually look and see over time what happens to those weights.
01:20:28.040 | And we know this is the correct answer here, right?
01:20:31.480 | So at a learning rate of 0.1, it kind of slides on in here, and you can see that it takes
01:20:36.580 | a little bit of time to get to the right point, and you can see the loss improving.
01:20:43.280 | At a higher learning rate of 0.7, you can see that the ground truth, the model jumps
01:20:49.240 | to the ground truth really quickly.
01:20:50.960 | And you can see that the weights jump straight to the right place really quickly.
01:20:56.040 | What if we have a learning rate that's really too high?
01:21:00.680 | You can see it takes a very, very, very long time to get to the right point.
01:21:04.880 | Or if it's really too high, it diverges.
01:21:09.520 | So you can see here why getting the right learning rate is important.
01:21:13.920 | When you get the right learning rate, it really zooms into the best spot very quickly.
01:21:20.480 | Now as you get closer to the final spot, something interesting happens, which is that you really
01:21:32.000 | want your learning rate to decrease, because you're getting close to the right spot.
01:21:38.640 | And what actually happens -- so what actually happens is -- I can only draw 2D, sorry.
01:21:51.500 | You don't generally actually have some kind of loss function surface that looks like that.
01:21:56.520 | Remember, there's lots of dimensions, but it actually tends to look bumpy, like that.
01:22:04.320 | And so you kind of want a learning rate that's high enough to jump over the bumps.
01:22:13.040 | But then once you get close to the middle, once you get close to the best answer, you
01:22:18.600 | don't want to be just jumping backwards and forwards between bumps.
01:22:21.400 | So you really want your learning rate to go down so that as you get closer, you take smaller
01:22:26.240 | and smaller steps.
01:22:28.720 | So that's why it is that we want our learning rate to go down at the end.
01:22:36.440 | Now this idea of decreasing the learning rate during training has been around forever, and
01:22:41.520 | it's just called learning rate annealing.
01:22:45.160 | But the idea of gradually increasing it at the start is much more recent, and it mainly
01:22:49.480 | comes from a guy called Leslie Smith.
01:22:52.520 | If you're in San Francisco next week, actually, you can come and join me and Leslie Smith.
01:22:57.400 | We're having a meetup where we'll be talking about this stuff, so come along to that.
01:23:03.160 | What Leslie discovered is that if you gradually increase your learning rate, what tends to
01:23:10.520 | happen is that actually -- actually what tends to happen is that loss function surfaces tend
01:23:20.040 | to kind of look something like this, bumpy, bumpy, bumpy, bumpy, bumpy, flat, bumpy, bumpy,
01:23:25.640 | bumpy, bumpy, bumpy, something like this, right?
01:23:28.400 | They have flat areas and bumpy areas.
01:23:31.560 | And if you end up in the bottom of a bumpy area, that solution will tend not to generalize
01:23:38.280 | very well because you found a solution that's -- it's good in that one place, but it's not
01:23:43.320 | very good in other places, whereas if you found one in the flat area, it probably will
01:23:48.800 | generalize well because it's not only good in that one spot, but it's good kind of around
01:23:53.040 | it as well.
01:23:54.980 | If you have a really small learning rate, it will tend to kind of log down and stick
01:24:02.720 | in these places, right?
01:24:05.340 | But if you gradually increase the learning rate, then it will kind of like jump down
01:24:10.320 | and then as the learning rate goes up, it's going to start kind of going up again like
01:24:15.680 | this, right? And then the learning rate is now going to be up here. It's going to be
01:24:19.440 | bumping backwards and forwards, and eventually the learning rate starts to come down again,
01:24:25.440 | and so it will tend to find its way to these flat areas.
01:24:29.080 | So it turns out that gradually increasing the learning rate is a really good way of
01:24:34.120 | helping the model to explore the whole function surface and try and find areas where both
01:24:41.240 | the loss is low and also it's not bumpy, because if it was bumpy, it would get kicked out again.
01:24:49.720 | And so this allows us to train at really high learning rates, so it tends to mean that we
01:24:54.400 | solve our problem much more quickly and we tend to end up with much more generalizable
01:24:59.480 | solutions.
01:25:01.280 | So if you call plot losses and find that it's just getting a little bit worse and then it
01:25:07.960 | gets a lot better, you've found a really good maximum learning rate. So when you actually
01:25:12.720 | call fit one cycle, you're not actually passing in a learning rate, you're actually passing
01:25:17.960 | in a maximum learning rate.
01:25:22.160 | And if it's kind of always going down, particularly after you unfreeze, that suggests you could
01:25:28.120 | probably bump your learning rates up a little bit, because you really want to see this kind
01:25:33.560 | of shape. It's going to train faster and generalize better, just a little bit. And you'll tend
01:25:39.520 | to particularly see it in the validation set, the orange is the validation set.
01:25:43.960 | And again, the difference between knowing the theory and being able to do it is looking
01:25:50.680 | at lots of these pictures. So after you train stuff, type learn.recorder. and hit tab and
01:25:59.520 | see what's in there, right? And particularly the things that start with plot and start
01:26:03.520 | getting a sense of, like, what are these pictures looking like when you're getting good results?
01:26:08.600 | And then try making the learning rate much higher, try making it much lower, more epochs,
01:26:13.080 | less epochs and get a sense for what these look like.
01:26:17.320 | So in this case, we used a size in our transforms of the original image size over 2. These two
01:26:31.320 | slashes in Python means integer divide, okay? Because obviously we can't have half pixel
01:26:37.520 | amounts in our sizes. So integer divide divided by 2. And we used batch size of 8. And I found
01:26:43.400 | that fits on my GPU. It might not fit on yours. If it doesn't, you can just decrease the batch
01:26:48.580 | size down to 4. And this isn't really solving the problem, because the problem is to segment
01:26:55.720 | all of the pixels, not half of the pixels. So I'm going to use the same trick that I
01:26:59.760 | did last time, which is I'm now going to put the size up to the full size of the source
01:27:06.160 | images, which means I now have to halve my batch size, otherwise I run out of GPU memory.
01:27:12.800 | And I'm then going to set my learner. I can either say learn.data equals my new data,
01:27:20.200 | or I actually found I've had a lot of trouble with kind of GPU memory, so I generally restarted
01:27:24.360 | my kernel, came back here, created a new learner, and loaded up the weights that I saved last
01:27:31.080 | time. But the key thing here being that this learner now has the same weights that I had
01:27:36.520 | here, but the data is now the full image size. So I can now do an LR find again, find an
01:27:43.960 | area where it's kind of, you know, well before it goes up. So I'm going to use 1nx3 and fit
01:27:49.800 | some more. And then unfreeze and fit some more. And you can go to learn.show_results
01:28:00.440 | to see how your predictions compare to the ground truth. And you've got to say they really
01:28:05.920 | look pretty good. Not bad, huh? So, how good is pretty good? An accuracy of 92.15. The best
01:28:20.720 | paper I know of for segmentation was a paper called the 100 layers tiramisu, which developed
01:28:27.680 | a convolutional dense net, came out about two years ago. So after I trained this today,
01:28:34.720 | I went back and looked at the paper to find their state of the art accuracy. Here it is.
01:28:46.860 | And I looked it up. And their best was 91.5. And we got 92.1. So I've got to say, when
01:28:58.240 | this happened today, I was like, wow. I don't know if better results have come out since
01:29:04.080 | this paper. But I remember when this paper came out, and it was a really big deal. And
01:29:08.360 | I was like, wow. This is an exceptionally good segmentation result. Like when you compare
01:29:13.240 | it to the previous bests that they compared it to, it was a big step up. And so like in
01:29:18.800 | last year's course, we spent a lot of time in the course re-implementing the 100 layers
01:29:23.800 | tiramisu. And now, with our totally default fast AI class, I'm easily beating this. And
01:29:34.640 | I also remember this I had to train for hours and hours and hours, whereas today's I trained
01:29:40.080 | in minutes. So this is a super strong architecture for segmentation. So yeah, I'm not going to
01:29:50.120 | promise that this is the definite state of the art today because I haven't done a complete
01:29:54.320 | literature search to see what's happened in the last two years. But it's certainly beating
01:30:01.160 | the world's best approach the last time I looked into this, which was in last year's
01:30:06.560 | course, basically. And so these are kind of just all the little tricks I guess we've picked
01:30:12.160 | up along the way in terms of like how to train things well. Things like using the pre-trained
01:30:17.520 | model and things like using the one cycle convergence and all these little tricks. They
01:30:23.160 | work extraordinarily well. And it's really nice to be able to like show something in
01:30:29.320 | class where we can say, we actually haven't published the paper on the exact details of
01:30:34.840 | how this variation of the unit works. There's a few little tweaks we do. But if you come
01:30:41.080 | back for part two, we'll be going into all of the details about how we make this work
01:30:46.400 | so well. But for you, all you have to know at this stage is that you can say learner.create_unit
01:30:52.600 | and you should get great results also. There's another trick you can use if you're running
01:31:04.180 | out of memory a lot, which is you can actually do something called mixed precision training.
01:31:13.780 | And mixed precision training means that instead of using, for those of you that have done
01:31:18.160 | a little bit of computer science, instead of using single precision floating point numbers,
01:31:23.000 | you can do all the--most of the calculations in your model with half precision floating
01:31:27.080 | point numbers. So 16 bits instead of 32 bits. Tradition--I mean, the very idea of this has
01:31:33.720 | only been around really for the last couple of years in terms of like hardware that actually
01:31:39.280 | does this reasonably quickly. And then fast AI library I think is the first and probably
01:31:45.720 | still the only that makes it actually easy to use this. If you add to FP16 on the end
01:31:52.000 | of any learner call, you're actually going to get a model that trains in 16-bit precision.
01:32:00.560 | Because it's so new, you'll need to have kind of the most recent CUDA drivers and all that
01:32:06.200 | stuff for this even to work. I'm going to trade it this morning on some of the platforms.
01:32:10.880 | It just killed the kernel. So you need to make sure you've got the most recent drivers.
01:32:17.240 | But if you've got a really recent GPU, like a 2080 Ti, not only will it work, but it will
01:32:24.800 | work about twice as fast as otherwise. Now, the reason I'm mentioning it is that it's
01:32:30.120 | going to use less GPU RAM. So even if you don't have like a 2080 Ti, you might find--or you'll
01:32:38.280 | probably find that things that didn't fit into your GPU without this then do fit in
01:32:44.240 | with this. Now, I actually have never seen people use 16-bit precision floating point
01:32:52.240 | for segmentation before. Just for a bit of a laugh, I tried it and actually discovered
01:32:58.880 | that I got an even better result. So I only found this this morning so I don't have anything
01:33:06.720 | more to add here other than quite often when you make things a little bit less precise
01:33:12.120 | in deep learning, it generalizes a little bit better. And I've never seen a 92.5 accuracy
01:33:20.440 | on Canva before. So yeah, not only will this be faster, you'll be able to use bigger batch
01:33:26.760 | values, but you might even find like I did that you get an even better result. So that's
01:33:33.000 | a cool little trick. You just need to make sure that every time you create a learner,
01:33:36.720 | you add this to FP16. If your kernel dies, it probably means you have slightly out of
01:33:41.920 | date CUDA drivers or maybe even an old--too old graphics card. I'm not sure exactly which
01:33:49.680 | cards support FP16. Okay, so one more before we kind of rewind. Sorry, two more. The first
01:34:04.400 | one I'm going to show you is an interesting data set called the BWE head pose data set.
01:34:12.520 | And Gabrielle Fanelli was kind enough to give us permission to use this in the class. His
01:34:17.600 | team created this cool data set. Here's what the data set looks like. It's pictures. It's
01:34:24.540 | actually got a few things in it. We're just going to do a simplified version. And one
01:34:27.200 | of the things they do is they have a dot saying this is the center of the face. And so we're
01:34:35.160 | going to try and create a model that can find the center of a face. So for this data set,
01:34:44.000 | there's a few data set specific things we have to do which I don't really even understand
01:34:49.480 | but I just know from the read me that you have to. They use some kind of depth sensing
01:34:53.880 | camera. I think they actually used a Connect, you know, Xbox Connect. There's some kind
01:34:58.120 | of calibration numbers that they provide in a little file which I had to read in. And
01:35:02.320 | then they provided a little function that you have to use to take their coordinates
01:35:07.120 | to change it from this depth sensor calibration thing to end up with actual coordinates.
01:35:14.440 | So when you open this and you see these little conversion routines, that's just, you know,
01:35:20.640 | I'm just doing what they told us to do basically. It's got nothing particularly to do with deep
01:35:24.440 | learning to end up with this dot. The interesting bit really is where we create something which
01:35:33.000 | is not an image or an image segment, but an image points. And we'll mainly learn about
01:35:39.720 | this later in the course. But basically, image points use this idea of kind of the coordinates,
01:35:48.760 | right? They're not pixel values, they're XY coordinates. There's just two numbers. As
01:35:54.760 | you can see--let me see. Okay. So here's an example for a particular image file name,
01:36:12.240 | this particular image file, and here it is. The coordinates of the center of the face
01:36:18.000 | are 263, 428. And here it is. So there's just two numbers which represent whereabouts on
01:36:26.400 | this picture as the center of the face. So if we're going to create a model that can
01:36:30.720 | find the center of a face, we need a neural network that spits out two numbers. But note,
01:36:37.240 | this is not a classification model. These are not two numbers that you look up in a
01:36:41.680 | list to find out that they're road or building or ragdoll, cat or whatever. They're actual
01:36:48.320 | locations. So, so far everything we've done has been a classification model, something
01:36:55.560 | that's created labels or classes. This, for the first time, is what we call a regression
01:37:00.680 | model. A lot of people think regression means linear regression. It doesn't. Regression
01:37:05.480 | just means any kind of model where your output is some continuous number or set of numbers.
01:37:11.600 | So, this is, we need to create an image regression model, something that can predict these two
01:37:16.720 | numbers. So how do you do that? Same way as always, right? So we can actually just say
01:37:25.560 | I've got a list of image files, it's in a folder, and I want to label them using this
01:37:32.320 | function that we wrote that basically does the stuff that the README says to grab the
01:37:37.040 | coordinates out of their text files. So that's going to give me the two numbers for everyone,
01:37:42.360 | and then I'm going to split it according to some function. And so in this case, the files
01:37:49.360 | they gave us, again, they're from videos, and so I picked just one folder to be my validation
01:37:56.520 | set, in other words, a different person. So again, I was trying to think about, like,
01:38:00.080 | how do I validate this fairly? So I said, well, the fair validation would be to make
01:38:04.780 | sure that it works well on a person that it's never seen before. So my validation set is
01:38:09.740 | all going to be a particular person. Create a data set, and so this data set, I just tell
01:38:15.460 | it what kind of data set is it. Well, they're going to be a set of points. So points means,
01:38:19.440 | you know, specific coordinates. Do some transforms. Again, I have to say transform Y equals true,
01:38:26.680 | because that red dot needs to move if I flip or rotate or what, right? Pick some size,
01:38:32.960 | I just picked a size that's going to work pretty quickly. Create a data bunch, normalize
01:38:36.280 | it, and again, show batch, there it is. Okay? And notice that their red dots don't always
01:38:43.240 | seem to be quite in the middle of the face. I don't know exactly what their kind of internal
01:38:48.520 | algorithm for putting dots on. It kind of sometimes looks like it's meant to be the
01:38:53.160 | nose, but sometimes it's not quite the nose. Anyway, you get the -- it's somewhere around
01:38:57.360 | the center of the face, or the nose. So how do we create a model? We create a CNN. But
01:39:07.840 | we're going to be learning a lot about loss functions in the next few lessons. But generally,
01:39:13.760 | basically the loss function is that number that says how good is the model. And so for
01:39:19.560 | classification, we use this loss function called cross-entropy loss, which says basically
01:39:25.600 | -- you remember this from earlier lessons? Did you predict the correct class, and were
01:39:31.680 | you confident of that prediction? Now, we can't use that for regression. So instead,
01:39:37.040 | we use something called mean-squared error. And if you remember from last lesson, we actually
01:39:42.960 | implemented mean-squared error from scratch. It's just the difference between the two squared
01:39:47.800 | and added up together. Okay. So we need to tell it this is not classification, so we
01:39:53.440 | use mean-squared error. So this is not classification, so we have to use mean-squared error. And
01:40:07.800 | then once we've created the learner, we've told it what loss function to use, we can
01:40:11.200 | go ahead and do lrfind. We can then fit. And you can see here, within a minute and a half,
01:40:18.520 | our mean-squared error is 0.0004. Now, the nice thing is about, like, mean-squared error,
01:40:25.120 | that's very easy to interpret, right? So we're trying to predict something, which is somewhere
01:40:30.560 | around a few hundred. And we're getting a squared error on average of 0.0004. So we
01:40:38.560 | can feel pretty confident that this is a really good model. And then we can look at the results
01:40:42.240 | by learn.show_results, and we can see predictions, ground truth. It's doing a nearly perfect
01:40:50.760 | job. Okay? So that's how you can do image regression models. So any time you've got
01:40:57.440 | something you're trying to predict, which is some continuous value, you use an approach
01:41:01.320 | that's something like this. So last example, before we look at some kind of more foundational
01:41:10.560 | theory stuff, NLP. And next week we're going to be looking at a lot more NLP. But let's
01:41:18.240 | now do the same thing, but rather than creating a classification of pictures, let's try and
01:41:24.280 | classify documents. And so we're going to go through this in a lot more detail next
01:41:31.000 | week, but let's do the quick version. Rather than importing from fastai.vision, I now
01:41:36.200 | import for the first time from fastai.txt. That's where you'll find all the application-specific
01:41:41.200 | stuff for analyzing text documents. And in this case, we're going to use a dataset called
01:41:46.460 | imdb. And imdb has lots of movie reviews. They're generally about a couple of thousand
01:41:54.720 | words. And each movie review has been classified as either negative or positive. So it's just
01:42:04.160 | in a CSV file, so we can use pandas to read it, we can take a little look, we can take
01:42:08.200 | a look at a review. And basically, as per usual, we can either use factory methods or
01:42:17.920 | the data block API to create a data bunch. So here's the quick way to create a data bunch
01:42:23.120 | from a CSV of texts, data bunch from CSV, and that's that. And yeah, at this point,
01:42:34.680 | I could create a learner and start training it. But we're going to show you a little bit
01:42:39.280 | more detail, which we're mainly going to look at next week. The steps that actually happen
01:42:44.320 | when you create these data bunches is there's a few steps. The first is it does something
01:42:48.600 | called tokenization, which is it takes those words, and it converts them into a standard
01:42:55.560 | form of tokens, where there's basically each token represents a word. But it does things
01:43:02.040 | like see here, see how didn't has been turned here into two separate words? And you see
01:43:07.480 | how everything's been lower cased? See how your has been turned into two separate words?
01:43:13.400 | So tokenization is trying to make sure that each token, each thing that we've got with
01:43:20.600 | spaces around it here, represents a single linguistic concept. Also, it finds words that
01:43:31.920 | are really rare, like really rare names and stuff like that and replaces them with a special
01:43:37.320 | token called unknown. So anything starting with XX and fast AI is some special token.
01:43:45.080 | So that's just tokenization. So we end up with something where we've got a list of tokenized
01:43:49.960 | words. You'll also see that things like punctuation end up with spaces around them to make sure
01:43:55.160 | that they're separate tokens. The next thing we do is we take a complete unique list of
01:44:03.800 | all of the possible tokens, that's called the vocab, and that gets created for us. And
01:44:09.800 | so here's the first ten items of the vocab. So here is every possible token, the first
01:44:15.160 | ten of them that appear in all of the movie reviews. And we then replace every movie review
01:44:22.360 | with a list of numbers. And the list of numbers simply says what numbered thing in the vocab
01:44:29.520 | is in this place. So here's six is zero, one, two, three, four, five, six. So this is the
01:44:36.080 | word "a." And this is three, zero, one, two, three, this is a comma, and so forth. So through
01:44:43.320 | tokenization and numericalization, this is the standard way in NLP of turning a document
01:44:49.680 | into a list of numbers. We can do that with the data block API, right? So this time it's
01:44:57.180 | not image files list, it's text, split data from a CSV, convert them to data sets, tokenize
01:45:06.600 | them, numericalize them, create a data bunch, and at that point we can start to create a
01:45:17.560 | model. As we learn about next week, when we do NLP classification, we actually create
01:45:25.080 | two models. The first model is something called a language model, which, as you can see, we
01:45:33.800 | train in a kind of a usual way. We say we want to create a language model learner, we
01:45:37.960 | train it, we can save it, we unfreeze, we train some more, and then after we've created
01:45:44.480 | a language model, we fine tune it to create the classifier. So here's the thing where
01:45:49.000 | we create the data bunch for the classifier, we create a learner, we train it, and we end
01:46:00.740 | up with some accuracy. So that's the really quick version. We're going to go through it
01:46:05.040 | in more detail next week. But you can see the basic idea of training an NLP classifier
01:46:10.520 | is very, very, very similar to creating every other model we've seen so far. And this accuracy,
01:46:18.640 | so the current state of the art for IMDB classification is actually the algorithm that we built and
01:46:26.360 | published with a colleague named Sebastian Ruder, and this basically, what I just showed
01:46:33.000 | you is pretty much the state of the art algorithm with some minor tweaks. You can get this up
01:46:37.240 | to about 95% if you try really hard. So this is very close to the state of the art accuracy
01:46:43.200 | that we developed. There's a question. Okay, now's a great time for a question.
01:46:53.880 | >> For a dataset very different than ImageNet, like the satellite images or genomic images
01:46:58.480 | shown in lesson two, we should use our own stats. Jeremy once said if you're using a
01:47:03.560 | pre-trained model, you need to use the same stats it was trained with. Why is that? Isn't
01:47:08.920 | it that normalized data with its own stats will have roughly the same distribution like
01:47:13.200 | ImageNet? The only thing I can think of which may differ is skewness. Is it the possibility
01:47:18.720 | of skewness or something else the reason of your statement? And does that mean you don't
01:47:23.240 | recommend using pre-trained models with very different datasets like the one-point mutation
01:47:28.080 | that you mentioned in lesson two? >> No. As you can see, I've used pre-trained
01:47:36.400 | models for all of those things. Every time I've used an imageNet trained model and every
01:47:41.280 | time I've used ImageNet stats. Why is that? Because that model was trained with those
01:47:47.400 | stats. So for example, imagine you're trying to classify different types of green frogs.
01:47:56.100 | So if you were to use your own per-channel means from your dataset, you would end up
01:48:01.080 | converting them to a mean of zero, standard deviation of one for each of your red, green
01:48:06.320 | and blue channels, which means they don't look like green frogs anymore. They now look
01:48:11.120 | like gray frogs, right? But ImageNet expects frogs to be green, okay? So you need to normalize
01:48:18.240 | with the same stats that the ImageNet training people normalized with, otherwise the unique
01:48:23.240 | characteristics of your dataset won't appear anymore. You actually normalize them out in
01:48:27.480 | terms of the per-channel statistics. So you should always use the same stats that the
01:48:32.080 | model was trained with. Okay. So in every case, what we're doing here
01:48:44.920 | is we're using gradient descent with mini-batches, so stochastic gradient descent, to fit some
01:48:51.040 | parameters of a model. And those parameters are parameters to basically matrix multiplications.
01:48:59.400 | In the second half of this part, we're actually going to learn about a little tweak called
01:49:03.000 | convolutions, but it's basically a type of matrix multiplication. The thing is, though,
01:49:10.120 | no amount of matrix multiplications is possibly going to create something that can read IMDB
01:49:18.120 | reviews and decide if it's positive or negative, or look at satellite imagery and decide whether
01:49:23.920 | it's got a road in it. That's far more than a linear classifier can do. Now, we know these
01:49:29.080 | are deep neural networks, and deep neural networks contain lots of these matrix multiplications.
01:49:35.520 | But every matrix multiplication is just a linear model, and a linear function on top
01:49:42.520 | of a linear function is just another linear function. If you remember back to your high
01:49:50.360 | school math, you might remember that if you have a Y equals AX plus B, and then you stick
01:49:55.960 | another CY plus D on top of that, it's still just another slope and another intercept.
01:50:05.600 | So no amount of stacking matrix multiplications is going to help in the slightest.
01:50:11.800 | So what are these models actually -- what are we actually doing? And here's the interesting
01:50:19.040 | thing. All we're actually doing is we literally do have a matrix multiplication or a slight
01:50:27.040 | variation like a convolution that we'll learn about. But after each one, we do something
01:50:32.720 | called a non-linearity or an activation function. An activation function is something that takes
01:50:39.800 | the result of that matrix multiplication and sticks it through some function. And these
01:50:47.920 | are some of the functions that we use. In the old days, the most common function that
01:50:56.800 | we used to use was basically this shape. These shapes are called sigmoid. And they have,
01:51:09.720 | you know, particular mathematical definitions. Nowadays, we almost never use those for these
01:51:17.400 | -- between each matrix multiply. Nowadays, we nearly always use this one. It's called
01:51:25.400 | a rectified linear unit. It's very important when you're doing deep learning to use big
01:51:30.680 | long words that sound impressive, otherwise normal people might think they can do it too.
01:51:35.480 | But just between you and me, a rectified linear unit is defined using the following function.
01:51:47.560 | That's it. Okay. So -- and if you want to be really exclusive, of course, you then shorten
01:51:54.440 | the long version and you call it a ReLU to show that you're really in the exclusive team.
01:51:59.760 | So this is a ReLU activation. So here's the crazy thing. If you take your red, green,
01:52:08.440 | blue pixel inputs and you chuck them through a matrix multiplication and then you replace
01:52:15.160 | the negatives with zero and you put it through another matrix multiplication, place the negatives
01:52:19.160 | with zero and you keep doing that again and again and again, you have a deep learning
01:52:23.760 | neural network. That's it. All right. So how the hell does that work? So an extremely cool
01:52:32.360 | guy called Michael Nielsen showed how this works. He has a very nice website. There's
01:52:39.320 | actually more than a website. It's a book. Neural networks and deep learning dot com.
01:52:44.680 | And he has these beautiful little JavaScript things where you can get to play around -- because
01:52:50.360 | this was back in the old days, this was back when we used to use sigmoids, right? And what
01:52:54.320 | he shows is that if you have enough little -- he shows these little matrix modifications.
01:53:00.840 | If you have enough little matrix modifications followed by sigmoids, and exactly the same
01:53:05.720 | thing works for a matrix multiplication followed by a ReLU, you can actually create arbitrary
01:53:11.440 | shapes, right? And so this idea that these combinations of linear functions and nonlinearities
01:53:24.160 | can create arbitrary shapes actually has a name. And this name is the universal approximation
01:53:30.040 | theorem. And what it says is that if you have stacks of linear functions and nonlinearities,
01:53:38.040 | the thing you end up with can approximate any function arbitrarily closely. So you just
01:53:46.280 | need to make sure that you have a big enough matrix to multiply by or enough of them. So
01:53:52.680 | if you have, you know, now this function, which is just a sequence of matrix multipliers
01:53:58.980 | and nonlinearities, where the nonlinearities can be, you know, basically any of these things.
01:54:04.200 | And we normally use this one. If that can approximate anything, then all you need is
01:54:08.880 | some way to find the particular values of the weight matrices in your matrix multipliers
01:54:14.840 | that solve the problem you want to solve. And we already know how to find the values
01:54:19.360 | of parameters. We can use gradient descent. And so that's actually it, right? And this
01:54:25.400 | is the bit I find the hardest thing normally to explain to students is that we're actually
01:54:33.200 | done now. People often come up to me after this lesson and they say, what's the rest?
01:54:40.080 | Please explain to me the rest of deep learning. But, like, no, there's no rest. Like, we have
01:54:45.200 | a function where we take our input pixels or whatever, we multiply them by some weight
01:54:49.320 | matrix, we replace the negatives with zeros, we multiply it by another weight matrix, replace
01:54:53.760 | the negatives with zeros, we do that a few times, we see how close it is to our target,
01:54:59.640 | and then we use gradient descent to update our weight matrices using the derivatives.
01:55:03.760 | And we do that a few times. And eventually we end up with something that can classify
01:55:09.160 | movie reviews or can recognize pictures of ragdoll cats. That's actually it. Okay? So
01:55:19.380 | the reason it's hard to understand intuitively is because we're talking about weight matrices
01:55:27.880 | that have, you know, once you wrap them all up, something like 100 million parameters.
01:55:33.040 | They're very big weight matrices, right? So your intuition about what multiplying something
01:55:39.880 | by a linear model and replacing the negatives with zeros a bunch of times can do, your intuition
01:55:45.680 | doesn't hold, right? You just have to accept empirically the truth is doing that works
01:55:52.920 | really well. So in part two of the course, we're actually going to build these from scratch,
01:56:00.680 | right? But I mean, just to skip ahead, you'll basically will find that, you know, it's going
01:56:06.480 | to be kind of five lines of code, right? It's going to be a little for loop that goes, you
01:56:11.640 | know, t equals, you know, x at weight matrix one, t two equals max of t comma zero, stick
01:56:23.840 | that in a for loop that goes through each weight matrix and at the end calculate my
01:56:29.480 | loss function. And of course we're not going to calculate the gradients ourselves because
01:56:33.680 | PyTorch does that for us. And that's about it. So, okay, question.
01:56:45.680 | >> There's a question about tokenization. I'm curious about how tokenizing words works when
01:56:50.040 | they depend on each other such as San Francisco. >> Yeah, okay. Okay. Tokenization, how do you
01:57:05.400 | tokenize something like San Francisco? San Francisco contains two tokens, San Francisco.
01:57:15.680 | That's it. That's how you tokenize San Francisco. The question may be coming from people who
01:57:22.000 | have done, like, traditional NLP often need to kind of use these things called ngrams.
01:57:28.840 | And ngrams are kind of this idea of, like, a lot of NLP in the old days was all built
01:57:34.640 | on top of linear models where you basically counted how many times particular strings
01:57:40.040 | of text appeared, like the phrase San Francisco. That would be a bigram or an ngram with an
01:57:48.040 | n of two. The cool thing is that with deep learning we don't have to worry about that.
01:57:52.480 | Like with many things, a lot of the complex feature engineering disappears when you do
01:57:56.920 | deep learning. So with deep learning, each token is literally just a word or in the case
01:58:04.280 | that the word really consists of two words, like your, you split it into two words. And
01:58:10.520 | then what we're going to do is we're going to then let the deep learning model figure
01:58:16.680 | out how best to combine words together. Now when we say, like, let the deep learning model
01:58:22.040 | figure it out, of course, all we really mean is find the weight matrices using gradient
01:58:28.040 | descent to give the right answer. Like, there's not really much more to it than that. Again,
01:58:34.280 | there's some minor tweaks, right? In the second half of the course we're going to be learning
01:58:38.280 | about the particular tweak for image models, which is using a convolution. There'll be
01:58:42.920 | a CNN. For language, there's a particular tweak we do called using recurrent models
01:58:49.200 | or an RNN. But they're very minor tweaks on what we've just described. So basically it
01:58:55.000 | turns out with an RNN that it can learn that SAN plus Francisco has a different meaning
01:59:05.000 | when those two things are together. >> Some satellite images have four channels.
01:59:12.280 | How can we deal with data that has four channels or two channels when using pre-trained models?
01:59:17.640 | >> Yeah, that's a good question. I think that's something that we're going to try and incorporate
01:59:26.480 | into fast AI. So hopefully by the time you watch this video there will be easier ways
01:59:30.800 | to do this. But the basic idea is a pre-trained image net model expects red, green, and blue
01:59:39.280 | pixels. So if you've only got two channels there's a few things you can do. But basically
01:59:48.120 | you want to create a third channel. And so you can create the third channel as either
01:59:54.080 | being all zeros or it could be the average of the other two channels. And so you can
01:59:59.600 | just use normal PyTorch arithmetic to create that third channel. You could either do that
02:00:07.760 | ahead of time in a little loop and save your three channel versions or you could create
02:00:13.080 | a custom dataset class that does that on demand. For fourth channel you probably don't want
02:00:23.160 | to get rid of the fourth channel. So instead what you'd have to do is to actually modify
02:00:29.240 | the model itself. So to know how to do that we'll only know how to do that in a couple
02:00:33.640 | more lessons time. But basically the idea is that the initial weight matrix, weight matrix
02:00:41.480 | is really the wrong term. They're not weight matrices, they're weight tensors so they can
02:00:45.960 | have more than just two dimensions. So that initial weight matrix in the neural net it's
02:00:51.680 | going to have, it's actually a tensor and one of its axes is going to have three slices
02:01:00.720 | in it. So you would just have to change that to add an extra slice which I would generally
02:01:06.480 | just initialize to zero or to some random numbers. So that's the short version. But
02:01:12.900 | really to answer this, to understand exactly what I meant by that, we're going to need
02:01:16.400 | a couple more lessons to get there. Okay, so wrapping up, what have we looked
02:01:24.360 | at today? Basically we started out by saying, hey, it's really easy now to create web apps.
02:01:40.840 | We've got starter kits for you that show you how to create web apps and people have created
02:01:46.600 | some really cool web apps using what we've learned so far which is single label classification.
02:01:54.220 | But the cool thing is the exact same steps we use to do single label classification,
02:02:00.760 | you can also do to do multi-label classification such as in the planet or you could use to
02:02:13.560 | do segmentation or you could use to do, or you could use to do any kind of image regression
02:02:28.160 | or, this is probably a bit early if you don't try this yet, you could do for an LP classification
02:02:34.920 | and a lot more. So in each case, all we're actually doing is we're doing gradient to
02:02:43.420 | descent on not just two parameters but on maybe 100 million parameters but still just
02:02:50.920 | plain gradient descent along with a non-linearity, which is normally this one, which it turns
02:03:01.160 | out the universal approximation theorem tells us, lets us arbitrarily accurately approximate
02:03:07.800 | any given function including functions such as converting a spoken waveform into the thing
02:03:15.200 | the person was saying or converting a sentence in Japanese to a sentence in English or converting
02:03:20.680 | a picture of a dog into the word dog. These are all mathematical functions that we can
02:03:26.500 | learn using this approach. So this week, see if you can come up with an interesting idea
02:03:34.880 | of a problem that you would like to solve which is either multi-label classification
02:03:40.760 | or image regression or image segmentation, something like that and see if you can try
02:03:49.520 | to solve that problem. You will probably find the hardest part of solving that problem is
02:03:56.160 | coming up creating the data bunch and so then you'll need to dig into the data block API
02:04:01.600 | to try to figure out how to create the data bunch from the data you have. And with some
02:04:07.800 | practice you will start to get pretty good at that. It's not a huge API, there's a small
02:04:12.280 | number of pieces, it's also very easy to add your own but for now, you know, ask on the
02:04:18.200 | forum if you try something and you get stuck. Okay, great. So next week we're going to come
02:04:28.080 | back and we're going to look at some more NLP. We're going to learn some more about some
02:04:34.120 | details about how we actually train with SGD quickly. We're going to learn about things
02:04:38.360 | like Atom and RMS prop and so forth. And hopefully we're also going to show off lots of really
02:04:44.720 | cool web apps and models that you've all built during the week. So I'll see you then. Thanks.
02:04:50.320 | [APPLAUSE]