Lesson 6 - Deep Learning for Coders (2020)

00:00:00.000 | Hi everybody and welcome to lesson 6 where we're going to continue looking at training

00:00:08.400 | convolutional neural networks for computer vision. And so we last looked at this the

00:00:13.640 | lesson before last and specifically we were looking at how to train an image classifier

00:00:19.360 | to pick out breeds of pet, one of 37 breeds of pet. And we've gotten as far as training

00:00:26.660 | a model but we also had to look and figure out what loss function was actually being

00:00:32.460 | used in this model. And so we talked about cross entropy loss which is actually a really

00:00:37.660 | important concept and some of the things we're talking about today depend a bit on you understanding

00:00:42.560 | this concept. So if you were at all unsure about where we got to with that go back and

00:00:48.480 | have another look have a look at the questionnaire in particular and make sure that you're comfortable

00:00:53.840 | with cross entropy loss. If you're not you may want to go back to the 04 MNIST basics

00:01:00.240 | notebook and remind yourself about MNIST loss because it's very very similar that's what

00:01:04.640 | we built on to build up cross entropy loss. So having trained our model the next thing

00:01:11.840 | we're going to do is look at model interpretation. There's not much point having a model if you

00:01:17.200 | don't see what it's doing. And one thing we can do is use a confusion matrix which in

00:01:25.160 | this case is not terribly helpful. There's kind of a few too many and it's not too bad

00:01:29.320 | we can kind of see some colored areas. And so this diagonal here are all the ones that

00:01:33.280 | are classified correctly. So for Persians there were 31 classified as Persians. But

00:01:40.440 | we can see there's some bigger numbers here like Siamese six were misclassified they're

00:01:45.160 | actually considered a Berman. But for when you've got a lot of classes like this it might

00:01:53.440 | be better instead to use the most confused method and that tells you the combinations

00:02:02.560 | which it got wrong the most often. In other words which numbers are the biggest so actually

00:02:06.960 | here's the biggest one ten and that's confusing an American pit bull terrier or a Staffordshire

00:02:13.960 | bull terrier that's happened ten times. And ragdoll is getting confused with the Berman

00:02:19.880 | eight times. And so I'm not a dog or cat expert and so I don't know what this stuff means

00:02:27.880 | so I looked it up on the internet and I found that American pit bull terriers and Staffordshire

00:02:32.320 | bull terriers are almost identical that I think they sometimes have a slightly different

00:02:36.600 | colored nose I remember correctly. And ragdolls and Berman's are types of cats that are so

00:02:42.920 | similar to each other that this whole long threads on cat lover forums about is this

00:02:47.040 | a ragdoll or is this a Berman and experts disagreeing with each other. So no surprise

00:02:52.600 | that these things are getting confused. So when you see your model making sensible mistakes

00:02:59.620 | the kind of mistakes that humans make that's a pretty good sign that it's picking up the

00:03:03.440 | right kind of stuff and that the kinds of errors you're getting also might be pretty

00:03:07.600 | tricky to fix. But you know let's see if we can make it better. And one way to try and

00:03:15.880 | make it better is to improve our learning rate. Why would we want to improve the learning

00:03:21.920 | rate? Well one thing we'd like to do is to try to train it faster, get more done in less

00:03:28.080 | epochs. And so one way to do that would be to call our fine-tune method with a higher

00:03:35.360 | learning rate. So last time we used the default which I think is, there you go, 1 in egg 2.

00:03:49.800 | And so if we pump that up to 0.1 it's going to jump further each time. So remember the

00:03:54.680 | learning rate and if you've forgotten this have a look again at notebook 4. That's the

00:03:59.520 | thing we multiply the gradients by to decide how far to step. And unfortunately when we

00:04:05.820 | use this higher learning rate the error rate goes from 0.083 epochs to 0.83 so we're getting

00:04:16.760 | the vast majority of them wrong now. So that's not a good sign. So why did that happen? Well

00:04:22.400 | what happened is rather than this gradual move towards the minimum, we had this thing

00:04:30.840 | where we step too far and we get further, further away. So when you see this happening

00:04:38.580 | which looks in practice like this, your error rate getting worse right from the start, that's

00:04:44.360 | a sign your learning rate is too high. So we need to find something just right, not

00:04:49.160 | too small that we take tiny jumps and it takes forever and not too big that we you know either

00:04:55.680 | get worse and worse or we just jump backwards and forwards quite slowly. So to find a good

00:05:01.840 | learning rate we can use something that the researcher Leslie Smith came up with called

00:05:06.800 | the learning rate finder. And the learning rate finder is pretty simple. All we do, remember

00:05:13.800 | when we do stochastic gradient descent, we look at one mini batch at a time or a few

00:05:19.080 | images in this case at a time, find the gradient for that set of images for the mini batch

00:05:24.800 | and jump, step our weights based on the learning rate and the gradient. Well what Leslie Smith

00:05:32.600 | said was okay let's do the very first mini batch at a really, really low learning rate

00:05:37.120 | like 10 to the minus 7 and then let's increase by a little bit. They're like maybe 25% higher

00:05:46.560 | and do another step and then 25% higher and do another step. So these are not epochs,

00:05:53.880 | these are just a single a simple mini batch and then we can plot on this chart here. Okay

00:05:57.960 | at 10 to the minus 7 what was the loss and at 25% higher than that what was the loss

00:06:04.000 | and the 25% higher than that what was the loss. And so not surprisingly if you do that

00:06:08.080 | at the low learning rates the loss doesn't really come down because the learning rate

00:06:11.960 | is so small that these steps are tiny, tiny, tiny. And then gradually we get to the point

00:06:21.320 | where they're big enough to make a difference and the loss starts coming down because we've

00:06:24.600 | plotted here the learning rate against the loss, right. So here the loss is coming down

00:06:31.120 | as we continue to increase the learning rate the loss comes down until we get to a point

00:06:36.160 | where our learning rates too high and so it flattens out and then oh it's getting worse

00:06:40.720 | again so here's the point above like 0.1 where we're in this territory. So what we really

00:06:49.560 | want is somewhere around here where it's kind of nice and steep. So you can actually ask

00:06:57.540 | it the learning rate finder so we used LR find to get this plot we can we can get back

00:07:03.400 | from it the minimum and steep. And so steep is where was it steepest but the steepest

00:07:08.720 | point was 5e neg 3 and the minimum point divided by 10 that's quite a good rule of thumb is

00:07:17.280 | 1e neg 2. So somewhere around this range might be pretty good. So each time you run it you'll

00:07:25.960 | get different values a different time we ran it we thought that maybe 3e neg 3 would be

00:07:30.080 | good so we picked that and you'll notice the learning rate finder is a logarithmic scale

00:07:35.920 | be careful of interpreting that. So we can now rerun the learning rate finder setting

00:07:41.720 | the learning rate to a number we picked from the learning rate finder which in this case

00:07:45.680 | was 3e neg 3. And we can see now that's looking good right we've got an 8.3% error rate after

00:07:53.360 | 3 epochs. So this idea of the learning rate finder is very straightforward I can describe

00:08:00.560 | it to you in a couple of sentences it doesn't require any complex math and yet it was only

00:08:05.200 | invented in 2015 which is super interesting right it just shows that there's so many interesting

00:08:13.960 | things just to learn and discover. I think part of the reason perhaps for this it took

00:08:18.520 | a while is that you know engineers kind of love using lots and lots of computers. So

00:08:25.680 | before the learning rate finder came along people would like run lots of experiments

00:08:29.320 | on big clusters to find out which learning rate was the best rather than just doing a

00:08:33.080 | batch at a time. And I think partly also the idea of having a thing where a human is in

00:08:39.240 | the loop where we look at something and make a decision is also kind of unfashionable a

00:08:43.280 | lot of folks in research and industry love things which are fully automated. But anyway

00:08:48.400 | it's great we now have this tool because it makes our life easier and fastai is certainly

00:08:55.400 | the first library to have this and I don't know if it's still the only one to have it

00:08:58.400 | built in at least to the basic the base library. So now we've got a good learning rate how

00:09:05.760 | do we fine-tune the weights so so far we've just been running this fine-tune method without

00:09:11.360 | thinking much about what it's actually doing. But we did mention in chapter one lesson one

00:09:20.600 | briefly basically what's happening with fine-tune what is transfer learning doing. And before

00:09:28.820 | we look at that let's take a question. Is the learning rate plot in LR find plotted against

00:09:37.880 | one single mini batch? No it's not it's just it's actually just the standard kind of walking

00:09:49.800 | through the walking through the data loader so just getting the usual mini batches of

00:09:57.400 | the shuffled data. And so it's kind of just normal training and the only thing that's

00:10:01.600 | being different is that we're increasing the learning rate a little bit after each mini

00:10:07.160 | batch and and keeping track of it. Along with that is is the network reset to the initial

00:10:18.720 | status after each trial? No certainly not we actually want to see how it learns we want

00:10:26.520 | to see it improving so we don't reset it to its initial state until we're done. So at

00:10:33.640 | the end of it we go back to the random weights we started with or whatever the weights were

00:10:37.440 | at the time we ran this. So what we're seeing here is something that's actually the actual

00:10:45.600 | learning that's happening as we at the same time increase the learning rate. Why would

00:10:53.640 | an ideal learning rate found with a single mini batch at the start of training keep being

00:10:58.000 | a good learning rate even after several epochs and further loss reductions? Great question

00:11:04.520 | it absolutely wouldn't so let's look at that too shall we? And ask one more? This is an

00:11:14.720 | important point so ask is very important. For the learning rate finder why use the steepest

00:11:20.440 | and not the minimum? We certainly don't want the minimum because the minimum is the point

00:11:26.640 | at which it's not learning anymore. Right so so this flat section at the bottom here

00:11:31.800 | means in this mini batch it didn't get better. So we want the steepest because that's the

00:11:35.920 | mini batch where it got the most improved and that's what we want we want the weights

00:11:40.280 | to be moving as fast as possible. As a rule of thumb though we do find that the minimum

00:11:47.000 | divided by 10 works pretty well that's Sylvain's favorite approach and he's generally pretty

00:11:52.760 | spot-on with that so that's why we actually print out those two things. lr min is actually

00:11:59.000 | the minimum divided by 10 and steepest point is suggests the steepest point. Great good

00:12:08.480 | questions all. So remind ourselves what transfer learning does. So with transfer learning remember

00:12:16.880 | what our neural network is. It's a bunch of linear models basically with with activation

00:12:26.360 | functions between them and our activation functions are generally values rectified linear

00:12:32.280 | units. If any of this is fuzzy have a look at the zero for notebook again to remind yourself.

00:12:41.800 | And so each of those linear layers has a bunch of parameters to the whole neural network has

00:12:47.320 | a bunch of parameters. And so after we train a neural network on something like ImageNet

00:12:56.120 | we have a whole bunch of parameters that aren't random anymore they're actually useful for

00:12:59.880 | something. And we've also seen that the early layers seem to learn about fairly general ideas

00:13:07.080 | like radiance and edges and the later layers learn about more sophisticated ideas like

00:13:12.520 | what are eyes look like or what is fur look like or what is text look like. So with transfer

00:13:18.520 | learning we take a model so in other words a set of parameters which has already been

00:13:23.520 | trained something like ImageNet. We throw away the very last layer because the very

00:13:28.960 | last layer is the bit that specifically says which one of those in the case of ImageNet

00:13:33.600 | 1000 categories is this an image in. We throw that away and we replace it with random weights

00:13:41.240 | sometimes with more than one layer of random weights and then we train that. Now yes.

00:13:51.200 | Oh I just wanted to make a comment and that's that I think the learning rate finder I think

00:13:58.200 | after you learn about it the idea almost seems kind of so simple or approximate that it's

00:14:04.200 | like wait this shouldn't work like or you know shouldn't you have to do something more

00:14:09.000 | more complicated or more precise that it's like I just want to highlight that this is

00:14:13.160 | a very surprising result that some kind of a such a simple approximate method would be

00:14:19.400 | so helpful. Yeah I would particularly say it's surprising to people who are not practitioners

00:14:25.680 | or have not been practitioners for long. I've noticed that a lot of my students at USF have

00:14:36.680 | a tendency to kind of jump in to try to doing something very complex where they account

00:14:40.320 | for every possible imperfection from the start and it's very rare that that's necessary so

00:14:45.560 | one of the cool things about this is it's good example of trying the easiest thing first

00:14:50.440 | and seeing how well it works. And this was a very big innovation when it came out that

00:14:55.920 | I think it's kind of easy to take for granted now but this was super super helpful when

00:15:00.600 | it was it was super helpful and it was also nearly entirely ignored. None of the research

00:15:07.320 | community cared about it and it wasn't until fast AI I think in our first course talked

00:15:11.680 | about it that people started noticing and we had quite a few years in fact it's still

00:15:16.880 | a bit the case where super fancy researchers still don't know about the learning rate finder

00:15:22.200 | and you know get kept beaten by you know first lesson fast AI students on practical problems

00:15:30.160 | because they can pick learning rates better and they can do it without a cluster of thousands

00:15:35.360 | of computers. Okay so transfer learning so we've got our pre-trained network and so it's

00:15:44.280 | really important every time you hear the word pre-trained network you're thinking a bunch

00:15:47.840 | of parameters which have particular numeric values and go with a particular architecture

00:15:54.200 | like resnet 34. We've thrown away the final layer and replace them with random numbers

00:16:02.600 | and so now we want to train to fine-tune this set of parameters for a new set of images

00:16:08.920 | in this case pets. So fine-tune is the method we call to do that and to see what it does

00:16:18.440 | we can go burn.fine-tune? and we can see the source code and here is the signature of the

00:16:27.280 | function and so the first thing that happens is we call freeze. So freeze is actually the

00:16:36.560 | method which makes it so only the last layers weights will get stepped by the optimizer.

00:16:45.600 | So the gradients are calculated just for those last layers of parameters and the step is

00:16:49.580 | done just for those last layer of parameters. So then we call fit and we fit for some number

00:16:57.720 | of epochs which by default is 1. We don't change that very often and what that fit is doing

00:17:07.400 | is it's just fitting those randomly added weights which makes sense right they're the

00:17:12.040 | ones that are going to need the most work because at the time which we add them they're

00:17:16.680 | doing nothing at all they're just random. So that's why we spend one epoch trying to

00:17:22.240 | make them better. After you've done that you now have a model which is much better than

00:17:30.180 | we started with it's not random anymore. All the layers except the last are the same as

00:17:35.240 | the pre-trained network the last layer has been tuned for this new data set. So the closer

00:17:41.040 | you get to the right answer as you can kind of see in this picture the smaller the steps

00:17:46.360 | you want to create sorry the smaller the steps you want to take generally speaking. The next

00:17:51.440 | thing we do is we divide our learning rate by 2 and then we unfreeze so that means we

00:17:56.760 | make it so that all the parameters can now be stepped and all of them will have gradients

00:18:01.660 | calculated and then we fit for some more epochs and this is something we have to pass to the

00:18:07.600 | method. And so that's now going to train the whole network. So if we want to we can kind

00:18:16.860 | of do this by hand right and actually CNN learner will by default freeze the model for us freeze

00:18:27.200 | the parameters for us so we actually don't have to call freeze. So if we just create

00:18:31.900 | a learner and then fit for a while this is three epochs of training just the last layer

00:18:40.100 | and so then we can just manually do it ourselves unfreeze. And so now at this point as the

00:18:46.480 | question earlier suggested maybe this is not the right learning rate anymore so we can

00:18:51.280 | run LR find again and this time you don't see the same shape you don't see this rapid

00:18:58.160 | drop because it's much harder to train a model that's already pretty good. But instead you

00:19:03.560 | just see a very gentle little gradient. So generally here what we do is we kind of try

00:19:09.840 | to find the bit where it starts to get worse again and go about which is about here and

00:19:14.320 | go about 10 let you know multiple of 10 less than that so about 1 a neg 5 I would guess

00:19:19.160 | which yep that's what we picked. So then after unfreezing finding our new learning rate and

00:19:25.320 | then we can do a bunch more and so here we are we're getting down to 5.9 percent error

00:19:32.360 | which is okay but there's there's better we can do. And the reason we can do better is

00:19:39.200 | that at this point here we're training the whole model at a 1 a neg 5 so 10 to the minus

00:19:45.800 | 5 learning rate which doesn't really make sense because we know that the last layer

00:19:52.320 | is still not that great it's only had three epochs of training from random so it probably

00:19:56.640 | needs more work. We know that the second last layer was probably pretty specialized to image

00:20:02.720 | net and less specialized to pet breeds so that probably needs a lot of work. Whereas

00:20:07.280 | the early layers but kind of gradients and edges probably don't need to be changed much

00:20:11.880 | at all. But what we'd really like is to have a small learning rate for the early layers

00:20:16.760 | and a bigger learning rate for the later layers. And this is something that we developed at

00:20:22.160 | fast AI and we call it discriminative learning rates. And Jason Yasinski actually is a guy

00:20:30.400 | who wrote a great paper that some of these ideas are based on which is he actually showed

00:20:35.080 | that different layers of the network really want to be trained at different rates. Although

00:20:39.400 | he didn't kind of go as far as trying that out and seeing how it goes it was more of

00:20:44.240 | a theoretical thing. So in fast AI if we want to do that we can pass to our learning rate

00:20:51.040 | rather than just passing a single number we can pass a slice. Now a slice is a special

00:20:58.640 | built-in feature of Python. It's just an object which basically can have a few different numbers

00:21:04.640 | in it. In this case it's been passing at two numbers. And the way we read those basically

00:21:11.160 | what this means in fast AI is a learning rate is the very first layer will have this learning

00:21:17.320 | rate 10 to the minus 6. The very last layer will be 10 to the minus 4. And the layers

00:21:22.800 | between the two will be kind of equal multiples. So they'll kind of be equally spaced learning

00:21:28.720 | rates from the start to the end. So here we can see basically doing our kind of own version

00:21:38.160 | of fine-tune. We create the learner, we fit with that automatically frozen version, we

00:21:45.520 | unfreeze, we fit some more. And so when we do that you can see this works a lot better

00:21:51.200 | we're getting down to 5.3, 5.1, 5.4 error. So that's pretty great. One thing we'll notice

00:21:59.840 | here is that we did kind of overshoot a bit. It seemed like more like epoch number 8 was

00:22:05.280 | better. So kind of back before you know well actually let me explain something about fit

00:22:12.200 | one cycle. So fit one cycle is a bit different to just fit. So what fit one cycle does is

00:22:20.920 | it actually starts at a low learning rate. It increases it gradually for the first one-third

00:22:29.360 | or so of the batches until it gets to a high learning rate. The highest, this is why it's

00:22:35.360 | called LR max. It's the highest learning rate we get to. And then for the remaining two-thirds

00:22:40.520 | or so of the batches it gradually decreases the learning rate. And the reason for that

00:22:46.680 | is just that well actually it's kind of like empirically researchers have found that works

00:22:51.640 | the best. In fact this was developed again by Leslie Smith the same guy that did the

00:22:55.340 | learning rate finder. Again it was a huge step you know it really dramatically accelerated

00:23:01.680 | the speed at which we can train neural networks and also made them much more accurate. And

00:23:06.320 | again the academic community basically ignored it. In fact the key publication that developed

00:23:13.360 | this idea was not even did not even pass peer review. And so the reason I mention this now

00:23:20.960 | is to say that we can't we don't really just want to go back and pick the model that was

00:23:25.280 | trained back here because we could probably do better because we really want to pick a

00:23:30.880 | model that's got a low learning rate. But what I would generally do here is I change

00:23:35.640 | this 12 to an 8 because this is this is looking good. And then I would retrain it from scratch.

00:23:42.680 | Normally you'd find a better result. You can plot the loss and you can see how the training

00:23:49.580 | and validation loss moved along. And you can see here that you know the error rate was

00:23:59.760 | starting to get worse here. And what you'll often see is often the validation loss will

00:24:07.860 | get worse a bit before the error rate gets worse. We're not really seeing it so much

00:24:13.480 | in this case but the error rate and the validation loss don't always or they're not always kind

00:24:17.740 | of in lockstep. So what we're plotting here is the loss but you actually kind of want

00:24:23.840 | to look to see mainly what's happening with the error rate because that's actually the

00:24:26.920 | thing we care about. Remember the loss is just like an approximation of what we care

00:24:31.480 | about that just happens to have a gradient that works out nicely. So how do you make

00:24:42.200 | it better now? We're already down to just 5.4 or if we'd stopped a bit earlier maybe

00:24:49.640 | we could get down to 5.1 or less error. On 37 categories that's pretty remarkable. That's

00:24:56.080 | a very, very good pet breed predictor. If you want to do something even better you could

00:25:02.720 | try creating a deeper architecture. So a deeper architecture is just literally putting more

00:25:11.080 | pairs of non-activation function also known as a non-linearity followed by these little

00:25:15.880 | linear models put more pairs onto the end. And basically the number of these sets of

00:25:21.800 | layers you have is the number that you'll see at the end of an architecture. So there's

00:25:26.960 | ResNet 18, ResNet 34, ResNet 50, so forth. Having said that you can't really pick ResNet

00:25:35.760 | 19 or ResNet 38. I mean you couldn't make one but nobody's created a pre-trained version

00:25:45.000 | of that for you so you won't be able to do any fine-tuning. So like you can theoretically

00:25:50.120 | create any number of layers you like but in practice most of the time you'll want to pick

00:25:57.800 | a model that has a pre-trained version. So you kind of have to select from the sizes

00:26:02.760 | people have pre-trained and there's nothing special about these sizes they're just ones

00:26:06.580 | that people happen to have picked out. For the bigger models there's more parameters

00:26:13.520 | and more gradients that are going to be stored on your GPU and you will get used to the idea

00:26:19.480 | of seeing this error unfortunately out of memory. So that's not out of memory in your

00:26:26.360 | RAM that's out of memory in your GPU. Cruder is referring to the language in the system

00:26:32.760 | used for your GPU. So if that happens unfortunately you actually have to restart your notebook

00:26:38.540 | so that's kernel restart and try again and that's a really annoying thing but such is

00:26:45.480 | life. One thing you can do if you get an out-of-memory error is after you've your CNN learner call

00:26:52.720 | add this magic incantation to FP16. What that does is it uses for most of the operations

00:27:01.500 | numbers that use half as many bits as usual so they're less accurate this half precision

00:27:07.000 | floating point or FP16 and that will use less memory and on pretty much any NVIDIA card

00:27:18.400 | created in 2020 or later and some more expensive cards even created in 2019 that's often going

00:27:27.080 | to result in a two to three times speed up in terms of how long it takes as well. So

00:27:33.000 | here if I add in to FP16 and I will be seeing often much faster training and in this case

00:27:42.040 | what I actually did is I switched to a ResNet-50 which would normally take about twice as long

00:27:47.200 | and my per epoch time has gone from 25 seconds to 26 seconds. So the fact that we used a

00:27:54.780 | much bigger network and it was no slower is thanks to to FP16. But you'll see error rate

00:28:01.120 | hasn't improved it's pretty similar to what it was and so it's important to realize that

00:28:07.260 | just because we increase the number of layers it doesn't always get better. So it tends

00:28:13.180 | to require a bit of experimentation to find what's going to work for you and of course

00:28:19.280 | don't forget the trick is use small models for as long as possible and to do all of your

00:28:25.920 | cleaning up and testing and so forth and wait until you're all done to try some bigger models

00:28:30.800 | because they're going to take a lot longer. Okay questions. How do you know or suspect

00:28:39.740 | when you can quote do better? You have to always assume you can do better because you

00:28:48.320 | never know. So you just have to I mean part of it though is do you need to do better or

00:28:54.760 | do you already have a good enough result to handle the actual task you're trying to do.

00:29:00.480 | When people do spend too much time fiddling around with their models rather than actually

00:29:04.440 | trying to see whether it's already going to be super helpful. So as soon as you can actually

00:29:09.560 | try to use your model to do something practical the better. But yeah how much can you improve

00:29:16.560 | it? Who knows? I you know go through the techniques that we're teaching this course and try them

00:29:23.520 | and see which ones help. Unless it's a problem that somebody has already tried before and

00:29:30.960 | written down their results in a paper or a capital competition or something there's no

00:29:34.920 | way to know how good can. So don't forget after you do the questionnaire to check out

00:29:44.920 | the further research section. And one of the things we've asked you to do here is to read

00:29:49.880 | a paper. So find the learning rate finder paper and read it and see if you can kind

00:29:57.040 | of connect what you read up to the things that we've learned in this lesson. And see

00:30:02.120 | if you can maybe even implement your own learning rate finder you know as manually as you need

00:30:10.000 | to see if you can get something that you know based on reading the paper to work yourself.

00:30:16.940 | You can even look at the source code of fastai's learning rate finder of course. And then can

00:30:23.520 | you make this classifier better? And so this is further research right? So maybe you can

00:30:28.200 | start doing some reading to see what else could you do. Have a look on the forums see

00:30:33.200 | what people are trying. Have a look on the book website or the course website to see

00:30:37.600 | what other people have achieved and what they did and play around. So we've got some tools

00:30:43.480 | in our toolbox now for you to experiment with. So that is that is pet breeds that is a you

00:30:53.760 | know a pretty tricky computer vision classification problem. And we kind of have seen most of

00:31:01.960 | the pieces of what goes into the training of it. We haven't seen how to build the actual

00:31:05.360 | architecture but other than that we've kind of worked our way up to understanding what's

00:31:09.820 | going on. So let's build from there into another kind of data set one that involves multi-label

00:31:19.800 | classification. So what's multi-label classification? Well maybe so maybe let's look at an example.

00:31:31.000 | Here is a multi-label data set where you can see that it's not just one label on each image

00:31:36.360 | but sometimes there's three, bicycle, car, person. I don't actually see the car here

00:31:41.480 | I guess it's being dropped out. So a multi-label data set is one where you still got one image

00:31:47.760 | per row but you can have zero one two or more labels per row. So we're going to have a think

00:31:55.280 | about and look at how we handle that. But first of all let's take another question.

00:32:01.400 | Is dropping floating point number precision switching from FP32 to FP16 have an impact

00:32:10.640 | on final? Yes it does. Often it makes it better believe it or not. It seems like you know

00:32:23.920 | the kind of it's doing a little bit of rounding off is one way to give it drop some of that

00:32:28.200 | precision. And so that creates a bit more bumpiness, a bit more uncertainty, bit more

00:32:35.280 | you know of a stochastic nature. And you know when you introduce more slightly random stuff

00:32:41.080 | into training it very often makes it a bit better. And so yeah FP16 training often gives

00:32:47.800 | us a slightly better result but I you know I wouldn't say it's generally a big deal either

00:32:53.320 | way and certainly it's not always better. Would you say this is a bit of a pattern in

00:32:58.240 | learning less exact and stochastic way? For sure not just in deep learning but machine

00:33:09.320 | learning more generally. You know there's been some interesting research looking at

00:33:13.840 | like matrix factorization techniques which if you want them to go super fast you can

00:33:17.840 | lots of machines, you can randomization and you often when you then use the results you

00:33:23.920 | often find you actually get better outcomes. Just a brief plug for the fast AI computational

00:33:30.600 | linear algebra course which talks a little bit about random. Does it really? Well that

00:33:37.160 | sounds like a fascinating course and look at that it's number one hit here on Google

00:33:43.040 | so easy to find. Well by somebody called Rachel Thomas. Hey that's person's got the same name

00:33:49.920 | as you Rachel Thomas. All right so how are we going to do multi-label classification?

00:33:58.440 | So let's look at a data set called Pascal which is a pretty famous data set. We'll look

00:34:02.640 | at the version that goes back to 2007 been around for a long time. And it comes with

00:34:08.400 | a CSV file which we will read in CSV is comma separated values and let's take a look. Each

00:34:15.320 | row has a file name, one or more labels and something telling you whether it's in the

00:34:21.280 | validation set or not. So the list of categories in each image is a space delimited string

00:34:28.020 | but it doesn't have a horse person it has a horse and a person. PD here stands for pandas.

00:34:36.000 | Pandas is a really important library for any kind of data processing and you use it all

00:34:43.520 | the time in machine learning and deep learning. So let's have a quick chat about it. Not a

00:34:47.720 | real panda it's the name of a library and it creates things called data frames. That's

00:34:52.720 | what the DF here stands for and a data frame is a table containing rows and columns. Pandas

00:34:58.680 | can also do some slightly more sophisticated things than that but we'll treat it that way

00:35:02.240 | for now. So you can read in a data frame by saying PD for pandas. Pandas read CSV, give

00:35:08.000 | it a file name, you've now got a data frame you can call head to see the first few rows

00:35:12.400 | of it for instance. A data frame has a ilock integer location property which you can index

00:35:22.320 | into as if it was an array, in fact it looks just like numpy. So colon means every row remember

00:35:30.360 | it's row comma column and zero means zeroth column and so here is the first column of

00:35:36.080 | the data frame. You can do the exact opposite so the zeroth row and every column is going

00:35:43.120 | to give us the first row and you can see the row has column headers and values. So it's

00:35:49.040 | a little bit different to numpy and remember if there's a comma colon or a bunch of comma

00:35:54.680 | columns at the end of indexing in numpy or pytorch or pandas whatever you can get rid

00:36:00.960 | of it and these two are exactly the same. You could do the same thing here by grabbing

00:36:08.020 | the column by name the first column is fname so you can say dffname you get that first

00:36:13.360 | column. You can create new columns so here's a tiny little data frame I've created from

00:36:18.120 | a dictionary and I could create a new column by for example adding two columns and you

00:36:25.340 | can see there it is. So it's like a lot like numpy or pytorch except you have this idea

00:36:31.680 | of kind of rows and and column named columns and so it's all about kind of tabular data.

00:36:41.080 | I find its API pretty unintuitive a lot of people do but it's fast and powerful so it

00:36:46.680 | takes a while to get familiar with it but it's worth taking a while and the creator

00:36:50.020 | of pandas wrote a fantastic book called Python for data analysis which I've read both versions

00:36:57.380 | and I found it fantastic. It doesn't just cover pandas it covers other stuff as well

00:37:01.680 | like IPython and numpy and matplotlib so highly recommend this book. This is our table so

00:37:13.440 | what we want to do now is construct data loaders that we can train with and we've talked about

00:37:22.000 | the data block API is being a great way to create data loaders but let's use this as

00:37:26.600 | an opportunity to create a data data loaders or a data so create a data block and then

00:37:31.120 | data loaders for this and let's try to do it like right from square one. So let's see

00:37:38.720 | exactly what's going on with data block. So first of all let's remind ourselves about

00:37:44.640 | what a data set and a data loader is. A data set is an abstract idea of a class you can

00:37:53.040 | create a data set. A data set is anything which you can index into it like so or and

00:37:59.040 | you can take the length of it like so. So for example the list of the lowercase letters

00:38:05.140 | along with a number saying which lowercase letter it is I can index into it to get 0,

00:38:11.080 | a I can get the length of it to get 26 and so therefore this qualifies as a data set

00:38:18.440 | and in particular data sets normally you would expect that when you index into it you would

00:38:22.600 | get back a tuple because you've got the independent and dependent variables not necessarily always

00:38:30.800 | just two things it could be more there could be less but two is the most common. So once

00:38:36.700 | we have a data set we can pass it to a data loader we can request a particular batch size

00:38:46.960 | we can shuffle or not and so there's our data loader from A we could grab the first value

00:38:52.480 | from that iterator and here is the shuffled 7 is H 4 is E 20 is U and so forth and so

00:39:00.580 | remember a mini batch has a bunch of a mini batch of the independent variable and a mini

00:39:06.520 | batch of the dependent variable. If you want to see how the two correspond to each other

00:39:12.320 | you can use zip so if I zip passing in this list and then this list so B0 and B1 you can

00:39:20.480 | see what zip does in Python is it grabs one element from each of those in turn and gives

00:39:26.640 | you back the tuples of the corresponding elements. Since we're just passing in all of the elements

00:39:34.240 | of B to this function Python has a convenient shortcut for that which is just say star B

00:39:42.000 | and so star means insert into this parameter list each element of B just like we did here

00:39:49.760 | so these are the same thing. So this is a very handy idiom that we use a lot in Python zip

00:39:56.100 | star something is kind of a way of like transposing something from one orientation to another.

00:40:07.040 | All right so we've got a data set we've got a data loader and then what about datasets

00:40:12.400 | what datasets is an object which has a training data set and a validation set dataset so let's

00:40:17.960 | look at one. Now normally you don't start with kind of an enumeration like this like

00:40:26.280 | with an independent variable and a dependent variable normally you start with like a file

00:40:32.400 | name for example and then you you kind of calculate or compute or transform your file

00:40:40.660 | name into an image by opening it and a label by for example looking at the file name and

00:40:46.520 | grabbing something out of it. So for example we could do something similar here this is

00:40:50.320 | what datasets does so we could start with just the lowercase letters so this is still

00:40:56.200 | a data set right because we can index into it and we can get the length of it although

00:41:00.280 | it's not giving us tuples yet. So if we now pass that list to the datasets class and index

00:41:09.000 | into it we get back the tuple and it's actually a tuple with just one item this is how Python

00:41:15.180 | shows a tuple with one item is it puts it in parentheses and a comma and then nothing okay.

00:41:20.880 | So in practice what we really want to do is to say like okay we'll take this and do something

00:41:26.780 | to compute an independent variable and do something to compute a dependent variable

00:41:30.880 | but here's a function we could use to compute an independent variable which is to stick

00:41:34.380 | an A on the end and our dependent variable might just be the same thing with a B on the

00:41:38.400 | end. So here's two functions so for example now we can call datasets passing in A and

00:41:46.320 | then we can pass in a list of transformations to do and so in this case I've just got one

00:41:54.000 | which is this function add an A on the end so now if I index into it I don't get A anymore

00:41:58.760 | I get AA. If you pass multiple functions then it's going to do multiple things so here I've

00:42:08.640 | got F1 then F2 AAB that's this one then that's this one and you'll see this is a list of

00:42:15.520 | lists and the reason for that is that you can also pass something like this a list containing

00:42:21.280 | F1 a list containing F2 and this will actually take each element of A pass it through this

00:42:27.920 | list of functions and there's just one of them to give you AA and then start again and

00:42:34.360 | separately pass it through this list of functions there's just one to get A B and so this is

00:42:40.360 | actually kind of the main way we build up independent variables and dependent variables

00:42:46.680 | in fast AI is we start with something like a file name and we pass it through two lists

00:42:51.160 | of functions one of them will generally kind of open up the image for example and the other

00:42:55.460 | one will kind of pass the file name for example and give you a independent variable and a

00:43:00.200 | dependent variable. So you can then create a data loaders object from data sets by passing

00:43:07.800 | in the data sets and a batch size and so here you can see I've got shuffled OAIA etc OB

00:43:16.640 | IB etc so this is worth studying to make sure you understand what data sets and data loaders

00:43:23.480 | are we don't often have to create them from scratch we can create a data block to do it

00:43:28.480 | for us but now we can see what the data block has to do let's see how it does it so we can

00:43:35.760 | start by creating an empty data block so an empty data block is going to take our data

00:43:41.460 | frame we're going to go back to looking at data frame which remember was this guy and

00:43:52.800 | so if we pass in our data frame we can now we'll now find that this data block has created

00:44:01.640 | data sets a training and a validation data set for us and if we look at the training

00:44:08.480 | set it'll give us back an independent variable and a dependent variable and we'll see that

00:44:13.920 | they are both the same thing so this is the first row of the table that's actually shuffled

00:44:20.960 | so it's a random row of the table repeated twice and the reason for that is by default

00:44:26.660 | the data block assumes that we have two things the independent variable and the dependent

00:44:31.040 | or the input in the target and by default it just copies it just keeps exactly whatever

00:44:36.720 | you gave it to create the training set and the validation set by default it just randomly

00:44:43.000 | splits the data with a 20% validation set so that's what's happened here so this is

00:44:50.160 | not much use and what we what we actually want to do if we look at X for example is

00:44:55.400 | grab the the F name the file name field because we want to open this image that's going to

00:45:00.880 | be our independent variable and then for the label we're gonna want this here person cat

00:45:12.280 | so we can actually pass these as parameters get X and get Y functions that return the

00:45:20.600 | bit of data that we want and so you can create and use a function in the same line of code

00:45:27.200 | in Python by saying lambda so lambda R means create a function doesn't have a name it's

00:45:34.080 | going to take a parameter called R we don't even have to say return it's going to return

00:45:39.200 | the F name column in this case and get Y is something which is a function that takes an

00:45:47.080 | R and returns the labels column so now we can do the same thing called dblock.datasets

00:45:55.080 | we can grab a row from that from the training set and you can see look here it is there

00:45:59.320 | is the image file name and there is the space delimited list of labels so here's exactly

00:46:08.880 | the same thing again but done with functions so now the one line of code above has become

00:46:15.200 | three lines of code but it does exactly the same thing okay we don't get back the same

00:46:22.480 | result because the training set well wait why don't we get the same result oh I know

00:46:32.920 | why because it's randomly shuffle it's randomly picking a different validation set because

00:46:38.920 | the random split is done differently each time so that's why we don't get the same result

00:46:44.360 | one thing to note be careful of lambdas if you want to save this data block for use later

00:46:51.400 | you won't be able to Python doesn't like saving things that contain lambdas so most of the

00:46:56.800 | time in the book and the course we normally use avoid lambdas for that reason because

00:47:01.400 | it's often very convenient to be able to save things we use the word here serialization

00:47:06.280 | that just means basically it means saving something this is not enough to open an image

00:47:15.000 | because we don't have the path so to turn this into so rather than just using this function

00:47:20.700 | to grab the F name column we should actually use path lib to go path slash train and then

00:47:27.660 | column and then for the Y again the labels is not quite enough we actually have to split

00:47:35.440 | on space but this is Python we can use any function we like and so then we use the same

00:47:40.240 | three lines of code is here and now we've got a path and a list of labels so that's

00:47:47.000 | looking good so we want this path to be opened as an image so the data block API that's your

00:47:56.600 | pass a blocks argument where you tell it or each of the things in your tuple so there's

00:48:03.080 | two of them what kind of block do you need so we need an image block to open an image

00:48:09.680 | and then in the past we've used a category block or categorical variables but this time

00:48:14.920 | we don't have a single category we've got multiple categories so we have to use a multi

00:48:19.480 | category block so once we do that and have a look we now have an 500 by 375 image as

00:48:27.740 | our independent variable and as a dependent variable we have a long lists of zeros and

00:48:34.160 | ones the long list of zeros and ones is the labels as a one hot encoded vector a rank

00:48:47.360 | one tensor and specifically there will be a zero in every location where in the vocab

00:48:57.880 | where there is not that kind of object in this image and a one in every location where

00:49:04.200 | there is so for this one there's just a person so this must be the location in the vocab

00:49:09.660 | where there's a person you have any questions so one hot encoding is a very important concept

00:49:17.640 | and we didn't have to use it before right we could just have a single integer saying

00:49:25.800 | which one thing is it but when we've got lots of things lots of potential labels it's it's

00:49:32.360 | convenient to use this one hot encoding and it's kind of what it's actually what's going

00:49:36.600 | to happen with them with the actual matrices anyway when we actually compare the activations

00:49:48.440 | of our neural network to the target it's actually going to be comparing each one of these okay

00:49:57.800 | so the categories as I mentioned is based on the vocab so we can grab the vocab from

00:50:04.200 | our data set subject and then we can say okay let's look at the first row and let's look

00:50:11.360 | at the dependent variable and let's look for where the dependent variable is one okay and

00:50:22.200 | then we can have a look past those indexes with a vocab and get back a list of what actually

00:50:27.160 | was there and again each time I run this I'm going to get different results so each time

00:50:33.880 | we run this we're going to get different results because I called dot data sets again here

00:50:38.120 | so it's going to give me different train tests split and so this time it turns out that this

00:50:42.760 | actually a chair and we have a question shouldn't the tensor be of integers why is it a tensor

00:50:50.800 | of floats yeah conceptually this is a tensor of integers they can only be 0 or 1 but we

00:51:04.880 | are going to be using a cross-entropy style loss function but we're going to actually

00:51:11.200 | need to do floating point calculations on them that's going to be faster to just store

00:51:17.920 | them as float in the first place rather than converting backwards and forwards even though

00:51:21.760 | they're conceptually an int we're not going to be doing kind of int style calculations

00:51:26.520 | with them good question I mentioned that by default the data block uses a random split

00:51:39.320 | you might have noticed in the data frame though it said here's a column saying what validation

00:51:49.120 | set to use and if the data set you're given tells you what validation set to use you should

00:51:54.160 | generally use it because that way you can compare your validation set results to somebody

00:51:58.760 | else's so you can pass a splitter argument which again is a function and so we're going

00:52:06.040 | to pass it a function that's also called splitter and the function is going to return the indexes

00:52:12.720 | where it's not valid and that's going to be the training set and the indexes where it

00:52:18.520 | is valid that's going to be the validation set and so the splitter argument is expected

00:52:23.120 | to return two lists of integers and so if we do that we get again the same thing but

00:52:29.680 | now we're using the correct train and validation sets another question sure any particular

00:52:40.480 | reason we don't use floating point eight is it just that the precision is too low yeah

00:52:47.800 | trying to train with 8-bit precision is super difficult it's it's so flat and bumpy it's

00:52:57.320 | pretty difficult to get decent gradients there but you know it's an area of research the

00:53:03.440 | main thing people do with 8-bit or even 1-bit data types is they take a model that's already

00:53:10.440 | been trained with 16-bit or 32-bit floating point and then they kind of round it off it's

00:53:15.160 | called discretizing to create a kind of purely integer or even binary network which can do

00:53:23.680 | inference much faster figuring out how to train with such low precision data is an

00:53:31.820 | area of active research I suspect it's possible and I suspect I mean people have fiddled around

00:53:42.280 | it and had some success I think you know it could turn out to be super interesting particularly

00:53:46.960 | for stuff that's being done on like low-powered devices that might not even have a floating

00:53:51.120 | point unit right so the last thing we need to do is to add our item transforms random

00:54:01.000 | resource crop we've talked about that enough so I won't go into it but basically that means

00:54:05.000 | we now are going to ensure that everything has the same shape so that we can collate

00:54:09.460 | it into a data loader they're now rather than going dot data sets go dot data loaders and

00:54:15.480 | display our data and remember something goes wrong as we saw last week you can call summary

00:54:22.120 | to find out exactly what's happening in your data block so now you know this is something

00:54:27.400 | really worth studying this section because data blocks are super handy and if you haven't

00:54:32.280 | used fastai 2 before they won't be familiar to you because no other library uses them

00:54:39.440 | and so like this is really showing you how to go right back to the start and gradually

00:54:42.840 | build them up so hopefully that'll make a whole lot of sense now we're going to need

00:54:49.520 | a loss function again and to do that let's start by just creating a learner it's created

00:54:57.200 | resnet 18 from the data loaders object that we just created and let's grab one batch of

00:55:04.720 | data and then let's put that into our mini batch of independent independent variables

00:55:12.360 | and then learn dot model is the thing that actually contains the model itself in this

00:55:18.880 | case is CNN and you can treat it as a function and so therefore we can just pass something

00:55:24.400 | to it and so if we pass a mini batch of the independent variable to learn dot model it

00:55:31.540 | will return the activations from the final layer and that is shape 64 by 20 so anytime

00:55:40.600 | you get a tensor back look at its shape and in fact before you look at its shape predict

00:55:45.640 | what the shape should be and then make sure that you're all right if you're not I think

00:55:51.200 | you guessed wrong so try to understand where you made a mistake or there's a problem with

00:55:55.680 | your code in this case 64 by 20 makes sense because we have a mini batch size of 64 and

00:56:04.840 | for each of those we're going to make predictions about what probability is each of these 20

00:56:10.760 | possible categories and we have a question two questions questions all right is the data

00:56:17.120 | block API compatible with out of core data sets like Dask yeah the data block API can

00:56:25.400 | do anything you want it to do so you're passing it if we go back to the start so you can create

00:56:37.360 | an empty one and then you can pass it anything that is indexable and yeah so that can be

00:56:48.120 | anything you you like and pretty much anything can be made indexable in Python and that's

00:56:54.880 | something like Dask is certainly indexable so that works perfectly fine if it's not indexable

00:57:03.920 | like it's a it's a network stream or something like that then the data loaders data sets

00:57:10.840 | API's directly which we'll learn about either in this course or the next one but yeah anything

00:57:16.320 | that you can index into it certainly includes Dask you can use with data blocks next question

00:57:23.240 | where do you put images for multi-label with that CSV table should they be in the same

00:57:28.080 | directory there can be anywhere you like so in this case we used a pathlib object like

00:57:37.120 | so and in this case the the by default it's going to be using I think about this so what's

00:57:57.240 | happening here is the path is oh it's saying dot okay the reason for that is that path

00:58:04.760 | dot base path is currently set to path and so that displays things relative oh let's

00:58:09.560 | rid of that okay so the path we set is here right and so then when we said get X it's

00:58:19.760 | saying path slash change slash whatever right so this is an absolute path and so here is

00:58:27.000 | the exact path so you can put them anywhere you like you just have to say what the path

00:58:31.240 | is and then if you want to not get confused by having this big long prefix that we can

00:58:38.920 | don't want to see all the time just set base path to the path you want everything to be

00:58:43.360 | relative to and then it'll just print things out in this more convenient manner right so

00:58:54.200 | this is really important that you can do this that you can create a learner you can grab

00:58:58.640 | a batch of data that you can pass it to the model is this is just plain pytorch this line

00:59:03.480 | here right no fast AI you can see the shape right you can recognize why it has this shape

00:59:11.360 | and so now if you have a look here are the 20 activations now this is not a trained model

00:59:21.040 | it's a pre-trained model with a random set of final layer weights so these specific numbers

00:59:26.200 | don't mean anything but it's just worth remembering this is what activations look like and most

00:59:32.800 | importantly they're not between 0 and 1 and if you remember from the MNIST notebook we

00:59:38.400 | know how to scale things between 0 and 1 we can pop them into the sigmoid function so

00:59:43.520 | the sigmoid function is something that scales everything to be between 0 and 1 so let's

00:59:49.720 | use that you'll also hopefully remember from the MNIST notebook that the MNIST loss the

00:59:58.760 | MNIST loss function first did sigmoid and then it did torch.where so and then it did

01:00:07.040 | dot mean so we're going to use exactly the same thing as the MNIST loss function and

01:00:11.520 | we're just going to do one thing which is going to add dot log for the same reason that

01:00:16.120 | we talked about when we were looking at softmax we talked about why log is a good idea as

01:00:25.880 | a transformation we saw in the MNIST notebook we didn't need it but we're going to train

01:00:33.280 | faster and more accurately if we use it because it's just more it's going to be better behaved

01:00:37.520 | as we've seen so this particular function which is identical to MNIST loss plus dot

01:00:43.520 | log as a specific name and it's called binary cross entropy and we used it for the threes

01:00:52.200 | versus sevens problem to decide whether that column is it a three or not but because we

01:01:01.480 | can use broadcasting in high torch and element wise arithmetic this function when we pass

01:01:09.360 | it a whole matrix is going to be applied to every column so is the first column you know

01:01:17.680 | so it'll basically do a torch.where on on every column separately in every item separately

01:01:27.480 | so that's great it basically means that this binary cross entropy function is going to

01:01:31.120 | be just like MNIST loss but rather than just being is this the number three it'll be is

01:01:37.880 | this a dog is this a cat is this a car is this a person is this the bicycle and so forth

01:01:43.240 | this is where it's so cool in PyTorch we can kind of run write one thing and then kind

01:01:48.040 | of have it expand to handle higher dimensional tensors without doing any extra work we don't

01:01:55.880 | have to write this a cells of course because PyTorch has one and it's called f dot binary

01:02:04.240 | cross entropy so we can just use PyTorch as we've talked about there's always a equivalent

01:02:10.520 | module version so this is exactly the same thing as a module and n dot BCE loss and these

01:02:21.040 | ones don't include the initial sigmoid actually if you want to include this initial sigmoid

01:02:26.700 | you need f dot binary cross entropy with logits or the equivalent nn dot BCE with logits loss

01:02:32.720 | so BCE is binary cross entropy and so those are two functions plus two equivalent classes

01:02:42.680 | or multilabel or binary problems and then the equivalent for single label like MNIST

01:02:49.360 | and PETS is nll loss and cross entropy so that's the equivalent of binary cross entropy

01:02:55.880 | and binary cross entropy with logits so these are pretty awful names I think we can all

01:02:59.920 | agree but it is what it is so in our case we have a one-hot encoded target and we want

01:03:09.640 | the one with a sigmoid in so the equivalent built-in is called BCE with logits loss so

01:03:17.120 | that we can make that our loss function we can compare the activations to our targets

01:03:22.840 | and we can get back a loss and then that's what we can use to train and then finally

01:03:31.060 | before we take our break we also need a metric now previously we've been using as a metric

01:03:35.880 | accuracy or actually error rate error rate is one minus accuracy accuracy only works

01:03:42.800 | for single label datasets like MNIST and PETS because what it does is it takes the input

01:03:52.860 | which is the final layer activations and it does argmax what argmax does is it says what

01:03:59.520 | is the index of the largest number in those activations so for example for MNIST you know

01:04:04.400 | maybe the largest the highest probability is seven so this argmax would return seven

01:04:11.080 | and then it says okay there's those are my predictions and then it says okay is the prediction

01:04:16.520 | equal to the target or not and then take the floating point mean so that's what accuracy

01:04:22.400 | is so argmax only makes sense when there's a single maximum thing you're looking for

01:04:30.040 | in this case we've got multilabel so instead we have to compare each activation to some

01:04:38.260 | threshold by default it's 0.5 and so we basically say if the sigmoid of the activation is greater

01:04:45.980 | than 0.5 let's assume that means that category is there and if it's not let's assume it means

01:04:53.040 | it's not there and so this is going to give us a list of trues and falses for the ones

01:04:58.280 | that the based on the activations it thinks are there and we can compare that to the target

01:05:04.920 | and then again take the floating point mean so we can use the default threshold of 0.5

01:05:13.080 | but we don't necessarily want to use 0.5 we might want to use a different threshold and

01:05:18.480 | remember we have to pass when we create our learner we have to pass to the metric the

01:05:23.760 | metrics argument a function so what if we want to use a threshold other than 0.5 well

01:05:29.960 | we'd like to create a special function which is accuracy multi with some different threshold

01:05:36.680 | and the way we do that is we use a special built-in in Python called partial let me show

01:05:44.040 | you how partial works here's a function called say hello say hello to somebody with something

01:05:54.120 | so say hello Jeremy well the default is hello so it says hello Jeremy say hello Jeremy comma

01:06:00.240 | ahoy gonna be ahoy Jeremy let's create a special version of this function that will be more

01:06:06.660 | suitable for a silver it's going to use French so we can say partial create a new function

01:06:14.000 | that's based on the say hello function but it's always going to set say what to bonjour

01:06:19.440 | and we'll call that f but now f Jeremy is bonjour Jeremy and f sylvain is bonjour sylvain

01:06:28.360 | so you see we've created a new function from an existing function by fixing one of its

01:06:33.560 | parameters so we can do the same thing for accuracy multi say let's use a threshold of

01:06:40.600 | 0.2 and we can pass that to metrics and so let's create a CNN learner and you'll notice

01:06:47.880 | here we don't actually pass a loss function and that's because fast AI is not enough to

01:06:52.880 | realize hey you're doing a classification model with a a multi label dependent variable

01:07:01.840 | so I know what loss function you probably want so it does it for us and we can call

01:07:05.760 | fine-tune and here we have an accuracy of 94.5 after the first view and eventually 95.1

01:07:14.060 | that's pretty good we've got an accuracy of over 95 percent was 0.2 a good threshold to

01:07:19.520 | pick who knows let's try 0.1 oh that's a worse accuracy so I guess in this case we could

01:07:28.080 | buy a higher threshold 94 hmm also not good so what's the best threshold well what we

01:07:34.520 | could do is call get preds to get all of the predictions and all of the targets and then

01:07:40.000 | we could calculate the accuracy at some threshold and then we could say okay let's grab lots

01:07:48.680 | of numbers between 0.05 and 0.95 and you with a list comprehension calculate the accuracy

01:07:54.880 | for all of those different thresholds and plot them ah looks like we want a threshold

01:08:03.040 | somewhere a bit above 0.5 so cool we can just use that and it's going to give us 96 in a

01:08:09.200 | bit which is going to give us a better accuracy um this is a you know something that a lot

01:08:17.640 | of theoreticians would be uncomfortable about I've used the validation set to pick a hyper

01:08:23.520 | parameter threshold right and so people might be like oh you're overfitting using the validation

01:08:30.160 | set to pick a hyper parameter but if you think about it this is a very smooth curve right

01:08:34.560 | it's not some bumpy thing where we've accidentally kind of randomly grabbed some unexpectedly

01:08:40.120 | good value when you're picking a single number from a smooth curve you know this is where

01:08:46.040 | the theory of like don't use a validation set for for hyper parameter tuning doesn't

01:08:51.320 | really apply so it's always good to be practical right don't treat these things as rules but

01:08:57.320 | as rules of um okay so let's take a break for five minutes and we'll see you back here

01:09:05.640 | in five minutes time all right welcome back so I want to show you something really cool

01:09:14.560 | image regression so we are not going to learn how to use a fast AI image regression application

01:09:23.680 | because we don't need one now that we know how to build stuff up with boss functions

01:09:30.880 | and the data block API ourselves we can invent our own applications so there is no image

01:09:37.800 | regression application per se but we can do image regression really easily what do we

01:09:46.600 | mean by image regression well remember back to lesson I think it's lesson one we talked

01:09:51.400 | about the two basic types of machine learning or supervised machine learning regression

01:09:59.720 | and classification classification is when our dependent variable is a discrete category

01:10:04.920 | or set of categories and regression is when our dependent variable is a continuous number

01:10:13.520 | like an age or x y coordinate or something like that so image regression means our independent

01:10:20.720 | variable is an image and our dependent variable is a continue one or more continuous value

01:10:28.440 | values and so here's what that can look like which is the b we had posed data set has a

01:10:38.120 | number of things in it but one of the things we can do is find the midpoint of a person's

01:10:42.640 | face see so the b we had posed data set so the b we had posed data set comes from this

01:11:01.440 | paper random forests real-time 3d face analysis so thank you to those authors and we can grab

01:11:11.440 | it in the usual way untied data and we can have a look at what's in there and we can

01:11:16.600 | see there's 24 directories numbered from one to 24 there's one two three and each one also

01:11:23.080 | has a .obj file we're not going to be using the .obj file I'm just the directories so let's

01:11:28.080 | look at one of the directories and as you can see there's a thousand things in the first

01:11:31.840 | directory so each one of these 24 directories is one different person that they've photographed

01:11:38.520 | and you can see for each person there's frame three pose frame three RGB frame for pose

01:11:46.440 | frame for RGB and so forth so in each case we've got the image which is the RGB and we've

01:11:52.880 | got the pose is the pose.txt so as we've seen we can grab use get image files to get a list

01:12:00.920 | of all of the files image files recursively in a path and so once we have an image file

01:12:06.680 | name like this one sorry like this one we can turn it into a pose file name by removing

01:12:16.320 | the last one two three five six seven letters and adding back on pose.txt and so here is

01:12:24.540 | a function that does that and so you can see I can pass in an image file to image to pose

01:12:30.720 | and get back a pose file right so pyo image.create is the fast AI way to create an image at least

01:12:41.400 | a pyo image it has a shape in computer vision they're normally backwards they normally do

01:12:48.080 | columns by rows so that's why it's this way around whereas pytorch and numpy tensors and

01:12:54.680 | arrays are rows by columns so that's confusing but that's just how things are I'm afraid

01:13:00.400 | and so here's an example of an image when you look at the readme from the dataset website

01:13:08.100 | they tell you how to get the center point from from one of the text files and it's just

01:13:14.680 | this function so it doesn't matter it just it is what it is we call it get center and

01:13:19.560 | it will return the XY coordinate of the center of the person's head face so we can pass this

01:13:27.840 | as get Y because get Y remember is the thing that gives us back the label okay so so here's

01:13:38.720 | the thing right we can create a data block and we can pass in as the independent variables

01:13:44.480 | block image block as usual and then the dependent variables block we can say point block which

01:13:50.160 | is a tensor with two values in and now by combining these two things this says we want

01:13:55.680 | to do image regression with a dependent variable with two continuous values to get the items

01:14:04.020 | you call get image files to get the Y we'll call the get center function to split it so

01:14:11.840 | this is important we should make sure that the validation set contains one or more people

01:14:21.140 | that don't appear in the training set so I'm just going to grab person number 13 just grabbed

01:14:26.440 | it randomly and I'll use all of those images as the validation set because I think they

01:14:32.200 | did this with a Xbox connect you know video thing so there's a lot of images that look

01:14:39.160 | almost identical so if you randomly assigned them then you would be massively overestimating

01:14:44.960 | how effective you are you want to make sure that you're actually doing a good job with

01:14:49.880 | a random with a new set of people not just a new set of frames that's why we use this

01:14:55.840 | and so a func splitter is a splitter that takes a function and in this case we're using

01:15:00.400 | lambda to create the function we will use data augmentation and we will also normalize

01:15:09.720 | so this is actually done automatically now but this case we're doing it manually so this

01:15:15.080 | is going to subtract the mean and divide by the standard deviation of the original data

01:15:22.920 | set that the pre-trained model used which is imageNet so that's our data block and so

01:15:32.400 | we can call data loaders to get our data loaders passing in the path and show batch and we

01:15:37.960 | can see that looks good right here's our faces and the points and so let's like particularly

01:15:44.760 | for as a student don't just look at the pictures look at the actual data so grab a batch put

01:15:51.160 | it into an xB and a yB expansion y batch and have a look at the shapes and make sure they

01:15:56.680 | make sense the ys is 64 by 1 by 2 so it's 64 in the mini batch 64 rows and then a coordinates

01:16:11.520 | is a 1 by 2 tensor so this is a single point with two things in it it's like you could

01:16:21.120 | have like hands face and armpits or whatever or nose and ears and mouth so in this case

01:16:27.720 | we're just using one point and the point is represented by two values the x and the y

01:16:34.320 | and then y is this 64 by 3 by 240 by 320 well there's 240 rows by 320 columns that's the

01:16:40.920 | pixels that's the size of the images that we're using mini batches 64 items and what's

01:16:48.440 | the three the three is the number of channels which in this case means the number of colors

01:16:54.840 | if we open up some random grizzly bear image and then we go through each of the elements

01:17:07.720 | of the first axis and do a show image you can see that it's got the red the green and

01:17:17.040 | the blue as the three channels so that's how we store a three-channel image is it stored

01:17:24.560 | as a three by number of rows by number of columns rank three tensor and so a mini batch

01:17:31.940 | of those is a rank four tensor that's why this is that shape but here's a row from the

01:17:39.600 | dependent variable okay there's that XY location we talked about so we can now go ahead and

01:17:47.720 | create a learner passing in our data loaders as usual passing in a pre-trained architecture

01:17:53.160 | as usual and if you think back you may just remember in lesson one we learned about y

01:18:00.400 | range y range is where we tell fastai what range of data we expect to see in the dependent

01:18:10.040 | variable so we want to use this generally when we're doing regression though the range

01:18:15.320 | of our coordinates is between minus one and one that's how fastai and pytorch treats coordinates

01:18:22.800 | the left hand side is minus one or the top is minus one and the bottom and the right

01:18:28.040 | one so there's no point predicting something that's smaller than minus one or bigger than

01:18:33.520 | one because that is not in the area that we use for our coordinates if a question sure

01:18:39.460 | just a moment so how is y range work well it actually uses this function called sigmoid

01:18:47.320 | range which takes the sigmoid of X multiplies by high minus low and adds low and here is

01:18:54.200 | what sigmoid range looks like or minus one to one it's just a sigmoid where the bottom

01:19:02.680 | is the low and the top is the high and so that way all of our activations are going

01:19:08.840 | to be mapped to the range from minus one to one yes rachel can you provide images with

01:19:17.240 | an arbitrary number of channels as inputs specifically more than three channels yeah

01:19:23.360 | you can have as many channels as you like we certainly seen images with less than three

01:19:28.760 | because we've been grayscale more than three is common as well you could have like an infrared

01:19:34.920 | band or like satellite images often have multispectral there's some kinds of medical images where

01:19:41.520 | there are bands that are kind of outside the visible range your pre-trained model will

01:19:47.800 | generally have three channels the fast AI does some tricks to use three channel pre-trained

01:19:55.800 | models for non three channel data but that's the only tricky bit other than that it's just

01:20:02.640 | just a you know it's just an axis that happens to have four things or two things or one thing

01:20:08.120 | instead of three things there's nothing special about it okay we didn't specify a loss function

01:20:18.440 | here so we get whatever it gave us which is a MSE loss so MSE losses mean squared error

01:20:24.800 | and that makes perfect sense right you would expect mean squared error to be a reasonable

01:20:30.160 | thing to use for regression we're just testing how close we are through the target and then

01:20:35.880 | taking the square taking the mean we didn't specify any metrics and that's because mean

01:20:42.840 | squared error is already a good metric like it's not it's it's it has nice gradients it

01:20:50.760 | behaves well but and it's also the thing that we care about so we don't need a separate

01:20:54.760 | metric to track so let's go ahead and use LR find and we can pick a learning rate so

01:21:02.880 | maybe about 10 to the minus 2 we can call fine-tune and we get a valid loss of 0.0001

01:21:11.920 | and so that's the mean squared error so we should take the square root on average we're

01:21:15.920 | about 0.01 off in a coordinate space that goes between minus 1 and 1 so that sounds

01:21:21.280 | super accurate took about three in a bit minutes to run so we can always call in fastai and

01:21:28.920 | we always should go results see what our results look like and as you can see fastai has automatically

01:21:34.280 | figured out how to display the combination of an image independent variable and a point

01:21:40.160 | dependent variable on the left is the is the target and on the right is the prediction

01:21:46.400 | and as you can see it is pretty close to perfect you know one of the really interesting things

01:21:52.280 | here is we used fine-tune even although think about it the thing we're fine-tuning image

01:21:58.880 | net isn't even an image regression model so we're actually fine-tuning an image classification

01:22:07.340 | model to become something totally different an image regression model why does that work

01:22:12.840 | so well well because and image net classification model must have learnt a lot about kind of

01:22:25.000 | how images look what things look like and where the pieces of them are to kind of know

01:22:29.560 | how to figure out what breed of animal something is even if it's partly obscured by a horse

01:22:35.800 | shorts in the shade or it's turned in different angles you know these pre-trained image models

01:22:41.760 | are incredibly powerful you know computing algorithms so built into every image net pre-trained

01:22:51.680 | model is all this capability that it had to learn for itself so asking it to use that

01:22:57.000 | capability to figure out where something is just actually not that hard for it and so

01:23:03.000 | that's why we can actually fine-tune an image net classification model to create something

01:23:09.360 | completely different which is a point image regression model so I find that incredibly

01:23:18.840 | cool I got to say so again look at the further research after you've done the questionnaire

01:23:26.840 | and particularly if you haven't used data frames before please play with them because

01:23:30.320 | we're going to be using them more and more good question I'll just do the last one and

01:23:39.560 | also go back and look at the bear classifier from notebook 2 or whatever hopefully you

01:23:45.840 | created some other classifier for your own data because remember we talked about how

01:23:51.760 | it would be better if the bear classifier could also recognize that there's no bear

01:23:56.000 | at all or maybe there's both a grizzly bear and a black bear or a grizzly bear and a teddy

01:24:01.940 | bear so if you retrain it using multi-label classification see what happens see how well

01:24:08.000 | it works when there's no bears and see whether it changes the accuracy of the single label

01:24:14.960 | model when you turn it into a multi-label problem so have a fiddle around and tell us

01:24:20.560 | on the forum what you find I've got a question Rachel is there a tutorial showing how to

01:24:25.480 | use pre-trained models on four channel images also how can you add a channel to a normal

01:24:31.920 | image?

01:24:32.920 | Well it's the last one how do you add a channel to an image I don't know what that means okay

01:24:41.000 | I don't know you can't like an image is an image you can't add a channel to an image

01:24:48.640 | is what it is I don't know if there's a tutorial but we can certainly make sure somebody on

01:24:59.240 | the forum has learned how to do it it's it's super straightforward it should be pretty

01:25:05.840 | much automatic okay we're going to talk about collaborative filtering what is collaborative

01:25:21.360 | filtering?

01:25:22.360 | Well think about on Netflix or whatever you might have watched a lot of movies that are

01:25:29.840 | sci-fi and have a lot of action and were made in the 70s and Netflix might not know anything

01:25:40.200 | about the properties of movies you watched it might just know that they're movies with

01:25:45.080 | titles and IDs but what it could absolutely see without any manual work is find other

01:25:52.520 | people that watched the same movies that you watched and it could see what other movies

01:26:02.240 | those people watched that you haven't and it would probably find they were also you

01:26:06.680 | would probably find they're also science fiction and full of action and made in the 70s so

01:26:13.880 | we can use an approach where we recommend things even if we don't know anything about

01:26:20.440 | what those things are as long as we know who else has used or recommended things that are

01:26:30.200 | similar you know the same kind you know many of the same things that that you've liked

01:26:34.520 | or used this doesn't necessarily mean users and products in fact in collaborative filtering

01:26:43.760 | sort of same products we normally say items and items could be links you click on a diagnosis

01:26:50.960 | for a patient and so forth so there's a key idea here which is that in the underlying

01:26:59.480 | items and we're going to be using movies in this example there are some there are some

01:27:05.400 | features they may not be labeled but there's some underlying concept of features of of

01:27:12.960 | those movies like the fact that there's a action concept and a sci-fi concept in the

01:27:19.160 | 1970s concept now you were never actually told Netflix you like these kinds of movies

01:27:24.720 | and maybe Netflix never actually added columns to their movies saying what movies are those

01:27:28.320 | types but as long as like you know in the real world there's this concept of sci-fi

01:27:35.680 | and action and movie age and that those concepts are relevant for at least some people's movie

01:27:42.520 | watching decisions as long as this is true then we can actually uncover these they're

01:27:50.240 | called late latent factors these things that kind of decide what kind of movies you want

01:27:56.600 | to watch and they're latent because nobody necessarily ever wrote them down or labeled

01:28:02.560 | them or communicated them in any way so let me show you what this looks like so there's

01:28:11.240 | a great data set we can use called movie lens which contains tens of millions of movie rankings

01:28:17.640 | and so a movie ranking looks like this it has a user number a movie number a rating

01:28:28.000 | and a time step so we don't know anything about who user number 196 is I don't know

01:28:34.160 | if that is Rachel or somebody else I don't know what movie number 242 is I don't know

01:28:41.520 | if that's Casablanca or Lord of the Rings or the mask and then rating is a number between

01:28:49.120 | I think it was one five a question sure in traditional machine learning we perform cross

01:28:56.760 | validations and K fold training to check for variance and bias trade-off is this common

01:29:03.040 | in training deep learning models as well.

01:29:09.820 | So cross validation is a technique where you don't just split your data set into one training

01:29:15.120 | set and one validation set but you basically do it five or so times like five training

01:29:21.920 | sets and like five validation sets representing different overlapping subsets and basically

01:29:29.840 | this was this used to be done a lot because people often used to have not enough data

01:29:35.320 | get a good result and so this way rather than kind of having 20% that you would leave out

01:29:43.600 | each time you could just leave out like 10% each time.

01:29:47.080 | Nowadays it's less common that we have so little data that we need to worry about the

01:29:53.120 | complexity and extra time of lots of models it's done on Kaggle a lot it's on Kaggle every

01:30:01.040 | little fraction of percent matters but it's not it's not a deep learning thing or a machine

01:30:07.240 | learning thing or whatever it's just a you know lots of data or not very much data thing

01:30:12.920 | and do you care about the last decimal place of them or not it's not something we're going

01:30:18.940 | to talk about certainly in this part of the course if ever because it's not something

01:30:24.680 | that comes up in practice that often as being that important.

01:30:30.040 | There are two more questions.

01:30:34.800 | What would be some good applications of collaborative filtering outside of recommender systems?

01:30:42.480 | Well I mean depends how you define recommender system if you're trying to figure out what

01:30:50.880 | kind of other diagnoses might be applicable to a patient I guess that's kind of a recommender

01:30:56.080 | system or you're trying to figure out where somebody is going to click next or whatever

01:31:02.320 | it's kind of a recommender system but you know really conceptually it's anything where

01:31:08.500 | you're trying to learn from from past behavior where that behavior is kind of like a thing

01:31:17.500 | happened to an entity.

01:31:20.800 | What is an approach to training using video streams i.e. from drone footage instead of

01:31:26.520 | images would you need to break up the footage into image frames?

01:31:31.800 | In practice quite often you would because images just tend to be pretty big so videos

01:31:36.520 | tend to be pretty big.

01:31:39.880 | There's a lot of so I mean theoretically the time could be the fourth channel yeah or fifth

01:31:49.080 | channel so if it's a full color movie you can absolutely have well I guess fourth because

01:31:57.000 | you can have five rank five tensor being batch by time by color by row by column but often

01:32:08.920 | that's too computationally and too memory intensive so sometimes people just look at

01:32:22.460 | one frame at a time sometimes people use a few frames around kind of the keyframe like

01:32:30.600 | three or five frames at a time and sometimes people use something called a recurrent neural

01:32:36.160 | network which we'll be seeing in the next week or two treated as a sequence data yeah

01:32:41.440 | there's all kinds of tricks you can do to try and work with that conceptually though

01:32:49.240 | there's no reason you can't just add an additional access to your tenses and everything to work

01:32:54.180 | it's just a practical issue around time and memory.

01:32:59.240 | And someone else noted that it's pretty fitting that you mentioned the movie The Mask.

01:33:03.880 | Yes it was not an accident because I've got masks on the brain.

01:33:12.880 | I'm not sure if we're allowed to like that movie anymore though I kind of liked it when

01:33:16.560 | it came out I don't know what I think nowadays it's a while okay so let's take a look so

01:33:28.480 | we can untie data ml 100k so ml 100k is a small subset of the full set there's another

01:33:35.360 | one that we can grab which has got the whole lot 25 million but 100k is good enough for

01:33:41.840 | messing around so if you look at the readme you'll find the main table the main table

01:33:46.540 | is in a file called u.data so let's open it up with read.csv again this one is actually

01:33:51.760 | not comma separated values it's tab separated rather confusingly we still use csv and to

01:33:57.000 | say delimiter is a tab /t is tab there's no row at the top saying what the columns are

01:34:04.480 | called so we say header is none and then pass in a list of what the columns are called .head

01:34:11.080 | will give us the first five rows and we mentioned this before what it looks like it's not a

01:34:19.800 | particularly friendly way to look at it so what I'm going to do is I'm going to cross

01:34:25.800 | tab it and so what I've done here is I've grabbed the top I can't remember how many

01:34:32.960 | it was 15 or 20 movies based on the most popular movies and the top bunch of users who watched

01:34:44.000 | the most movies and so I've basically kind of reoriented this so for each user I have

01:34:51.280 | all the movies they've watched and the rating they gave them so empty spots represent users

01:34:56.240 | that have not seen that movie so this is just another way of looking at this same data so

01:35:09.640 | basically what we want to do is guess what movies we should tell people they might want

01:35:15.640 | to watch and so it's basically filling in these gaps to tell user 212 do you think we

01:35:21.280 | would they might like movie 49 or 79 or 99 best to watch next.

01:35:32.720 | So let's assume that we actually had columns for every movie that represented say how much

01:35:43.980 | sci-fi they are how much action they are and how old they are and maybe they're between

01:35:49.440 | minus one and one and so like the last Skywalker is very sci-fi fairly action and definitely

01:35:57.480 | not old and then we could do the same thing for users so we could say user one really

01:36:05.760 | likes sci-fi quite likes action and really doesn't like old and so now if you multiply

01:36:13.120 | those together and remember in PyTorch and NumPy you have element wise calculations so

01:36:19.200 | this is going to multiply each corresponding item it's not matrix multiplication if you're

01:36:24.800 | a mathematician don't go there this is element wise multiplication if we want matrix multiplication

01:36:29.960 | be an at sign so if we multiply each element together next to with the equivalent element

01:36:37.440 | in the other one and then sum them up that's going to give us a number which will basically

01:36:42.800 | tell us how much do these two correspond because remember two negatives multiply together to

01:36:47.640 | get a positive so user one likes exactly the kind of stuff that last guy was that the last

01:36:55.160 | Skywalker has in it and so we get two point one multiplying things together element wise

01:37:01.680 | and adding them up is called the dot product and we use it a lot and it's the basis of

01:37:06.640 | matrix multiplication so make sure you know what a dot product is it's this so Casablanca

01:37:23.760 | is not at all sci-fi not much action and is certainly old so if we do user one times Casablanca

01:37:32.040 | we get a negative number so we might think okay user one more like won't like this movie

01:37:39.640 | problem is we don't know what the latent factors are and even if we did we don't know how to

01:37:44.840 | label a particular user or a particular movie with them so we have to learn them how do

01:37:53.400 | we learn them well we can actually look at a spreadsheet so I've got a spreadsheet version

01:38:06.920 | so we have a spreadsheet version which is basically what I did was I popped this table

01:38:15.800 | into Excel and then I randomly created a let's count this now one two three four five six

01:38:24.520 | seven eight nine ten eleven twelve I randomly created a 15 by 5 table here so these are

01:38:32.520 | just random numbers and I randomly created a 5 by 15 table here and I basically said

01:38:39.480 | okay well let's just pretend let's just assume that every movie and every user has five latent

01:38:45.320 | factors I don't know what they are and let's then do a matrix multiply of this set of factors

01:38:54.160 | by this set of factors and a matrix multiply of a row by a column is identical to a dot

01:39:00.280 | product of two vectors so that's why I can just use matrix multiply so this is just what

01:39:05.600 | this first cell contains so they then copied it to the whole thing so all these numbers

01:39:11.180 | there are being calculated from the row latent factors dot product with or matrix multiply

01:39:20.700 | with a column latent factors so in other words I'm doing exactly this calculation but I'm

01:39:28.680 | doing them with random numbers and so that gives us a whole bunch of values right and

01:39:39.000 | then what I could do is I could calculate a loss by comparing every one of these numbers

01:39:45.120 | here to every one of these numbers here and then I could do mean squared error and then

01:39:54.560 | I could use stochastic gradient descent to find the best set of numbers in each of these

01:40:00.920 | two locations and that is what collaborative filtering is so that's actually all we need

01:40:11.180 | so rather than doing an Excel and very the Excel version later if you're interested because

01:40:18.280 | we can actually do this whole thing and it works in Excel let's jump and do it into high

01:40:25.200 | torch now one thing that might just make this more fun is actually to know what the movies

01:40:30.120 | are and movie lens tells us in u.item what the movies are called and that uses the delimiter

01:40:37.040 | of the pipe sign weirdly enough so here are the names of each movie and so one of the

01:40:43.880 | nice things about pandas is it can do joins just like SQL and so you can use the merge

01:40:52.460 | method to combine the ratings table and the movies table and since they both have a column

01:40:57.920 | called movie by default it will join on those and so now here we have the ratings table

01:41:03.800 | with actual movie names that's going to be a bit more fun we don't need it for modeling

01:41:07.940 | but it's just going to be better for looking at stuff so we could use data blocks API at

01:41:15.760 | this point or we can just use the built-in application factory method since it's there

01:41:20.240 | we may as well use it so we can create a collaborative filtering data loaders object from a data

01:41:25.440 | frame by passing in the ratings table by default the user column is called user and ours is

01:41:34.200 | so fine by default the item column is called item and ours is not it's called title so

01:41:41.000 | let's pick title and choose a batch size and so if we now say show batch here is some of

01:41:50.480 | that data and the rating is called rating by default so that worked fine too but here's

01:41:57.840 | some data so we need to now create our let's assume we're going to use that five numbers

01:42:08.380 | of factors so the number of users is however many classes there are for user and the number

01:42:16.640 | of movies is however many classes there are a title and so these are so we don't just

01:42:26.600 | have a vocab now right we've actually got a list of classes for each categorical variable

01:42:36.400 | for each set of discrete choices so we've got a whole bunch of users at 944 and a whole

01:42:43.600 | bunch of titles 1635 so for our randomized latent factor parameters we're going to need

01:42:55.680 | to create those matrices so we can just create them with random numbers so this is normally

01:43:00.320 | distributed random numbers that's what random n is and that will be n users okay so 944 by

01:43:07.680 | 10 factors which is 5 that's exactly the same as this except this is just 15 so let's do

01:43:15.920 | exactly the same thing for movies random numbers and movies by 5 okay and so to calculate the

01:43:24.360 | result for some movie and some user we have to look up the index of the movie in our movie

01:43:30.440 | latent factors the index of the user in our user latent factors and then do a cross product

01:43:38.520 | so in other words we would say like oh okay for this particular combination we would have

01:43:43.640 | to look up that numbered user over here and that numbered movie over here to get the two

01:43:51.500 | appropriate sets of latent factors but this is a problem because look up in an index is

01:44:02.320 | not a linear model like remember our deep learning models really only know how to just

01:44:12.400 | multiply matrices together and do simple element wise nonlinearities like ReLU there isn't

01:44:17.640 | a thing called look up in an index okay I'll just finish this bit here's the cool thing

01:44:27.240 | though the look up in an index is actually can be represented as a matrix product believe

01:44:36.440 | it or not so if you replace our indices with one hot encoded vectors then a one hot encoded

01:44:47.200 | vector times something is identical to looking up in an index and let me show you so if we

01:44:59.440 | grab if we call the one hot function that creates a as it says here one hot encoding

01:45:11.260 | and we're going to one hot encode the value three with end users classes and so end users

01:45:21.700 | as we've just discussed is 944 right then so if we go one hot one hot encoding the number

01:45:35.640 | three into end users one hot three we get this big array big tensor and as you can see in

01:45:54.400 | index 3 0 1 2 3 we have a 1 and the size of that is 944 so if we then multiply that by

01:46:10.720 | user factors or user factors remember is that random matrix of this size and what's going

01:46:22.920 | to happen so we're going to go 0 by the first row and so that's going to be all zeros and

01:46:35.280 | then we're going to go 0 again and we're going to 0 again and then we're going to find finally

01:46:38.860 | go 1 right on the index 3 row and so it's going to return each of them and then we'll

01:46:47.160 | go back to 0 again so if we do that remember at sign is matrix multiply

01:46:57.520 | and compare that to user factors 3 same thing isn't that crazy so it's a kind of weird inefficient

01:47:13.360 | way to do it right but matrix multiplication is a way to index into an array and this is

01:47:22.080 | the thing that we know how to do SGD with and we know how to build models with so it

01:47:27.320 | turns out that anything that we can do with indexing to array we now have a way to optimize

01:47:33.760 | and we have a question there are two questions one how different in practice is collaborative

01:47:40.520 | filtering with sparse data compared to dense data we are not doing sparse data in this

01:47:48.020 | course but there's an excellent course I hear called computational linear algebra for coders

01:47:53.680 | it has a lot of information about sparse the fast AI course and second question in practice

01:48:00.800 | do we tune the number of latent factors absolutely we do yes just it's just a number of filters

01:48:10.280 | like we have in much any kind of deep learning model all right so now that we know that the

01:48:22.400 | procedure of finding out which latent set of latent factors is the right thing looking

01:48:27.980 | something up in an index is the same as matrix multiplication with a one-hot vector I already

01:48:35.320 | had it over here we can go ahead and build a model with that so basically if we do this

01:48:45.360 | for a whole for a few indices at once then we have a matrix of one hot encoded vectors

01:48:49.840 | so the whole thing is just one big matrix multiplication now the thing is as I said

01:48:58.600 | this is a pretty inefficient way to to do an index lookup so there is a computational

01:49:07.160 | shortcut which is called an embedding an embedding is a layer that has the computational speed

01:49:18.400 | of an array lookup and the same gradients as a matrix multiplication how does it do

01:49:27.440 | that well just internally it uses an index lookup to actually grab the values and it

01:49:35.120 | also knows what the gradient of a matrix multiplication by a one-hot encoded vector is or matrix is

01:49:44.720 | without having to go to all this trouble and so an embedding is a matrix multiplication

01:49:50.360 | with a one-hot encoded vector where you never actually have to create the one-hot encoded

01:49:54.320 | vector you just need the indexes this is important to remember because a lot of people have heard

01:50:00.000 | about embeddings and they think there's something special and magical and and they're absolutely

01:50:06.240 | not you can do exactly the same thing by creating a one-hot encoded matrix and doing a matrix

01:50:11.360 | multiply it is just a computational shortcut nothing else I often find when I talk to people

01:50:18.600 | about this in person I have to tell them this six or seven times before they believe me

01:50:25.120 | because they think embeddings are something more clever and they're not it's just a computational

01:50:29.440 | shortcut to do a matrix multiplication more quickly with a one-hot encoded matrix by instead

01:50:34.800 | doing an array lookup okay so let's try and create a collaborative filtering model in

01:50:46.280 | PyTorch a model or an architecture or really an nn.module is a class so to use PyTorch through

01:50:57.320 | its fullest you need to understand object-oriented programming because we have to create classes

01:51:01.600 | there's a lot of tutorials about this so I won't go into detail about it but I'll give

01:51:06.740 | you a quick overview a class could be something like dog or resnet or circle and it's

01:51:16.200 | something that has some data attached to it and it has some functionality attached to

01:51:20.520 | it is a class called example the data it has attached to it is a and the functionality

01:51:28.360 | attached to it is say and so we can for example create an instance of this class an object

01:51:36.040 | of this type example you pass in silver so silver will now be in ex.a and we can then

01:51:44.800 | say ex.say and it will call say and it will say passing in nice to meet you so that will

01:51:50.680 | be x and so it'll say hello self.a so that's silver nice to meet you here it is okay so

01:52:02.800 | in Python the way you create a class is to say class in its name then to say what is

01:52:09.360 | passed to it when you create that object it's a special method called dunder in it as we've

01:52:15.520 | briefly mentioned before in Python there are all kinds of special method names that have

01:52:21.720 | special behavior they start with two underscores they end with two underscores and we pronounce

01:52:27.160 | that dunder so dunder in it all methods in all regular methods instance methods in Python

01:52:38.240 | always get passed the actual object itself first that we normally call that self and

01:52:44.000 | then optionally anything else and so you can then change the contents of the current object

01:52:49.720 | by just setting self.whatever to whatever you like so after this self.a is now equal

01:52:55.880 | to silver so we call a method same thing it's passed self optionally anything you pass to

01:53:03.400 | it and then you can access the contents of self which you stashed away back here when

01:53:09.200 | we initialized it so that's basically how object or you know the basics of object-oriented

01:53:14.920 | programming works in Python there's something else you can do when you create a new class

01:53:24.640 | which is you can pop something in parentheses after its name and that means we're going

01:53:29.120 | to use something called inheritance and what inheritance means is I want to have all the

01:53:34.160 | functionality of this class plus I want to add some additional functionality so module

01:53:41.000 | is a PyTorch class which fast.ai has customized so it's kind of a fast.ai version of a PyTorch

01:53:49.920 | class and probably in the next course we'll see exactly how it works and but it looks

01:54:00.200 | a lot like a it all acts almost exactly like a just a regular Python class we have an init

01:54:07.440 | and we can set attributes to whatever we like and one of the things we can use is an embedding

01:54:15.120 | and so an embedding is just this class that does what I just described a it's the same

01:54:19.880 | as an as a linear layer with a one hot encoded matrix but it does it with this computational

01:54:25.920 | shortcut you can say how many in this case users are there and how many factors will

01:54:31.000 | they have now there is one very special thing about things that inherit from module which

01:54:38.400 | is that when you call them it will actually call a method called forward so forward is

01:54:44.240 | a special PyTorch method name it's the most important PyTorch method name this is where

01:54:49.820 | you put the the actual computation so to to grab the factors from an embedding we just

01:54:58.160 | call it like a function right so this is going to get passed here the user IDs and the movie

01:55:05.640 | IDs as two columns so let's grab the zero index column and grab the embeddings by passing

01:55:12.520 | them to user factors and then we'll do the same thing for the index one column that's

01:55:17.640 | the movie IDs pass them to the movie factors and then here is our element wise multiplication

01:55:25.280 | and then sum and now remember we've got another dimension time the first axis is the minibatch

01:55:33.320 | dimension so we want to sum over the other dimension the index one dimension so that's

01:55:40.040 | going to give us a dot product for each user sorry for each rating for each user movie

01:55:47.960 | combination so this is the dot product class so you can see if we look at one batch of

01:55:58.440 | our data it's of size of shape 64 by 2 because there are 64 items in the minibatch and each

01:56:06.240 | one has this is the independent variables so it's got the user ID and the movie ID to

01:56:20.840 | deep neural network based models for collaborative filtering work better than more traditional

01:56:25.400 | approaches like SPP or other matrix let's wait until we get there so here's X right so here

01:56:39.160 | is one user ID movie ID combination okay and then each one of those 64 here are the ratings

01:56:56.400 | so now we've created a dot product module from scratch so we can instantiate it passing

01:57:03.600 | in the number of users the number of movies and let's use 50 factors and now we can create

01:57:07.920 | a learner now this time we're not creating a CNN learner or a specific application learner

01:57:13.320 | it's just a totally generic learner so this is a learner that doesn't really know how

01:57:17.160 | to do anything clever it just stores away the data you give it and the model you give

01:57:21.920 | it and so when we're not using an application specific learner it doesn't know what loss

01:57:26.280 | function to use so we'll tell it to use MSE and fit and that's it right so we've just

01:57:34.680 | fitted our own collaborative filtering model where we literally created the entire architecture

01:57:40.920 | it's a pretty simple one from scratch so that's pretty amazing now the results aren't great

01:57:50.600 | if you look at the movie lens data set benchmarks online you'll see this is not actually a great

01:57:57.360 | result so one of the things we should do is take advantage of the tip we just mentioned

01:58:02.520 | earlier in this lesson which is when you're doing regression which we are here right the

01:58:07.320 | number between one and five is like a continuous value we're trying to get as close to it as

01:58:11.040 | possible we should tell fastai what the range is so we can use y range as before so here's

01:58:22.400 | exactly the same thing we've got a y range we've stored it away and then at the end we

01:58:28.400 | use as we discussed sigmoid range passing in and look here we pass in star self dot

01:58:35.200 | y range that's going to pass in by default 0 comma 5.5 and so we can see yeah not really

01:58:48.160 | any better it's worth a try normally this is a little bit better but it always depends

01:58:56.560 | on when you run it I'll just run it a second time well it's worth looking now there is

01:59:04.840 | something else we can do though which is that if we look back at our little Excel version

01:59:14.240 | the thing is here when we multiply you know these latent factors by these latent factors

01:59:22.080 | and add them up it's not really taking account of the fact that this user may just rate movies

01:59:31.220 | really badly in general regardless of what kind of movie they are and this movie might

01:59:38.260 | be just a great movie in general just everybody likes it regardless of what kind of stuff

01:59:42.960 | they like and so it'd be nice to be able to represent this directly and we can do that

01:59:48.400 | using something we've already learned about which is bias we could have another single

01:59:53.400 | number for each movie which we just add and it's another single number for each user which

01:59:59.440 | we just add right and we've already seen this for linear models you know this idea that

02:00:04.200 | it's nice to be able to add a bias value so let's do that so that means that we're going

02:00:13.280 | to need another embedding for each user which is a size one it's just a single number we're

02:00:18.520 | going to add so in other words it's just an array lookup but remember to do an array lookup

02:00:24.160 | that we can kind of take a gradient of we have to say embedding so we do the same thing

02:00:29.420 | for movie bias and so then all of this is identical as before and we just add this one

02:00:36.440 | extra line which is to add the user and movie bias values and so let's train that see how

02:00:45.680 | it goes well that was a shame it got worse so we used to have that finished here 0.87

02:00:59.200 | 0.88, 0.89 so it's a little bit worse why is that well if you look earlier on it was quite

02:01:09.240 | better it was 0.86 so it's overfitting very quickly and so what we need to do is we need

02:01:19.720 | to find a way that we can train more epochs without overfitting now we've already learned

02:01:26.760 | about data augmentation right like rotating images and changing their brightness and color

02:01:31.640 | and stuff but it's not obvious how we would do data augmentation for collaborative filtering

02:01:37.720 | right so how are we going to make it so that we can train lots of epochs without overfitting

02:01:46.480 | and to do that we're going to have to use something called regularization and regularization

02:01:51.360 | is a set of techniques which basically allow us to use models with lots of parameters and

02:01:57.520 | train them for a long period of time but penalize them effectively for overfitting or in some

02:02:04.540 | way cause them to to try to stop overfitting and so that is what we will look at next week

02:02:13.160 | okay well thanks everybody so there's a lot to take in there so please remember to practice

02:02:18.920 | to experiment listen to the lessons again because you know for the next couple of lessons

02:02:26.080 | things are going to really quickly build on top of all the stuff that we've learned so

02:02:30.440 | please be as comfortable with it as you can feel free to go back and re-listen and go

02:02:36.080 | through and follow through the notebooks and then try to recreate as much of them yourself

02:02:40.920 | thanks everybody and I will see you next week or see you in the next lesson whenever you

Lesson 6 - Deep Learning for Coders (2020)

Chapters