back to indexLesson 6 - Deep Learning for Coders (2020)
Chapters
0:0
31:10 Multi-label classification
48:35 One hot encoding
69:8 Regression
109:5 Embedding
110:40 Collobrative filtering from scratch
121:40 Regularisation (Data augmentation for regression)
00:00:00.000 |
Hi everybody and welcome to lesson 6 where we're going to continue looking at training 00:00:08.400 |
convolutional neural networks for computer vision. And so we last looked at this the 00:00:13.640 |
lesson before last and specifically we were looking at how to train an image classifier 00:00:19.360 |
to pick out breeds of pet, one of 37 breeds of pet. And we've gotten as far as training 00:00:26.660 |
a model but we also had to look and figure out what loss function was actually being 00:00:32.460 |
used in this model. And so we talked about cross entropy loss which is actually a really 00:00:37.660 |
important concept and some of the things we're talking about today depend a bit on you understanding 00:00:42.560 |
this concept. So if you were at all unsure about where we got to with that go back and 00:00:48.480 |
have another look have a look at the questionnaire in particular and make sure that you're comfortable 00:00:53.840 |
with cross entropy loss. If you're not you may want to go back to the 04 MNIST basics 00:01:00.240 |
notebook and remind yourself about MNIST loss because it's very very similar that's what 00:01:04.640 |
we built on to build up cross entropy loss. So having trained our model the next thing 00:01:11.840 |
we're going to do is look at model interpretation. There's not much point having a model if you 00:01:17.200 |
don't see what it's doing. And one thing we can do is use a confusion matrix which in 00:01:25.160 |
this case is not terribly helpful. There's kind of a few too many and it's not too bad 00:01:29.320 |
we can kind of see some colored areas. And so this diagonal here are all the ones that 00:01:33.280 |
are classified correctly. So for Persians there were 31 classified as Persians. But 00:01:40.440 |
we can see there's some bigger numbers here like Siamese six were misclassified they're 00:01:45.160 |
actually considered a Berman. But for when you've got a lot of classes like this it might 00:01:53.440 |
be better instead to use the most confused method and that tells you the combinations 00:02:02.560 |
which it got wrong the most often. In other words which numbers are the biggest so actually 00:02:06.960 |
here's the biggest one ten and that's confusing an American pit bull terrier or a Staffordshire 00:02:13.960 |
bull terrier that's happened ten times. And ragdoll is getting confused with the Berman 00:02:19.880 |
eight times. And so I'm not a dog or cat expert and so I don't know what this stuff means 00:02:27.880 |
so I looked it up on the internet and I found that American pit bull terriers and Staffordshire 00:02:32.320 |
bull terriers are almost identical that I think they sometimes have a slightly different 00:02:36.600 |
colored nose I remember correctly. And ragdolls and Berman's are types of cats that are so 00:02:42.920 |
similar to each other that this whole long threads on cat lover forums about is this 00:02:47.040 |
a ragdoll or is this a Berman and experts disagreeing with each other. So no surprise 00:02:52.600 |
that these things are getting confused. So when you see your model making sensible mistakes 00:02:59.620 |
the kind of mistakes that humans make that's a pretty good sign that it's picking up the 00:03:03.440 |
right kind of stuff and that the kinds of errors you're getting also might be pretty 00:03:07.600 |
tricky to fix. But you know let's see if we can make it better. And one way to try and 00:03:15.880 |
make it better is to improve our learning rate. Why would we want to improve the learning 00:03:21.920 |
rate? Well one thing we'd like to do is to try to train it faster, get more done in less 00:03:28.080 |
epochs. And so one way to do that would be to call our fine-tune method with a higher 00:03:35.360 |
learning rate. So last time we used the default which I think is, there you go, 1 in egg 2. 00:03:49.800 |
And so if we pump that up to 0.1 it's going to jump further each time. So remember the 00:03:54.680 |
learning rate and if you've forgotten this have a look again at notebook 4. That's the 00:03:59.520 |
thing we multiply the gradients by to decide how far to step. And unfortunately when we 00:04:05.820 |
use this higher learning rate the error rate goes from 0.083 epochs to 0.83 so we're getting 00:04:16.760 |
the vast majority of them wrong now. So that's not a good sign. So why did that happen? Well 00:04:22.400 |
what happened is rather than this gradual move towards the minimum, we had this thing 00:04:30.840 |
where we step too far and we get further, further away. So when you see this happening 00:04:38.580 |
which looks in practice like this, your error rate getting worse right from the start, that's 00:04:44.360 |
a sign your learning rate is too high. So we need to find something just right, not 00:04:49.160 |
too small that we take tiny jumps and it takes forever and not too big that we you know either 00:04:55.680 |
get worse and worse or we just jump backwards and forwards quite slowly. So to find a good 00:05:01.840 |
learning rate we can use something that the researcher Leslie Smith came up with called 00:05:06.800 |
the learning rate finder. And the learning rate finder is pretty simple. All we do, remember 00:05:13.800 |
when we do stochastic gradient descent, we look at one mini batch at a time or a few 00:05:19.080 |
images in this case at a time, find the gradient for that set of images for the mini batch 00:05:24.800 |
and jump, step our weights based on the learning rate and the gradient. Well what Leslie Smith 00:05:32.600 |
said was okay let's do the very first mini batch at a really, really low learning rate 00:05:37.120 |
like 10 to the minus 7 and then let's increase by a little bit. They're like maybe 25% higher 00:05:46.560 |
and do another step and then 25% higher and do another step. So these are not epochs, 00:05:53.880 |
these are just a single a simple mini batch and then we can plot on this chart here. Okay 00:05:57.960 |
at 10 to the minus 7 what was the loss and at 25% higher than that what was the loss 00:06:04.000 |
and the 25% higher than that what was the loss. And so not surprisingly if you do that 00:06:08.080 |
at the low learning rates the loss doesn't really come down because the learning rate 00:06:11.960 |
is so small that these steps are tiny, tiny, tiny. And then gradually we get to the point 00:06:21.320 |
where they're big enough to make a difference and the loss starts coming down because we've 00:06:24.600 |
plotted here the learning rate against the loss, right. So here the loss is coming down 00:06:31.120 |
as we continue to increase the learning rate the loss comes down until we get to a point 00:06:36.160 |
where our learning rates too high and so it flattens out and then oh it's getting worse 00:06:40.720 |
again so here's the point above like 0.1 where we're in this territory. So what we really 00:06:49.560 |
want is somewhere around here where it's kind of nice and steep. So you can actually ask 00:06:57.540 |
it the learning rate finder so we used LR find to get this plot we can we can get back 00:07:03.400 |
from it the minimum and steep. And so steep is where was it steepest but the steepest 00:07:08.720 |
point was 5e neg 3 and the minimum point divided by 10 that's quite a good rule of thumb is 00:07:17.280 |
1e neg 2. So somewhere around this range might be pretty good. So each time you run it you'll 00:07:25.960 |
get different values a different time we ran it we thought that maybe 3e neg 3 would be 00:07:30.080 |
good so we picked that and you'll notice the learning rate finder is a logarithmic scale 00:07:35.920 |
be careful of interpreting that. So we can now rerun the learning rate finder setting 00:07:41.720 |
the learning rate to a number we picked from the learning rate finder which in this case 00:07:45.680 |
was 3e neg 3. And we can see now that's looking good right we've got an 8.3% error rate after 00:07:53.360 |
3 epochs. So this idea of the learning rate finder is very straightforward I can describe 00:08:00.560 |
it to you in a couple of sentences it doesn't require any complex math and yet it was only 00:08:05.200 |
invented in 2015 which is super interesting right it just shows that there's so many interesting 00:08:13.960 |
things just to learn and discover. I think part of the reason perhaps for this it took 00:08:18.520 |
a while is that you know engineers kind of love using lots and lots of computers. So 00:08:25.680 |
before the learning rate finder came along people would like run lots of experiments 00:08:29.320 |
on big clusters to find out which learning rate was the best rather than just doing a 00:08:33.080 |
batch at a time. And I think partly also the idea of having a thing where a human is in 00:08:39.240 |
the loop where we look at something and make a decision is also kind of unfashionable a 00:08:43.280 |
lot of folks in research and industry love things which are fully automated. But anyway 00:08:48.400 |
it's great we now have this tool because it makes our life easier and fastai is certainly 00:08:55.400 |
the first library to have this and I don't know if it's still the only one to have it 00:08:58.400 |
built in at least to the basic the base library. So now we've got a good learning rate how 00:09:05.760 |
do we fine-tune the weights so so far we've just been running this fine-tune method without 00:09:11.360 |
thinking much about what it's actually doing. But we did mention in chapter one lesson one 00:09:20.600 |
briefly basically what's happening with fine-tune what is transfer learning doing. And before 00:09:28.820 |
we look at that let's take a question. Is the learning rate plot in LR find plotted against 00:09:37.880 |
one single mini batch? No it's not it's just it's actually just the standard kind of walking 00:09:49.800 |
through the walking through the data loader so just getting the usual mini batches of 00:09:57.400 |
the shuffled data. And so it's kind of just normal training and the only thing that's 00:10:01.600 |
being different is that we're increasing the learning rate a little bit after each mini 00:10:07.160 |
batch and and keeping track of it. Along with that is is the network reset to the initial 00:10:18.720 |
status after each trial? No certainly not we actually want to see how it learns we want 00:10:26.520 |
to see it improving so we don't reset it to its initial state until we're done. So at 00:10:33.640 |
the end of it we go back to the random weights we started with or whatever the weights were 00:10:37.440 |
at the time we ran this. So what we're seeing here is something that's actually the actual 00:10:45.600 |
learning that's happening as we at the same time increase the learning rate. Why would 00:10:53.640 |
an ideal learning rate found with a single mini batch at the start of training keep being 00:10:58.000 |
a good learning rate even after several epochs and further loss reductions? Great question 00:11:04.520 |
it absolutely wouldn't so let's look at that too shall we? And ask one more? This is an 00:11:14.720 |
important point so ask is very important. For the learning rate finder why use the steepest 00:11:20.440 |
and not the minimum? We certainly don't want the minimum because the minimum is the point 00:11:26.640 |
at which it's not learning anymore. Right so so this flat section at the bottom here 00:11:31.800 |
means in this mini batch it didn't get better. So we want the steepest because that's the 00:11:35.920 |
mini batch where it got the most improved and that's what we want we want the weights 00:11:40.280 |
to be moving as fast as possible. As a rule of thumb though we do find that the minimum 00:11:47.000 |
divided by 10 works pretty well that's Sylvain's favorite approach and he's generally pretty 00:11:52.760 |
spot-on with that so that's why we actually print out those two things. lr min is actually 00:11:59.000 |
the minimum divided by 10 and steepest point is suggests the steepest point. Great good 00:12:08.480 |
questions all. So remind ourselves what transfer learning does. So with transfer learning remember 00:12:16.880 |
what our neural network is. It's a bunch of linear models basically with with activation 00:12:26.360 |
functions between them and our activation functions are generally values rectified linear 00:12:32.280 |
units. If any of this is fuzzy have a look at the zero for notebook again to remind yourself. 00:12:41.800 |
And so each of those linear layers has a bunch of parameters to the whole neural network has 00:12:47.320 |
a bunch of parameters. And so after we train a neural network on something like ImageNet 00:12:56.120 |
we have a whole bunch of parameters that aren't random anymore they're actually useful for 00:12:59.880 |
something. And we've also seen that the early layers seem to learn about fairly general ideas 00:13:07.080 |
like radiance and edges and the later layers learn about more sophisticated ideas like 00:13:12.520 |
what are eyes look like or what is fur look like or what is text look like. So with transfer 00:13:18.520 |
learning we take a model so in other words a set of parameters which has already been 00:13:23.520 |
trained something like ImageNet. We throw away the very last layer because the very 00:13:28.960 |
last layer is the bit that specifically says which one of those in the case of ImageNet 00:13:33.600 |
1000 categories is this an image in. We throw that away and we replace it with random weights 00:13:41.240 |
sometimes with more than one layer of random weights and then we train that. Now yes. 00:13:51.200 |
Oh I just wanted to make a comment and that's that I think the learning rate finder I think 00:13:58.200 |
after you learn about it the idea almost seems kind of so simple or approximate that it's 00:14:04.200 |
like wait this shouldn't work like or you know shouldn't you have to do something more 00:14:09.000 |
more complicated or more precise that it's like I just want to highlight that this is 00:14:13.160 |
a very surprising result that some kind of a such a simple approximate method would be 00:14:19.400 |
so helpful. Yeah I would particularly say it's surprising to people who are not practitioners 00:14:25.680 |
or have not been practitioners for long. I've noticed that a lot of my students at USF have 00:14:36.680 |
a tendency to kind of jump in to try to doing something very complex where they account 00:14:40.320 |
for every possible imperfection from the start and it's very rare that that's necessary so 00:14:45.560 |
one of the cool things about this is it's good example of trying the easiest thing first 00:14:50.440 |
and seeing how well it works. And this was a very big innovation when it came out that 00:14:55.920 |
I think it's kind of easy to take for granted now but this was super super helpful when 00:15:00.600 |
it was it was super helpful and it was also nearly entirely ignored. None of the research 00:15:07.320 |
community cared about it and it wasn't until fast AI I think in our first course talked 00:15:11.680 |
about it that people started noticing and we had quite a few years in fact it's still 00:15:16.880 |
a bit the case where super fancy researchers still don't know about the learning rate finder 00:15:22.200 |
and you know get kept beaten by you know first lesson fast AI students on practical problems 00:15:30.160 |
because they can pick learning rates better and they can do it without a cluster of thousands 00:15:35.360 |
of computers. Okay so transfer learning so we've got our pre-trained network and so it's 00:15:44.280 |
really important every time you hear the word pre-trained network you're thinking a bunch 00:15:47.840 |
of parameters which have particular numeric values and go with a particular architecture 00:15:54.200 |
like resnet 34. We've thrown away the final layer and replace them with random numbers 00:16:02.600 |
and so now we want to train to fine-tune this set of parameters for a new set of images 00:16:08.920 |
in this case pets. So fine-tune is the method we call to do that and to see what it does 00:16:18.440 |
we can go burn.fine-tune? and we can see the source code and here is the signature of the 00:16:27.280 |
function and so the first thing that happens is we call freeze. So freeze is actually the 00:16:36.560 |
method which makes it so only the last layers weights will get stepped by the optimizer. 00:16:45.600 |
So the gradients are calculated just for those last layers of parameters and the step is 00:16:49.580 |
done just for those last layer of parameters. So then we call fit and we fit for some number 00:16:57.720 |
of epochs which by default is 1. We don't change that very often and what that fit is doing 00:17:07.400 |
is it's just fitting those randomly added weights which makes sense right they're the 00:17:12.040 |
ones that are going to need the most work because at the time which we add them they're 00:17:16.680 |
doing nothing at all they're just random. So that's why we spend one epoch trying to 00:17:22.240 |
make them better. After you've done that you now have a model which is much better than 00:17:30.180 |
we started with it's not random anymore. All the layers except the last are the same as 00:17:35.240 |
the pre-trained network the last layer has been tuned for this new data set. So the closer 00:17:41.040 |
you get to the right answer as you can kind of see in this picture the smaller the steps 00:17:46.360 |
you want to create sorry the smaller the steps you want to take generally speaking. The next 00:17:51.440 |
thing we do is we divide our learning rate by 2 and then we unfreeze so that means we 00:17:56.760 |
make it so that all the parameters can now be stepped and all of them will have gradients 00:18:01.660 |
calculated and then we fit for some more epochs and this is something we have to pass to the 00:18:07.600 |
method. And so that's now going to train the whole network. So if we want to we can kind 00:18:16.860 |
of do this by hand right and actually CNN learner will by default freeze the model for us freeze 00:18:27.200 |
the parameters for us so we actually don't have to call freeze. So if we just create 00:18:31.900 |
a learner and then fit for a while this is three epochs of training just the last layer 00:18:40.100 |
and so then we can just manually do it ourselves unfreeze. And so now at this point as the 00:18:46.480 |
question earlier suggested maybe this is not the right learning rate anymore so we can 00:18:51.280 |
run LR find again and this time you don't see the same shape you don't see this rapid 00:18:58.160 |
drop because it's much harder to train a model that's already pretty good. But instead you 00:19:03.560 |
just see a very gentle little gradient. So generally here what we do is we kind of try 00:19:09.840 |
to find the bit where it starts to get worse again and go about which is about here and 00:19:14.320 |
go about 10 let you know multiple of 10 less than that so about 1 a neg 5 I would guess 00:19:19.160 |
which yep that's what we picked. So then after unfreezing finding our new learning rate and 00:19:25.320 |
then we can do a bunch more and so here we are we're getting down to 5.9 percent error 00:19:32.360 |
which is okay but there's there's better we can do. And the reason we can do better is 00:19:39.200 |
that at this point here we're training the whole model at a 1 a neg 5 so 10 to the minus 00:19:45.800 |
5 learning rate which doesn't really make sense because we know that the last layer 00:19:52.320 |
is still not that great it's only had three epochs of training from random so it probably 00:19:56.640 |
needs more work. We know that the second last layer was probably pretty specialized to image 00:20:02.720 |
net and less specialized to pet breeds so that probably needs a lot of work. Whereas 00:20:07.280 |
the early layers but kind of gradients and edges probably don't need to be changed much 00:20:11.880 |
at all. But what we'd really like is to have a small learning rate for the early layers 00:20:16.760 |
and a bigger learning rate for the later layers. And this is something that we developed at 00:20:22.160 |
fast AI and we call it discriminative learning rates. And Jason Yasinski actually is a guy 00:20:30.400 |
who wrote a great paper that some of these ideas are based on which is he actually showed 00:20:35.080 |
that different layers of the network really want to be trained at different rates. Although 00:20:39.400 |
he didn't kind of go as far as trying that out and seeing how it goes it was more of 00:20:44.240 |
a theoretical thing. So in fast AI if we want to do that we can pass to our learning rate 00:20:51.040 |
rather than just passing a single number we can pass a slice. Now a slice is a special 00:20:58.640 |
built-in feature of Python. It's just an object which basically can have a few different numbers 00:21:04.640 |
in it. In this case it's been passing at two numbers. And the way we read those basically 00:21:11.160 |
what this means in fast AI is a learning rate is the very first layer will have this learning 00:21:17.320 |
rate 10 to the minus 6. The very last layer will be 10 to the minus 4. And the layers 00:21:22.800 |
between the two will be kind of equal multiples. So they'll kind of be equally spaced learning 00:21:28.720 |
rates from the start to the end. So here we can see basically doing our kind of own version 00:21:38.160 |
of fine-tune. We create the learner, we fit with that automatically frozen version, we 00:21:45.520 |
unfreeze, we fit some more. And so when we do that you can see this works a lot better 00:21:51.200 |
we're getting down to 5.3, 5.1, 5.4 error. So that's pretty great. One thing we'll notice 00:21:59.840 |
here is that we did kind of overshoot a bit. It seemed like more like epoch number 8 was 00:22:05.280 |
better. So kind of back before you know well actually let me explain something about fit 00:22:12.200 |
one cycle. So fit one cycle is a bit different to just fit. So what fit one cycle does is 00:22:20.920 |
it actually starts at a low learning rate. It increases it gradually for the first one-third 00:22:29.360 |
or so of the batches until it gets to a high learning rate. The highest, this is why it's 00:22:35.360 |
called LR max. It's the highest learning rate we get to. And then for the remaining two-thirds 00:22:40.520 |
or so of the batches it gradually decreases the learning rate. And the reason for that 00:22:46.680 |
is just that well actually it's kind of like empirically researchers have found that works 00:22:51.640 |
the best. In fact this was developed again by Leslie Smith the same guy that did the 00:22:55.340 |
learning rate finder. Again it was a huge step you know it really dramatically accelerated 00:23:01.680 |
the speed at which we can train neural networks and also made them much more accurate. And 00:23:06.320 |
again the academic community basically ignored it. In fact the key publication that developed 00:23:13.360 |
this idea was not even did not even pass peer review. And so the reason I mention this now 00:23:20.960 |
is to say that we can't we don't really just want to go back and pick the model that was 00:23:25.280 |
trained back here because we could probably do better because we really want to pick a 00:23:30.880 |
model that's got a low learning rate. But what I would generally do here is I change 00:23:35.640 |
this 12 to an 8 because this is this is looking good. And then I would retrain it from scratch. 00:23:42.680 |
Normally you'd find a better result. You can plot the loss and you can see how the training 00:23:49.580 |
and validation loss moved along. And you can see here that you know the error rate was 00:23:59.760 |
starting to get worse here. And what you'll often see is often the validation loss will 00:24:07.860 |
get worse a bit before the error rate gets worse. We're not really seeing it so much 00:24:13.480 |
in this case but the error rate and the validation loss don't always or they're not always kind 00:24:17.740 |
of in lockstep. So what we're plotting here is the loss but you actually kind of want 00:24:23.840 |
to look to see mainly what's happening with the error rate because that's actually the 00:24:26.920 |
thing we care about. Remember the loss is just like an approximation of what we care 00:24:31.480 |
about that just happens to have a gradient that works out nicely. So how do you make 00:24:42.200 |
it better now? We're already down to just 5.4 or if we'd stopped a bit earlier maybe 00:24:49.640 |
we could get down to 5.1 or less error. On 37 categories that's pretty remarkable. That's 00:24:56.080 |
a very, very good pet breed predictor. If you want to do something even better you could 00:25:02.720 |
try creating a deeper architecture. So a deeper architecture is just literally putting more 00:25:11.080 |
pairs of non-activation function also known as a non-linearity followed by these little 00:25:15.880 |
linear models put more pairs onto the end. And basically the number of these sets of 00:25:21.800 |
layers you have is the number that you'll see at the end of an architecture. So there's 00:25:26.960 |
ResNet 18, ResNet 34, ResNet 50, so forth. Having said that you can't really pick ResNet 00:25:35.760 |
19 or ResNet 38. I mean you couldn't make one but nobody's created a pre-trained version 00:25:45.000 |
of that for you so you won't be able to do any fine-tuning. So like you can theoretically 00:25:50.120 |
create any number of layers you like but in practice most of the time you'll want to pick 00:25:57.800 |
a model that has a pre-trained version. So you kind of have to select from the sizes 00:26:02.760 |
people have pre-trained and there's nothing special about these sizes they're just ones 00:26:06.580 |
that people happen to have picked out. For the bigger models there's more parameters 00:26:13.520 |
and more gradients that are going to be stored on your GPU and you will get used to the idea 00:26:19.480 |
of seeing this error unfortunately out of memory. So that's not out of memory in your 00:26:26.360 |
RAM that's out of memory in your GPU. Cruder is referring to the language in the system 00:26:32.760 |
used for your GPU. So if that happens unfortunately you actually have to restart your notebook 00:26:38.540 |
so that's kernel restart and try again and that's a really annoying thing but such is 00:26:45.480 |
life. One thing you can do if you get an out-of-memory error is after you've your CNN learner call 00:26:52.720 |
add this magic incantation to FP16. What that does is it uses for most of the operations 00:27:01.500 |
numbers that use half as many bits as usual so they're less accurate this half precision 00:27:07.000 |
floating point or FP16 and that will use less memory and on pretty much any NVIDIA card 00:27:18.400 |
created in 2020 or later and some more expensive cards even created in 2019 that's often going 00:27:27.080 |
to result in a two to three times speed up in terms of how long it takes as well. So 00:27:33.000 |
here if I add in to FP16 and I will be seeing often much faster training and in this case 00:27:42.040 |
what I actually did is I switched to a ResNet-50 which would normally take about twice as long 00:27:47.200 |
and my per epoch time has gone from 25 seconds to 26 seconds. So the fact that we used a 00:27:54.780 |
much bigger network and it was no slower is thanks to to FP16. But you'll see error rate 00:28:01.120 |
hasn't improved it's pretty similar to what it was and so it's important to realize that 00:28:07.260 |
just because we increase the number of layers it doesn't always get better. So it tends 00:28:13.180 |
to require a bit of experimentation to find what's going to work for you and of course 00:28:19.280 |
don't forget the trick is use small models for as long as possible and to do all of your 00:28:25.920 |
cleaning up and testing and so forth and wait until you're all done to try some bigger models 00:28:30.800 |
because they're going to take a lot longer. Okay questions. How do you know or suspect 00:28:39.740 |
when you can quote do better? You have to always assume you can do better because you 00:28:48.320 |
never know. So you just have to I mean part of it though is do you need to do better or 00:28:54.760 |
do you already have a good enough result to handle the actual task you're trying to do. 00:29:00.480 |
When people do spend too much time fiddling around with their models rather than actually 00:29:04.440 |
trying to see whether it's already going to be super helpful. So as soon as you can actually 00:29:09.560 |
try to use your model to do something practical the better. But yeah how much can you improve 00:29:16.560 |
it? Who knows? I you know go through the techniques that we're teaching this course and try them 00:29:23.520 |
and see which ones help. Unless it's a problem that somebody has already tried before and 00:29:30.960 |
written down their results in a paper or a capital competition or something there's no 00:29:34.920 |
way to know how good can. So don't forget after you do the questionnaire to check out 00:29:44.920 |
the further research section. And one of the things we've asked you to do here is to read 00:29:49.880 |
a paper. So find the learning rate finder paper and read it and see if you can kind 00:29:57.040 |
of connect what you read up to the things that we've learned in this lesson. And see 00:30:02.120 |
if you can maybe even implement your own learning rate finder you know as manually as you need 00:30:10.000 |
to see if you can get something that you know based on reading the paper to work yourself. 00:30:16.940 |
You can even look at the source code of fastai's learning rate finder of course. And then can 00:30:23.520 |
you make this classifier better? And so this is further research right? So maybe you can 00:30:28.200 |
start doing some reading to see what else could you do. Have a look on the forums see 00:30:33.200 |
what people are trying. Have a look on the book website or the course website to see 00:30:37.600 |
what other people have achieved and what they did and play around. So we've got some tools 00:30:43.480 |
in our toolbox now for you to experiment with. So that is that is pet breeds that is a you 00:30:53.760 |
know a pretty tricky computer vision classification problem. And we kind of have seen most of 00:31:01.960 |
the pieces of what goes into the training of it. We haven't seen how to build the actual 00:31:05.360 |
architecture but other than that we've kind of worked our way up to understanding what's 00:31:09.820 |
going on. So let's build from there into another kind of data set one that involves multi-label 00:31:19.800 |
classification. So what's multi-label classification? Well maybe so maybe let's look at an example. 00:31:31.000 |
Here is a multi-label data set where you can see that it's not just one label on each image 00:31:36.360 |
but sometimes there's three, bicycle, car, person. I don't actually see the car here 00:31:41.480 |
I guess it's being dropped out. So a multi-label data set is one where you still got one image 00:31:47.760 |
per row but you can have zero one two or more labels per row. So we're going to have a think 00:31:55.280 |
about and look at how we handle that. But first of all let's take another question. 00:32:01.400 |
Is dropping floating point number precision switching from FP32 to FP16 have an impact 00:32:10.640 |
on final? Yes it does. Often it makes it better believe it or not. It seems like you know 00:32:23.920 |
the kind of it's doing a little bit of rounding off is one way to give it drop some of that 00:32:28.200 |
precision. And so that creates a bit more bumpiness, a bit more uncertainty, bit more 00:32:35.280 |
you know of a stochastic nature. And you know when you introduce more slightly random stuff 00:32:41.080 |
into training it very often makes it a bit better. And so yeah FP16 training often gives 00:32:47.800 |
us a slightly better result but I you know I wouldn't say it's generally a big deal either 00:32:53.320 |
way and certainly it's not always better. Would you say this is a bit of a pattern in 00:32:58.240 |
learning less exact and stochastic way? For sure not just in deep learning but machine 00:33:09.320 |
learning more generally. You know there's been some interesting research looking at 00:33:13.840 |
like matrix factorization techniques which if you want them to go super fast you can 00:33:17.840 |
lots of machines, you can randomization and you often when you then use the results you 00:33:23.920 |
often find you actually get better outcomes. Just a brief plug for the fast AI computational 00:33:30.600 |
linear algebra course which talks a little bit about random. Does it really? Well that 00:33:37.160 |
sounds like a fascinating course and look at that it's number one hit here on Google 00:33:43.040 |
so easy to find. Well by somebody called Rachel Thomas. Hey that's person's got the same name 00:33:49.920 |
as you Rachel Thomas. All right so how are we going to do multi-label classification? 00:33:58.440 |
So let's look at a data set called Pascal which is a pretty famous data set. We'll look 00:34:02.640 |
at the version that goes back to 2007 been around for a long time. And it comes with 00:34:08.400 |
a CSV file which we will read in CSV is comma separated values and let's take a look. Each 00:34:15.320 |
row has a file name, one or more labels and something telling you whether it's in the 00:34:21.280 |
validation set or not. So the list of categories in each image is a space delimited string 00:34:28.020 |
but it doesn't have a horse person it has a horse and a person. PD here stands for pandas. 00:34:36.000 |
Pandas is a really important library for any kind of data processing and you use it all 00:34:43.520 |
the time in machine learning and deep learning. So let's have a quick chat about it. Not a 00:34:47.720 |
real panda it's the name of a library and it creates things called data frames. That's 00:34:52.720 |
what the DF here stands for and a data frame is a table containing rows and columns. Pandas 00:34:58.680 |
can also do some slightly more sophisticated things than that but we'll treat it that way 00:35:02.240 |
for now. So you can read in a data frame by saying PD for pandas. Pandas read CSV, give 00:35:08.000 |
it a file name, you've now got a data frame you can call head to see the first few rows 00:35:12.400 |
of it for instance. A data frame has a ilock integer location property which you can index 00:35:22.320 |
into as if it was an array, in fact it looks just like numpy. So colon means every row remember 00:35:30.360 |
it's row comma column and zero means zeroth column and so here is the first column of 00:35:36.080 |
the data frame. You can do the exact opposite so the zeroth row and every column is going 00:35:43.120 |
to give us the first row and you can see the row has column headers and values. So it's 00:35:49.040 |
a little bit different to numpy and remember if there's a comma colon or a bunch of comma 00:35:54.680 |
columns at the end of indexing in numpy or pytorch or pandas whatever you can get rid 00:36:00.960 |
of it and these two are exactly the same. You could do the same thing here by grabbing 00:36:08.020 |
the column by name the first column is fname so you can say dffname you get that first 00:36:13.360 |
column. You can create new columns so here's a tiny little data frame I've created from 00:36:18.120 |
a dictionary and I could create a new column by for example adding two columns and you 00:36:25.340 |
can see there it is. So it's like a lot like numpy or pytorch except you have this idea 00:36:31.680 |
of kind of rows and and column named columns and so it's all about kind of tabular data. 00:36:41.080 |
I find its API pretty unintuitive a lot of people do but it's fast and powerful so it 00:36:46.680 |
takes a while to get familiar with it but it's worth taking a while and the creator 00:36:50.020 |
of pandas wrote a fantastic book called Python for data analysis which I've read both versions 00:36:57.380 |
and I found it fantastic. It doesn't just cover pandas it covers other stuff as well 00:37:01.680 |
like IPython and numpy and matplotlib so highly recommend this book. This is our table so 00:37:13.440 |
what we want to do now is construct data loaders that we can train with and we've talked about 00:37:22.000 |
the data block API is being a great way to create data loaders but let's use this as 00:37:26.600 |
an opportunity to create a data data loaders or a data so create a data block and then 00:37:31.120 |
data loaders for this and let's try to do it like right from square one. So let's see 00:37:38.720 |
exactly what's going on with data block. So first of all let's remind ourselves about 00:37:44.640 |
what a data set and a data loader is. A data set is an abstract idea of a class you can 00:37:53.040 |
create a data set. A data set is anything which you can index into it like so or and 00:37:59.040 |
you can take the length of it like so. So for example the list of the lowercase letters 00:38:05.140 |
along with a number saying which lowercase letter it is I can index into it to get 0, 00:38:11.080 |
a I can get the length of it to get 26 and so therefore this qualifies as a data set 00:38:18.440 |
and in particular data sets normally you would expect that when you index into it you would 00:38:22.600 |
get back a tuple because you've got the independent and dependent variables not necessarily always 00:38:30.800 |
just two things it could be more there could be less but two is the most common. So once 00:38:36.700 |
we have a data set we can pass it to a data loader we can request a particular batch size 00:38:46.960 |
we can shuffle or not and so there's our data loader from A we could grab the first value 00:38:52.480 |
from that iterator and here is the shuffled 7 is H 4 is E 20 is U and so forth and so 00:39:00.580 |
remember a mini batch has a bunch of a mini batch of the independent variable and a mini 00:39:06.520 |
batch of the dependent variable. If you want to see how the two correspond to each other 00:39:12.320 |
you can use zip so if I zip passing in this list and then this list so B0 and B1 you can 00:39:20.480 |
see what zip does in Python is it grabs one element from each of those in turn and gives 00:39:26.640 |
you back the tuples of the corresponding elements. Since we're just passing in all of the elements 00:39:34.240 |
of B to this function Python has a convenient shortcut for that which is just say star B 00:39:42.000 |
and so star means insert into this parameter list each element of B just like we did here 00:39:49.760 |
so these are the same thing. So this is a very handy idiom that we use a lot in Python zip 00:39:56.100 |
star something is kind of a way of like transposing something from one orientation to another. 00:40:07.040 |
All right so we've got a data set we've got a data loader and then what about datasets 00:40:12.400 |
what datasets is an object which has a training data set and a validation set dataset so let's 00:40:17.960 |
look at one. Now normally you don't start with kind of an enumeration like this like 00:40:26.280 |
with an independent variable and a dependent variable normally you start with like a file 00:40:32.400 |
name for example and then you you kind of calculate or compute or transform your file 00:40:40.660 |
name into an image by opening it and a label by for example looking at the file name and 00:40:46.520 |
grabbing something out of it. So for example we could do something similar here this is 00:40:50.320 |
what datasets does so we could start with just the lowercase letters so this is still 00:40:56.200 |
a data set right because we can index into it and we can get the length of it although 00:41:00.280 |
it's not giving us tuples yet. So if we now pass that list to the datasets class and index 00:41:09.000 |
into it we get back the tuple and it's actually a tuple with just one item this is how Python 00:41:15.180 |
shows a tuple with one item is it puts it in parentheses and a comma and then nothing okay. 00:41:20.880 |
So in practice what we really want to do is to say like okay we'll take this and do something 00:41:26.780 |
to compute an independent variable and do something to compute a dependent variable 00:41:30.880 |
but here's a function we could use to compute an independent variable which is to stick 00:41:34.380 |
an A on the end and our dependent variable might just be the same thing with a B on the 00:41:38.400 |
end. So here's two functions so for example now we can call datasets passing in A and 00:41:46.320 |
then we can pass in a list of transformations to do and so in this case I've just got one 00:41:54.000 |
which is this function add an A on the end so now if I index into it I don't get A anymore 00:41:58.760 |
I get AA. If you pass multiple functions then it's going to do multiple things so here I've 00:42:08.640 |
got F1 then F2 AAB that's this one then that's this one and you'll see this is a list of 00:42:15.520 |
lists and the reason for that is that you can also pass something like this a list containing 00:42:21.280 |
F1 a list containing F2 and this will actually take each element of A pass it through this 00:42:27.920 |
list of functions and there's just one of them to give you AA and then start again and 00:42:34.360 |
separately pass it through this list of functions there's just one to get A B and so this is 00:42:40.360 |
actually kind of the main way we build up independent variables and dependent variables 00:42:46.680 |
in fast AI is we start with something like a file name and we pass it through two lists 00:42:51.160 |
of functions one of them will generally kind of open up the image for example and the other 00:42:55.460 |
one will kind of pass the file name for example and give you a independent variable and a 00:43:00.200 |
dependent variable. So you can then create a data loaders object from data sets by passing 00:43:07.800 |
in the data sets and a batch size and so here you can see I've got shuffled OAIA etc OB 00:43:16.640 |
IB etc so this is worth studying to make sure you understand what data sets and data loaders 00:43:23.480 |
are we don't often have to create them from scratch we can create a data block to do it 00:43:28.480 |
for us but now we can see what the data block has to do let's see how it does it so we can 00:43:35.760 |
start by creating an empty data block so an empty data block is going to take our data 00:43:41.460 |
frame we're going to go back to looking at data frame which remember was this guy and 00:43:52.800 |
so if we pass in our data frame we can now we'll now find that this data block has created 00:44:01.640 |
data sets a training and a validation data set for us and if we look at the training 00:44:08.480 |
set it'll give us back an independent variable and a dependent variable and we'll see that 00:44:13.920 |
they are both the same thing so this is the first row of the table that's actually shuffled 00:44:20.960 |
so it's a random row of the table repeated twice and the reason for that is by default 00:44:26.660 |
the data block assumes that we have two things the independent variable and the dependent 00:44:31.040 |
or the input in the target and by default it just copies it just keeps exactly whatever 00:44:36.720 |
you gave it to create the training set and the validation set by default it just randomly 00:44:43.000 |
splits the data with a 20% validation set so that's what's happened here so this is 00:44:50.160 |
not much use and what we what we actually want to do if we look at X for example is 00:44:55.400 |
grab the the F name the file name field because we want to open this image that's going to 00:45:00.880 |
be our independent variable and then for the label we're gonna want this here person cat 00:45:12.280 |
so we can actually pass these as parameters get X and get Y functions that return the 00:45:20.600 |
bit of data that we want and so you can create and use a function in the same line of code 00:45:27.200 |
in Python by saying lambda so lambda R means create a function doesn't have a name it's 00:45:34.080 |
going to take a parameter called R we don't even have to say return it's going to return 00:45:39.200 |
the F name column in this case and get Y is something which is a function that takes an 00:45:47.080 |
R and returns the labels column so now we can do the same thing called dblock.datasets 00:45:55.080 |
we can grab a row from that from the training set and you can see look here it is there 00:45:59.320 |
is the image file name and there is the space delimited list of labels so here's exactly 00:46:08.880 |
the same thing again but done with functions so now the one line of code above has become 00:46:15.200 |
three lines of code but it does exactly the same thing okay we don't get back the same 00:46:22.480 |
result because the training set well wait why don't we get the same result oh I know 00:46:32.920 |
why because it's randomly shuffle it's randomly picking a different validation set because 00:46:38.920 |
the random split is done differently each time so that's why we don't get the same result 00:46:44.360 |
one thing to note be careful of lambdas if you want to save this data block for use later 00:46:51.400 |
you won't be able to Python doesn't like saving things that contain lambdas so most of the 00:46:56.800 |
time in the book and the course we normally use avoid lambdas for that reason because 00:47:01.400 |
it's often very convenient to be able to save things we use the word here serialization 00:47:06.280 |
that just means basically it means saving something this is not enough to open an image 00:47:15.000 |
because we don't have the path so to turn this into so rather than just using this function 00:47:20.700 |
to grab the F name column we should actually use path lib to go path slash train and then 00:47:27.660 |
column and then for the Y again the labels is not quite enough we actually have to split 00:47:35.440 |
on space but this is Python we can use any function we like and so then we use the same 00:47:40.240 |
three lines of code is here and now we've got a path and a list of labels so that's 00:47:47.000 |
looking good so we want this path to be opened as an image so the data block API that's your 00:47:56.600 |
pass a blocks argument where you tell it or each of the things in your tuple so there's 00:48:03.080 |
two of them what kind of block do you need so we need an image block to open an image 00:48:09.680 |
and then in the past we've used a category block or categorical variables but this time 00:48:14.920 |
we don't have a single category we've got multiple categories so we have to use a multi 00:48:19.480 |
category block so once we do that and have a look we now have an 500 by 375 image as 00:48:27.740 |
our independent variable and as a dependent variable we have a long lists of zeros and 00:48:34.160 |
ones the long list of zeros and ones is the labels as a one hot encoded vector a rank 00:48:47.360 |
one tensor and specifically there will be a zero in every location where in the vocab 00:48:57.880 |
where there is not that kind of object in this image and a one in every location where 00:49:04.200 |
there is so for this one there's just a person so this must be the location in the vocab 00:49:09.660 |
where there's a person you have any questions so one hot encoding is a very important concept 00:49:17.640 |
and we didn't have to use it before right we could just have a single integer saying 00:49:25.800 |
which one thing is it but when we've got lots of things lots of potential labels it's it's 00:49:32.360 |
convenient to use this one hot encoding and it's kind of what it's actually what's going 00:49:36.600 |
to happen with them with the actual matrices anyway when we actually compare the activations 00:49:48.440 |
of our neural network to the target it's actually going to be comparing each one of these okay 00:49:57.800 |
so the categories as I mentioned is based on the vocab so we can grab the vocab from 00:50:04.200 |
our data set subject and then we can say okay let's look at the first row and let's look 00:50:11.360 |
at the dependent variable and let's look for where the dependent variable is one okay and 00:50:22.200 |
then we can have a look past those indexes with a vocab and get back a list of what actually 00:50:27.160 |
was there and again each time I run this I'm going to get different results so each time 00:50:33.880 |
we run this we're going to get different results because I called dot data sets again here 00:50:38.120 |
so it's going to give me different train tests split and so this time it turns out that this 00:50:42.760 |
actually a chair and we have a question shouldn't the tensor be of integers why is it a tensor 00:50:50.800 |
of floats yeah conceptually this is a tensor of integers they can only be 0 or 1 but we 00:51:04.880 |
are going to be using a cross-entropy style loss function but we're going to actually 00:51:11.200 |
need to do floating point calculations on them that's going to be faster to just store 00:51:17.920 |
them as float in the first place rather than converting backwards and forwards even though 00:51:21.760 |
they're conceptually an int we're not going to be doing kind of int style calculations 00:51:26.520 |
with them good question I mentioned that by default the data block uses a random split 00:51:39.320 |
you might have noticed in the data frame though it said here's a column saying what validation 00:51:49.120 |
set to use and if the data set you're given tells you what validation set to use you should 00:51:54.160 |
generally use it because that way you can compare your validation set results to somebody 00:51:58.760 |
else's so you can pass a splitter argument which again is a function and so we're going 00:52:06.040 |
to pass it a function that's also called splitter and the function is going to return the indexes 00:52:12.720 |
where it's not valid and that's going to be the training set and the indexes where it 00:52:18.520 |
is valid that's going to be the validation set and so the splitter argument is expected 00:52:23.120 |
to return two lists of integers and so if we do that we get again the same thing but 00:52:29.680 |
now we're using the correct train and validation sets another question sure any particular 00:52:40.480 |
reason we don't use floating point eight is it just that the precision is too low yeah 00:52:47.800 |
trying to train with 8-bit precision is super difficult it's it's so flat and bumpy it's 00:52:57.320 |
pretty difficult to get decent gradients there but you know it's an area of research the 00:53:03.440 |
main thing people do with 8-bit or even 1-bit data types is they take a model that's already 00:53:10.440 |
been trained with 16-bit or 32-bit floating point and then they kind of round it off it's 00:53:15.160 |
called discretizing to create a kind of purely integer or even binary network which can do 00:53:23.680 |
inference much faster figuring out how to train with such low precision data is an 00:53:31.820 |
area of active research I suspect it's possible and I suspect I mean people have fiddled around 00:53:42.280 |
it and had some success I think you know it could turn out to be super interesting particularly 00:53:46.960 |
for stuff that's being done on like low-powered devices that might not even have a floating 00:53:51.120 |
point unit right so the last thing we need to do is to add our item transforms random 00:54:01.000 |
resource crop we've talked about that enough so I won't go into it but basically that means 00:54:05.000 |
we now are going to ensure that everything has the same shape so that we can collate 00:54:09.460 |
it into a data loader they're now rather than going dot data sets go dot data loaders and 00:54:15.480 |
display our data and remember something goes wrong as we saw last week you can call summary 00:54:22.120 |
to find out exactly what's happening in your data block so now you know this is something 00:54:27.400 |
really worth studying this section because data blocks are super handy and if you haven't 00:54:32.280 |
used fastai 2 before they won't be familiar to you because no other library uses them 00:54:39.440 |
and so like this is really showing you how to go right back to the start and gradually 00:54:42.840 |
build them up so hopefully that'll make a whole lot of sense now we're going to need 00:54:49.520 |
a loss function again and to do that let's start by just creating a learner it's created 00:54:57.200 |
resnet 18 from the data loaders object that we just created and let's grab one batch of 00:55:04.720 |
data and then let's put that into our mini batch of independent independent variables 00:55:12.360 |
and then learn dot model is the thing that actually contains the model itself in this 00:55:18.880 |
case is CNN and you can treat it as a function and so therefore we can just pass something 00:55:24.400 |
to it and so if we pass a mini batch of the independent variable to learn dot model it 00:55:31.540 |
will return the activations from the final layer and that is shape 64 by 20 so anytime 00:55:40.600 |
you get a tensor back look at its shape and in fact before you look at its shape predict 00:55:45.640 |
what the shape should be and then make sure that you're all right if you're not I think 00:55:51.200 |
you guessed wrong so try to understand where you made a mistake or there's a problem with 00:55:55.680 |
your code in this case 64 by 20 makes sense because we have a mini batch size of 64 and 00:56:04.840 |
for each of those we're going to make predictions about what probability is each of these 20 00:56:10.760 |
possible categories and we have a question two questions questions all right is the data 00:56:17.120 |
block API compatible with out of core data sets like Dask yeah the data block API can 00:56:25.400 |
do anything you want it to do so you're passing it if we go back to the start so you can create 00:56:37.360 |
an empty one and then you can pass it anything that is indexable and yeah so that can be 00:56:48.120 |
anything you you like and pretty much anything can be made indexable in Python and that's 00:56:54.880 |
something like Dask is certainly indexable so that works perfectly fine if it's not indexable 00:57:03.920 |
like it's a it's a network stream or something like that then the data loaders data sets 00:57:10.840 |
API's directly which we'll learn about either in this course or the next one but yeah anything 00:57:16.320 |
that you can index into it certainly includes Dask you can use with data blocks next question 00:57:23.240 |
where do you put images for multi-label with that CSV table should they be in the same 00:57:28.080 |
directory there can be anywhere you like so in this case we used a pathlib object like 00:57:37.120 |
so and in this case the the by default it's going to be using I think about this so what's 00:57:57.240 |
happening here is the path is oh it's saying dot okay the reason for that is that path 00:58:04.760 |
dot base path is currently set to path and so that displays things relative oh let's 00:58:09.560 |
rid of that okay so the path we set is here right and so then when we said get X it's 00:58:19.760 |
saying path slash change slash whatever right so this is an absolute path and so here is 00:58:27.000 |
the exact path so you can put them anywhere you like you just have to say what the path 00:58:31.240 |
is and then if you want to not get confused by having this big long prefix that we can 00:58:38.920 |
don't want to see all the time just set base path to the path you want everything to be 00:58:43.360 |
relative to and then it'll just print things out in this more convenient manner right so 00:58:54.200 |
this is really important that you can do this that you can create a learner you can grab 00:58:58.640 |
a batch of data that you can pass it to the model is this is just plain pytorch this line 00:59:03.480 |
here right no fast AI you can see the shape right you can recognize why it has this shape 00:59:11.360 |
and so now if you have a look here are the 20 activations now this is not a trained model 00:59:21.040 |
it's a pre-trained model with a random set of final layer weights so these specific numbers 00:59:26.200 |
don't mean anything but it's just worth remembering this is what activations look like and most 00:59:32.800 |
importantly they're not between 0 and 1 and if you remember from the MNIST notebook we 00:59:38.400 |
know how to scale things between 0 and 1 we can pop them into the sigmoid function so 00:59:43.520 |
the sigmoid function is something that scales everything to be between 0 and 1 so let's 00:59:49.720 |
use that you'll also hopefully remember from the MNIST notebook that the MNIST loss the 00:59:58.760 |
MNIST loss function first did sigmoid and then it did torch.where so and then it did 01:00:07.040 |
dot mean so we're going to use exactly the same thing as the MNIST loss function and 01:00:11.520 |
we're just going to do one thing which is going to add dot log for the same reason that 01:00:16.120 |
we talked about when we were looking at softmax we talked about why log is a good idea as 01:00:25.880 |
a transformation we saw in the MNIST notebook we didn't need it but we're going to train 01:00:33.280 |
faster and more accurately if we use it because it's just more it's going to be better behaved 01:00:37.520 |
as we've seen so this particular function which is identical to MNIST loss plus dot 01:00:43.520 |
log as a specific name and it's called binary cross entropy and we used it for the threes 01:00:52.200 |
versus sevens problem to decide whether that column is it a three or not but because we 01:01:01.480 |
can use broadcasting in high torch and element wise arithmetic this function when we pass 01:01:09.360 |
it a whole matrix is going to be applied to every column so is the first column you know 01:01:17.680 |
so it'll basically do a torch.where on on every column separately in every item separately 01:01:27.480 |
so that's great it basically means that this binary cross entropy function is going to 01:01:31.120 |
be just like MNIST loss but rather than just being is this the number three it'll be is 01:01:37.880 |
this a dog is this a cat is this a car is this a person is this the bicycle and so forth 01:01:43.240 |
this is where it's so cool in PyTorch we can kind of run write one thing and then kind 01:01:48.040 |
of have it expand to handle higher dimensional tensors without doing any extra work we don't 01:01:55.880 |
have to write this a cells of course because PyTorch has one and it's called f dot binary 01:02:04.240 |
cross entropy so we can just use PyTorch as we've talked about there's always a equivalent 01:02:10.520 |
module version so this is exactly the same thing as a module and n dot BCE loss and these 01:02:21.040 |
ones don't include the initial sigmoid actually if you want to include this initial sigmoid 01:02:26.700 |
you need f dot binary cross entropy with logits or the equivalent nn dot BCE with logits loss 01:02:32.720 |
so BCE is binary cross entropy and so those are two functions plus two equivalent classes 01:02:42.680 |
or multilabel or binary problems and then the equivalent for single label like MNIST 01:02:49.360 |
and PETS is nll loss and cross entropy so that's the equivalent of binary cross entropy 01:02:55.880 |
and binary cross entropy with logits so these are pretty awful names I think we can all 01:02:59.920 |
agree but it is what it is so in our case we have a one-hot encoded target and we want 01:03:09.640 |
the one with a sigmoid in so the equivalent built-in is called BCE with logits loss so 01:03:17.120 |
that we can make that our loss function we can compare the activations to our targets 01:03:22.840 |
and we can get back a loss and then that's what we can use to train and then finally 01:03:31.060 |
before we take our break we also need a metric now previously we've been using as a metric 01:03:35.880 |
accuracy or actually error rate error rate is one minus accuracy accuracy only works 01:03:42.800 |
for single label datasets like MNIST and PETS because what it does is it takes the input 01:03:52.860 |
which is the final layer activations and it does argmax what argmax does is it says what 01:03:59.520 |
is the index of the largest number in those activations so for example for MNIST you know 01:04:04.400 |
maybe the largest the highest probability is seven so this argmax would return seven 01:04:11.080 |
and then it says okay there's those are my predictions and then it says okay is the prediction 01:04:16.520 |
equal to the target or not and then take the floating point mean so that's what accuracy 01:04:22.400 |
is so argmax only makes sense when there's a single maximum thing you're looking for 01:04:30.040 |
in this case we've got multilabel so instead we have to compare each activation to some 01:04:38.260 |
threshold by default it's 0.5 and so we basically say if the sigmoid of the activation is greater 01:04:45.980 |
than 0.5 let's assume that means that category is there and if it's not let's assume it means 01:04:53.040 |
it's not there and so this is going to give us a list of trues and falses for the ones 01:04:58.280 |
that the based on the activations it thinks are there and we can compare that to the target 01:05:04.920 |
and then again take the floating point mean so we can use the default threshold of 0.5 01:05:13.080 |
but we don't necessarily want to use 0.5 we might want to use a different threshold and 01:05:18.480 |
remember we have to pass when we create our learner we have to pass to the metric the 01:05:23.760 |
metrics argument a function so what if we want to use a threshold other than 0.5 well 01:05:29.960 |
we'd like to create a special function which is accuracy multi with some different threshold 01:05:36.680 |
and the way we do that is we use a special built-in in Python called partial let me show 01:05:44.040 |
you how partial works here's a function called say hello say hello to somebody with something 01:05:54.120 |
so say hello Jeremy well the default is hello so it says hello Jeremy say hello Jeremy comma 01:06:00.240 |
ahoy gonna be ahoy Jeremy let's create a special version of this function that will be more 01:06:06.660 |
suitable for a silver it's going to use French so we can say partial create a new function 01:06:14.000 |
that's based on the say hello function but it's always going to set say what to bonjour 01:06:19.440 |
and we'll call that f but now f Jeremy is bonjour Jeremy and f sylvain is bonjour sylvain 01:06:28.360 |
so you see we've created a new function from an existing function by fixing one of its 01:06:33.560 |
parameters so we can do the same thing for accuracy multi say let's use a threshold of 01:06:40.600 |
0.2 and we can pass that to metrics and so let's create a CNN learner and you'll notice 01:06:47.880 |
here we don't actually pass a loss function and that's because fast AI is not enough to 01:06:52.880 |
realize hey you're doing a classification model with a a multi label dependent variable 01:07:01.840 |
so I know what loss function you probably want so it does it for us and we can call 01:07:05.760 |
fine-tune and here we have an accuracy of 94.5 after the first view and eventually 95.1 01:07:14.060 |
that's pretty good we've got an accuracy of over 95 percent was 0.2 a good threshold to 01:07:19.520 |
pick who knows let's try 0.1 oh that's a worse accuracy so I guess in this case we could 01:07:28.080 |
buy a higher threshold 94 hmm also not good so what's the best threshold well what we 01:07:34.520 |
could do is call get preds to get all of the predictions and all of the targets and then 01:07:40.000 |
we could calculate the accuracy at some threshold and then we could say okay let's grab lots 01:07:48.680 |
of numbers between 0.05 and 0.95 and you with a list comprehension calculate the accuracy 01:07:54.880 |
for all of those different thresholds and plot them ah looks like we want a threshold 01:08:03.040 |
somewhere a bit above 0.5 so cool we can just use that and it's going to give us 96 in a 01:08:09.200 |
bit which is going to give us a better accuracy um this is a you know something that a lot 01:08:17.640 |
of theoreticians would be uncomfortable about I've used the validation set to pick a hyper 01:08:23.520 |
parameter threshold right and so people might be like oh you're overfitting using the validation 01:08:30.160 |
set to pick a hyper parameter but if you think about it this is a very smooth curve right 01:08:34.560 |
it's not some bumpy thing where we've accidentally kind of randomly grabbed some unexpectedly 01:08:40.120 |
good value when you're picking a single number from a smooth curve you know this is where 01:08:46.040 |
the theory of like don't use a validation set for for hyper parameter tuning doesn't 01:08:51.320 |
really apply so it's always good to be practical right don't treat these things as rules but 01:08:57.320 |
as rules of um okay so let's take a break for five minutes and we'll see you back here 01:09:05.640 |
in five minutes time all right welcome back so I want to show you something really cool 01:09:14.560 |
image regression so we are not going to learn how to use a fast AI image regression application 01:09:23.680 |
because we don't need one now that we know how to build stuff up with boss functions 01:09:30.880 |
and the data block API ourselves we can invent our own applications so there is no image 01:09:37.800 |
regression application per se but we can do image regression really easily what do we 01:09:46.600 |
mean by image regression well remember back to lesson I think it's lesson one we talked 01:09:51.400 |
about the two basic types of machine learning or supervised machine learning regression 01:09:59.720 |
and classification classification is when our dependent variable is a discrete category 01:10:04.920 |
or set of categories and regression is when our dependent variable is a continuous number 01:10:13.520 |
like an age or x y coordinate or something like that so image regression means our independent 01:10:20.720 |
variable is an image and our dependent variable is a continue one or more continuous value 01:10:28.440 |
values and so here's what that can look like which is the b we had posed data set has a 01:10:38.120 |
number of things in it but one of the things we can do is find the midpoint of a person's 01:10:42.640 |
face see so the b we had posed data set so the b we had posed data set comes from this 01:11:01.440 |
paper random forests real-time 3d face analysis so thank you to those authors and we can grab 01:11:11.440 |
it in the usual way untied data and we can have a look at what's in there and we can 01:11:16.600 |
see there's 24 directories numbered from one to 24 there's one two three and each one also 01:11:23.080 |
has a .obj file we're not going to be using the .obj file I'm just the directories so let's 01:11:28.080 |
look at one of the directories and as you can see there's a thousand things in the first 01:11:31.840 |
directory so each one of these 24 directories is one different person that they've photographed 01:11:38.520 |
and you can see for each person there's frame three pose frame three RGB frame for pose 01:11:46.440 |
frame for RGB and so forth so in each case we've got the image which is the RGB and we've 01:11:52.880 |
got the pose is the pose.txt so as we've seen we can grab use get image files to get a list 01:12:00.920 |
of all of the files image files recursively in a path and so once we have an image file 01:12:06.680 |
name like this one sorry like this one we can turn it into a pose file name by removing 01:12:16.320 |
the last one two three five six seven letters and adding back on pose.txt and so here is 01:12:24.540 |
a function that does that and so you can see I can pass in an image file to image to pose 01:12:30.720 |
and get back a pose file right so pyo image.create is the fast AI way to create an image at least 01:12:41.400 |
a pyo image it has a shape in computer vision they're normally backwards they normally do 01:12:48.080 |
columns by rows so that's why it's this way around whereas pytorch and numpy tensors and 01:12:54.680 |
arrays are rows by columns so that's confusing but that's just how things are I'm afraid 01:13:00.400 |
and so here's an example of an image when you look at the readme from the dataset website 01:13:08.100 |
they tell you how to get the center point from from one of the text files and it's just 01:13:14.680 |
this function so it doesn't matter it just it is what it is we call it get center and 01:13:19.560 |
it will return the XY coordinate of the center of the person's head face so we can pass this 01:13:27.840 |
as get Y because get Y remember is the thing that gives us back the label okay so so here's 01:13:38.720 |
the thing right we can create a data block and we can pass in as the independent variables 01:13:44.480 |
block image block as usual and then the dependent variables block we can say point block which 01:13:50.160 |
is a tensor with two values in and now by combining these two things this says we want 01:13:55.680 |
to do image regression with a dependent variable with two continuous values to get the items 01:14:04.020 |
you call get image files to get the Y we'll call the get center function to split it so 01:14:11.840 |
this is important we should make sure that the validation set contains one or more people 01:14:21.140 |
that don't appear in the training set so I'm just going to grab person number 13 just grabbed 01:14:26.440 |
it randomly and I'll use all of those images as the validation set because I think they 01:14:32.200 |
did this with a Xbox connect you know video thing so there's a lot of images that look 01:14:39.160 |
almost identical so if you randomly assigned them then you would be massively overestimating 01:14:44.960 |
how effective you are you want to make sure that you're actually doing a good job with 01:14:49.880 |
a random with a new set of people not just a new set of frames that's why we use this 01:14:55.840 |
and so a func splitter is a splitter that takes a function and in this case we're using 01:15:00.400 |
lambda to create the function we will use data augmentation and we will also normalize 01:15:09.720 |
so this is actually done automatically now but this case we're doing it manually so this 01:15:15.080 |
is going to subtract the mean and divide by the standard deviation of the original data 01:15:22.920 |
set that the pre-trained model used which is imageNet so that's our data block and so 01:15:32.400 |
we can call data loaders to get our data loaders passing in the path and show batch and we 01:15:37.960 |
can see that looks good right here's our faces and the points and so let's like particularly 01:15:44.760 |
for as a student don't just look at the pictures look at the actual data so grab a batch put 01:15:51.160 |
it into an xB and a yB expansion y batch and have a look at the shapes and make sure they 01:15:56.680 |
make sense the ys is 64 by 1 by 2 so it's 64 in the mini batch 64 rows and then a coordinates 01:16:11.520 |
is a 1 by 2 tensor so this is a single point with two things in it it's like you could 01:16:21.120 |
have like hands face and armpits or whatever or nose and ears and mouth so in this case 01:16:27.720 |
we're just using one point and the point is represented by two values the x and the y 01:16:34.320 |
and then y is this 64 by 3 by 240 by 320 well there's 240 rows by 320 columns that's the 01:16:40.920 |
pixels that's the size of the images that we're using mini batches 64 items and what's 01:16:48.440 |
the three the three is the number of channels which in this case means the number of colors 01:16:54.840 |
if we open up some random grizzly bear image and then we go through each of the elements 01:17:07.720 |
of the first axis and do a show image you can see that it's got the red the green and 01:17:17.040 |
the blue as the three channels so that's how we store a three-channel image is it stored 01:17:24.560 |
as a three by number of rows by number of columns rank three tensor and so a mini batch 01:17:31.940 |
of those is a rank four tensor that's why this is that shape but here's a row from the 01:17:39.600 |
dependent variable okay there's that XY location we talked about so we can now go ahead and 01:17:47.720 |
create a learner passing in our data loaders as usual passing in a pre-trained architecture 01:17:53.160 |
as usual and if you think back you may just remember in lesson one we learned about y 01:18:00.400 |
range y range is where we tell fastai what range of data we expect to see in the dependent 01:18:10.040 |
variable so we want to use this generally when we're doing regression though the range 01:18:15.320 |
of our coordinates is between minus one and one that's how fastai and pytorch treats coordinates 01:18:22.800 |
the left hand side is minus one or the top is minus one and the bottom and the right 01:18:28.040 |
one so there's no point predicting something that's smaller than minus one or bigger than 01:18:33.520 |
one because that is not in the area that we use for our coordinates if a question sure 01:18:39.460 |
just a moment so how is y range work well it actually uses this function called sigmoid 01:18:47.320 |
range which takes the sigmoid of X multiplies by high minus low and adds low and here is 01:18:54.200 |
what sigmoid range looks like or minus one to one it's just a sigmoid where the bottom 01:19:02.680 |
is the low and the top is the high and so that way all of our activations are going 01:19:08.840 |
to be mapped to the range from minus one to one yes rachel can you provide images with 01:19:17.240 |
an arbitrary number of channels as inputs specifically more than three channels yeah 01:19:23.360 |
you can have as many channels as you like we certainly seen images with less than three 01:19:28.760 |
because we've been grayscale more than three is common as well you could have like an infrared 01:19:34.920 |
band or like satellite images often have multispectral there's some kinds of medical images where 01:19:41.520 |
there are bands that are kind of outside the visible range your pre-trained model will 01:19:47.800 |
generally have three channels the fast AI does some tricks to use three channel pre-trained 01:19:55.800 |
models for non three channel data but that's the only tricky bit other than that it's just 01:20:02.640 |
just a you know it's just an axis that happens to have four things or two things or one thing 01:20:08.120 |
instead of three things there's nothing special about it okay we didn't specify a loss function 01:20:18.440 |
here so we get whatever it gave us which is a MSE loss so MSE losses mean squared error 01:20:24.800 |
and that makes perfect sense right you would expect mean squared error to be a reasonable 01:20:30.160 |
thing to use for regression we're just testing how close we are through the target and then 01:20:35.880 |
taking the square taking the mean we didn't specify any metrics and that's because mean 01:20:42.840 |
squared error is already a good metric like it's not it's it's it has nice gradients it 01:20:50.760 |
behaves well but and it's also the thing that we care about so we don't need a separate 01:20:54.760 |
metric to track so let's go ahead and use LR find and we can pick a learning rate so 01:21:02.880 |
maybe about 10 to the minus 2 we can call fine-tune and we get a valid loss of 0.0001 01:21:11.920 |
and so that's the mean squared error so we should take the square root on average we're 01:21:15.920 |
about 0.01 off in a coordinate space that goes between minus 1 and 1 so that sounds 01:21:21.280 |
super accurate took about three in a bit minutes to run so we can always call in fastai and 01:21:28.920 |
we always should go results see what our results look like and as you can see fastai has automatically 01:21:34.280 |
figured out how to display the combination of an image independent variable and a point 01:21:40.160 |
dependent variable on the left is the is the target and on the right is the prediction 01:21:46.400 |
and as you can see it is pretty close to perfect you know one of the really interesting things 01:21:52.280 |
here is we used fine-tune even although think about it the thing we're fine-tuning image 01:21:58.880 |
net isn't even an image regression model so we're actually fine-tuning an image classification 01:22:07.340 |
model to become something totally different an image regression model why does that work 01:22:12.840 |
so well well because and image net classification model must have learnt a lot about kind of 01:22:25.000 |
how images look what things look like and where the pieces of them are to kind of know 01:22:29.560 |
how to figure out what breed of animal something is even if it's partly obscured by a horse 01:22:35.800 |
shorts in the shade or it's turned in different angles you know these pre-trained image models 01:22:41.760 |
are incredibly powerful you know computing algorithms so built into every image net pre-trained 01:22:51.680 |
model is all this capability that it had to learn for itself so asking it to use that 01:22:57.000 |
capability to figure out where something is just actually not that hard for it and so 01:23:03.000 |
that's why we can actually fine-tune an image net classification model to create something 01:23:09.360 |
completely different which is a point image regression model so I find that incredibly 01:23:18.840 |
cool I got to say so again look at the further research after you've done the questionnaire 01:23:26.840 |
and particularly if you haven't used data frames before please play with them because 01:23:30.320 |
we're going to be using them more and more good question I'll just do the last one and 01:23:39.560 |
also go back and look at the bear classifier from notebook 2 or whatever hopefully you 01:23:45.840 |
created some other classifier for your own data because remember we talked about how 01:23:51.760 |
it would be better if the bear classifier could also recognize that there's no bear 01:23:56.000 |
at all or maybe there's both a grizzly bear and a black bear or a grizzly bear and a teddy 01:24:01.940 |
bear so if you retrain it using multi-label classification see what happens see how well 01:24:08.000 |
it works when there's no bears and see whether it changes the accuracy of the single label 01:24:14.960 |
model when you turn it into a multi-label problem so have a fiddle around and tell us 01:24:20.560 |
on the forum what you find I've got a question Rachel is there a tutorial showing how to 01:24:25.480 |
use pre-trained models on four channel images also how can you add a channel to a normal 01:24:32.920 |
Well it's the last one how do you add a channel to an image I don't know what that means okay 01:24:41.000 |
I don't know you can't like an image is an image you can't add a channel to an image 01:24:48.640 |
is what it is I don't know if there's a tutorial but we can certainly make sure somebody on 01:24:59.240 |
the forum has learned how to do it it's it's super straightforward it should be pretty 01:25:05.840 |
much automatic okay we're going to talk about collaborative filtering what is collaborative 01:25:22.360 |
Well think about on Netflix or whatever you might have watched a lot of movies that are 01:25:29.840 |
sci-fi and have a lot of action and were made in the 70s and Netflix might not know anything 01:25:40.200 |
about the properties of movies you watched it might just know that they're movies with 01:25:45.080 |
titles and IDs but what it could absolutely see without any manual work is find other 01:25:52.520 |
people that watched the same movies that you watched and it could see what other movies 01:26:02.240 |
those people watched that you haven't and it would probably find they were also you 01:26:06.680 |
would probably find they're also science fiction and full of action and made in the 70s so 01:26:13.880 |
we can use an approach where we recommend things even if we don't know anything about 01:26:20.440 |
what those things are as long as we know who else has used or recommended things that are 01:26:30.200 |
similar you know the same kind you know many of the same things that that you've liked 01:26:34.520 |
or used this doesn't necessarily mean users and products in fact in collaborative filtering 01:26:43.760 |
sort of same products we normally say items and items could be links you click on a diagnosis 01:26:50.960 |
for a patient and so forth so there's a key idea here which is that in the underlying 01:26:59.480 |
items and we're going to be using movies in this example there are some there are some 01:27:05.400 |
features they may not be labeled but there's some underlying concept of features of of 01:27:12.960 |
those movies like the fact that there's a action concept and a sci-fi concept in the 01:27:19.160 |
1970s concept now you were never actually told Netflix you like these kinds of movies 01:27:24.720 |
and maybe Netflix never actually added columns to their movies saying what movies are those 01:27:28.320 |
types but as long as like you know in the real world there's this concept of sci-fi 01:27:35.680 |
and action and movie age and that those concepts are relevant for at least some people's movie 01:27:42.520 |
watching decisions as long as this is true then we can actually uncover these they're 01:27:50.240 |
called late latent factors these things that kind of decide what kind of movies you want 01:27:56.600 |
to watch and they're latent because nobody necessarily ever wrote them down or labeled 01:28:02.560 |
them or communicated them in any way so let me show you what this looks like so there's 01:28:11.240 |
a great data set we can use called movie lens which contains tens of millions of movie rankings 01:28:17.640 |
and so a movie ranking looks like this it has a user number a movie number a rating 01:28:28.000 |
and a time step so we don't know anything about who user number 196 is I don't know 01:28:34.160 |
if that is Rachel or somebody else I don't know what movie number 242 is I don't know 01:28:41.520 |
if that's Casablanca or Lord of the Rings or the mask and then rating is a number between 01:28:49.120 |
I think it was one five a question sure in traditional machine learning we perform cross 01:28:56.760 |
validations and K fold training to check for variance and bias trade-off is this common 01:29:09.820 |
So cross validation is a technique where you don't just split your data set into one training 01:29:15.120 |
set and one validation set but you basically do it five or so times like five training 01:29:21.920 |
sets and like five validation sets representing different overlapping subsets and basically 01:29:29.840 |
this was this used to be done a lot because people often used to have not enough data 01:29:35.320 |
get a good result and so this way rather than kind of having 20% that you would leave out 01:29:43.600 |
each time you could just leave out like 10% each time. 01:29:47.080 |
Nowadays it's less common that we have so little data that we need to worry about the 01:29:53.120 |
complexity and extra time of lots of models it's done on Kaggle a lot it's on Kaggle every 01:30:01.040 |
little fraction of percent matters but it's not it's not a deep learning thing or a machine 01:30:07.240 |
learning thing or whatever it's just a you know lots of data or not very much data thing 01:30:12.920 |
and do you care about the last decimal place of them or not it's not something we're going 01:30:18.940 |
to talk about certainly in this part of the course if ever because it's not something 01:30:24.680 |
that comes up in practice that often as being that important. 01:30:34.800 |
What would be some good applications of collaborative filtering outside of recommender systems? 01:30:42.480 |
Well I mean depends how you define recommender system if you're trying to figure out what 01:30:50.880 |
kind of other diagnoses might be applicable to a patient I guess that's kind of a recommender 01:30:56.080 |
system or you're trying to figure out where somebody is going to click next or whatever 01:31:02.320 |
it's kind of a recommender system but you know really conceptually it's anything where 01:31:08.500 |
you're trying to learn from from past behavior where that behavior is kind of like a thing 01:31:20.800 |
What is an approach to training using video streams i.e. from drone footage instead of 01:31:26.520 |
images would you need to break up the footage into image frames? 01:31:31.800 |
In practice quite often you would because images just tend to be pretty big so videos 01:31:39.880 |
There's a lot of so I mean theoretically the time could be the fourth channel yeah or fifth 01:31:49.080 |
channel so if it's a full color movie you can absolutely have well I guess fourth because 01:31:57.000 |
you can have five rank five tensor being batch by time by color by row by column but often 01:32:08.920 |
that's too computationally and too memory intensive so sometimes people just look at 01:32:22.460 |
one frame at a time sometimes people use a few frames around kind of the keyframe like 01:32:30.600 |
three or five frames at a time and sometimes people use something called a recurrent neural 01:32:36.160 |
network which we'll be seeing in the next week or two treated as a sequence data yeah 01:32:41.440 |
there's all kinds of tricks you can do to try and work with that conceptually though 01:32:49.240 |
there's no reason you can't just add an additional access to your tenses and everything to work 01:32:54.180 |
it's just a practical issue around time and memory. 01:32:59.240 |
And someone else noted that it's pretty fitting that you mentioned the movie The Mask. 01:33:03.880 |
Yes it was not an accident because I've got masks on the brain. 01:33:12.880 |
I'm not sure if we're allowed to like that movie anymore though I kind of liked it when 01:33:16.560 |
it came out I don't know what I think nowadays it's a while okay so let's take a look so 01:33:28.480 |
we can untie data ml 100k so ml 100k is a small subset of the full set there's another 01:33:35.360 |
one that we can grab which has got the whole lot 25 million but 100k is good enough for 01:33:41.840 |
messing around so if you look at the readme you'll find the main table the main table 01:33:46.540 |
is in a file called u.data so let's open it up with read.csv again this one is actually 01:33:51.760 |
not comma separated values it's tab separated rather confusingly we still use csv and to 01:33:57.000 |
say delimiter is a tab /t is tab there's no row at the top saying what the columns are 01:34:04.480 |
called so we say header is none and then pass in a list of what the columns are called .head 01:34:11.080 |
will give us the first five rows and we mentioned this before what it looks like it's not a 01:34:19.800 |
particularly friendly way to look at it so what I'm going to do is I'm going to cross 01:34:25.800 |
tab it and so what I've done here is I've grabbed the top I can't remember how many 01:34:32.960 |
it was 15 or 20 movies based on the most popular movies and the top bunch of users who watched 01:34:44.000 |
the most movies and so I've basically kind of reoriented this so for each user I have 01:34:51.280 |
all the movies they've watched and the rating they gave them so empty spots represent users 01:34:56.240 |
that have not seen that movie so this is just another way of looking at this same data so 01:35:09.640 |
basically what we want to do is guess what movies we should tell people they might want 01:35:15.640 |
to watch and so it's basically filling in these gaps to tell user 212 do you think we 01:35:21.280 |
would they might like movie 49 or 79 or 99 best to watch next. 01:35:32.720 |
So let's assume that we actually had columns for every movie that represented say how much 01:35:43.980 |
sci-fi they are how much action they are and how old they are and maybe they're between 01:35:49.440 |
minus one and one and so like the last Skywalker is very sci-fi fairly action and definitely 01:35:57.480 |
not old and then we could do the same thing for users so we could say user one really 01:36:05.760 |
likes sci-fi quite likes action and really doesn't like old and so now if you multiply 01:36:13.120 |
those together and remember in PyTorch and NumPy you have element wise calculations so 01:36:19.200 |
this is going to multiply each corresponding item it's not matrix multiplication if you're 01:36:24.800 |
a mathematician don't go there this is element wise multiplication if we want matrix multiplication 01:36:29.960 |
be an at sign so if we multiply each element together next to with the equivalent element 01:36:37.440 |
in the other one and then sum them up that's going to give us a number which will basically 01:36:42.800 |
tell us how much do these two correspond because remember two negatives multiply together to 01:36:47.640 |
get a positive so user one likes exactly the kind of stuff that last guy was that the last 01:36:55.160 |
Skywalker has in it and so we get two point one multiplying things together element wise 01:37:01.680 |
and adding them up is called the dot product and we use it a lot and it's the basis of 01:37:06.640 |
matrix multiplication so make sure you know what a dot product is it's this so Casablanca 01:37:23.760 |
is not at all sci-fi not much action and is certainly old so if we do user one times Casablanca 01:37:32.040 |
we get a negative number so we might think okay user one more like won't like this movie 01:37:39.640 |
problem is we don't know what the latent factors are and even if we did we don't know how to 01:37:44.840 |
label a particular user or a particular movie with them so we have to learn them how do 01:37:53.400 |
we learn them well we can actually look at a spreadsheet so I've got a spreadsheet version 01:38:06.920 |
so we have a spreadsheet version which is basically what I did was I popped this table 01:38:15.800 |
into Excel and then I randomly created a let's count this now one two three four five six 01:38:24.520 |
seven eight nine ten eleven twelve I randomly created a 15 by 5 table here so these are 01:38:32.520 |
just random numbers and I randomly created a 5 by 15 table here and I basically said 01:38:39.480 |
okay well let's just pretend let's just assume that every movie and every user has five latent 01:38:45.320 |
factors I don't know what they are and let's then do a matrix multiply of this set of factors 01:38:54.160 |
by this set of factors and a matrix multiply of a row by a column is identical to a dot 01:39:00.280 |
product of two vectors so that's why I can just use matrix multiply so this is just what 01:39:05.600 |
this first cell contains so they then copied it to the whole thing so all these numbers 01:39:11.180 |
there are being calculated from the row latent factors dot product with or matrix multiply 01:39:20.700 |
with a column latent factors so in other words I'm doing exactly this calculation but I'm 01:39:28.680 |
doing them with random numbers and so that gives us a whole bunch of values right and 01:39:39.000 |
then what I could do is I could calculate a loss by comparing every one of these numbers 01:39:45.120 |
here to every one of these numbers here and then I could do mean squared error and then 01:39:54.560 |
I could use stochastic gradient descent to find the best set of numbers in each of these 01:40:00.920 |
two locations and that is what collaborative filtering is so that's actually all we need 01:40:11.180 |
so rather than doing an Excel and very the Excel version later if you're interested because 01:40:18.280 |
we can actually do this whole thing and it works in Excel let's jump and do it into high 01:40:25.200 |
torch now one thing that might just make this more fun is actually to know what the movies 01:40:30.120 |
are and movie lens tells us in u.item what the movies are called and that uses the delimiter 01:40:37.040 |
of the pipe sign weirdly enough so here are the names of each movie and so one of the 01:40:43.880 |
nice things about pandas is it can do joins just like SQL and so you can use the merge 01:40:52.460 |
method to combine the ratings table and the movies table and since they both have a column 01:40:57.920 |
called movie by default it will join on those and so now here we have the ratings table 01:41:03.800 |
with actual movie names that's going to be a bit more fun we don't need it for modeling 01:41:07.940 |
but it's just going to be better for looking at stuff so we could use data blocks API at 01:41:15.760 |
this point or we can just use the built-in application factory method since it's there 01:41:20.240 |
we may as well use it so we can create a collaborative filtering data loaders object from a data 01:41:25.440 |
frame by passing in the ratings table by default the user column is called user and ours is 01:41:34.200 |
so fine by default the item column is called item and ours is not it's called title so 01:41:41.000 |
let's pick title and choose a batch size and so if we now say show batch here is some of 01:41:50.480 |
that data and the rating is called rating by default so that worked fine too but here's 01:41:57.840 |
some data so we need to now create our let's assume we're going to use that five numbers 01:42:08.380 |
of factors so the number of users is however many classes there are for user and the number 01:42:16.640 |
of movies is however many classes there are a title and so these are so we don't just 01:42:26.600 |
have a vocab now right we've actually got a list of classes for each categorical variable 01:42:36.400 |
for each set of discrete choices so we've got a whole bunch of users at 944 and a whole 01:42:43.600 |
bunch of titles 1635 so for our randomized latent factor parameters we're going to need 01:42:55.680 |
to create those matrices so we can just create them with random numbers so this is normally 01:43:00.320 |
distributed random numbers that's what random n is and that will be n users okay so 944 by 01:43:07.680 |
10 factors which is 5 that's exactly the same as this except this is just 15 so let's do 01:43:15.920 |
exactly the same thing for movies random numbers and movies by 5 okay and so to calculate the 01:43:24.360 |
result for some movie and some user we have to look up the index of the movie in our movie 01:43:30.440 |
latent factors the index of the user in our user latent factors and then do a cross product 01:43:38.520 |
so in other words we would say like oh okay for this particular combination we would have 01:43:43.640 |
to look up that numbered user over here and that numbered movie over here to get the two 01:43:51.500 |
appropriate sets of latent factors but this is a problem because look up in an index is 01:44:02.320 |
not a linear model like remember our deep learning models really only know how to just 01:44:12.400 |
multiply matrices together and do simple element wise nonlinearities like ReLU there isn't 01:44:17.640 |
a thing called look up in an index okay I'll just finish this bit here's the cool thing 01:44:27.240 |
though the look up in an index is actually can be represented as a matrix product believe 01:44:36.440 |
it or not so if you replace our indices with one hot encoded vectors then a one hot encoded 01:44:47.200 |
vector times something is identical to looking up in an index and let me show you so if we 01:44:59.440 |
grab if we call the one hot function that creates a as it says here one hot encoding 01:45:11.260 |
and we're going to one hot encode the value three with end users classes and so end users 01:45:21.700 |
as we've just discussed is 944 right then so if we go one hot one hot encoding the number 01:45:35.640 |
three into end users one hot three we get this big array big tensor and as you can see in 01:45:54.400 |
index 3 0 1 2 3 we have a 1 and the size of that is 944 so if we then multiply that by 01:46:10.720 |
user factors or user factors remember is that random matrix of this size and what's going 01:46:22.920 |
to happen so we're going to go 0 by the first row and so that's going to be all zeros and 01:46:35.280 |
then we're going to go 0 again and we're going to 0 again and then we're going to find finally 01:46:38.860 |
go 1 right on the index 3 row and so it's going to return each of them and then we'll 01:46:47.160 |
go back to 0 again so if we do that remember at sign is matrix multiply 01:46:57.520 |
and compare that to user factors 3 same thing isn't that crazy so it's a kind of weird inefficient 01:47:13.360 |
way to do it right but matrix multiplication is a way to index into an array and this is 01:47:22.080 |
the thing that we know how to do SGD with and we know how to build models with so it 01:47:27.320 |
turns out that anything that we can do with indexing to array we now have a way to optimize 01:47:33.760 |
and we have a question there are two questions one how different in practice is collaborative 01:47:40.520 |
filtering with sparse data compared to dense data we are not doing sparse data in this 01:47:48.020 |
course but there's an excellent course I hear called computational linear algebra for coders 01:47:53.680 |
it has a lot of information about sparse the fast AI course and second question in practice 01:48:00.800 |
do we tune the number of latent factors absolutely we do yes just it's just a number of filters 01:48:10.280 |
like we have in much any kind of deep learning model all right so now that we know that the 01:48:22.400 |
procedure of finding out which latent set of latent factors is the right thing looking 01:48:27.980 |
something up in an index is the same as matrix multiplication with a one-hot vector I already 01:48:35.320 |
had it over here we can go ahead and build a model with that so basically if we do this 01:48:45.360 |
for a whole for a few indices at once then we have a matrix of one hot encoded vectors 01:48:49.840 |
so the whole thing is just one big matrix multiplication now the thing is as I said 01:48:58.600 |
this is a pretty inefficient way to to do an index lookup so there is a computational 01:49:07.160 |
shortcut which is called an embedding an embedding is a layer that has the computational speed 01:49:18.400 |
of an array lookup and the same gradients as a matrix multiplication how does it do 01:49:27.440 |
that well just internally it uses an index lookup to actually grab the values and it 01:49:35.120 |
also knows what the gradient of a matrix multiplication by a one-hot encoded vector is or matrix is 01:49:44.720 |
without having to go to all this trouble and so an embedding is a matrix multiplication 01:49:50.360 |
with a one-hot encoded vector where you never actually have to create the one-hot encoded 01:49:54.320 |
vector you just need the indexes this is important to remember because a lot of people have heard 01:50:00.000 |
about embeddings and they think there's something special and magical and and they're absolutely 01:50:06.240 |
not you can do exactly the same thing by creating a one-hot encoded matrix and doing a matrix 01:50:11.360 |
multiply it is just a computational shortcut nothing else I often find when I talk to people 01:50:18.600 |
about this in person I have to tell them this six or seven times before they believe me 01:50:25.120 |
because they think embeddings are something more clever and they're not it's just a computational 01:50:29.440 |
shortcut to do a matrix multiplication more quickly with a one-hot encoded matrix by instead 01:50:34.800 |
doing an array lookup okay so let's try and create a collaborative filtering model in 01:50:46.280 |
PyTorch a model or an architecture or really an nn.module is a class so to use PyTorch through 01:50:57.320 |
its fullest you need to understand object-oriented programming because we have to create classes 01:51:01.600 |
there's a lot of tutorials about this so I won't go into detail about it but I'll give 01:51:06.740 |
you a quick overview a class could be something like dog or resnet or circle and it's 01:51:16.200 |
something that has some data attached to it and it has some functionality attached to 01:51:20.520 |
it is a class called example the data it has attached to it is a and the functionality 01:51:28.360 |
attached to it is say and so we can for example create an instance of this class an object 01:51:36.040 |
of this type example you pass in silver so silver will now be in ex.a and we can then 01:51:44.800 |
say ex.say and it will call say and it will say passing in nice to meet you so that will 01:51:50.680 |
be x and so it'll say hello self.a so that's silver nice to meet you here it is okay so 01:52:02.800 |
in Python the way you create a class is to say class in its name then to say what is 01:52:09.360 |
passed to it when you create that object it's a special method called dunder in it as we've 01:52:15.520 |
briefly mentioned before in Python there are all kinds of special method names that have 01:52:21.720 |
special behavior they start with two underscores they end with two underscores and we pronounce 01:52:27.160 |
that dunder so dunder in it all methods in all regular methods instance methods in Python 01:52:38.240 |
always get passed the actual object itself first that we normally call that self and 01:52:44.000 |
then optionally anything else and so you can then change the contents of the current object 01:52:49.720 |
by just setting self.whatever to whatever you like so after this self.a is now equal 01:52:55.880 |
to silver so we call a method same thing it's passed self optionally anything you pass to 01:53:03.400 |
it and then you can access the contents of self which you stashed away back here when 01:53:09.200 |
we initialized it so that's basically how object or you know the basics of object-oriented 01:53:14.920 |
programming works in Python there's something else you can do when you create a new class 01:53:24.640 |
which is you can pop something in parentheses after its name and that means we're going 01:53:29.120 |
to use something called inheritance and what inheritance means is I want to have all the 01:53:34.160 |
functionality of this class plus I want to add some additional functionality so module 01:53:41.000 |
is a PyTorch class which fast.ai has customized so it's kind of a fast.ai version of a PyTorch 01:53:49.920 |
class and probably in the next course we'll see exactly how it works and but it looks 01:54:00.200 |
a lot like a it all acts almost exactly like a just a regular Python class we have an init 01:54:07.440 |
and we can set attributes to whatever we like and one of the things we can use is an embedding 01:54:15.120 |
and so an embedding is just this class that does what I just described a it's the same 01:54:19.880 |
as an as a linear layer with a one hot encoded matrix but it does it with this computational 01:54:25.920 |
shortcut you can say how many in this case users are there and how many factors will 01:54:31.000 |
they have now there is one very special thing about things that inherit from module which 01:54:38.400 |
is that when you call them it will actually call a method called forward so forward is 01:54:44.240 |
a special PyTorch method name it's the most important PyTorch method name this is where 01:54:49.820 |
you put the the actual computation so to to grab the factors from an embedding we just 01:54:58.160 |
call it like a function right so this is going to get passed here the user IDs and the movie 01:55:05.640 |
IDs as two columns so let's grab the zero index column and grab the embeddings by passing 01:55:12.520 |
them to user factors and then we'll do the same thing for the index one column that's 01:55:17.640 |
the movie IDs pass them to the movie factors and then here is our element wise multiplication 01:55:25.280 |
and then sum and now remember we've got another dimension time the first axis is the minibatch 01:55:33.320 |
dimension so we want to sum over the other dimension the index one dimension so that's 01:55:40.040 |
going to give us a dot product for each user sorry for each rating for each user movie 01:55:47.960 |
combination so this is the dot product class so you can see if we look at one batch of 01:55:58.440 |
our data it's of size of shape 64 by 2 because there are 64 items in the minibatch and each 01:56:06.240 |
one has this is the independent variables so it's got the user ID and the movie ID to 01:56:20.840 |
deep neural network based models for collaborative filtering work better than more traditional 01:56:25.400 |
approaches like SPP or other matrix let's wait until we get there so here's X right so here 01:56:39.160 |
is one user ID movie ID combination okay and then each one of those 64 here are the ratings 01:56:56.400 |
so now we've created a dot product module from scratch so we can instantiate it passing 01:57:03.600 |
in the number of users the number of movies and let's use 50 factors and now we can create 01:57:07.920 |
a learner now this time we're not creating a CNN learner or a specific application learner 01:57:13.320 |
it's just a totally generic learner so this is a learner that doesn't really know how 01:57:17.160 |
to do anything clever it just stores away the data you give it and the model you give 01:57:21.920 |
it and so when we're not using an application specific learner it doesn't know what loss 01:57:26.280 |
function to use so we'll tell it to use MSE and fit and that's it right so we've just 01:57:34.680 |
fitted our own collaborative filtering model where we literally created the entire architecture 01:57:40.920 |
it's a pretty simple one from scratch so that's pretty amazing now the results aren't great 01:57:50.600 |
if you look at the movie lens data set benchmarks online you'll see this is not actually a great 01:57:57.360 |
result so one of the things we should do is take advantage of the tip we just mentioned 01:58:02.520 |
earlier in this lesson which is when you're doing regression which we are here right the 01:58:07.320 |
number between one and five is like a continuous value we're trying to get as close to it as 01:58:11.040 |
possible we should tell fastai what the range is so we can use y range as before so here's 01:58:22.400 |
exactly the same thing we've got a y range we've stored it away and then at the end we 01:58:28.400 |
use as we discussed sigmoid range passing in and look here we pass in star self dot 01:58:35.200 |
y range that's going to pass in by default 0 comma 5.5 and so we can see yeah not really 01:58:48.160 |
any better it's worth a try normally this is a little bit better but it always depends 01:58:56.560 |
on when you run it I'll just run it a second time well it's worth looking now there is 01:59:04.840 |
something else we can do though which is that if we look back at our little Excel version 01:59:14.240 |
the thing is here when we multiply you know these latent factors by these latent factors 01:59:22.080 |
and add them up it's not really taking account of the fact that this user may just rate movies 01:59:31.220 |
really badly in general regardless of what kind of movie they are and this movie might 01:59:38.260 |
be just a great movie in general just everybody likes it regardless of what kind of stuff 01:59:42.960 |
they like and so it'd be nice to be able to represent this directly and we can do that 01:59:48.400 |
using something we've already learned about which is bias we could have another single 01:59:53.400 |
number for each movie which we just add and it's another single number for each user which 01:59:59.440 |
we just add right and we've already seen this for linear models you know this idea that 02:00:04.200 |
it's nice to be able to add a bias value so let's do that so that means that we're going 02:00:13.280 |
to need another embedding for each user which is a size one it's just a single number we're 02:00:18.520 |
going to add so in other words it's just an array lookup but remember to do an array lookup 02:00:24.160 |
that we can kind of take a gradient of we have to say embedding so we do the same thing 02:00:29.420 |
for movie bias and so then all of this is identical as before and we just add this one 02:00:36.440 |
extra line which is to add the user and movie bias values and so let's train that see how 02:00:45.680 |
it goes well that was a shame it got worse so we used to have that finished here 0.87 02:00:59.200 |
0.88, 0.89 so it's a little bit worse why is that well if you look earlier on it was quite 02:01:09.240 |
better it was 0.86 so it's overfitting very quickly and so what we need to do is we need 02:01:19.720 |
to find a way that we can train more epochs without overfitting now we've already learned 02:01:26.760 |
about data augmentation right like rotating images and changing their brightness and color 02:01:31.640 |
and stuff but it's not obvious how we would do data augmentation for collaborative filtering 02:01:37.720 |
right so how are we going to make it so that we can train lots of epochs without overfitting 02:01:46.480 |
and to do that we're going to have to use something called regularization and regularization 02:01:51.360 |
is a set of techniques which basically allow us to use models with lots of parameters and 02:01:57.520 |
train them for a long period of time but penalize them effectively for overfitting or in some 02:02:04.540 |
way cause them to to try to stop overfitting and so that is what we will look at next week 02:02:13.160 |
okay well thanks everybody so there's a lot to take in there so please remember to practice 02:02:18.920 |
to experiment listen to the lessons again because you know for the next couple of lessons 02:02:26.080 |
things are going to really quickly build on top of all the stuff that we've learned so 02:02:30.440 |
please be as comfortable with it as you can feel free to go back and re-listen and go 02:02:36.080 |
through and follow through the notebooks and then try to recreate as much of them yourself 02:02:40.920 |
thanks everybody and I will see you next week or see you in the next lesson whenever you