back to indexLesson 7: Practical Deep Learning for Coders 2022
Chapters
0:0 Tweaking first and last layers
2:47 What are the benefits of using larger models
5:58 Understanding GPU memory usage
8:4 What is GradientAccumulation?
20:52 How to run all the models with specifications
22:55 Ensembling
37:51 Multi-target models
41:24 What does `F.cross_entropy` do
45:43 When do you use softmax and when not to?
46:15 Cross_entropy loss
49:53 How to calculate binary-cross-entropy
52:19 Two versions of cross-entropy in pytorch
54:24 How to create a learner for prediction two targets
62:0 Collaborative filtering deep dive
68:55 What are latent factors?
71:28 Dot product model
78:37 What is embedding
82:18 How do you choose the number of latent factors
87:13 How to build a collaborative filtering model from scratch
89:57 How to understand the `forward` function
92:47 Adding a bias term
94:29 Model interpretation
99:6 What is weight decay and How does it help
103:47 What is regularization
00:00:00.000 |
All right, welcome to lesson seven, the penultimate lesson of practical deep learning for coders 00:00:12.920 |
And today we're going to be digging into what's inside a neural net. 00:00:19.160 |
We've already seen what's inside a kind of the most basic possible neural net, which 00:00:24.720 |
is a sandwich of fully connected layers, or linear layers, and rellues. 00:00:36.520 |
And so we built that from scratch, but there's a lot of tweaks that we can do. 00:00:43.580 |
And so most of the tweaks actually that we probably care about are the tweaking the very 00:00:58.180 |
But over the next couple of weeks, we'll look at some of the tricks we can do inside as 00:01:05.720 |
So I'm going to do this through the lens of the patty, rice patty competition we've been 00:01:14.240 |
talking about, and we got to a point where-- let's have a look. 00:01:35.000 |
We tried a few different types of basic preprocessing. 00:01:46.040 |
And then we scaled that up to larger images and rectangular images. 00:01:56.440 |
And that got us into the top 25% of the competition. 00:02:04.180 |
So that's part two of the so-called road to the top series, which is increasingly misnamed 00:02:17.620 |
More and more of our students have been passing me on the leaderboard. 00:02:22.520 |
So currently first and second place for both people from this class, Kurian and Nick. 00:02:47.280 |
So in part three, I'm going to show you a really interesting trick, a very simple trick 00:02:59.800 |
Watch or discover if you've tried to use larger models. 00:03:02.700 |
So you can replace the word small with the word large in those architectures and try 00:03:11.320 |
More parameters means it can find more tricky little features. 00:03:16.240 |
And broadly speaking models with more parameters therefore ought to be more accurate. 00:03:21.340 |
Problem is that those activations or more specifically the gradients that have to be 00:03:34.000 |
And your GPU is not as clever as your CPU, that kind of sticking stuff it doesn't need 00:03:40.000 |
right now into virtual memory on the hard drive. 00:03:42.600 |
When it runs out of memory, it runs out of memory. 00:03:45.760 |
And it also doesn't do such a good job as your CPU, it kind of shuffling things around 00:03:50.580 |
It just allocates blocks of memory and it stays allocated until you remove them. 00:03:56.660 |
So if you try to scale up your models to bigger models unless you have very expensive GPUs, 00:04:07.840 |
And you'll get an error, something like CUDA out of memory error. 00:04:12.120 |
So if that happens, first thing I'll mention is it's not a bad idea to restart your notebook 00:04:18.080 |
because they can be a bit tricky to recover from otherwise. 00:04:22.320 |
And then I'll show you how you can use as large a model as you like. 00:04:27.640 |
Almost is, you know, basically you'll be able to use an X large model on Kaggle. 00:04:40.920 |
Now when you run something on Kaggle, like actually on Kaggle, you're generally going 00:04:51.840 |
And you don't have to run stuff on Kaggle, you can run stuff on your home computer or 00:04:58.080 |
But sometimes if you want to do Kaggle competition, sometimes you have to run stuff on Kaggle 00:05:02.600 |
because a lot of competitions are what they call code competitions, which is where the 00:05:06.560 |
only way to submit is from a notebook that you're running on Kaggle. 00:05:10.920 |
And then a second reason to run stuff on Kaggle is that, you know, your notebooks will appear, 00:05:20.560 |
you know, with the leaderboard score on them. 00:05:22.600 |
And so people can see which notebooks are actually good. 00:05:26.920 |
And I kind of like, even in things that aren't code competitions, I love trying to be the 00:05:30.800 |
person who's number one on the notebook score leaderboard, because that's something which, 00:05:36.960 |
you know, you can't just work at Nvidia and use 1000 GPUs and win a competition through 00:05:46.760 |
Everybody has the same nine hour timeout to work with. 00:05:51.680 |
So I think it's a good way of keeping the, you know, things a bit more fair. 00:06:02.720 |
So I wanted to find out what can I get away with, you know, in 16 gig. 00:06:07.800 |
And the way I did that is I think a useful thing to discuss because again, it's all about 00:06:14.800 |
So I wanted to really quickly find out how much memory will a model use. 00:06:22.440 |
So there's a really quick hacky way I can do that, which is to say, okay, for the training 00:06:26.480 |
set, let's not use, so here's the value counts of labels, so the number of each disease. 00:06:34.720 |
Let's just pick one, the smallest one, right? 00:06:39.360 |
Our training set is the bacterial panicle blight images. 00:06:43.640 |
And now I can train a model with just 337 images without changing anything else. 00:06:49.200 |
Not that I care about that model, but then I can see how much memory it used. 00:06:54.920 |
It's important to realize that, you know, each image you pass through is the same size, 00:06:58.840 |
each batch size is the same size, so training for longer won't use more memory. 00:07:03.640 |
So that'll tell us how much memory we're going to need. 00:07:09.680 |
So what I then did was I then tried training different models to see how much memory they 00:07:21.120 |
Now, what happens if we train a model, so obviously ConvNec Small doesn't use too much 00:07:25.440 |
memory, so here's something that reports the amount of GPU memory just by basically printing 00:07:31.240 |
out CUDA's GPU processes, and you can see ConvNec Small took up 4 gig. 00:07:43.400 |
If you then call Python's garbage collection, gc.collect, and then call PyTorch's empty 00:07:50.920 |
cache, that should basically get your GPU back to a clean state of not using any more 00:07:57.480 |
memory than it needs to when you can start training the next model without restarting 00:08:04.920 |
So what would happen if we tried to train this little model and it crashed with a CUDA 00:08:13.200 |
We can use a cool little trick called gradient accumulation. 00:08:24.080 |
Well I added this parameter to my train method here. 00:08:28.040 |
So my train method creates by data loaders, creates my learner, and then depending on 00:08:36.360 |
whether I'm fine-tuning or not either fits or fine-tunes it. 00:08:50.560 |
I set my batch size, so that's the number of images that I pass through to the GPU all 00:08:55.960 |
at once, to 64, which is my default, divided by -- // means integer divide in Python -- divided 00:09:06.860 |
So if I pass 2, it's going to use a batch size of 32. 00:09:15.560 |
Now that obviously should let me cure any memory problems, use a smaller batch size. 00:09:22.460 |
But the problem is that now the dynamics of my training are different, right? 00:09:27.240 |
The smaller your batch size, the more volatility there is from batch to batch. 00:09:31.240 |
So now your learning rates are all messed up. 00:09:33.480 |
You don't want to be messing around with trying to find a different set of optimal parameters 00:09:38.760 |
for every batch size, for every architecture. 00:09:44.200 |
So what we want to do is find a way to run just let's say accumulate equals 2. 00:09:51.400 |
Let's say we just want to run 32 images at a time through. 00:09:55.740 |
How do we make it behave as if it was 64 images? 00:10:00.520 |
Well the solution to that problem is to consider our training loop. 00:10:05.320 |
This is basically the training loop we used from a couple of lessons ago, the one we created 00:10:10.880 |
So each x, y pair in the data loader, we calculate the loss using some coefficients based on 00:10:20.280 |
And then we call backward on that loss to calculate the gradients. 00:10:25.120 |
And then we subtract from the coefficients the gradients times the learning rate. 00:10:31.920 |
So I've skipped a bit of stuff like the with torch.nograd thing, actually no I don't need 00:10:41.400 |
I've skipped out printing the loss that's about it. 00:10:46.640 |
So here is a variation of that loop where I do not always subtract the gradient times 00:10:59.320 |
Instead I go through each x, y pair in the data loader, I calculate the loss, I look 00:11:11.120 |
So initially I start at zero and this count is going to be 32, say if I've divided the 00:11:17.720 |
And then if count is greater than 64, I do my coefficients update, well it's not. 00:11:32.560 |
And if you remember there was this interesting subtlety in PyTorch which is if you call back 00:11:37.280 |
word again without zeroing out the gradients then it adds this set of gradients to the 00:11:49.360 |
So by doing these two half size batches without zeroing out the gradients between them it's 00:11:56.940 |
So I'm going to end up with the total gradient of a 64 image batch size but passing only 00:12:07.920 |
If I used accumulate equals four it would go through this four times adding them up 00:12:13.400 |
before it subtracted out the coefficients.grad times learning rate and zeroed it out. 00:12:21.180 |
If I put in a Q equals 64 it would go through into a single image one at a time. 00:12:28.360 |
And after 64 passes through eventually count would be greater than 64 and we would do the 00:12:37.600 |
It's a very simple idea which is that you don't have to actually update your weights 00:12:55.200 |
But it has quite significant implications which I find most people seem not to realize 00:13:02.360 |
which is if you look on like Twitter or Reddit or whatever people can say oh I need to buy 00:13:08.140 |
a bigger GPU to train bigger models but they don't. 00:13:16.040 |
And so given the huge price differential between say a RTX 3080 and an RTX 3090 Ti huge price 00:13:27.120 |
differential the performance is not that different. 00:13:35.040 |
Just put in a bit smaller batch size and do gradient accumulation. 00:13:37.960 |
So there's actually not that much reason to buy giant GPUs. 00:13:47.160 |
Other results with gradient accumulation numerically identical? 00:13:50.760 |
They're numerically identical for this particular architecture. 00:13:59.200 |
There is something called batch normalization which we will look at in part two of the course 00:14:06.840 |
which keeps track of the moving average of standard deviations and averages and does it 00:14:20.040 |
in a mathematically slightly incorrect way as a result of which if you've got batch normalization 00:14:25.640 |
then it basically will introduce more volatility which is not necessarily a bad thing but because 00:14:31.120 |
it's not mathematically identical you won't necessarily get the same results. 00:14:35.240 |
One next doesn't use batch normalization so it is the same. 00:14:40.640 |
And in fact a lot of the models people want to use really big versions of which is NLP 00:14:46.000 |
ones transformers tend not to use batch normalization but instead they use something called layer 00:14:51.720 |
normalization which doesn't have the same issue. 00:15:01.360 |
In practice I found adding gradient accumulation for ConfNEXT has not caused any issues for 00:15:10.400 |
I don't have to change any parameters when I do it. 00:15:18.600 |
Tamori asking shouldn't it be count greater than equal to 64 if BS equals 64? 00:15:31.160 |
So we start at zero then it's going to be 32 then it's going to be, yeah, yeah, probably. 00:15:37.160 |
You can probably tell I didn't actually run this code. 00:15:39.320 |
Madhav is asking does this mean that LR find is based on the batch size set during the 00:15:47.400 |
Yeah, so LR find just uses your data loaders batch size. 00:15:55.400 |
Edward is asking why do we need gradient accumulation rather than just using a smaller batch size 00:16:02.040 |
and follows up with how would we pick a good batch size? 00:16:05.080 |
Well just if you use a smaller batch size, here's the thing, right? 00:16:09.760 |
But architectures have different amounts of memory, you know, which they take up. 00:16:19.000 |
And so you'll end up with different batch sizes for different architectures. 00:16:26.140 |
Which is not necessarily a bad thing but each of them is going to then need a different 00:16:29.360 |
learning rate and maybe even different weight decay or whatever. 00:16:33.640 |
Like the kind of the settings that's working really well for batch size 64 won't necessarily 00:16:40.880 |
And you know you want to be able to experiment as easily and quickly as possible. 00:16:46.760 |
I think the second part of your question was how do you pick an optimal batch size? 00:16:51.200 |
Honestly the standard approach is to pick the largest one you can just because it's 00:16:57.720 |
faster that way you're getting more parallel processing going on. 00:17:05.200 |
So to be honest I quite often use batch sizes that are quite a bit smaller than I need because 00:17:11.560 |
quite often it doesn't make that much difference. 00:17:14.000 |
But yeah the rule of thumb would be you know pick a batch size that fits in your GPU and 00:17:22.360 |
for performance reasons I think it's generally a good idea to have it be a multiple of eight. 00:17:28.720 |
Everybody seems to always use powers of two I don't know like I don't think it actually 00:17:33.640 |
Look there's one other just a clarification or a check if the learning rate should be 00:17:40.320 |
Yeah so generally speaking the rule of thumb is that if you divide the batch size by two 00:17:50.360 |
Did you have a question Nick if you do you can okay cool yeah now that's us all caught 00:18:02.680 |
So gradient accumulation in fast AI is very straightforward. 00:18:08.840 |
You just divide the batch size by however much you want to divide it by and then add 00:18:14.920 |
something called a callback and a callback is something which changes the way the model 00:18:18.680 |
trains this call that's called gradient accumulation and you pass in the effective batch size you 00:18:26.520 |
And then you say when you create the learner you say these are the callbacks I want and 00:18:31.360 |
so it's going to pass in gradient accumulation callbacks so it's going to only update the 00:18:45.280 |
So if we pass in a Q equals one it won't do any gradient accumulation and that uses four 00:18:51.720 |
gig if we use Q equals two about three gig Q equals four about two and a half gig and 00:19:01.600 |
generally the bigger the model the closer you'll get to a kind of a linear scaling because 00:19:06.360 |
models have a kind of a bit of overhead that they have anyway. 00:19:13.240 |
So what I then did was I just went through all the different models I wanted to try so 00:19:16.600 |
I wanted to try conf next large add a 320 by 240, VIT large, SWIN V2 large, SWIN large 00:19:26.960 |
and on each of these I just tried running it with a Q equals one and actually every single 00:19:31.800 |
time for all of these I got them out of memory error and then I tried each of them independently 00:19:36.200 |
with a Q equals two and it turns out that all of these worked with a Q equals two and 00:19:41.200 |
it only took me 12 seconds each time so that was a very quick thing for me then okay and 00:19:46.440 |
I now know how to train all of these models on a 16 gigabyte card so I can check here they're 00:19:56.360 |
So then I just created a little dictionary of all the architectures I wanted and for 00:20:03.880 |
each architecture all of the resize methods I wanted and final sizes I wanted. 00:20:11.560 |
Now these models VIT, SWIN V2 and SWIN are all transformers models which means that well 00:20:23.400 |
most transformers models nearly all of them have a fixed size this one's 224 this one's 00:20:27.960 |
192 this one's 224 so I have to make sure that my final size is a square of the size 00:20:33.600 |
required otherwise I get an error. There is a way of working around this but I haven't 00:20:43.160 |
experimented with it enough to know when it works well and when it doesn't so we'll probably 00:20:49.600 |
So for now it's going to use the size that they ask us to use so with this dictionary 00:20:54.680 |
of architectures and for each architecture kind of pre-processing details we switch the 00:21:00.400 |
training path back to using all of our images and then we can loop through each architecture 00:21:06.960 |
and loop through each item transforms and sizes and train the model and then the training 00:21:21.480 |
script if you're fine-tuning returns the TTA predictions. 00:21:38.520 |
So I append all those TTA predictions for each model for each type into a list and after 00:21:44.800 |
each one it's a good idea to do this garbage collection and empty cache that because otherwise 00:21:49.780 |
I find what happens is your GPU memory kind of I don't know I think it gets fragmented 00:21:55.920 |
or something and after a while it runs out of memory even when you thought it wouldn't 00:21:59.680 |
so this way you can really do as much as you like without running out of memory. 00:22:03.960 |
So they all train train train train and one key thing to note here is that in my train 00:22:11.800 |
script my data loaders does not have the seed equals parameter so I'm using a different 00:22:24.820 |
training set every time so that means that for each of these different runs they're using 00:22:35.560 |
also different validation sets so they're not directly comparable but you can kind of 00:22:39.880 |
see they're all doing pretty well 2.1 percent 2.3 percent 1.7 percent and so forth. 00:22:47.400 |
So why am I using different training and validation sets for each of these that's because I want 00:22:54.060 |
to ensemble them so I'm going to use bagging which is I am going to take the average of 00:23:06.040 |
their predictions now I mean really when we talked about random forest bagging we were 00:23:10.240 |
taking the average of like intentionally weak models these are not intentionally weak models 00:23:15.120 |
they have to be good models but they're all different they're using different architectures 00:23:18.720 |
and different pre-processing approaches and so in general we would hope that these different 00:23:23.240 |
approaches some might work well for some images and some might work well for other images 00:23:28.000 |
and so when we average them out hopefully we'll get a good blend of kind of different 00:23:32.120 |
ideas which is kind of what you want in bagging so we can stack up that list of different 00:23:41.080 |
of all the different probabilities and take their main and so that's going to give us 00:23:46.680 |
3469 predictions that's our test set size and each one has 10 probabilities the probability 00:23:55.080 |
of each disease and so then we can use arg max to find which probability index is the 00:24:04.760 |
highest so that's what it's going to give us our list of indexes so this is basically 00:24:10.560 |
the same steps as we used before to create our CSV submission file so at the time of 00:24:18.460 |
creating this analysis that got me to the top of the leaderboard and in fact these are 00:24:24.360 |
my four submissions and you can see each one got better now you're not always going to 00:24:30.080 |
get this nice monotonic improvement right but you want to be trying to submit something 00:24:34.200 |
every day to kind of like try out something new right and the more you practice the more 00:24:42.880 |
you'll get a good intuition of what's going to help right so partly I'm showing you this 00:24:47.640 |
to say it's not like purely random as to whether things work or don't once you've been doing 00:24:53.200 |
this for a while you know you will generally be improving things most of the time so as 00:25:02.000 |
you can see from the descriptions my first submission was our conf next small for 12 00:25:06.440 |
epochs with TTA and then a ensemble of confidence so it's basically this exact same thing but 00:25:14.120 |
just retraining a few with different training subsets and then this is the same thing again 00:25:20.200 |
this is the thing we just saw basically the ensemble of large bottles with TTA and then 00:25:27.400 |
the last one was something I skipped over which was I the the VIT models were the best 00:25:35.800 |
in my testing so I basically weighted them as double in the ensemble a pretty unscientific 00:25:42.560 |
but again it gave it a another boost and so that was that was it all right John yes thanks 00:25:55.520 |
Jeremy so no particular order Korean is asking would trying out cross-validation with K folds 00:26:02.120 |
with the same architecture makes sense okay so and sombling of models yes a popular thing 00:26:08.400 |
is to do K fold cross-validation so K fold cross-validation is something very very similar 00:26:15.080 |
to what I've done here so what I've done here is I've trained a bunch of models with different 00:26:24.000 |
training sets each one is a different random 80% of the data five fold cross-validation 00:26:31.320 |
does something as similar but what it says is rather than picking like say five samples 00:26:38.440 |
out with different random subsets in fact instead first like do all except for the first 00:26:46.600 |
20% of the data and then all but the second 20% and then all but the third third and so 00:26:50.760 |
forth and so you end up with five subsets each of which have non-overlapping validation 00:26:57.200 |
sets and then you'll ensemble those you know in a thin theory maybe that could be slightly 00:27:05.920 |
better because you're kind of guaranteed that every row is appears four times you know effectively 00:27:16.440 |
it also has a benefit that you could average those five validation sets because there's 00:27:22.200 |
no kind of overlap between them to get a cross-validation personally I generally don't bother but the 00:27:29.640 |
reason I don't is because this way I can add and remove models very easily I don't you 00:27:39.600 |
know I can just you know add another architecture and whatever to my ensemble without trying 00:27:46.200 |
to find a different overlapping non-overlapping subset so yeah cross-validation is therefore 00:27:53.820 |
something that I use probably less than most people or almost or almost never awesome thank 00:28:01.880 |
you are there any just come back to gradient accumulation any other kind of drawbacks or 00:28:07.760 |
potential gotchas with gradient accumulation no not really yeah like amazingly it doesn't 00:28:17.600 |
even really slow things down much you know going from a batch size of 64 to a batch size 00:28:22.320 |
of 32 by definition you had to do it because your GPU is full so you're obviously giving 00:28:28.040 |
a lot of data so it's probably going to be using its processing speed pretty effectively 00:28:33.640 |
so yeah no it's just it's just a good technique that we should all be buying cheaper graphics 00:28:42.600 |
cards with less memory in them and using you know have like I don't know the prices I suspect 00:28:48.620 |
like you could probably buy like 230 80s for the price of 130 90 ti or something that would 00:28:54.760 |
be a very good deal yes clearly you're not on the nvidia payroll so look this is a good 00:29:02.480 |
segue then we did have a question about sort of GPU recommendations and there's been a bit 00:29:07.280 |
of chat on that as well I bet so any any you know commentary any additional commentary 00:29:12.520 |
around GPU recommendations no not really I mean obviously at the moment nvidia is the 00:29:21.040 |
only game in town you know if you buy if you trying to use a you know apple m1 or m2 or 00:29:28.200 |
or an AMD card you're basically in for a world of pain in terms of compatibility and stuff 00:29:33.880 |
and unoptimized libraries and whatever the the nvidia consumer cards so the ones that 00:29:45.360 |
start with RTX are much cheaper but are just as good as the expensive enterprise cards 00:29:56.720 |
so you might be wondering why anybody would buy the expensive enterprise cards and the 00:30:00.760 |
reason is that there's a licensing issue that nvidia will not allow you to use an RTX consumer 00:30:07.640 |
card in a data center which is also why cloud computing is more expensive than they kind 00:30:15.600 |
of ought to be because everybody selling cloud computing GPUs is selling these cards that 00:30:21.600 |
are like I can't remember I think they're like three times more expensive for kind of 00:30:24.760 |
the same features so yeah if you do get serious about deep learning to the point that you're 00:30:31.200 |
prepared to invest you know a few days in administering a box and you know I guess depending 00:30:39.880 |
on prices hopefully will start to come down but currently a thousand or two thousand or 00:30:43.320 |
two thousand dollars on buying a GPU then you know that'll probably pay you back pretty 00:30:48.680 |
quickly great thank you um let's see another one's come in uh if you have a back on models 00:30:58.320 |
not hardware if you have a well functioning but large model can it make sense to train 00:31:03.240 |
a smaller model to produce the same final activations as the larger model oh yeah absolutely 00:31:10.120 |
i'm not sure we'll get into that this time around but um yeah um we'll cover that in 00:31:17.400 |
part two i think but yeah basically there's a kind of teacher student models and model 00:31:23.000 |
distillation which broadly speaking there there are ways to make inference faster by 00:31:29.680 |
training small models that work the same way as large models great thank you all right 00:31:37.240 |
so that is the actual real end of road to the top because beyond that we don't actually 00:31:45.120 |
cover how to get closer to the top you'd have to ask korean to share his techniques to find 00:31:50.080 |
out that or nick to get the second place in the top um part four is actually um something 00:31:58.900 |
that i think is very useful to know about for for learning and it's going to teach us 00:32:02.120 |
a whole lot about how the last layer of a neural networks and specifically what we're 00:32:10.600 |
going to try to do is we're going to try to build a model that doesn't just predict the 00:32:17.320 |
disease but also predicts the type of rice so how would you do that so here's the data 00:32:25.640 |
loader we're going to try to build it's going to be something that for each image it tells 00:32:32.320 |
us the disease and the type of rice i say disease sometimes normal i guess some of them 00:32:37.920 |
are not diseased so to build a model that can predict two things the first thing is 00:32:46.720 |
you're going to need data loaders that have two dependent variables and that is shockingly 00:32:53.560 |
easy to do in fastai thanks to the data block so we've seen the data block before we haven't 00:33:02.920 |
been using it for the patty competition so far because we haven't needed it we could 00:33:06.680 |
just use image data loader dot from folder so that's like the the highest level api the 00:33:12.920 |
simplest api if we go down a level deeper into the data block we have a lot more flexibility 00:33:19.960 |
so if you've been following the walkthroughs you'll know that as i built this the first 00:33:24.720 |
thing i actually did was to simply replicate the previous notebook but replace the image 00:33:31.040 |
data loader dot from folders with the data block to try to do first of all exactly the 00:33:35.240 |
same thing and then i added the second dependent variable so if we look at the previous image 00:33:45.360 |
data loader from folders thingy here it is where you're passing in some item transforms 00:33:55.440 |
and some batch transforms and we had something saying what percentage should be the validation 00:34:01.600 |
set so in a data block if you remember we have to pass in a blocks argument saying what 00:34:12.920 |
kind of data is the independent variable and what is the dependent variable so to replicate 00:34:18.280 |
what we had before we would just pass in image block comma category block because we've got 00:34:22.440 |
an image as our independent variable and a category one type of rice is the dependent 00:34:27.400 |
variable so the new thing i'm going to show you here is that you don't have to only put 00:34:31.600 |
in two things you can put in as many as you like so if you put in three things we're going 00:34:37.200 |
to generate one image and two categories now fastai if you're saying i want three things 00:34:44.360 |
fastai doesn't know which of those is the independent variable and which is the dependent 00:34:48.800 |
variable so the next thing you have to tell it is how many inputs are there number of 00:34:53.120 |
inputs and so here i've said there's one input so that means this is the input and therefore 00:34:58.240 |
by definition two categories will be the output because remember we're trying to predict two 00:35:02.960 |
things the type of rice and the disease okay this is the same as what we've seen before 00:35:09.000 |
to find out to get our list of items we'll call get image files now here's something 00:35:14.280 |
we haven't seen before get y is our labeling function normally we pass to get y a single 00:35:20.720 |
thing such as the parent label function which looks at the name of the parent directory 00:35:27.520 |
which remember is how these images are structured and that would tell us the label but get y 00:35:34.320 |
can also take an array and in this case we want two different labels one is the name 00:35:41.640 |
of the parent directory because that's the disease the second is the variety so what's 00:35:46.880 |
get variety get variety is a function so let me explain how this function works so we can 00:35:54.480 |
create a data frame containing our training data that came from Kaggle so for each image 00:36:01.240 |
it tells us the disease and the variety and what i did is something i haven't shown before 00:36:09.720 |
in pandas you can set one column to be the index and when you do that in this case image 00:36:15.560 |
id it makes this series this sorry this data frame kind of like a dictionary i can index 00:36:23.200 |
into it by saying tell me the row for this image and to do that you use the lock attribute 00:36:30.560 |
the location so we want in the data frame the location of this image and then you can 00:36:39.600 |
also say optionally what column you want this column and so here's this image and here's 00:36:47.320 |
this column and as you can see it returns that thing so hopefully now you can see it's 00:36:53.160 |
pretty easy for us to create a function that takes a row sorry a path and returns the location 00:37:06.480 |
in the data frame of the name of that file because remember these are the names of files 00:37:14.080 |
for the variety column so that's our second get y okay and then we've seen this before 00:37:22.800 |
randomly split the data into the 20 percent and so 80 percent and so we could just squish 00:37:30.720 |
them all to 192 just for this example and then use data augmentation to get us down 00:37:35.920 |
to 128 square images just for this example um and so that's what we get when we say show 00:37:44.880 |
batch we get what we just discussed so now we need a model that predicts two things how 00:38:00.000 |
do we create a model that predicts two things well the key thing to realize is we never 00:38:05.920 |
actually had a model that predicts two things we had a model that predicts 10 things before 00:38:12.880 |
the 10 things we predicted is the probability of each disease so we don't actually now want 00:38:19.520 |
a model that predicts two things we want a model that predicts 20 things the probability 00:38:24.280 |
of each of the 10 diseases and the probability of each of the 10 varieties so how could we 00:38:35.480 |
do that well let's first of all try to just create the same disease model we had before 00:38:44.320 |
with our new data loader and so this is going to be reasonably straightforward the key thing 00:38:50.240 |
to know is that since we told fastai that there's one input and therefore by definition 00:38:57.960 |
there's two outputs it's going to pass to our metrics and to our um loss functions three 00:39:07.560 |
things instead of two the predictions from the model and the disease and the variety 00:39:15.760 |
so if we're gonna so we can't just use error rate as our metric anymore because error rate 00:39:20.680 |
takes two things instead we have to create a function that takes three things and return 00:39:26.320 |
error rate are the two things we want which is the predictions from the model and the 00:39:33.120 |
disease okay so predictions the model this is the target so that's actually all we need 00:39:38.760 |
to do to define a metric that's going to work with our new data set with a new data loader 00:39:46.880 |
this is not going to actually tell us anything about variety first it's going to try to replicate 00:39:50.680 |
something that can do just disease so when we create our learner we'll pass in this new 00:39:56.840 |
disease error function okay so halfway there the other thing we're going to need is to 00:40:04.740 |
change our loss function now we never actually talked about what loss function to use and 00:40:12.320 |
that's because vision learner guessed what loss function to use vision learner saw that 00:40:20.120 |
our dependent variable is a single category and it knows the best loss function that's 00:40:25.200 |
probably going to be the case for things with a single category and it knows how big the 00:40:28.440 |
category is so it just didn't bother us at all just said okay I'll figure it out for 00:40:33.520 |
you so the only time we've provided our own loss function is when we were kind of doing 00:40:41.880 |
linear models and neural nets from scratch and we did I think mean squared error we might 00:40:47.160 |
also have done mean absolute error neither of those work when the dependent variable 00:40:55.720 |
is a category how would you use mean squared error or mean absolute error to say how close 00:41:02.760 |
were these 10 probability predictions to this one correct answer so in this case we have 00:41:11.120 |
to use a different loss function we have to use something called cross entropy loss and 00:41:15.440 |
this is actually the loss function that fast AI picked for us before without us knowing 00:41:21.680 |
but now that we are having to pick it out manually I'm going to explain to you exactly 00:41:26.480 |
what cross entropy loss does okay and you know these details are very important indeed 00:41:37.720 |
like remember I said at the start of this class the stuff that happens in the middle 00:41:41.120 |
of the model you're not going to have to care about much in your life if ever but the stuff 00:41:46.080 |
that happens in the first layer and the last layer including the loss function that sits 00:41:50.440 |
between the last layer and the loss you're going to have to care about a lot right this 00:41:54.340 |
stuff comes up all the time so you definitely want to know about cross entropy loss and 00:42:00.640 |
so I'm going to explain it using a spreadsheet the spreadsheets in the course repo and so 00:42:09.520 |
let's say you are predicting something like a kind of a mini image net thing where you're 00:42:15.360 |
trying to predict whether something an image is a cat a dog a plane a fish or a building 00:42:20.480 |
so you set up some model whatever it is a conf next model or just a big bunch of linear 00:42:27.320 |
layers connected up or whatever and initially you've got some random weights and it spits 00:42:34.640 |
out at the end five predictions right so remember to predict something with five categories 00:42:41.840 |
your model will spit out five probabilities now it doesn't initially spit out probabilities 00:42:46.920 |
there's nothing making them probabilities it just spits out five numbers could be negative 00:42:52.680 |
could be positive okay so here's the output of the model so what we want to do is we want 00:43:02.720 |
to convert these into probabilities and so we do that in two steps the first thing we 00:43:12.600 |
do is we go exp that's e to the power of we go e to the power of each of those things 00:43:23.360 |
like so okay and so here's the mathematical formula we're using this is called the soft 00:43:27.920 |
max what we're working through we're going to go through each of the categories so these 00:43:38.040 |
are our five categories so here K is five we're going to go through each of our categories 00:43:41.600 |
and we're going to go e to the power of the output so ZJ is the output for the jth category 00:43:50.880 |
so here's that and then we're going to sum them all together here it is sum up together 00:43:57.320 |
okay so this is the denominator and then the numerator is just e to the power of the thing 00:44:06.480 |
we care about so this row so the numerator is e to the power of cat on this row e to 00:44:17.240 |
the power of dog on this row and so forth now if you think about it since the denominator 00:44:25.580 |
adds up all the e to the power ofs and when we do each one divided by the sum that means 00:44:33.280 |
the sum of these will equal one by definition right and so now we have things that can be 00:44:42.120 |
treated as probabilities they're all numbers between zero and one numbers that were bigger 00:44:48.720 |
in the output will be bigger here but there's something else interesting which is because 00:44:53.560 |
we did either the power of it means that the bigger numbers will be like pushed up to numbers 00:45:00.200 |
closer to one like we're saying like oh really try to pick one thing as having most of the 00:45:06.680 |
probability because we are trying to predict you know one thing we're trying to predict 00:45:12.240 |
which one is it and so this is called softmax so sometimes you'll see people complaining 00:45:20.640 |
about the fact that their model which they said let's say is it a teddy bear or a grizzly 00:45:27.560 |
bear or a black bear and they faded a picture of the cat and they say oh the models wrong 00:45:33.440 |
because it predicted grizzly bear it's not a grizzly bear as you can see there's no way 00:45:37.440 |
for this to predict anything other than the categories we're giving it we're forcing it 00:45:42.680 |
to that now we don't if you want that like it's something else you could do which is 00:45:48.640 |
you could actually have them not add up to one right you could instead have something 00:45:53.880 |
which simply says what's the probability it's a cat what's probably it's a dog was put it 00:45:57.680 |
by totally separately and they could add up to less than one out of that situation you 00:46:03.280 |
can say you know or more than one which case you could have like more than one thing being 00:46:07.160 |
true or zero things being true but in this particular case where we want to predict one 00:46:12.760 |
and one thing only we use softmax the first part of the cross entropy formula the first 00:46:24.560 |
part of the cross entropy formula in fact let's look it up and end up cross entropy loss the 00:46:34.840 |
first part of what cross entropy loss in PyTorch does is to calculate the softmax it's actually 00:46:46.640 |
the log of the softmax but don't worry about that too much it's just a slightly faster 00:46:51.240 |
to do the log okay so now for each one of our five things we've got a probability the 00:47:03.920 |
next step is the actual cross entropy calculation which is we take our five things we've got 00:47:09.360 |
our five probabilities and then we've got our axials now the truth is the actual you 00:47:16.640 |
know the five things would have indices right zero one two three or four the actual turned 00:47:22.040 |
out to be the number one but what we tend to do is we think of it as being one hot encoded 00:47:27.880 |
which is we put a one next to the thing for which it's true and a zero everywhere else 00:47:36.020 |
and so now we can compare these five numbers to these five numbers and we would expect 00:47:42.920 |
to have a smaller loss if the softmax was high where the actual is high and so here's 00:47:53.520 |
how we calculate this is the formula the cross entropy loss we sum up they switch to M this 00:48:03.000 |
time for some reason but the same thing we sum up across the five categories M is five 00:48:08.240 |
and for each one we multiply the actual target value so that's zero so here it is here the 00:48:16.000 |
actual target value and we multiply that by the log of the predicted 00:48:27.840 |
probability the log of red the predicted probability and so of course for four of these that value 00:48:37.960 |
is zero because here yj equals zero by definition for all but one of them because it's one hot 00:48:47.520 |
encoded so for the one that it's not we've got our actual times the log softmax okay 00:49:00.640 |
and so now actually you can see why PyTorch prefers to use log softmax because that it 00:49:06.640 |
kind of skips over having to do this log at all so this equation looks slightly frightening 00:49:16.260 |
but when you think about it all it's actually doing is it's finding the probability for 00:49:21.680 |
the one that is one and taking its log right it's kind of weird doing it as a sum but in 00:49:28.040 |
math it can be a little bit tricky to kind of say oh look this up in an array which is 00:49:31.440 |
basically all it's doing but yeah basically at least in this case first for a single result 00:49:37.320 |
where it's softmax this is all it's doing because it's finding the point eight seven 00:49:41.040 |
where it's one four and taking the log and then finally negative so that is what cross 00:49:49.960 |
entropy loss does we add that together for every row so here's what it looks like if 00:50:04.440 |
we add it together over every row right so n is the number of rows and here's a special 00:50:11.080 |
case this is called binary cross entropy what happens if we're not predicting which of five 00:50:17.840 |
things it is but we're just predicting is it a cat so that case if you look at this approach 00:50:24.040 |
you end up with this formula which it's it's this is identical to this formula but in for 00:50:33.840 |
just two cases which is you've either you either are a cat or you're not a cat right 00:50:41.480 |
and so if you're not a cat it's one minus you are a cat and same with the probability 00:50:46.320 |
you've got the probability you are a cat and then not a cat is one minus that so here's 00:50:52.360 |
this special case of binary cross entropy and now our rows represent rows of data okay so 00:50:59.200 |
each one of these is a different image a different prediction and so for each one I'm just predicting 00:51:04.760 |
are you a cat and this is the actual and so the actual are you not a cat is just one minus 00:51:10.720 |
that and so then these are the predictions that came out of the model again we can use 00:51:18.720 |
softmax or it's it's binary equivalent and so that will give you a prediction that you're 00:51:24.580 |
a cat and the prediction that it's not a cat is one minus that and so here is each of the 00:51:35.880 |
part yi times log of p yi and here is why did I subtract that's weird oh because I've 00:51:49.960 |
got minus of both so I just do it this way avoids parentheses yeah minus the are you 00:51:56.880 |
not a cat times the log of the prediction of are you not a cat and then we can add those 00:52:02.640 |
together and so that would be the binary cross entropy loss of this data set of five cat 00:52:09.320 |
or not cat images now if you've got an eagle eye you may have noticed that I am currently 00:52:28.920 |
looking at the documentation for something called Anna and cross entropy loss but over 00:52:34.680 |
here I had something called f cross entropy basically it turns out that all of the loss 00:52:41.400 |
functions in pytorch have two versions there's a version which is a class this is a class 00:52:50.320 |
which you can instantiate passing in various tweaks you might want and there's also a version 00:52:57.120 |
which is just a function and so if you don't need any of these tweaks you can just use 00:53:02.720 |
the function the functions live in a kind of remember what the sub module called I think 00:53:10.480 |
it might be like torch dot n n dot functional but everybody including the pytorch official 00:53:15.560 |
docs just calls a capital F so that's what this capital F refers to so our loss if we 00:53:23.400 |
just care about disease we're going to be past the three things but just going to calculate 00:53:26.880 |
cross entropy on our input versus disease all right so that's all fine we passed so 00:53:34.680 |
now when we create a vision learner you can't rely on fast AI to know what loss function 00:53:39.400 |
to use because we've got multiple targets so you have to say this is the loss function 00:53:43.860 |
I want to use this is the metrics I want to use and the other thing you can't rely on 00:53:48.280 |
is that fast AI no longer knows how many activations to create because again it there's more than 00:53:55.100 |
one target so you have to say the number of outputs to create at the last layer is 10 00:54:00.160 |
this is just saying what's the size of the last matrix and once we've done that we can 00:54:08.120 |
train it and we get you know basically the same kind of result as we always get because 00:54:13.280 |
this model at this point is identical to our previous conf next small model we've just done 00:54:21.040 |
it in a slightly more roundabout way so finally before our break I'll show you how to expand 00:54:29.520 |
this now into a multi-target model and the trick is actually very simple and you might 00:54:36.120 |
have almost got the idea of it when I talked about it earlier our vision learner now requires 00:54:42.680 |
20 outputs we now need that last matrix to have to produce 20 activations not 10 10 of 00:54:51.360 |
those activations are going to predict the disease and 10 of the activations are going 00:54:57.800 |
to predict the variety so you might be then asking like well how does the model know what 00:55:03.360 |
it's meant to be predicting and the answer is with the loss function you're going to 00:55:08.920 |
have to tell it so for example disease loss remember it's going to get the input the disease 00:55:18.000 |
in the variety this is now going to have 20 columns in so we're just going to decide all 00:55:25.480 |
right we're just going to decide the first 10 columns we're going to decide are the prediction 00:55:29.640 |
of what the disease is which of the probability of each disease so we can now passed across 00:55:34.560 |
entropy the first 10 columns and the disease target so the way you read this colon means 00:55:44.460 |
every row and then colon 10 means every column up to the 10th so these are the first 10 columns 00:55:56.400 |
and that will that's a loss function that just works on predicting disease using the 00:56:01.120 |
first 10 columns for variety we'll use cross entropy loss with the target of variety and 00:56:08.480 |
this time we'll use the second 10 columns so here's column 10 onwards so then the overall 00:56:15.960 |
loss function is the sum of those two things disease loss plus variety loss 00:56:27.200 |
and that's actually it that's all the model needs to basically it's now going to if you 00:56:34.080 |
kind of think through the manual neural nets we've created this loss function will be reduced 00:56:42.160 |
when the first 10 columns are doing a good job of predicting disease probabilities and 00:56:46.760 |
the second 10 columns are doing a good job of predicting the variety probabilities and 00:56:50.240 |
therefore the gradients will point in an appropriate direction that the coefficients will get better 00:56:56.160 |
and better at using those columns for those purposes it would be nice to see the error 00:57:04.520 |
rate as well for each of disease and variety so we can call error rate passing in the first 00:57:10.280 |
10 columns and disease and then variety the second 10 columns and variety and we may as 00:57:18.240 |
well also add to the metrics the losses and so now when we create a learner we're going 00:57:24.560 |
to pass in as the loss function the combined loss and as the metrics our list of all the 00:57:31.480 |
metrics and n out equals 20 and now look what happens when we train as well as telling us 00:57:40.000 |
the overall train in valid loss it also tells us the disease and variety error and the disease 00:57:44.880 |
and variety loss and you can see our disease error is getting down to similar levels it 00:57:50.680 |
was before it's slightly less good but it's similar it's not surprising it's slightly 00:57:59.920 |
less good because we've only given it the same number of epochs and we're now asking 00:58:05.720 |
it to try to do more stuff which is to learn to recognize what the rice variety looks like 00:58:10.760 |
and also learns to recognize what the disease looks like here's the counterintuitive thing 00:58:16.600 |
though if we train it for longer it may well turn out that this model which is trying to 00:58:23.680 |
predict two things actually gets better at predicting disease than our disease specific 00:58:30.240 |
model why is that like that sounds weird right because we're trying to have to do more stuff 00:58:37.080 |
that's the same size model well the reason is that quite often it'll turn out that the 00:58:44.400 |
kinds of features that help you recognize a variety of rice are also useful for recognizing 00:58:51.760 |
the disease you know maybe there are certain textures right or maybe some diseases impact 00:58:59.440 |
different varieties different ways so it'd be really helpful to know what variety it was 00:59:05.280 |
so I haven't tried training this for a long time and I don't know the answer is in this 00:59:10.400 |
particular case does a multi-target model do better than a single target model at predicting 00:59:15.200 |
disease but I just want to let you know sometimes it does right so for example a few years ago 00:59:20.960 |
there was a Kaggle competition for recognizing the kinds of fish on a boat and I remember we 00:59:28.400 |
ended up doing a multi-target model where we tried to predict a second thing I can't even remember 00:59:35.040 |
what it was maybe it was a type of boat or something and it definitely turned out in that 00:59:38.560 |
Kaggle competition that predicting two things helped you predict the type of fish better 00:59:42.880 |
than predicting just the type of fish so there's at least you know there's two reasons to learn 00:59:49.840 |
about multi-target models one is that sometimes you just want to be able to predict more than one 00:59:55.040 |
thing so this is useful and the second is sometimes this will actually be better at predicting just 01:00:00.560 |
one thing than a just one thing model and of course the third reason is it really forced us to 01:00:07.840 |
dig quite deeply into these loss functions and activations in a way we haven't quite done before 01:00:13.920 |
so it's okay it's absolutely okay if this is confusing 01:00:23.760 |
the way to make it not confusing is well the first thing I do is like go back to our earlier 01:00:32.640 |
models where we did stuff by hand on like the Titanic data set and built our own architectures 01:00:39.600 |
and maybe you could try to build a model that predicts two things in the Titanic dataset maybe 01:00:46.080 |
you could try to predict both sex and survival or something like that or or class and survival 01:00:54.320 |
because that's kind of kind of forced you to look at it on very small data sets and then the other 01:00:59.520 |
thing I'd say is run this notebook and really experiment at trying to see what kind of outputs 01:01:08.000 |
you get like actually look at the inputs and look at the outputs and look at the data loaders and so 01:01:12.000 |
forth all right let's have a six minute break um so I'll see you back here at 10 past seven 01:01:24.480 |
okay welcome back um oh before I continue I very rudely forgot to mention this very nice 01:01:32.560 |
equation image here is from an article by Chris said called things that confused me about cross 01:01:40.560 |
entropy it's a very good article so I recommend you check it out if you want to go a bit deeper 01:01:46.960 |
there there's a link to it inside the spreadsheet so the next notebook we're going to be looking at 01:01:59.440 |
is this one called collaborative filtering deep dive and this is going to cover our last 01:02:07.040 |
of the four major application areas collaborative filtering 01:02:15.200 |
and this is actually the first time I'm going to be presenting a chapter of the book largely 01:02:21.360 |
without variation um because this is one where I looked back at the chapter and I was like oh I 01:02:28.160 |
can't think of any way to improve this so I thought I'll just leave it as is um but we have 01:02:34.800 |
put the whole chapter up on Kaggle um so that's for the way I'm going to be showing it to you 01:02:42.960 |
and so we're going to be looking at a data set called the movie lens 01:02:46.480 |
data set which is a data set of movie ratings and we're going to grab a 01:02:55.360 |
smaller version of it 100,000 record version of it 01:02:59.280 |
and it comes as a csv file which we can read in well it's not really a csv file it's a tsv file 01:03:09.840 |
this here means a tab in python um these are the names of the columns so here's what it looks like 01:03:21.600 |
it's got a user a movie a rating and a timestamp we're not going to use the timestamp 01:03:26.320 |
at all so basically three columns we care about this is a user id so maybe 196 is Jeremy and maybe 01:03:34.880 |
186 is Rachel and 22 is John I don't know um maybe this movie is Return of the Jedi and this one's 01:03:44.480 |
Casablanca this one's LA Confidential and then this rating says how did Jeremy feel about Return 01:03:51.600 |
of the Jedi he gave it a three out of five that's how we can read this data set um this kind of data 01:04:00.000 |
is very common uh anytime you've got a user and a product or service and you might not even have 01:04:11.040 |
ratings maybe just the fact that they bought that product you could have a similar table with zeros 01:04:16.400 |
and ones um so for example um uh Radek who's in the audience here is now at nvidia doing like 01:04:26.960 |
basically does this right recommendation systems so recommendation systems you know 01:04:31.120 |
it's it's a huge industry um and so what we're learning today is you know a really key foundation 01:04:37.600 |
of it um so these are the first few rows this is not a particularly great way to see it i prefer to 01:04:46.560 |
kind of cross tabulate it like that like this this is the same information 01:04:51.600 |
uh so for each movie for each user here's the rating so user 212 never watched movie 49 01:05:03.760 |
why there's so few empty cells here i actually grabbed the the most watched movies 01:05:13.280 |
and the most movie watching users for this particular sample matrix so that's why it's 01:05:19.600 |
particularly full so yeah so this is what kind of a collaborative filtering data set looks like 01:05:26.560 |
when we cross tabulate it so how do we fill in this gap so maybe user 212 is nick 01:05:38.400 |
and movie 49 what's the movie you haven't seen nick and you'd quite like to maybe not sure about it 01:05:47.280 |
the new elvis movie baz limon good choice australian director filmed in queen's land 01:05:52.640 |
yeah okay so that's movie two and that's movie number 49 so is nick gonna like the new elvis 01:06:00.720 |
movie well to figure this out what we could do ideally would like to know for each movie 01:06:15.680 |
what kind of movie is it like what are the kind of features of it is it like 01:06:19.760 |
actiony science fictiony dialogue driven critical acclaimed you know um so let's say for example we 01:06:28.480 |
were trying to look at the last guy walker maybe that was the movie the nick's wondering about 01:06:33.600 |
watching and so if we like had three categories being science fiction action or kind of classic 01:06:43.120 |
old movies would say the last guy walker is very science fiction let's see this is from like negative 01:06:48.240 |
one to one pretty action definitely not an old classic or at least not yet 01:06:55.760 |
and so then maybe we then could say like okay well maybe like nick's tastes in movies are that he 01:07:06.640 |
really likes science fiction quite likes action movies and doesn't really like old classics 01:07:12.480 |
right so then we could kind of like match these up to see how much we think this user might like 01:07:19.840 |
this movie to calculate the match we could just multiply the corresponding values user one times 01:07:29.600 |
last guy walker and add them up point nine times point nine eight plus point eight times point nine 01:07:34.800 |
plus negative point six times negative point nine that's going to give us a pretty high number 01:07:39.280 |
right with a maximum of three so that would suggest nick probably would like the last guy 01:07:45.680 |
walker on the other hand the movie casablanca we would say definitely not very science fiction 01:07:55.680 |
not really very action definitely very old classic so then we'd do exactly the same 01:08:01.520 |
calculation and get this negative result here so you probably wouldn't like casablanca 01:08:08.720 |
this thing here when we multiply the corresponding parts of a vector together and add them up is 01:08:16.240 |
called a dot product in math so this is the dot product of the user's preferences and the type 01:08:25.040 |
of movie now the problem is we weren't given that information we know nothing about these users 01:08:33.600 |
or about the movies so what are we going to do we want to try to create these factors 01:08:42.320 |
without knowing ahead of time what they are we wouldn't even know what factors to create 01:08:49.440 |
what are the things that really matters when it just people decide what movies they want to watch 01:08:53.680 |
what we can do is we can create things called latent factors latent factors is this weird idea 01:09:00.160 |
that we can say i don't know what things about movies matter to people but there's probably 01:09:07.600 |
something and let's just try like using sgd to find them and we can do it and everybody's 01:09:19.760 |
favorite mathematical optimization software microsoft xl so here is that table 01:09:31.760 |
and what we can do let's head over here actually here's that table so what we could do is we could 01:09:42.640 |
say for each of those movies so let's say for movie 27 let's assume there are five 01:09:49.920 |
latent factors i don't know what they're for they're just five latent factors we'll figure 01:09:57.760 |
them out later and for now i certainly don't know what the value of those five latent factors for 01:10:03.920 |
movie 27 so we're going to just chuck a little random numbers in them 01:10:11.120 |
and we're going to do the same thing for movie 49 pick another five random numbers and the same 01:10:16.000 |
thing for movie 57 pick another five numbers and you might not be surprised to hear we're 01:10:21.840 |
going to do the same thing for each user so for user 14 we're going to pick five random numbers 01:10:27.840 |
for them and for user 29 we'll pick five random numbers for them and so the idea is that this 01:10:34.800 |
number here 0.19 is saying if it was true that user id 14 feels not very strongly about the 01:10:44.960 |
factor that for movie 27 has a value of 0.71 so therefore in here we do the dot product 01:10:52.960 |
the details of why i don't matter too much but well actually you can figure this out from what 01:11:00.640 |
we've said so far if you go back to our definition of matrix product you might notice that the 01:11:07.040 |
matrix product of a row with a column is the same thing as a dot product and so here in excel i 01:11:15.760 |
have a row in a column so therefore i say matrix multiply that by that that gives us the dot product 01:11:21.280 |
so here's the dot product of that by that or the matrix multiply given that they're row and column 01:11:29.680 |
the only other slight quirk here is that if the actual rating is zero is empty i'm just going to 01:11:41.360 |
leave it blank i'm going to set it to zero actually so here is everybody's rating predicted rating of 01:11:52.240 |
movies i say predicted of course these are currently random numbers so they are terrible 01:11:57.920 |
predictions but when we have some way to predict things and we start with terrible random predictions 01:12:03.920 |
we know how to make them better don't we we use stochastic gradient descent now to do that we're 01:12:10.160 |
going to need a loss function so that's easy enough we can just calculate the sum of x minus 01:12:19.520 |
y squared divided by the count that is the mean squared error and if we take the square root that 01:12:25.680 |
is the root mean squared error so here is the root mean squared error in excel between these 01:12:32.880 |
predictions and these axials and so now that we have a loss function we can optimize it data 01:12:43.040 |
solver set objective this one here by changing cells these ones here and these ones here 01:12:56.400 |
solve okay and initially our loss is 2.81 so we hope it's going to go down 01:13:08.720 |
and as it solves not a great choice of background color but it says 0.68 so this number is going 01:13:14.720 |
down so this is using um actually in excel it's not quite using stochastic gradient descent because 01:13:22.640 |
excel doesn't know how to calculate gradients there are actually optimization techniques that 01:13:27.440 |
don't need gradients they calculate them numerically as they go but that's a minor quirk um one thing 01:13:34.640 |
you'll notice is it's doing it very very slowly um there's not much data here and it's still going 01:13:40.640 |
um one reason for that is that if it's because it's not using gradients it's much slower and 01:13:47.200 |
the second is excel is much slower than pythage anyway it's come up with an answer and look at 01:13:53.520 |
that it's got to 0.42 so it's got a pretty good prediction and so we can kind of get a sense of 01:14:02.480 |
this for example um looking at the last three movie uh user 14 likes dislikes likes let's see 01:14:18.240 |
somebody else like that here's somebody else this person likes dislikes likes so based on our kind 01:14:25.440 |
of approach we're saying okay since they have the same feeling about these three movies maybe they'll 01:14:30.160 |
feel the same about these three movies so this person likes all three of those movies and 01:14:35.440 |
this person likes two out of three of them so you know you kind of this is the idea right as if 01:14:42.240 |
somebody says to you i like this movie this movie this movie and you're like oh they like those 01:14:46.560 |
movies too what other movies do you like and they'll say oh how about this there's a chance 01:14:51.760 |
good chance that you're going to like the same thing that's the basis of collaborative filtering 01:14:56.640 |
okay um it's and and mathematically we call this matrix completion so this matrix is missing values 01:15:04.560 |
we just want to complete them so the core of collaborative filtering is it's a matrix completion 01:15:22.800 |
my question was is with um the dot products right so if we think about the math of that 01:15:28.640 |
for a minute is yeah if we think about the cosine of the angle between the two vectors 01:15:33.040 |
that's going to roughly approximate the correlation is that essentially what's going on here in one 01:15:38.480 |
sense with the way that we're so is the cosine of the angle between the vectors much the same thing 01:15:43.680 |
as the dot product um the answer is yes um they're the same once you normalize them so yeah 01:15:54.480 |
it's correlation what we're doing here at scale as well yeah you can yeah 01:16:02.000 |
you can think of it that way okay cool um now 01:16:07.840 |
this looks pretty different to how pytorch looks pytorch has things in rows 01:16:18.160 |
right we've got a user a movie rating user movie rating right so how do we do the same kind of 01:16:25.520 |
thing in pytorch so let's do the same kind of thing in excel but using the table in the same 01:16:31.760 |
format that pytorch has it okay so to do that in excel the first thing i'm going to do is i'm 01:16:38.560 |
going to see okay this i've got to look at user number 14 and i want to know what index like how 01:16:46.080 |
far down this list is 14 okay so we'll just match means find the index so this is user index one 01:16:51.520 |
and then what i'm going to do is i'm going to say the 01:16:57.040 |
these five numbers is basically i want to find row one over here and in excel that's called offset 01:17:07.920 |
so we're going to offset from here by one row and so you can see here it is 0.19 0.63 0.19 0.63 etc 01:17:18.400 |
right so here's the second user 0.25 0.03 etc and we can do the same thing for movies right so movie 01:17:27.920 |
four one seven is index 14 that's going to be 0.75 0.47 etc and so same thing right but now we're 01:17:38.960 |
going to offset from here by 14 to get this row which is 0.75 0.47 etc and so the prediction now 01:17:54.080 |
is the dot product is called sum product in excel this is sum product of those two things 01:18:01.760 |
so this is exactly the same as we had before right but when we kind of put everything next 01:18:08.400 |
to each other we have to like manually look up the index and so then for each one we can calculate 01:18:16.400 |
the error squared prediction minus rating squared and then we could add those all up 01:18:22.960 |
and if you remember this is actually the same root mean squared error we had before we optimized 01:18:27.280 |
before 2.81 because we've got the same numbers as before and so this is mathematically identical 01:18:33.760 |
so what's this weird word up here embedding you've probably heard it before 01:18:41.200 |
and you might have come across the impression it's some very complex fancy mathematical thing 01:18:47.760 |
but actually it turns out that it is just looking something up in an array that is what an embedding 01:18:54.720 |
is so we call this an embedding matrix and these are our user embeddings and our movie embeddings 01:19:11.440 |
so let's take a look at that in pytorch and you know at this point if you've heard about embeddings 01:19:17.920 |
before you might be thinking that can't be it and yeah it's just as complex as the rectified 01:19:25.520 |
linear unit which turned out to be replaced negatives with zeros embedding actually means 01:19:31.600 |
look something up in an array so there's a lot of things that we use as deep learning practitioners 01:19:37.520 |
to try to make you as intimidated as possible so that you don't wander into our territory and 01:19:45.200 |
start winning our kaggle competitions and unfortunately once you discover the simplicity 01:19:49.920 |
of it you might start to think that you can do it yourself and then it turns out you can 01:19:54.240 |
so yeah that's what basically it turns out pretty much all of this jargon turns out to be 01:20:03.280 |
so we're going to try to learn these latent factors which is exactly what we just did in excel we just 01:20:10.560 |
learned the latent factors all right so if we're going to learn things in pytorch we're going to 01:20:18.400 |
need data loaders one thing i did is there is actually a movies table as well with the names 01:20:26.880 |
of the movies so i merged that together with the ratings so that then we've now got the user id and 01:20:33.120 |
the actual name of the movie we don't need that obviously for the model but it's just going to 01:20:36.880 |
make it a bit more fun to interpret later so this is called ratings we have something called 01:20:46.080 |
collaborative data loaders so collaborative filtering data loaders and we can get that from 01:20:50.560 |
a data frame by passing in the data frame and it expects a user column and an item column so the 01:20:58.960 |
user column is what it sounds like the the person that is rating this thing and the item column is 01:21:04.880 |
the product or service that they're rating in our case the user column is called user so we don't 01:21:09.680 |
have to pass that in and the item column is called title so we do have to pass this in because by 01:21:16.000 |
default the user column should be called user and the item column will be called item give it a batch 01:21:22.560 |
size and as usual we can call show batch and so here's our data loaders a batch of data loaders 01:21:32.240 |
or at least a bit of it and so now that since we talk about the names we actually get to see the 01:21:40.000 |
names which is nice all right so now we're going to create the user factors and movie factors i.e. 01:21:54.400 |
this one and this one so the number of rows of the movie factors will be equal to the 01:22:04.800 |
number of movies and the number of rows of the user factors will be equal to the number of 01:22:09.120 |
users and the number of columns will be whatever we want however many factors we want to create 01:22:14.800 |
john this might be a pertinent time to jump in with a question any comments about 01:22:25.200 |
um not really um we um we have defaults that we use for embeddings in fastai 01:22:38.160 |
um it's a very obscure formula and people often ask me for like the mathematical derivation of 01:22:44.880 |
where it came from but what actually happened is it's i wrote down how many factors i think is 01:22:50.240 |
appropriate for different size categories on a piece of paper in a table well actually in excel 01:22:55.520 |
and then i fitted a function to that and that's the function so it's basically a mathematical 01:23:00.480 |
function that fits my intuition about what works well um but it seems to work pretty well i said 01:23:06.000 |
it used in lots of other places now lots of papers will be like using fastai's rule of thumb for 01:23:11.840 |
embedding sizes here's the formula cool thank you um it's pretty fast to train these things so you 01:23:20.880 |
can try a few so we're going to create um so the number of users is just the length of how many 01:23:28.400 |
users there are number of movies is the length of how many titles there are so create a matrix of 01:23:33.200 |
random numbers of users by five and movies of movies by five 01:23:39.600 |
and now we need to look up the index of the movie in our movie latent factor matrix 01:23:48.240 |
um the thing is when we've learned about deep learning we learned that we do matrix 01:23:56.160 |
multiplications not look something up in a matrix in an array so in excel 01:24:04.240 |
we were saying offset which is to say find element number 14 and the table which that's not 01:24:15.600 |
a matrix multiply how does that work well actually it is um it actually is for the same reason 01:24:28.560 |
here which is we can represent find the element number one thing in this list is actually the 01:24:41.920 |
same as multiplying by a one hot encoded matrix so remember how if we let's just take off the log 01:24:58.560 |
look this is returned 0.87 um and particularly if i take the negative off here if i add this up 01:25:05.680 |
this is 0.87 which is the result of finding the index number one thing in this list 01:25:12.880 |
but we didn't do it that way we did this by taking the dot product of this 01:25:20.240 |
sorry of this and this but that's actually the same thing taking the dot product of a one hot 01:25:28.480 |
encoded vector with something is the same as looking up this index in the vector so that means that 01:25:39.280 |
this exercise here of looking up the 14th thing is the same as doing a matrix multiply 01:25:48.800 |
with a one hot encoded vector and we can see that here 01:25:53.120 |
this is how we create a one hot encoded vector of length and users in which the third element is 01:26:03.760 |
set to one and everything else is zero and if we multiply that so at means do you remember 01:26:11.200 |
matrix multiply in python so if we multiply that by our user factors we get back this answer 01:26:18.640 |
and if we just ask for user factors number three we get back the exact same answer 01:26:23.840 |
they're the same thing so you can think of an embedding as being a computational shortcut 01:26:33.680 |
for multiplying something by a one hot encoded vector and so if you think back to what we did 01:26:40.400 |
with dummy variables right this basically means embeddings are like a cool math trick for speeding 01:26:50.320 |
up doing matrix multipliers with dummy variables not just speeding up we never even have to create 01:26:54.880 |
the dummy variables we never have to create the one hot encoded vectors we can just look up in an array 01:27:08.480 |
all right so we're now ready to build a collaborative filtering model 01:27:17.440 |
and we're going to create one from scratch and as we've discussed before in PyTorch 01:27:32.880 |
and so we briefly touched on this but i've got to touch on it again 01:27:39.040 |
this is how we create a class in python you give it a name 01:27:46.080 |
and then you say how to initialize it how to construct it so in python remember they call 01:27:53.280 |
these things dunder whatever this is dunder edit these are magic methods that python will call for 01:28:00.400 |
you at certain times the the method called dunder edit is called when you create an object of this 01:28:10.400 |
class so we could pass it a value and so now we set the attribute called a equal to that value 01:28:20.240 |
and so then later on we could call a method called say that will say hello to whatever you passed in 01:28:27.600 |
here and this is what it will say so for example if you construct an object of type example passing 01:28:35.600 |
in silver self.a now equals silver so if you say use the dot method the dot say method nice to meet 01:28:44.400 |
you x is now nice to meet you so it will say hello silver nice to meet you so that's that's kind of 01:28:54.480 |
all you need to know about object-oriented programming in PyTorch to create a model 01:29:00.640 |
oh there is one more thing we need to know sorry which is you can put something in parentheses 01:29:08.560 |
after your class name and that's called the superclass it's basically gonna give you some stuff for 01:29:16.080 |
free give you some functionality for free and if you create a model in PyTorch you have to make 01:29:24.320 |
module your superclass this is actually fastai's version of module but it's nearly the same as PyTorches 01:29:33.520 |
so when we create this dot product object it's going to call dunder in it and we have to say well how 01:29:39.600 |
many users are going to be in our model and how many movies and how many factors and so we can 01:29:45.920 |
now create an embedding of users by factors for users and an embedding of movies by factors 01:29:53.120 |
for movies and so then PyTorch does something quite magic which is that if you create a 01:30:07.200 |
it then you can treat it like a function you can call it and I can calculate values on it 01:30:14.560 |
and when you do that it's really important to know PyTorch is going to call a method called 01:30:21.360 |
forward in your class so this is where you put your calculation of your model it has to be called 01:30:26.400 |
forward and it's going to be past the object itself and the thing you're calculating on 01:30:33.520 |
in this case the user and movie for a batch so this is your batch of data 01:30:44.800 |
each row will be one user and movie combination and the columns will be users and movies 01:30:53.440 |
so we can grab the first column right so this is every row of the first column 01:31:00.240 |
and look it up in the user factors embedding to get our users embeddings so that is the same 01:31:09.120 |
as doing this let's say this is one mini batch and then we do exactly the same thing for the 01:31:16.960 |
second column passing it into our movie factors to look up the movie embeddings and then 01:31:25.120 |
take the dot product them equals one because we're summing across the columns for each row 01:31:39.360 |
so once we've got that we can pass it to a learner passing in our data loaders 01:31:47.760 |
and our model and our loss function means squared error and we can call fit 01:31:54.400 |
and away it goes and this by the way is running on cpu now these are very fast 01:32:06.160 |
to run so this is doing 100 000 rows in 10 seconds which is a whole lot faster than our 01:32:12.880 |
few dozen rows in excel and so you can see the loss going down and so we've trained a model 01:32:29.840 |
it's not going to be a great model and one of the problems is that let's see if we can see this in 01:32:37.200 |
eric cell one look at this one here this prediction is bigger than five 01:32:45.280 |
but nothing's bigger than five so that seems like a problem we're predicting things that are bigger 01:32:53.200 |
than the highest possible number and in fact these are very much movie enthusiasts that 01:33:01.040 |
nobody gave anything a one yeah nobody even gave anything a one here so 01:33:07.840 |
do you remember when we learned about sigmoid the idea of squishing things between zero and one 01:33:16.080 |
we could do stuff still without a sigmoid but when we added a sigmoid it trained better 01:33:21.600 |
because the model didn't have to work so hard to get it kind of into the right zone 01:33:24.960 |
now if you think about it if you take something and put it through a sigmoid 01:33:29.600 |
and then multiply it by five now you've got something that's going to be between zero and 01:33:34.800 |
five used to have something which is between zero and one so we could do that in fact we could do 01:33:41.280 |
that in excel i'll leave that as an exercise to the reader let's do it over here in pytorch 01:33:50.080 |
so if we take the exact same class as before and this time we call sigmoid range and so sigmoid 01:33:59.920 |
range is something which will take our prediction and then squash it into our range and by default 01:34:09.280 |
we'll use a range of zero through to 5.5 so it can't be smaller than zero it can't be bigger than 01:34:14.880 |
5.5 why don't i use five that's because a sigmoid can never hit one right and a sigmoid times five 01:34:23.760 |
can never hit five but some people do give things movies five so you want to make it a bit bigger 01:34:29.040 |
than our highest so this one got a loss of 0.8628 86 oh it's not better isn't that always the way 01:34:43.120 |
all right didn't actually help doesn't always so be it 01:34:45.760 |
um let's keep trying to improve it um let me show you something i noticed 01:34:53.840 |
um some of the users like this one this person here just loved movies 01:35:06.320 |
they give nearly everything a four or five their worst score is a three all right this person oh 01:35:13.040 |
here's a one this person's got much more range some things are twos some ones some fives 01:35:18.640 |
um this person doesn't seem to like movies very much considering how many they watch nothing 01:35:25.280 |
gets a five they've got discerning tastes i guess at the moment we don't have any way 01:35:33.520 |
in our kind of formulation of this model to say this user tends to give low scores and this user 01:35:41.840 |
tends to give high scores there's just nothing like that right but that would be very easy to add 01:35:47.600 |
let's add one more number to our five factors just here for each user 01:36:00.000 |
and now rather than doing just the matrix multiply let's add 01:36:05.920 |
oh it's actually the top one let's add this number to it h19 01:36:16.000 |
and so for this one let's add i19 to it yeah so i've got it wrong this one here so this 01:36:24.800 |
this row here we're going to add to each rating and then we're going to do the same thing here 01:36:34.480 |
each movie's now got an extra number here that again we're going to 01:36:42.640 |
add a26 so it's our matrix multiplication plus we call it the bias the user bias plus 01:36:53.600 |
the movie bias so effectively that's like making it so we don't have an intercept of zero anymore 01:37:12.640 |
so previously we got to 0.42 okay and so we're going to let that go along for a while 01:37:21.840 |
and then let's also go back and look at the pytorch version so for pytorch now 01:37:27.280 |
we're going to have a user bias which is an embedding of n users by one right remember 01:37:34.960 |
there was just one number for each user and movie bias is an embedding of n movies also by one 01:37:43.120 |
and so we can now look up the user embedding the movie embedding do the dot product and then look 01:37:53.120 |
up the user bias and the movie bias and add them chuck that through the sigmoid 01:38:11.280 |
wow we're not training very well are we still not too great 0.894 01:38:15.520 |
i think excel normally does do better though let's see 01:38:18.480 |
okay excel oh excel's done a lot better it's going from 0.42 to 0.35 01:38:26.320 |
okay so what happened here why did it get worse well look at this the valid loss got better 01:38:41.600 |
and then it started getting worse again so we think we might be overfitting 01:38:46.880 |
which you know we have got a lot of parameters in our embeddings so how do we avoid overfitting 01:38:58.720 |
so a classic way to avoid overfitting is to use something called wait decay 01:39:07.680 |
also known as l2 regularization which sounds much more fancy 01:39:12.960 |
what we're going to do is when we can compute the gradients 01:39:20.560 |
we're going to first add to our loss function the sum of the weights squared this is something you 01:39:30.000 |
should go back and add to your titanic model not that it's overfitting but just to try it right 01:39:34.960 |
so previously our gradients have just been and our loss function has just been about the difference 01:39:43.360 |
between our predictions and our actuals right and so our gradients were based on the derivative of 01:39:49.040 |
that with respect to the derivative of that with respect to the coefficients but we're saying now 01:39:56.400 |
let's add the sum of the square of the weights times some small number so what would make that 01:40:06.880 |
loss function go down that loss function would go down if we reduce our weights 01:40:12.240 |
for example if we reduce all of our weights to zero i should say we reduce the magnitude of our 01:40:21.440 |
weights if we reduce the amount of zero that part of the loss function will be zero because the sum 01:40:28.000 |
of zero squared is zero now problem is if our weights are all zero our model doesn't do anything 01:40:34.480 |
right so we'd have crappy predictions so it would want to increase the weights 01:40:40.240 |
so that's actually predicting something useful 01:40:47.200 |
but if it increases the weights too much then it starts overfitting so how is it going to 01:40:53.680 |
actually get the lowest possible value of the loss function by finding the right mix 01:40:58.800 |
weights not too high right but high enough to be useful at predicting 01:41:04.000 |
if there's some parameter that's not useful for example say we asked for five factors and we 01:41:13.600 |
only need four it can just set the weights for the fifth factor to zero right and then 01:41:20.000 |
problem solved right it won't be used to predict anything 01:41:25.120 |
but it also won't contribute to our weight decay part 01:41:29.840 |
so previously we had something calculating the loss function so now we're going to do exactly 01:41:41.920 |
the same thing but we're going to square the parameters we're going to sum them up 01:41:46.480 |
and we're going to multiply them by some small number like 0.01 or 0.001 01:41:52.560 |
um and in fact we don't even need to do this because remember the whole purpose of the loss 01:42:03.200 |
is to take its gradient right and to print it out um the gradient of parameters squared 01:42:12.480 |
is two times parameters it's okay if you don't remember that from high school but the 01:42:17.120 |
you can take my word for it the gradient of y equals x squared is 2x so actually all we need 01:42:25.920 |
to do is take our gradient and add the weight decay coefficient 0.01 or whatever times two 01:42:33.680 |
times parameters and given this is just number some number we get to pick we might as well fold 01:42:38.880 |
the two into it and just get rid of it so when you call fit you can pass in a wd parameter 01:42:48.000 |
which does adds this times the parameters to the gradient for you and so that's going to ask the 01:42:57.040 |
model it's going to say to the model please don't make the the weights any bigger than they have to 01:43:01.440 |
be and yay finally our loss actually improved okay you can see getting better and better 01:43:16.320 |
in fastai applications like vision we try to set this for you 01:43:21.120 |
appropriately and we generally do a reasonably good job just the defaults are normally fine 01:43:26.640 |
um but in things like tabular and collaborative filtering we don't really know enough about your 01:43:32.640 |
data to know what to use here so you should just try a few things let's try a few multiples of 10 01:43:39.120 |
start at point one and then divide by 10 a few times you know and just see which one gives you 01:43:45.200 |
the best result so this is called regularization so regularization is about making your bottle 01:43:54.400 |
model no more complex than it has to be right it has a lower capacity and so the higher the weights 01:44:02.480 |
the more they're moving the model around right so we want to keep the weights down but not so far 01:44:09.120 |
down that they don't make good predictions and so the value of this if it's higher will keep the 01:44:14.160 |
weights down more it will reduce overfitting but it will also reduce the capacity of your model 01:44:19.920 |
to make good predictions and if it's lower it increases the capacity of model and increases 01:44:25.840 |
overfitting all right i'm going to take this bit for next time before we wrap up john are there 01:44:38.800 |
any more questions uh yeah there are there's some from from back at the start of the collaborative 01:44:51.360 |
filtering so um we had a bit of a conversation a while back about this the size of the embedding 01:44:59.680 |
vectors um and you talked about your your fast ai rule of thumb so there was a question if anyone 01:45:05.120 |
has ever done a kind of a hyper parameter search an exploration um for i mean people often will do a 01:45:11.280 |
hyper parameter search for sure a bigger problem people will often do a hyper parameter search for 01:45:16.000 |
their model but i haven't seen a i haven't seen any other rules other than my rule of thumb 01:45:22.160 |
right so not not productively to your knowledge oh productively for an individual model that 01:45:28.240 |
somebody's building right um and then there's a there's a question here from zaki which i 01:45:36.720 |
didn't quite wrap my head around so zaki if you want to maybe clarify in the in the chat as well 01:45:41.760 |
but can recommendation systems be built based on average ratings of users experience rather than 01:45:48.480 |
collaborative filtering not really right i mean if you've got lots of metadata you could 01:45:54.480 |
right so if you've got a lot you know like lots of information about demographic data about where 01:45:59.360 |
the user's from and you know what loyalty scheme results they've had and blah blah blah and then 01:46:06.320 |
for products there's metadata about that as well then sure averages would be fine but if all you've 01:46:12.880 |
got is kind of purchasing history then you really want the granular data otherwise how could you say 01:46:20.400 |
like they like this movie this movie in this movie therefore my they might also like that movie or 01:46:25.360 |
you've got it's like oh they kind of like movies there's just not enough information there yeah 01:46:30.080 |
great that's about it thanks okay great all right thanks everybody see you next time for our