Lesson 6 - Deep Learning for Coders (2020)

Hi everybody and welcome to lesson 6 where we're going to continue looking at training convolutional neural networks for computer vision. And so we last looked at this the lesson before last and specifically we were looking at how to train an image classifier to pick out breeds of pet, one of 37 breeds of pet.

And we've gotten as far as training a model but we also had to look and figure out what loss function was actually being used in this model. And so we talked about cross entropy loss which is actually a really important concept and some of the things we're talking about today depend a bit on you understanding this concept.

So if you were at all unsure about where we got to with that go back and have another look have a look at the questionnaire in particular and make sure that you're comfortable with cross entropy loss. If you're not you may want to go back to the 04 MNIST basics notebook and remind yourself about MNIST loss because it's very very similar that's what we built on to build up cross entropy loss.

So having trained our model the next thing we're going to do is look at model interpretation. There's not much point having a model if you don't see what it's doing. And one thing we can do is use a confusion matrix which in this case is not terribly helpful. There's kind of a few too many and it's not too bad we can kind of see some colored areas.

And so this diagonal here are all the ones that are classified correctly. So for Persians there were 31 classified as Persians. But we can see there's some bigger numbers here like Siamese six were misclassified they're actually considered a Berman. But for when you've got a lot of classes like this it might be better instead to use the most confused method and that tells you the combinations which it got wrong the most often.

In other words which numbers are the biggest so actually here's the biggest one ten and that's confusing an American pit bull terrier or a Staffordshire bull terrier that's happened ten times. And ragdoll is getting confused with the Berman eight times. And so I'm not a dog or cat expert and so I don't know what this stuff means so I looked it up on the internet and I found that American pit bull terriers and Staffordshire bull terriers are almost identical that I think they sometimes have a slightly different colored nose I remember correctly.

And ragdolls and Berman's are types of cats that are so similar to each other that this whole long threads on cat lover forums about is this a ragdoll or is this a Berman and experts disagreeing with each other. So no surprise that these things are getting confused. So when you see your model making sensible mistakes the kind of mistakes that humans make that's a pretty good sign that it's picking up the right kind of stuff and that the kinds of errors you're getting also might be pretty tricky to fix.

But you know let's see if we can make it better. And one way to try and make it better is to improve our learning rate. Why would we want to improve the learning rate? Well one thing we'd like to do is to try to train it faster, get more done in less epochs.

And so one way to do that would be to call our fine-tune method with a higher learning rate. So last time we used the default which I think is, there you go, 1 in egg 2. And so if we pump that up to 0.1 it's going to jump further each time.

So remember the learning rate and if you've forgotten this have a look again at notebook 4. That's the thing we multiply the gradients by to decide how far to step. And unfortunately when we use this higher learning rate the error rate goes from 0.083 epochs to 0.83 so we're getting the vast majority of them wrong now.

So that's not a good sign. So why did that happen? Well what happened is rather than this gradual move towards the minimum, we had this thing where we step too far and we get further, further away. So when you see this happening which looks in practice like this, your error rate getting worse right from the start, that's a sign your learning rate is too high.

So we need to find something just right, not too small that we take tiny jumps and it takes forever and not too big that we you know either get worse and worse or we just jump backwards and forwards quite slowly. So to find a good learning rate we can use something that the researcher Leslie Smith came up with called the learning rate finder.

And the learning rate finder is pretty simple. All we do, remember when we do stochastic gradient descent, we look at one mini batch at a time or a few images in this case at a time, find the gradient for that set of images for the mini batch and jump, step our weights based on the learning rate and the gradient.

Well what Leslie Smith said was okay let's do the very first mini batch at a really, really low learning rate like 10 to the minus 7 and then let's increase by a little bit. They're like maybe 25% higher and do another step and then 25% higher and do another step.

So these are not epochs, these are just a single a simple mini batch and then we can plot on this chart here. Okay at 10 to the minus 7 what was the loss and at 25% higher than that what was the loss and the 25% higher than that what was the loss.

And so not surprisingly if you do that at the low learning rates the loss doesn't really come down because the learning rate is so small that these steps are tiny, tiny, tiny. And then gradually we get to the point where they're big enough to make a difference and the loss starts coming down because we've plotted here the learning rate against the loss, right.

So here the loss is coming down as we continue to increase the learning rate the loss comes down until we get to a point where our learning rates too high and so it flattens out and then oh it's getting worse again so here's the point above like 0.1 where we're in this territory.

So what we really want is somewhere around here where it's kind of nice and steep. So you can actually ask it the learning rate finder so we used LR find to get this plot we can we can get back from it the minimum and steep. And so steep is where was it steepest but the steepest point was 5e neg 3 and the minimum point divided by 10 that's quite a good rule of thumb is 1e neg 2.

So somewhere around this range might be pretty good. So each time you run it you'll get different values a different time we ran it we thought that maybe 3e neg 3 would be good so we picked that and you'll notice the learning rate finder is a logarithmic scale be careful of interpreting that.

So we can now rerun the learning rate finder setting the learning rate to a number we picked from the learning rate finder which in this case was 3e neg 3. And we can see now that's looking good right we've got an 8.3% error rate after 3 epochs. So this idea of the learning rate finder is very straightforward I can describe it to you in a couple of sentences it doesn't require any complex math and yet it was only invented in 2015 which is super interesting right it just shows that there's so many interesting things just to learn and discover.

I think part of the reason perhaps for this it took a while is that you know engineers kind of love using lots and lots of computers. So before the learning rate finder came along people would like run lots of experiments on big clusters to find out which learning rate was the best rather than just doing a batch at a time.

And I think partly also the idea of having a thing where a human is in the loop where we look at something and make a decision is also kind of unfashionable a lot of folks in research and industry love things which are fully automated. But anyway it's great we now have this tool because it makes our life easier and fastai is certainly the first library to have this and I don't know if it's still the only one to have it built in at least to the basic the base library.

So now we've got a good learning rate how do we fine-tune the weights so so far we've just been running this fine-tune method without thinking much about what it's actually doing. But we did mention in chapter one lesson one briefly basically what's happening with fine-tune what is transfer learning doing.

And before we look at that let's take a question. Is the learning rate plot in LR find plotted against one single mini batch? No it's not it's just it's actually just the standard kind of walking through the walking through the data loader so just getting the usual mini batches of the shuffled data.

And so it's kind of just normal training and the only thing that's being different is that we're increasing the learning rate a little bit after each mini batch and and keeping track of it. Along with that is is the network reset to the initial status after each trial? No certainly not we actually want to see how it learns we want to see it improving so we don't reset it to its initial state until we're done.

So at the end of it we go back to the random weights we started with or whatever the weights were at the time we ran this. So what we're seeing here is something that's actually the actual learning that's happening as we at the same time increase the learning rate.

Why would an ideal learning rate found with a single mini batch at the start of training keep being a good learning rate even after several epochs and further loss reductions? Great question it absolutely wouldn't so let's look at that too shall we? And ask one more? This is an important point so ask is very important.

For the learning rate finder why use the steepest and not the minimum? We certainly don't want the minimum because the minimum is the point at which it's not learning anymore. Right so so this flat section at the bottom here means in this mini batch it didn't get better. So we want the steepest because that's the mini batch where it got the most improved and that's what we want we want the weights to be moving as fast as possible.

As a rule of thumb though we do find that the minimum divided by 10 works pretty well that's Sylvain's favorite approach and he's generally pretty spot-on with that so that's why we actually print out those two things. lr min is actually the minimum divided by 10 and steepest point is suggests the steepest point.

Great good questions all. So remind ourselves what transfer learning does. So with transfer learning remember what our neural network is. It's a bunch of linear models basically with with activation functions between them and our activation functions are generally values rectified linear units. If any of this is fuzzy have a look at the zero for notebook again to remind yourself.

And so each of those linear layers has a bunch of parameters to the whole neural network has a bunch of parameters. And so after we train a neural network on something like ImageNet we have a whole bunch of parameters that aren't random anymore they're actually useful for something. And we've also seen that the early layers seem to learn about fairly general ideas like radiance and edges and the later layers learn about more sophisticated ideas like what are eyes look like or what is fur look like or what is text look like.

So with transfer learning we take a model so in other words a set of parameters which has already been trained something like ImageNet. We throw away the very last layer because the very last layer is the bit that specifically says which one of those in the case of ImageNet 1000 categories is this an image in.

We throw that away and we replace it with random weights sometimes with more than one layer of random weights and then we train that. Now yes. Oh I just wanted to make a comment and that's that I think the learning rate finder I think after you learn about it the idea almost seems kind of so simple or approximate that it's like wait this shouldn't work like or you know shouldn't you have to do something more more complicated or more precise that it's like I just want to highlight that this is a very surprising result that some kind of a such a simple approximate method would be so helpful.

Yeah I would particularly say it's surprising to people who are not practitioners or have not been practitioners for long. I've noticed that a lot of my students at USF have a tendency to kind of jump in to try to doing something very complex where they account for every possible imperfection from the start and it's very rare that that's necessary so one of the cool things about this is it's good example of trying the easiest thing first and seeing how well it works.

And this was a very big innovation when it came out that I think it's kind of easy to take for granted now but this was super super helpful when it was it was super helpful and it was also nearly entirely ignored. None of the research community cared about it and it wasn't until fast AI I think in our first course talked about it that people started noticing and we had quite a few years in fact it's still a bit the case where super fancy researchers still don't know about the learning rate finder and you know get kept beaten by you know first lesson fast AI students on practical problems because they can pick learning rates better and they can do it without a cluster of thousands of computers.

Okay so transfer learning so we've got our pre-trained network and so it's really important every time you hear the word pre-trained network you're thinking a bunch of parameters which have particular numeric values and go with a particular architecture like resnet 34. We've thrown away the final layer and replace them with random numbers and so now we want to train to fine-tune this set of parameters for a new set of images in this case pets.

So fine-tune is the method we call to do that and to see what it does we can go burn.fine-tune? and we can see the source code and here is the signature of the function and so the first thing that happens is we call freeze. So freeze is actually the method which makes it so only the last layers weights will get stepped by the optimizer.

So the gradients are calculated just for those last layers of parameters and the step is done just for those last layer of parameters. So then we call fit and we fit for some number of epochs which by default is 1. We don't change that very often and what that fit is doing is it's just fitting those randomly added weights which makes sense right they're the ones that are going to need the most work because at the time which we add them they're doing nothing at all they're just random.

So that's why we spend one epoch trying to make them better. After you've done that you now have a model which is much better than we started with it's not random anymore. All the layers except the last are the same as the pre-trained network the last layer has been tuned for this new data set.

So the closer you get to the right answer as you can kind of see in this picture the smaller the steps you want to create sorry the smaller the steps you want to take generally speaking. The next thing we do is we divide our learning rate by 2 and then we unfreeze so that means we make it so that all the parameters can now be stepped and all of them will have gradients calculated and then we fit for some more epochs and this is something we have to pass to the method.

And so that's now going to train the whole network. So if we want to we can kind of do this by hand right and actually CNN learner will by default freeze the model for us freeze the parameters for us so we actually don't have to call freeze. So if we just create a learner and then fit for a while this is three epochs of training just the last layer and so then we can just manually do it ourselves unfreeze.

And so now at this point as the question earlier suggested maybe this is not the right learning rate anymore so we can run LR find again and this time you don't see the same shape you don't see this rapid drop because it's much harder to train a model that's already pretty good.

But instead you just see a very gentle little gradient. So generally here what we do is we kind of try to find the bit where it starts to get worse again and go about which is about here and go about 10 let you know multiple of 10 less than that so about 1 a neg 5 I would guess which yep that's what we picked.

So then after unfreezing finding our new learning rate and then we can do a bunch more and so here we are we're getting down to 5.9 percent error which is okay but there's there's better we can do. And the reason we can do better is that at this point here we're training the whole model at a 1 a neg 5 so 10 to the minus 5 learning rate which doesn't really make sense because we know that the last layer is still not that great it's only had three epochs of training from random so it probably needs more work.

We know that the second last layer was probably pretty specialized to image net and less specialized to pet breeds so that probably needs a lot of work. Whereas the early layers but kind of gradients and edges probably don't need to be changed much at all. But what we'd really like is to have a small learning rate for the early layers and a bigger learning rate for the later layers.

And this is something that we developed at fast AI and we call it discriminative learning rates. And Jason Yasinski actually is a guy who wrote a great paper that some of these ideas are based on which is he actually showed that different layers of the network really want to be trained at different rates.

Although he didn't kind of go as far as trying that out and seeing how it goes it was more of a theoretical thing. So in fast AI if we want to do that we can pass to our learning rate rather than just passing a single number we can pass a slice.

Now a slice is a special built-in feature of Python. It's just an object which basically can have a few different numbers in it. In this case it's been passing at two numbers. And the way we read those basically what this means in fast AI is a learning rate is the very first layer will have this learning rate 10 to the minus 6.

The very last layer will be 10 to the minus 4. And the layers between the two will be kind of equal multiples. So they'll kind of be equally spaced learning rates from the start to the end. So here we can see basically doing our kind of own version of fine-tune.

We create the learner, we fit with that automatically frozen version, we unfreeze, we fit some more. And so when we do that you can see this works a lot better we're getting down to 5.3, 5.1, 5.4 error. So that's pretty great. One thing we'll notice here is that we did kind of overshoot a bit.

It seemed like more like epoch number 8 was better. So kind of back before you know well actually let me explain something about fit one cycle. So fit one cycle is a bit different to just fit. So what fit one cycle does is it actually starts at a low learning rate.

It increases it gradually for the first one-third or so of the batches until it gets to a high learning rate. The highest, this is why it's called LR max. It's the highest learning rate we get to. And then for the remaining two-thirds or so of the batches it gradually decreases the learning rate.

And the reason for that is just that well actually it's kind of like empirically researchers have found that works the best. In fact this was developed again by Leslie Smith the same guy that did the learning rate finder. Again it was a huge step you know it really dramatically accelerated the speed at which we can train neural networks and also made them much more accurate.

And again the academic community basically ignored it. In fact the key publication that developed this idea was not even did not even pass peer review. And so the reason I mention this now is to say that we can't we don't really just want to go back and pick the model that was trained back here because we could probably do better because we really want to pick a model that's got a low learning rate.

But what I would generally do here is I change this 12 to an 8 because this is this is looking good. And then I would retrain it from scratch. Normally you'd find a better result. You can plot the loss and you can see how the training and validation loss moved along.

And you can see here that you know the error rate was starting to get worse here. And what you'll often see is often the validation loss will get worse a bit before the error rate gets worse. We're not really seeing it so much in this case but the error rate and the validation loss don't always or they're not always kind of in lockstep.

So what we're plotting here is the loss but you actually kind of want to look to see mainly what's happening with the error rate because that's actually the thing we care about. Remember the loss is just like an approximation of what we care about that just happens to have a gradient that works out nicely.

So how do you make it better now? We're already down to just 5.4 or if we'd stopped a bit earlier maybe we could get down to 5.1 or less error. On 37 categories that's pretty remarkable. That's a very, very good pet breed predictor. If you want to do something even better you could try creating a deeper architecture.

So a deeper architecture is just literally putting more pairs of non-activation function also known as a non-linearity followed by these little linear models put more pairs onto the end. And basically the number of these sets of layers you have is the number that you'll see at the end of an architecture.

So there's ResNet 18, ResNet 34, ResNet 50, so forth. Having said that you can't really pick ResNet 19 or ResNet 38. I mean you couldn't make one but nobody's created a pre-trained version of that for you so you won't be able to do any fine-tuning. So like you can theoretically create any number of layers you like but in practice most of the time you'll want to pick a model that has a pre-trained version.

So you kind of have to select from the sizes people have pre-trained and there's nothing special about these sizes they're just ones that people happen to have picked out. For the bigger models there's more parameters and more gradients that are going to be stored on your GPU and you will get used to the idea of seeing this error unfortunately out of memory.

So that's not out of memory in your RAM that's out of memory in your GPU. Cruder is referring to the language in the system used for your GPU. So if that happens unfortunately you actually have to restart your notebook so that's kernel restart and try again and that's a really annoying thing but such is life.

One thing you can do if you get an out-of-memory error is after you've your CNN learner call add this magic incantation to FP16. What that does is it uses for most of the operations numbers that use half as many bits as usual so they're less accurate this half precision floating point or FP16 and that will use less memory and on pretty much any NVIDIA card created in 2020 or later and some more expensive cards even created in 2019 that's often going to result in a two to three times speed up in terms of how long it takes as well.

So here if I add in to FP16 and I will be seeing often much faster training and in this case what I actually did is I switched to a ResNet-50 which would normally take about twice as long and my per epoch time has gone from 25 seconds to 26 seconds.

So the fact that we used a much bigger network and it was no slower is thanks to to FP16. But you'll see error rate hasn't improved it's pretty similar to what it was and so it's important to realize that just because we increase the number of layers it doesn't always get better.

So it tends to require a bit of experimentation to find what's going to work for you and of course don't forget the trick is use small models for as long as possible and to do all of your cleaning up and testing and so forth and wait until you're all done to try some bigger models because they're going to take a lot longer.

Okay questions. How do you know or suspect when you can quote do better? You have to always assume you can do better because you never know. So you just have to I mean part of it though is do you need to do better or do you already have a good enough result to handle the actual task you're trying to do.

When people do spend too much time fiddling around with their models rather than actually trying to see whether it's already going to be super helpful. So as soon as you can actually try to use your model to do something practical the better. But yeah how much can you improve it?

Who knows? I you know go through the techniques that we're teaching this course and try them and see which ones help. Unless it's a problem that somebody has already tried before and written down their results in a paper or a capital competition or something there's no way to know how good can.

So don't forget after you do the questionnaire to check out the further research section. And one of the things we've asked you to do here is to read a paper. So find the learning rate finder paper and read it and see if you can kind of connect what you read up to the things that we've learned in this lesson.

And see if you can maybe even implement your own learning rate finder you know as manually as you need to see if you can get something that you know based on reading the paper to work yourself. You can even look at the source code of fastai's learning rate finder of course.

And then can you make this classifier better? And so this is further research right? So maybe you can start doing some reading to see what else could you do. Have a look on the forums see what people are trying. Have a look on the book website or the course website to see what other people have achieved and what they did and play around.

So we've got some tools in our toolbox now for you to experiment with. So that is that is pet breeds that is a you know a pretty tricky computer vision classification problem. And we kind of have seen most of the pieces of what goes into the training of it.

We haven't seen how to build the actual architecture but other than that we've kind of worked our way up to understanding what's going on. So let's build from there into another kind of data set one that involves multi-label classification. So what's multi-label classification? Well maybe so maybe let's look at an example.

Here is a multi-label data set where you can see that it's not just one label on each image but sometimes there's three, bicycle, car, person. I don't actually see the car here I guess it's being dropped out. So a multi-label data set is one where you still got one image per row but you can have zero one two or more labels per row.

So we're going to have a think about and look at how we handle that. But first of all let's take another question. Is dropping floating point number precision switching from FP32 to FP16 have an impact on final? Yes it does. Often it makes it better believe it or not.

It seems like you know the kind of it's doing a little bit of rounding off is one way to give it drop some of that precision. And so that creates a bit more bumpiness, a bit more uncertainty, bit more you know of a stochastic nature. And you know when you introduce more slightly random stuff into training it very often makes it a bit better.

And so yeah FP16 training often gives us a slightly better result but I you know I wouldn't say it's generally a big deal either way and certainly it's not always better. Would you say this is a bit of a pattern in learning less exact and stochastic way? For sure not just in deep learning but machine learning more generally.

You know there's been some interesting research looking at like matrix factorization techniques which if you want them to go super fast you can lots of machines, you can randomization and you often when you then use the results you often find you actually get better outcomes. Just a brief plug for the fast AI computational linear algebra course which talks a little bit about random.

Does it really? Well that sounds like a fascinating course and look at that it's number one hit here on Google so easy to find. Well by somebody called Rachel Thomas. Hey that's person's got the same name as you Rachel Thomas. All right so how are we going to do multi-label classification?

So let's look at a data set called Pascal which is a pretty famous data set. We'll look at the version that goes back to 2007 been around for a long time. And it comes with a CSV file which we will read in CSV is comma separated values and let's take a look.

Each row has a file name, one or more labels and something telling you whether it's in the validation set or not. So the list of categories in each image is a space delimited string but it doesn't have a horse person it has a horse and a person. PD here stands for pandas.

Pandas is a really important library for any kind of data processing and you use it all the time in machine learning and deep learning. So let's have a quick chat about it. Not a real panda it's the name of a library and it creates things called data frames. That's what the DF here stands for and a data frame is a table containing rows and columns.

Pandas can also do some slightly more sophisticated things than that but we'll treat it that way for now. So you can read in a data frame by saying PD for pandas. Pandas read CSV, give it a file name, you've now got a data frame you can call head to see the first few rows of it for instance.

A data frame has a ilock integer location property which you can index into as if it was an array, in fact it looks just like numpy. So colon means every row remember it's row comma column and zero means zeroth column and so here is the first column of the data frame.

You can do the exact opposite so the zeroth row and every column is going to give us the first row and you can see the row has column headers and values. So it's a little bit different to numpy and remember if there's a comma colon or a bunch of comma columns at the end of indexing in numpy or pytorch or pandas whatever you can get rid of it and these two are exactly the same.

You could do the same thing here by grabbing the column by name the first column is fname so you can say dffname you get that first column. You can create new columns so here's a tiny little data frame I've created from a dictionary and I could create a new column by for example adding two columns and you can see there it is.

So it's like a lot like numpy or pytorch except you have this idea of kind of rows and and column named columns and so it's all about kind of tabular data. I find its API pretty unintuitive a lot of people do but it's fast and powerful so it takes a while to get familiar with it but it's worth taking a while and the creator of pandas wrote a fantastic book called Python for data analysis which I've read both versions and I found it fantastic.

It doesn't just cover pandas it covers other stuff as well like IPython and numpy and matplotlib so highly recommend this book. This is our table so what we want to do now is construct data loaders that we can train with and we've talked about the data block API is being a great way to create data loaders but let's use this as an opportunity to create a data data loaders or a data so create a data block and then data loaders for this and let's try to do it like right from square one.

So let's see exactly what's going on with data block. So first of all let's remind ourselves about what a data set and a data loader is. A data set is an abstract idea of a class you can create a data set. A data set is anything which you can index into it like so or and you can take the length of it like so.

So for example the list of the lowercase letters along with a number saying which lowercase letter it is I can index into it to get 0, a I can get the length of it to get 26 and so therefore this qualifies as a data set and in particular data sets normally you would expect that when you index into it you would get back a tuple because you've got the independent and dependent variables not necessarily always just two things it could be more there could be less but two is the most common.

So once we have a data set we can pass it to a data loader we can request a particular batch size we can shuffle or not and so there's our data loader from A we could grab the first value from that iterator and here is the shuffled 7 is H 4 is E 20 is U and so forth and so remember a mini batch has a bunch of a mini batch of the independent variable and a mini batch of the dependent variable.

If you want to see how the two correspond to each other you can use zip so if I zip passing in this list and then this list so B0 and B1 you can see what zip does in Python is it grabs one element from each of those in turn and gives you back the tuples of the corresponding elements.

Since we're just passing in all of the elements of B to this function Python has a convenient shortcut for that which is just say star B and so star means insert into this parameter list each element of B just like we did here so these are the same thing.

So this is a very handy idiom that we use a lot in Python zip star something is kind of a way of like transposing something from one orientation to another. All right so we've got a data set we've got a data loader and then what about datasets what datasets is an object which has a training data set and a validation set dataset so let's look at one.

Now normally you don't start with kind of an enumeration like this like with an independent variable and a dependent variable normally you start with like a file name for example and then you you kind of calculate or compute or transform your file name into an image by opening it and a label by for example looking at the file name and grabbing something out of it.

So for example we could do something similar here this is what datasets does so we could start with just the lowercase letters so this is still a data set right because we can index into it and we can get the length of it although it's not giving us tuples yet.

So if we now pass that list to the datasets class and index into it we get back the tuple and it's actually a tuple with just one item this is how Python shows a tuple with one item is it puts it in parentheses and a comma and then nothing okay.

So in practice what we really want to do is to say like okay we'll take this and do something to compute an independent variable and do something to compute a dependent variable but here's a function we could use to compute an independent variable which is to stick an A on the end and our dependent variable might just be the same thing with a B on the end.

So here's two functions so for example now we can call datasets passing in A and then we can pass in a list of transformations to do and so in this case I've just got one which is this function add an A on the end so now if I index into it I don't get A anymore I get AA.

If you pass multiple functions then it's going to do multiple things so here I've got F1 then F2 AAB that's this one then that's this one and you'll see this is a list of lists and the reason for that is that you can also pass something like this a list containing F1 a list containing F2 and this will actually take each element of A pass it through this list of functions and there's just one of them to give you AA and then start again and separately pass it through this list of functions there's just one to get A B and so this is actually kind of the main way we build up independent variables and dependent variables in fast AI is we start with something like a file name and we pass it through two lists of functions one of them will generally kind of open up the image for example and the other one will kind of pass the file name for example and give you a independent variable and a dependent variable.

So you can then create a data loaders object from data sets by passing in the data sets and a batch size and so here you can see I've got shuffled OAIA etc OB IB etc so this is worth studying to make sure you understand what data sets and data loaders are we don't often have to create them from scratch we can create a data block to do it for us but now we can see what the data block has to do let's see how it does it so we can start by creating an empty data block so an empty data block is going to take our data frame we're going to go back to looking at data frame which remember was this guy and so if we pass in our data frame we can now we'll now find that this data block has created data sets a training and a validation data set for us and if we look at the training set it'll give us back an independent variable and a dependent variable and we'll see that they are both the same thing so this is the first row of the table that's actually shuffled so it's a random row of the table repeated twice and the reason for that is by default the data block assumes that we have two things the independent variable and the dependent or the input in the target and by default it just copies it just keeps exactly whatever you gave it to create the training set and the validation set by default it just randomly splits the data with a 20% validation set so that's what's happened here so this is not much use and what we what we actually want to do if we look at X for example is grab the the F name the file name field because we want to open this image that's going to be our independent variable and then for the label we're gonna want this here person cat so we can actually pass these as parameters get X and get Y functions that return the bit of data that we want and so you can create and use a function in the same line of code in Python by saying lambda so lambda R means create a function doesn't have a name it's going to take a parameter called R we don't even have to say return it's going to return the F name column in this case and get Y is something which is a function that takes an R and returns the labels column so now we can do the same thing called dblock.datasets we can grab a row from that from the training set and you can see look here it is there is the image file name and there is the space delimited list of labels so here's exactly the same thing again but done with functions so now the one line of code above has become three lines of code but it does exactly the same thing okay we don't get back the same result because the training set well wait why don't we get the same result oh I know why because it's randomly shuffle it's randomly picking a different validation set because the random split is done differently each time so that's why we don't get the same result one thing to note be careful of lambdas if you want to save this data block for use later you won't be able to Python doesn't like saving things that contain lambdas so most of the time in the book and the course we normally use avoid lambdas for that reason because it's often very convenient to be able to save things we use the word here serialization that just means basically it means saving something this is not enough to open an image because we don't have the path so to turn this into so rather than just using this function to grab the F name column we should actually use path lib to go path slash train and then column and then for the Y again the labels is not quite enough we actually have to split on space but this is Python we can use any function we like and so then we use the same three lines of code is here and now we've got a path and a list of labels so that's looking good so we want this path to be opened as an image so the data block API that's your pass a blocks argument where you tell it or each of the things in your tuple so there's two of them what kind of block do you need so we need an image block to open an image and then in the past we've used a category block or categorical variables but this time we don't have a single category we've got multiple categories so we have to use a multi category block so once we do that and have a look we now have an 500 by 375 image as our independent variable and as a dependent variable we have a long lists of zeros and ones the long list of zeros and ones is the labels as a one hot encoded vector a rank one tensor and specifically there will be a zero in every location where in the vocab where there is not that kind of object in this image and a one in every location where there is so for this one there's just a person so this must be the location in the vocab where there's a person you have any questions so one hot encoding is a very important concept and we didn't have to use it before right we could just have a single integer saying which one thing is it but when we've got lots of things lots of potential labels it's it's convenient to use this one hot encoding and it's kind of what it's actually what's going to happen with them with the actual matrices anyway when we actually compare the activations of our neural network to the target it's actually going to be comparing each one of these okay so the categories as I mentioned is based on the vocab so we can grab the vocab from our data set subject and then we can say okay let's look at the first row and let's look at the dependent variable and let's look for where the dependent variable is one okay and then we can have a look past those indexes with a vocab and get back a list of what actually was there and again each time I run this I'm going to get different results so each time we run this we're going to get different results because I called dot data sets again here so it's going to give me different train tests split and so this time it turns out that this actually a chair and we have a question shouldn't the tensor be of integers why is it a tensor of floats yeah conceptually this is a tensor of integers they can only be 0 or 1 but we are going to be using a cross-entropy style loss function but we're going to actually need to do floating point calculations on them that's going to be faster to just store them as float in the first place rather than converting backwards and forwards even though they're conceptually an int we're not going to be doing kind of int style calculations with them good question I mentioned that by default the data block uses a random split you might have noticed in the data frame though it said here's a column saying what validation set to use and if the data set you're given tells you what validation set to use you should generally use it because that way you can compare your validation set results to somebody else's so you can pass a splitter argument which again is a function and so we're going to pass it a function that's also called splitter and the function is going to return the indexes where it's not valid and that's going to be the training set and the indexes where it is valid that's going to be the validation set and so the splitter argument is expected to return two lists of integers and so if we do that we get again the same thing but now we're using the correct train and validation sets another question sure any particular reason we don't use floating point eight is it just that the precision is too low yeah trying to train with 8-bit precision is super difficult it's it's so flat and bumpy it's pretty difficult to get decent gradients there but you know it's an area of research the main thing people do with 8-bit or even 1-bit data types is they take a model that's already been trained with 16-bit or 32-bit floating point and then they kind of round it off it's called discretizing to create a kind of purely integer or even binary network which can do inference much faster figuring out how to train with such low precision data is an area of active research I suspect it's possible and I suspect I mean people have fiddled around it and had some success I think you know it could turn out to be super interesting particularly for stuff that's being done on like low-powered devices that might not even have a floating point unit right so the last thing we need to do is to add our item transforms random resource crop we've talked about that enough so I won't go into it but basically that means we now are going to ensure that everything has the same shape so that we can collate it into a data loader they're now rather than going dot data sets go dot data loaders and display our data and remember something goes wrong as we saw last week you can call summary to find out exactly what's happening in your data block so now you know this is something really worth studying this section because data blocks are super handy and if you haven't used fastai 2 before they won't be familiar to you because no other library uses them and so like this is really showing you how to go right back to the start and gradually build them up so hopefully that'll make a whole lot of sense now we're going to need a loss function again and to do that let's start by just creating a learner it's created resnet 18 from the data loaders object that we just created and let's grab one batch of data and then let's put that into our mini batch of independent independent variables and then learn dot model is the thing that actually contains the model itself in this case is CNN and you can treat it as a function and so therefore we can just pass something to it and so if we pass a mini batch of the independent variable to learn dot model it will return the activations from the final layer and that is shape 64 by 20 so anytime you get a tensor back look at its shape and in fact before you look at its shape predict what the shape should be and then make sure that you're all right if you're not I think you guessed wrong so try to understand where you made a mistake or there's a problem with your code in this case 64 by 20 makes sense because we have a mini batch size of 64 and for each of those we're going to make predictions about what probability is each of these 20 possible categories and we have a question two questions questions all right is the data block API compatible with out of core data sets like Dask yeah the data block API can do anything you want it to do so you're passing it if we go back to the start so you can create an empty one and then you can pass it anything that is indexable and yeah so that can be anything you you like and pretty much anything can be made indexable in Python and that's something like Dask is certainly indexable so that works perfectly fine if it's not indexable like it's a it's a network stream or something like that then the data loaders data sets API's directly which we'll learn about either in this course or the next one but yeah anything that you can index into it certainly includes Dask you can use with data blocks next question where do you put images for multi-label with that CSV table should they be in the same directory there can be anywhere you like so in this case we used a pathlib object like so and in this case the the by default it's going to be using I think about this so what's happening here is the path is oh it's saying dot okay the reason for that is that path dot base path is currently set to path and so that displays things relative oh let's rid of that okay so the path we set is here right and so then when we said get X it's saying path slash change slash whatever right so this is an absolute path and so here is the exact path so you can put them anywhere you like you just have to say what the path is and then if you want to not get confused by having this big long prefix that we can don't want to see all the time just set base path to the path you want everything to be relative to and then it'll just print things out in this more convenient manner right so this is really important that you can do this that you can create a learner you can grab a batch of data that you can pass it to the model is this is just plain pytorch this line here right no fast AI you can see the shape right you can recognize why it has this shape and so now if you have a look here are the 20 activations now this is not a trained model it's a pre-trained model with a random set of final layer weights so these specific numbers don't mean anything but it's just worth remembering this is what activations look like and most importantly they're not between 0 and 1 and if you remember from the MNIST notebook we know how to scale things between 0 and 1 we can pop them into the sigmoid function so the sigmoid function is something that scales everything to be between 0 and 1 so let's use that you'll also hopefully remember from the MNIST notebook that the MNIST loss the MNIST loss function first did sigmoid and then it did torch.where so and then it did dot mean so we're going to use exactly the same thing as the MNIST loss function and we're just going to do one thing which is going to add dot log for the same reason that we talked about when we were looking at softmax we talked about why log is a good idea as a transformation we saw in the MNIST notebook we didn't need it but we're going to train faster and more accurately if we use it because it's just more it's going to be better behaved as we've seen so this particular function which is identical to MNIST loss plus dot log as a specific name and it's called binary cross entropy and we used it for the threes versus sevens problem to decide whether that column is it a three or not but because we can use broadcasting in high torch and element wise arithmetic this function when we pass it a whole matrix is going to be applied to every column so is the first column you know so it'll basically do a torch.where on on every column separately in every item separately so that's great it basically means that this binary cross entropy function is going to be just like MNIST loss but rather than just being is this the number three it'll be is this a dog is this a cat is this a car is this a person is this the bicycle and so forth this is where it's so cool in PyTorch we can kind of run write one thing and then kind of have it expand to handle higher dimensional tensors without doing any extra work we don't have to write this a cells of course because PyTorch has one and it's called f dot binary cross entropy so we can just use PyTorch as we've talked about there's always a equivalent module version so this is exactly the same thing as a module and n dot BCE loss and these ones don't include the initial sigmoid actually if you want to include this initial sigmoid you need f dot binary cross entropy with logits or the equivalent nn dot BCE with logits loss so BCE is binary cross entropy and so those are two functions plus two equivalent classes or multilabel or binary problems and then the equivalent for single label like MNIST and PETS is nll loss and cross entropy so that's the equivalent of binary cross entropy and binary cross entropy with logits so these are pretty awful names I think we can all agree but it is what it is so in our case we have a one-hot encoded target and we want the one with a sigmoid in so the equivalent built-in is called BCE with logits loss so that we can make that our loss function we can compare the activations to our targets and we can get back a loss and then that's what we can use to train and then finally before we take our break we also need a metric now previously we've been using as a metric accuracy or actually error rate error rate is one minus accuracy accuracy only works for single label datasets like MNIST and PETS because what it does is it takes the input which is the final layer activations and it does argmax what argmax does is it says what is the index of the largest number in those activations so for example for MNIST you know maybe the largest the highest probability is seven so this argmax would return seven and then it says okay there's those are my predictions and then it says okay is the prediction equal to the target or not and then take the floating point mean so that's what accuracy is so argmax only makes sense when there's a single maximum thing you're looking for in this case we've got multilabel so instead we have to compare each activation to some threshold by default it's 0.5 and so we basically say if the sigmoid of the activation is greater than 0.5 let's assume that means that category is there and if it's not let's assume it means it's not there and so this is going to give us a list of trues and falses for the ones that the based on the activations it thinks are there and we can compare that to the target and then again take the floating point mean so we can use the default threshold of 0.5 but we don't necessarily want to use 0.5 we might want to use a different threshold and remember we have to pass when we create our learner we have to pass to the metric the metrics argument a function so what if we want to use a threshold other than 0.5 well we'd like to create a special function which is accuracy multi with some different threshold and the way we do that is we use a special built-in in Python called partial let me show you how partial works here's a function called say hello say hello to somebody with something so say hello Jeremy well the default is hello so it says hello Jeremy say hello Jeremy comma ahoy gonna be ahoy Jeremy let's create a special version of this function that will be more suitable for a silver it's going to use French so we can say partial create a new function that's based on the say hello function but it's always going to set say what to bonjour and we'll call that f but now f Jeremy is bonjour Jeremy and f sylvain is bonjour sylvain so you see we've created a new function from an existing function by fixing one of its parameters so we can do the same thing for accuracy multi say let's use a threshold of 0.2 and we can pass that to metrics and so let's create a CNN learner and you'll notice here we don't actually pass a loss function and that's because fast AI is not enough to realize hey you're doing a classification model with a a multi label dependent variable so I know what loss function you probably want so it does it for us and we can call fine-tune and here we have an accuracy of 94.5 after the first view and eventually 95.1 that's pretty good we've got an accuracy of over 95 percent was 0.2 a good threshold to pick who knows let's try 0.1 oh that's a worse accuracy so I guess in this case we could buy a higher threshold 94 hmm also not good so what's the best threshold well what we could do is call get preds to get all of the predictions and all of the targets and then we could calculate the accuracy at some threshold and then we could say okay let's grab lots of numbers between 0.05 and 0.95 and you with a list comprehension calculate the accuracy for all of those different thresholds and plot them ah looks like we want a threshold somewhere a bit above 0.5 so cool we can just use that and it's going to give us 96 in a bit which is going to give us a better accuracy um this is a you know something that a lot of theoreticians would be uncomfortable about I've used the validation set to pick a hyper parameter threshold right and so people might be like oh you're overfitting using the validation set to pick a hyper parameter but if you think about it this is a very smooth curve right it's not some bumpy thing where we've accidentally kind of randomly grabbed some unexpectedly good value when you're picking a single number from a smooth curve you know this is where the theory of like don't use a validation set for for hyper parameter tuning doesn't really apply so it's always good to be practical right don't treat these things as rules but as rules of um okay so let's take a break for five minutes and we'll see you back here in five minutes time all right welcome back so I want to show you something really cool image regression so we are not going to learn how to use a fast AI image regression application because we don't need one now that we know how to build stuff up with boss functions and the data block API ourselves we can invent our own applications so there is no image regression application per se but we can do image regression really easily what do we mean by image regression well remember back to lesson I think it's lesson one we talked about the two basic types of machine learning or supervised machine learning regression and classification classification is when our dependent variable is a discrete category or set of categories and regression is when our dependent variable is a continuous number like an age or x y coordinate or something like that so image regression means our independent variable is an image and our dependent variable is a continue one or more continuous value values and so here's what that can look like which is the b we had posed data set has a number of things in it but one of the things we can do is find the midpoint of a person's face see so the b we had posed data set so the b we had posed data set comes from this paper random forests real-time 3d face analysis so thank you to those authors and we can grab it in the usual way untied data and we can have a look at what's in there and we can see there's 24 directories numbered from one to 24 there's one two three and each one also has a .obj file we're not going to be using the .obj file I'm just the directories so let's look at one of the directories and as you can see there's a thousand things in the first directory so each one of these 24 directories is one different person that they've photographed and you can see for each person there's frame three pose frame three RGB frame for pose frame for RGB and so forth so in each case we've got the image which is the RGB and we've got the pose is the pose.txt so as we've seen we can grab use get image files to get a list of all of the files image files recursively in a path and so once we have an image file name like this one sorry like this one we can turn it into a pose file name by removing the last one two three five six seven letters and adding back on pose.txt and so here is a function that does that and so you can see I can pass in an image file to image to pose and get back a pose file right so pyo image.create is the fast AI way to create an image at least a pyo image it has a shape in computer vision they're normally backwards they normally do columns by rows so that's why it's this way around whereas pytorch and numpy tensors and arrays are rows by columns so that's confusing but that's just how things are I'm afraid and so here's an example of an image when you look at the readme from the dataset website they tell you how to get the center point from from one of the text files and it's just this function so it doesn't matter it just it is what it is we call it get center and it will return the XY coordinate of the center of the person's head face so we can pass this as get Y because get Y remember is the thing that gives us back the label okay so so here's the thing right we can create a data block and we can pass in as the independent variables block image block as usual and then the dependent variables block we can say point block which is a tensor with two values in and now by combining these two things this says we want to do image regression with a dependent variable with two continuous values to get the items you call get image files to get the Y we'll call the get center function to split it so this is important we should make sure that the validation set contains one or more people that don't appear in the training set so I'm just going to grab person number 13 just grabbed it randomly and I'll use all of those images as the validation set because I think they did this with a Xbox connect you know video thing so there's a lot of images that look almost identical so if you randomly assigned them then you would be massively overestimating how effective you are you want to make sure that you're actually doing a good job with a random with a new set of people not just a new set of frames that's why we use this and so a func splitter is a splitter that takes a function and in this case we're using lambda to create the function we will use data augmentation and we will also normalize so this is actually done automatically now but this case we're doing it manually so this is going to subtract the mean and divide by the standard deviation of the original data set that the pre-trained model used which is imageNet so that's our data block and so we can call data loaders to get our data loaders passing in the path and show batch and we can see that looks good right here's our faces and the points and so let's like particularly for as a student don't just look at the pictures look at the actual data so grab a batch put it into an xB and a yB expansion y batch and have a look at the shapes and make sure they make sense the ys is 64 by 1 by 2 so it's 64 in the mini batch 64 rows and then a coordinates is a 1 by 2 tensor so this is a single point with two things in it it's like you could have like hands face and armpits or whatever or nose and ears and mouth so in this case we're just using one point and the point is represented by two values the x and the y and then y is this 64 by 3 by 240 by 320 well there's 240 rows by 320 columns that's the pixels that's the size of the images that we're using mini batches 64 items and what's the three the three is the number of channels which in this case means the number of colors if we open up some random grizzly bear image and then we go through each of the elements of the first axis and do a show image you can see that it's got the red the green and the blue as the three channels so that's how we store a three-channel image is it stored as a three by number of rows by number of columns rank three tensor and so a mini batch of those is a rank four tensor that's why this is that shape but here's a row from the dependent variable okay there's that XY location we talked about so we can now go ahead and create a learner passing in our data loaders as usual passing in a pre-trained architecture as usual and if you think back you may just remember in lesson one we learned about y range y range is where we tell fastai what range of data we expect to see in the dependent variable so we want to use this generally when we're doing regression though the range of our coordinates is between minus one and one that's how fastai and pytorch treats coordinates the left hand side is minus one or the top is minus one and the bottom and the right one so there's no point predicting something that's smaller than minus one or bigger than one because that is not in the area that we use for our coordinates if a question sure just a moment so how is y range work well it actually uses this function called sigmoid range which takes the sigmoid of X multiplies by high minus low and adds low and here is what sigmoid range looks like or minus one to one it's just a sigmoid where the bottom is the low and the top is the high and so that way all of our activations are going to be mapped to the range from minus one to one yes rachel can you provide images with an arbitrary number of channels as inputs specifically more than three channels yeah you can have as many channels as you like we certainly seen images with less than three because we've been grayscale more than three is common as well you could have like an infrared band or like satellite images often have multispectral there's some kinds of medical images where there are bands that are kind of outside the visible range your pre-trained model will generally have three channels the fast AI does some tricks to use three channel pre-trained models for non three channel data but that's the only tricky bit other than that it's just just a you know it's just an axis that happens to have four things or two things or one thing instead of three things there's nothing special about it okay we didn't specify a loss function here so we get whatever it gave us which is a MSE loss so MSE losses mean squared error and that makes perfect sense right you would expect mean squared error to be a reasonable thing to use for regression we're just testing how close we are through the target and then taking the square taking the mean we didn't specify any metrics and that's because mean squared error is already a good metric like it's not it's it's it has nice gradients it behaves well but and it's also the thing that we care about so we don't need a separate metric to track so let's go ahead and use LR find and we can pick a learning rate so maybe about 10 to the minus 2 we can call fine-tune and we get a valid loss of 0.0001 and so that's the mean squared error so we should take the square root on average we're about 0.01 off in a coordinate space that goes between minus 1 and 1 so that sounds super accurate took about three in a bit minutes to run so we can always call in fastai and we always should go results see what our results look like and as you can see fastai has automatically figured out how to display the combination of an image independent variable and a point dependent variable on the left is the is the target and on the right is the prediction and as you can see it is pretty close to perfect you know one of the really interesting things here is we used fine-tune even although think about it the thing we're fine-tuning image net isn't even an image regression model so we're actually fine-tuning an image classification model to become something totally different an image regression model why does that work so well well because and image net classification model must have learnt a lot about kind of how images look what things look like and where the pieces of them are to kind of know how to figure out what breed of animal something is even if it's partly obscured by a horse shorts in the shade or it's turned in different angles you know these pre-trained image models are incredibly powerful you know computing algorithms so built into every image net pre-trained model is all this capability that it had to learn for itself so asking it to use that capability to figure out where something is just actually not that hard for it and so that's why we can actually fine-tune an image net classification model to create something completely different which is a point image regression model so I find that incredibly cool I got to say so again look at the further research after you've done the questionnaire and particularly if you haven't used data frames before please play with them because we're going to be using them more and more good question I'll just do the last one and also go back and look at the bear classifier from notebook 2 or whatever hopefully you created some other classifier for your own data because remember we talked about how it would be better if the bear classifier could also recognize that there's no bear at all or maybe there's both a grizzly bear and a black bear or a grizzly bear and a teddy bear so if you retrain it using multi-label classification see what happens see how well it works when there's no bears and see whether it changes the accuracy of the single label model when you turn it into a multi-label problem so have a fiddle around and tell us on the forum what you find I've got a question Rachel is there a tutorial showing how to use pre-trained models on four channel images also how can you add a channel to a normal image?

Well it's the last one how do you add a channel to an image I don't know what that means okay I don't know you can't like an image is an image you can't add a channel to an image is what it is I don't know if there's a tutorial but we can certainly make sure somebody on the forum has learned how to do it it's it's super straightforward it should be pretty much automatic okay we're going to talk about collaborative filtering what is collaborative filtering?

Well think about on Netflix or whatever you might have watched a lot of movies that are sci-fi and have a lot of action and were made in the 70s and Netflix might not know anything about the properties of movies you watched it might just know that they're movies with titles and IDs but what it could absolutely see without any manual work is find other people that watched the same movies that you watched and it could see what other movies those people watched that you haven't and it would probably find they were also you would probably find they're also science fiction and full of action and made in the 70s so we can use an approach where we recommend things even if we don't know anything about what those things are as long as we know who else has used or recommended things that are similar you know the same kind you know many of the same things that that you've liked or used this doesn't necessarily mean users and products in fact in collaborative filtering sort of same products we normally say items and items could be links you click on a diagnosis for a patient and so forth so there's a key idea here which is that in the underlying items and we're going to be using movies in this example there are some there are some features they may not be labeled but there's some underlying concept of features of of those movies like the fact that there's a action concept and a sci-fi concept in the 1970s concept now you were never actually told Netflix you like these kinds of movies and maybe Netflix never actually added columns to their movies saying what movies are those types but as long as like you know in the real world there's this concept of sci-fi and action and movie age and that those concepts are relevant for at least some people's movie watching decisions as long as this is true then we can actually uncover these they're called late latent factors these things that kind of decide what kind of movies you want to watch and they're latent because nobody necessarily ever wrote them down or labeled them or communicated them in any way so let me show you what this looks like so there's a great data set we can use called movie lens which contains tens of millions of movie rankings and so a movie ranking looks like this it has a user number a movie number a rating and a time step so we don't know anything about who user number 196 is I don't know if that is Rachel or somebody else I don't know what movie number 242 is I don't know if that's Casablanca or Lord of the Rings or the mask and then rating is a number between I think it was one five a question sure in traditional machine learning we perform cross validations and K fold training to check for variance and bias trade-off is this common in training deep learning models as well.

So cross validation is a technique where you don't just split your data set into one training set and one validation set but you basically do it five or so times like five training sets and like five validation sets representing different overlapping subsets and basically this was this used to be done a lot because people often used to have not enough data get a good result and so this way rather than kind of having 20% that you would leave out each time you could just leave out like 10% each time.

Nowadays it's less common that we have so little data that we need to worry about the complexity and extra time of lots of models it's done on Kaggle a lot it's on Kaggle every little fraction of percent matters but it's not it's not a deep learning thing or a machine learning thing or whatever it's just a you know lots of data or not very much data thing and do you care about the last decimal place of them or not it's not something we're going to talk about certainly in this part of the course if ever because it's not something that comes up in practice that often as being that important.

There are two more questions. What would be some good applications of collaborative filtering outside of recommender systems? Well I mean depends how you define recommender system if you're trying to figure out what kind of other diagnoses might be applicable to a patient I guess that's kind of a recommender system or you're trying to figure out where somebody is going to click next or whatever it's kind of a recommender system but you know really conceptually it's anything where you're trying to learn from from past behavior where that behavior is kind of like a thing happened to an entity.

What is an approach to training using video streams i.e. from drone footage instead of images would you need to break up the footage into image frames? In practice quite often you would because images just tend to be pretty big so videos tend to be pretty big. There's a lot of so I mean theoretically the time could be the fourth channel yeah or fifth channel so if it's a full color movie you can absolutely have well I guess fourth because you can have five rank five tensor being batch by time by color by row by column but often that's too computationally and too memory intensive so sometimes people just look at one frame at a time sometimes people use a few frames around kind of the keyframe like three or five frames at a time and sometimes people use something called a recurrent neural network which we'll be seeing in the next week or two treated as a sequence data yeah there's all kinds of tricks you can do to try and work with that conceptually though there's no reason you can't just add an additional access to your tenses and everything to work it's just a practical issue around time and memory.

And someone else noted that it's pretty fitting that you mentioned the movie The Mask. Yes it was not an accident because I've got masks on the brain. I'm not sure if we're allowed to like that movie anymore though I kind of liked it when it came out I don't know what I think nowadays it's a while okay so let's take a look so we can untie data ml 100k so ml 100k is a small subset of the full set there's another one that we can grab which has got the whole lot 25 million but 100k is good enough for messing around so if you look at the readme you'll find the main table the main table is in a file called u.data so let's open it up with read.csv again this one is actually not comma separated values it's tab separated rather confusingly we still use csv and to say delimiter is a tab /t is tab there's no row at the top saying what the columns are called so we say header is none and then pass in a list of what the columns are called .head will give us the first five rows and we mentioned this before what it looks like it's not a particularly friendly way to look at it so what I'm going to do is I'm going to cross tab it and so what I've done here is I've grabbed the top I can't remember how many it was 15 or 20 movies based on the most popular movies and the top bunch of users who watched the most movies and so I've basically kind of reoriented this so for each user I have all the movies they've watched and the rating they gave them so empty spots represent users that have not seen that movie so this is just another way of looking at this same data so basically what we want to do is guess what movies we should tell people they might want to watch and so it's basically filling in these gaps to tell user 212 do you think we would they might like movie 49 or 79 or 99 best to watch next.

So let's assume that we actually had columns for every movie that represented say how much sci-fi they are how much action they are and how old they are and maybe they're between minus one and one and so like the last Skywalker is very sci-fi fairly action and definitely not old and then we could do the same thing for users so we could say user one really likes sci-fi quite likes action and really doesn't like old and so now if you multiply those together and remember in PyTorch and NumPy you have element wise calculations so this is going to multiply each corresponding item it's not matrix multiplication if you're a mathematician don't go there this is element wise multiplication if we want matrix multiplication be an at sign so if we multiply each element together next to with the equivalent element in the other one and then sum them up that's going to give us a number which will basically tell us how much do these two correspond because remember two negatives multiply together to get a positive so user one likes exactly the kind of stuff that last guy was that the last Skywalker has in it and so we get two point one multiplying things together element wise and adding them up is called the dot product and we use it a lot and it's the basis of matrix multiplication so make sure you know what a dot product is it's this so Casablanca is not at all sci-fi not much action and is certainly old so if we do user one times Casablanca we get a negative number so we might think okay user one more like won't like this movie problem is we don't know what the latent factors are and even if we did we don't know how to label a particular user or a particular movie with them so we have to learn them how do we learn them well we can actually look at a spreadsheet so I've got a spreadsheet version so we have a spreadsheet version which is basically what I did was I popped this table into Excel and then I randomly created a let's count this now one two three four five six seven eight nine ten eleven twelve I randomly created a 15 by 5 table here so these are just random numbers and I randomly created a 5 by 15 table here and I basically said okay well let's just pretend let's just assume that every movie and every user has five latent factors I don't know what they are and let's then do a matrix multiply of this set of factors by this set of factors and a matrix multiply of a row by a column is identical to a dot product of two vectors so that's why I can just use matrix multiply so this is just what this first cell contains so they then copied it to the whole thing so all these numbers there are being calculated from the row latent factors dot product with or matrix multiply with a column latent factors so in other words I'm doing exactly this calculation but I'm doing them with random numbers and so that gives us a whole bunch of values right and then what I could do is I could calculate a loss by comparing every one of these numbers here to every one of these numbers here and then I could do mean squared error and then I could use stochastic gradient descent to find the best set of numbers in each of these two locations and that is what collaborative filtering is so that's actually all we need so rather than doing an Excel and very the Excel version later if you're interested because we can actually do this whole thing and it works in Excel let's jump and do it into high torch now one thing that might just make this more fun is actually to know what the movies are and movie lens tells us in u.item what the movies are called and that uses the delimiter of the pipe sign weirdly enough so here are the names of each movie and so one of the nice things about pandas is it can do joins just like SQL and so you can use the merge method to combine the ratings table and the movies table and since they both have a column called movie by default it will join on those and so now here we have the ratings table with actual movie names that's going to be a bit more fun we don't need it for modeling but it's just going to be better for looking at stuff so we could use data blocks API at this point or we can just use the built-in application factory method since it's there we may as well use it so we can create a collaborative filtering data loaders object from a data frame by passing in the ratings table by default the user column is called user and ours is so fine by default the item column is called item and ours is not it's called title so let's pick title and choose a batch size and so if we now say show batch here is some of that data and the rating is called rating by default so that worked fine too but here's some data so we need to now create our let's assume we're going to use that five numbers of factors so the number of users is however many classes there are for user and the number of movies is however many classes there are a title and so these are so we don't just have a vocab now right we've actually got a list of classes for each categorical variable for each set of discrete choices so we've got a whole bunch of users at 944 and a whole bunch of titles 1635 so for our randomized latent factor parameters we're going to need to create those matrices so we can just create them with random numbers so this is normally distributed random numbers that's what random n is and that will be n users okay so 944 by 10 factors which is 5 that's exactly the same as this except this is just 15 so let's do exactly the same thing for movies random numbers and movies by 5 okay and so to calculate the result for some movie and some user we have to look up the index of the movie in our movie latent factors the index of the user in our user latent factors and then do a cross product so in other words we would say like oh okay for this particular combination we would have to look up that numbered user over here and that numbered movie over here to get the two appropriate sets of latent factors but this is a problem because look up in an index is not a linear model like remember our deep learning models really only know how to just multiply matrices together and do simple element wise nonlinearities like ReLU there isn't a thing called look up in an index okay I'll just finish this bit here's the cool thing though the look up in an index is actually can be represented as a matrix product believe it or not so if you replace our indices with one hot encoded vectors then a one hot encoded vector times something is identical to looking up in an index and let me show you so if we grab if we call the one hot function that creates a as it says here one hot encoding and we're going to one hot encode the value three with end users classes and so end users as we've just discussed is 944 right then so if we go one hot one hot encoding the number three into end users one hot three we get this big array big tensor and as you can see in index 3 0 1 2 3 we have a 1 and the size of that is 944 so if we then multiply that by user factors or user factors remember is that random matrix of this size and what's going to happen so we're going to go 0 by the first row and so that's going to be all zeros and then we're going to go 0 again and we're going to 0 again and then we're going to find finally go 1 right on the index 3 row and so it's going to return each of them and then we'll go back to 0 again so if we do that remember at sign is matrix multiply and compare that to user factors 3 same thing isn't that crazy so it's a kind of weird inefficient way to do it right but matrix multiplication is a way to index into an array and this is the thing that we know how to do SGD with and we know how to build models with so it turns out that anything that we can do with indexing to array we now have a way to optimize and we have a question there are two questions one how different in practice is collaborative filtering with sparse data compared to dense data we are not doing sparse data in this course but there's an excellent course I hear called computational linear algebra for coders it has a lot of information about sparse the fast AI course and second question in practice do we tune the number of latent factors absolutely we do yes just it's just a number of filters like we have in much any kind of deep learning model all right so now that we know that the procedure of finding out which latent set of latent factors is the right thing looking something up in an index is the same as matrix multiplication with a one-hot vector I already had it over here we can go ahead and build a model with that so basically if we do this for a whole for a few indices at once then we have a matrix of one hot encoded vectors so the whole thing is just one big matrix multiplication now the thing is as I said this is a pretty inefficient way to to do an index lookup so there is a computational shortcut which is called an embedding an embedding is a layer that has the computational speed of an array lookup and the same gradients as a matrix multiplication how does it do that well just internally it uses an index lookup to actually grab the values and it also knows what the gradient of a matrix multiplication by a one-hot encoded vector is or matrix is without having to go to all this trouble and so an embedding is a matrix multiplication with a one-hot encoded vector where you never actually have to create the one-hot encoded vector you just need the indexes this is important to remember because a lot of people have heard about embeddings and they think there's something special and magical and and they're absolutely not you can do exactly the same thing by creating a one-hot encoded matrix and doing a matrix multiply it is just a computational shortcut nothing else I often find when I talk to people about this in person I have to tell them this six or seven times before they believe me because they think embeddings are something more clever and they're not it's just a computational shortcut to do a matrix multiplication more quickly with a one-hot encoded matrix by instead doing an array lookup okay so let's try and create a collaborative filtering model in PyTorch a model or an architecture or really an nn.module is a class so to use PyTorch through its fullest you need to understand object-oriented programming because we have to create classes there's a lot of tutorials about this so I won't go into detail about it but I'll give you a quick overview a class could be something like dog or resnet or circle and it's something that has some data attached to it and it has some functionality attached to it is a class called example the data it has attached to it is a and the functionality attached to it is say and so we can for example create an instance of this class an object of this type example you pass in silver so silver will now be in ex.a and we can then say ex.say and it will call say and it will say passing in nice to meet you so that will be x and so it'll say hello self.a so that's silver nice to meet you here it is okay so in Python the way you create a class is to say class in its name then to say what is passed to it when you create that object it's a special method called dunder in it as we've briefly mentioned before in Python there are all kinds of special method names that have special behavior they start with two underscores they end with two underscores and we pronounce that dunder so dunder in it all methods in all regular methods instance methods in Python always get passed the actual object itself first that we normally call that self and then optionally anything else and so you can then change the contents of the current object by just setting self.whatever to whatever you like so after this self.a is now equal to silver so we call a method same thing it's passed self optionally anything you pass to it and then you can access the contents of self which you stashed away back here when we initialized it so that's basically how object or you know the basics of object-oriented programming works in Python there's something else you can do when you create a new class which is you can pop something in parentheses after its name and that means we're going to use something called inheritance and what inheritance means is I want to have all the functionality of this class plus I want to add some additional functionality so module is a PyTorch class which fast.ai has customized so it's kind of a fast.ai version of a PyTorch class and probably in the next course we'll see exactly how it works and but it looks a lot like a it all acts almost exactly like a just a regular Python class we have an init and we can set attributes to whatever we like and one of the things we can use is an embedding and so an embedding is just this class that does what I just described a it's the same as an as a linear layer with a one hot encoded matrix but it does it with this computational shortcut you can say how many in this case users are there and how many factors will they have now there is one very special thing about things that inherit from module which is that when you call them it will actually call a method called forward so forward is a special PyTorch method name it's the most important PyTorch method name this is where you put the the actual computation so to to grab the factors from an embedding we just call it like a function right so this is going to get passed here the user IDs and the movie IDs as two columns so let's grab the zero index column and grab the embeddings by passing them to user factors and then we'll do the same thing for the index one column that's the movie IDs pass them to the movie factors and then here is our element wise multiplication and then sum and now remember we've got another dimension time the first axis is the minibatch dimension so we want to sum over the other dimension the index one dimension so that's going to give us a dot product for each user sorry for each rating for each user movie combination so this is the dot product class so you can see if we look at one batch of our data it's of size of shape 64 by 2 because there are 64 items in the minibatch and each one has this is the independent variables so it's got the user ID and the movie ID to deep neural network based models for collaborative filtering work better than more traditional approaches like SPP or other matrix let's wait until we get there so here's X right so here is one user ID movie ID combination okay and then each one of those 64 here are the ratings so now we've created a dot product module from scratch so we can instantiate it passing in the number of users the number of movies and let's use 50 factors and now we can create a learner now this time we're not creating a CNN learner or a specific application learner it's just a totally generic learner so this is a learner that doesn't really know how to do anything clever it just stores away the data you give it and the model you give it and so when we're not using an application specific learner it doesn't know what loss function to use so we'll tell it to use MSE and fit and that's it right so we've just fitted our own collaborative filtering model where we literally created the entire architecture it's a pretty simple one from scratch so that's pretty amazing now the results aren't great if you look at the movie lens data set benchmarks online you'll see this is not actually a great result so one of the things we should do is take advantage of the tip we just mentioned earlier in this lesson which is when you're doing regression which we are here right the number between one and five is like a continuous value we're trying to get as close to it as possible we should tell fastai what the range is so we can use y range as before so here's exactly the same thing we've got a y range we've stored it away and then at the end we use as we discussed sigmoid range passing in and look here we pass in star self dot y range that's going to pass in by default 0 comma 5.5 and so we can see yeah not really any better it's worth a try normally this is a little bit better but it always depends on when you run it I'll just run it a second time well it's worth looking now there is something else we can do though which is that if we look back at our little Excel version the thing is here when we multiply you know these latent factors by these latent factors and add them up it's not really taking account of the fact that this user may just rate movies really badly in general regardless of what kind of movie they are and this movie might be just a great movie in general just everybody likes it regardless of what kind of stuff they like and so it'd be nice to be able to represent this directly and we can do that using something we've already learned about which is bias we could have another single number for each movie which we just add and it's another single number for each user which we just add right and we've already seen this for linear models you know this idea that it's nice to be able to add a bias value so let's do that so that means that we're going to need another embedding for each user which is a size one it's just a single number we're going to add so in other words it's just an array lookup but remember to do an array lookup that we can kind of take a gradient of we have to say embedding so we do the same thing for movie bias and so then all of this is identical as before and we just add this one extra line which is to add the user and movie bias values and so let's train that see how it goes well that was a shame it got worse so we used to have that finished here 0.87 0.88, 0.89 so it's a little bit worse why is that well if you look earlier on it was quite better it was 0.86 so it's overfitting very quickly and so what we need to do is we need to find a way that we can train more epochs without overfitting now we've already learned about data augmentation right like rotating images and changing their brightness and color and stuff but it's not obvious how we would do data augmentation for collaborative filtering right so how are we going to make it so that we can train lots of epochs without overfitting and to do that we're going to have to use something called regularization and regularization is a set of techniques which basically allow us to use models with lots of parameters and train them for a long period of time but penalize them effectively for overfitting or in some way cause them to to try to stop overfitting and so that is what we will look at next week okay well thanks everybody so there's a lot to take in there so please remember to practice to experiment listen to the lessons again because you know for the next couple of lessons things are going to really quickly build on top of all the stuff that we've learned so please be as comfortable with it as you can feel free to go back and re-listen and go through and follow through the notebooks and then try to recreate as much of them yourself thanks everybody and I will see you next week or see you in the next lesson whenever you

Lesson 6 - Deep Learning for Coders (2020)

Chapters

Transcript