So today we're going to continue working on object detection, which means that for every object in a photo in one of 20 classes, we're going to try and figure out what the object is and what its bounding box is, such that we can apply that model to a new dataset of unlabeled data and add those labels to it.
The general approach we're going to use is to start simple and gradually make it more complicated so we started last week with a simple classifier, the three lines of closed classifier, we then made it slightly more complex to turn it into a bounding box without a classifier. Today we're going to put those two pieces together to make a classifier plus a bounding box, all of these are just for a single object, the largest object, and then from there we're going up to something closer to our final goal.
You should go back and make sure that you understand all of these concepts from last week before you move on. If you don't, go back and re-go through the notebooks carefully. I won't read them all to you because you can see them in the video easily enough. Perhaps this is the most important, knowing how to jump around source code in whatever editor you prefer to use.
At plotlet, lambda function is also particularly important, they come up everywhere, and this idea of a custom head is also going to come up in pretty much every lesson. I've also added here a reminder of what you should know from part one of the course because quite often I see questions on the forum asking, basically, why isn't my model working?
Why doesn't it start training or, having trained, why doesn't it seem to be in use? And nearly always, the answer to the question is, did you print out the inputs to it from a data loader? Did you print out the outputs from it after evaluating it? And normally the answer is no, when they try printing it and it turns out all the inputs are zero or all of the outputs are negative or it's really obvious.
So that's just something I wanted to remind you about, you need to know how to do these two things. If you can't do that, then it's going to be very hard to debug models, and if you can do that, but you're not doing it, then it's going to be very hard for you to debug models.
You don't debug models by staring at the source code hoping your error pops out, you debug models by checking all of the intermediate steps, looking at the data, printing it out, plotting its histogram, making sure it makes sense. We were working through Pascal Notebook and we just quickly zipped through the bounding box of the largest object without a classified part, and there was one bit that I skipped over and said I'd come back to, so let's do that now.
Which is to talk about data augmentations of the y of the dependent variable. Before I do, I'll just mention something pretty awkward in all this, which is I've got here image classifier data continuous equals true. This makes no sense whatsoever. A classifier is anything where the dependent variable is categorical or binomial, as opposed to regression, which is anything where the dependent variable is continuous.
And yet this parameter here, continuous equals true, says that the dependent variable is continuous. So this claims to be creating data for a classifier where the dependent is continuous. This is the kind of awkward rough edge that you see when we're kind of at the edge of the fast AI code that's not quite solidified yet.
So probably by the time you watch this in the MOOC, this will be sorted out, and this will be called image regressor data or something like that, but I just wanted to kind of point out this issue, and also because sometimes people are getting confused between regression vs. classification, and this is not going to help one bit.
So let's create some data augmentations. Normally when we create data augmentations, we tend to type in transform_side_on or transform_top_dem. But if you look inside the fast_ai.transforms module, you'll see that they are simply defined as a list. So transform_basic is 10 degree rotations plus 0.05 brightness and contrast, and then side_on adds to that random horizontal flips, or else top_down adds to that random dihedral group of asymmetry flips, which basically means every possible 90 degree rotation optionally with a flip, so eight possibilities.
So these are just little shortcuts that I added because they seem to be useful a lot of the time, but you can always create your own list of augmentations. And if you're not sure what augmentations are there, you can obviously check the fast_ai source, or if you just start typing random, they all start with random, so you can see them easily enough.
So let's take a look at what happens if we create some data augmentations. Let's create a model data object, and let's just go through and rerun the iterator a bunch of times. And we'll do two things, we'll print out the bounding boxes, and we'll also draw the pictures. So you'll see this lady is, as we would expect, flipping around and spinning around and getting darker and lighter, but the bounding box, A, is not moving, and B is in the wrong spot.
So this is the problem with data augmentation when your dependent variable is pixel values or is in some way connected to your independent variable, the two need to be augmented together. And in fact, you can see that from the printout these numbers are bigger than 224, but these images are of size 224, that's what we requested in these transforms.
And so it's not even being scaled or cropped or anything. So you can see that our dependent variable needs to go through all of the same geometric transformations as our independent variable. So to do that, every transformation has an optional transform Y parameter. It takes a transform type enum, the transform type enum has a few options, all of which we'll cover in this course.
The co-ord option says that the Y values represent coordinates, in this case bounding box coordinates. And so therefore if you flip, you need to change the coordinate to represent that flip, or if you rotate, you have to change the coordinate to represent that rotation. So I can add transform type or co-ord to all of my augmentations.
I also have to add the exact same thing to my transforms from model function, because that's the thing that does the cropping and zooming and padding and resizing, and all of those things need to happen to the dependent variable as well. So if we add all of those together and rerun this, you'll see the bounding box changes each time, and you'll see it's in the right spot.
Now you'll see sometimes it looks a little odd, like here, why is that bounding box there? And the problem is, this is just a constraint of the information we have. The bounding box does not tell us that actually her head isn't way over here in the top left corner, but actually if you do a 30 degree rotation and her head was over here in the top left corner, then the new bounding box would go really high.
So this is actually the correct bounding box based on the information it has available, which is to say this is how high she might have been. So basically you've got to be careful of not doing too high rotations with bounding boxes because there's not enough information for them to stay totally accurate, just the fundamental limitation of the information we're given.
If we were doing polygons, or segmentations, or whatever, we wouldn't have this problem. So I'm going to do a maximum of 3 degree rotations to avoid that problem. I'm also going to only rotate half the time, I'm going to have my random flip and my brightness and contrast changing, so there's my set of transformations that I can use.
So we briefly looked at this custom head idea, but basically if you look at dot summary, dot summary does something pretty cool which basically runs a small batch of data through a model and prints out how big it is at every layer, and we can see that at the end of the convolutional section before we hit the flatten, it's 512 x 7 x 7, and so 512 x 7 x 7, a tensor, a rank 3 tensor of that size, if we flatten it out into a single rank 1 tensor into a vector, it's going to be 25,098 long.
So then that's why we had this linear layer, 2.5.0 to 4, because there's 4 bounding boxes. So stick that on top of a pre-trained ResNet and train it for a while. So that's where we got to last time. So let's now put those two pieces together so that we can get something that classifies and does bounding boxes, and there are three things that we need to do basically to train a neural network ever.
We need to provide data, we need to pick some kind of architecture, and we need a loss function. So the loss function says anything that gives a lower number here is a better network using this data and this architecture. So we're going to need to create those three things for our classification plus bounding box regression.
So that means we need a model data object which has the independence, the images, and the dependence, so when I have a tuple, the first element of the tuple should be the bounding box coordinates, and the second element of the tuple should be the class. There's lots of different ways you could do this.
The particularly lazy and convenient way I came up with was to create two model data objects representing the two different dependent variables I want. So one with the bounding box coordinates, one with the classes, just using the CSB. And now I'm going to merge them together. So I create a new data set class, and a data set class is anything which has a length and an indexer, so something that lets you use it in square brackets like a list.
And so in this case I can have a constructor which takes an existing data set, so that's going to have both an independent and a dependent, and the second dependent that I want. The length then is just obviously the length of the data set, the first data set. And then getItem is grab the x and the y from the data set that I passed in, and return that x and that y and the i-th value of the second dependent variable that I passed in.
So there's a data set that basically adds in a second dependent variable. As I said, there's lots of ways you could do this, but it's kind of convenient because now what I could do is create a training data set and a validation data set based on that. So here's an example, it's got a tuple of the bounding box coordinates in the class.
We can then take the existing training and validation data loaders and actually replace their data sets with these, and I'm done. So we can now test it by grabbing a mini-batch of data and checking that we have something that makes sense. So there's one way to customize a data set.
So what we're going to do this time now is we've got the data, so now we need an architecture. So the architecture is going to be the same as the architectures that we used for the classifier and for the bounding box regression, but we're just going to combine them.
So in other words, if there are c classes, then the number of activations we need in the final layer is 4 plus c. We've got the 4 bounding box coordinates and the c probabilities 1/3 class. So this is the final layer, a linear layer that has 4 plus len of categories activations.
The first layer is before is a flatten. We could just join those up together, but in general I want my custom head to hopefully be capable of solving the problem that I give it on its own if the pre-trained backbone it's connected to is appropriate. And so in this case I'm thinking I'm trying to do quite a bit here, two different things, the classifier and bounding box regression.
So just a single linear layer doesn't sound like enough, so I put in a second linear layer. And so you can see we basically go relu, dropout, lenia, relu, batchnorm, dropout, lenia. If you're wondering why there's no batchnorm back here, I checked the resnet backbone, it already has a batchnorm as its final layer.
So this is basically nearly the same custom head as before, it's just got two linear layers rather than one and the appropriate nonlinearities. So that's piece 2, we've got theta, we've got architecture, now we need a loss function. So the loss function needs to look at these 4 plus c activations and decide are they good?
Are these numbers accurately reflecting the position and class of the largest object in this image? We know how to do that. For the first 4 we use L1 loss, just like we did in the bounding box regression before. Remember L1 loss is like mean squared error, the sum of squareds is the sum of those values.
And then for the rest of the activations we can use cross-entropy loss. So let's go ahead and do that. So we're going to create something called detection_loss, and loss functions always take an input and a target, that's what PyTorch always calls them. So this is the activations, this is the ground truth.
So remember that our custom dataset returns a tuple containing the bounding box coordinates and the classes. So we can destructure that, use destructuring assignment to grab the bounding boxes and the classes of the target. And then the bounding boxes and the classes of the input are simply the first 4 elements of the input and the 4 onwards elements of the input.
And remember we've also got a batch dimension that we need to grab the whole thing. So that's it. We've now got the bounding box target, bounding box input, class target, class input. For the bounding boxes we know that they're going to be between 0 and 224, the coordinates, because that's how big our image is.
So let's grab a sigmoid to force it between 0 and 1, multiply it by 224, and that's just helping our neural net get close to what we -- be in the range we know it has to be. As a general rule, is it better to put batch norm before or after a relu?
I would suggest that you should put it after a relu, because batch norm is meant to move towards a 0 and 1 random variable, and if you put relu after it, then you're truncating it at 0. So there's no way to create negative numbers. But if you put relu and then batch norm, it does have that ability.
Having said that -- and I think that way of doing it gives slightly better results. Having said that, it's not too big a deal either way, and you'll see during this part of the course, most of the time I go relu and then batch norm, but sometimes I go batch norm and then relu if I'm trying to be consistent with a paper or something like that.
I think originally the batch norm was put after the activation, so there's still people who do that. So this is kind of to help our data or force our data into the right range, which if you can do stuff like that, it makes it easier to train. Yes, Rachel?
One more question. What's the intuition behind using dropout with p=0.5 after a batch norm? Doesn't batch norm already do a good job of regularizing? Batch norm does an okay job of regularizing, but if you think back to part 1, we've got to have that list of things we do to avoid overfitting, and adding batch norm is one of them, that's data augmentation, but it's perfectly possible that you'll still be overfitting.
So one nice thing about dropout is that it has a parameter to say how much to drop out, so parameters are great, or specifically parameters that decide how much to regularize are great, because it lets you build a nice, big over-parameterized model and then decide how much to regularize it.
So I tend to always include dropout, and then if it turns out I'll start with p=0, and then as I add new to add regularization, I can just change my dropout parameter without worrying about if I saved a model, I want to be able to load it back, but if I had dropout layers in one and not in another, it would load me more, so this way it stays consistent.
So now that I've got my inputs and targets, I can just go calculate the L1 loss and add to it the cross entropy. So that's our loss function, surprisingly easy perhaps. Now of course the cross entropy and the L1 loss may be of wildly different scales, in which case in the loss function the larger one is going to dominate.
And so I just ran this in a debugger, checked how big each of the two things were, and found if they multiply by 20, that makes them about the same scale. That's your training, it's nice to print out information as you go. So I also grabbed the L1 part of this and put it in a function, and I also created a function for accuracy, so that I could then make the metrics and print it out as it goes.
So we've now got something which is printing out our object detection loss, detection accuracy, and detection L1, and so we've trained it for a while, and it's looking good. Add detection accuracy is in the low 80s, which is the same as what it was before. That doesn't surprise me because ResNet was designed to do classification, so I wouldn't expect us to be able to improve things in such a simple way.
But it certainly wasn't designed to do bounding box regression, it was explicitly actually designed in such a way as to kind of not care about geometry. It takes that last 7x7 grid of activations and averages them all together. It throws away all of the information about where things came from.
So you can see that when we only trained the last layer, the detection L1 is pretty bad, it's 24, and it really improves a lot, whereas the accuracy doesn't improve, it stays exactly the same. Interestingly, the L1, when we do accuracy and bounding box at the same time, adding .5, seems like it's a little bit better than when we just do bounding box regression.
And if that's counterintuitive to you, then that would be one of the main things to think about after this lesson, so it's a really important idea. And the idea is this, figuring out what the main object in an image is, is kind of the hard part, and then figuring out exactly where the bounding box is and what class it is is kind of the easy part in a way.
And so when you've got a single network that's both saying what is the object and where is the object, it's going to share all of the computation about finding the object. And so all that shared computation is very efficient. And so when we backpropagate the errors in the class and in the place, that's all information that's going to help the computation around finding the biggest object.
So anytime you've got multiple tasks which kind of share some concept of what those tasks would need to do to complete their work, it's very likely they should share at least some layers of the network together. And we'll look later today at a place where most of the layers are shared but the last one isn't.
So you can see this is doing a good job as before of any time there's just a single major object. Sometimes it's getting a little confused, it thinks the main object here is the dog and it's going to circle the dog, although it's kind of recognized that actually the main object is a sofa.
So the classifier is doing the right thing with the bounding boxes labeling the wrong thing, which is kind of curious. When there are two birds it can only pick one so it's just kind of hedging in the middle, ditto and there's lots of cows and so forth, doing a good job with this kind of thing.
So that's that. There's not much new there, although in that last bit we did learn about some simple custom data sets and simple custom lots functions. Hopefully you can see now how easy that is to do. So the next stage for me would be to do multi-label classifications. This is this idea that I just want to keep building models that are slightly more complex than the last model but hopefully don't require too much extra concepts so I can keep seeing things working.
And if something stops working I know exactly where it worked, I'm not going to try and build everything at the same time. So multi-label classification is so easy, there's not much to mention. So we've moved to Pascal Multi now, this is where we're going to do the multi-object stuff.
So for the multi-object stuff, I've just copied and pasted the functions from the previous notebook that we used, so they're all at the top. So we can create now a multi-class CSV file using the same basic approach that we did last time. And I'll mention by the way, one of our students who's visiting from India, Fani, pointed out to me that all this stuff we're doing with default dicks and stuff like that, he actually showed a way of doing it which was much simpler using pandas and he shared that on the forum.
So I totally bow to his much better approach, a simpler, more concise approach. It's definitely true, like the more you get to know pandas, the more often you realize it's a good way to solve lots of different problems. So definitely check that out. When you're building out the smaller models and you're iterating, do you reuse those models as pre-trained weights for this larger one or do you just toss it all away and then retrain from scratch?
When I'm figuring stuff out as I go like this, I would generally lean towards tossing away because the reusing pre-trained weights introduces complexities that I'm not really thinking about. However if I'm trying to get to a point where I can run something on really big images, I'll generally start on much smaller ones and often I will reuse those weights.
So in this case what we're doing is joining up all of the classes with a space which gives us a CSV in a normal format and once we've got the CSV in a normal format it's the usual three lines of code and we train it and we print out the results.
So there's literally nothing to show you there. And as you can see it's done a great job. The only mistake I think it made was it called this dog where it should have been dog and sofa. I think everything else is correct. So multi-class classification is pretty straightforward. A minor tweak here is to note that I used a set here because I don't want to list all of the objects.
I only want each object type to appear once and so the set plus is a way of depuplicating a list. So that's why I don't have person, person, person, person, person, just appears once. So these object classification pre-trained networks we have are really pretty good at recognizing multiple objects as long as you only have to mention each one once.
So that works pretty well. So we've got this idea that we've got an input image that goes through a ConvNet which is a tensor vector of size 4+c where c is the number of classes. So that's what we've got. And that gives us an object detector for a single object, the largest object in our case.
So let's now create one which doesn't find a single object but that finds 16 objects. So an obvious way to do that would be to take this last, this is just an n.linear, which has got however many inputs and 4+c outputs. And we could take that linear layer and rather than having 4+c outputs, we could have 16 times 4+c outputs.
So it's now spitting out enough things to give us 16 sets of class probabilities and 16 sets of bounding box coordinates. And then we would just need a loss function that would check whether those 16 sets of bounding boxes correctly represented the up to 16 objects that were represented in the image.
Now there's a lot of hand waving about the loss function, we'll go into it later as to what that is, but let's pretend we have one. Assuming we had a reasonable loss function, that's totally going to work. That is an architecture which has the necessary output activations, but with the correct loss function we should be able to train it to do what we want it to do.
But that's just one way to do it. There's a second way we could do it. Rather than having an n.linear, what if instead we took from our resnet convolutional backbone, not an n.linear, but instead we added an n.com2d with stride 2. So the final layer of resnet gives you a 7x7x512 result.
So this would give us a 4 by 4 by whatever, the number of filters, let's say we pick 256. So 4 by 4 by 256 has, well actually, no, let's change that. Let's not make it 4 by 4 by 256, that is still, let's do it all in one step.
Let's make it 4 by 4 by 4 plus C because now we've got a tensor where the number of elements is exactly equal to the number of elements we wanted. So in other words, we could now, this would work too, if we created a loss function that took a 4 by 4 by 4 plus C tensor and mapped it to 16 objects in the image and checked whether each one was correctly represented by those 4 plus C activations.
That would work. These are two exactly equivalent sets of activations because they've got the same number of elements, they just reshaped. So it turns out that both of these approaches are actually used. The approach where you basically just spit out one big long vector from a fully-confected linear layer is used by a class of models known as YOLO.
Whereas the approach of the convolutional activations is used by models which started with something called SSD or single shot detector. What I will say is that since these things came out at very similar times in late 2015, things have very much moved towards here, to the point where this morning YOLO version 3 came out and is now doing it the SSD way.
So that's what we're going to do. We're going to do this, and we're going to learn about why this makes more sense as well. And so the basic idea is this. Let's imagine that underneath this we had another conv2D, strive2, and we'd have something which was 2x2, again let's say it's 4+c, that's nice and simple.
And so basically it's creating a grid that looks something like this, 1, 2, 3, 4. So that would be how the activations are, the geometry of the activations of that second extra convolutional strive2 layer. But strive2 convolution does the same thing to the geometry of the activations as a strive1 convolution followed by a mass-pulling, assuming patterns are key.
So let's talk about what we might do here, because the basic idea is we want to kind of say this top left grid cell is responsible for identifying any object that's in the top left, and this one in the top right is responsible for identifying something in the top right, this one in the bottom left, and this one in the bottom right.
So in this case you can actually see it's done and it's said, okay, this one is going to try and find the chair, this one, it's actually made a mistake, it should have said table, but there are actually 1, 2, 3 chairs here as well, so it makes sense.
So basically each of these grid cells, if it's going to be told in the loss function, your job is to find the object, the big object that's in that part of the image. So what -- >> So for multi-label classification, I saw you had a threshold on there, which I guess is a hyperparameter, is there a way to -- >> We're getting your well ahead, let's work through this.
So why do we care about the idea that we would like this convolutional grid cell to be responsible for finding things that were in this part of the image? And the reason is because of something called the receptive field of that convolutional grid cell. And the basic idea is that throughout your convolutional layers, every piece of those tenses has a receptive field which means which part of the input image was responsible for calculating that cell.
And like all things in life, the easiest way to see this is with Microsoft Excel. So do you remember our convolutional neural net? And this was MNIST, we had the number 7. And it went through a two-channel filter, channel 1, channel 2, which therefore created a two-channel output. And then the next layer was another convolution, so this tensor is now a 3D tensor, which then creates a two-channel output.
And then after that, we had our max-pooling layer. So let's look at this part of this output. And the fact that this is conv, followed by max-pool, let's just pretend it's a stride-two conv. It's basically the same thing. So let's see where this number 27 came from. So if you've got Excel, you can go formulas, trace precedence, and so you can see this came from these four.
Now where did those four come from? This four came from obviously the convolutional filter kernels, and from these four parts of column 1, because we've got four things here, each one of which has a 3x3 filter, and so we have 3, 3, 3, 3, and all together, it makes up 4x4.
Where did those four come from? Those four came from obviously our filter, and this entire part of the input image. And what's more, you can see that these bits in the middle have lots of weights coming out, whereas these bits on the outside only have one weight coming out.
So we call this here the receptive field of this activation. But note that the receptive field is not just saying it's this here box, but also that the center of the box has more dependencies. So this is a critically important concept when it comes to understanding architectures and understanding why convnets work the way they do, the idea of the receptive field.
And there are some great articles, if you just google for convolution receptive field, you can find lots of terrific articles. I'm sure some of you will write much better ones during the week as well. So that's the basic idea there, right, is that the receptive field of this convolutional activation is generally centered around this part of the input image, so it should be responsible for finding objects that are here.
So that's the architecture. The architecture is that we're going to have a ResNet backbone followed by one or more 2D convolutions. And for now we're just going to do one, which is going to give us a 4x4 grid. So let's take a look at that. So here it is.
We start with our Releu and Dropout. We then do, let's start at the output, actually let's go through and see what we've got here. We start with a Stride 1 convolution. And the reason we start with a Stride 1 convolution is because that doesn't change the geometry at all, it just lets us add an extra layer of calculations, it lets us create not just a linear layer, but now we have a little mini neural network in our custom here.
So we start with a Stride 1 convolution. And standard conv is just something I defined up here, which does convolution, Releu, Vaginam, Dropout. Like most research code you see won't define a class like this, instead they'll write the entire thing again and again and again, convolution, Vaginam, Dropout. Don't be like that.
That kind of duplicate code leads to errors and leads to poor understanding. I mention that also because this week I released the first draft of the FastAI style guide. And the FastAI style guide is very heavily oriented towards the idea of expository programming, which is the idea that programming code should be something you can use to explain an idea, ideally as readily as mathematical notation to somebody that understands your coding method.
And so the idea actually goes back a very long way, but it was best described in the Turing Award lecture, this is like the Nobel of Computer Science, the Turing Award lecture of 1979 by probably my greatest computer science hero, Ken Iverson. He had been working on it well before in 1964, but 1964 was the first example of this approach to programming.
He released something called APL, and then 25 years later he won the Turing Award. He then passed on the baton to his son, Eric Iverson, and there's been basically 50 or 60 years now of continuous development of this idea of what does programming look like when it's designed to be a notation as a tool for thought for expository programming.
And so I've made a very shoddy attempt at taking some of these ideas and thinking about how can they be applied to Python programming with all the limitations by comparison that Python has. So here's a very simple example, if you write all of these things again and again and again, then it really hides the fact that you've got two convolutional layers, one of stride 1, one of stride 2.
So my default for standard conv is stride 2, this is stride 1, this is stride 2, and then at the end, the output of this is going to be 4x4, I've got an outcon, and an outcon is interesting. You can see it's got two separate convolutional layers, each of which is stride 1, so it's not changing the geometry of the input.
One of them is of length of the number of classes. Just ignore k for now, k is equal to 1 at this point of the code, so one is equal to the length of the number of classes, one is equal to 4. And so this is this idea of rather than having a single conv layer that outputs 4 + c, let's have two conv layers, one of which outputs 4, one of which outputs c.
And then I will just return them as a list of two items. That's nearly the same thing as having a single conv layer that outputs 4 + c, but it lets these layers specialize just a little bit. So like we talked about this idea that when you've got multiple tasks, they can share layers, but they don't have to share all the layers.
So in this case, our two tasks, which is create a classifier and create bound box regression, share every single layer except the very last one. And so this is going to spit out two separate tenses of activations, one of the classes and one of the bounding box coordinates. Why am I adding 1?
That's because I'm going to have one more class for background. So if there aren't actually 16 objects to detect, or if there isn't an object in this corner represented by this convolutional grid cell, then I want it to predict background. So that's the entirety of our architecture, it's incredibly simple, but the point is now that we have this convolutional layer at the end.
One thing I do do is that at the very end I flatten out the convolution, basically because I wrote the loss function to expect a flattened-out tensor, but we could totally rewrite it to not do that. I might even try doing that during the week and see which one looks easier to understand.
So we've got our data, we've got our architecture. So now all we need is a loss function. So the loss function needs to look at each of these 16 sets of activations, each of which are going to have 4 bounding box coordinates and c+1 class probabilities, and decide are those activations close or far away from the object closest to this grid cell in the image?
And if nothing's there, then are you predicting background correctly? So that turns out to be very hard to do. Let's go back to the 2x2 example to keep it simple. The loss function actually needs to take each of the objects in the image and match them to one of these convolutional grid cells, to say this grid cell is responsible for this particular object, this grid cell is responsible for this particular object, so then it can go ahead and say how close are the 4 coordinates and how close are the class probabilities?
So this is called the matching problem. In order to explain it, I'm going to show it to you. But what I'm going to do first is I'm going to take a break and we're going to come back and understand the matching problem. So during the break, have a think about how would you design a loss function here?
How would you design a function which has a lower value if these 16x4+k activations somehow better reflect the up to 16 objects which are actually in the ground truth image? And we'll come back at 7.40. So here's our goal. Our dependent variable basically looks like that, and it's just an extract from our CSV file.
And our final convolutional layer is going to be a bunch of numbers which initially is a 4 by 4 by, in this case I think c is equal to 20 plus we've got 1 in the background, so 4 plus 21 equals 26, 4 by 4. And then we flatten that out into a vector.
We flatten that out into a vector, and so basically our goal then is to say to some particular set of activations that ended up coming out of this model, let's pick some particular dependent variable. We need some function that takes in that and that, and where it feeds back a higher number if these activations aren't a good reflection of the ground truth bounding boxes, or a lower number if it is a good reflection of the ground truth bounding boxes.
That's our goal. We need to create that function. And so the general approach to creating that function will be to first of all, to simplify it down with a 2 by 2 version, will be to first of all, well actually, I'll show you. Here's a model I trained earlier, and let's run through, I've taken the loss function and I've split it line by line so that you can see every line that goes into making it.
So let's grab our validation set data loader, grab a batch from it, turn them into variables so we can stick them into a model, put the model in evaluation mode, stick that data into our model to grab a batch of activations, and remember that the final output convolution returned two items, the classes and the bounding boxes, so we can do destructuring assignment to grab the two pieces, the batch of classes and outputs, and the batch of bounding box outputs.
And so as expected, the batch of class outputs is batch size 64 by 16 grid cells by 21 classes and then 64 by 16 by 4 for the bounding box coordinates. Hopefully that all makes sense and after class go back and just make sure if it's not obvious why these are the shapes, make sure you get to the point where you understand where they are.
So let's now go back and look at the ground truth, so the ground truth is in this Y variable. So let's grab the bounding box part and the class part and put them into these two Python variables and print them out. And so there's our ground truth bounding boxes and there's our ground truth classes.
So this image apparently has three objects in it. So let's draw a picture of the three objects, and there they are. We already have a show ground truth function, the torch ground truth function simply converts the tensors into numpy and passes them along so that we can print them out.
So here we've got the bounding box coordinates. So notice that they've all been scaled between 0 and 1, so basically we're treating the image as being 1 by 1, so these are all relative to the size of the image, there's our three classes, and so here they are, chair is 0, dining table is 1, and 2 is sofa.
This is not a model, this is the ground truth. Here is our 4 by 4 grid cells from our final convolutional layer. So each of these square boxes, different papers call them different things, the three terms you'll hear are anchor boxes, prior boxes, or default boxes. And through this explanation you'll get a sense of what they are, but for now think of them as just these 16 squares, I'm going to stick with the term anchor boxes.
These 16 squares are our anchor boxes. So what we're going to do for this loss function is we're going to go through a matching problem where we're going to take every one of these 16 boxes and we're going to see which one of these three ground truth objects has the highest amount of overlap with this square.
So to do that, we're going to have to have some way of measuring an amount of overlap, and there's a standard function for this which is called the Jacquard index, and the Jacquard index is very simple, I'll do it through example. Let's take this sofa, so if we take this sofa and let's take the Jacquard index of this sofa with this grid cell here, what we do is we find the area of their intersection, so here is the area of their intersection, and then we find the area of their union, so here is the area of their union, and then we say take the intersection divided by the union.
And so that's the Jacquard index, also known as IOU intersection over union. So if two things overlap by more compared to their total sizes together, they have a higher Jacquard index. So we're going to go through and find the Jacquard overlap for each one of these three objects versus each of these 16 anchor boxes, and so that's going to give us a 3x16 matrix.
For every ground truth object, for every anchor box, how much overlap is there? So here are the coordinates of all of our anchor boxes, in this case they're printed as center and height and width. And so here is the amount of overlap between, and as you can see it's 3x16, so for each of the three ground truth objects, for each of the 16 anchor boxes, how much do they overlap?
So you can see here, 0, 1, 2, 3, 4, 5, 6, 7, 8, the 8th, anchor box overlaps a little bit with the second ground truth object. So what we could do now is we could take the max of dimension 1, so the max of each row, and that will tell us for each ground truth object what's the maximum amount that it overlaps with some grid cell.
And it also tells us, remember PyTorch when you say max returns two things, it says what is the max and what is the index of the max. So for each of these things, the 14th grid cell is the largest overlap for the first ground truth, 13 for the second, and 11 for the third.
So that tells us a pretty good way of assigning each of these ground truth objects to a grid cell, what the max is, which one is the highest overlap. But we're going to do a second thing, we're also going to look at max over dimension 0, and max over dimension 0 is going to tell us what's the maximum amount of overlap for each grid cell across all of the ground truth objects.
And so particularly interesting here tells us for every grid cell of 16, what's the index of the ground truth object which overlaps with it the most. Zero is a bit overloaded here, zero could either mean the amount of overlap was zero, or it could mean its largest overlap is with object index 0.
It's going to turn out not to matter, I just wanted to explain why this would be zero. So there's a function called map to ground truth, which I'm not going to worry about for now, it's super simple code but it's slightly awkward to think about, but basically what it does is it combines these two sets of overlaps in a way described in the SSD paper to assign every anchor box to a ground truth object.
Basically the way it assigns it is each of these ones, each of these three, gets assigned in this way, so this object is assigned to anchor box 14, this one to 13, and this one to 11, and then of the rest of the anchor boxes they get assigned to anything which they have an overlap of at least 0.5 with.
If anything which isn't in either of those criteria, i.e. which either isn't a maximum or doesn't have a greater than 0.5 overlap, is considered to be a cell which contains background. So that's all the map to ground truth function does. And so after we go through it, you can see now a list of all of the assignments, and you can also see anywhere that there's a 0 here, it means it was assigned to background.
In fact anywhere it's less than 0.5 here, it was assigned to background. So you can see those three which are kind of forced assignments that puts a high number in just to make sure that they're assigned. So we can now go ahead and convert those to classes, and then we can make sure we just grab those which are at least 0.5 in size, and so finally that allows us to spit out the three classes that are being predicted.
We can then put that back into the bounding boxes, and so here are what each of those anchor boxes is meant to be predicting. So you can see sofa, dining room table, chair, this is meant to be predicting sofa, this is meant to be predicting dining room table, this is meant to be predicting chair, and everything else is meant to be predicting background.
So that's the matching stage. So once we've done the matching stage, we're basically done. We can take the activations, just grab those which matched, that's what these positive indexes are, subtract from those the ground truth bounding boxes, take the absolute value of the difference, take the mean of that, and that's bell1_was.
And then for the classifications, we can just do cross-entropy, and then as before we can add them together. So that's the basic idea. There's a few, and so this is what's going to happen. We're going to end up with 16 recommended predicted bounding boxes coming out. Most of them will be background, see all these ones that say bg, but from time to time they'll say this is a cow, this is potted plant, this is a cow.
If you're wondering what does it predict in terms of the bounding box of background, the answer is it totally ignores it. That's why we had this only positive index thing here. So if it's background, there's no sense of where's the correct bounding box of background. So the only ones where the bounding box makes sense out of all of these are the ones that aren't background.
There are some important little tweaks. One is that how do we interpret the activations? And so the way we interpret the activations is defined here in activation to bounding box. And so basically we grab the activations, we stick them through than, and so remember than is the same as sigmoid except it's scaled to be between -1 and 1, not between 0 and 1.
So it's basically a sigmoid function that goes between -1 and 1. And so that forces it to be within that range. And we then say let's grab the actual position of the anchor boxes and we will move them around according to the value of the activations divided by 2.
So in other words, each predicted bounding box can be moved by up to 50% of a grid size from where its default position is, and ditto for its height and width can be up to twice as big or half as big as its default size. So that's one thing is we have to convert the activations into some kind of way of scaling those default anchor box positions.
Another thing is we don't actually use cross-entropy, we actually use binary cross-entropy loss. So remember binary cross-entropy loss is what we normally use for multi-label classification, like in the planet Amazon satellite competition. Each satellite image could have multiple things in it. So if it's got multiple things in it, you can't use softmax, because softmax kind of really encourages just one thing to have the high number.
In our case, each anchor box can only have one object associated with it. So it's not for that reason that we're avoiding softmax, it's something else, which is it's possible for an anchor box to have nothing associated with it. So there'd be two ways to handle that, this idea of background.
One would be to say, you know what, background's just a class, so let's use softmax and just treat background as one of the classes that the softmax could predict. A lot of people have done it this way, I don't like that though, because that's a really hard thing to ask a neural network to do, is basically to say, can you tell whether this grid cell doesn't have any of the 20 objects that I'm interested in with a jacquard overlap of more than 0.5?
That's a really hard thing to put into a single computation. On the other hand, what if we just had for each class, is it a motorbike, is it a bus, is it a person, is it a bird, is it a dining room table? And then it can check each of those and be no, no, no, no, no, and if it's no to all of them, it's like, oh, then it's background.
So that's the way I'm doing it, it's not that we could have multiple true labels, but we can have zero true labels. And so that's what's going on here. We take our target and we do a one-hot embedding with number of classes plus one, so at this stage we do have the idea of background for the one-hot embedding.
But then we remove the last column, so the background column's now gone. And so now this vector's either of all zeros, basically, meaning there's nothing here, or it has at most one one. And so then we can use binary cross-entropy to convey our predictions with that target. That is a minor tweak, but it's the kind of minor tweak that I want you to think about and understand, because it's a really big difference in practice to your training.
And it's the kind of thing that you'll see a lot of papers talk about, like often when there's some increment over some previous paper, it'll be something like this. It'll be somebody who realizes like, oh, trying to predict a background category using a softmax is a really hard thing to do, what if we use the binary cross-entropy instead.
And so it's kind of like, if you understand what this is doing, and more importantly why we're doing it, that's a really good test of your understanding of the material. And if you don't, that's fine, it just shows you this is something that you need to maybe go back and rewatch this part of the video and talk to some of your classmates and if necessary ask for the forum until you understand what are we doing and why are we doing it.
So that's what this binary cross-entropy loss function is doing. So basically in this part of the code we've got this custom loss function, we've got the thing that calculates the Descartes index, we've got the thing that converts activations to bounding blocks, we've got the thing that does map-to-ground truth that we look at, and that's it.
All that's left is the SSD loss function. So the SSD loss function, this is actually what we set as our criterion is SSD loss. So what SSD loss does is it loops through each image in the minivac and it calls SSD1 loss, so SSD loss for one image. So this function is really where it's all happening, this is calculating the SSD loss for one image, so we destructure our bounding box in class and basically, what this is doing here, this is worth mentioning, a lot of code you find out there on the internet doesn't work with minibatches, it only does like one thing at a time, which we really don't want.
So in this case, all of this stuff is working, it's not exactly on a minibatch at a time, it's on a whole bunch of ground truth objects at a time, and the data loader is being fed a minibatch at a time to do the convolutional layers. Because we could have different numbers of ground truth objects in each image, but a tensor has to be a strict rectangular shape, fastai automatically pads it with zeros, anything that's not the same length.
I think I fairly recently added it, but it's super handy, almost no other libraries do that. But that does mean that you then have to make sure that you get rid of those zeros. So you can see here I'm checking to find all of the non-zeros, and I'm only keeping those.
This is just getting rid of any of the bounding boxes that are actually just padding. So get rid of the padding, turn the activations, bounding boxes, do the jaccard, do the ground truth, this is all the stuff we just went through, it's all line by line underneath. Check that there's an overlap greater than something around 0.4 or 0.5, different papers use different values for this.
Find the things that match, put the background class for those, and then finally get the L1 loss for the localization part, get the binary cross-entropy loss for the classification part, return those two pieces, and then finally add them together. So that's a lot going on, and it might take a few watches of the video to put in a code to fully understand it.
But the basic idea now is that we now have the things we need. We have the data, we have the architecture, and we have the loss function. So now we've got those three things we can train. So do my normal learning rate finder and train for a bit, and we get down to 25, and then at the end we can see how we went.
So obviously this isn't quite what we want, I mean in practice we kind of remove the background ones or some threshold, but it's on the right track, there's a dog in the middle, 0.34, there's a bird here in the middle, 0.94, something's working okay, I've got a few concerns, I don't see anything saying motorcycle here, it says bicycle, which isn't great.
There's nothing for the plot of plant that's big enough, but that's not surprising because all of our anchor boxes were small, they were 4x4 grid. So to go from here to something that's going to be more accurate, all we're going to do is create way more anchor boxes. So there's a couple of ways we can create -- >> Quick question, I'm just getting lost in the fact that the anchor boxes and the bounding boxes are, how are they not the same?
Isn't that how we wrote the loss? I must be missing something. >> Anchor boxes are the square, the fixed square grid cells, these are the anchor boxes, they're in an exact, specific, unmoving location. The bounding boxes are, these are three things, the bounding boxes, these 16 things are anchor boxes.
Okay. So we're going to create lots more anchor boxes. So there's three ways to do that, and I've kind of drawn some of them here. One is to create anchor boxes of different sizes and aspect ratios. So here you can see, you know, there's an upright rectangle, there's a line down rectangle, and there's a square.
>> Just a question, for the multi-label classification, why aren't we multiplying the categorical loss by a constant like we did before? >> That's a great question. Because later on it'll turn out we don't need to. So yeah, so you can see here, like there's a square. And so I don't know if you can see this, but if you look, you've basically got one, two, three squares of different sizes, and for each of those three squares you've also got a line down rectangle and an upright rectangle to go with them.
So we've got three aspect ratios at three zoom levels, so that's one way we can do this. And this is for the one-by-one grid. So in other words, if we added two more stride two convolutional layers, you'll eventually get to the one-by-one grid, and this is for the one-by-one grid.
Another thing we could do is to use more convolutional layers as sources of anchor boxes. So as well as our, and I've randomly jitted these a little bit so it's easier to see. So as well as our 16-by-16 grid cells, we've also got two-by-two grid cells, and we've also got the one-by-one grid cell.
So in other words, if we add three stride two convolutions to the end, we'll have four-by-four, two-by-two, and one-by-one grid cells, all of which have anchor boxes. And then for every one of those, we can have all of these different shapes and sizes. So obviously those two are combined with each other to create lots of anchor boxes, and if I try to print that on the screen, it's just one big blur of color, so I'm not going to do that.
So that's all this code is, right? It says, "All right, what are all the grid cell sizes I have for the anchor boxes? What are all the zoom levels I have for the anchor boxes? And what are all the aspect ratios I have for the anchor boxes?" And the rest of this code then just goes away and creates the top left and bottom right corners inside an anchor corner, and the middle and height and width in anchors.
So that's all this does, and you can go through it and print out the anchors and anchor corners. So the key is to remember this basic idea that we have a vector of ground truth stuff, right? Where that stuff is like sets of four bounding boxes, but this is what we were given in the JSON files.
It's the ground truth, it's a dependent variable. Sets of four bounding boxes, and for each one, also a class. So this is a person in this location, this is a dog in this location, and that's the ground truth that we're given. "Just to clarify, each set of four is one box, top left, bottom right, top left, xy, bottom right, xy." So that's what we printed here, we printed out, this is what we call the ground truth.
There's no model, this is what we're told is what the answer is meant to be. And so remember, any time we train a neural net, we have a dependent variable, and then we have a neural net, some black box neural net, that takes some input and spits out some output activations, and we take those activations and we compare them to the ground truth.
We calculate a loss, we find the derivative of that, and adjust the weights according to the derivative times the learning rate, okay? So the loss is calculated using a loss question. Something I wanted to say is just I think one of the challenges with this problem is part of what's going on here is we're having to come up with an architecture that's letting us predict this ground truth.
Like it's not, because you can have, you know, any number of objects in your picture, it's not immediately obvious what's the correct architecture that's going to let us predict that sort of ground truth. I guess so, but I'm going to kind of make this plain, as we saw when we looked at the YOLO versus SSD, that there are only two possible architectures.
The last layer is fully connected, or the last layer is convolutional. And both of them work perfectly well. I'm sorry, I meant in terms of by creating this idea of anchor boxes and anchor boxes with different locations and sizes, that's giving you a format that kind of lets you get to the activations.
You're right, like high level. You see, okay, so that's really entirely in the loss function, not in the architecture. Like if we used the YOLO architecture where we had a fully connected layer, like literally there would be no concept of geometry in it at all. So I would suggest kind of forgetting the architecture and just treat it as just a given.
It's a thing that is spitting out 16x4+c activations. And then I would say our job is to figure out how to take those 16x4+c activations and compare them to our ground truth, which is like 4+1, but if it was one hot encoded it would be c, and I think that's easier to think about, so call it 4+c times however many ground truth objects there are for that particular image.
So let's call that m. So we need a loss function that can take these two things and spit out a number that says how good are these activations. That's what we're trying to do. So to do it, we need to take each one of these m ground truth objects and decide which set of 4+c activations is responsible for that object.
Which one should we be comparing and saying it's the right class or not, and yeah it's close or not. The way we do that is basically to say let's decide the first 4+c activations are going to be responsible for predicting the bounding box of the thing that's closest to the top left, and the last 4+c you'll be predicting the furthest to the bottom right.
And then of course we're not using the YOLO approach where we have a single vector, we're using the SSD approach where we spit out a convolutional output, which means that it's not arbitrary as to which we match up, but actually we want to match up the set of activations whose receptive field has the maximum density from where this real object is.
But that's a minor tweak. I guess the easy way to have taught this would be to start with the YOLO approach where it's just like an arbitrary vector and we can decide which activations correspond to which bound truth object. As long as it's consistent, it's got to be a consistent rule, because if in the first image the top left object corresponds with the first 4+c activations, and then the second image we threw things around and suddenly it's now going with the last 4+c activations, the neural net doesn't know what to learn, but the loss function needs to be some consistent task, which in this case the consistent task is try to make these activations reflect the bounding box in this general area.
That's basically what this loss function is trying to do. Is it purely coincident that the 4x4 in the conv2D is the same thing as year 16? No, not at all coincidence. That 4x4 conv is going to give us activations whose receptive field corresponds to those locations in the input image, so it's carefully designed to make that as effective as possible.
Now remember I told you before part 2 that the stuff we learn in part 2 is going to assume that you are extremely comfortable with everything you learn in part 1? And for a lot of you, you might be realizing now maybe I wasn't quite as familiar with the stuff in part 1 as I first thought, and that's fine, but just realize you might just have to go back and really think deeply and experiment more with understanding what are the inputs and outputs to each layer in a convolutional network, how big are they, what are their rank, exactly how are they calculated, so that you really fully understand the idea of a receptive field.
What's the loss function really, how does backpropagation work exactly? These things all need to be deeply felt intuitions, which you only get through to practice. And once they're all deeply felt intuitions, then you can rewatch this video and you'll be like, oh, I see, okay, I see that these activations just need some way of understanding what task they're being given, and that is being done by the loss function and the loss function is encoding a task.
And so the task of the SSD loss function is basically two parts. Part 1 is figure out which ground truth object is closest to which grid cell or which anchor box. When we started doing this, the grid cells of the convolution and the anchor boxes were the same, but now we're starting to introduce the idea that we can have multiple anchor boxes per grid cell.
So this is why it starts to get a little bit more complicated. So every ground truth object we have to figure out which anchor boxes are closest to, for every anchor box we have to decide which ground truth object is responsible for, if any. And once we've done that matching, it's trivial.
Now we just basically go through and do, going back to the single object detection, now it's just this. Once we've got every ground truth object matched to an anchor box, to a set of activations, we can basically then say, OK, what's the cross-entropy loss of the categorical part? What's the L1 loss of the coordinate part?
So really it's the matching part, which is kind of the slightly surprising bit. And then this idea of picking those in a way that the convolutional network gives it the best opportunity to calculate that part of the space, is then the final cherry on top. And this, I'll tell you something else, this class is by far going to be the most conceptually challenging.
And part of the reason for that is that after this, we're going to go and do some different stuff, and we're going to come back to it in lesson 14 and do it again with some tweaks. And we're going to add in some of the new stuff we learned afterwards.
So you're going to get a whole second run through of this material, effectively, once we add some extra stuff at the end. So we're going to revise it, as we normally do. So in part one, we kind of went through computer vision, NLP, structured data, back to NLP, back to computer vision.
So we revised everything from the start to the end, and it'll be kind of similar. So don't worry if it's a bit challenging at first, you'll get there. So for every grid cell that can be different sizes, we can have different orientations and zooms representing different anchor boxes, which are just like conceptual ideas that basically every one of these is associated with one set of 4+C activations in our model.
So however many of these ground truth boxes we have, we need to have that times 4+C activations in the model. Now that does not mean that each convolutional layer needs that many filters, because remember, the 4x4 convolutional layer already has 16 sets of filters, the 2x2 convolutional layer already has four sets of activations.
And then finally the 1x1 has one set of activations. So we basically get 1+4+16 for free, just because that's how a convolution works, it calculates things at different locations. So we actually only need to know k, where k is the number of zooms by the number of aspect ratios, whereas the grids we're going to get for free through our architecture.
So let's check out that architecture. So the model is nearly identical to what we had before, but we're going to have a number of stride 2 convolutions, which is going to take us through to 4x4, 2x2, 1x1. Each stride 2 convolution halves our grid size in both directions. And then after we do our first convolution to get to 4x4, we're going to grab a set of outputs from that, because we want to save away the 4x4 grid's anchors.
And then once we get to 2x2, we grab another set of our 2x2 anchors, and then finally we get to 1x1, so we get another set of outputs. So you can see we've got a whole bunch of these outcomes, this first one we're not using. So at the end of that we can then concatenate them all together.
So we've got the 4x4 activations, the 2x2 activations, the 1x1 activations. So that's going to give us the correct number of activations to give us one activation for every bounding, for every anchor box that we have. So then we just set our criteria as before to SSD loss, and we go ahead and train, and away we go.
So in this case I'm just printing out those things, which are at least the probability of 0.1, and you can see we've got -- some things look okay, some things don't. Our big objects like bird, we've got a box here with a 0.93 probability, it's looking to be in about the right spot, our person's looking pretty hopeful, but our motorbike has nothing at all with a probability of 0.1.
Our pod of plants looking pretty horrible, our bus is all the wrong size, what's going on? So what's going on here will tell us a lot about the history of object detection. And so these five papers are the key steps in the recent modern history of object detection. And so they go back to about, I think this is maybe 2013, this paper called Scalable Object Detection Using Deep Neural Networks.
This is what basically set everything up. And when people refer to the multi-box method, they're talking about this paper. And this is the basic one that came up with this idea that you can have a loss function that has this matching process, and then you can kind of use that to do object detection.
So everything since that time has been trying to figure out basically how to make this better. So in parallel, there's a guy called Russ Gershik who was going down a totally different direction, which was he had these two stage processes where the first stage used classical computer vision approaches to find edges and changes of gradients and stuff to kind of guess which parts of the image may represent distinct objects, and then fit each of those into a convolutional neural network, which was basically designed to figure out, is that actually the kind of object I'm interested in?
And so this was called the R-CNN, and then fast R-CNN, this kind of hybrid of traditional computer vision and deep learning. So what Russ and his team then did was they basically took this multi-box idea and replaced the traditional non-deep learning computer vision part of their two stage process with a ConvNet.
So they now had two ConvNets, one ConvNet that basically spat out something like this, which he called these region proposals, all of the things that might be objects. And then the second part was the same as his earlier work, it was basically something to talk each of those, fit it into a separate ConvNet, which was designed to classify whether or not that particular thing really is an interesting object or not.
At a similar time, these two papers came out, YOLO and SSD, and both of these did something pretty cool, which is they got the same kind of performance as fast R-CNN, but with one stage. And so they basically took the multi-box idea and they tried to figure out how to deal with this mess that was done, and the basic ideas were to use, for example, a clinical hard-negative mining where they would go through and find all of the matches that didn't look that good and turn them away, some very tricky and complex data augmentation methods, all kinds of hackery, basically.
But they got it to work pretty well. But then something really cool happened late last year, which is this thing called focal loss for dense object detection, where they actually realized why this messy crap wasn't working. And I'll describe why this messy crap wasn't working, by trying to describe why it is that we can't find a motorbike.
So here's the thing. When we look at this, we have three different granularities of convolutional groups, 4x4, 2x2, 1x1. The 1x1 is quite likely to have a reasonable overlap with some object, because most people photos have some kind of main subject. On the other hand, in the 4x4, those 16 grid cells are unlikely.
Most of them are not going to have much of an overlap with anything. Like in this motorbike case, it's going to be sky, sky, sky, sky, sky, sky, sky, ground, ground, ground, ground, ground, ground, and finally motorbike. So if somebody was to say to you like, "20 buck bet, what do you reckon this little clip is?" And you're not sure, you're going to say, "background." Because most of the time it is background.
And so here's the thing. I understand why we have a 4x4 grid of receptive fields with one anchor box each to coarsely localize objects in the image, but I think I'm missing is why we need multiple receptive fields at different sizes. The first version already included 16 receptive fields each with a single anchor box associated with the addition there are now many more anchor boxes to consider.
Is this because you constrained how much a receptive field could move or scale from its original size or is there another reason? It's kind of backwards. The reason I did the constraining is because I knew I was going to be adding more anchor boxes later, but really the reason is that the Jacquard overlap between one of those 4x4 grid cells and a single object that takes up most of the image is never going to be 0.5 because the intersection is much smaller than the union because one object is too big.
So for this general idea of work where we're saying you're responsible for something that you've got a better than 50% overlap with, we need anchor boxes which will on a regular basis have a 50% or higher overlap, which means we need to have a variety of sizes and shapes and scales.
So this all happens in the last function. Basically the vast majority of the interesting stuff in all of the object detection stuff is the last function because there is only three things, last function, architecture, data. So this is the focal loss paper, focal loss for dense object detection from August 2017.
Here's Ross Gershik still doing this stuff, climbing her. You might recognize as being the ResNet guy, a bit of an all-star cast here. And the key thing is this very first picture. The blue line is a picture of binary cross-entropy loss. The x-axis is what is the probability or what is the activation, what is the probability of the ground truth class.
So it's actually a motorbike, I said with 0.6 chance it's a motorbike, or it's actually not a motorbike, and I said with 0.6 chance it's not a motorbike. So this blue line represents the level of the value of cross-entropy loss. You can draw this in Excel or Python or whatever, this is just a simple plot of cross-entropy loss.
So the point is, if the answer is, because remember we're doing binary cross-entropy loss, if the answer is not a motorbike, and I said yeah I think it's not a motorbike, I'm 0.6 sure it's not a motorbike. This blue line is still at a loss of about 0.5, it's still pretty bad, so I actually have to keep getting more and more confident that it's not a motorbike.
So if I want to get my loss down, then for all of these things which are actually background, I have to be saying like, I am sure that's background, or I'm sure it's not a motorbike or a bus or a person or a dining room table. Because if I don't say I'm sure it's not any of these things, then I still get loss.
So that's why this doesn't work, because even when it gets to here, and it wants to say, I think it's a motorbike, there's no payoff for it to say so, because if it's wrong, it gets killed. And the vast majority of the time, it's not anything. The vast majority of the time it's background.
And even if it's not background, it's not enough just to say it's not background, you could say, which of the 20 things it is. So for the really big things, it's fine, because that's the one by one grid, so it generally is a thing, and you just have to figure out which thing it is.
Or else for these small ones, generally it's not anything, so generally small ones would just prefer to be like, I've got nothing to say, no comment. So that's why this is empty, and that's why even when we do have a bus, it's using a really big grid cell to say it's a bus, because these are the only ones where it's confident enough to make a call that it's something, because the small grid cells very rarely are something.
So the trick is to try and find a different loss function instead of binary cross-entropy loss that doesn't look like the blue line, but looks more like the green or purple line. And they actually end up suggesting the purple line. And so it turns out this is cross-entropy loss, negative log p_t, focal loss is simply 1 - p_t to the gamma, where gamma is some parameter, and they recommend using 2, times the cross-entropy loss.
So it's literally just a scaling of it. And so that takes you to, if you use gamma equals 2, that takes you to this purple line. So now if we say, I'm 0.6 sure that it's not a motorbike, then the loss function is like, good for you, no worries.
So that's what we want to do. We want to replace cross-entropy loss with focal loss. And I mentioned a couple of things about this fantastic paper. The first is, the actual contribution of this paper is to add 1 - p_t to the gamma to the start of this equation, which sounds like nothing.
But actually people have been trying to figure out this damn problem for years, and I'm not even sure they'd realize it's a problem, there's just this assumption that object detection is really hard, and you have to do all of these complex data augmentations, and hard negative mining and blah, blah, blah to get the damn thing to work.
So A, it's like this recognition of, why are we doing all of those things? And then this realization of, oh, if I do that it goes away, it's fixed. So when you come across a paper like this, which is like game-changing, you shouldn't assume that you're going to have to write 100,000 lines of code.
Very often it's one line of code, or the change of a single constant, or adding log to a single place. So let's go down to the bit where it all happens, where they describe focal loss. And I just wanted to point out a couple of terrific things about this paper.
The first is, here is their definition of cross-entropy. And if you're not able to write cross-entropy on a piece of paper right now, then you need to go back and study it, because we've got to be assuming what it is, what it means, why it's that, what the shape of it looks like, cross-entropy appears everywhere, binary cross-entropy, and categorical cross-entropy, and the softmax that appears there.
Most of the time we'll see cross-entropy written as an indicator on Y times log P plus an indicator on Y of 1 minus Y times log 1 minus P. This is like a kind of awkward notation, often people will use like a Dirac delta function, stupid stuff like that.
Or else this paper just says, you know what, it's just a conditional. Cross-entropy simply is negative log P if Y is 1, negative log 1 minus P, otherwise. So Y is 1 if it's a motorbike, 0 if not. In this paper they say 1 if it's a motorbike, or negative 1 if not.
And then they do something which mathematicians never do, they refactor, check this out. Hey, what if we replace, what if we define a new term called PT which is equal to the probability if Y is 1 or 1 minus P otherwise, if we did that we could now redefine CE as that, which is super cool, like it's such an obvious thing to do, but as soon as you do it all of the other equations get simpler as well.
Because later on, in the very next paragraph, they say, hey, one way to deal with class imbalance, i.e. lots of stuff is background, would just be to have a different weighting factor for background versus not. So for class 1 we'll have some number alpha, and for class 0 we'll have 1 minus alpha.
But then they're like, hey, let's define alpha_t the same way, and so now our cross-entropy with a weighting factor can be written like this. And so then they can write their focal loss with the same concept, and then eventually they say, hey, let's take focal loss and combine it with class weighting, like so.
So often when you see in a paper huge big equations, it's just because mathematicians don't know how to read back that, and you'll see the same pieces are repeated all over the place. Very, very, very often. And by the time you've turned it into numpy code, suddenly it's super simple.
So this is a million times better than nearly any other paper. So it's a great paper to read to understand how papers should be, a terrible paper to read to understand what most papers look like. So let's try this. We're going to use this here. Now remember negative log p is the cross-entropy loss, so therefore this is just equal to some number times the cross-entropy loss.
And when I defined the binomial cross-entropy loss, I don't know if you remember or if you noticed, but I had a weight which by default was none. And when you call binary cross-entropy with logits, the PyTorch thing, you can optionally pass that away. That is something that's modified by everything.
And if it's none, then there's no way. So since we're just going to multiply cross-entropy by something, we can just define get_weight. So here's the entirety of paper loss. This is the thing that suddenly made object detection make sense. So this was late last year, suddenly it got rid of all of the complex messy hackery.
And so we do our sigmoid, here's our p(t), here's our w, and here you can see 1 minus p(t)^gamma, and so we're going to set gamma of 2, alpha of 0.25. If you're wondering why, here's another excellent thing about this paper, because they tried lots of different values of gamma and alpha, and they found that 2 and 0.25 work well consistently.
So there's our new_loss function, it derives from our bce_loss, adding a weight to it, focal_loss. Other than that, there's nothing else to do, we can just train our model again. And so this time, things are looking quite a bit better. We now have motorbike, bicycle, person, motorbike...it's actually having a go at finding something here.
It's still doing a good job with big ones, in fact it's looking quite a lot better. It's finding quite a few people, it's finding a couple of different birds, it's looking pretty good. So our last step is to basically figure out how to pull out just the interesting stuff out of...let's take this dog and this sofa, how do we pick out our dog and our sofa?
And the answer is incredibly simple, all we're going to do is we're going to go through every pair of these bounding boxes. And if they overlap by more than some amount, say 0.5 using jaccard, and they both are predicting the same class, we're going to assume they're the same thing.
And we're just going to pick the one with the higher p-value. And we just keep doing that repeatedly. That's really boring code, I actually didn't write it myself, I copied it off somebody else. Somebody else's code, non-maximum suppression, NMS, no reason particularly to go through it, but that's all it does.
So we can now show the results of the non-maximum suppression, and here's the sofa, here's the dog, here's the bird, here's the person. This person's cigarette looks like it's like a firework or something, I don't know what's going on there. But it's fine, it's okay but not great, it's found a person in his bicycle, and a person in his bicycle with his bicycle is in the wrong place, and this person is in the wrong place.
You can also see that some of these smaller things have lower p-values than the tote, like the motorbike is just 0.16, this is same time, or bus, so there's some things still there, and the trick will be to use something called feature theorems. And that's what we're going to do in lesson 14, or thereabouts, and that'll fix this up.
What I wanted to do in the last few minutes of class was to talk a little bit more about the papers, and specifically to go back to the SSD paper. So this is single shot multi-box detector, and when this came out I was very excited because it was kind of, you know, it and YOLO were like the first kind of single pass, good quality object detection methods that had come along.
And so I kind of ignored object detection until this time, all this two-pass stuff with RCNN, and fast RCNN, and faster RCNN, because there's been this kind of continuous repetition of history in the deep learning world, which is things that involve multiple passes of multiple different pieces over time, you know, particularly where they involve some long deep learning pieces, like RCNN and fast RCNN did, over time they basically always get turned into a single end-to-end deep learning model.
So I tend to kind of ignore them until that happens, because that's the point where it's like okay, now people have figured out how to show this as a deep learning problem. As soon as people do that, they generally end up with something that's much faster and much more accurate.
And so SSD and YOLO are really important. So here's the SSD paper. Let's go down to the key piece, which is where they describe the model. And let's try and understand it. So the model is basically 1, 2, 3, 4 paragraphs. So papers are really concise, which means you kind of need to read them pretty carefully.
Partly, though, you need to know which bits to read carefully. So the bits where they say here we're going to prove the error bounds on this model, you can ignore that, because you don't care about proving the error bounds. But the bit which says here is what the model is, is the bit you need to read really carefully.
So here's the bit called model. And so hopefully you'll find we can now read this together and understand it. So SSD is a feed-forward conflict and it creates a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes. So fixed-size, i.e. the convolutional grid times k, the different aspect ratios and stuff, and each one of those has 4+c activations, followed by a non-maximum suppression step to take that massive dump and turn it into just a couple of non-overlapping different objects.
The early layers are based on a standard architecture, so we just use ResNet. This is pretty standard, as you can kind of see this consistent theme, particularly in how the fast-ai library tries to do things, which is grab a pre-trained network that already does something, pull off the end bits, stick on a new end bit.
So early network layers, if we use a standard classifier, truncate the classification layers as we always do, that happens automatically when we use Concloner, and we call this the base network. Some papers call that the backbone, and we then add an auxiliary structure. So the auxiliary structure, which we call the custom head, has multiscale feature mass.
So we add convolutional layers to the end of this base network, and they decrease in size progressively, so a bunch of strive-to conflays. So that allows predictions of detections at multiple scales. The grid cells are a different size at each of these. The model is different for each feature layer compared to YOLO that operate on a single feature map, so YOLO is one vector, whereas we have different convales.
Each added feature layer gives you a fixed set of predictions using a bunch of filters. For a filter layer where the grid size is N by N, 4 by 4, with P channels, let's take the previous one, 7 by 7 with 5 channels of channels, the basic element is going to be 3 by 3 by P kernel, which in our case is a 3 by 3 by 4 for the shape offset bit, or 3 by 3 by C for the score for a category.
So those are those two pieces. At each of those grid cell locations, it's going to produce an output value. And the bounding box offsets are measured relative to a default box position, which we've been calling an anchor box position, relative to the feature map location, what we've been calling the grid cell, as opposed to YOLO, which has a poly-connected layer.
And then they go on to describe the default boxes, what they are for each feature map cell, or what we would say grid cell, they tile the feature map in a convolutional manner, so the position of each box relative to its grid cell is fixed. So hopefully you can see we end up with C+4*K filters if there are K boxes at each location.
So these are similar to the anchor boxes described in the past class. So if you jump straight in and read a paper like this without knowing what problem they're solving and why are they solving it and what's the kind of non-actature and so forth, those four paragraphs would probably make almost no sense.
But now that we've gone through it, you read those four paragraphs and hopefully you're thinking, "Oh, that's just what Jeremy said, only they said it better than Jeremy in less words." So I have the same problem, when I started reading the SSD paper and I read those four paragraphs and I didn't have before this time much of a background in object detection because I had decided to wait until these two passed anymore, so I read this and I was like, "What the hell?" And so the trick is to then start reading back over the citations.
So for example, and you should go back and read this paper now, look, here's the matching strategy. And that whole matching strategy that I somehow spent an hour talking about, that's just a paragraph. But it really is. For each ground truth, we select from default boxes based on location, aspect ratio, and scale.
We match each ground truth to the default box with the best jaccade overlap. And then we match default boxes to anything with the jaccade overlap higher than 25. That's it. That's the one-sentence version. And then we've got the loss function, which is basically to say, take the average of the loss based on the classes plus the loss based on localization with some weighting factor.
Now with focal loss, I found I didn't really need the weighting factor anymore. They both had about the same scale, just a coincidence perhaps. But in this case, as I started reading this, I didn't really understand exactly what L and G and all this stuff was, but it says, well, this is derived from the multibox objective.
So then I went back to the paper that defined multibox, and I found in their proposed approach, they've also got a section called training objective, also known as loss function. And here I can see it's the same notation, L, G, blah, blah, blah. And so this is where I can go back and see the detail.
And after you read a bunch of papers, you'll start to see things very quickly. For example, when you see these double bars, 2, 2, you'll realize every time there's mean squared error, that's how you write mean squared error. This is actually called the 2-norm, the 2-norm is just the sum of squared differences.
And then this 2 up here means normally they take the square root, so we just undo the square root. So this is just MSE. Any time you see here's a log C and here's a log of y minus C, you know that's basically a binary cross-entropy. So it's like, you're not actually going to have to read every bit of every equation.
You are kind of a bit at first, but after a while your brain just like immediately knows basically what's going on. And then I say, oh, I've got a log C up and a log 1 minus C, and as expected I should have my x, and here's my 1 minus x.
Okay, there's all the pieces there that I would expect to see in a binary cross-entropy. So then having done that, that then kind of allowed me, and then they get combined with the two pieces, and oh, there's the multiplier that I expected, and so now I can kind of come back here and understand what's going on.
So we're going to be looking at a lot more papers, but maybe this week, go through the code and go through the paper and be like, what's going on? And remember, what I did to make it easier for you was I took that loss function, I copied it into a cell, and then I split it up so that each bit was in a separate cell, and then after every cell, I either printed or plotted that value.
So if I hadn't done that for you, you should do it yourself, because there's no way you can understand these functions without putting things in and seeing what comes down. So hopefully this is kind of a good starting point. Thanks everybody, have a great week, and see you next Monday.