Lesson 8: Deep Learning Part 2 2018 - Single object detection

Okay, welcome to Part 2 of Deep Learning for Coders. Part 1 was Practical Deep Learning for Coders, Part 2 is not impractical, but it is a little different as we'll discuss. This is probably a really dumb idea, but last year I started not starting Part 2 with Part 2 Lesson 1, but Part 2 Lesson 8 because it's kind of part of the same sequence.

I've done that again, but sometimes I'll probably forget and call things Lesson 1. So Part 2 Lesson 1 and Part 2 Lesson 8 are the same thing if I ever make that mistake. So we're going to be talking about object detection today, which refers to not just finding out what a picture is a picture of, but also where about that thing is.

But in general, the idea of each lesson in this part is not so much because I particularly want you to care about object detection, but rather because I'm trying to pick topics which allow me to teach you some foundational skills that you haven't got yet. So for example, object detection is going to be all about creating much richer convolutional network structures, which have a lot more interesting stuff going on and a lot more stuff going on in the fast.ai library that we have to customize to get there.

So at the end of these 7 weeks, I can't possibly cover the hundreds of interesting things people are doing with deep learning right now, but the good news is that all of those hundreds of things you'll see once you read the papers, like minor tweaks on a reasonably small number of concepts.

So we covered a bunch of those concepts in Part 1, and we're going to go a lot deeper into those concepts and build on them to get some deeper concepts in Part 2. So in terms of what we covered in Part 1, there's a few key takeaways. We'll go through each of these takeaways in turn.

One is the idea -- and you might have seen recently Yann LeCun has been promoting the idea that we don't call this deep learning, but differentiable programming. And the idea is that, you'll have noticed, all the stuff we did in Part 1 was really about setting up a differentiable function and a loss function that describes how good the parameters are, and then pressing Go and it kind of makes it work.

And so I think it's quite a good way of thinking about it, differentiable programming, this idea that if you can configure a loss function that describes the scores, how good something is at doing your task, and you have a reasonably flexible neural network architecture, you're kind of done. So that's one key way of thinking about this.

This example here comes from Playground.TensorFlow.org, which is a cool website where you can play interactively with creating your own little differentiable functions manually. The second thing we learned is about transfer learning. And it's basically that transfer learning is the most important single thing to be able to do, to use deep learning effectively.

Nearly all courses, nearly all papers, nearly everything in deep learning, education, research focuses on starting with random weights, which is ridiculous because you almost never would want to do that. You would only want to do that if nobody had ever trained a model on a vaguely similar set of data with an even remotely connected kind of problem to solve as what you're doing now, which almost never happens.

So this is where the fast.ai library and the stuff we talk about in this class is vastly different to any other library or course, it's all focused on transfer learning and it turns out that you do a lot of things quite differently. So the basic idea of transfer learning is here's a network that does thing A, remove the last layer or so, replace it with a few random layers at the end, fine-tune those layers to do thing B, taking advantage of the features that the original network learned, and then optionally fine-tune the whole thing end-to-end.

And you've now got something which probably uses orders of magnitude less data than if you started with random weights. It's probably a lot more accurate and probably trained a lot faster. We didn't talk a hell of a lot about architecture design in Part 1, and that's because architecture design is getting less and less interesting.

There's a pretty small range of architectures that generally work pretty well quite a lot of the time. We've been focusing on using CNNs for generally fixed size, somehow ordered data, RNNs for sequences that have some kind of state, fiddling around a tiny bit with activation functions like softmax if you've got a single categorical outcome or sigmoid if you've got multiple outcomes and so forth.

Some of the architecture design we'll be doing in this part gets more interesting, particularly this first session about object detection. But on the whole, I think we probably spend less time talking about architecture design than most courses or papers because it's generally not the hard bit in my opinion.

The third thing we looked at was how to avoid overfitting. The general idea that I tried to explain is the way I like to build a model is to first of all create something that's definitely terribly over-parameterized, will massively overfit for sure, train it and make sure it does overfit.

Because at that point you know, okay, I've got a model that is capable of reflecting the training set and then it's as simple as doing these things to then reduce that overfitting. If you don't start with something that's overfitting, then you're kind of lost. So you start with something that's overfitting and then to make it overfit less, you can add more data, you can add more data augmentation, you can do things like more batch norm layers or dense nets or various things that can handle basically less data, you can add regularization like weight decay and dropout.

And then finally, this is often the thing people do first, this should be the thing you do last, is reduce the complexity of your architecture, have less layers or less activations. We talked quite a bit about embeddings, both for NLP and the general idea of any kind of categorical data as being something you can now model with neural nets.

It's been interesting to see how since Part 1 came out, at which point there were almost no examples of papers or blogs or anything about using tabular data or categorical data in deep learning, suddenly it's kind of taken off and it's kind of everywhere. So this is becoming a more and more popular approach.

It's still little enough known that when I say to people, we use neural nets for time series and tabular data analysis, it's often kind of like, wait, really? But it's definitely not such a far out idea, and there's more and more resources available, including recent Kaggle competition winning approaches using this technique.

So Part 1, which particularly had those five messages, really was all about introducing you to best practices in deep learning. And so it's like trying to show you techniques which were mature enough that they definitely work reasonably reliably for practical real-world problems, and that I had researched and tuned enough over quite a long period of time that I could kind of say, OK, here's a sequence of steps and architectures and whatever that if you use this, you'll almost certainly get pretty good results, and then had kind of put that into the fast AI library into a way that you could do that pretty quickly and easily.

So that's kind of what practical deep learning for coders was designed to do. So this Part 2 is cutting edge deep learning for coders, and what that means is I often don't know the exact best parameters, architecture details and so forth to solve a particular problem. We don't necessarily know if it's going to solve a problem well enough to be practically useful, it almost certainly won't be integrated well enough into fast AI or in any library that you can just press a few buttons and it will start working.

It's all about stuff which I'm not going to teach it unless I'm very confident that it either is now or will be soon, a very practically useful technique. So I don't kind of take stuff which just appeared and I don't know enough about it to kind of know what's the trajectory going to be.

So if I'm teaching it in this course, I'm saying either works well in the research literature now and it's going to be well worth learning about or we're pretty close to being there, but it's going to take a lot of creaking often and experimenting to get it to work on your particular problem because we don't know the details well enough to know how to make it work for every data set or every example.

So it's kind of exciting to be working at this point. It means that rather than fast AI and PyTorch being obscure black boxes which you just know these recipes for, you're going to learn the details of them well enough that you can customize them exactly the way you want, that you can debug them, that you can read the source code of them to see what's happening and so forth.

And so if you're not pretty confident of object-oriented Python and stuff like that, then that's something you're going to want to focus on studying during this course because we assume that. I will be trying to introduce you to some tools that I think are particularly helpful like the Python Debugger, like how to use your editor to jump through the code, stuff like that.

And in general there will be a lot more detailed specific code walkthroughs, coding technique discussions and stuff like that, as well as more detailed walkthroughs of papers and stuff. And so anytime we cover one of these things, if you notice something where you're like, this is assuming some knowledge that I don't have, that's fine.

It just means that's something you could ask on the forum and say hey, Jeremy was talking about static methods in Python, I don't really know what a static method is, or why he was using it here, could someone give me some resources. These are things that are not rocket science, just because you don't happen to have come across it yet doesn't mean it's hard, it's just something you need to learn.

I will mention that as I cover these research-level topics and develop these courses, I often refer to code that academics have put up to go along with their papers, or kind of example code that somebody else has written on GitHub. I nearly always find that there's some massive critical flaw in it.

So be careful of taking code from online resources and assuming that if it doesn't work for you that you've made a mistake or something, this kind of research-level code, it's just good enough that they were able to run their particular experiments every second Tuesday. So you should be ready to do some debugging and so forth.

So on that sense, I just wanted to remind you about something from our old course wiki that we sometimes talk about, which is like people often ask what should I do after the lesson, like how do I know if I've got it, and we basically have this thing called how to use the provided notebooks, and the idea is this.

Don't open up the notebook, I know I said this in part 1 as well, but I'll say it again, then go shift, enter, shift, enter, shift, enter until a bug appears and then go to the forums and say the notebook's broken. The idea of the notebook is to kind of be like a little crutch to help you get through each step.

The idea is that you start with an empty notebook and think I now want to complete this process. And that might initially require you alt-tabbing to the notebook and reading it, figuring out what it says, but whatever you do, don't copy and paste it to your notebook. Type it out yourself.

So try to make sure you can repeat the process, and as you're typing it out, you need to be thinking, what am I typing, why am I typing it? So if you can get to the point where you can solve an object detection problem yourself in a new empty notebook, even if it's using the exact same data set we used in the course, that's a great sign that you're getting it.

That will take a while, but the idea is that by practicing the second time you try to do it, the third time you try to do it, you'll check the notebook lastness. And if there's anything in the notebook where you think, if you think I don't know what it's doing, I hope to teach you enough techniques in this course, in this class, that you'll know how to experiment to find out what it's doing, so you shouldn't have to ask that.

But you may well want to ask, why is it doing that? That's the conceptual bit, and that's something which you may need to go to the forums and say, before this step, Jeremy had done this, after this step, Jeremy had done that, there's this bit in the middle where he does this other thing, I don't quite know why.

So you can say here are my hypotheses as to why, try and work through it as much as possible, that way you'll both be helping yourself and other people will help you fill in the gaps. If you wish, and you have the financial resources, now is a good time to build a deep learning box for yourself.

When I say a good time, I don't mean a good time in the history of the pricing of GPUs. GPUs are currently by far the most expensive they've ever been, as I say this because of the cryptocurrency mining boom. I mean it's a good time in your study cycle.

The fact is if you're paying somewhere between $0.60 and $0.90 an hour for doing your deep learning on a cloud provider, particularly if you're still on a K80 like an Amazon P2 or Google Colab actually, if you haven't come across it, now lets you train on a K80 for free.

But those are very slow GPUs. You can buy one that's going to be like three times faster for maybe $600 or $700. You need a box to put it in, of course, but the example in the bottom right here from the forum was something that somebody put together in last year's course, so like a year ago they were able to put together a pretty decent box for a bit over $500.

Probably speaking you're probably looking at more like $1,000 or $1,500. I created a new forum thread where you can talk about options and parts and ask questions and so forth. If you could afford it right now, the GTX 1080 Ti is almost certainly what you want in terms of the best price performance mix.

If you can't afford it, a 1070 is fine. If you can't afford that, you should probably be looking for a second-hand 980 or a second-hand 970, something like that. If you can afford to spend more money, it's worth getting a second GPU so you can do what I do, which is to have one GPU training and another GPU which I'm running an interactive Jupyter notebook session in.

RAM is very useful, try and get 32GB if you can, RAM is not terribly expensive. A lot of people find that their vendor or person to buy one of these business classes on CPUs, that's a total waste of time. You can get one of the Intel i5 or i7 consumer CPUs, far, far cheaper, but actually a lot of them are faster.

Often you'll hear CPU speed doesn't matter. If you're doing computer vision, that's definitely not true. It's very common now with these 1080TIs and so forth to find that the speed of the data augmentation is actually the slow bit that's happening on the CPU, so it's worth getting a decent CPU.

Your GPU, if it's running quickly but the hard drive's not fast enough to give it data, then that's a waste as well. So if you can afford an NVMe drive that's super, super fast, you don't have to get a big one. You can just get a little one that you just copy your current set of data onto and have some big RAID array that sits there for the rest of your data when you're not using it.

There's a slightly arcane thing about PCI lanes which is basically like the size of the highway that connects your GPU to your computer, and a lot of people claim that you need to have 16 lanes to feed your GPU. It actually turns out, based on some analysis that I've seen recently that that's not true, you need 8 lanes of GPU.

So again, hopefully it'll help you save some money on your motherboard. If you've never heard of PCI lanes before, trust me, by the end of putting together this box you'll be sick of hearing about them. You can buy all the parts and put it together yourself. It's not that hard, it can be a useful learning experience, it can also be kind of frustrating and annoying, so you can always go to central computers and they'll put it together for you, there's lots of online vendors that will do the same thing, and they'll generally make sure it turns on and runs properly and generally not much of a mark-up, so it's not a bad idea.

We're going to be doing a lot of reading papers. Basically each week we'll be implementing a paper, or a few papers, and if you haven't looked at papers before, they look something like on the left. The thing on the left is an extract from the paper that implements Adam.

You may also have seen Adam as a single Excel formula on the spreadsheet that I've written. They're the same thing. The difference is in academic papers, people love to use Greek letters, they also hate to refactor. So you'll often see like a page-long formula where when you actually look at it carefully you'll realize the same kind of sub-equation appears 8 times.

They didn't think to say above it, let t equal this sub-equation and now it's 1. I don't know why this is a thing, but I guess all this is to say once you've read and understood a paper, you then go back to it and you look at it and you're just like wow, how did they make such a simple thing so complicated?

Like Adam is like momentum on the gradient and momentum on the square of the gradient. That's it. And it's a big long thing. And the other reason it's a big long thing is because they have things like this where they have theorems and corollaries and stuff where they're kind of saying here's all our theoretical reasoning behind why this ought to work or whatever.

And for whatever reason, a lot of conferences and journals don't like to accept papers that don't have a lot of this theoretical justification. Jeffrey Hinton has talked about this a bit, particularly a decade or two ago when no conferences would really accept any neural network papers. Then there was this one abstract theoretical result that came out where suddenly they could show this practically unimportant but theoretically interesting thing, and then suddenly they could then start submitting things to journals because they had this theoretical justification.

So academic papers are a bit weird, but in the end it's the way that the research community communicates their findings and so we need to learn to read them. But something that can be a great thing to do is to take a paper, put in the effort to understand it, and then write a blog where you explain it in code and normal English.

And lots of people who do that end up getting quite a following, end up getting some pretty great job offers and so forth because it's such a useful skill to be able to show I can understand these papers, I can implement them in code, I can explain them in English.

One thing I will mention is it's very hard to read or understand something which you can't vocalize, which means if you don't know the names of the Greek letters, it sounds weird, but it's actually very difficult to understand, remember, take in a formula that appears again and again that's got squiggle.

You need to know that that squiggle is called delta, or that squiggle is called sigma, whatever. Just spending some time learning the names of the Greek letters sounds like a strange thing to do, but suddenly you don't look at these things anymore and go squiggle a over squiggle b plus other weird squiggles, it looks like a y thing.

They've all got names. So now that we're kind of at the cutting edge stage, a lot of the stuff we'll be learning in this class is stuff that almost nobody else knows about. So that's a great opportunity for you to be the first person to create an understandable and generalizable code library that implements it, or the first person to write a blog post that explains it in clear English, or the first person to try applying it to this slightly different area which is obviously going to work just as well, or whatever.

So when we say cutting edge research, that doesn't mean you have to come up with the next batch norm, or the next atom, or the next diluted convolution. It can mean take this thing that was used for translation and apply it to this very similar other parallel NLP task, or take this thing that was tested on skin lesions and tested on this data set of this other kind of lesions.

That kind of stuff is super great learning experience and incredibly useful because the vast majority of the world that knows nothing about this whole field, it just looks like magic. You'll be like, hey, I've for the first time shown greater than 90% accuracy at finding this kind of lesion in this kind of data.

So when I say here experiment in your area of expertise, one of the things we particularly look for in this class is to bring in people who are pretty good at something else, pretty good at meteorology, or pretty good at de novo drug design, or pretty good at goat dairy farming, or whatever, these are all examples of people we've had in the class.

So probably the thing you can do the best would be to take that thing you're already pretty good at and add on these new skills, because otherwise if you're trying to go into some different domain, you're going to have to figure out how do I get data for that domain, how do I know what are the problems to solve in that domain, and so forth.

Whereas often it will seem pretty trivial to you to take this technique applied to this data set that you've already got sitting on your hard drive, but that's often going to be a super interesting thing for the rest of the world to see like oh, that's interesting when you apply it to meteorology data and use this RNN or whatever, suddenly it allows you to forecast over larger areas or longer time periods.

So communicating what you're doing is super helpful, we've talked about that before, but I know something that a lot of people in the forums ask people who have already written - when somebody has written a blog, often people in the forum will be like how did you get up the guts to do that, or what of the process you got to before you decided to start publishing something, or whatever, and the answer is always the same.

It's always just, I was sure I wasn't good enough to do it, I felt terrified and intimidated with doing it, but I wrote it and posted it anyway. There's never a time I think any of us actually feel like we're not total frauds and imposters, but we know more about what we're doing than us six months ago.

And there's somebody else in the world who knows as much as you did six months ago, so if you write something now that would have helped you six months ago, you're helping some people. Honestly if you wait another six months, then the year of 12 months ago probably won't even understand that anymore because it's too advanced now.

It's great to communicate wherever you're up to in a way that you think would be helpful to the person you were before you knew that thing. And of course something that the forums have been useful for is getting feedback about drafts and if you post a draft of something that you're thinking of releasing, then other folks here can point out things that they find unclear or they think need some corrections.

So the kind of overarching theme of Part 2 I've described as generative models, but unfortunately then Rachel asked me this afternoon exactly what I meant by generative models, and I realized I don't really know. So what I really mean is in Part 1, the output of our neural networks was generally like a number or a category, whereas the outputs of a lot of the stuff in Part 2 are going to be like a whole lot of things, like the top left and bottom right location of every object in an image along with what the object is, or a complete picture with a class of every single pixel in that picture, or an enhanced super-resolution version of the input image, or the entire original input paragraph translated into French.

It's kind of like, often it just requires some different ways of thinking about things and some kind of different architectures and so forth, and so that's kind of like I guess the main theme of the kind of techniques we'll be looking at. The vast majority, possibly all, of the data we'll be looking at will be either text or image data.

It would be fairly trivial to do most of these things with audio as well, it's just not something I've spent much time on myself yet. Somebody asked on the forum about what can we do more stuff with time series and tabular data, and my answer was, I've already taught you everything I know about that and I'm not sure there's much else to say, particularly if you check out the machine learning course, which goes into a lot of that in a lot more detail.

I don't feel like there's more stuff to tell you, I think that's a super-important area, but I think we're done with that. We'll be looking at some larger data sets, both in terms of the number of objects in the data set and the size of each of those objects.

For those of you that are working with limited computational resources, please don't let that put you off, feel free to replace it with something smaller and simpler. In fact, when I was designing this course, I did quite a lot of it in Australia when I went to visit my mum, and my mum decided to book a nice holiday house for us with fast WiFi.

We turned up the holiday house with fast WiFi, and indeed it did have WiFi, it was fast, but the WiFi was not connected to the internet. So I called up the agent and I said, "I found the ADSL router and it's got an ADSL thing plugged in, and I followed the cable down, and the other end of the cable has nothing to plug into." So she called the people renting the house and the owner and called me back the next day, and she said, "Actually, Point Leo has no internet." So the good old Australian government had decided to replace ADSL in Point Leo with a new national broadband network, and therefore they had disconnected ADSL that had not yet connected the national broadband network.

So we had fast WiFi, which we could use to Skype chat from one side of the house to the other, but I had no internet. Luckily, I did have a new Surface Book 15-inch, which has a GTX 1070 in it, and so I wrote a large amount of this course entirely on my laptop, which means I had to practice with relatively small resources, I mean not tiny, but 16GB RAM and 6GB GPU, and it was all in Windows by the way.

So I can tell you that pretty much all of this course works well on Windows, on a laptop. So you can always use smaller batch sizes, you could use a cut-down version of the dataset, whatever. So if you have the resources, you'll get better results if you can use the bigger datasets when they're available.

Now's a good time, I think, to take a somewhat early break so we can fix the forums. So the forums are still down. Okay, let's come back at 7.25. So let's start talking about object detection, and so here is an example of object detection. So hopefully you'll see two main differences from what we're used to when it comes to classification.

The first is that we have multiple things that we're classifying, which is not unheard of. We did that in the Planet Satellite data, for example, but what is kind of unheard of is that as well as saying what we see, we've also got what's called bounding boxes around what we see.

A bounding box has a very specific definition, which is it's a box, it's a rectangle, and the rectangle has the object entirely fitting within it, but it's no bigger than it has to be. You'll see this bounding box is perhaps, for the horse at least, slightly imperfect in that it looks like there's a bit of tail here.

So it probably should be a bit wider, and maybe there's even a little bit of hoof here, maybe it should be a bit longer. So the bounding boxes won't be perfect, but they're generally pretty good in most data sets that you can find. So our job will be to take data that has been labeled in this way and on data that is unlabeled to generate the classes of the objects and each one of those are bounding boxes.

One thing I'll note to start with is that labeling this kind of data is generally more expensive. It's generally quicker to say horse, person, person, horse, car, dog, jumbo jet, than it is to say if there's a whole horse race going on to label the exact location of every rider and every horse.

And then of course it also depends on what classes do you want to label, do you want to label every fence post or whatever. So generally always, just like in ImageNet, it's not like tell me any object you see in this picture. In ImageNet it's like here are the 1000 classes that we ask you to look for, tell us which one of those 1000 classes you find, just tell me one thing.

For these object detection data sets, it is a list of object classes that we want you to tell us about and find every single one of them of any type in the picture along with where they are. So in this case, why isn't there a tree or jump labeled?

That's because for this particular data set they weren't one of the classes that the annotators were asked to find and therefore were not part of this particular problem. So that's kind of the specification of the object detection problem. So let me describe stage 1. And stage 1 is actually going to be surprisingly straightforward.

And we're going to start at the top and work down. We're going to start out by classifying the largest object in each image. So we're going to try and say person, actually this one is wrong, dog is not the largest object, sofa is the largest object. So here's an example of a misclassified one, bird, correct, person, correct.

That will be the first thing we try to do, that's not going to require anything new, so it'll just be a bit of a warm-up for us. The second thing will be to tell us the location of the largest object in each image. Again here, this is actually incorrect, it should have labeled the sofa, but you can see where it's coming from.

And then finally we will try and do both at the same time, which is to label what it is and where it is for the largest thing in the picture. And this is going to be relatively straightforward, actually, so it will be a good warm-up to get us going again.

But what I'm going to do is I'm going to use it as an opportunity to show you some useful coding techniques, really, and a couple of little fast.ai handy details before we then get on to multi-label classification and then multiple object classification. So let's start here. The notebook that we're using is Pascal notebook, and all of the notebooks are in the DL2 folder.

One thing you'll see in some of my notebooks is torch.coder.set_device, you may have even seen it in the last part, just in case you're wondering why that's there. I have four GPUs on the university server that I use, and so I can put a number from 0 to 3 in here to pick one.

This is how I prefer to use multiple GPUs rather than run a model on multiple GPUs, which doesn't always beat it up that much, and it's kind of awkward. I generally like to have different GPUs running different things, so in this case I was running something in this on device 1 and doing something else on another notebook in device 2.

Now obviously if you see this in a notebook left behind, that was a mistake. If you don't have more than one GPU, you're going to get an error, so you can just change it to 0 or delete that line entirely. So there's a number of standard object detection datasets, just like ImageNet is a standard object classification dataset, and kind of the old classic ImageNet equivalent, if you like, is Pascal BOC, Visual Object Classes, something like that.

The actual main website for it is like, I don't know, it's running on somebody's coffee warmer or something, it goes down all the time every time he makes coffee. So some folks have mirrored it, which is very kind of them, so you might find it easier to grab from the mirror.

You'll see when you download it that there's a 2007 dataset, the 2012 dataset, there basically were academic competitions in those different years, just like the ImageNet dataset we tend to use is actually the ImageNet 2012 competition dataset. We'll be using the 2007 version in this particular notebook. Feel free to use the 2012 instead, it's a bit bigger, you might get better results.

A lot of people, in fact most people now in research papers actually combine the two. You do have to be careful because there's some leakage between the validation sets between the two, so if you do decide to do that, make sure you do some reading about the dataset to make sure you know how to combine them correctly.

The first thing you'll notice in terms of coding here is this, we haven't used this before. I'm going to be using this all the time now. This is part of the Python 3 standard library called pathlib, and it's super handy. It basically gives you an object-oriented access to a directory or a file.

So you can see, if I go path.something, there's lots of things I can do. One of them is iterative directory, however, path.iterate directory returns that. Basically you've come across generators by now because we did quite a lot of stuff that used them behind the scenes without talking about them too much, but basically a generator is something in Python 3 which you can iterate over.

So basically you could go for o in that print, o for instance, or of course you could do the same thing as a list comprehension, or you can just stick the word "list" around it to return a generator into the list. Any time you see me put list around something, that's normally because it returned a generator.

It's not particularly interesting. The reason that things generally return generators is that what if the directory had 10 million items in? You don't necessarily want a 10 million long list, so with a for loop, you'll just grab one, do the thing, throw it away, grab a second, throw it away.

You'll see that the things that's returning aren't actually strings, but they're some kind of object. If you're using Windows, it'll be a Windows path, on Linux it'll be a POSIX path. Most of the time you can use them as if they were strings, like if you pass it to any of the os.path.whatever functions in Python, it'll just work.

But some external libraries, it won't work, so that's fine. If you grab one of these, let's just grab one of these. So in general, you can change data types in Python just by naming the data type that you want and treating it like a function, and that will cast it.

So anytime you try to use one of these pathlib objects and you pass it to something which says like I was expecting a string, this is not a string, that's how you do it. So you'll see there's quite a lot of convenient things you can do. One kind of fun thing is the slash operator is not divided by, but it's path/.

So they've overwritten the slash operator in Python so that it works, so you can say path/whatever, and you'll see how that's not inside a string. So this is actually applying not the division operator, but the overwritten slash operator, which means get a child thing in that path. And you'll see if you run that, it doesn't return a string, it returns a pathlib object.

And so one of the things a pathlib object can do is it has an open method, so it's actually pretty cool once you start getting the hang of it. And you'll also find that the open method takes all the kind of arguments you're familiar with, you can say write, or binary, or encoding, or whatever.

So in this case, I want to load up these JSON files which contain not the images but the bounding boxes and the classes of the objects. And so in Python, the easiest way to do that is with the JSON library, or there's some faster API equivalent versions, but this is pretty small so you won't need them.

And you go to JSON.load, and you pass it an open file object, and so the easy way to do that since we're using pathlib is just go path.open. So these JSON files that we're going to look inside in a moment, if you haven't used them before JSON is JavaScript object notation, it's kind of the most standard way to pass around hierarchical structured data now, obviously not just with JavaScript.

You'll see I've got some JSON files in here, they actually did not come from the mirror I mentioned. The original Pascal annotations were in XML format, but cool kids can't use XML anymore, we have to use JSON, so somebody's converted them all to JSON, and so you'll find the second link here has all the JSON files.

So if you just pop them in the same location that I've put them here, everything will work for you. So these annotation files, JSONs, basically contain a dictionary. Once you open up the JSON, it becomes a Python dictionary, and they've got a few different things in. The first is we can look at images, it's got a list of all of the images, how big they are, and a unique ID for each one.

One thing you'll notice here is I've taken the word images and put it inside a constant called images. They seem kind of weird, but if you're using a notebook or any kind of IDE, this now means I can tap Complete all of my strings and I won't accidentally type it slightly wrong, so that's just a handy trick.

So here's the contents of the first few things in the images. More interestingly, here are some of the annotations. So you'll see basically an annotation contains a bounding box, and the bounding box tells you the column and row of the top left, and its height and width. And then it tells you that that particular bounding box is for this particular image, so you'd have to join that up over here to find it's actually O12.jpg.

And it's of category ID 7. Also some of them at least have a polygon segmentation, not just a bounding box. We're not going to be using that. Some of them have an ignore flag, so we'll ignore the ignore flags. Some of them have something telling you it's a crowd of that object, not just one of them.

So that's what these annotations look like. So then you saw here there's a category ID, so then we can look at the categories, and here's a few examples, basically each ID has a name, there we go. So what I did then was turn this category list into a dictionary from ID to name, I created a dictionary from ID to name of the image file names, and I created a list of all of the image IDs just to make life easier.

So generally when you're working with a new dataset, I try to make it look the way I would want it to if I kind of designed that dataset, so I just kind of do a quick bit of manipulation. And so the steps you see here, and you'll see in each class, are basically the sequence of steps I took as I started working with this new dataset, except without the thousands of screw-ups that I did.

I find the one thing people most comment on when they see me working in real time, having seen my classes, is like "wow, you actually don't know what you're doing, do you?" It's like 99% of the things I do don't work, and then the small percentage of the things that do work end up here.

So I mentioned that because machine learning and particularly deep learning is kind of incredibly frustrating because in theory, you're just to find the correct loss function and a flexible enough architecture, and you press train and you're done. But if that was actually all at talk, then nothing would take any time, and the problem is that all the steps along the way until it works, it doesn't work.

Like it goes straight to infinity, or it crashes with an incorrect tensor size, or whatever. And I will endeavor to show you some debugging techniques as we go, but it's one of the hardest things to teach because I don't know, maybe I just haven't quite figured it out yet.

The main thing it requires is tenacity. I find the biggest difference between the people I've worked with who are super effective and the ones who don't seem to go very far has never been about intellect, it's always been about sticking with it, basically never giving up. It's particularly important with this deep learning stuff because you don't get that continuous reward cycle.

With normal programming, you've got like 12 things to do until you've got your flash endpoint staged up. You know at each stage, it's like okay, we've successfully processed in the JSON, and now we've successfully got the callback from that promise, and now we've successfully created the authentication system. It's this constant sequence of stuff that works, whereas generally with training a model, it's a constant stream of like "it doesn't work, it doesn't work, it doesn't work" until eventually it does.

So it's kind of annoying. So let's now look at the images. You'll find inside the VOC devkit, there's 2007 and 2012 directories, and in there there's a whole bunch of stuff that's mainly these XML files, the one we care about with JPEG images, and so again here you've got the pathlibs/operator, and inside there's a few examples of images.

So what I wanted to do was to create a dictionary where the key was the image ID, and the value was a list of all of its annotations. So basically what I wanted to do was go through each of the annotations that doesn't say to ignore it, and append it, the bounding box and the class, to the appropriate dictionary item where that dictionary item is a list.

But the annoying thing is if that dictionary item doesn't exist yet, then there's no list to append to. So one super handy trick in Python is that there's a class called collections.defaultdict, which is just like a dictionary, but if you try and access a key that doesn't exist, it magically makes itself exist and sets itself equal to the return value of this function.

Now this could be the name of some function that you've defined, or it can be a lambda function. A lambda function simply means it's a function that you define in place. We'll be seeing lots of them. So here's an example of a function. All the arguments to the function are listed on the left, so there's no arguments to the function.

And lambda functions are special, you don't have to write return as a return is assumed. So in this case, this is a lambda function that takes no arguments and returns an empty list. So in other words, every time I try and access something in train annotations that doesn't exist, it now does exist and it's an empty list, which means I can append to it.

One comment on variable naming is when I read through these notebooks, I'll generally try and speak out the English words that the variable name is a noun for. A reasonable question would be why didn't I write the full name of the variable in English rather than using a short mnemonic.

It's a personal preference I have based on a number of programming communities where the basic thesis is that the more that you can see in a single eye grab of the screen, the more you can understand intuitively at one go. Every time your eye has to jump around, it's kind of like a context change that reduces your understanding.

It's a style of programming I found super helpful, and so generally speaking I particularly try to reduce the vertical height, so things don't scroll off the screen, but I also try to reduce the size of things so that there's a mnemonic there which if you know it's training annotations, it doesn't take long for you to see training annotations, but you don't have to write the whole thing in.

So I'm not saying you have to do it this way, I'm just saying there's some very large programming communities, some of which have been around for 50 or 60 years which have used this approach and I find it works well. It's interesting to compare, I guess my philosophy is somewhere between math and Java.

In math, everything's a single character. The same single character can be used in the same paper for five different things, and depending on whether it's in italics or boldface or capitals, it's another five different things. I find that less than ideal. In Java, variable names sometimes require a few pages to print out, and I find that less than ideal as well.

So for me, I personally like names which are short enough to not take too much of my perception to see at once, but long enough to have a mnemonic. Also, however, a lot of the time the variable will be describing a mathematical object as it exists in a paper, and there isn't really an English name for it, and so in those cases I will use the same, often single letter that the paper uses.

And so if you see something called delta or A or something, and it's like something inside an equation from a paper, I generally try to use the same thing just to explain that. By no means do you have to do the same thing. I will say, however, if you contribute to fast.ai, I'm not particularly fastidious about coding style or whatever, but if you write things more like the way I do than the way Java people do, I'll certainly appreciate it.

So by the end of this we now have a dictionary from file names to a tuple, and so here's an example of looking at that dictionary and we get back a bounding box and a class. You'll see when I create the bounding box, I've done a couple of things.

The first is I've switched the x and y coordinates. The reason for this, I think we mentioned this briefly in the last course, the computer vision world when you say my screen is 640x480, that's width by height. Whereas the math world when you say my array is 640x480, it's rows by columns, i.e.

height by width. So you'll see that a lot of things like PIL or Pillow Image Library in Python tend to do things in this kind of width by height or columns by rows way, NumPy is the opposite way around. My view is don't put up with this kind of incredibly annoying inconsistency, fix it.

So I've decided FastAI is the NumPy PyTorch way is the right way, so I'm always rows by columns. So you'll see here I've switched my rows by columns. I've also decided that we're going to do things by describing the top left x, y coordinate and the bottom right x, y coordinate bounding box rather than the x, y and the height width.

So you'll see here I'm just converting the height and width to the top left and bottom right. So again, I often find dealing with junior programmers, in particular junior data scientists, that they get given data sets that are in shitty formats, happy APIs, and they just act as if everything has to be that way.

But your life will be much easier if you take a couple of moments to make things consistent and make them the way you want them to be. So earlier on I took all of our classes and created a categories list, and so if we look up category number 7, which is what this is, category number 7 is a car.

Let's have a look at another example. Image number 17 has two bounding boxes, one of them is type 15, one is type 13, that is a person and a horse. So this will be much easier to understand if we can see a picture of these things. So let's create some pictures.

So having just turned our height width stuff into top left, bottom right stuff, we're now going to create a method to do the exact opposite, because any time I want to call some library that expects the opposite, I'm going to need to pass it in the opposite. So here is something that converts a bounding box to a height width, bbhw, b bounding box to height width.

So it's again reversing the order and giving us the height and width. So we can now open an image in order to display it, and where we're going to get to is we're going to get it to show this - that's that car. So one thing that I often get asked on the forums or through GitHub is like, well how did I find out about this open image thing?

Where did it come from, what does it mean, who uses it. And so I wanted to take a moment because one of the things you're going to be doing a lot, and I know a lot of you aren't professional coders, you have backgrounds in statistics or meteorology or physics or whatever, and I apologize for those of you who are professional coders, you know all this already.

Because we're going to be doing a lot of stuff with the fastai library and other libraries, you need to be able to navigate very quickly through them. And so let me give you a quick overview of how to navigate through code, and for those of you who haven't used an editor properly before, this is going to blow your minds.

For those of you that have, you're going to be like, check this out guys, check this out. For the demo I'm going to show you in Visual Studio Code, personally my view is that on pretty much every platform, unless you're prepared to put in the decades of your life to learn Vim or Emacs well, Visual Studio Code is probably the best editor out there.

It's free, it's open source, there are other perfectly good ones as well. So if you download a recent version of Anaconda, it will offer to install Visual Studio Code for you, it integrates with Anaconda, sets it up with your Python interpreter and comes with the Python extensions and everything.

So it's a good choice if you're not sure. If you've got some other editor you like, search for the right keywords for the help. So if I fire up Visual Studio Code, the first thing to do of course is do a git clone of the fastai library to your laptop.

You'll find in the root of the repo as well as the environment.yml file that sets up Anaconda environment for GPU. One of the students has been kind enough to create an environment-CPU.yml file, and perhaps one of you that knows how to do this can add some notes to the wiki, but basically you can use that to create a local CPU-only fastai installation.

The reason you might want to do that is so that as you navigate the code, you'll be able to navigate into PyTorch, you'll see all the stuff is there. So I opened up Visual Studio Code, and it's as simple as saying open folder, and then you can just point it at the fastai github folder that you just downloaded.

And so the next thing you need to do is to set up Visual Studio Code to say I want to use the fastai conda environment, please. So the way you do that is with the select interpreter command, and there's a really nice idea which is kind of like the best of both worlds between a command-line interface and a GUI, which is this is the only command you need to know, Ctrl+Shift+P.

You hit Ctrl+Shift+P, and then you start typing what you want to do and watch what happens. I want to change my interpreter into... okay, and it appears. If you're not sure, you can kind of try a few different things. So here we are, Python select interpreter, and you can see generally you can type stuff in, it will give you a list of things if you can.

And so here's a list of all of the environments and interpreters I have set up, and here's my fastai environment. So that's basically the only setup that you have to do. The only other thing you might want to do is to know there's an integrated terminal, so if you hit Ctrl+Backtick, it brings up the terminal.

And the first time you do it, it will ask you what terminal do you want. If you're in Windows, it will be like PowerShell or Command Prompt or Bash. If you're on Linux, you've got multiple shells installed, it will ask. So as you can see, I've got it set up to use Bash.

And you'll see it automatically goes to the directory that I'm in. So the main thing we want to do right now is find out what open_images is. So the only thing you need to know to do that is Ctrl+T. If you hit Ctrl+T, you can now type the name of a class, a function, pretty much anything and you can find out about it.

So open_image, you can see it appears. And it's kind of cool if there's something that's got like camel case capitalized or something with underscore, you can just type the first few letters of each bit so I could be like open_image, for example. I do that and it's found the function, it's also found some other things that match.

There it is. So that's kind of a good way you can see exactly where it's come from and you can find out exactly what it is. And then the next thing I guess would be like, well, what's it used for? So if it's used inside fast.ai, you could say find_references, which is shift, it should say shift_f12, open_image, shift_f12, and it brings up something saying, oh, it's used twice in this code base, and I can go and have a look at each of those examples.

If it's used in multiple different files, it will tell you the model of different files that it's used in. Another thing that's really handy then is as you look at the code, you'll find that certain bits of the code call other parts of the code. So for example, if you're inside files_dataset, and you're like, oh, this is calling something called open_image, what is that?

You can wave your pointer over it and it will give you the docstring. Or you can hit f12, and it jumps straight to its definition. So often it's easy to get a bit lost in things, call things, call things, and if you have to manually go to each bit, it's infuriating, whereas this way it's always one button away.

Ctrl+T to go to something that you specifically know the name of, or f12 to jump to the name of the definition of something that you're clicking on. When you're done, you probably want to go back to where you came from, so Alt+Left takes you back to where you were.

So whatever you use, BIM, Emacs, Atom, whatever, they all have this functionality as long as you have an appropriate extension installed. If you use PyCharm, you can get that for free, that doesn't need any extensions because it's Python. Whatever you're using, you want to know how to do this stuff.

Finally I'll mention there's a nice thing called Zen mode, Ctrl+KZ, which basically gets rid of everything else so you can focus, but it does keep this nice little thing on the right-hand side which shows you where you are. That's something that you should practice if you haven't played around with it before during the week because we're increasingly going to be digging deeper and deeper into fast.ai and PyTorch libraries.

As I say, if you're already a professional coder, know all this stuff, apologies for telling me stuff you already know. So we're going to -- well actually since we did that, let's just talk about OpenImage. You'll see that we're using CV2, CV2 is the library, is actually the OpenCV library.

You might wonder why we're using OpenCV, and I want to explain some of the inits of fast.ai to you because some of them are kind of interesting and might be helpful to you. The torch vision, like the standard PyTorch vision library, actually uses PyTorch tensors for all of its data augmentation and stuff like that.

A lot of people use Pillow, the standard Python imaging library. I did a lot of testing of all of these. I found OpenCV was about 5 to 10 times faster than TorchVision, so early on I actually teamed up with one of the students from an earlier class to do the Planet Lab satellite competition back when that was on, and we used TorchVision.

Because it was so slow, we could only get 25% GPU utilization because we were doing a lot of data augmentation. So then I used the Profiler to find out what was going on and realized it was all in TorchVision. Pillow or PIL is quite a bit faster, but it's not as fast as OpenCV, and also is not nearly as thread-safe.

So I actually talked to the guy who developed the thing, Python has this thing called the global interpreter lock, the GIL, which basically means that two threads can't do Pythonic things at the same time. It makes Python a really shitty language for modern programming, but we're stuck with it.

So I spoke to the guy on Twitter who actually made it so that OpenCV releases the GIL. So one of the reasons the Fast.io library is so amazingly fast is because we don't use multiple processors like every other library does through our data augmentation, we actually do multiple threads.

And the reason we can do multiple threads is because we use OpenCV. Unfortunately, OpenCV is a really shitty API, it's kind of inscrutable, a lot of stuff it does is poorly documented. When I say poorly documented, it's documented in really obtuse kind of ways. So that's why I try to make it so no one using Fast.io needs to know that it's using OpenCV.

If you want to open an image, do you really need to know that you have to pass these flags to open to actually make it work? Do you actually need to know that if the reading fails it doesn't show an exception, it just silently returns none? It's these kinds of things that we try to do to actually make it work nicely.

But as you start to dig into it, you'll find yourself in these places and you'll want to know why. And I mentioned this in particular to say don't start using PyTorch for your data augmentation, don't start bringing in Pillow, you'll find suddenly things slow down horribly or the multithreading won't work anymore, try to stick to using OpenCV for your processing.

So we've got our image, we're just going to use it to demonstrate the Pascal library. And so the next thing I wanted to show you in terms of important coding stuff we're going to be using throughout this course is using Matplotlib a lot better. So Matplotlib is so named because it was originally a clone of Matlab's plotting library.

Unfortunately, Matlab's plotting library is awful, but at the time it was what everybody knew. So at some point, the Matplotlib folks realized that the Matlab plotting library is awful, so they added a second API to it which was an object-oriented API. Unfortunately, because nobody who originally learned Matplotlib learned the OO API, they then taught the next generation of people the old Matlab-style API, and now there's basically no examples or tutorials online I'm aware of that use the much, much better, easier to understand, simpler OO API.

So one of the things I'm going to try and show you because plotting is so important in deep learning is how to use this API, and I've discovered some simple little tricks. One simple little trick is plot.subplots is just a super handy wrapper, I'm going to use a lots, and what it does is it returns two things.

One of the things you probably want to care about, the other thing is an axes object, and basically anywhere where you used to say plt.something, you now say ax.something, and it will now do that plotting to that particular subplot. So a lot of the time you'll use this, or I'll use this during this course to plot multiple plots that we can compare next to each other, but even in this case I'm creating a single plot.

But it's just nice to only know one thing rather than lots of things, so regardless of whether you're doing one plot or lots of plots, I always start now with this plot.subplot. And the nice thing is that this way I can pass in an axes object if I want to plot it into a figure I've already created, or if it hasn't been passed in I can create one.

So this is also a nice way to make your matplotlib functions really versatile, and you'll kind of see this used throughout this course. So now rather than plot.imshow, it's ax.imshow. And then rather than kind of weird stateful setting things in the old-style API, you can now use oos, get_access that returns an object, set_visible, sets a property, it's all pretty normal straightforward stuff.

So once you start getting the hang of a small number of these oo.matplotlib things, hopefully you'll find life a lot easier, so I'm going to show you a few right now actually. So let me show you a cool example, what I think is a cool example. So one thing that kind of drives me crazy with people putting text on images, whether it be subtitles on TV or people doing stuff with computer vision is that it's like white text on a white background or black text on a dark background, you can't read it.

And so a really simple thing that I like to do every time I draw on an image is to either make my text in boxes white with a little black border or vice versa. And so here's a cool little thing you can do in matplotlib, is you can take a matplotlib plotting object and you can go setPathEffects and say add a black stroke around it.

And you can see that when you draw that, it doesn't matter that here it's white on a white background or here it's on a black background, it's equally visible. And I know it's a simple little thing, but it kind of just makes life so much better when you can actually see your bounding boxes and actually read the text.

So you can see, rather than just saying add a rectangle, I get the object that it creates and then pass that object to drawOutline. Now everything I do, I'm going to get this nice path effect on it. You can see matplotlib is a perfectly convenient way of drawing stuff.

So when I want to draw a rectangle, matplotlib calls that a patch, and then you can pass in all different kinds of patches. So here's -- again, rather than having to remember all that every time, please stick it in a function. And now you can use that function every time.

You don't have to put it in a library somewhere, I always put lots of functions inside my notebook. If I use it in like three notebooks, then I know it's useful enough that I'll stick it in a separate library. You can draw text, and notice all of these take an axis object, so this is always going to be added to whatever thing I want to add it to.

So I can add text, and draw an outline around it. So having done all that, I can now take my showImage, and notice here the showImage, if you didn't pass it an axis, it returns the axis it created. So showImage returns the axis that image is on, I then turn my bounding box into height and width for this particular image's bounding box, I can then draw the rectangle, I can then draw the text in the top left corner.

So remember the bounding box x and y are the first two coordinates, so b colon 2 is the top left. This is the, remember the tuple contains two things, the bounding box and then the class, so this is the class, and then to get the text of it I just pass it into my categories list and there we go.

So now that I've kind of got all that set up, I can use that for all of my object detection stuff from here on. What I really want to do though is to kind of package all that up, so here it is, packaging it all up, so here's something that draws an image with some annotations, so it shows the image, it goes through each annotation, turns it into height and width, draws the rectangle, draws the text.

If you haven't seen this before, each annotation remember contains a bounding box and a class, so rather than going for o in a and n and going 0, 0, 1, I can destructure it, so if you put something on the left, then that's going to put the two parts of the tuple or a list into those two things, super handy.

So for the bounding box and the class in the annotations, go ahead and do all that, and so then I can then say okay, draw an image at a particular index by grabbing the image ID, opening it up and then calling that draw, and so let's test it out, and there it is.

So that kind of seems like quite a few steps, but to me, when you're working with a new data set, getting to the point that you can rapidly explore it, it pays off. You'll see as we start building our model, we're going to keep using these functions now to kind of see how things are going.

So step 1 from our presentation is to do a classifier. And so I think it's always good, like for me, I didn't really have much experience before I started preparing this course a few months ago in doing this kind of object detection stuff, so I was like, alright, I want to get this feeling of, even though it's deep learning, of continual progress.

So I'm like, what can I make work? I thought, alright, why don't I find the biggest object in each image and classify it? I know how to do that. This is one of the biggest problems I find, particularly with younger students, is they figure out the whole big solution they want, generally which involves a whole lot of new speculative ideas that nobody's ever tried before, and they spend 6 months doing it, and then the day before the presentation, none of it works, and they're screwed.

I've talked about my approach to Kaggle competitions before, which is like half an hour if you date. At the end of that half an hour, submit something, and try and make it a little bit better than yesterday's. So I've kind of tried to do the same thing in preparing this lesson, which is try to create something that's a bit better than the last thing.

So the first thing, the easiest thing I could come up with was my largest item classifier. So the first thing I needed to do was to go through each of the bounding boxes in an image and get the largest one. So I actually didn't write that first, I actually wrote this first.

So normally I pretend that somebody else has created the exact API I want, and then go back and write it. So I wrote this line first, and it's like, okay, I need something which takes all of the bounding boxes for a particular image and finds the largest, and that's pretty straightforward.

I can just sort the bounding boxes, and here again we've got a lambda function. So again, if you haven't used lambda functions before, this is something you should study during the week, they're used all over the place to quickly define a once-off function. And in this case, the Python built-in sorted function lets you pass in a function to say, how do you decide whether something's earlier or later in the sort order?

And in this case, I took the product of the last two items of my bounding box list, i.e. the bottom right hand corner, minus the first two items of my bounding box list, i.e. the top left corner. So bottom right minus top left is the size, the two sizes, and if you take the product of those two things you get the size of the bounding box.

And so then that's the function, do that in descending order. Often you can take something that's going to be a few lines of code and turn it into one line of code, and sometimes you can take that too far, but for me, I like to do that where I reasonably can, because again, having to understand a whole big chain of things, my brain can just say, I can just look at that at once, and say okay, there it is.

And also I find that over time, my brain kind of builds up this little library of idioms, and more and more things I can look at a single line and know what's going on. So this now is a dictionary, and it's a dictionary because this is a dictionary comprehension.

A dictionary comprehension is just like a list comprehension, I'm going to use it a lot in this part of the course, except it goes inside curly brackets, and it's got a key colon value. So here the key is going to be the image ID, and the value is the largest bound box.

So now that we've got that, we can look at an example, and here's an example of the largest bounding box for this image. So obviously there's a lot of objects here, there's three bicycles and three people, but here's the largest bounding box. I feel like this ought to go without saying, but it definitely needs to be said because so many people don't do it.

You need to look at every stage when you've got any kind of processing pipeline, if you're as bad at coding as I am, everything you do will be wrong the first time you do it. But there's lots of people that are as bad as me at coding, and yet lots of people write lines and lines of code assuming they're all correct, and then at the very end they've got a mistake and they don't know where it came from.

So particularly when you're working with images or text, like things that humans can look at and understand, keep looking at it. So here I have it, yep, that looks like the biggest thing, and that certainly looks like a person. So let's move on. Here's another nice thing in Pathlib, make directory, so it's a handy little method.

So I'm going to create a path called CSV, which is a path to my large objects CSV file. Why am I going to create a CSV file? Pure laziness, right? We have an image classifier from CSV, I could go through a whole lot of work to create a custom data set and blah blah blah to use this particular format I have.

But why? It's so easy to create the CSV, chuck it inside a temporary folder, and then use something that already you have. Something I've seen a lot of times on the forum is people will say how do I convert this weird structure into a way that fast.ai can accept it, and then normally somebody on the forum will say, print it to a CSV file.

So that's a good simple tip. And the easiest way to create a CSV file is to create a pandas dataframe. So here's my pandas dataframe, I can just give it a dictionary with the name of a column and the list of things in that column, so there's the file name, there's the category, and then you'll see here, why do I have this?

I've already named the columns in the dictionary, why is it here? Because the order of columns matters, and the dictionary does not have an order. So this says the file name comes first and the category comes second. So that's a good trick to creating your CSVs. So now it's just dogs and cats.

I have a CSV file that contains a bunch of file names, and for each one it contains the plus of that object. So this is the same two lines of code you've seen a thousand times. What we will do though is to take a look at this. The one thing that's different is crop type.

So you might remember the default strategy for creating a 224x224 image in fastai is to first of all resize it, so the largest side is 224, and then to take a random square crop during training, and then during validation we take the center crop unless we use data augmentation in which case we do a few random crops.

For bounding boxes, we don't want to do that because unlike an image net where the thing we care about is pretty much in the middle and it's pretty big, a lot of the stuff in object detection is quite small and close to the edge. So we could crop it out, and that would be bad.

So when you create your transforms you can choose crop type = crop type.no, and no means don't crop, and therefore to make it square instead it squishes it. So you'll see this guy now looks kind of a bit strangely wide, and that's because he's been squished like this rather than cropped.

Generally speaking, a lot of computer vision models work a little bit better if you crop rather than squish, but they still work pretty well if you squish, and in this case we definitely don't want to crop, so this is perfectly fine. If you had very long or very tall images such that if a human looked at the squash version you'd be like, that looks really weird, then that might be difficult to model, but in this case we're just like, it looks a little bit strange, so the computer won't mind.

So I'm going to quite often dig a little bit into some more depths of fast.ai and PyTorch, and in this case I want to just look at data loaders a little bit more. So you already know that inside a model data object, when there's lots of model data subclasses like image classifier data, we have a bunch of things which include a training data loader and a training data set, and we'll talk much more about this soon.

The main thing to know about a data loader is that it's an iterator, that each time you grab the next iteration of stuff from it, you get a mini-batch. And the mini-batch you get is of whatever size you asked for, and by default the batch size is 64. In Python, the way you grab the next thing from an iterator is with next, but you can't just do that.

The reason you can't do that is because you need to say, start a new epoch now. In general, this isn't just in PyTorch, but for any Python iterator, you kind of need to say start at the beginning of the sequence, please. So the way you do that, and this is a general Python concept, is you write iter.

And iter says please grab an iterator out of this object. Specifically as we'll learn later, it means this class has to have defined an underscore underscore iter underscore underscore method, which returns some different object which then has an underscore underscore next underscore underscore method. So that's how I do that.

And so if you want to grab just a single batch, this is how you do it. x, y = next in a data loader. Why x, y? Because our data loaders, our data sets behind the data loaders always have an x independent and a y independent variable. So here we can grab a mini-batch of x's and y's.

I now want to pass that to that show image command we had earlier, but we can't send that straight to show image. Here it is. For one thing, it's not a NumPy array, it's not on the CPU, and its shape is all wrong. It's not 224x224x3, it's 3x224x224. Furthermore these are not numbers between 0 and 1, why not?

Because remember all of the standard ImageNet pre-trained models expect our data to be normalized to have a 0 mean and a 1 standard deviation. So if you look inside -- let's use Visual Studio Code for this since that's what we've been doing -- so if you look inside transforms from model, so Ctrl+T transforms from model, T-F-M, which in turn calls transforms, so F12, actually transforms from model, calls transform from stats, and here you can see normalize.

And it normalizes with some set of image statistics, and the set of image statistics, they're basically hard-coded. This is the ImageNet statistics, this is the statistics used for inception models. So there's a whole bunch of stuff that's been done to the input to get it ready to be passed to a pre-trained model.

So we have a function called denorm for denormalize. It doesn't only denormalize, it also fixes up the dimension order and all that stuff. The denormalization depends on the transform. And the dataset knows what transform was used to create it. So that's why you have to go model data, dot, and then some dataset, dot, denorm, and that's a function that is stored for you that will undo everything.

And then you can pass that a mini-batch, but you have to turn it into NumPy first. So this is like all the stuff that you need to be able to do to grab batches and look at them. And so after you've done all that, you can show the image, and we've got back our last list.

So that's looking good. So in the end, we've just got the standard four lines of code. We've got our transforms, we've got our model data, convlin, dot, pre-trained, we're using a ResNet34 here, I'm going to add accuracy as a metric, fix some optimization function, do an LRfind, and that looks kind of weird, not particularly helpful.

Normally we would expect to see an uptick on the right. The reason we don't see it is because we intentionally remove the first few points and the last few points. The reason is that often the last few points shoot so high up towards infinity that you basically can't see anything, so the vast majority of the time removing the last few points is a good idea.

However, when you've got very few mini-batches, sometimes it's not a good idea, and so a lot of people ask this on the forum, here's how you fix it. Just say skip, by default it skips 10 at the start, so in this case we just say 5, by default it skips 5 at the end, so now we can see the shape properly.

If your data set is really tiny, you may need to use a smaller batch size, like if you only have three or four batches worth, there's nothing to see. But in this case, it's fine, we just have to plot a little bit more. So we pick a learning rate, we say fit, after one epoch, just train the last layer, it's 80%, let's unfreeze a couple of layers, do another epoch, 82%, and freeze the whole thing, not really improving.

Why are we stuck at 80%? It kind of makes sense, right? Unlike ImageNet or dogs vs cats, where each image has one major thing, they were picked because they had one major thing, and the one major thing is what you're asked to look for, a lot of the Pascal data set has lots of little things, and so a large classifier is not necessarily going to do great.

But of course, we really need to be able to see the results to see whether it makes sense. So we're going to write something that creates this, and in this case, after working with this a while, I know what the 20 Pascal classes are. So I know there's a person in the bicycle class, I know there's a dog in a sofa class, so I know this is wrong, it should be sofa, that's correct, bird, yes, yes, chair, that's wrong, I think the table's bigger, motorbike's correct because there's no cactus, that should be a bus, person's correct, bird's correct, cow's correct, plant's correct, cow's correct.

So it's looking pretty good. So when you see a piece of code like this, if you're not familiar with all the steps to get there, it can be a little overwhelming. And I feel the same way when I see a few lines of code and something I'm not terribly familiar with, I feel overwhelmed as well, but it turns out there's two ways to make it super simple to understand the code.

Or there's one high-level way. The high-level way is run each line of code step-at-step, print out the inputs, print out the outputs. Most of the time, that'll be enough. If there's a line of code where you don't understand how the outputs relate to the inputs, go and have a look for the source.

So now all you need to know is what are the two ways you can step through the lines of code one at a time. The way I use perhaps the most often is to take the contents of the loop, copy it, create a cell above it, paste it, outdent it, write i=0, and then put them all in separate cells, and then run each one one at a time, printing out the input samples.

I know that's obvious, but the number of times I actually see people do that when they ask me for help is basically zero, because if they had done that, they wouldn't be asking for help. Another method that's super handy and there's particular situations where it's super handy is to use the Python Debugger.

Who here has used a debugger before? So half to two-thirds. So for the other half of you, this will be life-changing. Actually, a guy I know this morning who's actually a deep learning researcher wrote on Twitter, and his message on Twitter was "How come nobody told me about the Python Debugger before?

My life has changed." And this guy's an expert, but because nobody teaches basic software engineering skills in academic courses, nobody thought to say to him, "Hey Mark, do you know what? There's something that shows you everything your code does one step at a time." So I replied on Twitter and I said, "Good news Mark, not only that, every single language in existence, in every single operating system also has a debugger, and if you Google for language name debugger, it will tell you how to use it." So there's a meta piece of information for you.

In Python, the standard debugger is called PDB. And there's two main ways to use it. The first is to go into your code. And the reason I'm mentioning this now is because during the next few weeks, if you're anything like me, 99% of the time you'll be in a situation where your code's not working.

And very often it will have been on the 14th mini-batch inside the forward method of your custom module. It's like, what do you do? And the answer is you go inside your module and you write that. And if you know it's only happening on the 14th iteration, you type if i = 13.

So you can set a conditional breakpoint. PDB is the Python debugger, fastai imports it for you, if you get the message that PDB is not there, then you can just say import PDB. So let's try that. And you'll see it's not the most user-friendly experience. It just pops up a box.

But the first cool thing to notice is, the debugger even works in a notebook. So that's pretty nifty. It will also work in the terminal. And so what can you do? You can type h for help. And there are plenty of tutorials here. The main thing to know is this is one of these situations where you definitely want to know the one-letter mnemonics.

So you could type next, but you definitely want to type n. You could type continue, but you definitely want to type c. I've listed the main ones you need. So what I can do now that I'm sitting here is it shows me the line it's about to run. So one thing I might want to do is to print out something, and I can write any Python expression and hit enter and find it.

So that's a useful thing to do. I might want to find out more about where am I in the code more generally. I don't just want to see this line, but what's before it and after it, in which case I want L for list. And so you can see I'm about to run that line, these are the lines above it, and below it.

So I might be now like, let's run this line and see what happens. So go to the next line, here's n, and you can see now it's about to run the next line. One handy tip, you don't even have to type n. If you just hit enter, it repeats the last thing you did, so that's another n.

So I now should have a thing called b. Unfortunately, single letters are often used for debugger commands. So if I just type b, it'll run the b command rather than print b for me. So to force it to print, you use print b. So there's a bird. Alright, fine, let's do next again.

At this point, if I hit next, it'll draw the text. But I don't want to just draw the text, I want to know how it's going to draw the text. So I don't want to know next over it, I want to s step into it. So if I now hit s, step into it, I'm now inside draw text, and I now hit n, I can see draw text, and so forth.

And then I'm like, okay, I know everything I want to know about this, I will continue until I hit the next breakpoint. So c will continue until I'm back at the breakpoint again. What if I was zipping along, and this happens quite often, let's step into dnorm. Here I am inside dnorm.

And what will often happen is if you're debugging something in your PyTorch module, and it's hit an exception, and you're trying to debug, you'll find yourself like six layers deep inside PyTorch. You want to actually see back up what's happening where you called it from. So in this case, I'm inside this property, but I actually want to know what was going on up the call stack, I just hit u, and that doesn't actually run anything, it just changes the context of the debugger to show me what called it, and now I can type things to find out about that environment.

And then if I'm going to go down again, it's deep, so I'm not going to show you everything about the debugger, but I've just showed you all of those commands. Yes, Azar? Something that we've found helpful as we've been doing this is using from ipython.core.debugger imports that trace, and then you get it all prettily colored.

You do indeed, thank you. Excellent tip. Let's learn about some of our students here. Azar, tell us, I know you're doing an interesting project, can you tell us about it? Sure. Okay. Hello everyone, I'm Azar, here with my collaborator Britt, and we're using this kind of stuff to try to build a Google Translate for animal communication.

So that involves playing around a lot with unsupervised machine neural translation and doing it on top of audio. Where do you get data for that from? That's sort of the hard problem. So there you have to go, and we're talking to a number of researchers to try to collect and collate large data sets, but if we can't get it that way, we're thinking about building a living library of the audio of the species of Earth that involves going out and collecting 100,000 hours of gelata monkey vocalization.

I didn't know that, that's pretty cool. All right. That's great here. So let's get rid of that set trace. The other place that the debugger comes in particularly handy is, as I say, if you've got an exception, particularly if it's deep inside PyTorch. So if I, like when I times 100 here, obviously that's going to be an exception, I've got rid of the set trace.

So if I run this now, something's wrong. Now in this case it's easy to see what's wrong, but often it's not, so what do I do? Percent debug pops open the debugger at the point the exception had. So now I can check, okay, preds.lend preds 64, I times 100, I've got to print that because I have a command, 100, oh no wonder, and you can go down, you can go up, you can list whatever.

I do all of my development, both with the library and the lessons in Jupyter Notebook. I do it all interactively and I use percent debug all the time along with this idea of copying stuff out of a function, putting it into separate cells, running it step by step. There are similar things you can do inside, for example, Visual Studio Code.

There's actually a Jupyter extension which says you select any line of code inside Visual Studio Code and say run in Jupyter, and it will run it in Jupyter and create a little window showing you the output. There's neat little stuff like that. Actually I think Jupyter Notebook is better, and perhaps by the time you watch this on the video, Jupyter Lab will be the main thing.

Jupyter Lab is like the next version of Jupyter Notebook, pretty similar. Well, I just broke it totally. We know exactly how to fix it, so we will worry about that another time. We'll debug it this evening. So to kind of do the next stage, we want to create the bounding box.

And now creating the bounding box around the largest object may seem like something you haven't done before, but actually it's totally something you've done before. And the reason it's something you've done before is we know that we can create a regression rather than a classification neural net. In other words, a classification neural net is just one that has a sigmoid or softmax output and that we use a cross-entropy or a binary cross-entropy, negative or blackyoid, loss function.

That's basically what makes it a classifier. If we don't have the softmax of sigmoids at the end, and we use mean squared error as a loss function, it's now a regression model, so we can now use it to predict a continuous number rather than a category. We also know that we can have multiple outputs, like in the Planet competition we did a multiple object classification.

What if we combine the two ideas and do a multiple column regression? In this case we've got four numbers, top, left, x and y, bottom, right, x and y, and we could create a neural net with four activations. We could have no softmax or sigmoid and use a mean squared error loss function.

And this is kind of like where you're thinking about it like differentiable programming. It's not like how do I create a bounding box model, it's like what do I need? I need four numbers, therefore I need a neural network with four activations. That's half of what I need to know.

The other half I need to know is a loss function. In other words, what's a function that when it is lower means that the four numbers are better? Because if I can do those two things, I'm done. If the x is close to the first activation and the y is close to the second, then I'm done.

So that's it. I just need to create a model with four activations with a mean squared error loss function, and that should be it. We don't need anything new, so let's try it. So again, we'll use a CSV. And if you remember from part 1, to do a multiple label classification, your multiple labels have to be space-separated, and then your file name is comma-separated.

So I'll take my largest item dictionary, create a bunch of bounding boxes for each one separated by a space using a list comprehension, then create a data frame like I did before, I'll turn that into a CSV, and now I've got something that's got the file name and the four bounding box coordinates.

I will then pass that to from_csv, again I will use crop_type=crop_type.no. Next week we'll look at transform_type.coordinate. For now, just realize that when we're doing scaling and data augmentation, that needs to happen to the bounding boxes, not just to the images. ImageClassifierData.csv gets us to a situation where we can now grab one mini-batch of data, we can denormalize it, we can turn the bounding box back into a height width so that we can show it, and here it is.

Remember we're not doing classifications, I don't know what kind of thing this is, it's just a thing but there is a thing. So I now want to create a conflict based on ResNet-34, but I don't want to add the standard set of fully connected layers that create a classifier, I want to just add a single linear layer with four outputs.

So FastAI has this concept of a custom head, if you say my model has a custom head, the head being the thing that's added to the top of the model, then it's not going to create any of that fully connected network for you, it's not going to add the adaptive average pooling for you, but instead it will add whatever model you ask for.

So in this case I've created a tiny model, it's a model that flattens out the previous layer. Normally it would have a 7x7 by I think 512 previous layer in ResNet-34, so it just flattens that out into a single vector of length 25.088, and then I just add a linear layer that goes from 25.088 to 4, there's my 4 outputs.

So that's the simplest possible kind of final layer you could add. I stick that on top of my pre-trained ResNet-34 model, so this is exactly the same as usual except I've just got this custom head. Optimize it with Adam, use a criteria, I'm actually not going to use MSC, I'm going to use L1 loss, so I can't remember if we covered this last week, we can revise it next week if we didn't, but L1 loss means rather than adding up the squared errors, add up the absolute values of the errors.

It's normally actually what you want, adding up the squared errors really penalizes bad misses by too much, so L1 loss is generally better to work with. I'll come back to this next week, but basically you can see what we do now is we do our LR find, find our learning rate, learn for a while, freeze 2 - 2, learn a bit more, freeze - 3, learn a bit more, and you can see this validation loss, which remember is the absolute value, mean of the absolute value of the pixels we're off by, gets lower and lower, and then when we're done we can print out the bounding boxes, and lo and behold, it's done a damn good job.

So we'll revise this a bit more next week, but you can see this idea of like if I said to you before this class, do you know how to create a bounding box model? You might have said, no, nobody's taught me that. But the question actually is, can you create a model with 4 continuous outputs?

Yes. Can you create a loss function that is lower if those 4 outputs are near to 4 other numbers? Yes. Then you're done. Now you'll see if I scroll a bit further down, it starts looking a bit crappy. Anytime you've got more than one object. And that's not surprising, because how the hell do you decide which bird, so it's just said I'll just pick the middle, which cow, I'll pick the middle.

How much of this is actually potted plant, I'll pick the middle. This one it could probably improve, but it's got close to the car, but it's a pretty weird car. But nonetheless, for the ones that are reasonably clear, I would say it's been a pretty good job. Alright, so that's time for this week.

I think it's been a kind of gentle introduction for the first lesson. If you're a professional coder, there's probably not heaps of new stuff here for you. And so in that case I would suggest practicing learning about bounding boxes and stuff. If you aren't so experienced with things like debuggers and matplotlib API and stuff like that, there's going to be a lot for you to practice because we're going to be really assuming you know it well from next week.

Thanks everybody, see you next Monday. (audience applauds)

Lesson 8: Deep Learning Part 2 2018 - Single object detection

Chapters

Transcript