Lesson 3: Deep Learning 2018

Welcome back everybody I'm sure you've noticed But there's been a lot of cool activity on the forum this week and one of the things that's been really great to see Is that a lot of you have started creating? Really helpful materials both for your classmates to better understand stuff and also for you to better understand stuff by Trying to teach what you've learned.

I just wanted to highlight a few I've actually Posted to the wiki thread a few of these, but there's lots more Reshma has posted a whole bunch of nice introductory tutorials so for example if you're having any trouble getting connected with AWS She's got a whole step-by-step How to go about logging in and getting everything working which I think is a really terrific thing and so it's a kind of thing that if you Writing some notes for yourself to remind you how to do it You may as well post them for others to do it to do it as well and by using a markdown file like this It's actually good practice if you haven't used github before if you put it up on github Everybody can now use it or of course you can just put it in the forum so more advanced Thing that Reshma wrote up about is she noticed that I like using tmux Which is a handy little thing which lets me?

Let me basically have a window. Let's see if I've got one. I'll show you So as soon as I log into my computer If I run tmux You'll see that all of my windows pop straight up Basically and I can like continue running stuff in the background and I can like I've got vim over here And I can kind of zoom into it or I can move over to the top which is here's budget But I can all running and so forth so if that sounds interesting Reshma has a Tutorial here on how you can use tmux And it's actually got a whole bunch of stuff in her github, so that's that's really cool Up built among has written a very nice kind of summary basically of our last lesson Which kind of covers What are the key things we did and why did we do them so if you're if you're kind of?

Wondering like how does it fit together? I think this is a really helpful summary Like what did those couple of hours look like if we summarize it all into a page or two? I also really like Pavel has Dark kind of done a deep dive on the learning rate finder which is a Topic that a lot of you have been interested in learning more about particularly Those of you who have done deep learning before I've realized that this is like a solution to a problem that you've been having for A long time and haven't seen before and so it's kind of something which hasn't really been vlogged about before so this is the first Time I've seen this blogged about so when I put this on Twitter a link to Pavel's post it's been shared now hundreds of times It's been really really popular and viewed many thousands of times, so that's some great content Radek has posted lots of cool stuff.

I really like this practitioners guide to pytorch which again This is more for more advanced students, but it's like digging into people who have never used pytorch before but know a bit about Numerical programming in general and it's a quick introduction to how pytorch is different And then there's been some interesting little bits of research like what's the relationship between learning rate and batch size so one of the Students actually asked me this before class and I said well one of the other students has written an analysis of exactly that so what he's done is basically looked through and tried different batch sizes and different learning rates and tried to see how they seem to Relate together and these are all like cool experiments, which you know you can try yourself Radek again, he's written something again a kind of a research into this question.

I made a claim that The the stochastic gradient descent with restarts finds more generalizable Parts of the function surface because they're kind of flatter, and he's been trying to figure out. Is there a way to measure that more directly? Not quite successful yet, but a really interesting piece of research got some introductions to convolutional neural networks and then something that we'll be learning about towards the end of this course, but I'm sure you've noticed we're using something called ResNet and Anand Sahar actually posted a pretty impressive analysis of like what's a ResNet and why is it interesting?

And this one's actually been very already shared very widely around the internet. I've seen also So so we're advanced students who are interested in Jumping ahead can look at that and appeal to mom also has done something similar so lots of Yeah, lots of stuff going on on the forums.

I'm sure you've also noticed we have a beginner forum now specifically for you know asking questions which You know There's always the case that there are no Dumb questions, but when there's lots of people around you talking about advanced topics. It might not feel that way so hopefully the beginners forum is just a less intimidating space and If you're a more advanced Student who can help answer those questions, please do but remember when you do answer those questions try to answer in a way That's friendly to people that maybe you know have no more than a year of programming experience and haven't done any machine learning before So you know I hope Other people in the class Feel like you can contribute as well and just remember all of the people we just looked at or many of them I believe have never Posted anything to the internet before right I mean you don't have to be a particular kind of person to be allowed to blog or something you can just drop down your notes throw it up there and One handy thing is if you just put it on the forum, and you're not quite sure of some of the details then Then you know you have an opportunity to get feedback and say like ah well That's not quite how that works You know actually it works this way instead or or that's a really interesting insight had you thought about taking this further and so forth So what we've done so far is a kind of a an introduction as a just as a practitioner to Convolutional neural networks for images, and we haven't really talked much at all about The theory or why they work or the math of them, but on the other hand what we have done is seen how to Build a model which actually works exceptionally well in fact world-class level models and we'll kind of review a little bit of that today and Then also today We're going to dig in a little bit quite a lot more actually into the underlying theory of like What is a what is a cnn?

What's a convolution? How does this work and then we're going to kind of go through this this cycle where we're going to dig We're going to do a little intro into a whole bunch of application areas using neural nets for structured data so kind of like logistics or forecasting or you know financial data or that kind of thing and then looking at language applications NLP applications using recurrent neural nets and then collaborative filtering for Recommendation systems and so these will all be like Similar to what we've done for cnn's images It'll be like here's how you can get a state-of-the-art result without digging into the theory But but knowing how to actually make it work And then we're kind of go go to go back through those almost in reverse order So then we're going to dig right into collaborative filtering in a lot of detail and see how how to write the code Underneath and how the math works underneath and then we're going to do the same thing for structured data analysis We're going to do the same thing for confidence images and finally an in-depth dig dive into recurrent neural networks So that's kind of where we're getting so let's start by Doing a little bit of a review and I want to Also provide a bit more detail on some on some steps that we only briefly skipped over So I want to make sure that we're all able to complete Kind of last week's assignment, which was the the dog breeds I mean to basically apply what you've learned to it another data set and I thought the easiest one to do would be the dog Breeds Kaggle competition and so I want to make sure everybody has everything you need to do this right now So and the first thing is to make sure that you know how to download Data and so there's there's two main places at the moment.

We're kind of downloading data from one is from Kaggle And the other is from like anywhere else And so I'll first of all do the the Kaggle version So to download from Kaggle We use something called Kaggle CLI Which is here and to install it I think it's already in let's just double check Yeah, so it's or it should already be in your environment But to make sure one thing that happens is because this is downloading from the Kaggle website through like screen scraping every time Kaggle changes The website it breaks so anytime you try to use it and If Kaggle's websites changed recently you'll need to make sure you get the most recent version so you can always go to pip install Kaggle - CLI - - upgrade and so that'll just make sure that you've got the latest version of of it and everything that it depends on okay, and so then having done that you can Follow the instructions.

Actually, I think rational was kind enough to they go. There's a Kaggle CLI Feel like everything you need to know can be found at rational's GitHub So basically to do that the next step you go KG download And then you provide your username with - you you provide your password with - P and then - see you did the competition name And a lot of people in the forum has been confused about what to enter here And so the key thing to note is that when you're at a Kaggle competition?

After the /c there's a specific name planet - understanding - etc. Right? That's the name you need Okay the other thing you'll need to make sure is that you've On your own computer have attempted to click download at least once because when you do it will ask you to accept the rules If you've forgotten to do that KG download will give you a hint it'll say it looks like you might have forgotten the rules if you log into Kaggle with like a Google account like anything other than a username password this won't work So you'll need to click forgot password on Kaggle and get them to send you a normal password So that's the Kaggle version Right and so when you do that you end up with a whole folder created for you with all of that competition data in it So a couple of reasons you might want to not use that The first is that you're using a data set that's not on Kaggle The second is that you don't want all of the data sets in a Kaggle competition for example the planet competition That we've been looking at a little bit.

We'll look at again today Has data in two formats TIFF and JPEG the TIFF is 19 gigabytes and the JPEG is 600 megabytes So you probably don't want to download both So I'll show you a really cool kit, which actually somebody on the forum taught me I think was one of the MSAN students here at USF.

There's a Chrome extension called curl w get So you can just search for curl w get And then you install it by just clicking on install if you haven't installed extension before and then from now on Every time you try to download something, so I'll try and download this file and I'll just go ahead and cancel it right and now you see this little yellow button.

That's added up here There's a whole command here All right, so I can copy that and Paste it into my window and Hit go and it's there goes okay So what that does is like all of your cookies and headers and everything else needed to download that file is like save So this is not just useful for downloading data It's also useful if you like trying to download some I don't know TV show or something anything where you're hidden behind a Log in or something you can you can grab it and actually that is very useful for data science because quite often we want to Analyze things like videos on our on our consoles So this is a good trick.

All right, so there's two ways to get the data So then Having got the data you then need to Build your model, right? So what I tend to do like you'll notice that I tend to assume that the data is in a directory called data That's a subdirectory of wherever your notebook is, right?

Now you don't necessarily Actually want to put your data there You might want to put it directly in your home directory or you might want to put it on another drive or whatever so what I do is if you look inside my courses deal one folder, you'll see that data is actually a symbolic link To a different drive, right?

So you can put it anywhere you like and then you can just add a symbolic link Or you can just put it there directly. It's up to you You haven't used some links before they're like aliases or shortcuts on the Mac or Windows Very handy and there's some threads on the forum about how to use them if you want help with that that's for example is also how we actually have the fast AI modules Available from the same place as our notebooks.

It's just a similar to where they come from anytime you want to see like Where things actually point to in Linux you can just use the minus L flag to listing a directory And it'll show you where the sim links Exist and also show you which things are directories so forth Okay, so one thing which May be a little unclear based on what we've done so far is like How little code you actually need to do this end-to-end so what I've got here is is in a single window is an entire End-to-end process to get a state-of-the-art result for cats versus dogs, right?

I've the only step I've skipped is the bit where we've downloaded it in Kaggle and then where we unzipped it, right? so These are literally all the steps and so we Import our libraries and actually if you import this one conf learner that basically imports everything else So that's that we need to tell it the path of where things are the size that we want the batch size that we want alright So then and we're going to learn a lot more about what these do very shortly But basically we say how do we want to transform our data so we want to transform it in a way That's suitable to this particular kind of model and it assumes that the photos are side on photos And that we're going to zoom in up to 10% each time We say that we want to get some data Based on paths and so remember this is this idea that there's a path called cats and a path called dogs And they're inside a path called train and a path called valid Note that you can always Overwrite these with other things so if your things are in different named folders you could either rename them or you can see here There's like a train name and a vowel name you can always pick something else here Also notice there's a test name So if you want to submit some into Kaggle you'll need to fill in the name the name of the folder where the test Set is and obviously those those won't be labeled So then we create a model from a pre trained model.

It's from a ResNet 50 model using this data And then we call fit and remember by default That has all of the layers, but the last few frozen and again, we'll learn a lot more about what that means And so that's that's what that does so that That took two and a half minutes Notice here.

I didn't say pre compute equals true again There's been some confusion on the forums about like what that means It's it's only a it's only something that makes it a little faster for this first step right so you can always skip it And if you're at all confused about it, or it's causing you any problems.

Just leave it off right because it's just a It's just a shortcut which caches some of that intermediate steps that don't have to be recapulated each time Okay, and remember that when we are using pre computed activations data augmentation doesn't work right so even if you ask for a data augmentation if you've got pre computed equals true It doesn't actually do any data augmentation because it's using the cached non-augmented activations So in this case to keep this as simple as possible.

I have no pre computed anything going on so I do three cycles of length one and Then I can then unfreeze So it's now going to train the whole thing something we haven't seen before and we'll learn about in the second half is called BN freeze for now all you need to know is that if you're using a model like a Bigger deeper model like resnet 50 or res next 101 on a data set That's very very similar to image net like these cats and dogs later sets on other words Like side on photos of standard objects You know of a similar size to image net like somewhere between 200 and 500 pixels You should probably add this line when you unfreeze for those of you that are more advanced what it's doing is it's it's Causing the batch normalization Moving averages to not be updated but in the second half of this course you're going to learn all about why we do that It's something that's not supported by any other library But it turns out to be super important anyway, so we do one more epoch with training the whole network And then at the end we use test time augmentation To ensure that we get the best predictions we can and that gives us ninety nine point four five percent So that's that's it right so when you try a new data set they're basically the minimum set of steps That you would need to follow You'll notice this is assuming.

I already know what learning rate to use so you'd use a learning rate finder for that It's assuming that I know the the directory layout and so forth So that's kind of a minimum set now one of the things that I wanted to make sure You had an understanding of how to do is how to use other libraries other than fast AI And so I feel like the best thing to look at is to look at Keras because Keras is a library Just like fast AI sits on top of Pytorch Keras sits on top of actually a whole variety of different back ends it fits mainly people nowadays use it with TensorFlow There's also an MX net version.

There's also a Microsoft CNTK version So what I've got if you do a git pull you'll see that there's a something Called Keras lesson one where I've attempted to replicate at least parts of lesson one in Keras Just to give you a sense of how that works I'm not going to talk more about batch norm freeze now other than to say if you're using something Which has got a number larger than 34 at the end so like resnet 50 or res next 101 and you're Trading a data set that has that is very similar to image net So it's like normal photos of normal sizes where the thing of interest takes up most of the frame Then you probably should add the end freeze true after unfreeze If in doubt try trading it with and then try trading it without More advanced students will can certainly talk about it on the forums this week And we will be talking about the details of it in the second half of the course when we come back to our CNN in-depth section in the second last lesson So with Keras again, we import a bunch of stuff and Remember I mentioned that this idea that you've got a thing called train and a thing called valid and inside that you've got a Thing called dogs and the things called cats is a standard way of providing image Labeled images so Keras does that too right so it's going to tell it where the training set and the validation set are Size twice what batch size to use Now you're noticing Keras.

We need much much much more code to do the same thing More importantly each part of that code has many many many more things you have to set and if you set them wrong everything breaks, right, so I'll give you a summary of what they are. So you're basically rather than creating a single Data object in Keras we first of all have to define something called a data Generator to say how to generate the data and so a data generator We basically have to say what kind of data augmentation we want to do and We also we actually have to say what kind of Normalization do we want to do so we're else with fast AI we just say Whatever resnet 50 requires just do that for me, please We actually have to kind of know a little bit about what's expected of us Generally speaking copy and pasting Keras code from the internet is a good way to make sure you've got the right The right stuff to make that work And again, it doesn't have a kind of a standard set of like here the best data augmentation parameters to use for photos So, you know, I've copied and pasted all of this from the Keras documentation So I don't know if it's I don't think it's the best set to use at all, but it's the set that they're using in their Docs So having said this is how I want to generate data.

So horizontally flip sometimes, you know zoom sometimes she is sometimes We then create a generator from that by taking that data generator and saying I want to generate Images by looking from a directory and we pass in the directory which is of the same directory structure that fast AI uses and You'll see there's some overlaps with kind of how fast AI works here You tell it what size images you want to create you tell it what batch size you want in your mini batches And then there's something here not to worry about too much But basically if you're just got two possible outcomes you would generally say binary here If you've got multiple possible outcomes you would say categorical.

Yeah, so we've only got cats or dogs. So it's binary So an example of like where things get a little more complex is you have to do the same thing for the validation set So it's up to you to create a data generator That doesn't have data augmentation because obviously for the validation set unless you're using TTA that's going to stuff things up you also When you train?

You randomly reorder the images so that they're always shown in different orders to make it more random but with a validation it's Vital that you don't do that because if you shuffle the validation set you then can't track how well you're doing It's in a different order for the labels.

That's a Basically, these are the kind of steps you have to do every time with Keras So again, the reason I was using resnet 54 is Keras doesn't have resnet 34 unfortunately So I just wanted to compare like with Mike so we got to use resnet 50 here There isn't the same idea with Keras of saying like construct a model that is suitable for this data set for me So you have to do it by hand, right?

So the way you do it is to basically say this is my base model and then you have to construct on top of that manually The layers that you want to add and so by the end of this course, you'll understand why it is that these particular three layers are the layers that we add So having done that in Keras you basically say okay this is my model and then again there isn't like a Concept of like automatically freezing things or an API for that so you just have to allow loop through the layers that you want to freeze and Call trainable equals false on them In Keras, there's a concept we don't have in fast AI or pytorch of compiling a model So basically once your models ready to use you have to compile it Passing in what kind of optimizer to use what kind of loss to look for or what metrics so again with fast AI You don't have to pass this in because we know what loss is the right loss to use you can always override it But for a particular model we give you good defaults Okay, so having done all that Rather than calling fit you call fit generator Passing in those two generators that you saw earlier the train generator and the validation generator For reasons I don't quite understand Keras expects you to also tell it how many batches there are per epoch So the number of batches is equal to the size of the generator Divided by the batch size you can tell it how many epochs just like in Fast AI you can say how many Processes or how many workers to use for pre-processing?

Unlike fast AI the default in Keras is basically not to use any So you to get good speed you've got to make sure you include this And so that's basically enough to start fine-tuning the last layers So as you can see I got to a validation accuracy of 95% But as you can also see something really weird happened where after one it was like 49 and then it was 69 and then 95 I don't know Why these are so low?

That's not normal. I may have there may be a bug in Keras. They may be a bug in my code I reached out on Twitter to see if anybody could figure it out, but they couldn't I guess this is one of the challenges with using Something like this is one of the reasons I wanted to use fast AI for this course is it's much harder to screw things up So I don't know if I screwed something up or somebody else did yes, you know This is using the tensorflow back end yeah, yeah, and if you want to run this to try it out yourself You just can just go pip install tensorflow - GPU Keras Okay, because it's not part of the fast AI environment about default But that should be all you need to do to get that working So then There isn't a concept of like layer groups or differential learning rates or partial unfreezing or whatever So you have to decide like I had to print out all of the layers and decide manually How many I wanted to fine-tune so I decided to fine-tune everything from a layer 140 onwards So that's why I just looped through like this After you change that you have to recompile the model And then after that I then ran another step and again I don't know what happened here the accuracy of the training set stayed about the same but the validation set totally fell in the hole But I mean the main thing to note is even if we put aside the validation set We're getting I mean, I guess the main thing is there's a hell of a lot more code here Which is kind of annoying but also the performance is very different.

So we're also here even on the training set We're getting like 97% after four epochs that took a total of about eight minutes you know over here we had 99.5% on the validation set and it ran a lot faster. So it was like four or five minutes right so Depending on what you do particularly if you end up wanting to deploy stuff to mobile devices at the moment The kind of pie torch on mobile situation is very early So you may find yourself wanting to use tensorflow or you may work for a company that's kind of settled on tensorflow So if you need to convert something like redo something you've learned here in tensorflow You probably want to do it with Keras, but just recognize you know, it's going to take a bit more work to get there and By default it's much harder to get I mean I to get the same state-of-the-art results you get with fast AI You'd have to like replicate all of the state-of-the-art Algorithms that are in fast AI so it's hard to get the same Level of results, but you can see the basic ideas are similar Okay, and it's certainly It's certainly possible, you know, like there's nothing I'm doing in fast AI that like would be impossible But like you would have to implement stochastic gradient percent with restarts.

You would have to Implement differential learning rates you would have to implement batch norm freezing Which you probably don't want to do. I know well, that's not quite true I think somewhat one person at least on the forum is Attempting to create a Keras compatible version of or tons of flow compatible version fast AI Which I think I hope we'll get there I actually spoke to Google about this a few weeks ago, and they're very interested in getting fast AI ported to tensorflow So maybe by the time you're looking at this on the MOOC, maybe that will exist.

I certainly hope so We will see Anyway, so Keras is Keras and tensorflow are certainly not You know That difficult to handle and so I don't think you should worry if you're told you have to learn them After this course for some reason it'll only take you a couple of days.

I'm sure So that's kind of most of the stuff you would need to Kind of complete this is kind of assignment from last week Which was like try to do everything you've seen already, but on the dog breeds data set and just to remind you The kind of last few minutes of last week's lesson I show you how to do much of that Including like how I actually explored the data to find out like what the classes were and how big the images were and stuff like That right so if you've forgotten that or didn't quite follow it all last week check out the video from last week to see One thing that we didn't talk about is how do you actually submit to Kaggle?

So how do you actually get predictions? So I just wanted to show you that last piece as well And on the wiki thread this week. I've already put a little image of this to show you these steps But if you go to the Kaggle Website for every competition there's a section called evaluation and they tell you what to submit and so I just copied and pasted these Two lines from from there, and so it says we're expected to submit a file where the first line Contains the the word the word ID and then a comma separated list of all of the possible dog breeds And then every line after that will contain the ID itself Followed by all the probabilities of all the different dog breeds so How do you create that?

So I recognize that inside our data object. There's a dot classes Which has got in alphabetical order all of the all of the classes and then So it's got all of the different classes and then inside Data dot test data set test. Yes, you can also see there's all the file names So I just remind you dogs and cats sorry dogs and cats dog breeds Was not provided in the kind of Keras style format where the dogs and cats are in different folders But instead it was provided as a CSV file of labels, right?

So when you get a CSV file of labels you use Image classifier data from CSV rather than image classifier data from paths There isn't an equivalent in Keras, so you'll see like on the Kaggle forums people Share scripts for how to convert it to a Keras style folders But in our case we don't have to we just go image classifier data from CSV passing in that CSV file And so the CSV file will you know has automatically told the data.

You know what the classes are And then also we can see from the folder of test images what the file names of those are So with those two pieces of information We're ready to go so I always think it's a good idea to use TTA As you saw with that dogs and cats example just now it can really improve things particularly when your model is less good So I can say learn dot TTA and if you pass in If you pass in is test equals true Then it's going to give you predictions on the test set rather than the validation set okay, and now obviously we can't now get An accuracy or anything because by definition.

We don't know the labels for the test set right So by default most Pytorch models give you back the log of the predictions So then we just have to go exp of that to get back our probabilities So in this case the test set had ten thousand three hundred and fifty seven Images in it, and there are 120 possible breeds all right, so we get back a matrix of of that size and so we now need to turn that into Something that looks like this and So the easiest way to do that is with pandas if you're not familiar with pandas There's lots of information online about it or check out the machine learning course intro to machine learning that we have Where we do lots of stuff with pandas?

but basically we can just go PD dot data frame and pass in that matrix and then we can say the names of the columns are equal to data dot classes and Then finally we can insert a new column at position zero called ID that contains the file names But you'll notice that the file names contain Five letters at the end with a start we don't want and four letters at the end.

We don't want so I just Subset in like so right so at that point I've got a data frame that looks like this Which is what we want so you can now Call data frame data. I should have used a DF not DS Let's fix it now data Frame Okay, so you can now call data frame to CSV and Quite often you'll find these files actually get quite big so it's a good idea to say compression equals G zip and that'll zip it up on the server for you and that's going to create a zipped up CSV file on the server on wherever you're running this Jupiter notebook, so you need absent You now need to get that back to your computer so you can upload it Or you can use Kaggle CLI so you can type KG submit and do it that way I?

Generally download it to my computer because I like I often like to just like double check it all looks okay So to do that there's a cool little thing called file link and if you run file link With a path on your server it gives you back a URL Which you can click on and it'll download that file from the server onto your computer so if I click on that now I Can go ahead and save it and then I can see in my downloads There it is here's my submission file If you want to open there yeah, and as you can see it's exactly what I asked for there's my ID in the 120 different dot breeds and Then here's my first row containing the file name and the 120 different probabilities Okay, so then you can go ahead and submit that to Kaggle through there Through their regular form and so this is also a good way you can see we've now got a good way of both Grabbing any file off the internet and getting it to our AWS instance or paper space or whatever by using the Cool little extension in Chrome, and we've also got a way of grabbing stuff off our server easily those of you that are more Command-line oriented you can also use SCP of course, but I kind of like doing everything through the notebook All right One other question.

I had during the week was like what if I want to just get a single a single file that I want to You know get a prediction for so for example you know maybe I want to get the first file from my validation set So there's its name So you can always look at a file just by calling image open That just uses the regular Python imaging library and So what you can do is there's actually I'll show you the shortest version You can just call learn dot predict array Passing in your your image Okay, now the image needs to have been transformed So you've seen transform transform transforms from model before Normally, we just put put it all in one variable, but actually behind the scenes.

It was returning two things It was returning training transforms and validation transforms, so I can actually split them apart And so here you can see I'm actually applying example my training transforms or probably more likely I want to play validation transforms That gives me back an array containing the image the transformed image Which I can then pass to predict array Everything that gets passed to or returned from our models is Generally assumed to be a mini batch right generally assumed to be a bunch of images So we'll talk more about some numpy tricks later, but basically in this case.

We only have one image So we have to turn that into a mini batch of images so in other words. We need to create a tensor That basically is not just Rows by columns by channels, but it's number of image by rows by columns by channels and and it has one image So it's basically becomes a four-dimensional tensor so there's a cool little trick in numpy that if you index Into an array with none that basically adds additional unit access to the start So it turns it from an image into a mini batch of one images, and so that's why we had to do that So if you basically find you're trying to do things with a single image With any kind of pytorch or fastai thing this is just something you might you might find it says like expecting four Dimensions only got three it probably means that or if you get back a return Value from something that has like some weird first axis.

That's probably why it's probably giving you like back a mini batch Okay, and so we'll learn a lot more about this, but it's just something to be aware of Okay, so that's kind of everything you need to do in practice So now we're going to kind of get into a little bit of theory What's actually going on behind the scenes with these convolutional neural networks, and you might remember in back in lesson one We actually saw Our first little bit of theory Which we stole from this fantastic website so toaster dot IO EV explained visually And we learned that a that a convolution is something where we basically have a little matrix In deep learning nearly always three by three a little matrix that we basically multiply every element of that matrix By every element of a three by three section of an image Add them all together to get the result of that convolution at one point right now Let's see how that all gets turned together to create these These various layers that we saw in the the xyla and burgers paper and to do that again I'm going to steal off somebody who's much smarter than I am we're going to steal from a Guy called a tavio good a tavio good was the guy who created word lens Which nowadays is part of Google Translate if on Google Translate you've ever like done that thing where you you point your camera at something?

At something with it which has any kind of foreign language on it and in real time it overlays it with the translation That was the potatoes company that built that And so it was kind enough to share this fantastic video. He created he's at Google now And I want to kind of step you through it because I think it explains really really well What's going on and then after we look at the video?

We're going to see how to implement the whole a whole Sequence of convo an entire set of layers of convolutional neural network in Microsoft Excel So whether you're a visual learner or a spreadsheet learner, hopefully you'll be able to understand all this So we're going to start with an image And something that we're going to do later in the course is we're going to learn to recognize digits So we'll do it like end-to-end.

We'll do the whole thing. So this is pretty similar So we're going to try and recognize in this case letters So here's an a which obviously it's actually a grid of numbers, right? And so there's the grid of numbers. And so what we do is we take our first Convolutional filter, so we're assuming this is always this is assuming that these are already learnt Right and you can see this one.

It's got white down the right hand side, right and black down the left So it's like 0 0 0 or maybe negative 1 negative 1 negative 1 0 0 0 1 1 1 and so we're taking each 3 by 3 part of the image and multiplying it by that 3 by 3 Matrix not as a matrix product that an element wise product and so you can see what happens is everywhere where the the white edge is Matching the edge of the a and the black edge isn't we're getting green We're getting a positive and everywhere where it's the opposite.

We're getting a negative We're getting a red right and so that's the first filter creating the first The result of the first kernel right and so here's a new kernel This one is is got a white stripe along the top right so we literally scan it through every three by three part of the matrix multiplying those three bits of the a Nine bits of the a by the nine bits of the filter to find out whether it's red or green and how red or green it is Okay, and so this is assuming we had two filters one was a bottom edge One was a left edge and you can see here the top edge not surprisingly It's red here.

Sorry bottom edge was red here and green here the right edge red here and green here And then in the next step we add a non-linearity Okay, the rectified linear unit which literally means throw away the negatives so here the reds all gone Okay, so here's layer one the input here's layer two the result of two convolutional filters Here's layer three which is which is throw away all of the red stuff And that's called a rectified linear unit and then layer four is something called a max pull And a layer four we replace every two by two Part of this grid and we replace it with its maximum right so it basically makes it half the size It's basically the same thing, but half the size and then we can go through and do exactly the same thing We can have some new Filter three by three filter that we put through each of the two results of the previous layer Okay And again, we can throw away the red bits Right so get rid of all the negatives so we just keep the positives.

That's called applying a rectified linear unit and That gets us to our next layer of this convolutional neural network So you can see that by you know at this layer back here. It was kind of very interpretable It's like we've either got bottom edges or left edges, but then the next layer was combining The results of convolution so it's starting to become a lot less clear like intuitively what's happening But it's doing the same thing and then we do another max pull right so we replace every two by two or three by three Section with a single digit so here this two by two.

It's all black so we replaced it with a black All right, and then we go and we take that and we we compare it To basically a kind of a template of what we would expect to see if it was an A It was a B. It was a C.

It was D It was me and we see how closely it matches and we can do it in exactly the same way We can multiply every one of the values in this four by eight matrix with every one of the four by eight in this one And this one and this one and we add we just add them together to say like how often does it match?

Versus how often does it not match and then that could be converted to give us a percentage Probability that this isn't a so in this case this particular template matched well with a So notice we're not doing any training here, right? This is how it would work if we have a pre trained model All right So when we download a pre trained image net model off the internet and visit on an image without any changing to it This is what's happening or if we take a model that you've trained and you're applying it to some test set or to some new image This is what it's doing right is it's basically taking it through.

It's applying a convolution to each layer to each well multiple convolutional filters to each layer And then during the rectified linear unit so throw away the negatives and then do the max pull And then repeat that a bunch of times and so then we can do it with a new Letter a or letter B or whatever and keep going through That process, right?

So as you can see that's a far nicer visualization thing and I could have created because I'm not a tevio So thanks to him for for sharing this with us because it's totally awesome He actually this is not done by hand. He actually wrote a piece of computer software to actually do these convolutions This is actually being actually being done dynamically.

It's pretty cool So I'm more of a spreadsheet guy personally. I'm a simple person So here is the same thing now in spreadsheet form right and so you'll find this in the github repo, so you can either Get clone the repo to your own computer to open up the spreadsheet or you can just go to github.com slash fastai and Click on this it sits inside If you go to our repo And just go to courses as usual go to deal one as usual you'll see there's an Excel section there Okay, and so here they all are so you can just download them by clicking them Or you can clone the whole repo, and we're looking at conv example convolution example right, so you can see I have here an Input right so in this case the input is the number seven so I grabbed this from a data set called end list MNist which we'll be looking at in a lot of detail and I just took one of those digits at random and I put it into Excel and so you can see every Pixel is actually just a number between naught and one okay, very often actually it'll be a Bite between naught and 255 Or sometimes it might be a float between naught and one it doesn't really matter by the time it gets to PI torch We're generally dealing with floats So we if one of the steps we often will take will be to convert it to a number between naught and one So you can see I've just used conditional formatting in Excel to kind of make the higher numbers more red So you can clearly see that this is a red that this is a seven But but it's just a bunch of numbers that have been imported into Excel okay, so here's our input So remember what Atavio did was he then applied two filters Right with different shapes so here.

I've created a filter which is designed to detect top edges So this is a 3 by 3 filter Okay, and I've got ones along the top zeros in the middle minus ones at the bottom right so let's take a look at an example That's here right and so if I hit that - you can see here highlighted This is the 3 by 3 part of the input that this particular thing is calculating right so here you can see it's got 1 1 1 are all being multiplied by 1 and 0.1 0 0 are all being multiplied by negative 1 Okay, so in other words all the positive bits are getting a lot of positive the negative bits are getting nearly nothing at all So we end up with a high number Okay, where else on the other side of this bit of the seven?

Right you can see how you know this is basically zeros here or perhaps more interestingly on the top of it Right Here we've got High numbers at the top, but we've also got high numbers at the bottom which are negating it Okay, so you can see that the only place that we end up activating is Where we're actually at an edge So in this case this here this number three This is called an activation Okay, so when I say an activation I mean a number a number a Number that is calculated and it is calculated by taking some numbers from the input and applying some kind of linear operation in this case a convolutional kernel to Calculate an output, right?

You'll notice that other than going Inputs multiplied by kernel and summing it together Right. So here's my sum and here's my multiply I then take that and I go max of zero comma that and So that's my rectified linear unit. So it sounds very fancy Rectified linear unit, but what they actually mean is open up Excel and type equals max zero comma thing.

Okay That's all a red and you'll see people in the biz sort of say real you okay So really you means rectified linear unit means max zero comma thing and I'm not like simplifying it I really mean it like when I say like if I'm simplifying always say I'm simplifying But if I'm not saying I'm simplifying that's the entirety.

Okay, so a rectified linear unit in its entirety is this And a convolution in its entirety is is this Okay, so a single layer of a convolutional neural network is being implemented in its entirety Here in Excel, okay, and so you can see what it's done is it's deleted pretty much the vertical edges And highlighted the horizontal edges so again, this is assuming that our network is trained and That at the end of training it had created a convolutional filter with these specific nine numbers in And so here is a second convolutional filter It's just a different nine numbers Now pie torch doesn't store them as two separate nine digit arrays It stores it as a tensor.

Remember a tensor just means an array with More dimensions. Okay, you can use the word array as well It's the same thing but in pytorch. They always use the word tensor. So I'm going to say tensor Okay, so it's just a tensor with an additional axis which allows us to stack Each of these filters together right a filter and kernel Pretty much mean the same thing.

Yeah, right it refers to one of these three by three Matrices or one of these three by three slices of a three dimensional tensor So if I take this one and here I've literally just copied the formulas in Excel from above Okay, and so you can see this one is now finding a vertical edge as we would expect.

Okay, so We've now created One Layer right this here is a layer and specifically we'd say it's a hidden layer Which is it's not an input layer and it's not an output layer. So everything else is a hidden layer. Okay, and this particular hidden layer has is A size 2 on this dimension, right because it has two Filters Right two kernels So what happens next Well Let's do another one Okay, so as we kind of go along things can Multiply a little bit in complexity right because my next filter is going to have to contain Two of these three by threes because I'm going to have to say how do I want to bring how do I want to?

Wait these three things and at the same time, how do I want to wait the corresponding three things down here? But because in PyTorch This is going to be this whole thing here is going to be stored as a multi-dimensional tensor, right? So you shouldn't really think of this now as two three by three kernels, but one two by three by three kernel Okay, so to calculate this value here I've got the sum product of all of that plus the sum product of Scroll down All of that Okay, and So the top ones are being multiplied by this part of the kernel and the bottom ones are being multiplied by this part of the kernel and so over time You want to start to get very comfortable with the idea of these like higher dimensional?

Linear combinations, right? Like it's it's harder to draw it on the screen like I had to put one above the other But conceptually just stack it in your mind like this. That's really how you want to think Right and actually Jeffrey Hinton in his original 2012 neural nets Coursera class has a tip which is how all computer scientists deal with like very high dimensional spaces Which is that they basically just visualize the two-dimensional space and then say like 12 dimensions really fast in their head lots of times So that's it right we can see two dimensions on the screen, and then you just got to try to trust That you can have more dimensions like the concepts just you know There's there's nothing different about them, and so you can see in Excel You know Excel doesn't have the ability to handle three-dimensional tenses, so I had to like say okay take this two-dimensional Dot product add on this two-dimensional dot product right, but if there was some kind of 3d excel I could have just done that in a single formula And then again apply max 0 comma otherwise known as rectified linear unit otherwise known as real you Okay, so here is my second layer, and so when people create different architectures right and architecture means Like how big is your kernel at layer one how many filters are in your kernel at layer one so here?

I've got a 3 by 3 Where's number one and a 3 by 3 there's number two so like this architecture? I've created starts off with two three by three convolutional kernels and then my Second layer has another two kernels of size two by three by three So there's the first one and then down here.

Here's the second two by three by three kernel, okay, and so Remember one of these specific where any one of these numbers is an activation Okay, so this activation is being calculated from these three things here and other three things up there And we're using these this two by three by three kernel okay And so what tends to happen is people generally give names to their layers, so I say okay Let's call this layer here cons one and this layer here and this and This layer here con two right so that's you know Generally, you'll just see that like when you print out a summary of a network every layer will have some kind of name Okay, and so then what happens next?

Well part of the architecture is like do you have some max pooling? Whereabouts is that max pooling happens or in this architecture? We're inventing we're going to next step Is to max pooling okay max pooling is a little hard to? Kind of show in Excel, but we've got it So max pooling if I do a two by two max pooling it's going to have the resolution both height and width So you can see here that I've replaced These four numbers with the maximum of those four numbers Right and so because I'm having the resolution it only makes sense to actually have something every two cells Okay, so you can see here the way.

I've got kind of the same Looking shape as I had back here, okay, but it's now half the resolution because I've replaced every two by two With its max and you'll notice like it's not every possible two by two I skip over from here So this is like starting at BQ and then the next one starts at Bs Right, so they're like non overlapping.

That's why it's decreasing the resolution Okay, so anybody who's comfortable with spreadsheets You know you can open this and have a look and so after our max pooling There's a number of different things we could do next and I'm going to show you a kind of Classic old style approach nowadays in fact what generally happens nowadays is we do a max pool where we kind of like max across the entire size right But on older architectures and also on all the structured data stuff we do We actually do something called a fully connected layer, and so here's a fully connected layer I'm going to take every single one of these activations, and I'm going to give every single one of them a weight Right and so then I'm going to take over here here is the sum product of every one of the activations by every one of the weights for both of the Two Levels of my three-dimensional tensor right and so this is called a fully connected layer notice.

It's different to a convolution I'm not going through a few at a time Right, but I'm creating a really big weight matrix right so rather than having a couple of little three by three kernels My weight matrix is now as big as the entire input And so as you can imagine Architectures that make heavy use of fully convolutional layers can have a lot of weights Which means they can have trouble with overfitting and they can also be slow and so you're going to see a lot An architecture called VGG because it was the first kind of successful deeper architecture It has up to 19 layers and VGG Actually contains a fully connected layer with 4,096 weights Connected to at a hidden layer with 4,000 sorry 4,096 activations connected to a hidden layer with 4,096 activations, so you've got like 4,096 by 4,096 multiplied by remember multiplied by the number of kind of kernels that we've calculated so in VGG there's This I think it's like 300 million Weights of which something like 250 million of them are in these fully connected layers So we'll learn later on in the course about how we can kind of avoid using these big fully connected layers and behind the scenes All the stuff that you've seen us using like res net and res next none of them use very large Fully connected layers you know you had a question So you tell us more about for example if we had like three channels of the input what would be the The shape yeah these filters right so that's a great question So if we had three channels of input it would look exactly like conv1 right conv1 kind of has two channels Right and so you can see with conv1.

We had two channels so therefore our filters had to have like two channels per filter and so you could like Imagine that this input didn't exist you know and actually this was the input right so when you have a multi-channel input It just means that your filters look like this and so images often full color They have three red green and blue sometimes.

They also have an alpha channel So however many you have that's how many inputs you need and so something which I know Yannette was playing with recently was like using a full color image net model In medical imaging for something called bone age calculations Which has a single channel and so what she did was basically take the the input The single channel input and make three copies of it So you end up with basically like one two three versions of the same thing which is like It's kind of it's not ideal like it's kind of redundant information that we don't quite want But it does mean that then if you had a something that expected a three channel convolutional filter You can use it right and so at the moment.

There's a Kaggle competition for iceberg detection using Some funky satellite specific data format that has two channels So here's how you could do that you could Either copy one of those two channels into the third channel Or I think what people on Kaggle are doing is to take the average of the two Again, it's not ideal, but it's a way that you can use pre-trained networks Yeah, I've done a lot of fiddling around like that you can also actually I've actually done things where I wanted to use a Three channel image net network on four channel data.

I had a satellite data where the fourth channel was near infrared And so basically I added an extra kind of Level to my convolutional kernels that were all zeros and so basically like started off by ignoring the new infrared band And so what happens it basically and you'll see this next week is That rather than having these like carefully trained filters when you're actually training something from scratch We're actually going to start with random numbers That's actually what we do we actually start with random numbers And then we use this thing called stochastic gradient descent which we've kind of seen Conceptually to slightly improve those random numbers to make them less random and we basically do that again and again and again Okay, great.

Let's take a seven-minute break, and we'll come back at 750 All right, so what happens next so we've got as far as as Doing a Fully connected layer right so we had our the results of our max Pauling layer got fed to a fully connected layer And you might notice those of you that remember your linear algebra the fully connected layer is actually doing a classic traditional matrix product Okay, so it's basically just going through each pair in turn multiplying them together and then adding them up to do a matrix product now In practice if we want to calculate which one of the 10 digits we're looking at This single number we've calculated isn't enough We would actually calculate 10 numbers so what we would have is rather than just having one set of Fully connected weights like this and I say set because remember.

There's like a whole 3d kind of tensor of them we would actually need 10 of those Right so you can see that these tensors start to get a little bit High dimensional right and so this is where my patience with doing it an Excel ran out But imagine that I had done this 10 times I could now have 10 different numbers all being calculated here Using exactly the same process right it just be 10 of these fully connected To by and by and Arrays basically and So then we would have 10 numbers being spat out, so what happens next?

So next up We can open up a different Excel worksheet Entropy example that XLS that's got two different Worksheets one of them is called softmax And what happens here? I'm sorry I've changed domains rather than predicting whether it's a number from one not to nine I'm going to predict whether something is a cat a dog a plane of fish or building okay, so out of our that fully connected layer We've got in this case.

We'd have five numbers and notice at this point There's no value okay, and then last layer. There's no value okay, so I can have negatives Okay, so I want to turn these five numbers Each into a probability I want to turn it into a probability from not to one that it's a cat That's a dog.

There's a plane that it's a fish that it's a building and I want those probabilities to have a couple of characteristics first is that each of them should be between zero and one and The second is that they together should add up to one right? It's definitely one of these five things Okay, so to do that we use a different kind of activation function What's an activation function an activation function is a function that is applied to activations?

so for example max zero comma something is a function that I applied to an activation So an activation function always takes in One number and spits out one number so max of zero comma X Takes in a number X and spits out some different number value of X That's all an activation function is and if you remember back to that PowerPoint we saw and Lesson one Each of our layers Was just a linear Function and then after every layer We said we needed some non-linearity Right because if you stack a bunch of linear layers together Right then all you end up with is a linear layer right So if somebody's talking can can you not I'm slightly distracting.

Thank you If you stack a number of linear Functions together you just end up with a linear function and nobody does any cool deep learning with just linear functions All right, but remember we also learned that by stacking linear functions With in between each one a non-linearity we could create like arbitrarily complex shapes and so the non-linearity that we're using after every hidden layer is a value rectified linear unit a non-linearity is an activation function an Activation function is a non-linearity in in it within deep learning.

Obviously, there's lots of other non-linearities in the world, but in deep learning This is what we mean So an activation function is any function that takes some activation in that's a single number and spits out some new activation like max of 0 comma So I'm now going to tell you about a different activation function.

It's slightly more complicated than Rally-u, but not too much. It's called softmax softmax only ever occurs in the final layer at the very end and the reason why is that softmax always spits out Numbers as an activation function that always spits out a number between 0 and 1 and it always spits out a bunch of numbers That add to one So a softmax gives us what we want, right?

in theory This isn't strictly necessary right like we could ask our neural net to learn a set of kernels Which have you know, which which give probabilities that line up as closely as possible with what we want But in general with deep learning if you can construct your architecture so that the desired characteristics are as easy to express as possible You'll end up with better models like they'll learn more quickly with less parameters So in this case, we know that our probabilities should end up being between 0 and 1 We know that they should end up adding to one So if we construct an activation function, which always has those features Then we're going to make our neural network do a better job.

It's going to make it easier for it It doesn't have to learn to do those things because it all happened automatically Okay, so in order to make this work We first of all have to get rid of all of the negatives Right, like we can't have negative probabilities So to make things not be negative one way we could do it is just go into the power of Right.

So here you can see my first step is to go x of the previous one right and I think I've mentioned this before but Of all the math that you just need to be super familiar with to do deep learning The one you really need is logarithms and x's right all of deep learning and all of machine learning They appear all the time, right?

So For example You absolutely need to know that log of x times y equals log of x plus log of y Right and like not just know that that's a formula that exists but have a sense of like what does that mean? Why is that interesting? Oh, I can turn multiplications into additions.

That could be really handy, right and therefore log of x over y equals log of x minus log of y Again, that's going to come in pretty handy, you know rather than dividing I can just subtract things, right? And also remember that if I've got log of x equals y Then that means a to the y Equals x in other words log Log and a to the the inverse of each other Okay again, you just you need to really really understand these things and like so if you if you haven't spent much time with logs and x for a while You try plotting them in Excel or a little notebook have a sense of what shape they are how they combine together Just make sure you're really comfortable with them.

So We're using it here, right? We're using it here. So one of the things that we know is a to the power of something is positive Okay, so that's great. The other thing you'll notice about a to the power of something is because it's a power Numbers that are slightly bigger than other numbers like 4 is a little bit bigger than 2.8 When you go either the power of it really accentuates that difference Okay, so we're going to take advantage of both of these features for the purpose of deep learning.

Okay, so we take our The results of this fully connected layer we go a to the power of for each of them and then we're going to And then we're going to add them up Okay, so here is the sum of a to the power of So then here We're going to take a to the power of divided by the sum of a to the power of so if you take All of these things divided by their sum then by definition all of those things must add up to 1 and Furthermore since we're dividing by their sum They must always vary between 0 and 1 because they're always positive Alright, and that's it.

So that's what softmax is Okay, so I've got this kind of Doing random numbers each time right and so you can see like as I look through My softmax generally has quite a few things that are so close to 0 that they round down to 0 and you know Maybe one thing that's nearly 1 right and the reason for that is what we just talked about that is with the x Just having one number a bit bigger than the others tends to like push it out further, right?

So even though my inputs here are random numbers between negative 5 and 5 Right my outputs from the softmax don't really look that random at all in the sense that They tend to have one big number and a bunch of small numbers and Now that's what we want Right.

We want to say like in terms of like is this a cat a dog a plane a fish or a building We really want it to say like it's it's that you know It's it's a dog or it's a plane not like I don't know Okay, so softmax has lots of these cool Properties right it's going to return a probability that adds up to one and it's going to tend to want to pick one thing particularly strongly Okay, so that's softmax your net.

Could you pass actually bust me up? we how would we do something that as let's say you have an image and you want to kind of categorize as like cat and The dog or like as multiple things What what kind of function would we try to use? So happens we're going to do that right now so So have to think about why we might want to do that and so one reason we might want to do that is to do multi-label classification so we're looking now at listen to image models and specifically we're going to take a look at the planet competition satellite imaging competition Now the satellite imaging competition has Some similarities to stuff we've seen before right so before we've seen a cat versus dog and these images are a cat or a dog They're not neither.

They're not both right, but the satellite imaging competition Has data as images that look like this and in fact every single one of the images is classified by weather There's four kinds of weather one of which is haze and another of which is clear In addition to which there is a list of features that may be present including agriculture Which is like some some cleared area used for agriculture Primary which means primary rainforest and water which means a river or a creek so here is a clear day Satellite image showing some agriculture some primary rainforest and some water features And here's one which is in haze and is entirely primary rainforest So in this case we're going to want to be able to show We're going to be able to predict multiple things and so softmax wouldn't be good because softmax doesn't like Predicting multiple things and like I would definitely recommend Anthropomorphizing your activation functions right they have personalities Okay, and the personality of the softmax is it wants to pick a thing Okay, and people forget this all the time.

I've seen many people even well regarded researchers in famous academic papers Using like softmax for multi-label classification it happens all the time, right? And it's kind of ridiculous because they're not understanding the personality of their activation function, so For multi-label classification where each sample can belong to one or more classes.

We have to change a few things But here's the good news in fastai. We don't have to change anything Right so fastai will look at the labels in the CSV and if there is more than one label ever for any Item it will automatically switch into like multi-label mode So I'm going to show you how it works behind the scenes, but the good news is you don't actually have to care It happens anyway so if You have multi-label Images multi-label objects you obviously can't use the classic Keras style approach where things are in folders Because something can't conveniently be in multiple folders at the same time Right, so that's why we you basically have to use the from CSV Approach right so if we look at an example Actually, I'll show you I tend to take you through it right so we can say okay This is the CSV file containing our labels This looks exactly the same as it did before but rather than side on it's top down And top down I've mentioned before that it can do Vertical flips it actually does more than that there's actually eight possible symmetries for a square Which is it can be rotated through 90 180 270 or 0 degrees?

And for each of those it can be flipped and if you think about it for a while you'll realize that that's a complete enumeration of everything that you can do In terms of symmetries to a square, so they're called it's called the dihedral group of eight So if you see in the code, there's actually a transform called dihedral.

That's why it's called that So this transforms will basically do the full set of eight symmetric dihedral rotations and flips Plus everything which we can do to dogs and cats you know small 10-degree rotations little bit of zooming a little bit of contrast and brightness adjustment So these images are of size 256 by 256 So I just created a little function here to let me quickly grab you know a Data loader of any size so here's a 256 by 256 Once you've got a data object inside it We've already seen that there's things called valve DS test DS train DS They're things that you can just index into and grab a particular image so you just use square brackets zero You'll also see that all of those things have a DL.

That's a data loader So DS is data set DL is data loader. These are concepts from pytorch So if you google pytorch data set or pytorch data loader You can basically see what it means, but the basic idea is a data set gives you a single image or a single object back a data loader gives you back a mini-batch and Specifically it gives you back a transformed mini-batch, so that's why when we create our data object we can pass in num workers and Transforms like how many processes do you want to use what transforms?

Do you want and so with a data loader you can't ask for an individual image? You can only get back at a mini-batch and you can't get that back a particular mini-batch You can only get back the next mini-batch so something we risk is loop through Grabbing a mini-batch at a time and so in Python The thing that does that is called a generator right or an iterator this slightly different versions Of the same thing so to turn a data loader into an iterator you use the standard Python function called iter That's a Python function just a regular part of the Python Basic language that returns to an iterator and an iterator is something that takes you can pass the standard give pass it to the standard Python Function or statement next and that just says give me another batch from this iterator So we're basically this is one of the things I really like about pytorch is it really leverages?

modern pythons Kind of stuff you know in tensorflow they invent their whole new world of ways of doing things And so it's kind of more In a sense. It's more like cross-platform, but another sense like it's not a good fit to any platform So it's nice if you if you know Python well Pytorch comes very naturally if you don't know Python well pytorch is a good reason to learn Python well a Pytorch near module neural network module is a standard Python bus for example So any work you put into learning Python better will pay off with Pytorch so here.

I am using standard Python Iterators and next to grab my next mini-batch From the validation sets data loader, and that's going to return two things It's going to return the images in the mini-batch and the labels in the mini-batch so standard Python approach I can pull them apart like so and so here is one mini-batch of labels And so not surprisingly since I said that my batch size Actually, it's the batch size by default is 64 so I didn't pass in a batch size So just remember shift tab to see like what are the things you can pass and what are the defaults so by default?

My batch size is 64, so I've got back something of size 64 by 17 so there are 17 of the possible classes right So let's take a look at the zeroth Set of labels so the zeroth images labels So I can zip again standard Python things it takes two lists and combines it so you get the zeroth thing from the first List the zeroth thing from the second list and the first thing for the first first this first thing from the second list and so Forth so I can zip them together and that way I can find out For the zeroth image in the validation set it's agriculture It's clear It's primary rainforest.

It's slash and burn. It's water okay, so as you can see here, this is a multi label You see here's a way to do multi label classification So by the same token right if we go back to our single label classification It's a cat dog playing official building Behind the scenes we haven't actually looked at it, but behind the scenes Fastai and Pytorch are turning our labels into something called one hot encoded Labels and so if it was actually a dog then the actual values Would be like that right so these are like the actuals Okay, so do you remember at the very end of a tavio's video?

He showed how like the template had to match to one of the like five a b c d or e templates And so what it's actually doing is it's comparing When I said it's basically doing a dot product. It's actually a fully connected layer at the end right that calculates an output activation that goes through a softmax and Then the softmax is compared to the one hot encoded label right so if it was a dog there would be a one here And then we take take the difference between the actuals and the softmax Activations to say and add those add up those differences to say how much error is there essentially?

We're skipping over something called a loss function that we'll learn about next week, but essentially we're basically doing that Now if it's one hot encoded like if there's only one thing which have a one in it then actually storing it as 0 1 0 0 0 is terribly inefficient Right like we can basically say what are the index of each of these things?

Right so we can say it's like 0 1 2 3 4 like so right and so rather than storing it as 0 1 0 0 0 we actually just store the index value Right so if you look at the the y values for the cats and dogs competition or the dog breeds competition You won't actually see a big lists of ones and zeros like this.

You'll see a single integer Right, which is like. What's what class index is it right and internally Inside Pytorch it will actually turn that into a one hot encoded vector, but like you will literally never see it Okay, and and Pytorch has different loss functions where you basically say this thing's one This thing is one hot encoded or this thing is not and it uses different loss functions That's all hidden by the fast AI library right so like you don't have to worry about it But it's but the the cool thing to realize is that this approach for multi-label encoding with these ones and zeros Behind the scenes the exact same thing happens for single-level classification Does it make sense to change the pickiness of the sigmoid of the softmax function by changing the base?

No because when you change the more math Log base a of B equals log B over log A so changing the base is just a linear scaling and Linear scaling is something which the neural net can learn with that very easily Good question Okay, so here is that image right here is the image with slash and burn water etc etc One of the things to notice here is like when I first displayed this image it was So washed out I really couldn't see it right but remember images Now you know we know images are just Matrices of numbers and so you can see here.

I just said times 1.4 Just to make it more visible right so like now that you're kind of it's the kind of thing I want you to get familiar with is the idea that this stuff you're dealing with they're just matrices of numbers Then you can fiddle around with them, so if you're looking at something like oh, it's a bit washed out You can just multiply it by something to Brighten it up a bit okay, so here.

We can see I guess this is the slash and burn Here's the river. That's the water. Here's the primary rainforest. Maybe that's the agriculture and so forth okay, so So you know with all that background how do we actually use this? Exactly the same way as everything we've done before right so you know size you know and and The interesting thing about playing around with this planet competition is that these images are not at all like image net and I Would guess that the vast majority of the stuff that the vast majority of you do involving convolutional neural nets Won't actually be anything like image net you know it'll be it'll be medical imaging Or it'll be like classifying different kinds of steel tube or figuring out whether a world You know is going to break or not or or looking at satellite images, or you know whatever right so?

It's it's good to experiment with stuff like this planet Competition to get a sense of kind of what you want to do and so you'll see here I start out by resizing my data to 64 by 64 It starts out at 256 by 256 right now I wouldn't want to do this for the cats and dogs competition because the cats in dog competition We start with a pre trained image net network.

It's it's nearly it's it's it starts off nearly perfect Right so if we resized everything to 64 by 64 and then retrained the whole set We basically destroy the weights that are already pre trained to be very good Remember image net most image net models are trained at either 224 by 224 or 299 by 299 right so if we like retrain them at 64 by 64.

We're going to we're going to kill it on the other hand There's nothing in image net that looks anything like this You know there's no satellite images So the only useful bits of the image net network for us kind of layers like this one You know finding edges and gradients and this one you know finding kind of textures and repeating patterns And maybe these ones of kind of finding more complex textures, but that's probably about it right so so in other words You know starting out by training very small images Works pretty well when you're using stuff like satellites So in this case I started right back at 64 by 64 grabbed some data Built my model found out what learning rate to use interestingly it turned out to be quite high It seems that because like it's so unlike image net I Needed to do quite a bit more fitting with just that last layer before it started to flatten out Then I unfreezed it and again.

This is the difference to Image net like Data sets is my learning rate in the initial layer I set to divided by 9 the middle layers I set to divided by 3 Where else for stuff like this like image net I had a multiple of 10 for each of those You know again the idea being that the earlier layers Probably are not as close to what they need to be compared to the image net like data sets So again unfreeze train for a while And you can kind of see here.

You know there's cycle one. There's cycle two. There's cycle three And then I kind of increased double the size of my images Fit for a while Unfreeze fit for a while double the size of the images again fit for a while unfreeze fit for a while And then add TTA and so as I mentioned last time we looked at this this process ends up You know getting us about 30th place in this competition Which is really cool because people you know a lot of very very smart people Just a few months ago worked very very hard on this competition Couple of things people have asked about one is What is this data dot resize do so a Couple of different pieces here the first is that when we say Back here What transforms do we apply and here's our transforms we actually pass in a size right?

So one of the things that that one of the things that data loader does is to resize the images like on demand every time It sees them This has got nothing to do with that dot resize method right so This is this is the thing that happens at the end like whatever's passed in before it hits out that before our data Lotus fits it out.

It's going to resize it to this size If the initial input is like a thousand by a thousand Reading that JPEG and resizing it to 64 by 64 Turns out to actually take more time than training the confident dots for each batch Right so basically all resize does is it says hey I'm not going to be using any images bigger than size times 1.3 So just go through once and create new JPEGs of this size Right and and they're rectangular right so new JPEGs where the smallest Edges of this size and again.

It's like you never have to do this There's no reason to ever use it if you don't want to it's just a speed up okay, but if you've got really big images coming in it saves you a lot of time and you'll often see on like Kaggle kernels or forum posts or whatever people will have like Bash scripts stuff like that to like loop through and resize images to save time you never have to do that right just you can Just say dot resize and it'll just Create you know once off it'll go through and create that if it's already there It'll use the resized ones for you.

Okay, so it's just a it's just a Speed up convenience function no more Okay, so for those of you that are kind of past dog breeds I Would be looking at planet Next you know like track like play around with With trying to get a sense of like how can you get this as an accurate model?

One thing to mention, and I'm not really going to go into it in detail It's nothing to do with deep learning particularly is that I'm using a different metric. I didn't use metrics equals accuracy But I said metrics equals f2 Just remember from last week that confusion matrix that like two by two you know correct incorrect for each of dogs and cats There's a lot of different ways you could turn that confusion matrix into a score You know do you care more about false negatives, or do you care more about false positives, and how do you wait them?

And how do you combine them together right? There's a base. There's basically a function called f beta Where the beta says how much do you wait false negatives versus false positives and so f2? Is f beta with beta equals 2 and it's basically as particular way of waiting false negatives and false positives And the reason we use it is because cattle told us that planet who were running this competition Wanted to use this particular f-theta metric The important thing for you to know is that you can create Custom metrics so in this case you can see here It says from planet import f2 and really I've got this here so that you can see how to do it Right so if you look inside Courses deal one You can see there's something called planet dot py Right and so if I look at planet dot py you'll see there's a function there called f2 right and so f2 simply calls f beta score from psychic Or sci-fi and can remember where it came from And does a couple little tweets that are particularly important But the important thing is like you can write any metric you like right as long as it takes in set of predictions and a set of targets They're both going to be numpy arrays one-dimensional numpy arrays, and then you return back a number Okay, and so as long as you create a function that takes two vectors and returns up number You can call it as a metric and so then when we said Learn metrics equals and then passed in that array which just contains a single function f2 Then it's just going to be printed out After every epoch for you, okay, so in general like the the fast AI library Everything is customizable so kind of the idea is that everything is Everything is Kind of gives you what you might want by default, but also everything can be changed as well Yes, you know We have a little bit of confusion about the difference between multi label and Just single label.

Uh-huh. Do you by any chance an example in which you compute? similarly to the example of the They just show us. Oh, I didn't get to that activation function. Yeah, so So I'm so sorry. I said I'd do that and then I didn't so the activation the output activation function for a single label Classification is softmax for all the reasons that we talked about but if we were trying to predict something that was like 00110 Then softmax would be a terrible choice because it's very hard to come up with something where both of these are high In fact, it's impossible because they have to add up to one.

So the closest they could be would be 0.5 so for multi label classification activation function is called Sigmoid okay, and again the fast AI library does this automatically for you if it notices you have a multi label Problem and it does that by checking your data set to see if anything has more than one label applied to it and so sigmoid is a function which is equal to It's basically the same thing Except rather than we never add up All of these x's but instead we just take this x and we say it's just equal to it divided by 1 plus It And so the nice thing about that is that now like multiple things can be high at once Right and so generally then if something is less than zero its sigmoid is going to be less than 0.5 If it's greater than 0 its sigmoid is going to be greater than 0.5 And so the important thing to know about a sigmoid function is that its shape is Something which asymptotes the top to one and asymptotes.

Oh, I drew that Asymptotes at the bottom To zero and so therefore it's a good thing to model a probability with Anybody who has done any? logistic regression Will be familiar with this is what we do in logistic regression So it kind of appears everywhere in machine learning, and you'll see that kind of a sigmoid and a softmax.

They're very close to each other Conceptually, but this is what we want is our activation function for multi label And this is what we want a single label and again and fast AI does it all for you. There was a question over here. Yes I have a question about The initial training that you do if I understand correctly you have we have frozen the The pre-trained model and you only did initially try to train the latest Layer, right?

Right But from the other hand we said that only the initial layer So let's last probably the first layer is like important to us and the other two Are more like features that are image not related and we didn't apply in this case. Well, it's that they The layers are very important But the pre-trained weights in them aren't so it's the later layers that we really want to train the most so earlier layers Likely to be like already Closer to what we want Okay, so you start with the latest one and then you go right so if you go back to our quick dogs and cats right when we create a model from pre trained from a pre trained model it returns something where all of the convolutional layers are frozen and some randomly set Fully connected layers we add to the end Unfrozen and so when we go fit But first it just trains The randomly set a randomly initialized fully connected layers, right?

And if something is like really close to image net that's often all we need But because the other the only layers are already good at finding edges gradients repeating patterns for ears and dogs heads So then when we unfreeze We set the learning rates for the early layers to be really low Because we don't want to change them much for us the later ones we set them to be higher Where else for satellite data?

right This is no longer true. You know the early layers are still like Better than the later layers, but we still probably need to change them quite a bit So that's right. This learning rate is nine times smaller than the final learning rate rather than a thousand times smaller than the final learning rate Okay, so you play with with the weights of the layers with the learning rates.

Yeah, normally Most of the stuff you see online if they talk about this at all, they'll talk about unfreezing different subsets of layers And indeed we do unfreeze our randomly generated ones But what I found is although the fast AI library you can type learn dot freeze to and just freeze a subset of layers this approach of using differential learning rates seems to be like More flexible to the point that I never find myself unfreezing subsets of layers So but but I don't understand is that I would expect you to start with that with a differential the different Learning rates rather than trying to learn the last layer.

So the reason okay, so you could skip this Training just the last layers and just go straight to differential learning rates But you probably don't want to the reason you probably don't want to is that there's a difference the convolutional layers all contain Pre trained weights, so they're like they're not random for things that are close to image net They're actually really good for things that are not close to image net.

They're better than nothing All of our fully connected layers, however are totally random So therefore you would always want to make the fully connected weights better than random by training them a bit first Because otherwise if you go straight to unfreeze Then you're actually going to be like fiddling around of those early early can early layer weights when the later ones are still random That's probably not what you want.

I Think there's another question here So when we unfreeze What are the things we're trying to change there? will it change the kernels themselves That that's always what SGD does. Yeah, so the only thing what training means is setting these numbers right and These numbers and These numbers the weights so the weights are the weights of the fully connected layers and The weights in those kernels in the convolutions.

So that's what training means It's and we'll learn about how to do it with SGD. But training literally is setting those numbers these numbers on the other hand Activations they're calculated. They're calculated from the weights and the previous layers activations or imports I have a question. So can you lift that up higher and speak badly?

So in your example of training the satellite image Example so you start with very small size exit support Yeah, so does it literally mean that you know the model takes a small area from the entire image? That is 64 by 64 So how do we get that 64 by 64 depends on?

the transforms by default our transform takes the smallest edge and Resize zooms the whole thing out Resamples it so the smallest edge is the size 64 and then it takes a center crop of that, okay, although When we're using data augmentation it actually takes a randomly chosen prop In the case where the image has multiple objects like in this case Like would it be possible like you would just lose the other things that they try to forget?

Yeah, which is why data augmentation is important. So by and particularly their Test time augmentation is going to be particularly important because you would you wouldn't want to you know That there may be a artisanal mine out in the corner, which if you take a center crop you you don't see So data augmentation becomes very important.

Yeah Sure So when we talk about metrics that users are here see that lower or up to That's not really what the model tries to that's a great point. That's not the loss function Yeah, right. The loss function is something we'll be learning about next week And it uses a cross entropy or otherwise known as like negative log likelihood The metric is just the thing that's printed so we can see what's going on Just next to that So in the context of multi-class Modeling cannot training data does a training data also have to be multi-class?

So can I train on just like images of pure cats and pure dogs and expect it at prediction time to? Predict if I give it a picture of both having cat analog I've never tried that and I've never seen an example of something that needed it. I Guess conceptually there's no reason it wouldn't work But it's kind of out there And you still use a sigmoid activity you would have to make sure you're using a sigmoid loss function So in this case fast a eyes default would not work because by default fast a I would say your training data Never has both a cat and a dog, so you would have to override the loss function When you use the differential learning rates Those three learning rates do they just kind of spread evenly across the layers?

Yeah, we'll talk more about this later in the course, but I'm in the fast AI library There's a concept of layer groups so in something like a resnet 50 You know there's hundreds of layers, and I figured you don't want to write down hundreds of learning rates, so I've basically decided for you how to split them and The the last one always refers just to the fully connected layers that we've randomly initialized and add it to the end And then these ones are split generally about halfway through Basically, I've tried to make it so that These you know these ones are kind of the ones which you hardly want to change at all And these are the ones you might want to change a little bit, and I don't think we're covered in the course But if you're interested we can talk about in the forum There are ways you can override this behavior to define your own layer groups if you want to And is there any way to visualize the model easily or like dump dump the layers of the model?

Yeah, absolutely You can Make sure we've got one here Okay So if you just type learn it doesn't tell you much at all, but what you can do is go learn summary and That spits out basically everything There's all the letters and so you can see in this case These are the names I mentioned how they all got names right so the first layer is called conv 2d - 1 And it's going to take as input This is useful to actually look at it's taking 64 by 64 images.

Which is what we told it We're going to transform things - this is three channels pie torch Like most things have channels at the end would say 64 by 64 by 3 pie torch moves it to the front So it's 3 by 64 by 64 That's because it turns out that some of the GPU computations run faster when it's in that order Okay, but that happens all behind the scenes automatically so part of that transformation stuff That's kind of all done automatically is to do that - 1 Means however however big the batch size is In Keras they use the number they use a special number none In pie torch they use - 1 so this is a four-dimensional mini batch the number of Elements in the number of images in the image mini batches dynamic you can change that the number of channels is 3 Number of images is 64 by 64.

Okay, and so then you can basically see that this particular convolutional kernel Apparently has 64 kernels in it And it's also halving we haven't talked about this but convolutions can have something called a stride That it's like max pooling for changes the size. So it's returning a 32 by 32 by 64 kernel Tensor and so on and so forth So that's summary and we'll learn all about what that's doing in detail in the second half of the course one more I Clicked in my own data set and I tried to use the and it's a really small data set these currencies from images and I tried to do a Learning rate find and then the plot and it just it gave me some numbers which I didn't understand on the learning rate font Yeah, and then the plot was empty.

So yeah, I mean let's let's talk about that on the forum but basically The learning rate finder is going to go through a mini batch at a time if you've got a tiny data set There's just not enough mini batches. So the trick is to make your mini that make your batch size really small Like try making it like four or eight or something Okay, they were great questions nothing online to add in it They were great questions we've got a little bit past where I hope to but let's let's quickly talk about Structured data so we can start thinking about it for next week so This is really weird right to me.

There's basically two types of data set we use in machine learning. There's a type of data like audio images natural language text where all of the all of the things inside an object like all of the pixels inside an image are All the same kind of thing. They're all pixels or they're all amplitudes of a waveform or They're all words I call this kind of data unstructured and then there's data sets like a profit-and-loss statement or the information about a Facebook user Where each column is like?

Structurally quite different, you know one thing is representing like how many page views last month another one is their sex Another one is what zip code they're in and I call this structured data That particular terminology is not Unusual like lots of people use that terminology, but lots of people don't there's no Particularly agreed upon terminology so when I say structured data I'm referring to kind of columnar data as you might find in a database or a spreadsheet where different columns represent different kinds of things and each row represents an observation and So structured data is probably what most of you Analyzing most of the time Funnily enough you know academics in the deep learning world don't really give a shit about structured data Because it's pretty hard to get published in fancy conference proceed proceedings If you're like if you've got a better logistics model, you know, it's the thing that makes the world goes round It's a thing that makes everybody you know money and efficiency and make stuff work But it's largely ignored sadly So we're not going to ignore it because we're practical deep learning And Kaggle doesn't ignore it either because people put prize money up on Kaggle to solve real-world problems So there are some great Kaggle competitions we can look at there's one running right now Which is the grocery sales forecasting competition for Ecuador's largest chain?

It's always a little I've got to be a little careful about how much I show you about currently running competitions because I don't want To you know help you cheat, but it so happens. There was a competition a year or two ago For one of Germany's largest grocery chains, which is almost identical.

So I'm going to show you how to do that So that was called the Rossman stores data and So I would suggest you know, first of all try practicing what we're learning on Rossman, right? but then see if you can get it working on on grocery because currently On the leaderboard no one seems to basically know what they're doing in the groceries competition.

If you look at the leaderboard The See here These ones around five to nine five three. Oh are people that are literally finding like group averages and submitting those I know because that the kernels that they're using so, you know the basically the people around 20th place I'm not actually doing any machine learning So yeah, let's see if we can improve things So you'll see there's a lesson three Rossman Notebook sure you get pool.

Okay, in fact, you know just reminder, you know before you start working Get pool in your fast AI repo and from time to time Conda and update for you guys doing the in-person course the Conda and update You should do it more often because we're kind of changing things a little bit folks in the MOOC You know more like once a month should be fine So anyway, I just I just changed this a little bit so make sure you get pulled to get lesson three Rossman And there's a couple of new libraries here one is fast AI dot structured Fast AI dot structured contains stuff, which is actually not at all Pytorch specific And we actually use that in the machine learning course as well for doing random forests with no Pytorch at all I mentioned that because you can use that particular library without any of the other parts of fast AI So that can be handy And then we're also going to use fast AI dot column data Which is basically some stuff that allows us to do fast AI Pytorch stuff with columnar structured data For structured data we need to use pandas a lot Anybody who's used our data frames will be very familiar with pandas pandas is basically an attempt to kind of replicate data frames in Python You know and a bit more If you're not entirely familiar with pandas There's a great book Which I think I might have mentioned before Python for data analysis by Wes McKinney.

There's a new edition that just came out a couple of weeks ago Obviously being by the pandas author its coverage of pandas is excellent, but it also covers numpy scipy plotlib scikit learn I python and jupyter really well, okay, and so I'm kind of going to assume That you know your way around these libraries to some extent Also, there was the workshop we did before this started and there's a video of that online where we kind of have a brief mention of all of those tools Structured data is generally shared as CSV files.

It was no different in this competition As you'll see, there's a hyperlink to the Rossman data set here All right now if you look at the bottom of my screen you'll see this goes to files.fast.ai Because this doesn't require any login or anything to grab this data set. It's as simple as right clicking copy link address Head over to wherever you want it and just type Wget and The URL okay, so that's because you know, it's it's not behind a login or anything so you can grab the grab it from there and You can always read a CSV file with just pandas dot read CSV now in this particular case.

There's a lot of Pre-processing that we do and what I've actually done here is I've I've actually Stolen the entire pipeline from the third-place winner of Rossman. Okay, so they made all their data They're really great. You know, they've had a github available with everything that we need and I've ported it all across and simplified it and Tried to make it pretty easy to understand this Course is about deep learning not about data processing.

So I'm not going to go through it But we will be going through it in the machine learning course in some detail because feature engineering is really important So if you're interested You know check out the machine learning course for that I will however show you Kind of what it looks like.

So once we read the CSVs in You can see basically what's there so the key one is For a particular store We have the We have the date and we have the sales For that particular store. We know whether that Thing is on promo or not We know the number of customers that that particular store had We know whether that date was a school holiday We also know What kind of store it is so like this is pretty common right you'll often get Data sets where there's some column with like just some kind of code.

We don't really know what the code means Most of the time I find it doesn't matter what it means Like normally you get given a data dictionary when you start on a project and obviously if you're working on internal project You can ask the people at your company. What does this column mean?

I? Kind of stay away from learning too much about it. I prefer to like see what the data says first There's something about what kind of product are we selling in this particular row? And then there's information about like how far away is the nearest competitor how long have they been open for How long is the promo being on for Each store we can find out what state it's in for each state we can find out the name of the state this is in Germany and Interestingly they were allowed to download any data external data They wanted in this competition It's very common as long as you share it with everybody else and so some folks tried downloading data from Google Trends I'm not sure exactly what it was that they were checking the trend of but we have this information from Google Trends Somebody downloaded the weather for every day in Germany for every state And yeah, that's about it right so You can get a data frame summary with pandas which kind of lets you see how many Observations and means and standard deviations Again, I don't do a hell of a lot with that early on But it's nice to know it there So what we do, you know, this is called a relational data set a relational data set is one where there's quite a few tables We have to join together.

It's very easy to do that in pandas There's a thing called merge so I create a little function to do that And so I just started joining everything together join in the weather the Google Trends the stores Yeah, that's about everything I guess You'll see there's one thing that I'm using from the fast AI library, which is called add date part We talk about this a lot in the machine learning course But basically this is going to take a date and pull out of it a bunch of columns day of week Is at the start of a quarter month of year so on and so forth and add them all in for the data set Okay, so this is all standard pre-processing As we join everything together we fiddle around with some of the dates a little bit some of them are in month and year Format we turn it into date format We spend a lot of time Trying to Take information about for example holidays and add a column for like how long until the next holiday How long has it been since the last holiday?

ditto for promos So on and so forth. Okay, so we do all that and at the very end We basically save a big structured data file that contains all that stuff Something that those of you that use pandas may not be aware of is that there's a very cool new format called feather Which you can save a pandas Data frame into this feather format It's kind of pretty much takes it as it sits in RAM and dumps it to the disk and so it's like really really really fast the reason that you need to know this is because the Ecuadorian grocery competition it's on now has 350 million records So you will care about how long things take it took I believe about six seconds for me to save 350 million records to feather format, so it's pretty cool So at the end of all that I'd save it as feather format and for the rest of this discussion I'm just going to take it as given that we've got this nicely Processed feature-engineered file and I can just go read better.

Okay, but for you to play along at home You will have to run those previous cells. Oh except the See these ones are commented out You don't have to run those because the file that you download from files.fast.ai has already done that for you, okay? All right So we basically have all these columns So it basically is going to tell us You know how many of this thing was sold on?

This date at this store and so the goal of this competition is to find out How many things will be sold for each store for each type of thing in the future? Okay, and so that's basically what we're going to be trying to do And so here's an example of what some of the data looks like And so Next week we're going to see how to go through these steps But basically what we're going to learn is we're going to learn to split the columns into two types some columns we're going to treat as categorical, which is to say Store ID 1 and store ID 2 are not numerically related to each other the categories Right we're going to treat day of week like that to Monday and Tuesday day zero and day one not numerically Where else distance in kilometers to the nearest competitor?

That's a number that we're going to treat numerically Right so in other words the categorical variables. We basically are going to one hot encode them You can think of it as one hot encoding them where else the continuous variables. We're going to be feeding into fully connected layers Just as is Okay So what we'll be doing is we'll be basically creating a Validation set and you'll see like a lot of these are start to look familiar This is the same function we used on planet and dog breeds to create a validation set There's some stuff that you haven't seen before where we're going to Basically rather than saying image data dot from CSV.

We're going to say columnar data From data frame right so you can see like the basic API concepts will be the same, but they're a little different, right? but just like before we're going to get a learner and we're going to go lr find to find our best learning rate and Then we're going to go dot fit with a metric with a cycle length Okay, so the basic sequence who's going to end up looking hopefully very familiar.

Okay, so we're out of time so what I suggest you do this week is like try to Enter as many Kaggle image competitions as possible like like try to really get this feel for like cycle lengths learning rates plotting things You know that That post I showed you at the start of class today that kind of took you through lesson one like Really go through that on as many image data sets as you can to just feel Really comfortable with it, right?

because you want to get to the point where next week when we start talking about structured data that this idea of like how Learners kind of work and data works and data loaders and data sets and looking at pictures should be really you know intuitive Alright, good luck. See you next week (audience applauding) (audience applauding)

Lesson 3: Deep Learning 2018

Chapters

Transcript