Machine Learning 1: Lesson 10

00:00:00.000 | Well, welcome back to machine learning one of the most exciting things this week

00:00:05.480 | Almost certainly the most exciting thing this week is that fastai is now on pip so you can pip install fastai

00:00:14.760 | And so thank you to Prince and for to karem for making that happen

00:00:21.360 | To USF students who had never published a pip package before and this is one of the harder ones to publish because it's got a lot

00:00:28.160 | of dependencies

00:00:30.160 | So it's you know probably still easiest just to do the Conda end update thing

00:00:36.720 | But a couple of places that it would be handy instead to pip install fastai would be well obviously if you're working

00:00:42.880 | Outside of the the repo and the notebooks then this gives you access to fastai everywhere

00:00:49.680 | Also, I believe they submitted a pull request to Kaggle to try and get it added to the Kaggle kernels

00:00:55.960 | So hopefully you'll be able to use it on Kaggle kernels

00:00:58.080 | soon and

00:01:00.080 | Yeah, you can use it at your work or whatever else

00:01:04.160 | So that's that's exciting. I mean I'm not going to say it's like officially released yet. You know it's still

00:01:11.240 | very early obviously and we're still

00:01:13.880 | You're helping add documentation and all that kind of stuff, but it's great that that's now there

00:01:21.560 | a couple of cool kernels from USF students this week thought I'd highlight two that were both from the

00:01:29.240 | text normalization

00:01:31.960 | competition which was about

00:01:34.360 | Trying to take text which was

00:01:37.840 | Written out you know written a standard English text they also had one for Russian

00:01:45.920 | And you're trying to kind of identify things that could be like a first second third and say like that's a cardinal number

00:01:52.760 | Or if this is a phone number or whatever and I did a quick little bit of searching and I saw that

00:01:57.820 | There had been some attempts in academia to use

00:02:01.840 | deep learning for this, but they hadn't managed to make much progress and

00:02:06.280 | Actually noticed us. I'll veres

00:02:09.200 | Colonel here which gets point nine nine two on the leaderboard, which I think is like top 20

00:02:14.520 | Is yeah, it's kind of entirely heuristic, and it's a great example of

00:02:18.640 | Kind of feature engineering this in this case the whole thing is basically entirely feature engineering

00:02:23.780 | So it's basically looking through and using lots of regular expressions to figure out for each token

00:02:29.600 | What is it you know and I think she's done a great job here kind of laying it all out

00:02:34.240 | clearly as to what all the different pieces are and how they all fit together and

00:02:38.560 | She mentioned that she's maybe hoping to turn this into a library which I think would be great

00:02:43.480 | right you know you could use this to

00:02:45.480 | Grab a piece of text and pull out. What are all the pieces in it?

00:02:50.080 | It's the kind of thing that

00:02:53.400 | The neural the natural language can like natural language processing community hopes to be able to do

00:02:58.640 | Without like lots of handwritten code like this, but for now

00:03:03.260 | This is I'll be interesting to see like what the winners turn out to have done, but I haven't seen

00:03:09.200 | Machine learning being used really to do this particularly well

00:03:13.520 | Perhaps the best approach is the ones which combine this kind of feature engineering along with some machine learning

00:03:19.600 | But I think this is a great example of effective feature engineering, and this is a another USF student

00:03:27.460 | Who has done much the same thing got a similar kind of score?

00:03:30.980 | But used her own different set of rules

00:03:36.160 | Again, this is gets you would get you a good leaderboard position with these as well

00:03:40.480 | so I thought that was interesting to see examples of some of our students entering a

00:03:45.800 | competition and getting kind of top 20 ish results by you know basically just handwritten heuristics, and this is where

00:03:54.740 | for example computer vision was

00:03:59.640 | Six years ago still basically all the best approaches was a whole lot of like carefully handwritten heuristics

00:04:06.640 | often combined with some simple machine learning and

00:04:10.680 | So I think over time

00:04:13.160 | You know the field is kind of

00:04:15.520 | Definitely trying to move towards

00:04:18.360 | Automating much more of this and actually interestingly

00:04:22.240 | very interestingly in the

00:04:25.600 | Safe driver prediction competition was just finished

00:04:29.200 | One of the Netflix prize winners won this competition and he

00:04:35.320 | Invented a new algorithm for dealing with structured data which basically doesn't require any feature engineering at all

00:04:44.040 | So he came first place using nothing but

00:04:47.160 | five

00:04:49.760 | deep learning models and one gradient boosting machine

00:04:54.200 | And his his basic approach was very similar to what we've been learning in this class so far

00:05:00.180 | And what we'll be learning also tomorrow

00:05:02.180 | Which is using fully connected neural networks and we're and one hot encoding

00:05:07.720 | And specifically embedding which we'll learn about but he had a very clever technique

00:05:13.280 | Which was there was a lot of data in this competition which was unlabeled so in other words

00:05:18.000 | Where they didn't know whether that?

00:05:22.200 | Driver would go on to claim or not

00:05:24.200 | Or whatever so unlabeled data so when you've got some labeled and some unlabeled data

00:05:29.080 | We call that semi supervised learning and in real life

00:05:32.960 | Most learning is semi supervised learning like in real life normally you have some things that are labeled and some things that are unlabeled

00:05:40.100 | so this is kind of the most practically useful kind of learning and

00:05:44.160 | Then structured data is it's the most common kind of data that companies deal with day to day

00:05:49.460 | so the fact that this competition was a

00:05:52.000 | semi supervised

00:05:53.620 | Structured data competition made it incredibly practically useful

00:05:57.460 | And so what his technique for winning this was was to?

00:06:01.780 | Do data augmentation which those of you doing the deep learning course have learned about which is basically the idea like if you had

00:06:09.760 | Pictures you would like flip them horizontally or rotate them a bit data augmentation means creating new data examples

00:06:16.360 | Which are kind of slightly?

00:06:19.240 | Different versions of ones you already have and the way he did it was for each row in the data. He would like

00:06:25.800 | at random replace 15% of the

00:06:29.600 | variables with a different row

00:06:32.880 | So each row now would represent like a mix of like 80 percent 85 percent of the original row

00:06:38.320 | But 15 percent randomly selected from a different row

00:06:41.120 | and so this was a way of like

00:06:44.760 | randomly changing the data a little bit and then he used something called an autoencoder which we will

00:06:51.080 | Probably won't study until part two of the deep learning course

00:06:54.960 | But the basic idea of an autoencoder is your dependent variable is the same as your independent variable

00:07:01.020 | so in other words you try to predict your input, which obviously is

00:07:07.520 | Trivial if you're allowed to like it like you know the identity transform for example trivially predicts the input

00:07:14.640 | But the trick with an autoencoder is to have less activations in

00:07:18.640 | At least one of your layers than your input right so if your input was like a hundred-dimensional vector, and you put it through a

00:07:27.560 | 100 pi 10 matrix to create 10 activations and then have to recreate the original hundred long vector from that

00:07:36.600 | Then you've basically come you have to have compressed it effectively and so it turns out that

00:07:42.000 | That kind of neural network

00:07:44.600 | You know it's forced to find

00:07:47.400 | Correlations and features and interesting relationships in the data even when it's not labeled so he used that

00:07:55.280 | Rather than doing any he didn't do any hand engineering. He just used an autoencoder

00:08:00.120 | So you know these are some interesting kind of directions that if you keep going with your machine learning studies

00:08:06.920 | You know particularly if you?

00:08:08.920 | Do part two with a deep learning course next year?

00:08:12.260 | you'll you'll learn about

00:08:15.320 | and

00:08:16.840 | You can kind of see how

00:08:18.840 | Feature engineering is going away, and this was just

00:08:21.440 | Yeah, an hour ago, so this is very recent news indeed, but it's one of this is one of the most important

00:08:28.400 | breakthroughs I've seen in a long time

00:08:30.400 | Okay, so we were working through a

00:08:36.960 | Simple logistic regression trained with SGD for MNIST

00:08:44.880 | And here's the summary of where we got to we have nearly built a module

00:08:57.920 | A model module and a training loop from scratch and we were going to kind of try and finish that and after we finish that

00:09:04.640 | I'm then going to go through this entire notebook

00:09:06.640 | Backwards right so having gone like top to bottom, but I'm going to go back through

00:09:11.200 | bottom to top okay, so

00:09:13.920 | You know this was that little

00:09:17.600 | Handwritten and end up module class we created

00:09:22.280 | We defined our loss we defined our learning rate, and we defined our optimizer

00:09:26.840 | And this is the thing that we're going to try and write by hand in a moment

00:09:29.680 | so that stuff

00:09:32.440 | That and that we're still in with from Pytorch, but that we've written ourselves and this we've written ourselves

00:09:38.760 | So the basic idea was we're going to go through some number of epochs, so let's go through one epoch

00:09:43.460 | Right and we're going to keep track of how much for each mini batch. What was the loss so that we can report it at the end

00:09:51.840 | We're going to turn our training data loader into an iterator

00:09:55.140 | So that we can loop through it loop through every mini batch, and so now we can go and go ahead and say for tensor in

00:10:01.680 | The length of the data loader, and then we can call next to grab the next independent variables and the dependent variables

00:10:11.300 | From our data loader from that iterator, okay?

00:10:15.960 | So then remember we can then pass the X tensor into our model by calling the model as if it was a function

00:10:23.560 | But first of all we have to turn it into a variable

00:10:26.120 | Last week we were typing variable

00:10:28.880 | Blah dot CUDA to turn it into a variable a shorthand for that is just the capital V now

00:10:34.360 | It's a capital T for a tensor capital B for a V for a variable. That's just a shortcut in fast AI

00:10:39.600 | Okay, so that returns our predictions

00:10:43.240 | And so the next thing we needed was to calculate our loss

00:10:45.560 | Because we can't calculate the derivatives of the loss if you haven't calculated the loss

00:10:50.940 | So the loss takes the predictions and the actuals

00:10:54.240 | Okay, so the actuals again are the the Y tensor and again. We have to turn that into a variable

00:10:59.760 | Now can anybody remind me what a variable is and why we would want to use a variable here?

00:11:11.320 | I think once you turn into variable, then it tracks it so then you can do it backward on that so you can get it

00:11:16.680 | What sorry when you turn the variable it?

00:11:18.400 | It can track like it's process of like you know as you add the function as the function is targeting layers within each other

00:11:23.440 | You can track it and then we do backward on it back propagates and does the yeah, right, so

00:11:28.080 | Right so a variable

00:11:33.120 | keeps

00:11:35.280 | track of all of the steps

00:11:37.480 | to get computed and

00:11:40.720 | So there's actually a fantastic tutorial on the Pytorch website

00:11:45.480 | So on the Pytorch website there's a tutorial section

00:11:56.680 | And there's a tutorial there about autograd autograd is the name of the automatic

00:12:01.680 | differentiation package that comes with Pytorch and it's it's an implementation of automatic differentiation and so the variable plus

00:12:10.400 | is really the key

00:12:12.400 | The key class here because that's the thing that makes turns a tensor into something where we can keep track of its gradients

00:12:19.240 | So basically here they show how to create a variable do an operation to a variable

00:12:25.780 | And then you can go back and actually look at the grad function

00:12:30.360 | Which is the the function that it's keeping track of basically to calculate the gradient right so as we do

00:12:38.160 | More and more operations to this very a variable and the variables calculated from that variable it keeps keeping track of it

00:12:45.320 | So later on we can go dot backward and then print dot grad and find out the gradient

00:12:52.560 | Right and so you notice we never defined the gradient. We just defined it as being x plus 2

00:12:58.520 | Squared times 3 whatever and it can calculate the gradient

00:13:04.640 | Okay, so that's why we need to turn that into a variable so L is now a

00:13:16.800 | Variable containing the loss so it contains a single number for this mini batch

00:13:22.600 | Which is the loss for this mini batch, but it's not just a number. It's a it's a number as a variable

00:13:29.160 | So it's a number that knows how it was calculated all right

00:13:32.920 | so we're going to append that loss to our array just so we can

00:13:37.040 | Get the average of it later basically

00:13:39.920 | And now we're going to calculate the gradient so L dot backward is the thing that says

00:13:46.840 | Calculate the gradient so remember when we call the the network. It's actually calling our forward function

00:13:55.560 | So that's like cap go through it forward and then backward is like using the chain rule to calculate the gradients

00:14:02.400 | Backwards okay, and then this is the thing we're about to write which is update the weights based on the gradients and the learning rate

00:14:10.160 | Okay

00:14:11.840 | Zero grad will explain when we write this out by hand

00:14:14.520 | okay

00:14:16.640 | and so then at the end we can turn our validation data loader into an iterator and

00:14:22.160 | We can then go through its length

00:14:25.400 | grabbing each

00:14:28.360 | x and y out of that and

00:14:30.960 | asking for the score

00:14:32.960 | Which we defined up here to be equal to?

00:14:35.840 | Which thing did you predict which thing was actual and so check whether they're equal right and then the

00:14:44.480 | Main of that is going to be our accuracy, okay?

00:14:48.960 | Could you pass that over to Chenxi?

00:14:51.760 | What's the advantage that you found converted into a iterator rather than like use normal?

00:15:00.000 | Python loop or

00:15:02.000 | We're using a normal Python loop

00:15:04.120 | So it's still and this is a normal Python loop so the question really is like

00:15:08.640 | Compared to what right so like?

00:15:12.560 | The alternative perhaps you're thinking it would be like we could choose like a something like a list with an indexer

00:15:19.040 | Okay, so you know the problem there is that we want

00:15:23.560 | Was a few things I mean one key one is we want each time we grab a new mini batch. We want to be random

00:15:29.600 | We want a different different shuffled thing so this

00:15:33.120 | You can actually kind of iterate from

00:15:35.960 | Forever you know you can loop through it as many times as you like so

00:15:40.120 | There's this kind of idea. It's called different things in different languages

00:15:44.160 | But a lot of languages are called like stream processing

00:15:47.480 | And it's this basic idea that rather than saying I want the third thing or the ninth thing

00:15:51.720 | It's just like I want the next thing right it's great for like network programming. It's like grab the next thing from the network

00:15:58.320 | It's great for

00:16:00.320 | UI programming it's like grab the next event where somebody clicked a button it also turns out to be great for

00:16:06.360 | This kind of numeric programming. It's like I just want the next batch of data

00:16:10.340 | It means that the data like can be kind of arbitrarily long as we're describing one piece at a time

00:16:18.540 | Yeah, so you know I mean and also in I guess the short answer is because it's how pytorch works

00:16:27.440 | Pytorch that's pytorch is data loaders are designed to be

00:16:30.460 | Called in this way, and then so Python has this concept of a generator

00:16:35.680 | Which is like an and and?

00:16:38.480 | Different type of generator. I wonder if this is gonna be a snake generator or a computer generator, okay?

00:16:44.480 | A generator is a way that you can create a function that as it says behaves like an iterator

00:16:50.920 | So like Python has recognized that this stream processing approach to programming is like super handy and helpful and

00:16:57.840 | Supports it everywhere so basically anywhere that you use a for in loop anywhere you use a list comprehension

00:17:05.760 | Those things can always be generators or iterators so by programming this way. We just get a lot of

00:17:11.880 | Flexibility I guess is that sound about right Terrence you're the programming language expert. Did you?

00:17:19.680 | Want to grab that box so we can hear

00:17:21.680 | So Terrence actually does programming languages for a living so we should ask him

00:17:26.440 | Yeah, I mean the short answer is what you said

00:17:29.360 | You might say something about space

00:17:32.400 | But in this case that all that data has to be in memory anyway because we've got no doesn't have to be in memory

00:17:39.000 | So in fact most of the time we could pull a mini batch from something in fact most of the time with pytorch

00:17:44.160 | The mini batch will be read from like separate images spread over your disk on demand

00:17:50.000 | So most of the time it's not in memory

00:17:51.800 | But in general you want to keep as little in memory as possible at a time

00:17:56.440 | And so the idea of stream processing also is great because you can do compositions you can

00:18:00.640 | Pipe the data to a different machine you can yeah

00:18:03.400 | Yeah, the competition is great

00:18:05.000 | You can grab the next thing from here and then send it off to the next stream which can then grab it and do something

00:18:09.400 | Else which you guys all recognize of course in the command-line pipes and redirection

00:18:14.120 | Yes, okay, thanks Terrence

00:18:17.200 | The benefit of working with people that actually know what they're talking about

00:18:21.840 | All right, so let's now take that and get rid of the optimizer

00:18:28.320 | Okay, so the only thing that we're going to be left with is the negative log likelihood loss function

00:18:33.920 | Which we could also replace actually we have a?

00:18:38.560 | implementation of that from scratch that unit wrote in the

00:18:41.160 | In the notebooks, so it's only one line of code as we learned earlier. You can do it with a single if statement, okay?

00:18:48.340 | So I don't know why I was so lazy is to include this

00:18:51.840 | So what we're going to do is we're going to again grab this module that we've written ourselves the logistic regression module

00:18:58.880 | We're going to have one epoch again. We're going to loop through each thing in our iterator again

00:19:05.020 | We're going to grab our independent independent variable for the mini batch again

00:19:09.280 | Pass it into our network again

00:19:11.760 | Calculate the loss, so this is all the same as before

00:19:14.360 | But now we're going to get rid of this optimizer dot step

00:19:18.000 | And we're going to do it by hand

00:19:21.040 | so the basic trick is

00:19:23.480 | As I mentioned we're not going to do the calculus by hand so we'll call L dot backward to calculate the gradients automatically

00:19:30.940 | And that's going to fill in our weight matrix, so do you remember when we created our?

00:19:36.320 | Let's go back and

00:19:38.760 | Look at the code for

00:19:40.760 | Here's that module we built so the weight matrix for the for the

00:19:48.420 | Linear layer weights we called l1w and for the bias we called l1b right so they were the attributes we created

00:19:56.400 | So

00:20:00.360 | I've just put them into things called W and B just to save some typing basically so W is our weights

00:20:08.000 | B is our biases and

00:20:10.400 | So the weights remember the weights are a variable and to get the tensor out of the variable

00:20:16.800 | We have to use dot data right so we want to update the actual tensor that's in this variable, so we say weights dot data

00:20:22.920 | Minus equals so we want to go in the opposite direction to the gradient the gradient tells us which way is up

00:20:28.980 | We want to go down

00:20:30.980 | Whatever is currently in

00:20:34.040 | the gradients

00:20:36.400 | times the learning rate so that is the formula for

00:20:40.320 | gradient descent

00:20:43.200 | All right, so as you can see it's it's like as as easier thing as you can possibly imagine

00:20:48.680 | It's like literally update the weights to be equal to be equal to whatever they are now minus the gray the gradients

00:20:55.600 | times the learning rate and

00:20:57.640 | Do the same thing?

00:20:59.640 | for the bias

00:21:01.760 | So anybody have any questions about that step in terms of like why we do it or how did you have a question?

00:21:05.960 | Do you want to grab that?

00:21:07.960 | So that step, but when we do the next of deal

00:21:11.520 | The next year yes, yes

00:21:14.720 | So when it is the end of the loop. How do you grab the next element?

00:21:18.920 | So this is going through each

00:21:23.880 | Each index in range of length, so this is going 0 1 2 3 at the end of this loop

00:21:29.960 | It's going to print out the mean of the validation set go back to the start of the epoch at which point

00:21:36.880 | It's going to recreate a new a new iterator

00:21:39.680 | Okay, so basically behind the scenes in python when you call it a

00:21:44.440 | On this it basically tells it to like reset its state to create a new iterator

00:21:52.240 | And if you're interested in how that works

00:21:55.080 | the

00:21:58.680 | The code is all you know available for you to look at so we could look at like MD dot train

00:22:07.680 | DL is a fast AI dot data set dot model data loader, so we could like take a look at the code of that

00:22:14.940 | So we could take a look at the code of that

00:22:18.560 | And see exactly how it's being built right and so you can see here that here's the next function

00:22:24.760 | right which basically is

00:22:27.440 | Keeping track of how many times it's been through in the self dot I

00:22:31.240 | And here's the it a function which is the thing that gets quick called when you when you create a new iterator

00:22:37.200 | And you can see it's basically passing it off to something else

00:22:39.760 | Which is a type data loader and then you can check out data loader if you're interested to see how that's implemented

00:22:44.680 | as well

00:22:47.240 | So the data loader that we wrote

00:22:49.240 | Basically uses multi-threading to allow it to have multiple of these going on at the same time

00:22:55.120 | It's actually a great. It's really simple. It's like it's only about a screen full of code

00:23:00.340 | So if you're interested in simple multi-threaded programming. It's a good thing to look at

00:23:03.900 | Okay now um oh

00:23:07.560 | Yes

00:23:10.160 | Why have you wrapped this in a for epoch in range one since that'll only run once?

00:23:16.240 | Because in real life we would normally be running multiple epochs

00:23:20.520 | So like in this case because it's a linear model it actually basically trains to

00:23:26.880 | As good as it's going to get in one epoch so if I type three here

00:23:31.260 | it actually

00:23:34.200 | It actually won't really improve after the first epoch much at all as you can see right

00:23:41.280 | But when we go back up to the top we're going to look at some slightly deeper and more interesting

00:23:46.480 | Versions which will take more epochs, so you know if I was turning this into a into a function

00:23:52.380 | You know I'd be going like you know death train model

00:23:56.800 | And one of the things you would pass in is like number of epochs

00:24:00.400 | kind of

00:24:03.840 | Okay great

00:24:07.040 | So one thing to remember is that

00:24:13.720 | When you're you know creating these neural network layers

00:24:17.820 | and remember like

00:24:20.880 | This is just as part watch is concerned. This is just it's an end up module

00:24:25.000 | It could be a we could be using it as a layer it could be using the function

00:24:28.600 | We could be using it as a neural net pie torch doesn't think of those as different things, right?

00:24:33.760 | So this could be a layer inside some other network, right?

00:24:38.080 | So how do gradients work so if you've got a layer which remember is just a bunch of we can think of it basically

00:24:44.160 | as its activations right or some activations that get computed through some

00:24:48.880 | other non-linear activation function or through some linear function and

00:24:53.400 | From that layer

00:24:57.840 | We it's very likely that we're then like let's say putting it through a matrix product right

00:25:03.120 | to create some new layer

00:25:07.200 | And

00:25:08.560 | So each one of these so if we were to grab like

00:25:11.160 | One of these activations right is actually going to be

00:25:17.240 | Used to calculate

00:25:20.520 | every one of these outputs

00:25:22.960 | Right and so if you want to calculate the

00:25:27.000 | The derivative you have to know how this weight matrix

00:25:34.880 | Impacts that output and that output and that output and that output

00:25:38.960 | Right and then you have to add all of those together to find like the total impact of this

00:25:44.600 | you know across all of its outputs and

00:25:47.560 | So that's why in pie torch

00:25:51.800 | You have to tell it when to set the gradients to zero

00:25:56.680 | Right because the idea is that you know you could be like having lots of different loss functions or lots of different outputs in your next

00:26:02.560 | Activation set of activations or whatever all adding up

00:26:06.600 | Increasing or decreasing your gradients right so you basically have to say okay. This is a new

00:26:12.800 | calculation

00:26:15.720 | Reset okay, so here is where we do that right so before we do L dot backward we say

00:26:22.320 | Reset okay, so let's take our weights

00:26:25.320 | Let's take the gradients. Let's take the tensor that they point to and

00:26:31.000 | Then zero underscore does anybody remember from last week what underscore does as a suffix in pi torch?

00:26:38.040 | Yeah, I

00:26:42.680 | Forgot the language, but basically it changes it within the place right there the language is in place yeah

00:26:50.760 | Exactly so it sounds like a minor technicality

00:26:55.000 | But it's super useful to remember every function pretty much has an underscore version suffix

00:27:00.680 | Which does it in place?

00:27:02.680 | Yeah, so normally zero returns a

00:27:05.840 | Tensor of zeros of a particular size so zero underscore means replace the contents of this with a bunch of zeros, okay?

00:27:14.720 | All right, so that's

00:27:18.340 | That's it right, so that's like SGD from scratch

00:27:24.240 | And if I get rid of my menu bar we can officially say it fits within a screen, okay?

00:27:29.720 | so

00:27:31.360 | Of course we haven't got our definition of logistic regression here. That's another half a screen, but basically there's there's not much to it

00:27:37.240 | Yes, fish

00:27:39.160 | So later on if we have to do this more the gradient is it because you might find like a wrong

00:27:44.920 | Minima local minimize that way so you have to kick it out

00:27:47.560 | And that's what you have to do multiple times when the surface is get more. Why do you need multiple epochs?

00:27:51.680 | Is that your question well? I mean a simple way to answer that would be let's say our learning rate was tiny

00:27:56.520 | right

00:27:59.440 | then

00:28:01.440 | It's just not going to get very far

00:28:04.920 | Right there's nothing that says going through one epoch is enough to get you all the way there

00:28:09.800 | So then you'd be like okay. Well, let's increase our learning rate, and it's like yeah, sure

00:28:13.960 | We'll increase our learning rate, but who's to say that the highest learning rate that learns stably is is enough to

00:28:21.120 | Learn this as well as it can be learned and for most data sets for most architectures one epoch is

00:28:27.840 | Very rarely enough to get you

00:28:29.840 | To the best result you can get to

00:28:32.560 | You know linear models are just

00:28:36.680 | They're very nicely behaved. You know so you can often use higher learning rates and learn more quickly also they

00:28:43.520 | They don't you can't like generally get as good at accuracy

00:28:48.380 | So there's not as far to take them either so yeah doing one epoch is going to be the rarity all right

00:28:54.820 | So let's go backwards

00:28:56.680 | So going backwards. We're basically going to say all right. Let's not write

00:29:01.200 | Those two lines again and again again. Let's not write those three lines again and again and again

00:29:06.920 | Let's have somebody do that for us, right?

00:29:09.960 | So that's like that's the only difference between that version and this version is rather than saying dot zero ourselves

00:29:16.800 | Rather than saying minus gradient times LIR ourselves

00:29:21.240 | These are wrapped up for us, okay

00:29:26.080 | There is another wrinkle here, which is

00:29:29.160 | this approach to updating

00:29:32.280 | The the weights is actually pretty inefficient. It doesn't take advantage of

00:29:38.660 | momentum

00:29:40.720 | and curvature and so

00:29:43.520 | In the deal course we learn about how to do momentum from scratch as well, okay, so

00:29:51.000 | if we

00:29:53.400 | Actually, just use plain old SGD

00:29:56.400 | Then you'll see that this

00:30:01.680 | Learns much slower so now that I've typed just plain old SGD here. This is now literally doing exactly the same thing

00:30:07.680 | As our slow version so I have to increase the learning rate

00:30:11.880 | Okay there we go so this this is now the same as the the one we wrote by hand

00:30:21.800 | So then all right

00:30:23.800 | Let's do a little bit more stuff automatically

00:30:29.020 | Let's not you know given that every time we train something we have to loop through epoch

00:30:37.640 | Look through batch do forward get the loss zero the gradient do backward do a step of the optimizer

00:30:45.960 | Let's put all that in a function

00:30:48.320 | Okay, and that function is called fit

00:30:51.000 | All right there it is okay, so let's take a look at fit

00:31:01.980 | Fit go through each epoch go through each batch

00:31:11.080 | Do one step?

00:31:14.440 | Keep track of the loss and at the end calculate the validation all right and so then

00:31:21.200 | step

00:31:23.200 | So if you're interested in looking at this this stuff's all inside fastai.model

00:31:39.620 | And

00:31:42.760 | So here is step right?

00:31:46.040 | Zero the gradients calculate the loss remember PyTorch tends to call it criterion rather than loss

00:31:53.160 | Right do backward

00:31:55.720 | And then there's something else we haven't learned here, but we do learn the deep learning course

00:31:59.520 | Which is gradient clicking so you can ignore that

00:32:01.720 | All right, so you can see now like all the stuff that we've learnt when you look inside the actual frameworks

00:32:07.160 | That's the code you see okay?

00:32:09.680 | So that's what fit does and

00:32:15.000 | So then the next step would be like okay. Well this idea of like having some

00:32:19.640 | Weights and a bias and doing a matrix product in addition

00:32:25.300 | Let's put that in a function

00:32:28.320 | This thing of doing the log softmax

00:32:30.520 | Let's put that in a function and then the very idea of like first doing this and then doing that

00:32:36.800 | This idea of like chaining functions together. Let's put that into a function and

00:32:41.720 | that finally gets us to

00:32:44.720 | that

00:32:46.720 | Okay, so sequential simply means do this function take the result send it to this function etc, right?

00:32:55.080 | And linear means create the weight matrix create the biases

00:32:59.640 | Okay

00:33:02.680 | So that's that's it right

00:33:05.400 | So we can then you know as we started to talk about like turn this into a deep neural network

00:33:13.800 | by saying you know rather than sending this straight off into

00:33:19.100 | 10 activations, let's let's put it into say 100 activations. We could pick whatever one number we like

00:33:26.420 | Put it through a relu to make it nonlinear

00:33:30.060 | Put it through another linear layer another relu and then our final output with our final activation function right and so this is now

00:33:39.260 | a deep network

00:33:41.940 | so

00:33:43.940 | We could fit that and

00:33:50.940 | This time now because it's like deeper

00:33:54.940 | I'm actually going to run a few more epochs right and you can see the accuracy

00:34:00.740 | Increasing right so if you try and increase the learning rate here, it's like zero point one

00:34:06.420 | further

00:34:08.860 | it actually

00:34:10.860 | Starts to become unstable

00:34:12.740 | Now I'll show you a trick

00:34:14.740 | This is called learning rate annealing and the trick is this

00:34:18.060 | when you're

00:34:20.860 | Trying to fit to a function right you've been taking a few steps

00:34:25.740 | Step step step as you get close to the middle like get close to the bottom

00:34:32.900 | Your steps probably want to become smaller right otherwise what tends to happen is you start finding you're doing this

00:34:40.100 | All right, and so you can actually see it here right they've got 93 94 and a bit 94 6

00:34:48.100 | 94 8 like it's kind of starting to flatten out

00:34:50.820 | Right now that could be because it's kind of done as well as it can

00:34:55.420 | Or it could be that it's going to going backwards and forwards

00:34:58.620 | So what is a good idea is is later on in training is to decrease your learning rate and to take smaller steps

00:35:07.100 | Okay, that's called a learning rate annealing. So there's a function in fast AI called set learning rates

00:35:12.780 | you can pass in your optimizer and your new learning rate and

00:35:16.540 | You know see if that helps right and very often it does

00:35:22.780 | About about an order of magnitude

00:35:27.780 | In the deep learning course we learn a much much better technique than this to do this all automatically and about a more granular

00:35:34.460 | Level, but if you're doing it by hand, you know like an order of magnitude at a time is what?

00:35:39.300 | people generally do

00:35:42.060 | So you'll see people in papers talk about learning rate schedules

00:35:46.780 | This is like a learning rate schedule. So this schedule just a moment Erica

00:35:51.140 | I just come to earnest first has got us to 97 right and I tried

00:35:55.720 | Kind of going further and we don't seem to be able to get much better than that

00:35:59.860 | So yeah, so here we've got something where we can get 97 percent

00:36:04.380 | Accuracy. Yes, Erica. So it seems like you change the learning rate

00:36:08.620 | to something very small

00:36:11.820 | Ten times smaller than we started with so we had point one now, it's point. Oh one. Yeah

00:36:15.780 | But that makes the whole model train really slow

00:36:19.540 | So I was wondering if you can make it so that it changes dynamically as it approaches

00:36:24.180 | Closer to the minima. Yeah, pretty much. Yeah, so so that's some of the stuff we learn in the deep learning course

00:36:29.820 | There's these more advanced approaches. Yeah

00:36:32.140 | the fish

00:36:34.140 | So how it is different from using Adam optimizer or something that that's the kind of stuff we can do

00:36:39.780 | I mean you still need annealing as I say we do this kind of stuff in the deep learning course

00:36:43.780 | So for now, we're just going to stick to standard SGD. I

00:36:46.540 | Had a question about the data loading. Yeah, I know it's a fast AI function

00:36:53.580 | But could you go into a little bit detail of how it's creating batches how it's learning data and how it's making those decisions

00:36:58.340 | Sure

00:37:01.940 | I

00:37:03.460 | Would be good to ask that on Monday night so we can talk about in detail in the deep learning class

00:37:08.220 | But let's let's do the quick version here

00:37:10.980 | so basically

00:37:13.780 | There's a really nice design in pytorch

00:37:16.300 | Where they basically say let's let's create a thing called a data set

00:37:21.140 | Right and a data set is basically something that looks like a list. It has a length

00:37:28.740 | right and so that's like how many images are in the data set and it has the ability to

00:37:35.780 | Index into it like a list right so if you had like D equals data set

00:37:41.820 | You can do length D, and you can do D of some index right that's basically all the data set

00:37:47.860 | Is as far as pytorch is concerned and so you start with a data set, so it's like okay?

00:37:53.220 | D 3 gives you the third image. You know or whatever

00:37:58.140 | And so then the idea is that you can take a data set and you can pass that into a constructor for a data loader

00:38:04.940 | And

00:38:12.020 | That gives you something which is now iterable right so you can now say it a

00:38:17.220 | deal and that's something that you can call next on and

00:38:23.660 | What that now is going to do is if when you do this you can choose to have shuffle on or shuffle off shuffle on

00:38:31.060 | Means give me random mini-batch shuffle off means go through it sequentially

00:38:35.380 | And so

00:38:38.980 | What the data loader does now when you say next is it basically assuming you said shuffle equals true is it's going to grab?

00:38:45.220 | You know if you've got a batch size of 64 64 random integers between 0 and length and call this

00:38:53.220 | 64 times to get 64 different items and jam them together

00:38:58.320 | So fast AI uses the exact same

00:39:02.940 | terminology and the exact same API

00:39:06.540 | We just do some of the details differently so specifically particularly with computer vision

00:39:13.540 | You often want to do a lot of pre-pro

00:39:15.940 | I'm so much pre-processing data augmentation like flipping changing the colors a little bit rotating those turn out to be really

00:39:23.340 | Computationally expensive even just reading the JPEGs turns out to be computation expensive

00:39:27.820 | So pie torch uses an approach where it fires off multiple processes to do that in parallel

00:39:34.020 | Whereas the fast AI library instead does something called multi threading, which is a much can be a much faster way of doing it

00:39:41.460 | Yes, you're net

00:39:46.140 | So an epoch is it really pork in the sense that all of the elements so it's a shuffle at the beginning of the

00:39:53.900 | Poke something like that. Yeah. Yeah, I mean not all libraries work the same way some do sampling with replacement

00:39:59.580 | Some don't

00:40:01.660 | We actually the fast AI library hands off the shuffling off to the set to the actual pie torch version

00:40:09.260 | And I believe the pie torch version. Yeah, actually shuffles and an epoch covers everything once I believe

00:40:15.220 | Okay, now the thing is when you start to get these bigger networks

00:40:25.100 | Potentially you're getting quite a few parameters

00:40:29.900 | right, so

00:40:32.860 | I want to ask you to calculate how many parameters there are but let's let's remember here. We've got

00:40:37.880 | 28 by 28 input into 100 output and then 100 into 100 and then 100 into 10

00:40:44.740 | All right, and then for each of those who got weights and biases

00:40:47.900 | So we can actually

00:40:51.060 | Do this

00:40:54.180 | net dot parameters

00:40:56.180 | returns a list where each element of the list is a matrix of actually a tensor of

00:41:02.100 | The parameters for that not just for that layer

00:41:05.860 | But if it's a layer with both weights and biases that would be two parameters, right?

00:41:09.820 | So basically returns us a list of all of the tenses containing the the parameters

00:41:14.980 | Num elements in pytorch tells you how how big that is right so if I run this

00:41:21.980 | Here is the number of

00:41:25.780 | parameters in each layer

00:41:27.900 | So I've got seven hundred and eighty four inputs and the first layer has a hundred outputs

00:41:32.900 | So therefore the first weight matrix is of size seventy eight thousand four hundred

00:41:37.300 | Okay, and the first bias vector is of size a hundred and then the next one is a hundred by a hundred

00:41:42.900 | Okay, and there's a hundred and then the next one is a hundred by ten, and then there's my bias, okay?

00:41:48.820 | So there's the number of elements in each layer, and if I add them all up. It's nearly a hundred thousand

00:41:54.420 | Okay, and so I'm possibly at risk of overfitting. Yeah, all right, so

00:42:01.620 | We might want to think about using regularization

00:42:05.020 | So a really simple common approach to regularization in all of machine learning

00:42:10.860 | is something called

00:42:13.740 | L2

00:42:15.940 | Regularization and

00:42:19.980 | It's super important super handy. You can use it with just about anything right and the basic idea

00:42:25.620 | Anyway so

00:42:31.540 | L2 regularization the basic idea is this normally we'd say our loss is

00:42:35.700 | Equal to let's just do RMSE to keep things kind of simple

00:42:39.660 | It's equal to our predictions minus our actuals

00:42:43.180 | You know squared, and then we sum them up take the average

00:42:47.740 | Take the square root, okay?

00:42:50.340 | so

00:42:52.620 | What if we then want to say you know what like if I've got lots and lots of parameters?

00:42:58.660 | Don't use them unless they're really helping enough right like if you've got a million parameters, and you only really needed 10

00:43:05.940 | Parameters to be useful just use 10 right so how could we like tell the loss function to do that?

00:43:12.820 | And so basically what we want to say is hey if a parameter is zero

00:43:17.220 | That's no problem. It's like it doesn't exist at all so let's penalize a parameter

00:43:22.980 | for not being zero

00:43:26.740 | Right so what would be a way we could measure that?

00:43:29.940 | How can we like calculate how unzero our parameters are

00:43:37.140 | Can you pass that to chin sheath is honest

00:43:42.940 | You calculates the average of all the parameters that's my first can't quite be the average

00:43:53.780 | Close yes, Taylor. Yeah. Yes, you figured it out. Okay?

00:43:56.740 | so I think if we like

00:43:59.900 | Assuming all of our data has been normalized standardized however you want to call it

00:44:03.900 | We want to check that they're like significantly different from zero right would that be not the data that the parameter

00:44:09.460 | Is rather would be significantly and the parameters don't have to be normalized or anything that is calculated right?

00:44:14.780 | Yeah, so significantly different from zero right as well

00:44:17.340 | I just met assuming that the data has been normalized so that we can compare them. Oh, yeah, got it. Yeah, right

00:44:23.820 | And then those that are not significantly different from zero we can probably just drop

00:44:28.460 | And I think Chen she's going to tell us how to do that. You just figured it out, right?

00:44:31.380 | The meaning of the absolute could do that that would be called l1. Which is great so l1

00:44:38.460 | would be the

00:44:41.020 | absolute

00:44:43.180 | Value of the weights average l2 is actually the sum

00:44:51.060 | Yeah, yeah exactly so we just take this we can just we don't even have to square root

00:44:55.340 | So we just take the squares of the weights themselves, and then like we want to be able to say like okay

00:45:02.140 | How much do we want to panelize?

00:45:06.580 | Not being zero right because if we actually don't have that many parameters

00:45:10.740 | We don't want to regularize much at all if we've got heaps. We do want to regularize a lot right so then we put a

00:45:18.580 | Parameter yeah, right except I have a rule in my classes. Which is never to use Greek letters, so normally people use alpha

00:45:24.860 | I'm going to use a okay, so

00:45:27.540 | So this is some number which you often see something around kind of 1e neg 6 to 1e neg 4

00:45:37.380 | ish all right

00:45:40.340 | Now

00:45:43.660 | We actually don't care about the loss

00:45:48.020 | When you think about it, we don't actually care about the loss other than like maybe to print it out

00:45:51.800 | All we actually care about is the gradient of the loss

00:45:54.220 | Okay, so the gradient of

00:45:57.180 | That

00:46:00.860 | Right is

00:46:03.580 | That

00:46:07.220 | Right so there are two ways to do this we can actually modify our loss function to add in this square

00:46:16.420 | penalty or

00:46:18.340 | We could modify that thing where we said weights equals weights minus

00:46:23.760 | Gradient times learning rate to subtract that

00:46:27.420 | as well

00:46:29.740 | Right back so to add that as well

00:46:32.060 | and

00:46:34.780 | These are roughly these are kind of basically equivalent, but they have different names. This is called L2 regularization

00:46:39.900 | Right this is called weight decay

00:46:44.700 | So in the neural network literature

00:46:46.700 | You know that version kind of

00:46:49.700 | Was the how it was first posed in the neural network literature whereas this other version is kind of

00:46:56.140 | How it was posed in the statistics literature, and yeah, you know they're they're equivalent

00:47:03.060 | As we talked about in the deep learning class it turns out

00:47:06.380 | They're not exactly equivalent because when you have things like momentum and Adam it can behave differently and two weeks ago a researcher

00:47:14.660 | figured out a way to actually

00:47:16.820 | Do proper weight decay in modern optimizers and one of our fast AI students just implemented that in the fast AI library

00:47:24.460 | So fast AI is now the first

00:47:26.460 | Library to actually support this properly

00:47:28.980 | so anyway, so for now, let's do the

00:47:32.020 | The version which

00:47:35.820 | Pie torch calls weight decay

00:47:39.380 | But actually it turns out based on this paper two weeks ago is actually L2 regularization

00:47:43.980 | It's not quite correct, but it's close enough so here. We can say weight decay is 1e neg 3

00:47:48.780 | So it's going to set our cons out our penalty multiplier a to 1e neg 3 and it's going to add that to the loss function

00:47:57.020 | Okay, and so let's make a copy of these cells

00:48:01.020 | Just so we can compare hope this actually works

00:48:06.180 | Okay, and we'll set this running okay, so this is now optimizing

00:48:10.100 | Well except

00:48:13.460 | If you're actually so I've made a mistake here, which is I didn't rerun

00:48:17.820 | This cell this is an important thing to kind of remember since I didn't run this rerun this cell

00:48:23.340 | Here when it created the optimizer and said net dot parameters

00:48:27.700 | It started with the parameters that I had already trained right so I actually hadn't recreated my network

00:48:33.820 | Okay, so I actually need to go back and rerun this cell first to recreate the network

00:48:39.020 | Then go through and run this

00:48:41.580 | Okay there we go, so let's see what happens

00:48:49.500 | So you might notice some notice something kind of kind of counterintuitive here

00:48:58.340 | Which is that?

00:49:02.580 | That's our training error right now. You would expect our training error with regularization

00:49:07.900 | to be worse

00:49:10.860 | That makes sense right because we're like we're penalizing

00:49:14.120 | parameters that

00:49:17.300 | Specifically can make it better and yet

00:49:19.700 | Actually it started out better not worse

00:49:23.460 | So why could that be?

00:49:26.980 | So the reason that can happen is that if you have a function

00:49:32.780 | That looks like that

00:49:35.540 | Right it takes potentially a really long time to train

00:49:38.900 | or else if you have a function that kind of looks more like

00:49:42.300 | That it's going to train a lot more quickly

00:49:45.740 | And there are certain things that you can do which sometimes just like can take a function

00:49:50.820 | That's kind of horrible and make it less horrible, and it's sometimes weight decay can actually

00:49:56.500 | Make your functions a little more nicely behaved, and that's actually happened here

00:50:01.060 | So like I just mentioned that to say like don't let that confuse you right like weight decay really does

00:50:07.060 | Panelize the training set and look so strictly speaking

00:50:10.140 | The final number we get to for the training set shouldn't end up be being better

00:50:15.860 | But it can train sometimes more quickly

00:50:17.860 | Yes, can you pass it a chance you

00:50:24.780 | I

00:50:26.260 | Don't get it. Okay, why making it faster like the time matters like the training time

00:50:32.500 | No, it's this is after one epoch. Yeah, right so after one epoch

00:50:38.020 | Now congratulations for saying I don't get it. That's like the best thing anybody can say you know so helpful

00:50:53.100 | This here was our training

00:50:55.100 | without

00:50:56.980 | weight decay

00:50:58.420 | Okay, and this here is our training with weight decay, okay, so this is not related to time

00:51:05.820 | This is related to just an epoch

00:51:08.420 | Right after one epoch my claim was that you would expect the training set all other things being equal

00:51:17.360 | to have a

00:51:19.700 | worse

00:51:20.900 | loss with weight decay

00:51:23.260 | Because we're penalizing it you know this has no penalty this has a penalty so the thing with a penalty should be worse and

00:51:30.180 | I'm saying oh, it's not that's weird

00:51:34.020 | right, and so the reason it's not is

00:51:37.980 | Because in a single epoch it matters a lot as to whether you're trying to optimize something

00:51:44.380 | That's very bumpy or whether you're trying to optimize something. That's kind of nice and smooth

00:51:50.340 | If you're trying to optimize something that's really bumpy like imagine in some high-dimensional space, right?

00:51:56.220 | You end up kind of rolling around through all these different tubes and tunnels and stuff

00:52:01.940 | You know or else if it's just smooth you just go boom

00:52:04.800 | Adam it's like imagine a marble rolling down a hill where one of them you've got like

00:52:09.980 | It's a called Lombard Street in San Francisco. It's like backwards forwards backwards forwards

00:52:15.260 | It takes a long time to drive down the road right

00:52:17.980 | Where else you know if you kind of took a motorbike and just went straight over the top. You're just going boom, right, so

00:52:23.500 | So whether it's a kind of the shape of the loss function surface

00:52:28.580 | you know impacts or kind of defines how easy it is to optimize and therefore how

00:52:34.500 | Far can it get in a single epoch and based on these results?

00:52:39.100 | It would appear that weight decay here has made it this function easier to optimize

00:52:44.140 | so just to make sure it's

00:52:48.180 | The panelizing is making the optimizer more than likely to reach the global minimum

00:52:54.120 | No, I wouldn't say that my claim actually is that at the end

00:52:58.180 | It's probably going to be less good on the training set indeed. This doesn't look to be the case at the end

00:53:03.220 | after five epochs

00:53:06.260 | our

00:53:07.900 | Training set is now worse with weight decay now. That's what I would expect right?

00:53:12.820 | I would expect like if you actually find like I never use the term global optimum because

00:53:17.300 | It's just not something we have any guarantees about we don't really care about we just care like where do we get to after?

00:53:23.620 | a certain number of epochs

00:53:25.620 | We hope that we found somewhere. That's like a good solution

00:53:28.660 | And so by the time we get to like a good solution the training set with weight decay the loss is worse

00:53:34.180 | Because it's penalty right but on

00:53:38.900 | The validation set the loss is better

00:53:41.860 | Right because we penalized the training set in order to kind of try and create something that generalizes better

00:53:48.820 | So we've got more parameter

00:53:49.900 | You know that the parameters that are kind of pointless are now zero and it generalizes better

00:53:54.060 | Right so so always saying is that it just got to a good point

00:54:00.100 | After one epoch is really always saying

00:54:03.660 | So is it always true?

00:54:07.700 | No, no

00:54:09.700 | But if you're bit by it you mean just wait decay you always make the function surface smoother

00:54:14.020 | No, it's not always true, but it's like it's worth remembering that

00:54:21.620 | if you're having trouble training a function adding a little bit of weight decay may

00:54:27.180 | may help

00:54:29.780 | The word so by recognizing the parameters what it does is it smoothens out the loss

00:54:37.100 | I mean it's not it's not why we do it

00:54:39.740 | you know the reason why we do it is because we want to penalize things that aren't zero to say like

00:54:44.780 | Don't make this parameter a high number unless it's really helping the loss a lot right set it to zero if you can

00:54:51.660 | Because setting as many parameters to zero as possible means it's going to generalize better, right?

00:54:57.060 | It's like the same as having a smaller

00:54:59.060 | Network, right so that's that's we do that's why we do it

00:55:04.260 | But it can change how it learns as well

00:55:07.800 | So let's okay. That's one moment. Okay, so I just wanted to check how we actually went here

00:55:13.180 | So after the second epoch yeah, so you can see here. It's really has helped right after the second epoch

00:55:17.780 | Before we got to 97% accuracy now. We're nearly up to about 98% accuracy

00:55:23.660 | Right and you can see that the loss was 0.08 versus 0.13 right so adding regularization

00:55:30.500 | Has allowed us to find a you know

00:55:33.740 | 3% versus 2% so like a 50% better

00:55:38.340 | Solution yes Erica, so there are two pieces to this right one is L2 regularization and the weight decay

00:55:47.500 | No, there's so my claim was they're the same thing, right?

00:55:50.940 | So weight decay is the version if you just take the derivative of L2 regularization you get weight decay

00:55:58.020 | So you can implement it either by changing the loss function with an with a squared loss

00:56:03.360 | Penalty or you can implement it by adding

00:56:06.540 | The weights themselves as part of the gradient, okay?

00:56:11.820 | Yeah, I was just going to finish the questions. Yes. Okay pass it to division

00:56:16.820 | Can we use regularization convolution layer as well absolutely so convolution layer just is is weights so yep

00:56:28.140 | And Jeremy can you explain why you thought you needed weight decay in this particular problem?

00:56:34.580 | Not easily I mean other than to say it's something that I would always try you're all fitting founder well. Yeah, I mean okay, so

00:56:45.220 | Even if I yeah, okay, that's a good point unit, so if if my training loss

00:56:56.660 | Was higher than my validation loss than I'm under fitting

00:57:00.340 | Right, so there's definitely no point regularizing right if like that would always be a bad thing

00:57:06.920 | That would always mean you need like more parameters in your model

00:57:09.900 | In this case. I'm I'm over fitting that doesn't necessarily mean regularization will help, but it's certainly worth trying

00:57:18.620 | Thank you, and that's a great point. There's one more question. Yeah

00:57:21.740 | Tyler gonna pass over there

00:57:24.620 | So how do you choose the up to a number of epoch?

00:57:31.000 | You do my deep learning course

00:57:37.140 | It's a it's that's a long story and lots of lots of

00:57:41.820 | It's a bit of both we just don't as I say we don't have time to cover

00:57:50.900 | Best practices in this class we're going to learn the kind of fundamentals. Yeah, okay, so let's take a

00:57:58.140 | Six minute break and come back at 11 10

00:58:01.700 | All right

00:58:12.020 | So something that we cover in great detail in the deep learning course

00:58:18.060 | But it's like really important to mention here. Is that is that the secret in my opinion to kind of modern machine learning techniques is to

00:58:25.980 | massively over parameterize

00:58:28.900 | The solution to your problem right like as we've done here. You know we've got like a hundred thousand weights

00:58:34.740 | When we only had a small number of 28 by 28 images

00:58:38.420 | And then use regularization

00:58:40.980 | okay, it's like the

00:58:43.220 | direct opposite of

00:58:46.140 | how

00:58:47.700 | nearly all

00:58:49.500 | statistics and learning was done for decades before

00:58:52.580 | and still most kind of like

00:58:55.580 | Senior lecturers at most universities in most areas of have this background where they've learned the correct way to build a model is

00:59:03.460 | To like have as few parameters as possible

00:59:05.460 | Right and so hopefully we've learned two things so far. You know one is we can build

00:59:11.780 | Very accurate models even when they have lots and lots of parameters

00:59:17.420 | Like a random forest has a lot of parameters and you know this here deep network has a lot of parameters

00:59:23.700 | And they can be accurate right?

00:59:25.860 | And we can do that by either using bagging or by using

00:59:32.260 | regularization

00:59:34.180 | Okay, and regularization in neural nets means either weight decay

00:59:39.100 | also known as kind of L2 regularization or

00:59:42.660 | Drop out which we won't worry too much about here

00:59:47.260 | okay

00:59:49.260 | So like it's a

00:59:51.260 | It's a very different way of thinking

00:59:53.260 | about

00:59:55.260 | Building useful models and like I just wanted to kind of warn you that once you leave this classroom

01:00:02.020 | Like even possibly when you go to the next faculty members talk like there'll be people at USF as well who?

01:00:08.300 | Entirely trained in the world of like

01:00:13.340 | Models with small numbers of parameters you know your next boss is very likely to have been trained in the world of like models

01:00:19.540 | with small numbers of parameters

01:00:21.300 | The idea that they are somehow

01:00:23.780 | More pure or easier or better or more interpretable or whatever I?

01:00:29.140 | Am convinced that that is not true probably not ever true certainly very rarely true

01:00:38.500 | and

01:00:41.260 | that actually

01:00:43.260 | Models with lots of parameters can be extremely interpretable as we learn from our whole lesson of random forest interpretation

01:00:50.980 | You can use most of the same techniques with neural nets, but with neural nets are even easier right remember how we did feature importance

01:00:58.900 | by

01:01:01.020 | Randomizing a column to see how it changes in that column would impact the output

01:01:04.860 | Well, that's just like a kind of dumb way of calculating its gradient

01:01:10.100 | How much does burying this import change the output with a neural net we can actually calculate its gradient?

01:01:15.340 | Right so with PI torch you could actually say what's the gradient of the output with respect to this column?

01:01:20.780 | All right

01:01:22.580 | You can do the same kind of thing to do

01:01:24.580 | partial dependence plot with a neural net

01:01:27.900 | And you know I'll mention for those of you interested in making a real impact

01:01:32.700 | Nobody's written

01:01:35.020 | Basically any of these things the neural nets all right so that that that whole area

01:01:40.180 | Needs like libraries to be written blog posts to be written

01:01:44.140 | You know some papers have been written

01:01:46.620 | But only in very narrow domains like computer vision as far as I know nobody's written the paper saying

01:01:51.780 | Here's how to do structured data

01:01:54.420 | Neural networks you know interpretation methods

01:01:58.380 | So it's a really exciting big

01:02:01.060 | area

01:02:03.620 | So what we're going to do though is we're going to start with applying this

01:02:09.940 | With a simple linear model

01:02:13.460 | And this is mildly terrifying for me because we're going to do NLP and our NLP

01:02:17.980 | Faculty expert is in the room so David just yell at me if I screw this up too badly

01:02:22.500 | And so NLP refers to you know any any kind of modeling where we're working with with natural language text

01:02:33.220 | right and it interestingly enough

01:02:35.860 | We're going to look at a situation where a

01:02:40.860 | Linear model is pretty close to the state-of-the-art for solving a particular problem. It's actually something where I

01:02:50.020 | actually surpassed this bad at state-of-the-art in this using a

01:02:54.260 | Recurrent neural network a few weeks ago

01:02:56.980 | But this is actually going to show you pretty close to the state of art with with a linear model

01:03:03.200 | We're going to be working with the IMDB IMDB data set so this is a data set of movie reviews

01:03:10.520 | You can download it by following these steps

01:03:13.080 | and

01:03:16.000 | Once you download it you'll see that you've got a train and a test

01:03:23.080 | directory and

01:03:26.000 | In your train directory you'll see there's a negative and a positive directory and in your positive directory

01:03:33.160 | You'll see there's a bunch of text files

01:03:35.160 | And here's an example of a text file

01:03:39.920 | So somehow we've managed to pick out a story of a man who has unnatural feelings for a pig as our first choice

01:03:45.960 | That wasn't intentional, but it'll be fine

01:03:48.680 | So we're going to look at these movie reviews

01:03:56.280 | And for each one, we're going to look to see whether they were positive or negative

01:04:00.000 | So they've been put into one of these folders. They were downloaded from from IMDB them the movie database and review site

01:04:06.840 | The ones that were strongly in positive went in positive strongly negative went negative and the rest they didn't label at all

01:04:14.600 | So these are only highly polarized reviews so in this case. You know

01:04:18.080 | We have an insane violent mob which unfortunately just too absurd

01:04:23.920 | Too off-putting those in the area we turned off so the label for this was a zero which is

01:04:31.800 | Negative okay, so this is a negative review

01:04:36.960 | so

01:04:39.280 | In the first AI library. There's lots of little

01:04:41.520 | functions and classes to help with

01:04:44.600 | Most kinds of domains that you do machine learning on for NLP one of the simple things we have is text from folders

01:04:51.600 | That's just going to go ahead and go through and find all of the folders in here

01:04:56.200 | With these names and create a labeled data set and you know don't let these things

01:05:04.120 | Ever stop you from understanding. What's going on behind the scenes?

01:05:07.680 | Right we can grab its source code and as you can see it's time. You know it's like five lines

01:05:12.860 | Okay, so I don't like to write these things out in full

01:05:16.460 | You know but hide them behind at all functions so you can reuse them

01:05:19.880 | But basically it's just going to go through each directory and then within that so it goes through

01:05:24.280 | Yeah, go through each directory

01:05:27.440 | And then go through each

01:05:29.560 | file in that directory

01:05:31.960 | and then stick that into

01:05:34.240 | This array of texts and figure out what folder it's in and stick that into the array of labels, okay, so

01:05:41.280 | That's how we basically end up with something where we have an array of

01:05:47.520 | The reviews and an array of the labels, okay, so that's our data so our job will be to take

01:05:54.660 | that and

01:05:57.120 | to predict that

01:05:59.120 | Okay, and the way we're going to do it is we're going to throw away

01:06:04.920 | Like all of the interesting stuff about language

01:06:08.640 | Which is the order in which the words are in right now? This is very often not a good idea

01:06:15.480 | But in this particular case it's going to turn out to work like not too badly

01:06:19.040 | So let me show what I mean by like throwing away the order of the words like normally the order of the words

01:06:23.960 | Matters a lot if you've got a not

01:06:26.280 | Before something then that not refers to that thing right so but the thing is when in this case

01:06:32.840 | We're trying to predict whether something's positive or negative if you see the word absurd appear a lot

01:06:38.040 | Right then maybe that's a sign that this isn't very good

01:06:44.600 | So you know cryptic maybe that's a sign that it's not very good. So the idea is that we're going to turn it into something called a

01:06:51.280 | term document matrix

01:06:53.960 | Where for each document I each review what is going to create a list of what words are in it?

01:06:58.880 | Rather than what order they're in so let me give an example

01:07:02.320 | Can you see this okay?

01:07:05.280 | Okay

01:07:06.480 | So here are four

01:07:08.480 | Movie reviews that I made up

01:07:12.640 | This movie is good. The movie is good. They're both positive this movie is bad. The movie is bad

01:07:18.360 | They're both negative right so I'm going to turn this into a term document matrix

01:07:23.400 | So the first thing I need to do is create something called a vocabulary a vocabulary is a list of all the unique words

01:07:28.960 | That appear okay, so here's my vocabulary this movie is good the bad. That's all the words

01:07:34.900 | Okay, and so now I'm going to take each one of my movie reviews and turn it into a

01:07:41.280 | Vector of which words appear and how often do they appear right and in this case none of my words appear twice

01:07:47.440 | So this movie is good has those four words in it

01:07:52.440 | Where else this movie is bad has?

01:07:55.320 | Those four words in it

01:07:58.040 | Okay, so this

01:08:00.380 | Is called a term document matrix

01:08:03.440 | Right and this representation we call a bag of words

01:08:08.680 | Representation right so this here is a bag of words representation of the view of the review

01:08:13.860 | It doesn't contain the order of the text anymore. It's just a bag of the words

01:08:19.080 | What words are in it it contains bad is?

01:08:21.820 | Movie this okay, so that's the first thing we're going to do is we're going to turn it into a bag of words

01:08:27.800 | Representation and the reason that this is convenient

01:08:30.720 | For linear models is that this is a nice

01:08:36.280 | rectangular matrix that we can like do math on

01:08:39.200 | Okay, and specifically we can do a logistic regression, and that's what we're going to do is we're going to get to a point

01:08:44.880 | We do a logistic regression

01:08:46.880 | Before we get there though. We're going to do something else which is called naive base, okay?

01:08:51.200 | so

01:08:53.040 | SK learn

01:08:54.640 | Has something which will create a term document matrix for us. It's called count vectorizer. Okay, so we'll just use it now

01:09:01.760 | in NLP

01:09:04.680 | You have to turn your text into a list of words

01:09:08.900 | And that's called tokenization

01:09:11.760 | Okay, and that's actually non-trivial

01:09:13.880 | Because like if this was actually this movie is good

01:09:17.700 | Dot right or if it was this

01:09:20.520 | movie is

01:09:23.360 | good like

01:09:25.440 | How do you deal with like that?

01:09:27.440 | Punctuation well perhaps more interestingly what if it was this movie isn't good

01:09:33.920 | right, so

01:09:35.800 | How you turn a piece of text into a list of tokens is called tokenization, right?

01:09:41.400 | And so a good tokenizer would turn this movie isn't good

01:09:46.160 | Into this this space

01:09:49.560 | Quote movie space is space and good space right so you can see in this version here

01:09:57.320 | If I now split this on spaces every token is either a single piece of punctuation or like this suffix and is

01:10:04.840 | Considered like a word right that's kind of like how we would probably want to tokenize that piece of text because you wouldn't want

01:10:12.140 | good full stop

01:10:14.200 | to be like an object right because that does there's no concept of good full stop right or

01:10:20.080 | Double-quote movie is not like an object

01:10:22.600 | so

01:10:25.720 | Tokenization is something we hand off to a tokenizer

01:10:27.720 | Fast AI has a tokenizer in it that we can use

01:10:31.680 | So this is how we create our term document matrix with a tokenizer

01:10:37.560 | SK learn has a pretty standard API which is nice

01:10:45.840 | I'm sure you've seen it a few times now before so once we've built some kind of model

01:10:52.220 | We can kind of think of this as a model

01:10:54.320 | Just ish

01:10:55.800 | This is just defining what it's going to do. We can call fit transform to

01:11:00.780 | To do that right so in this case fit transform is going to create the vocabulary

01:11:05.600 | Okay, and create the term document matrix

01:11:09.200 | based on the training set

01:11:11.920 | Transform is a little bit different that says use the previously fitted model which in this case means use the previously created vocabulary

01:11:21.600 | We wouldn't want the validation set in the training set to have

01:11:24.400 | You know the words in different orders in the matrices right because then they'd like to have different meanings

01:11:29.480 | So this is here saying use the same vocabulary

01:11:32.200 | To create a bag of words for the validation set could you pass that back please?

01:11:38.280 | What if the violation set has different set of words other than training? Yeah, that's a great question so generally most

01:11:47.960 | Of these kind of vocab creating approaches will have a special token for unknown

01:11:52.940 | Sometimes you can you'll also say like hey if a word appears less than three times call it unknown

01:12:00.640 | But otherwise it's like if you see something you haven't seen before call it unknown

01:12:05.000 | So that would just become a column in the bag of words is is unknown

01:12:09.080 | Good question all right, so when we create this

01:12:16.160 | Term document matrix of the training set we have 25,000 rows because there are 25,000 movie reviews

01:12:21.680 | And there are 75,000 132 columns

01:12:25.620 | What does that represent? What does that mean there are seven hundred and thirty five thousand one hundred thirty two?

01:12:29.880 | What can you pass that to the veg?

01:12:31.560 | At just a moment you can pass it to the veg

01:12:33.560 | All vocabulary yeah, go on. What do you mean?

01:12:38.880 | So like the the number of words union of a number of words that the number of unique words yeah exactly good

01:12:46.280 | okay, now

01:12:48.280 | most documents

01:12:51.240 | Don't have most of these 75,000

01:12:54.040 | Words all right, so we don't want to actually store that as

01:12:59.040 | A normal array in memory because it's going to be very wasteful

01:13:03.480 | So instead we store it as a sparse

01:13:06.520 | Matrix all right and what a sparse matrix does is it just stores it?

01:13:12.520 | as something that says

01:13:16.560 | Whereabouts of the non zeros right so it says like okay term number so document number one

01:13:22.860 | word number four

01:13:25.560 | Appears and it has four of them. You know document one term number

01:13:32.120 | 123

01:13:35.120 | Has that that appears and it's a one right and so forth. That's basically how it's stored

01:13:41.000 | There's actually a number of different ways of storing

01:13:43.520 | And if you do Rachel's computational linear algebra course you'll learn about the different types and why you choose them and how to convert

01:13:50.560 | And so forth, but they're all kind of something like this right and you don't really on the whole have to worry about the details

01:13:57.640 | The important thing to know is it's it's efficient. Okay, and so we could grab the first review

01:14:05.540 | right and that gives us

01:14:08.080 | 75,000 long sparse

01:14:11.400 | One long one row long matrix okay with 93 stored elements so in other words

01:14:16.980 | 93 of those words are actually used in the first document, okay?

01:14:22.820 | We can have a look at the vocabulary by saying vectorizer dot get fetch feature names that gives us the vocab

01:14:29.440 | And so here's an example of a few of the elements of get feature names

01:14:34.040 | I

01:14:36.880 | Didn't intentionally pick the one that had Aussie, but you know that's the important words obviously

01:14:42.280 | I

01:14:44.280 | Haven't used the tokenizer here. I'm just bidding on space so this isn't quite the same as what the

01:14:48.540 | Vectorizer did but to simplify things

01:14:51.120 | Let's grab a set of all the lowercase words

01:14:55.360 | By making it a set we make them unique so this is

01:14:58.920 | Roughly the list of words that would appear right and that length is 91

01:15:04.720 | Which is pretty similar to 93 and just the difference will be that I didn't use a real tokenizer. Yeah

01:15:11.080 | All right

01:15:13.080 | So that's basically all that's been done there. It's kind of created this unique list of words and map them

01:15:19.600 | We could check by calling vectorizer dot vocabulary underscore to find the idea of a particular word

01:15:27.240 | So this is like the reverse map of this one right this is like integer to word

01:15:31.760 | Here is word to integer, and so we saw absurd appeared twice in the first document

01:15:38.040 | So let's check train term doc 0 comma 1 2 9 7 there

01:15:42.120 | It is is 2 right or else unfortunately Aussie didn't appear in the unnatural relationship with a pig movie

01:15:49.720 | So 0 comma 5,000 is 0 okay, so that's that's our term document matrix

01:15:59.340 | Yes, so does it care about the relative relationship between the words

01:16:08.480 | As in the ordering of the words no, we've thrown away the orderings. That's why it's a bag of words

01:16:12.560 | And I'm not claiming that this is like

01:16:16.520 | Necessarily a good idea what I will say is that like the vast majority of NLP work

01:16:23.880 | That's been done over the last few decades generally uses this representation because we didn't really know much better

01:16:29.800 | Nowadays increasingly we're using recurrent neural networks instead which we'll learn about in our

01:16:36.480 | last

01:16:37.880 | deep learning lesson of part one

01:16:40.080 | But sometimes this representation works pretty well, and it's actually going to work pretty well in this case

01:16:46.920 | Okay, so in fact you know most like back when I was at fast mail my email company a

01:16:55.400 | Lot of the spam filtering we did used this next technique naive Bayes

01:17:00.440 | Which is as a bag of words approach just kind of like you know if you're getting a lot of?

01:17:05.960 | Email containing the word Viagra, and it's always been a spam

01:17:09.760 | And you never get email from your friends talking about Viagra

01:17:13.480 | Then it's very likely something that says Viagra regardless of the detail of the language is probably from a spammer

01:17:19.880 | Alright, so that's the basic theory about like classification using a term document matrix, okay, so let's talk about naive Bayes

01:17:28.280 | And here's the basic idea. We're going to start with our term document matrix

01:17:34.920 | right and

01:17:36.920 | These first two is

01:17:39.200 | our corpus of positive reviews

01:17:41.680 | These next two is our corpus of negative reviews, and so here's our whole corpus of all reviews

01:17:48.260 | So what I could do is now to create a

01:17:52.160 | Probability I

01:17:55.720 | Got a call the as we tend to call these more generically features rather than words, right?

01:18:00.480 | This is a feature movie is a feature is as a feature, right?

01:18:04.800 | So it's kind of more now like machine learning language a column is a feature

01:18:08.660 | We'll call those we often call those f in the phase so we can basically say the probability

01:18:15.160 | That you would see the word this

01:18:17.880 | Given that the class is one given that it's a positive review

01:18:23.620 | It's just the average of how often do you see this in the positive reviews?

01:18:29.240 | right

01:18:31.660 | Now we've got to be a bit careful though

01:18:34.160 | because

01:18:36.160 | If you never ever see a particular word

01:18:39.880 | In a particular class right so if I've never received an email from a friend that said Viagra

01:18:47.040 | All right, that doesn't actually mean the probability of us of a friend sending sending me an email about Viagra is zero

01:18:53.600 | It's not really zero, right? I

01:18:56.260 | Hope I don't get an email. You know from Terrence tomorrow saying like

01:19:02.160 | Jeremy you probably could use this you know advertisement for Viagra, but you know it could happen and you know

01:19:08.660 | You know, I'm sure it'd be in my best interest

01:19:12.400 | So so what we do is we say actually what we've seen so far is not the full sample of everything that could happen

01:19:20.120 | It's like a sample of what's happened so far. So let's assume that the next email you get

01:19:26.480 | Actually does mention Viagra and every other possible word right so basically we're going to add a row of

01:19:33.400 | ones

01:19:35.480 | Okay, so that's like the email that contains every possible word so that way nothing's ever

01:19:40.160 | infinitely unlikely okay, so I take the average of

01:19:44.680 | All of the

01:19:48.360 | Times that this appears in my positive corpus plus the ones

01:19:53.680 | okay, so that's like the

01:19:56.040 | the probability that

01:19:59.320 | Feature equals this appears in a document given that class equals one

01:20:06.840 | And so not surprisingly here's the same thing

01:20:10.320 | For probability that this feature this appears given class equals zero right same calculation except for the zero

01:20:18.080 | Rows and obviously these are the same because this appears

01:20:24.000 | twice in the positives sorry once in the positives and once in the negatives, okay

01:20:29.720 | Let's just put this back to what it was

01:20:33.440 | All right

01:20:42.240 | So we can do that for every feature

01:20:45.080 | for every class

01:20:48.560 | Right so our trick now is to basically use

01:20:53.840 | Base rule to kind of fill this in

01:20:57.920 | So what we want is the probability that

01:21:04.880 | Given that I've got this particular document so somebody sent me this particular email or I have this particular IMDB review

01:21:15.160 | What's the probability that its class is?

01:21:18.640 | equal to I

01:21:21.080 | Don't know positive right so for this particular movie review. What's the probability that its class is?

01:21:26.240 | Positive right and so we can say well that's equal to the probability

01:21:31.440 | That we got this particular

01:21:35.240 | movie review

01:21:38.320 | Given that its class is positive

01:21:42.400 | Multiplied by the probability that any movie reviews class is positive

01:21:46.960 | Divided by the probability of getting this particular movie review

01:21:51.960 | All right, that's just basis rule okay, and so we can calculate

01:21:57.080 | All of those things

01:22:00.280 | But actually what we really want to know is is it more likely that this is class zero or class one?

01:22:07.720 | Right so what if we actually took?

01:22:12.000 | Probability that's plus one and divided by a probability that's plus zero

01:22:16.120 | What if we did that right and so then we could say like okay?

01:22:21.560 | If this number is bigger than one then it's more likely to be class one if it's smaller than one

01:22:26.760 | It's more likely to be class zero right so in that case we could just divide

01:22:31.480 | This whole thing

01:22:35.280 | Right by the same version for class zero right which is the same as multiplying it by the reciprocal

01:22:41.680 | And so the nice thing is now that's going to put a probability D on top here, which we can get rid of

01:22:46.960 | Right and a probability of getting the data given class zero down here and the probability of getting plus

01:22:55.560 | Zero here right and so if we basically what that means is we want to calculate

01:23:02.560 | The probability that we would get this particular document given that the class is one

01:23:08.760 | Times the probability that the class is one divided by the probability of getting this particular document given the class is two zero

01:23:16.280 | times the probability that the class is zero

01:23:19.040 | so the probability that the class is one is

01:23:22.640 | Just equal to the average of the labels

01:23:26.360 | Right probability that the class is zero is just one minus that right so

01:23:36.640 | So there are those two numbers right I've got an equal amount of both so it's both point five

01:23:40.960 | What is the probability of getting this document given that the class is one can anybody tell me how I would calculate that

01:23:51.600 | Can somebody pass that please

01:24:02.640 | Look at all the documents which have class equal to one uh-huh and one divided by that will give you

01:24:08.500 | So remember it's though. It's going to be for a particular document so for example. We'd be saying like what's the probability that?

01:24:14.960 | This review is positive right so what so you're on the right track

01:24:20.200 | But what we have to going to have to do is going to have to say let's just look at

01:24:23.300 | the words it has and

01:24:26.360 | Then multiply the probabilities together

01:24:31.320 | For class equals one right so the probability that a class one review has this is

01:24:40.120 | Two-thirds the probability it has movie is one is is one and good is one

01:24:47.960 | So the probability it has all of them is all of those multiplied together

01:24:51.880 | Kinda and the kinder Tyler why is it not really can you pass it to Tyler?

01:24:58.240 | So glad you look horrified and skeptical word choice is not independent

01:25:08.200 | So nobody can call Tyler naive

01:25:12.760 | Because the reason this is naive Bayes is

01:25:16.160 | Because this is what happens if you take Bayes's theorems in a naive way and Tyler is not naive anything better right so

01:25:23.880 | Naive Bayes says let's assume that if you have this movie is bloody stupid

01:25:29.520 | I hate it

01:25:30.400 | But the probability of hate is independent of the probability of bloody is an independent of the probability of stupid, right?

01:25:36.880 | Which is definitely not true right and so naive Bayes ain't actually very good

01:25:41.920 | But I'm kind of teaching it to you because it's going to turn out to be a convenient

01:25:46.560 | Peace for something we're about to learn later

01:25:51.120 | It's okay, right? I mean, it's it's it's I would never I would never choose it

01:25:55.760 | Like I don't think it's better than any other technique. That's equally fast and equally easy

01:25:59.520 | But you know, it's a thing you can do and it's certainly going to be a useful foundation

01:26:05.520 | so

01:26:08.080 | so here is our calculation right of the probability that this document is

01:26:15.360 | That we get this particular document assuming. It's a positive review. Here's the probability given

01:26:21.160 | It's a negative and here's the ratio and this ratio is above one

01:26:25.280 | So we're going to say I think that this is probably a positive review. Okay, so that's the Excel version

01:26:31.640 | and

01:26:34.080 | So you can tell that I let your net touch this because it's got latex in it. We've got actual math. So

01:26:40.240 | So here is the here is the same thing

01:26:44.440 | the log count ratio for each

01:26:46.440 | feature F each word F

01:26:48.960 | and so here it is

01:26:51.960 | Written out as Python. Okay, so our independent variable is our term document matrix

01:26:58.520 | Dependent variable is just the labels of the Y

01:27:01.480 | So using NumPy

01:27:04.160 | This is going to grab the rows

01:27:06.160 | Where the dependent variable is one?

01:27:09.680 | Okay, and so then we can sum them over the rows to get the total word count

01:27:16.720 | For that feature across all the documents, right?

01:27:21.040 | Plus one right because that's the email Terrence is totally going to send me something about Biagra today

01:27:26.240 | I can tell that's that's that yeah, okay, so I'll do the same thing for the negative reviews

01:27:31.700 | Right and then of course it's nicer to take the log

01:27:36.560 | Right because if we take the log then we can add things together rather than multiply them together

01:27:41.680 | And once you like multiply enough of these things together

01:27:44.280 | It's going to get kind of so close to zero that you'll probably run out of floating point, right? So we take the log

01:27:49.680 | of the ratios

01:27:52.800 | And

01:27:54.560 | Then we can as I say we then multiply that or in log we subtract that from the so add that to the

01:28:01.780 | ratio of the class the whole class

01:28:06.360 | probabilities, right

01:28:08.360 | So in order to say for each document

01:28:11.800 | Multiply the Bayes probabilities by the accounts we can just use matrix multiply

01:28:19.040 | okay, and then to add on the

01:28:21.880 | The log of the class ratios we can just use plus B and so we end up with something that looks a lot like our

01:28:31.520 | Logistic regression right, but we're not learning anything right not in kind of a SGD point of view

01:28:38.280 | We're just we're calculating it using this theoretical model

01:28:41.560 | Okay, and so as I said we can then compare that as to whether it's bigger or smaller than zero

01:28:46.400 | Not one anymore because we're now in log space

01:28:48.720 | Right and then we can compare that to the mean and we say okay. That's 80% accurate 81% accurate

01:28:56.000 | Right so naive Bayes, you know is not is not nothing. It gave us something. Okay?

01:29:02.600 | it turns out that

01:29:05.320 | This version where we're actually looking at how often a word appears

01:29:10.280 | Like absurd appeared twice

01:29:13.040 | It turns out at least for this problem and quite often it doesn't matter whether absurd appeared twice or once all that matters

01:29:19.940 | Is that it appeared?

01:29:21.200 | So what what people tend to try doing is to say take the turn of the term

01:29:27.520 | Document matrix and go dot sign dot sign

01:29:31.060 | Replaces anything positive with one and anything negative with negative one we don't have any negative counts obviously so this

01:29:38.200 | Binerizes it so it says it's I don't care that you saw absurd twice

01:29:42.720 | I just care that you saw it right so if we do exactly the same thing

01:29:49.040 | With the binarized version

01:29:51.040 | Then you get a better result, okay?

01:29:55.060 | Okay now this is the difference between theory and practice right in theory

01:30:05.680 | Naive Bayes sounds okay, but it's it's naive unlike Tyler. It's naive right so what Tyler would probably do would instead say rather than assuming

01:30:17.120 | That I should use these coefficients are why don't we learn them so it sound reasonable Tyler?

01:30:24.480 | Yeah, okay, so let's learn them so we can you know we can totally learn them, so let's create a logistic regression

01:30:31.240 | Right and let's fit

01:30:33.800 | Some coefficients, and that's going to literally give us something with exactly the same functional form that we had before

01:30:39.800 | But now rather than using a theoretical

01:30:43.160 | R and a theoretical B. We're going to calculate the two things based on logistic regression, and that's better

01:30:49.540 | okay, so

01:30:52.720 | So it's kind of like yeah, why

01:30:57.640 | Why do something based on some theoretical model because theoretical models are never

01:31:03.160 | Going to be as accurate pretty much as a data-driven model right because theoretical models

01:31:07.920 | unless you're dealing with some I

01:31:11.080 | Don't know like physics thing or something where you're like okay?

01:31:13.840 | This is actually how the world works there really is no I don't know

01:31:17.360 | We're working in a vacuum, and this is the exact gravity and blah blah blah right, but most of the real world

01:31:23.080 | This is how things are like it's better to learn your coefficients and calculate them. Yes, you know

01:31:28.420 | Generally what's this dual equal true?

01:31:33.120 | Hoping it ignore not notice, but you saw it

01:31:39.120 | basically in this case our

01:31:41.120 | Term document matrix is much wider than it is tall

01:31:44.720 | There is a reformulation

01:31:47.780 | Mathematically basically almost a mathematically equivalent reformulation of logistic regression that happens to be a lot faster when it's wider than it is tall

01:31:55.680 | So the short answer is if you don't put that here anytime

01:31:58.760 | It's wider than it is tall put dual equals true and it will run this runs in like two seconds

01:32:03.200 | If you don't have it here, it'll take a few minutes

01:32:06.880 | So like in math there's this kind of concept of dual versions of problems which are kind of like

01:32:12.480 | Equivalent versions that sometimes work better for certain situations

01:32:17.360 | Okay, here is so here is the binarized version

01:32:24.480 | right

01:32:26.600 | and it's it's about the same right so you can see I've fitted it with the the sign of the dock of the dock term dock matrix and

01:32:35.840 | Predicted it with this right

01:32:37.840 | Now the thing is that this is going to be a coefficient for every term

01:32:45.320 | There was about 75,000 terms in our vocabulary

01:32:49.480 | And that seems like a lot of coefficients given that we've only got

01:32:53.400 | 25,000 reviews, so maybe we should try regularizing this

01:32:57.480 | So we can use

01:33:00.400 | Regularization built into SK learns logistic regression plus which is C is the parameter that they use a smaller

01:33:07.640 | This is slightly weird a smaller parameter is more regularization, right?

01:33:12.120 | So that's why I used one a to basically turn off regularization here. So if I turn on regularization set it to point one

01:33:18.440 | Then now it's 88 percent. Okay, which makes sense. You know, you wouldn't you would think like

01:33:25.400 | 25,000 parameters for 25,000 documents, you know, it's likely to overfit indeed. It did overfit

01:33:30.880 | So this is adding L2 regularization to avoid overfitting

01:33:35.440 | I

01:33:37.880 | Mentioned earlier that as well as L2, which is looking at the weight squared. There's also L1

01:33:44.440 | Which is looking at just the absolute value of the weights, right? I

01:33:54.400 | was

01:33:55.760 | Kind of pretty sloppy in my wording before I said that L2 tries to make things zero

01:34:00.920 | That's kind of true. But if you've got two things that are highly correlated

01:34:04.960 | Then L2 regularization will like move them both down together

01:34:10.040 | It won't make one of them zero and one of them non-zero, right?

01:34:13.440 | So L1 regularization actually has the property that it'll try to make as many things zero as possible

01:34:20.680 | Whereas L2 regularization has a property that it tends to try to make kind of everything smaller

01:34:25.560 | we actually don't care about that difference in

01:34:28.640 | Really any modern machine learning because we very rarely try to directly interpret the coefficients. We try to understand our models through

01:34:37.660 | Interrogation using the kind of techniques that we've learned

01:34:40.800 | The reason that we would care about L1 versus L2 is simply like which one ends up with a better error on the validation

01:34:47.320 | Set okay, and you can try both

01:34:50.320 | With SK learns logistic regression

01:34:52.320 | L2 actually turns out to be a lot faster because you can't use dual equals true unless you have L2

01:34:58.680 | So you know and L2 is the default so I didn't really worry too much about that difference here

01:35:03.560 | So you can see here if we use

01:35:06.720 | regularization and binarized

01:35:10.320 | We actually do pretty well

01:35:12.600 | Okay

01:35:14.640 | So

01:35:18.840 | Yes, can you pass that back to w please

01:35:21.260 | Before we learned about elastic net right like combining L1 and L2. Yeah. Yeah. Yeah, you can do that is that but I mean

01:35:30.240 | It's like you know with with deeper models

01:35:34.080 | Yeah, I've never seen anybody find that useful

01:35:37.480 | Okay, so the last thing I mentioned is

01:35:42.880 | That you can when you do your count vectorizer

01:35:48.200 | Wherever that was when you do your count vectorizer you can also ask for n grams right by default we get

01:35:54.960 | unigrams that is single words, but if we if we say

01:35:59.560 | n gram range equals 1 comma 3 that's also going to give us

01:36:04.920 | Bigrams and trigrams by which I mean if I now say okay. Let's go ahead and

01:36:10.920 | Do the count vectorizer get feature names now my vocabulary includes a bigram

01:36:18.280 | Right by fast by vengeance and a trigram by vengeance full stop

01:36:23.480 | Five era miles right so this is now doing the same thing but after tokenizing

01:36:28.960 | It's not just grabbing each word and saying that's part of our vocabulary

01:36:32.600 | But each two words next to each other and each three words next to each other and this ten this turns out to be like

01:36:38.560 | Super helpful in like taking advantage of bag of word

01:36:44.280 | Approaches because we now can see like the difference between like

01:36:48.040 | You know not good versus not bad

01:36:51.920 | versus not terrible

01:36:54.840 | Right or even like double quote good double quote, which is probably going to be sarcastic right so using trigram features

01:37:04.120 | Actually is going to turn out to make both naive phase

01:37:08.840 | And logistic regression quite a lot better. It really takes us quite a lot further and makes them quite useful

01:37:15.400 | I have a question about

01:37:19.320 | the

01:37:21.320 | Tokenizers so you are saying some marks features, so how are these?

01:37:26.200 | Bigrams and trigrams selected right so

01:37:30.480 | Since I'm using a linear model I

01:37:36.440 | Didn't want to create too many features. I mean it actually worked fine even without max features. I think I had something like I

01:37:42.080 | Can't remember 70 million coefficients. It still worked right, but just there's no need to have 70 million coefficients

01:37:48.960 | So if you say max features equals 800,000

01:37:51.760 | The count vectorizer will sort the vocabulary by how often everything appears whether it be unigram by gram trigram

01:38:00.620 | And it will cut it off

01:38:03.440 | After the first 800,000 most common n grams n gram is just the generic word for

01:38:09.820 | unigram by gram and trigram

01:38:12.800 | so that's why the the train term doc dot shape is now 25,000 by

01:38:18.320 | 800,000 and like if you're not sure what number this should be I

01:38:22.240 | Just picked something that was really big and you know didn't didn't worry about it too much, and it seemed to be fine

01:38:28.840 | Like it's not terribly

01:38:30.840 | sensitive

01:38:32.840 | All right, okay, well, that's we're out of time so what we're going to see

01:38:38.200 | Next week and by the way you know we could have

01:38:41.680 | Replaced this logistic regression with our pytorch version and next week

01:38:47.880 | We'll actually see something in the fastai library that does exactly that

01:38:51.480 | but also what we'll see next week so next week tomorrow is

01:38:57.480 | How to combine logistic regression and naive Bayes together to get something that's better than either

01:39:03.080 | and then we'll learn how to move from there to create a

01:39:06.800 | Deeper neural network to get a pretty much state-of-the-art result for structured learning all right, so we'll see them

01:39:15.240 | [BLANK_AUDIO]

Machine Learning 1: Lesson 10

Chapters