back to index

Machine Learning 1: Lesson 10


Chapters

0:0 Fast AI
1:22 Feature Engineering
4:25 Structured Data
8:34 Recap
11:53 AutoGrad
13:12 Variables
15:3 Iterators
24:8 Gradients
37:8 Data Loader
40:18 Parameters
47:30 Weight Decay
55:8 Discussion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Well, welcome back to machine learning one of the most exciting things this week
00:00:05.480 | Almost certainly the most exciting thing this week is that fastai is now on pip so you can pip install fastai
00:00:14.760 | And so thank you to Prince and for to karem for making that happen
00:00:21.360 | To USF students who had never published a pip package before and this is one of the harder ones to publish because it's got a lot
00:00:28.160 | of dependencies
00:00:30.160 | So it's you know probably still easiest just to do the Conda end update thing
00:00:36.720 | But a couple of places that it would be handy instead to pip install fastai would be well obviously if you're working
00:00:42.880 | Outside of the the repo and the notebooks then this gives you access to fastai everywhere
00:00:49.680 | Also, I believe they submitted a pull request to Kaggle to try and get it added to the Kaggle kernels
00:00:55.960 | So hopefully you'll be able to use it on Kaggle kernels
00:00:58.080 | soon and
00:01:00.080 | Yeah, you can use it at your work or whatever else
00:01:04.160 | So that's that's exciting. I mean I'm not going to say it's like officially released yet. You know it's still
00:01:11.240 | very early obviously and we're still
00:01:13.880 | You're helping add documentation and all that kind of stuff, but it's great that that's now there
00:01:21.560 | a couple of cool kernels from USF students this week thought I'd highlight two that were both from the
00:01:29.240 | text normalization
00:01:31.960 | competition which was about
00:01:34.360 | Trying to take text which was
00:01:37.840 | Written out you know written a standard English text they also had one for Russian
00:01:45.920 | And you're trying to kind of identify things that could be like a first second third and say like that's a cardinal number
00:01:52.760 | Or if this is a phone number or whatever and I did a quick little bit of searching and I saw that
00:01:57.820 | There had been some attempts in academia to use
00:02:01.840 | deep learning for this, but they hadn't managed to make much progress and
00:02:06.280 | Actually noticed us. I'll veres
00:02:09.200 | Colonel here which gets point nine nine two on the leaderboard, which I think is like top 20
00:02:14.520 | Is yeah, it's kind of entirely heuristic, and it's a great example of
00:02:18.640 | Kind of feature engineering this in this case the whole thing is basically entirely feature engineering
00:02:23.780 | So it's basically looking through and using lots of regular expressions to figure out for each token
00:02:29.600 | What is it you know and I think she's done a great job here kind of laying it all out
00:02:34.240 | clearly as to what all the different pieces are and how they all fit together and
00:02:38.560 | She mentioned that she's maybe hoping to turn this into a library which I think would be great
00:02:43.480 | right you know you could use this to
00:02:45.480 | Grab a piece of text and pull out. What are all the pieces in it?
00:02:50.080 | It's the kind of thing that
00:02:53.400 | The neural the natural language can like natural language processing community hopes to be able to do
00:02:58.640 | Without like lots of handwritten code like this, but for now
00:03:03.260 | This is I'll be interesting to see like what the winners turn out to have done, but I haven't seen
00:03:09.200 | Machine learning being used really to do this particularly well
00:03:13.520 | Perhaps the best approach is the ones which combine this kind of feature engineering along with some machine learning
00:03:19.600 | But I think this is a great example of effective feature engineering, and this is a another USF student
00:03:27.460 | Who has done much the same thing got a similar kind of score?
00:03:30.980 | But used her own different set of rules
00:03:36.160 | Again, this is gets you would get you a good leaderboard position with these as well
00:03:40.480 | so I thought that was interesting to see examples of some of our students entering a
00:03:45.800 | competition and getting kind of top 20 ish results by you know basically just handwritten heuristics, and this is where
00:03:54.740 | for example computer vision was
00:03:59.640 | Six years ago still basically all the best approaches was a whole lot of like carefully handwritten heuristics
00:04:06.640 | often combined with some simple machine learning and
00:04:10.680 | So I think over time
00:04:13.160 | You know the field is kind of
00:04:15.520 | Definitely trying to move towards
00:04:18.360 | Automating much more of this and actually interestingly
00:04:22.240 | very interestingly in the
00:04:25.600 | Safe driver prediction competition was just finished
00:04:29.200 | One of the Netflix prize winners won this competition and he
00:04:35.320 | Invented a new algorithm for dealing with structured data which basically doesn't require any feature engineering at all
00:04:44.040 | So he came first place using nothing but
00:04:49.760 | deep learning models and one gradient boosting machine
00:04:54.200 | And his his basic approach was very similar to what we've been learning in this class so far
00:05:00.180 | And what we'll be learning also tomorrow
00:05:02.180 | Which is using fully connected neural networks and we're and one hot encoding
00:05:07.720 | And specifically embedding which we'll learn about but he had a very clever technique
00:05:13.280 | Which was there was a lot of data in this competition which was unlabeled so in other words
00:05:18.000 | Where they didn't know whether that?
00:05:22.200 | Driver would go on to claim or not
00:05:24.200 | Or whatever so unlabeled data so when you've got some labeled and some unlabeled data
00:05:29.080 | We call that semi supervised learning and in real life
00:05:32.960 | Most learning is semi supervised learning like in real life normally you have some things that are labeled and some things that are unlabeled
00:05:40.100 | so this is kind of the most practically useful kind of learning and
00:05:44.160 | Then structured data is it's the most common kind of data that companies deal with day to day
00:05:49.460 | so the fact that this competition was a
00:05:52.000 | semi supervised
00:05:53.620 | Structured data competition made it incredibly practically useful
00:05:57.460 | And so what his technique for winning this was was to?
00:06:01.780 | Do data augmentation which those of you doing the deep learning course have learned about which is basically the idea like if you had
00:06:09.760 | Pictures you would like flip them horizontally or rotate them a bit data augmentation means creating new data examples
00:06:16.360 | Which are kind of slightly?
00:06:19.240 | Different versions of ones you already have and the way he did it was for each row in the data. He would like
00:06:25.800 | at random replace 15% of the
00:06:29.600 | variables with a different row
00:06:32.880 | So each row now would represent like a mix of like 80 percent 85 percent of the original row
00:06:38.320 | But 15 percent randomly selected from a different row
00:06:41.120 | and so this was a way of like
00:06:44.760 | randomly changing the data a little bit and then he used something called an autoencoder which we will
00:06:51.080 | Probably won't study until part two of the deep learning course
00:06:54.960 | But the basic idea of an autoencoder is your dependent variable is the same as your independent variable
00:07:01.020 | so in other words you try to predict your input, which obviously is
00:07:07.520 | Trivial if you're allowed to like it like you know the identity transform for example trivially predicts the input
00:07:14.640 | But the trick with an autoencoder is to have less activations in
00:07:18.640 | At least one of your layers than your input right so if your input was like a hundred-dimensional vector, and you put it through a
00:07:27.560 | 100 pi 10 matrix to create 10 activations and then have to recreate the original hundred long vector from that
00:07:36.600 | Then you've basically come you have to have compressed it effectively and so it turns out that
00:07:42.000 | That kind of neural network
00:07:44.600 | You know it's forced to find
00:07:47.400 | Correlations and features and interesting relationships in the data even when it's not labeled so he used that
00:07:55.280 | Rather than doing any he didn't do any hand engineering. He just used an autoencoder
00:08:00.120 | So you know these are some interesting kind of directions that if you keep going with your machine learning studies
00:08:06.920 | You know particularly if you?
00:08:08.920 | Do part two with a deep learning course next year?
00:08:12.260 | you'll you'll learn about
00:08:16.840 | You can kind of see how
00:08:18.840 | Feature engineering is going away, and this was just
00:08:21.440 | Yeah, an hour ago, so this is very recent news indeed, but it's one of this is one of the most important
00:08:28.400 | breakthroughs I've seen in a long time
00:08:30.400 | Okay, so we were working through a
00:08:36.960 | Simple logistic regression trained with SGD for MNIST
00:08:44.880 | And here's the summary of where we got to we have nearly built a module
00:08:57.920 | A model module and a training loop from scratch and we were going to kind of try and finish that and after we finish that
00:09:04.640 | I'm then going to go through this entire notebook
00:09:06.640 | Backwards right so having gone like top to bottom, but I'm going to go back through
00:09:11.200 | bottom to top okay, so
00:09:13.920 | You know this was that little
00:09:17.600 | Handwritten and end up module class we created
00:09:22.280 | We defined our loss we defined our learning rate, and we defined our optimizer
00:09:26.840 | And this is the thing that we're going to try and write by hand in a moment
00:09:29.680 | so that stuff
00:09:32.440 | That and that we're still in with from Pytorch, but that we've written ourselves and this we've written ourselves
00:09:38.760 | So the basic idea was we're going to go through some number of epochs, so let's go through one epoch
00:09:43.460 | Right and we're going to keep track of how much for each mini batch. What was the loss so that we can report it at the end
00:09:51.840 | We're going to turn our training data loader into an iterator
00:09:55.140 | So that we can loop through it loop through every mini batch, and so now we can go and go ahead and say for tensor in
00:10:01.680 | The length of the data loader, and then we can call next to grab the next independent variables and the dependent variables
00:10:11.300 | From our data loader from that iterator, okay?
00:10:15.960 | So then remember we can then pass the X tensor into our model by calling the model as if it was a function
00:10:23.560 | But first of all we have to turn it into a variable
00:10:26.120 | Last week we were typing variable
00:10:28.880 | Blah dot CUDA to turn it into a variable a shorthand for that is just the capital V now
00:10:34.360 | It's a capital T for a tensor capital B for a V for a variable. That's just a shortcut in fast AI
00:10:39.600 | Okay, so that returns our predictions
00:10:43.240 | And so the next thing we needed was to calculate our loss
00:10:45.560 | Because we can't calculate the derivatives of the loss if you haven't calculated the loss
00:10:50.940 | So the loss takes the predictions and the actuals
00:10:54.240 | Okay, so the actuals again are the the Y tensor and again. We have to turn that into a variable
00:10:59.760 | Now can anybody remind me what a variable is and why we would want to use a variable here?
00:11:11.320 | I think once you turn into variable, then it tracks it so then you can do it backward on that so you can get it
00:11:16.680 | What sorry when you turn the variable it?
00:11:18.400 | It can track like it's process of like you know as you add the function as the function is targeting layers within each other
00:11:23.440 | You can track it and then we do backward on it back propagates and does the yeah, right, so
00:11:28.080 | Right so a variable
00:11:33.120 | keeps
00:11:35.280 | track of all of the steps
00:11:37.480 | to get computed and
00:11:40.720 | So there's actually a fantastic tutorial on the Pytorch website
00:11:45.480 | So on the Pytorch website there's a tutorial section
00:11:56.680 | And there's a tutorial there about autograd autograd is the name of the automatic
00:12:01.680 | differentiation package that comes with Pytorch and it's it's an implementation of automatic differentiation and so the variable plus
00:12:10.400 | is really the key
00:12:12.400 | The key class here because that's the thing that makes turns a tensor into something where we can keep track of its gradients
00:12:19.240 | So basically here they show how to create a variable do an operation to a variable
00:12:25.780 | And then you can go back and actually look at the grad function
00:12:30.360 | Which is the the function that it's keeping track of basically to calculate the gradient right so as we do
00:12:38.160 | More and more operations to this very a variable and the variables calculated from that variable it keeps keeping track of it
00:12:45.320 | So later on we can go dot backward and then print dot grad and find out the gradient
00:12:52.560 | Right and so you notice we never defined the gradient. We just defined it as being x plus 2
00:12:58.520 | Squared times 3 whatever and it can calculate the gradient
00:13:04.640 | Okay, so that's why we need to turn that into a variable so L is now a
00:13:16.800 | Variable containing the loss so it contains a single number for this mini batch
00:13:22.600 | Which is the loss for this mini batch, but it's not just a number. It's a it's a number as a variable
00:13:29.160 | So it's a number that knows how it was calculated all right
00:13:32.920 | so we're going to append that loss to our array just so we can
00:13:37.040 | Get the average of it later basically
00:13:39.920 | And now we're going to calculate the gradient so L dot backward is the thing that says
00:13:46.840 | Calculate the gradient so remember when we call the the network. It's actually calling our forward function
00:13:55.560 | So that's like cap go through it forward and then backward is like using the chain rule to calculate the gradients
00:14:02.400 | Backwards okay, and then this is the thing we're about to write which is update the weights based on the gradients and the learning rate
00:14:11.840 | Zero grad will explain when we write this out by hand
00:14:16.640 | and so then at the end we can turn our validation data loader into an iterator and
00:14:22.160 | We can then go through its length
00:14:25.400 | grabbing each
00:14:28.360 | x and y out of that and
00:14:30.960 | asking for the score
00:14:32.960 | Which we defined up here to be equal to?
00:14:35.840 | Which thing did you predict which thing was actual and so check whether they're equal right and then the
00:14:44.480 | Main of that is going to be our accuracy, okay?
00:14:48.960 | Could you pass that over to Chenxi?
00:14:51.760 | What's the advantage that you found converted into a iterator rather than like use normal?
00:15:00.000 | Python loop or
00:15:02.000 | We're using a normal Python loop
00:15:04.120 | So it's still and this is a normal Python loop so the question really is like
00:15:08.640 | Compared to what right so like?
00:15:12.560 | The alternative perhaps you're thinking it would be like we could choose like a something like a list with an indexer
00:15:19.040 | Okay, so you know the problem there is that we want
00:15:23.560 | Was a few things I mean one key one is we want each time we grab a new mini batch. We want to be random
00:15:29.600 | We want a different different shuffled thing so this
00:15:33.120 | You can actually kind of iterate from
00:15:35.960 | Forever you know you can loop through it as many times as you like so
00:15:40.120 | There's this kind of idea. It's called different things in different languages
00:15:44.160 | But a lot of languages are called like stream processing
00:15:47.480 | And it's this basic idea that rather than saying I want the third thing or the ninth thing
00:15:51.720 | It's just like I want the next thing right it's great for like network programming. It's like grab the next thing from the network
00:15:58.320 | It's great for
00:16:00.320 | UI programming it's like grab the next event where somebody clicked a button it also turns out to be great for
00:16:06.360 | This kind of numeric programming. It's like I just want the next batch of data
00:16:10.340 | It means that the data like can be kind of arbitrarily long as we're describing one piece at a time
00:16:18.540 | Yeah, so you know I mean and also in I guess the short answer is because it's how pytorch works
00:16:27.440 | Pytorch that's pytorch is data loaders are designed to be
00:16:30.460 | Called in this way, and then so Python has this concept of a generator
00:16:35.680 | Which is like an and and?
00:16:38.480 | Different type of generator. I wonder if this is gonna be a snake generator or a computer generator, okay?
00:16:44.480 | A generator is a way that you can create a function that as it says behaves like an iterator
00:16:50.920 | So like Python has recognized that this stream processing approach to programming is like super handy and helpful and
00:16:57.840 | Supports it everywhere so basically anywhere that you use a for in loop anywhere you use a list comprehension
00:17:05.760 | Those things can always be generators or iterators so by programming this way. We just get a lot of
00:17:11.880 | Flexibility I guess is that sound about right Terrence you're the programming language expert. Did you?
00:17:19.680 | Want to grab that box so we can hear
00:17:21.680 | So Terrence actually does programming languages for a living so we should ask him
00:17:26.440 | Yeah, I mean the short answer is what you said
00:17:29.360 | You might say something about space
00:17:32.400 | But in this case that all that data has to be in memory anyway because we've got no doesn't have to be in memory
00:17:39.000 | So in fact most of the time we could pull a mini batch from something in fact most of the time with pytorch
00:17:44.160 | The mini batch will be read from like separate images spread over your disk on demand
00:17:50.000 | So most of the time it's not in memory
00:17:51.800 | But in general you want to keep as little in memory as possible at a time
00:17:56.440 | And so the idea of stream processing also is great because you can do compositions you can
00:18:00.640 | Pipe the data to a different machine you can yeah
00:18:03.400 | Yeah, the competition is great
00:18:05.000 | You can grab the next thing from here and then send it off to the next stream which can then grab it and do something
00:18:09.400 | Else which you guys all recognize of course in the command-line pipes and redirection
00:18:14.120 | Yes, okay, thanks Terrence
00:18:17.200 | The benefit of working with people that actually know what they're talking about
00:18:21.840 | All right, so let's now take that and get rid of the optimizer
00:18:28.320 | Okay, so the only thing that we're going to be left with is the negative log likelihood loss function
00:18:33.920 | Which we could also replace actually we have a?
00:18:38.560 | implementation of that from scratch that unit wrote in the
00:18:41.160 | In the notebooks, so it's only one line of code as we learned earlier. You can do it with a single if statement, okay?
00:18:48.340 | So I don't know why I was so lazy is to include this
00:18:51.840 | So what we're going to do is we're going to again grab this module that we've written ourselves the logistic regression module
00:18:58.880 | We're going to have one epoch again. We're going to loop through each thing in our iterator again
00:19:05.020 | We're going to grab our independent independent variable for the mini batch again
00:19:09.280 | Pass it into our network again
00:19:11.760 | Calculate the loss, so this is all the same as before
00:19:14.360 | But now we're going to get rid of this optimizer dot step
00:19:18.000 | And we're going to do it by hand
00:19:21.040 | so the basic trick is
00:19:23.480 | As I mentioned we're not going to do the calculus by hand so we'll call L dot backward to calculate the gradients automatically
00:19:30.940 | And that's going to fill in our weight matrix, so do you remember when we created our?
00:19:36.320 | Let's go back and
00:19:38.760 | Look at the code for
00:19:40.760 | Here's that module we built so the weight matrix for the for the
00:19:48.420 | Linear layer weights we called l1w and for the bias we called l1b right so they were the attributes we created
00:20:00.360 | I've just put them into things called W and B just to save some typing basically so W is our weights
00:20:08.000 | B is our biases and
00:20:10.400 | So the weights remember the weights are a variable and to get the tensor out of the variable
00:20:16.800 | We have to use dot data right so we want to update the actual tensor that's in this variable, so we say weights dot data
00:20:22.920 | Minus equals so we want to go in the opposite direction to the gradient the gradient tells us which way is up
00:20:28.980 | We want to go down
00:20:30.980 | Whatever is currently in
00:20:34.040 | the gradients
00:20:36.400 | times the learning rate so that is the formula for
00:20:40.320 | gradient descent
00:20:43.200 | All right, so as you can see it's it's like as as easier thing as you can possibly imagine
00:20:48.680 | It's like literally update the weights to be equal to be equal to whatever they are now minus the gray the gradients
00:20:55.600 | times the learning rate and
00:20:57.640 | Do the same thing?
00:20:59.640 | for the bias
00:21:01.760 | So anybody have any questions about that step in terms of like why we do it or how did you have a question?
00:21:05.960 | Do you want to grab that?
00:21:07.960 | So that step, but when we do the next of deal
00:21:11.520 | The next year yes, yes
00:21:14.720 | So when it is the end of the loop. How do you grab the next element?
00:21:18.920 | So this is going through each
00:21:23.880 | Each index in range of length, so this is going 0 1 2 3 at the end of this loop
00:21:29.960 | It's going to print out the mean of the validation set go back to the start of the epoch at which point
00:21:36.880 | It's going to recreate a new a new iterator
00:21:39.680 | Okay, so basically behind the scenes in python when you call it a
00:21:44.440 | On this it basically tells it to like reset its state to create a new iterator
00:21:52.240 | And if you're interested in how that works
00:21:58.680 | The code is all you know available for you to look at so we could look at like MD dot train
00:22:07.680 | DL is a fast AI dot data set dot model data loader, so we could like take a look at the code of that
00:22:14.940 | So we could take a look at the code of that
00:22:18.560 | And see exactly how it's being built right and so you can see here that here's the next function
00:22:24.760 | right which basically is
00:22:27.440 | Keeping track of how many times it's been through in the self dot I
00:22:31.240 | And here's the it a function which is the thing that gets quick called when you when you create a new iterator
00:22:37.200 | And you can see it's basically passing it off to something else
00:22:39.760 | Which is a type data loader and then you can check out data loader if you're interested to see how that's implemented
00:22:44.680 | as well
00:22:47.240 | So the data loader that we wrote
00:22:49.240 | Basically uses multi-threading to allow it to have multiple of these going on at the same time
00:22:55.120 | It's actually a great. It's really simple. It's like it's only about a screen full of code
00:23:00.340 | So if you're interested in simple multi-threaded programming. It's a good thing to look at
00:23:03.900 | Okay now um oh
00:23:10.160 | Why have you wrapped this in a for epoch in range one since that'll only run once?
00:23:16.240 | Because in real life we would normally be running multiple epochs
00:23:20.520 | So like in this case because it's a linear model it actually basically trains to
00:23:26.880 | As good as it's going to get in one epoch so if I type three here
00:23:31.260 | it actually
00:23:34.200 | It actually won't really improve after the first epoch much at all as you can see right
00:23:41.280 | But when we go back up to the top we're going to look at some slightly deeper and more interesting
00:23:46.480 | Versions which will take more epochs, so you know if I was turning this into a into a function
00:23:52.380 | You know I'd be going like you know death train model
00:23:56.800 | And one of the things you would pass in is like number of epochs
00:24:00.400 | kind of
00:24:03.840 | Okay great
00:24:07.040 | So one thing to remember is that
00:24:13.720 | When you're you know creating these neural network layers
00:24:17.820 | and remember like
00:24:20.880 | This is just as part watch is concerned. This is just it's an end up module
00:24:25.000 | It could be a we could be using it as a layer it could be using the function
00:24:28.600 | We could be using it as a neural net pie torch doesn't think of those as different things, right?
00:24:33.760 | So this could be a layer inside some other network, right?
00:24:38.080 | So how do gradients work so if you've got a layer which remember is just a bunch of we can think of it basically
00:24:44.160 | as its activations right or some activations that get computed through some
00:24:48.880 | other non-linear activation function or through some linear function and
00:24:53.400 | From that layer
00:24:57.840 | We it's very likely that we're then like let's say putting it through a matrix product right
00:25:03.120 | to create some new layer
00:25:08.560 | So each one of these so if we were to grab like
00:25:11.160 | One of these activations right is actually going to be
00:25:17.240 | Used to calculate
00:25:20.520 | every one of these outputs
00:25:22.960 | Right and so if you want to calculate the
00:25:27.000 | The derivative you have to know how this weight matrix
00:25:34.880 | Impacts that output and that output and that output and that output
00:25:38.960 | Right and then you have to add all of those together to find like the total impact of this
00:25:44.600 | you know across all of its outputs and
00:25:47.560 | So that's why in pie torch
00:25:51.800 | You have to tell it when to set the gradients to zero
00:25:56.680 | Right because the idea is that you know you could be like having lots of different loss functions or lots of different outputs in your next
00:26:02.560 | Activation set of activations or whatever all adding up
00:26:06.600 | Increasing or decreasing your gradients right so you basically have to say okay. This is a new
00:26:12.800 | calculation
00:26:15.720 | Reset okay, so here is where we do that right so before we do L dot backward we say
00:26:22.320 | Reset okay, so let's take our weights
00:26:25.320 | Let's take the gradients. Let's take the tensor that they point to and
00:26:31.000 | Then zero underscore does anybody remember from last week what underscore does as a suffix in pi torch?
00:26:38.040 | Yeah, I
00:26:42.680 | Forgot the language, but basically it changes it within the place right there the language is in place yeah
00:26:50.760 | Exactly so it sounds like a minor technicality
00:26:55.000 | But it's super useful to remember every function pretty much has an underscore version suffix
00:27:00.680 | Which does it in place?
00:27:02.680 | Yeah, so normally zero returns a
00:27:05.840 | Tensor of zeros of a particular size so zero underscore means replace the contents of this with a bunch of zeros, okay?
00:27:14.720 | All right, so that's
00:27:18.340 | That's it right, so that's like SGD from scratch
00:27:24.240 | And if I get rid of my menu bar we can officially say it fits within a screen, okay?
00:27:31.360 | Of course we haven't got our definition of logistic regression here. That's another half a screen, but basically there's there's not much to it
00:27:37.240 | Yes, fish
00:27:39.160 | So later on if we have to do this more the gradient is it because you might find like a wrong
00:27:44.920 | Minima local minimize that way so you have to kick it out
00:27:47.560 | And that's what you have to do multiple times when the surface is get more. Why do you need multiple epochs?
00:27:51.680 | Is that your question well? I mean a simple way to answer that would be let's say our learning rate was tiny
00:27:56.520 | right
00:28:01.440 | It's just not going to get very far
00:28:04.920 | Right there's nothing that says going through one epoch is enough to get you all the way there
00:28:09.800 | So then you'd be like okay. Well, let's increase our learning rate, and it's like yeah, sure
00:28:13.960 | We'll increase our learning rate, but who's to say that the highest learning rate that learns stably is is enough to
00:28:21.120 | Learn this as well as it can be learned and for most data sets for most architectures one epoch is
00:28:27.840 | Very rarely enough to get you
00:28:29.840 | To the best result you can get to
00:28:32.560 | You know linear models are just
00:28:36.680 | They're very nicely behaved. You know so you can often use higher learning rates and learn more quickly also they
00:28:43.520 | They don't you can't like generally get as good at accuracy
00:28:48.380 | So there's not as far to take them either so yeah doing one epoch is going to be the rarity all right
00:28:54.820 | So let's go backwards
00:28:56.680 | So going backwards. We're basically going to say all right. Let's not write
00:29:01.200 | Those two lines again and again again. Let's not write those three lines again and again and again
00:29:06.920 | Let's have somebody do that for us, right?
00:29:09.960 | So that's like that's the only difference between that version and this version is rather than saying dot zero ourselves
00:29:16.800 | Rather than saying minus gradient times LIR ourselves
00:29:21.240 | These are wrapped up for us, okay
00:29:26.080 | There is another wrinkle here, which is
00:29:29.160 | this approach to updating
00:29:32.280 | The the weights is actually pretty inefficient. It doesn't take advantage of
00:29:38.660 | momentum
00:29:40.720 | and curvature and so
00:29:43.520 | In the deal course we learn about how to do momentum from scratch as well, okay, so
00:29:51.000 | if we
00:29:53.400 | Actually, just use plain old SGD
00:29:56.400 | Then you'll see that this
00:30:01.680 | Learns much slower so now that I've typed just plain old SGD here. This is now literally doing exactly the same thing
00:30:07.680 | As our slow version so I have to increase the learning rate
00:30:11.880 | Okay there we go so this this is now the same as the the one we wrote by hand
00:30:21.800 | So then all right
00:30:23.800 | Let's do a little bit more stuff automatically
00:30:29.020 | Let's not you know given that every time we train something we have to loop through epoch
00:30:37.640 | Look through batch do forward get the loss zero the gradient do backward do a step of the optimizer
00:30:45.960 | Let's put all that in a function
00:30:48.320 | Okay, and that function is called fit
00:30:51.000 | All right there it is okay, so let's take a look at fit
00:31:01.980 | Fit go through each epoch go through each batch
00:31:11.080 | Do one step?
00:31:14.440 | Keep track of the loss and at the end calculate the validation all right and so then
00:31:23.200 | So if you're interested in looking at this this stuff's all inside fastai.model
00:31:42.760 | So here is step right?
00:31:46.040 | Zero the gradients calculate the loss remember PyTorch tends to call it criterion rather than loss
00:31:53.160 | Right do backward
00:31:55.720 | And then there's something else we haven't learned here, but we do learn the deep learning course
00:31:59.520 | Which is gradient clicking so you can ignore that
00:32:01.720 | All right, so you can see now like all the stuff that we've learnt when you look inside the actual frameworks
00:32:07.160 | That's the code you see okay?
00:32:09.680 | So that's what fit does and
00:32:15.000 | So then the next step would be like okay. Well this idea of like having some
00:32:19.640 | Weights and a bias and doing a matrix product in addition
00:32:25.300 | Let's put that in a function
00:32:28.320 | This thing of doing the log softmax
00:32:30.520 | Let's put that in a function and then the very idea of like first doing this and then doing that
00:32:36.800 | This idea of like chaining functions together. Let's put that into a function and
00:32:41.720 | that finally gets us to
00:32:46.720 | Okay, so sequential simply means do this function take the result send it to this function etc, right?
00:32:55.080 | And linear means create the weight matrix create the biases
00:33:02.680 | So that's that's it right
00:33:05.400 | So we can then you know as we started to talk about like turn this into a deep neural network
00:33:13.800 | by saying you know rather than sending this straight off into
00:33:19.100 | 10 activations, let's let's put it into say 100 activations. We could pick whatever one number we like
00:33:26.420 | Put it through a relu to make it nonlinear
00:33:30.060 | Put it through another linear layer another relu and then our final output with our final activation function right and so this is now
00:33:39.260 | a deep network
00:33:43.940 | We could fit that and
00:33:50.940 | This time now because it's like deeper
00:33:54.940 | I'm actually going to run a few more epochs right and you can see the accuracy
00:34:00.740 | Increasing right so if you try and increase the learning rate here, it's like zero point one
00:34:06.420 | further
00:34:08.860 | it actually
00:34:10.860 | Starts to become unstable
00:34:12.740 | Now I'll show you a trick
00:34:14.740 | This is called learning rate annealing and the trick is this
00:34:18.060 | when you're
00:34:20.860 | Trying to fit to a function right you've been taking a few steps
00:34:25.740 | Step step step as you get close to the middle like get close to the bottom
00:34:32.900 | Your steps probably want to become smaller right otherwise what tends to happen is you start finding you're doing this
00:34:40.100 | All right, and so you can actually see it here right they've got 93 94 and a bit 94 6
00:34:48.100 | 94 8 like it's kind of starting to flatten out
00:34:50.820 | Right now that could be because it's kind of done as well as it can
00:34:55.420 | Or it could be that it's going to going backwards and forwards
00:34:58.620 | So what is a good idea is is later on in training is to decrease your learning rate and to take smaller steps
00:35:07.100 | Okay, that's called a learning rate annealing. So there's a function in fast AI called set learning rates
00:35:12.780 | you can pass in your optimizer and your new learning rate and
00:35:16.540 | You know see if that helps right and very often it does
00:35:22.780 | About about an order of magnitude
00:35:27.780 | In the deep learning course we learn a much much better technique than this to do this all automatically and about a more granular
00:35:34.460 | Level, but if you're doing it by hand, you know like an order of magnitude at a time is what?
00:35:39.300 | people generally do
00:35:42.060 | So you'll see people in papers talk about learning rate schedules
00:35:46.780 | This is like a learning rate schedule. So this schedule just a moment Erica
00:35:51.140 | I just come to earnest first has got us to 97 right and I tried
00:35:55.720 | Kind of going further and we don't seem to be able to get much better than that
00:35:59.860 | So yeah, so here we've got something where we can get 97 percent
00:36:04.380 | Accuracy. Yes, Erica. So it seems like you change the learning rate
00:36:08.620 | to something very small
00:36:11.820 | Ten times smaller than we started with so we had point one now, it's point. Oh one. Yeah
00:36:15.780 | But that makes the whole model train really slow
00:36:19.540 | So I was wondering if you can make it so that it changes dynamically as it approaches
00:36:24.180 | Closer to the minima. Yeah, pretty much. Yeah, so so that's some of the stuff we learn in the deep learning course
00:36:29.820 | There's these more advanced approaches. Yeah
00:36:32.140 | the fish
00:36:34.140 | So how it is different from using Adam optimizer or something that that's the kind of stuff we can do
00:36:39.780 | I mean you still need annealing as I say we do this kind of stuff in the deep learning course
00:36:43.780 | So for now, we're just going to stick to standard SGD. I
00:36:46.540 | Had a question about the data loading. Yeah, I know it's a fast AI function
00:36:53.580 | But could you go into a little bit detail of how it's creating batches how it's learning data and how it's making those decisions
00:37:03.460 | Would be good to ask that on Monday night so we can talk about in detail in the deep learning class
00:37:08.220 | But let's let's do the quick version here
00:37:10.980 | so basically
00:37:13.780 | There's a really nice design in pytorch
00:37:16.300 | Where they basically say let's let's create a thing called a data set
00:37:21.140 | Right and a data set is basically something that looks like a list. It has a length
00:37:28.740 | right and so that's like how many images are in the data set and it has the ability to
00:37:35.780 | Index into it like a list right so if you had like D equals data set
00:37:41.820 | You can do length D, and you can do D of some index right that's basically all the data set
00:37:47.860 | Is as far as pytorch is concerned and so you start with a data set, so it's like okay?
00:37:53.220 | D 3 gives you the third image. You know or whatever
00:37:58.140 | And so then the idea is that you can take a data set and you can pass that into a constructor for a data loader
00:38:12.020 | That gives you something which is now iterable right so you can now say it a
00:38:17.220 | deal and that's something that you can call next on and
00:38:23.660 | What that now is going to do is if when you do this you can choose to have shuffle on or shuffle off shuffle on
00:38:31.060 | Means give me random mini-batch shuffle off means go through it sequentially
00:38:35.380 | And so
00:38:38.980 | What the data loader does now when you say next is it basically assuming you said shuffle equals true is it's going to grab?
00:38:45.220 | You know if you've got a batch size of 64 64 random integers between 0 and length and call this
00:38:53.220 | 64 times to get 64 different items and jam them together
00:38:58.320 | So fast AI uses the exact same
00:39:02.940 | terminology and the exact same API
00:39:06.540 | We just do some of the details differently so specifically particularly with computer vision
00:39:13.540 | You often want to do a lot of pre-pro
00:39:15.940 | I'm so much pre-processing data augmentation like flipping changing the colors a little bit rotating those turn out to be really
00:39:23.340 | Computationally expensive even just reading the JPEGs turns out to be computation expensive
00:39:27.820 | So pie torch uses an approach where it fires off multiple processes to do that in parallel
00:39:34.020 | Whereas the fast AI library instead does something called multi threading, which is a much can be a much faster way of doing it
00:39:41.460 | Yes, you're net
00:39:46.140 | So an epoch is it really pork in the sense that all of the elements so it's a shuffle at the beginning of the
00:39:53.900 | Poke something like that. Yeah. Yeah, I mean not all libraries work the same way some do sampling with replacement
00:39:59.580 | Some don't
00:40:01.660 | We actually the fast AI library hands off the shuffling off to the set to the actual pie torch version
00:40:09.260 | And I believe the pie torch version. Yeah, actually shuffles and an epoch covers everything once I believe
00:40:15.220 | Okay, now the thing is when you start to get these bigger networks
00:40:25.100 | Potentially you're getting quite a few parameters
00:40:29.900 | right, so
00:40:32.860 | I want to ask you to calculate how many parameters there are but let's let's remember here. We've got
00:40:37.880 | 28 by 28 input into 100 output and then 100 into 100 and then 100 into 10
00:40:44.740 | All right, and then for each of those who got weights and biases
00:40:47.900 | So we can actually
00:40:51.060 | Do this
00:40:54.180 | net dot parameters
00:40:56.180 | returns a list where each element of the list is a matrix of actually a tensor of
00:41:02.100 | The parameters for that not just for that layer
00:41:05.860 | But if it's a layer with both weights and biases that would be two parameters, right?
00:41:09.820 | So basically returns us a list of all of the tenses containing the the parameters
00:41:14.980 | Num elements in pytorch tells you how how big that is right so if I run this
00:41:21.980 | Here is the number of
00:41:25.780 | parameters in each layer
00:41:27.900 | So I've got seven hundred and eighty four inputs and the first layer has a hundred outputs
00:41:32.900 | So therefore the first weight matrix is of size seventy eight thousand four hundred
00:41:37.300 | Okay, and the first bias vector is of size a hundred and then the next one is a hundred by a hundred
00:41:42.900 | Okay, and there's a hundred and then the next one is a hundred by ten, and then there's my bias, okay?
00:41:48.820 | So there's the number of elements in each layer, and if I add them all up. It's nearly a hundred thousand
00:41:54.420 | Okay, and so I'm possibly at risk of overfitting. Yeah, all right, so
00:42:01.620 | We might want to think about using regularization
00:42:05.020 | So a really simple common approach to regularization in all of machine learning
00:42:10.860 | is something called
00:42:15.940 | Regularization and
00:42:19.980 | It's super important super handy. You can use it with just about anything right and the basic idea
00:42:25.620 | Anyway so
00:42:31.540 | L2 regularization the basic idea is this normally we'd say our loss is
00:42:35.700 | Equal to let's just do RMSE to keep things kind of simple
00:42:39.660 | It's equal to our predictions minus our actuals
00:42:43.180 | You know squared, and then we sum them up take the average
00:42:47.740 | Take the square root, okay?
00:42:52.620 | What if we then want to say you know what like if I've got lots and lots of parameters?
00:42:58.660 | Don't use them unless they're really helping enough right like if you've got a million parameters, and you only really needed 10
00:43:05.940 | Parameters to be useful just use 10 right so how could we like tell the loss function to do that?
00:43:12.820 | And so basically what we want to say is hey if a parameter is zero
00:43:17.220 | That's no problem. It's like it doesn't exist at all so let's penalize a parameter
00:43:22.980 | for not being zero
00:43:26.740 | Right so what would be a way we could measure that?
00:43:29.940 | How can we like calculate how unzero our parameters are
00:43:37.140 | Can you pass that to chin sheath is honest
00:43:42.940 | You calculates the average of all the parameters that's my first can't quite be the average
00:43:53.780 | Close yes, Taylor. Yeah. Yes, you figured it out. Okay?
00:43:56.740 | so I think if we like
00:43:59.900 | Assuming all of our data has been normalized standardized however you want to call it
00:44:03.900 | We want to check that they're like significantly different from zero right would that be not the data that the parameter
00:44:09.460 | Is rather would be significantly and the parameters don't have to be normalized or anything that is calculated right?
00:44:14.780 | Yeah, so significantly different from zero right as well
00:44:17.340 | I just met assuming that the data has been normalized so that we can compare them. Oh, yeah, got it. Yeah, right
00:44:23.820 | And then those that are not significantly different from zero we can probably just drop
00:44:28.460 | And I think Chen she's going to tell us how to do that. You just figured it out, right?
00:44:31.380 | The meaning of the absolute could do that that would be called l1. Which is great so l1
00:44:38.460 | would be the
00:44:41.020 | absolute
00:44:43.180 | Value of the weights average l2 is actually the sum
00:44:51.060 | Yeah, yeah exactly so we just take this we can just we don't even have to square root
00:44:55.340 | So we just take the squares of the weights themselves, and then like we want to be able to say like okay
00:45:02.140 | How much do we want to panelize?
00:45:06.580 | Not being zero right because if we actually don't have that many parameters
00:45:10.740 | We don't want to regularize much at all if we've got heaps. We do want to regularize a lot right so then we put a
00:45:18.580 | Parameter yeah, right except I have a rule in my classes. Which is never to use Greek letters, so normally people use alpha
00:45:24.860 | I'm going to use a okay, so
00:45:27.540 | So this is some number which you often see something around kind of 1e neg 6 to 1e neg 4
00:45:37.380 | ish all right
00:45:43.660 | We actually don't care about the loss
00:45:48.020 | When you think about it, we don't actually care about the loss other than like maybe to print it out
00:45:51.800 | All we actually care about is the gradient of the loss
00:45:54.220 | Okay, so the gradient of
00:46:00.860 | Right is
00:46:07.220 | Right so there are two ways to do this we can actually modify our loss function to add in this square
00:46:16.420 | penalty or
00:46:18.340 | We could modify that thing where we said weights equals weights minus
00:46:23.760 | Gradient times learning rate to subtract that
00:46:27.420 | as well
00:46:29.740 | Right back so to add that as well
00:46:34.780 | These are roughly these are kind of basically equivalent, but they have different names. This is called L2 regularization
00:46:39.900 | Right this is called weight decay
00:46:44.700 | So in the neural network literature
00:46:46.700 | You know that version kind of
00:46:49.700 | Was the how it was first posed in the neural network literature whereas this other version is kind of
00:46:56.140 | How it was posed in the statistics literature, and yeah, you know they're they're equivalent
00:47:03.060 | As we talked about in the deep learning class it turns out
00:47:06.380 | They're not exactly equivalent because when you have things like momentum and Adam it can behave differently and two weeks ago a researcher
00:47:14.660 | figured out a way to actually
00:47:16.820 | Do proper weight decay in modern optimizers and one of our fast AI students just implemented that in the fast AI library
00:47:24.460 | So fast AI is now the first
00:47:26.460 | Library to actually support this properly
00:47:28.980 | so anyway, so for now, let's do the
00:47:32.020 | The version which
00:47:35.820 | Pie torch calls weight decay
00:47:39.380 | But actually it turns out based on this paper two weeks ago is actually L2 regularization
00:47:43.980 | It's not quite correct, but it's close enough so here. We can say weight decay is 1e neg 3
00:47:48.780 | So it's going to set our cons out our penalty multiplier a to 1e neg 3 and it's going to add that to the loss function
00:47:57.020 | Okay, and so let's make a copy of these cells
00:48:01.020 | Just so we can compare hope this actually works
00:48:06.180 | Okay, and we'll set this running okay, so this is now optimizing
00:48:10.100 | Well except
00:48:13.460 | If you're actually so I've made a mistake here, which is I didn't rerun
00:48:17.820 | This cell this is an important thing to kind of remember since I didn't run this rerun this cell
00:48:23.340 | Here when it created the optimizer and said net dot parameters
00:48:27.700 | It started with the parameters that I had already trained right so I actually hadn't recreated my network
00:48:33.820 | Okay, so I actually need to go back and rerun this cell first to recreate the network
00:48:39.020 | Then go through and run this
00:48:41.580 | Okay there we go, so let's see what happens
00:48:49.500 | So you might notice some notice something kind of kind of counterintuitive here
00:48:58.340 | Which is that?
00:49:02.580 | That's our training error right now. You would expect our training error with regularization
00:49:07.900 | to be worse
00:49:10.860 | That makes sense right because we're like we're penalizing
00:49:14.120 | parameters that
00:49:17.300 | Specifically can make it better and yet
00:49:19.700 | Actually it started out better not worse
00:49:23.460 | So why could that be?
00:49:26.980 | So the reason that can happen is that if you have a function
00:49:32.780 | That looks like that
00:49:35.540 | Right it takes potentially a really long time to train
00:49:38.900 | or else if you have a function that kind of looks more like
00:49:42.300 | That it's going to train a lot more quickly
00:49:45.740 | And there are certain things that you can do which sometimes just like can take a function
00:49:50.820 | That's kind of horrible and make it less horrible, and it's sometimes weight decay can actually
00:49:56.500 | Make your functions a little more nicely behaved, and that's actually happened here
00:50:01.060 | So like I just mentioned that to say like don't let that confuse you right like weight decay really does
00:50:07.060 | Panelize the training set and look so strictly speaking
00:50:10.140 | The final number we get to for the training set shouldn't end up be being better
00:50:15.860 | But it can train sometimes more quickly
00:50:17.860 | Yes, can you pass it a chance you
00:50:26.260 | Don't get it. Okay, why making it faster like the time matters like the training time
00:50:32.500 | No, it's this is after one epoch. Yeah, right so after one epoch
00:50:38.020 | Now congratulations for saying I don't get it. That's like the best thing anybody can say you know so helpful
00:50:53.100 | This here was our training
00:50:55.100 | without
00:50:56.980 | weight decay
00:50:58.420 | Okay, and this here is our training with weight decay, okay, so this is not related to time
00:51:05.820 | This is related to just an epoch
00:51:08.420 | Right after one epoch my claim was that you would expect the training set all other things being equal
00:51:17.360 | to have a
00:51:19.700 | worse
00:51:20.900 | loss with weight decay
00:51:23.260 | Because we're penalizing it you know this has no penalty this has a penalty so the thing with a penalty should be worse and
00:51:30.180 | I'm saying oh, it's not that's weird
00:51:34.020 | right, and so the reason it's not is
00:51:37.980 | Because in a single epoch it matters a lot as to whether you're trying to optimize something
00:51:44.380 | That's very bumpy or whether you're trying to optimize something. That's kind of nice and smooth
00:51:50.340 | If you're trying to optimize something that's really bumpy like imagine in some high-dimensional space, right?
00:51:56.220 | You end up kind of rolling around through all these different tubes and tunnels and stuff
00:52:01.940 | You know or else if it's just smooth you just go boom
00:52:04.800 | Adam it's like imagine a marble rolling down a hill where one of them you've got like
00:52:09.980 | It's a called Lombard Street in San Francisco. It's like backwards forwards backwards forwards
00:52:15.260 | It takes a long time to drive down the road right
00:52:17.980 | Where else you know if you kind of took a motorbike and just went straight over the top. You're just going boom, right, so
00:52:23.500 | So whether it's a kind of the shape of the loss function surface
00:52:28.580 | you know impacts or kind of defines how easy it is to optimize and therefore how
00:52:34.500 | Far can it get in a single epoch and based on these results?
00:52:39.100 | It would appear that weight decay here has made it this function easier to optimize
00:52:44.140 | so just to make sure it's
00:52:48.180 | The panelizing is making the optimizer more than likely to reach the global minimum
00:52:54.120 | No, I wouldn't say that my claim actually is that at the end
00:52:58.180 | It's probably going to be less good on the training set indeed. This doesn't look to be the case at the end
00:53:03.220 | after five epochs
00:53:07.900 | Training set is now worse with weight decay now. That's what I would expect right?
00:53:12.820 | I would expect like if you actually find like I never use the term global optimum because
00:53:17.300 | It's just not something we have any guarantees about we don't really care about we just care like where do we get to after?
00:53:23.620 | a certain number of epochs
00:53:25.620 | We hope that we found somewhere. That's like a good solution
00:53:28.660 | And so by the time we get to like a good solution the training set with weight decay the loss is worse
00:53:34.180 | Because it's penalty right but on
00:53:38.900 | The validation set the loss is better
00:53:41.860 | Right because we penalized the training set in order to kind of try and create something that generalizes better
00:53:48.820 | So we've got more parameter
00:53:49.900 | You know that the parameters that are kind of pointless are now zero and it generalizes better
00:53:54.060 | Right so so always saying is that it just got to a good point
00:54:00.100 | After one epoch is really always saying
00:54:03.660 | So is it always true?
00:54:07.700 | No, no
00:54:09.700 | But if you're bit by it you mean just wait decay you always make the function surface smoother
00:54:14.020 | No, it's not always true, but it's like it's worth remembering that
00:54:21.620 | if you're having trouble training a function adding a little bit of weight decay may
00:54:27.180 | may help
00:54:29.780 | The word so by recognizing the parameters what it does is it smoothens out the loss
00:54:37.100 | I mean it's not it's not why we do it
00:54:39.740 | you know the reason why we do it is because we want to penalize things that aren't zero to say like
00:54:44.780 | Don't make this parameter a high number unless it's really helping the loss a lot right set it to zero if you can
00:54:51.660 | Because setting as many parameters to zero as possible means it's going to generalize better, right?
00:54:57.060 | It's like the same as having a smaller
00:54:59.060 | Network, right so that's that's we do that's why we do it
00:55:04.260 | But it can change how it learns as well
00:55:07.800 | So let's okay. That's one moment. Okay, so I just wanted to check how we actually went here
00:55:13.180 | So after the second epoch yeah, so you can see here. It's really has helped right after the second epoch
00:55:17.780 | Before we got to 97% accuracy now. We're nearly up to about 98% accuracy
00:55:23.660 | Right and you can see that the loss was 0.08 versus 0.13 right so adding regularization
00:55:30.500 | Has allowed us to find a you know
00:55:33.740 | 3% versus 2% so like a 50% better
00:55:38.340 | Solution yes Erica, so there are two pieces to this right one is L2 regularization and the weight decay
00:55:47.500 | No, there's so my claim was they're the same thing, right?
00:55:50.940 | So weight decay is the version if you just take the derivative of L2 regularization you get weight decay
00:55:58.020 | So you can implement it either by changing the loss function with an with a squared loss
00:56:03.360 | Penalty or you can implement it by adding
00:56:06.540 | The weights themselves as part of the gradient, okay?
00:56:11.820 | Yeah, I was just going to finish the questions. Yes. Okay pass it to division
00:56:16.820 | Can we use regularization convolution layer as well absolutely so convolution layer just is is weights so yep
00:56:28.140 | And Jeremy can you explain why you thought you needed weight decay in this particular problem?
00:56:34.580 | Not easily I mean other than to say it's something that I would always try you're all fitting founder well. Yeah, I mean okay, so
00:56:45.220 | Even if I yeah, okay, that's a good point unit, so if if my training loss
00:56:56.660 | Was higher than my validation loss than I'm under fitting
00:57:00.340 | Right, so there's definitely no point regularizing right if like that would always be a bad thing
00:57:06.920 | That would always mean you need like more parameters in your model
00:57:09.900 | In this case. I'm I'm over fitting that doesn't necessarily mean regularization will help, but it's certainly worth trying
00:57:18.620 | Thank you, and that's a great point. There's one more question. Yeah
00:57:21.740 | Tyler gonna pass over there
00:57:24.620 | So how do you choose the up to a number of epoch?
00:57:31.000 | You do my deep learning course
00:57:37.140 | It's a it's that's a long story and lots of lots of
00:57:41.820 | It's a bit of both we just don't as I say we don't have time to cover
00:57:50.900 | Best practices in this class we're going to learn the kind of fundamentals. Yeah, okay, so let's take a
00:57:58.140 | Six minute break and come back at 11 10
00:58:01.700 | All right
00:58:12.020 | So something that we cover in great detail in the deep learning course
00:58:18.060 | But it's like really important to mention here. Is that is that the secret in my opinion to kind of modern machine learning techniques is to
00:58:25.980 | massively over parameterize
00:58:28.900 | The solution to your problem right like as we've done here. You know we've got like a hundred thousand weights
00:58:34.740 | When we only had a small number of 28 by 28 images
00:58:38.420 | And then use regularization
00:58:40.980 | okay, it's like the
00:58:43.220 | direct opposite of
00:58:47.700 | nearly all
00:58:49.500 | statistics and learning was done for decades before
00:58:52.580 | and still most kind of like
00:58:55.580 | Senior lecturers at most universities in most areas of have this background where they've learned the correct way to build a model is
00:59:03.460 | To like have as few parameters as possible
00:59:05.460 | Right and so hopefully we've learned two things so far. You know one is we can build
00:59:11.780 | Very accurate models even when they have lots and lots of parameters
00:59:17.420 | Like a random forest has a lot of parameters and you know this here deep network has a lot of parameters
00:59:23.700 | And they can be accurate right?
00:59:25.860 | And we can do that by either using bagging or by using
00:59:32.260 | regularization
00:59:34.180 | Okay, and regularization in neural nets means either weight decay
00:59:39.100 | also known as kind of L2 regularization or
00:59:42.660 | Drop out which we won't worry too much about here
00:59:49.260 | So like it's a
00:59:51.260 | It's a very different way of thinking
00:59:53.260 | about
00:59:55.260 | Building useful models and like I just wanted to kind of warn you that once you leave this classroom
01:00:02.020 | Like even possibly when you go to the next faculty members talk like there'll be people at USF as well who?
01:00:08.300 | Entirely trained in the world of like
01:00:13.340 | Models with small numbers of parameters you know your next boss is very likely to have been trained in the world of like models
01:00:19.540 | with small numbers of parameters
01:00:21.300 | The idea that they are somehow
01:00:23.780 | More pure or easier or better or more interpretable or whatever I?
01:00:29.140 | Am convinced that that is not true probably not ever true certainly very rarely true
01:00:41.260 | that actually
01:00:43.260 | Models with lots of parameters can be extremely interpretable as we learn from our whole lesson of random forest interpretation
01:00:50.980 | You can use most of the same techniques with neural nets, but with neural nets are even easier right remember how we did feature importance
01:01:01.020 | Randomizing a column to see how it changes in that column would impact the output
01:01:04.860 | Well, that's just like a kind of dumb way of calculating its gradient
01:01:10.100 | How much does burying this import change the output with a neural net we can actually calculate its gradient?
01:01:15.340 | Right so with PI torch you could actually say what's the gradient of the output with respect to this column?
01:01:20.780 | All right
01:01:22.580 | You can do the same kind of thing to do
01:01:24.580 | partial dependence plot with a neural net
01:01:27.900 | And you know I'll mention for those of you interested in making a real impact
01:01:32.700 | Nobody's written
01:01:35.020 | Basically any of these things the neural nets all right so that that that whole area
01:01:40.180 | Needs like libraries to be written blog posts to be written
01:01:44.140 | You know some papers have been written
01:01:46.620 | But only in very narrow domains like computer vision as far as I know nobody's written the paper saying
01:01:51.780 | Here's how to do structured data
01:01:54.420 | Neural networks you know interpretation methods
01:01:58.380 | So it's a really exciting big
01:02:03.620 | So what we're going to do though is we're going to start with applying this
01:02:09.940 | With a simple linear model
01:02:13.460 | And this is mildly terrifying for me because we're going to do NLP and our NLP
01:02:17.980 | Faculty expert is in the room so David just yell at me if I screw this up too badly
01:02:22.500 | And so NLP refers to you know any any kind of modeling where we're working with with natural language text
01:02:33.220 | right and it interestingly enough
01:02:35.860 | We're going to look at a situation where a
01:02:40.860 | Linear model is pretty close to the state-of-the-art for solving a particular problem. It's actually something where I
01:02:50.020 | actually surpassed this bad at state-of-the-art in this using a
01:02:54.260 | Recurrent neural network a few weeks ago
01:02:56.980 | But this is actually going to show you pretty close to the state of art with with a linear model
01:03:03.200 | We're going to be working with the IMDB IMDB data set so this is a data set of movie reviews
01:03:10.520 | You can download it by following these steps
01:03:16.000 | Once you download it you'll see that you've got a train and a test
01:03:23.080 | directory and
01:03:26.000 | In your train directory you'll see there's a negative and a positive directory and in your positive directory
01:03:33.160 | You'll see there's a bunch of text files
01:03:35.160 | And here's an example of a text file
01:03:39.920 | So somehow we've managed to pick out a story of a man who has unnatural feelings for a pig as our first choice
01:03:45.960 | That wasn't intentional, but it'll be fine
01:03:48.680 | So we're going to look at these movie reviews
01:03:56.280 | And for each one, we're going to look to see whether they were positive or negative
01:04:00.000 | So they've been put into one of these folders. They were downloaded from from IMDB them the movie database and review site
01:04:06.840 | The ones that were strongly in positive went in positive strongly negative went negative and the rest they didn't label at all
01:04:14.600 | So these are only highly polarized reviews so in this case. You know
01:04:18.080 | We have an insane violent mob which unfortunately just too absurd
01:04:23.920 | Too off-putting those in the area we turned off so the label for this was a zero which is
01:04:31.800 | Negative okay, so this is a negative review
01:04:39.280 | In the first AI library. There's lots of little
01:04:41.520 | functions and classes to help with
01:04:44.600 | Most kinds of domains that you do machine learning on for NLP one of the simple things we have is text from folders
01:04:51.600 | That's just going to go ahead and go through and find all of the folders in here
01:04:56.200 | With these names and create a labeled data set and you know don't let these things
01:05:04.120 | Ever stop you from understanding. What's going on behind the scenes?
01:05:07.680 | Right we can grab its source code and as you can see it's time. You know it's like five lines
01:05:12.860 | Okay, so I don't like to write these things out in full
01:05:16.460 | You know but hide them behind at all functions so you can reuse them
01:05:19.880 | But basically it's just going to go through each directory and then within that so it goes through
01:05:24.280 | Yeah, go through each directory
01:05:27.440 | And then go through each
01:05:29.560 | file in that directory
01:05:31.960 | and then stick that into
01:05:34.240 | This array of texts and figure out what folder it's in and stick that into the array of labels, okay, so
01:05:41.280 | That's how we basically end up with something where we have an array of
01:05:47.520 | The reviews and an array of the labels, okay, so that's our data so our job will be to take
01:05:54.660 | that and
01:05:57.120 | to predict that
01:05:59.120 | Okay, and the way we're going to do it is we're going to throw away
01:06:04.920 | Like all of the interesting stuff about language
01:06:08.640 | Which is the order in which the words are in right now? This is very often not a good idea
01:06:15.480 | But in this particular case it's going to turn out to work like not too badly
01:06:19.040 | So let me show what I mean by like throwing away the order of the words like normally the order of the words
01:06:23.960 | Matters a lot if you've got a not
01:06:26.280 | Before something then that not refers to that thing right so but the thing is when in this case
01:06:32.840 | We're trying to predict whether something's positive or negative if you see the word absurd appear a lot
01:06:38.040 | Right then maybe that's a sign that this isn't very good
01:06:44.600 | So you know cryptic maybe that's a sign that it's not very good. So the idea is that we're going to turn it into something called a
01:06:51.280 | term document matrix
01:06:53.960 | Where for each document I each review what is going to create a list of what words are in it?
01:06:58.880 | Rather than what order they're in so let me give an example
01:07:02.320 | Can you see this okay?
01:07:06.480 | So here are four
01:07:08.480 | Movie reviews that I made up
01:07:12.640 | This movie is good. The movie is good. They're both positive this movie is bad. The movie is bad
01:07:18.360 | They're both negative right so I'm going to turn this into a term document matrix
01:07:23.400 | So the first thing I need to do is create something called a vocabulary a vocabulary is a list of all the unique words
01:07:28.960 | That appear okay, so here's my vocabulary this movie is good the bad. That's all the words
01:07:34.900 | Okay, and so now I'm going to take each one of my movie reviews and turn it into a
01:07:41.280 | Vector of which words appear and how often do they appear right and in this case none of my words appear twice
01:07:47.440 | So this movie is good has those four words in it
01:07:52.440 | Where else this movie is bad has?
01:07:55.320 | Those four words in it
01:07:58.040 | Okay, so this
01:08:00.380 | Is called a term document matrix
01:08:03.440 | Right and this representation we call a bag of words
01:08:08.680 | Representation right so this here is a bag of words representation of the view of the review
01:08:13.860 | It doesn't contain the order of the text anymore. It's just a bag of the words
01:08:19.080 | What words are in it it contains bad is?
01:08:21.820 | Movie this okay, so that's the first thing we're going to do is we're going to turn it into a bag of words
01:08:27.800 | Representation and the reason that this is convenient
01:08:30.720 | For linear models is that this is a nice
01:08:36.280 | rectangular matrix that we can like do math on
01:08:39.200 | Okay, and specifically we can do a logistic regression, and that's what we're going to do is we're going to get to a point
01:08:44.880 | We do a logistic regression
01:08:46.880 | Before we get there though. We're going to do something else which is called naive base, okay?
01:08:53.040 | SK learn
01:08:54.640 | Has something which will create a term document matrix for us. It's called count vectorizer. Okay, so we'll just use it now
01:09:01.760 | in NLP
01:09:04.680 | You have to turn your text into a list of words
01:09:08.900 | And that's called tokenization
01:09:11.760 | Okay, and that's actually non-trivial
01:09:13.880 | Because like if this was actually this movie is good
01:09:17.700 | Dot right or if it was this
01:09:20.520 | movie is
01:09:23.360 | good like
01:09:25.440 | How do you deal with like that?
01:09:27.440 | Punctuation well perhaps more interestingly what if it was this movie isn't good
01:09:33.920 | right, so
01:09:35.800 | How you turn a piece of text into a list of tokens is called tokenization, right?
01:09:41.400 | And so a good tokenizer would turn this movie isn't good
01:09:46.160 | Into this this space
01:09:49.560 | Quote movie space is space and good space right so you can see in this version here
01:09:57.320 | If I now split this on spaces every token is either a single piece of punctuation or like this suffix and is
01:10:04.840 | Considered like a word right that's kind of like how we would probably want to tokenize that piece of text because you wouldn't want
01:10:12.140 | good full stop
01:10:14.200 | to be like an object right because that does there's no concept of good full stop right or
01:10:20.080 | Double-quote movie is not like an object
01:10:25.720 | Tokenization is something we hand off to a tokenizer
01:10:27.720 | Fast AI has a tokenizer in it that we can use
01:10:31.680 | So this is how we create our term document matrix with a tokenizer
01:10:37.560 | SK learn has a pretty standard API which is nice
01:10:45.840 | I'm sure you've seen it a few times now before so once we've built some kind of model
01:10:52.220 | We can kind of think of this as a model
01:10:54.320 | Just ish
01:10:55.800 | This is just defining what it's going to do. We can call fit transform to
01:11:00.780 | To do that right so in this case fit transform is going to create the vocabulary
01:11:05.600 | Okay, and create the term document matrix
01:11:09.200 | based on the training set
01:11:11.920 | Transform is a little bit different that says use the previously fitted model which in this case means use the previously created vocabulary
01:11:21.600 | We wouldn't want the validation set in the training set to have
01:11:24.400 | You know the words in different orders in the matrices right because then they'd like to have different meanings
01:11:29.480 | So this is here saying use the same vocabulary
01:11:32.200 | To create a bag of words for the validation set could you pass that back please?
01:11:38.280 | What if the violation set has different set of words other than training? Yeah, that's a great question so generally most
01:11:47.960 | Of these kind of vocab creating approaches will have a special token for unknown
01:11:52.940 | Sometimes you can you'll also say like hey if a word appears less than three times call it unknown
01:12:00.640 | But otherwise it's like if you see something you haven't seen before call it unknown
01:12:05.000 | So that would just become a column in the bag of words is is unknown
01:12:09.080 | Good question all right, so when we create this
01:12:16.160 | Term document matrix of the training set we have 25,000 rows because there are 25,000 movie reviews
01:12:21.680 | And there are 75,000 132 columns
01:12:25.620 | What does that represent? What does that mean there are seven hundred and thirty five thousand one hundred thirty two?
01:12:29.880 | What can you pass that to the veg?
01:12:31.560 | At just a moment you can pass it to the veg
01:12:33.560 | All vocabulary yeah, go on. What do you mean?
01:12:38.880 | So like the the number of words union of a number of words that the number of unique words yeah exactly good
01:12:46.280 | okay, now
01:12:48.280 | most documents
01:12:51.240 | Don't have most of these 75,000
01:12:54.040 | Words all right, so we don't want to actually store that as
01:12:59.040 | A normal array in memory because it's going to be very wasteful
01:13:03.480 | So instead we store it as a sparse
01:13:06.520 | Matrix all right and what a sparse matrix does is it just stores it?
01:13:12.520 | as something that says
01:13:16.560 | Whereabouts of the non zeros right so it says like okay term number so document number one
01:13:22.860 | word number four
01:13:25.560 | Appears and it has four of them. You know document one term number
01:13:35.120 | Has that that appears and it's a one right and so forth. That's basically how it's stored
01:13:41.000 | There's actually a number of different ways of storing
01:13:43.520 | And if you do Rachel's computational linear algebra course you'll learn about the different types and why you choose them and how to convert
01:13:50.560 | And so forth, but they're all kind of something like this right and you don't really on the whole have to worry about the details
01:13:57.640 | The important thing to know is it's it's efficient. Okay, and so we could grab the first review
01:14:05.540 | right and that gives us
01:14:08.080 | 75,000 long sparse
01:14:11.400 | One long one row long matrix okay with 93 stored elements so in other words
01:14:16.980 | 93 of those words are actually used in the first document, okay?
01:14:22.820 | We can have a look at the vocabulary by saying vectorizer dot get fetch feature names that gives us the vocab
01:14:29.440 | And so here's an example of a few of the elements of get feature names
01:14:36.880 | Didn't intentionally pick the one that had Aussie, but you know that's the important words obviously
01:14:44.280 | Haven't used the tokenizer here. I'm just bidding on space so this isn't quite the same as what the
01:14:48.540 | Vectorizer did but to simplify things
01:14:51.120 | Let's grab a set of all the lowercase words
01:14:55.360 | By making it a set we make them unique so this is
01:14:58.920 | Roughly the list of words that would appear right and that length is 91
01:15:04.720 | Which is pretty similar to 93 and just the difference will be that I didn't use a real tokenizer. Yeah
01:15:11.080 | All right
01:15:13.080 | So that's basically all that's been done there. It's kind of created this unique list of words and map them
01:15:19.600 | We could check by calling vectorizer dot vocabulary underscore to find the idea of a particular word
01:15:27.240 | So this is like the reverse map of this one right this is like integer to word
01:15:31.760 | Here is word to integer, and so we saw absurd appeared twice in the first document
01:15:38.040 | So let's check train term doc 0 comma 1 2 9 7 there
01:15:42.120 | It is is 2 right or else unfortunately Aussie didn't appear in the unnatural relationship with a pig movie
01:15:49.720 | So 0 comma 5,000 is 0 okay, so that's that's our term document matrix
01:15:59.340 | Yes, so does it care about the relative relationship between the words
01:16:08.480 | As in the ordering of the words no, we've thrown away the orderings. That's why it's a bag of words
01:16:12.560 | And I'm not claiming that this is like
01:16:16.520 | Necessarily a good idea what I will say is that like the vast majority of NLP work
01:16:23.880 | That's been done over the last few decades generally uses this representation because we didn't really know much better
01:16:29.800 | Nowadays increasingly we're using recurrent neural networks instead which we'll learn about in our
01:16:37.880 | deep learning lesson of part one
01:16:40.080 | But sometimes this representation works pretty well, and it's actually going to work pretty well in this case
01:16:46.920 | Okay, so in fact you know most like back when I was at fast mail my email company a
01:16:55.400 | Lot of the spam filtering we did used this next technique naive Bayes
01:17:00.440 | Which is as a bag of words approach just kind of like you know if you're getting a lot of?
01:17:05.960 | Email containing the word Viagra, and it's always been a spam
01:17:09.760 | And you never get email from your friends talking about Viagra
01:17:13.480 | Then it's very likely something that says Viagra regardless of the detail of the language is probably from a spammer
01:17:19.880 | Alright, so that's the basic theory about like classification using a term document matrix, okay, so let's talk about naive Bayes
01:17:28.280 | And here's the basic idea. We're going to start with our term document matrix
01:17:34.920 | right and
01:17:36.920 | These first two is
01:17:39.200 | our corpus of positive reviews
01:17:41.680 | These next two is our corpus of negative reviews, and so here's our whole corpus of all reviews
01:17:48.260 | So what I could do is now to create a
01:17:52.160 | Probability I
01:17:55.720 | Got a call the as we tend to call these more generically features rather than words, right?
01:18:00.480 | This is a feature movie is a feature is as a feature, right?
01:18:04.800 | So it's kind of more now like machine learning language a column is a feature
01:18:08.660 | We'll call those we often call those f in the phase so we can basically say the probability
01:18:15.160 | That you would see the word this
01:18:17.880 | Given that the class is one given that it's a positive review
01:18:23.620 | It's just the average of how often do you see this in the positive reviews?
01:18:29.240 | right
01:18:31.660 | Now we've got to be a bit careful though
01:18:34.160 | because
01:18:36.160 | If you never ever see a particular word
01:18:39.880 | In a particular class right so if I've never received an email from a friend that said Viagra
01:18:47.040 | All right, that doesn't actually mean the probability of us of a friend sending sending me an email about Viagra is zero
01:18:53.600 | It's not really zero, right? I
01:18:56.260 | Hope I don't get an email. You know from Terrence tomorrow saying like
01:19:02.160 | Jeremy you probably could use this you know advertisement for Viagra, but you know it could happen and you know
01:19:08.660 | You know, I'm sure it'd be in my best interest
01:19:12.400 | So so what we do is we say actually what we've seen so far is not the full sample of everything that could happen
01:19:20.120 | It's like a sample of what's happened so far. So let's assume that the next email you get
01:19:26.480 | Actually does mention Viagra and every other possible word right so basically we're going to add a row of
01:19:35.480 | Okay, so that's like the email that contains every possible word so that way nothing's ever
01:19:40.160 | infinitely unlikely okay, so I take the average of
01:19:44.680 | All of the
01:19:48.360 | Times that this appears in my positive corpus plus the ones
01:19:53.680 | okay, so that's like the
01:19:56.040 | the probability that
01:19:59.320 | Feature equals this appears in a document given that class equals one
01:20:06.840 | And so not surprisingly here's the same thing
01:20:10.320 | For probability that this feature this appears given class equals zero right same calculation except for the zero
01:20:18.080 | Rows and obviously these are the same because this appears
01:20:24.000 | twice in the positives sorry once in the positives and once in the negatives, okay
01:20:29.720 | Let's just put this back to what it was
01:20:33.440 | All right
01:20:42.240 | So we can do that for every feature
01:20:45.080 | for every class
01:20:48.560 | Right so our trick now is to basically use
01:20:53.840 | Base rule to kind of fill this in
01:20:57.920 | So what we want is the probability that
01:21:04.880 | Given that I've got this particular document so somebody sent me this particular email or I have this particular IMDB review
01:21:15.160 | What's the probability that its class is?
01:21:18.640 | equal to I
01:21:21.080 | Don't know positive right so for this particular movie review. What's the probability that its class is?
01:21:26.240 | Positive right and so we can say well that's equal to the probability
01:21:31.440 | That we got this particular
01:21:35.240 | movie review
01:21:38.320 | Given that its class is positive
01:21:42.400 | Multiplied by the probability that any movie reviews class is positive
01:21:46.960 | Divided by the probability of getting this particular movie review
01:21:51.960 | All right, that's just basis rule okay, and so we can calculate
01:21:57.080 | All of those things
01:22:00.280 | But actually what we really want to know is is it more likely that this is class zero or class one?
01:22:07.720 | Right so what if we actually took?
01:22:12.000 | Probability that's plus one and divided by a probability that's plus zero
01:22:16.120 | What if we did that right and so then we could say like okay?
01:22:21.560 | If this number is bigger than one then it's more likely to be class one if it's smaller than one
01:22:26.760 | It's more likely to be class zero right so in that case we could just divide
01:22:31.480 | This whole thing
01:22:35.280 | Right by the same version for class zero right which is the same as multiplying it by the reciprocal
01:22:41.680 | And so the nice thing is now that's going to put a probability D on top here, which we can get rid of
01:22:46.960 | Right and a probability of getting the data given class zero down here and the probability of getting plus
01:22:55.560 | Zero here right and so if we basically what that means is we want to calculate
01:23:02.560 | The probability that we would get this particular document given that the class is one
01:23:08.760 | Times the probability that the class is one divided by the probability of getting this particular document given the class is two zero
01:23:16.280 | times the probability that the class is zero
01:23:19.040 | so the probability that the class is one is
01:23:22.640 | Just equal to the average of the labels
01:23:26.360 | Right probability that the class is zero is just one minus that right so
01:23:36.640 | So there are those two numbers right I've got an equal amount of both so it's both point five
01:23:40.960 | What is the probability of getting this document given that the class is one can anybody tell me how I would calculate that
01:23:51.600 | Can somebody pass that please
01:24:02.640 | Look at all the documents which have class equal to one uh-huh and one divided by that will give you
01:24:08.500 | So remember it's though. It's going to be for a particular document so for example. We'd be saying like what's the probability that?
01:24:14.960 | This review is positive right so what so you're on the right track
01:24:20.200 | But what we have to going to have to do is going to have to say let's just look at
01:24:23.300 | the words it has and
01:24:26.360 | Then multiply the probabilities together
01:24:31.320 | For class equals one right so the probability that a class one review has this is
01:24:40.120 | Two-thirds the probability it has movie is one is is one and good is one
01:24:47.960 | So the probability it has all of them is all of those multiplied together
01:24:51.880 | Kinda and the kinder Tyler why is it not really can you pass it to Tyler?
01:24:58.240 | So glad you look horrified and skeptical word choice is not independent
01:25:08.200 | So nobody can call Tyler naive
01:25:12.760 | Because the reason this is naive Bayes is
01:25:16.160 | Because this is what happens if you take Bayes's theorems in a naive way and Tyler is not naive anything better right so
01:25:23.880 | Naive Bayes says let's assume that if you have this movie is bloody stupid
01:25:29.520 | I hate it
01:25:30.400 | But the probability of hate is independent of the probability of bloody is an independent of the probability of stupid, right?
01:25:36.880 | Which is definitely not true right and so naive Bayes ain't actually very good
01:25:41.920 | But I'm kind of teaching it to you because it's going to turn out to be a convenient
01:25:46.560 | Peace for something we're about to learn later
01:25:51.120 | It's okay, right? I mean, it's it's it's I would never I would never choose it
01:25:55.760 | Like I don't think it's better than any other technique. That's equally fast and equally easy
01:25:59.520 | But you know, it's a thing you can do and it's certainly going to be a useful foundation
01:26:08.080 | so here is our calculation right of the probability that this document is
01:26:15.360 | That we get this particular document assuming. It's a positive review. Here's the probability given
01:26:21.160 | It's a negative and here's the ratio and this ratio is above one
01:26:25.280 | So we're going to say I think that this is probably a positive review. Okay, so that's the Excel version
01:26:34.080 | So you can tell that I let your net touch this because it's got latex in it. We've got actual math. So
01:26:40.240 | So here is the here is the same thing
01:26:44.440 | the log count ratio for each
01:26:46.440 | feature F each word F
01:26:48.960 | and so here it is
01:26:51.960 | Written out as Python. Okay, so our independent variable is our term document matrix
01:26:58.520 | Dependent variable is just the labels of the Y
01:27:01.480 | So using NumPy
01:27:04.160 | This is going to grab the rows
01:27:06.160 | Where the dependent variable is one?
01:27:09.680 | Okay, and so then we can sum them over the rows to get the total word count
01:27:16.720 | For that feature across all the documents, right?
01:27:21.040 | Plus one right because that's the email Terrence is totally going to send me something about Biagra today
01:27:26.240 | I can tell that's that's that yeah, okay, so I'll do the same thing for the negative reviews
01:27:31.700 | Right and then of course it's nicer to take the log
01:27:36.560 | Right because if we take the log then we can add things together rather than multiply them together
01:27:41.680 | And once you like multiply enough of these things together
01:27:44.280 | It's going to get kind of so close to zero that you'll probably run out of floating point, right? So we take the log
01:27:49.680 | of the ratios
01:27:54.560 | Then we can as I say we then multiply that or in log we subtract that from the so add that to the
01:28:01.780 | ratio of the class the whole class
01:28:06.360 | probabilities, right
01:28:08.360 | So in order to say for each document
01:28:11.800 | Multiply the Bayes probabilities by the accounts we can just use matrix multiply
01:28:19.040 | okay, and then to add on the
01:28:21.880 | The log of the class ratios we can just use plus B and so we end up with something that looks a lot like our
01:28:31.520 | Logistic regression right, but we're not learning anything right not in kind of a SGD point of view
01:28:38.280 | We're just we're calculating it using this theoretical model
01:28:41.560 | Okay, and so as I said we can then compare that as to whether it's bigger or smaller than zero
01:28:46.400 | Not one anymore because we're now in log space
01:28:48.720 | Right and then we can compare that to the mean and we say okay. That's 80% accurate 81% accurate
01:28:56.000 | Right so naive Bayes, you know is not is not nothing. It gave us something. Okay?
01:29:02.600 | it turns out that
01:29:05.320 | This version where we're actually looking at how often a word appears
01:29:10.280 | Like absurd appeared twice
01:29:13.040 | It turns out at least for this problem and quite often it doesn't matter whether absurd appeared twice or once all that matters
01:29:19.940 | Is that it appeared?
01:29:21.200 | So what what people tend to try doing is to say take the turn of the term
01:29:27.520 | Document matrix and go dot sign dot sign
01:29:31.060 | Replaces anything positive with one and anything negative with negative one we don't have any negative counts obviously so this
01:29:38.200 | Binerizes it so it says it's I don't care that you saw absurd twice
01:29:42.720 | I just care that you saw it right so if we do exactly the same thing
01:29:49.040 | With the binarized version
01:29:51.040 | Then you get a better result, okay?
01:29:55.060 | Okay now this is the difference between theory and practice right in theory
01:30:05.680 | Naive Bayes sounds okay, but it's it's naive unlike Tyler. It's naive right so what Tyler would probably do would instead say rather than assuming
01:30:17.120 | That I should use these coefficients are why don't we learn them so it sound reasonable Tyler?
01:30:24.480 | Yeah, okay, so let's learn them so we can you know we can totally learn them, so let's create a logistic regression
01:30:31.240 | Right and let's fit
01:30:33.800 | Some coefficients, and that's going to literally give us something with exactly the same functional form that we had before
01:30:39.800 | But now rather than using a theoretical
01:30:43.160 | R and a theoretical B. We're going to calculate the two things based on logistic regression, and that's better
01:30:49.540 | okay, so
01:30:52.720 | So it's kind of like yeah, why
01:30:57.640 | Why do something based on some theoretical model because theoretical models are never
01:31:03.160 | Going to be as accurate pretty much as a data-driven model right because theoretical models
01:31:07.920 | unless you're dealing with some I
01:31:11.080 | Don't know like physics thing or something where you're like okay?
01:31:13.840 | This is actually how the world works there really is no I don't know
01:31:17.360 | We're working in a vacuum, and this is the exact gravity and blah blah blah right, but most of the real world
01:31:23.080 | This is how things are like it's better to learn your coefficients and calculate them. Yes, you know
01:31:28.420 | Generally what's this dual equal true?
01:31:33.120 | Hoping it ignore not notice, but you saw it
01:31:39.120 | basically in this case our
01:31:41.120 | Term document matrix is much wider than it is tall
01:31:44.720 | There is a reformulation
01:31:47.780 | Mathematically basically almost a mathematically equivalent reformulation of logistic regression that happens to be a lot faster when it's wider than it is tall
01:31:55.680 | So the short answer is if you don't put that here anytime
01:31:58.760 | It's wider than it is tall put dual equals true and it will run this runs in like two seconds
01:32:03.200 | If you don't have it here, it'll take a few minutes
01:32:06.880 | So like in math there's this kind of concept of dual versions of problems which are kind of like
01:32:12.480 | Equivalent versions that sometimes work better for certain situations
01:32:17.360 | Okay, here is so here is the binarized version
01:32:24.480 | right
01:32:26.600 | and it's it's about the same right so you can see I've fitted it with the the sign of the dock of the dock term dock matrix and
01:32:35.840 | Predicted it with this right
01:32:37.840 | Now the thing is that this is going to be a coefficient for every term
01:32:45.320 | There was about 75,000 terms in our vocabulary
01:32:49.480 | And that seems like a lot of coefficients given that we've only got
01:32:53.400 | 25,000 reviews, so maybe we should try regularizing this
01:32:57.480 | So we can use
01:33:00.400 | Regularization built into SK learns logistic regression plus which is C is the parameter that they use a smaller
01:33:07.640 | This is slightly weird a smaller parameter is more regularization, right?
01:33:12.120 | So that's why I used one a to basically turn off regularization here. So if I turn on regularization set it to point one
01:33:18.440 | Then now it's 88 percent. Okay, which makes sense. You know, you wouldn't you would think like
01:33:25.400 | 25,000 parameters for 25,000 documents, you know, it's likely to overfit indeed. It did overfit
01:33:30.880 | So this is adding L2 regularization to avoid overfitting
01:33:37.880 | Mentioned earlier that as well as L2, which is looking at the weight squared. There's also L1
01:33:44.440 | Which is looking at just the absolute value of the weights, right? I
01:33:55.760 | Kind of pretty sloppy in my wording before I said that L2 tries to make things zero
01:34:00.920 | That's kind of true. But if you've got two things that are highly correlated
01:34:04.960 | Then L2 regularization will like move them both down together
01:34:10.040 | It won't make one of them zero and one of them non-zero, right?
01:34:13.440 | So L1 regularization actually has the property that it'll try to make as many things zero as possible
01:34:20.680 | Whereas L2 regularization has a property that it tends to try to make kind of everything smaller
01:34:25.560 | we actually don't care about that difference in
01:34:28.640 | Really any modern machine learning because we very rarely try to directly interpret the coefficients. We try to understand our models through
01:34:37.660 | Interrogation using the kind of techniques that we've learned
01:34:40.800 | The reason that we would care about L1 versus L2 is simply like which one ends up with a better error on the validation
01:34:47.320 | Set okay, and you can try both
01:34:50.320 | With SK learns logistic regression
01:34:52.320 | L2 actually turns out to be a lot faster because you can't use dual equals true unless you have L2
01:34:58.680 | So you know and L2 is the default so I didn't really worry too much about that difference here
01:35:03.560 | So you can see here if we use
01:35:06.720 | regularization and binarized
01:35:10.320 | We actually do pretty well
01:35:18.840 | Yes, can you pass that back to w please
01:35:21.260 | Before we learned about elastic net right like combining L1 and L2. Yeah. Yeah. Yeah, you can do that is that but I mean
01:35:30.240 | It's like you know with with deeper models
01:35:34.080 | Yeah, I've never seen anybody find that useful
01:35:37.480 | Okay, so the last thing I mentioned is
01:35:42.880 | That you can when you do your count vectorizer
01:35:48.200 | Wherever that was when you do your count vectorizer you can also ask for n grams right by default we get
01:35:54.960 | unigrams that is single words, but if we if we say
01:35:59.560 | n gram range equals 1 comma 3 that's also going to give us
01:36:04.920 | Bigrams and trigrams by which I mean if I now say okay. Let's go ahead and
01:36:10.920 | Do the count vectorizer get feature names now my vocabulary includes a bigram
01:36:18.280 | Right by fast by vengeance and a trigram by vengeance full stop
01:36:23.480 | Five era miles right so this is now doing the same thing but after tokenizing
01:36:28.960 | It's not just grabbing each word and saying that's part of our vocabulary
01:36:32.600 | But each two words next to each other and each three words next to each other and this ten this turns out to be like
01:36:38.560 | Super helpful in like taking advantage of bag of word
01:36:44.280 | Approaches because we now can see like the difference between like
01:36:48.040 | You know not good versus not bad
01:36:51.920 | versus not terrible
01:36:54.840 | Right or even like double quote good double quote, which is probably going to be sarcastic right so using trigram features
01:37:04.120 | Actually is going to turn out to make both naive phase
01:37:08.840 | And logistic regression quite a lot better. It really takes us quite a lot further and makes them quite useful
01:37:15.400 | I have a question about
01:37:21.320 | Tokenizers so you are saying some marks features, so how are these?
01:37:26.200 | Bigrams and trigrams selected right so
01:37:30.480 | Since I'm using a linear model I
01:37:36.440 | Didn't want to create too many features. I mean it actually worked fine even without max features. I think I had something like I
01:37:42.080 | Can't remember 70 million coefficients. It still worked right, but just there's no need to have 70 million coefficients
01:37:48.960 | So if you say max features equals 800,000
01:37:51.760 | The count vectorizer will sort the vocabulary by how often everything appears whether it be unigram by gram trigram
01:38:00.620 | And it will cut it off
01:38:03.440 | After the first 800,000 most common n grams n gram is just the generic word for
01:38:09.820 | unigram by gram and trigram
01:38:12.800 | so that's why the the train term doc dot shape is now 25,000 by
01:38:18.320 | 800,000 and like if you're not sure what number this should be I
01:38:22.240 | Just picked something that was really big and you know didn't didn't worry about it too much, and it seemed to be fine
01:38:28.840 | Like it's not terribly
01:38:30.840 | sensitive
01:38:32.840 | All right, okay, well, that's we're out of time so what we're going to see
01:38:38.200 | Next week and by the way you know we could have
01:38:41.680 | Replaced this logistic regression with our pytorch version and next week
01:38:47.880 | We'll actually see something in the fastai library that does exactly that
01:38:51.480 | but also what we'll see next week so next week tomorrow is
01:38:57.480 | How to combine logistic regression and naive Bayes together to get something that's better than either
01:39:03.080 | and then we'll learn how to move from there to create a
01:39:06.800 | Deeper neural network to get a pretty much state-of-the-art result for structured learning all right, so we'll see them
01:39:15.240 | [BLANK_AUDIO]