Back to Index

Machine Learning 1: Lesson 10


Chapters

0:0 Fast AI
1:22 Feature Engineering
4:25 Structured Data
8:34 Recap
11:53 AutoGrad
13:12 Variables
15:3 Iterators
24:8 Gradients
37:8 Data Loader
40:18 Parameters
47:30 Weight Decay
55:8 Discussion

Transcript

Well, welcome back to machine learning one of the most exciting things this week Almost certainly the most exciting thing this week is that fastai is now on pip so you can pip install fastai And so thank you to Prince and for to karem for making that happen To USF students who had never published a pip package before and this is one of the harder ones to publish because it's got a lot of dependencies So it's you know probably still easiest just to do the Conda end update thing But a couple of places that it would be handy instead to pip install fastai would be well obviously if you're working Outside of the the repo and the notebooks then this gives you access to fastai everywhere Also, I believe they submitted a pull request to Kaggle to try and get it added to the Kaggle kernels So hopefully you'll be able to use it on Kaggle kernels soon and Yeah, you can use it at your work or whatever else So that's that's exciting.

I mean I'm not going to say it's like officially released yet. You know it's still very early obviously and we're still You're helping add documentation and all that kind of stuff, but it's great that that's now there a couple of cool kernels from USF students this week thought I'd highlight two that were both from the text normalization competition which was about Trying to take text which was Written out you know written a standard English text they also had one for Russian And you're trying to kind of identify things that could be like a first second third and say like that's a cardinal number Or if this is a phone number or whatever and I did a quick little bit of searching and I saw that There had been some attempts in academia to use deep learning for this, but they hadn't managed to make much progress and Actually noticed us.

I'll veres Colonel here which gets point nine nine two on the leaderboard, which I think is like top 20 Is yeah, it's kind of entirely heuristic, and it's a great example of Kind of feature engineering this in this case the whole thing is basically entirely feature engineering So it's basically looking through and using lots of regular expressions to figure out for each token What is it you know and I think she's done a great job here kind of laying it all out clearly as to what all the different pieces are and how they all fit together and She mentioned that she's maybe hoping to turn this into a library which I think would be great right you know you could use this to Grab a piece of text and pull out.

What are all the pieces in it? It's the kind of thing that The neural the natural language can like natural language processing community hopes to be able to do Without like lots of handwritten code like this, but for now This is I'll be interesting to see like what the winners turn out to have done, but I haven't seen Machine learning being used really to do this particularly well Perhaps the best approach is the ones which combine this kind of feature engineering along with some machine learning But I think this is a great example of effective feature engineering, and this is a another USF student Who has done much the same thing got a similar kind of score?

But used her own different set of rules Again, this is gets you would get you a good leaderboard position with these as well so I thought that was interesting to see examples of some of our students entering a competition and getting kind of top 20 ish results by you know basically just handwritten heuristics, and this is where for example computer vision was Six years ago still basically all the best approaches was a whole lot of like carefully handwritten heuristics often combined with some simple machine learning and So I think over time You know the field is kind of Definitely trying to move towards Automating much more of this and actually interestingly very interestingly in the Safe driver prediction competition was just finished One of the Netflix prize winners won this competition and he Invented a new algorithm for dealing with structured data which basically doesn't require any feature engineering at all So he came first place using nothing but five deep learning models and one gradient boosting machine And his his basic approach was very similar to what we've been learning in this class so far And what we'll be learning also tomorrow Which is using fully connected neural networks and we're and one hot encoding And specifically embedding which we'll learn about but he had a very clever technique Which was there was a lot of data in this competition which was unlabeled so in other words Where they didn't know whether that?

Driver would go on to claim or not Or whatever so unlabeled data so when you've got some labeled and some unlabeled data We call that semi supervised learning and in real life Most learning is semi supervised learning like in real life normally you have some things that are labeled and some things that are unlabeled so this is kind of the most practically useful kind of learning and Then structured data is it's the most common kind of data that companies deal with day to day so the fact that this competition was a semi supervised Structured data competition made it incredibly practically useful And so what his technique for winning this was was to?

Do data augmentation which those of you doing the deep learning course have learned about which is basically the idea like if you had Pictures you would like flip them horizontally or rotate them a bit data augmentation means creating new data examples Which are kind of slightly? Different versions of ones you already have and the way he did it was for each row in the data.

He would like at random replace 15% of the variables with a different row So each row now would represent like a mix of like 80 percent 85 percent of the original row But 15 percent randomly selected from a different row and so this was a way of like randomly changing the data a little bit and then he used something called an autoencoder which we will Probably won't study until part two of the deep learning course But the basic idea of an autoencoder is your dependent variable is the same as your independent variable so in other words you try to predict your input, which obviously is Trivial if you're allowed to like it like you know the identity transform for example trivially predicts the input But the trick with an autoencoder is to have less activations in At least one of your layers than your input right so if your input was like a hundred-dimensional vector, and you put it through a 100 pi 10 matrix to create 10 activations and then have to recreate the original hundred long vector from that Then you've basically come you have to have compressed it effectively and so it turns out that That kind of neural network You know it's forced to find Correlations and features and interesting relationships in the data even when it's not labeled so he used that Rather than doing any he didn't do any hand engineering.

He just used an autoencoder So you know these are some interesting kind of directions that if you keep going with your machine learning studies You know particularly if you? Do part two with a deep learning course next year? you'll you'll learn about and You can kind of see how Feature engineering is going away, and this was just Yeah, an hour ago, so this is very recent news indeed, but it's one of this is one of the most important breakthroughs I've seen in a long time Okay, so we were working through a Simple logistic regression trained with SGD for MNIST And here's the summary of where we got to we have nearly built a module A model module and a training loop from scratch and we were going to kind of try and finish that and after we finish that I'm then going to go through this entire notebook Backwards right so having gone like top to bottom, but I'm going to go back through bottom to top okay, so You know this was that little Handwritten and end up module class we created We defined our loss we defined our learning rate, and we defined our optimizer And this is the thing that we're going to try and write by hand in a moment so that stuff That and that we're still in with from Pytorch, but that we've written ourselves and this we've written ourselves So the basic idea was we're going to go through some number of epochs, so let's go through one epoch Right and we're going to keep track of how much for each mini batch.

What was the loss so that we can report it at the end We're going to turn our training data loader into an iterator So that we can loop through it loop through every mini batch, and so now we can go and go ahead and say for tensor in The length of the data loader, and then we can call next to grab the next independent variables and the dependent variables From our data loader from that iterator, okay?

So then remember we can then pass the X tensor into our model by calling the model as if it was a function But first of all we have to turn it into a variable Last week we were typing variable Blah dot CUDA to turn it into a variable a shorthand for that is just the capital V now It's a capital T for a tensor capital B for a V for a variable.

That's just a shortcut in fast AI Okay, so that returns our predictions And so the next thing we needed was to calculate our loss Because we can't calculate the derivatives of the loss if you haven't calculated the loss So the loss takes the predictions and the actuals Okay, so the actuals again are the the Y tensor and again.

We have to turn that into a variable Now can anybody remind me what a variable is and why we would want to use a variable here? I think once you turn into variable, then it tracks it so then you can do it backward on that so you can get it What sorry when you turn the variable it?

It can track like it's process of like you know as you add the function as the function is targeting layers within each other You can track it and then we do backward on it back propagates and does the yeah, right, so Right so a variable keeps track of all of the steps to get computed and So there's actually a fantastic tutorial on the Pytorch website So on the Pytorch website there's a tutorial section And there's a tutorial there about autograd autograd is the name of the automatic differentiation package that comes with Pytorch and it's it's an implementation of automatic differentiation and so the variable plus is really the key The key class here because that's the thing that makes turns a tensor into something where we can keep track of its gradients So basically here they show how to create a variable do an operation to a variable And then you can go back and actually look at the grad function Which is the the function that it's keeping track of basically to calculate the gradient right so as we do More and more operations to this very a variable and the variables calculated from that variable it keeps keeping track of it So later on we can go dot backward and then print dot grad and find out the gradient Right and so you notice we never defined the gradient.

We just defined it as being x plus 2 Squared times 3 whatever and it can calculate the gradient Okay, so that's why we need to turn that into a variable so L is now a Variable containing the loss so it contains a single number for this mini batch Which is the loss for this mini batch, but it's not just a number.

It's a it's a number as a variable So it's a number that knows how it was calculated all right so we're going to append that loss to our array just so we can Get the average of it later basically And now we're going to calculate the gradient so L dot backward is the thing that says Calculate the gradient so remember when we call the the network.

It's actually calling our forward function So that's like cap go through it forward and then backward is like using the chain rule to calculate the gradients Backwards okay, and then this is the thing we're about to write which is update the weights based on the gradients and the learning rate Okay Zero grad will explain when we write this out by hand okay and so then at the end we can turn our validation data loader into an iterator and We can then go through its length grabbing each x and y out of that and asking for the score Which we defined up here to be equal to?

Which thing did you predict which thing was actual and so check whether they're equal right and then the Main of that is going to be our accuracy, okay? Could you pass that over to Chenxi? What's the advantage that you found converted into a iterator rather than like use normal?

Python loop or We're using a normal Python loop So it's still and this is a normal Python loop so the question really is like Compared to what right so like? The alternative perhaps you're thinking it would be like we could choose like a something like a list with an indexer Okay, so you know the problem there is that we want Was a few things I mean one key one is we want each time we grab a new mini batch.

We want to be random We want a different different shuffled thing so this You can actually kind of iterate from Forever you know you can loop through it as many times as you like so There's this kind of idea. It's called different things in different languages But a lot of languages are called like stream processing And it's this basic idea that rather than saying I want the third thing or the ninth thing It's just like I want the next thing right it's great for like network programming.

It's like grab the next thing from the network It's great for UI programming it's like grab the next event where somebody clicked a button it also turns out to be great for This kind of numeric programming. It's like I just want the next batch of data It means that the data like can be kind of arbitrarily long as we're describing one piece at a time Yeah, so you know I mean and also in I guess the short answer is because it's how pytorch works Pytorch that's pytorch is data loaders are designed to be Called in this way, and then so Python has this concept of a generator Which is like an and and?

Different type of generator. I wonder if this is gonna be a snake generator or a computer generator, okay? A generator is a way that you can create a function that as it says behaves like an iterator So like Python has recognized that this stream processing approach to programming is like super handy and helpful and Supports it everywhere so basically anywhere that you use a for in loop anywhere you use a list comprehension Those things can always be generators or iterators so by programming this way.

We just get a lot of Flexibility I guess is that sound about right Terrence you're the programming language expert. Did you? Want to grab that box so we can hear So Terrence actually does programming languages for a living so we should ask him Yeah, I mean the short answer is what you said You might say something about space But in this case that all that data has to be in memory anyway because we've got no doesn't have to be in memory So in fact most of the time we could pull a mini batch from something in fact most of the time with pytorch The mini batch will be read from like separate images spread over your disk on demand So most of the time it's not in memory But in general you want to keep as little in memory as possible at a time And so the idea of stream processing also is great because you can do compositions you can Pipe the data to a different machine you can yeah Yeah, the competition is great You can grab the next thing from here and then send it off to the next stream which can then grab it and do something Else which you guys all recognize of course in the command-line pipes and redirection Yes, okay, thanks Terrence The benefit of working with people that actually know what they're talking about All right, so let's now take that and get rid of the optimizer Okay, so the only thing that we're going to be left with is the negative log likelihood loss function Which we could also replace actually we have a?

implementation of that from scratch that unit wrote in the In the notebooks, so it's only one line of code as we learned earlier. You can do it with a single if statement, okay? So I don't know why I was so lazy is to include this So what we're going to do is we're going to again grab this module that we've written ourselves the logistic regression module We're going to have one epoch again.

We're going to loop through each thing in our iterator again We're going to grab our independent independent variable for the mini batch again Pass it into our network again Calculate the loss, so this is all the same as before But now we're going to get rid of this optimizer dot step And we're going to do it by hand so the basic trick is As I mentioned we're not going to do the calculus by hand so we'll call L dot backward to calculate the gradients automatically And that's going to fill in our weight matrix, so do you remember when we created our?

Let's go back and Look at the code for Here's that module we built so the weight matrix for the for the Linear layer weights we called l1w and for the bias we called l1b right so they were the attributes we created So I've just put them into things called W and B just to save some typing basically so W is our weights B is our biases and So the weights remember the weights are a variable and to get the tensor out of the variable We have to use dot data right so we want to update the actual tensor that's in this variable, so we say weights dot data Minus equals so we want to go in the opposite direction to the gradient the gradient tells us which way is up We want to go down Whatever is currently in the gradients times the learning rate so that is the formula for gradient descent All right, so as you can see it's it's like as as easier thing as you can possibly imagine It's like literally update the weights to be equal to be equal to whatever they are now minus the gray the gradients times the learning rate and Do the same thing?

for the bias So anybody have any questions about that step in terms of like why we do it or how did you have a question? Do you want to grab that? So that step, but when we do the next of deal The next year yes, yes So when it is the end of the loop.

How do you grab the next element? So this is going through each Each index in range of length, so this is going 0 1 2 3 at the end of this loop It's going to print out the mean of the validation set go back to the start of the epoch at which point It's going to recreate a new a new iterator Okay, so basically behind the scenes in python when you call it a On this it basically tells it to like reset its state to create a new iterator And if you're interested in how that works the The code is all you know available for you to look at so we could look at like MD dot train DL is a fast AI dot data set dot model data loader, so we could like take a look at the code of that So we could take a look at the code of that And see exactly how it's being built right and so you can see here that here's the next function right which basically is Keeping track of how many times it's been through in the self dot I And here's the it a function which is the thing that gets quick called when you when you create a new iterator And you can see it's basically passing it off to something else Which is a type data loader and then you can check out data loader if you're interested to see how that's implemented as well So the data loader that we wrote Basically uses multi-threading to allow it to have multiple of these going on at the same time It's actually a great.

It's really simple. It's like it's only about a screen full of code So if you're interested in simple multi-threaded programming. It's a good thing to look at Okay now um oh Yes Why have you wrapped this in a for epoch in range one since that'll only run once? Because in real life we would normally be running multiple epochs So like in this case because it's a linear model it actually basically trains to As good as it's going to get in one epoch so if I type three here it actually It actually won't really improve after the first epoch much at all as you can see right But when we go back up to the top we're going to look at some slightly deeper and more interesting Versions which will take more epochs, so you know if I was turning this into a into a function You know I'd be going like you know death train model And one of the things you would pass in is like number of epochs kind of Okay great So one thing to remember is that When you're you know creating these neural network layers and remember like This is just as part watch is concerned.

This is just it's an end up module It could be a we could be using it as a layer it could be using the function We could be using it as a neural net pie torch doesn't think of those as different things, right? So this could be a layer inside some other network, right?

So how do gradients work so if you've got a layer which remember is just a bunch of we can think of it basically as its activations right or some activations that get computed through some other non-linear activation function or through some linear function and From that layer We it's very likely that we're then like let's say putting it through a matrix product right to create some new layer And So each one of these so if we were to grab like One of these activations right is actually going to be Used to calculate every one of these outputs Right and so if you want to calculate the The derivative you have to know how this weight matrix Impacts that output and that output and that output and that output Right and then you have to add all of those together to find like the total impact of this you know across all of its outputs and So that's why in pie torch You have to tell it when to set the gradients to zero Right because the idea is that you know you could be like having lots of different loss functions or lots of different outputs in your next Activation set of activations or whatever all adding up Increasing or decreasing your gradients right so you basically have to say okay.

This is a new calculation Reset okay, so here is where we do that right so before we do L dot backward we say Reset okay, so let's take our weights Let's take the gradients. Let's take the tensor that they point to and Then zero underscore does anybody remember from last week what underscore does as a suffix in pi torch?

Yeah, I Forgot the language, but basically it changes it within the place right there the language is in place yeah Exactly so it sounds like a minor technicality But it's super useful to remember every function pretty much has an underscore version suffix Which does it in place? Yeah, so normally zero returns a Tensor of zeros of a particular size so zero underscore means replace the contents of this with a bunch of zeros, okay?

All right, so that's That's it right, so that's like SGD from scratch And if I get rid of my menu bar we can officially say it fits within a screen, okay? so Of course we haven't got our definition of logistic regression here. That's another half a screen, but basically there's there's not much to it Yes, fish So later on if we have to do this more the gradient is it because you might find like a wrong Minima local minimize that way so you have to kick it out And that's what you have to do multiple times when the surface is get more.

Why do you need multiple epochs? Is that your question well? I mean a simple way to answer that would be let's say our learning rate was tiny right then It's just not going to get very far Right there's nothing that says going through one epoch is enough to get you all the way there So then you'd be like okay.

Well, let's increase our learning rate, and it's like yeah, sure We'll increase our learning rate, but who's to say that the highest learning rate that learns stably is is enough to Learn this as well as it can be learned and for most data sets for most architectures one epoch is Very rarely enough to get you To the best result you can get to You know linear models are just They're very nicely behaved.

You know so you can often use higher learning rates and learn more quickly also they They don't you can't like generally get as good at accuracy So there's not as far to take them either so yeah doing one epoch is going to be the rarity all right So let's go backwards So going backwards.

We're basically going to say all right. Let's not write Those two lines again and again again. Let's not write those three lines again and again and again Let's have somebody do that for us, right? So that's like that's the only difference between that version and this version is rather than saying dot zero ourselves Rather than saying minus gradient times LIR ourselves These are wrapped up for us, okay There is another wrinkle here, which is this approach to updating The the weights is actually pretty inefficient.

It doesn't take advantage of momentum and curvature and so In the deal course we learn about how to do momentum from scratch as well, okay, so if we Actually, just use plain old SGD Then you'll see that this Learns much slower so now that I've typed just plain old SGD here.

This is now literally doing exactly the same thing As our slow version so I have to increase the learning rate Okay there we go so this this is now the same as the the one we wrote by hand So then all right Let's do a little bit more stuff automatically Let's not you know given that every time we train something we have to loop through epoch Look through batch do forward get the loss zero the gradient do backward do a step of the optimizer Let's put all that in a function Okay, and that function is called fit All right there it is okay, so let's take a look at fit Fit go through each epoch go through each batch Do one step?

Keep track of the loss and at the end calculate the validation all right and so then step So if you're interested in looking at this this stuff's all inside fastai.model And So here is step right? Zero the gradients calculate the loss remember PyTorch tends to call it criterion rather than loss Right do backward And then there's something else we haven't learned here, but we do learn the deep learning course Which is gradient clicking so you can ignore that All right, so you can see now like all the stuff that we've learnt when you look inside the actual frameworks That's the code you see okay?

So that's what fit does and So then the next step would be like okay. Well this idea of like having some Weights and a bias and doing a matrix product in addition Let's put that in a function This thing of doing the log softmax Let's put that in a function and then the very idea of like first doing this and then doing that This idea of like chaining functions together.

Let's put that into a function and that finally gets us to that Okay, so sequential simply means do this function take the result send it to this function etc, right? And linear means create the weight matrix create the biases Okay So that's that's it right So we can then you know as we started to talk about like turn this into a deep neural network by saying you know rather than sending this straight off into 10 activations, let's let's put it into say 100 activations.

We could pick whatever one number we like Put it through a relu to make it nonlinear Put it through another linear layer another relu and then our final output with our final activation function right and so this is now a deep network so We could fit that and This time now because it's like deeper I'm actually going to run a few more epochs right and you can see the accuracy Increasing right so if you try and increase the learning rate here, it's like zero point one further it actually Starts to become unstable Now I'll show you a trick This is called learning rate annealing and the trick is this when you're Trying to fit to a function right you've been taking a few steps Step step step as you get close to the middle like get close to the bottom Your steps probably want to become smaller right otherwise what tends to happen is you start finding you're doing this All right, and so you can actually see it here right they've got 93 94 and a bit 94 6 94 8 like it's kind of starting to flatten out Right now that could be because it's kind of done as well as it can Or it could be that it's going to going backwards and forwards So what is a good idea is is later on in training is to decrease your learning rate and to take smaller steps Okay, that's called a learning rate annealing.

So there's a function in fast AI called set learning rates you can pass in your optimizer and your new learning rate and You know see if that helps right and very often it does About about an order of magnitude In the deep learning course we learn a much much better technique than this to do this all automatically and about a more granular Level, but if you're doing it by hand, you know like an order of magnitude at a time is what?

people generally do So you'll see people in papers talk about learning rate schedules This is like a learning rate schedule. So this schedule just a moment Erica I just come to earnest first has got us to 97 right and I tried Kind of going further and we don't seem to be able to get much better than that So yeah, so here we've got something where we can get 97 percent Accuracy.

Yes, Erica. So it seems like you change the learning rate to something very small Ten times smaller than we started with so we had point one now, it's point. Oh one. Yeah But that makes the whole model train really slow So I was wondering if you can make it so that it changes dynamically as it approaches Closer to the minima.

Yeah, pretty much. Yeah, so so that's some of the stuff we learn in the deep learning course There's these more advanced approaches. Yeah the fish So how it is different from using Adam optimizer or something that that's the kind of stuff we can do I mean you still need annealing as I say we do this kind of stuff in the deep learning course So for now, we're just going to stick to standard SGD.

I Had a question about the data loading. Yeah, I know it's a fast AI function But could you go into a little bit detail of how it's creating batches how it's learning data and how it's making those decisions Sure I Would be good to ask that on Monday night so we can talk about in detail in the deep learning class But let's let's do the quick version here so basically There's a really nice design in pytorch Where they basically say let's let's create a thing called a data set Right and a data set is basically something that looks like a list.

It has a length right and so that's like how many images are in the data set and it has the ability to Index into it like a list right so if you had like D equals data set You can do length D, and you can do D of some index right that's basically all the data set Is as far as pytorch is concerned and so you start with a data set, so it's like okay?

D 3 gives you the third image. You know or whatever And so then the idea is that you can take a data set and you can pass that into a constructor for a data loader And That gives you something which is now iterable right so you can now say it a deal and that's something that you can call next on and What that now is going to do is if when you do this you can choose to have shuffle on or shuffle off shuffle on Means give me random mini-batch shuffle off means go through it sequentially And so What the data loader does now when you say next is it basically assuming you said shuffle equals true is it's going to grab?

You know if you've got a batch size of 64 64 random integers between 0 and length and call this 64 times to get 64 different items and jam them together So fast AI uses the exact same terminology and the exact same API We just do some of the details differently so specifically particularly with computer vision You often want to do a lot of pre-pro I'm so much pre-processing data augmentation like flipping changing the colors a little bit rotating those turn out to be really Computationally expensive even just reading the JPEGs turns out to be computation expensive So pie torch uses an approach where it fires off multiple processes to do that in parallel Whereas the fast AI library instead does something called multi threading, which is a much can be a much faster way of doing it Yes, you're net So an epoch is it really pork in the sense that all of the elements so it's a shuffle at the beginning of the Poke something like that.

Yeah. Yeah, I mean not all libraries work the same way some do sampling with replacement Some don't We actually the fast AI library hands off the shuffling off to the set to the actual pie torch version And I believe the pie torch version. Yeah, actually shuffles and an epoch covers everything once I believe Okay, now the thing is when you start to get these bigger networks Potentially you're getting quite a few parameters right, so I want to ask you to calculate how many parameters there are but let's let's remember here.

We've got 28 by 28 input into 100 output and then 100 into 100 and then 100 into 10 All right, and then for each of those who got weights and biases So we can actually Do this net dot parameters returns a list where each element of the list is a matrix of actually a tensor of The parameters for that not just for that layer But if it's a layer with both weights and biases that would be two parameters, right?

So basically returns us a list of all of the tenses containing the the parameters Num elements in pytorch tells you how how big that is right so if I run this Here is the number of parameters in each layer So I've got seven hundred and eighty four inputs and the first layer has a hundred outputs So therefore the first weight matrix is of size seventy eight thousand four hundred Okay, and the first bias vector is of size a hundred and then the next one is a hundred by a hundred Okay, and there's a hundred and then the next one is a hundred by ten, and then there's my bias, okay?

So there's the number of elements in each layer, and if I add them all up. It's nearly a hundred thousand Okay, and so I'm possibly at risk of overfitting. Yeah, all right, so We might want to think about using regularization So a really simple common approach to regularization in all of machine learning is something called L2 Regularization and It's super important super handy.

You can use it with just about anything right and the basic idea Anyway so L2 regularization the basic idea is this normally we'd say our loss is Equal to let's just do RMSE to keep things kind of simple It's equal to our predictions minus our actuals You know squared, and then we sum them up take the average Take the square root, okay?

so What if we then want to say you know what like if I've got lots and lots of parameters? Don't use them unless they're really helping enough right like if you've got a million parameters, and you only really needed 10 Parameters to be useful just use 10 right so how could we like tell the loss function to do that?

And so basically what we want to say is hey if a parameter is zero That's no problem. It's like it doesn't exist at all so let's penalize a parameter for not being zero Right so what would be a way we could measure that? How can we like calculate how unzero our parameters are Can you pass that to chin sheath is honest You calculates the average of all the parameters that's my first can't quite be the average Close yes, Taylor.

Yeah. Yes, you figured it out. Okay? so I think if we like Assuming all of our data has been normalized standardized however you want to call it We want to check that they're like significantly different from zero right would that be not the data that the parameter Is rather would be significantly and the parameters don't have to be normalized or anything that is calculated right?

Yeah, so significantly different from zero right as well I just met assuming that the data has been normalized so that we can compare them. Oh, yeah, got it. Yeah, right And then those that are not significantly different from zero we can probably just drop And I think Chen she's going to tell us how to do that.

You just figured it out, right? The meaning of the absolute could do that that would be called l1. Which is great so l1 would be the absolute Value of the weights average l2 is actually the sum Yeah, yeah exactly so we just take this we can just we don't even have to square root So we just take the squares of the weights themselves, and then like we want to be able to say like okay How much do we want to panelize?

Not being zero right because if we actually don't have that many parameters We don't want to regularize much at all if we've got heaps. We do want to regularize a lot right so then we put a Parameter yeah, right except I have a rule in my classes. Which is never to use Greek letters, so normally people use alpha I'm going to use a okay, so So this is some number which you often see something around kind of 1e neg 6 to 1e neg 4 ish all right Now We actually don't care about the loss When you think about it, we don't actually care about the loss other than like maybe to print it out All we actually care about is the gradient of the loss Okay, so the gradient of That Right is That Right so there are two ways to do this we can actually modify our loss function to add in this square penalty or We could modify that thing where we said weights equals weights minus Gradient times learning rate to subtract that as well Right back so to add that as well and These are roughly these are kind of basically equivalent, but they have different names.

This is called L2 regularization Right this is called weight decay So in the neural network literature You know that version kind of Was the how it was first posed in the neural network literature whereas this other version is kind of How it was posed in the statistics literature, and yeah, you know they're they're equivalent As we talked about in the deep learning class it turns out They're not exactly equivalent because when you have things like momentum and Adam it can behave differently and two weeks ago a researcher figured out a way to actually Do proper weight decay in modern optimizers and one of our fast AI students just implemented that in the fast AI library So fast AI is now the first Library to actually support this properly so anyway, so for now, let's do the The version which Pie torch calls weight decay But actually it turns out based on this paper two weeks ago is actually L2 regularization It's not quite correct, but it's close enough so here.

We can say weight decay is 1e neg 3 So it's going to set our cons out our penalty multiplier a to 1e neg 3 and it's going to add that to the loss function Okay, and so let's make a copy of these cells Just so we can compare hope this actually works Okay, and we'll set this running okay, so this is now optimizing Well except If you're actually so I've made a mistake here, which is I didn't rerun This cell this is an important thing to kind of remember since I didn't run this rerun this cell Here when it created the optimizer and said net dot parameters It started with the parameters that I had already trained right so I actually hadn't recreated my network Okay, so I actually need to go back and rerun this cell first to recreate the network Then go through and run this Okay there we go, so let's see what happens So you might notice some notice something kind of kind of counterintuitive here Which is that?

That's our training error right now. You would expect our training error with regularization to be worse That makes sense right because we're like we're penalizing parameters that Specifically can make it better and yet Actually it started out better not worse So why could that be? So the reason that can happen is that if you have a function That looks like that Right it takes potentially a really long time to train or else if you have a function that kind of looks more like That it's going to train a lot more quickly And there are certain things that you can do which sometimes just like can take a function That's kind of horrible and make it less horrible, and it's sometimes weight decay can actually Make your functions a little more nicely behaved, and that's actually happened here So like I just mentioned that to say like don't let that confuse you right like weight decay really does Panelize the training set and look so strictly speaking The final number we get to for the training set shouldn't end up be being better But it can train sometimes more quickly Yes, can you pass it a chance you I Don't get it.

Okay, why making it faster like the time matters like the training time No, it's this is after one epoch. Yeah, right so after one epoch Now congratulations for saying I don't get it. That's like the best thing anybody can say you know so helpful This here was our training without weight decay Okay, and this here is our training with weight decay, okay, so this is not related to time This is related to just an epoch Right after one epoch my claim was that you would expect the training set all other things being equal to have a worse loss with weight decay Because we're penalizing it you know this has no penalty this has a penalty so the thing with a penalty should be worse and I'm saying oh, it's not that's weird right, and so the reason it's not is Because in a single epoch it matters a lot as to whether you're trying to optimize something That's very bumpy or whether you're trying to optimize something.

That's kind of nice and smooth If you're trying to optimize something that's really bumpy like imagine in some high-dimensional space, right? You end up kind of rolling around through all these different tubes and tunnels and stuff You know or else if it's just smooth you just go boom Adam it's like imagine a marble rolling down a hill where one of them you've got like It's a called Lombard Street in San Francisco.

It's like backwards forwards backwards forwards It takes a long time to drive down the road right Where else you know if you kind of took a motorbike and just went straight over the top. You're just going boom, right, so So whether it's a kind of the shape of the loss function surface you know impacts or kind of defines how easy it is to optimize and therefore how Far can it get in a single epoch and based on these results?

It would appear that weight decay here has made it this function easier to optimize so just to make sure it's The panelizing is making the optimizer more than likely to reach the global minimum No, I wouldn't say that my claim actually is that at the end It's probably going to be less good on the training set indeed.

This doesn't look to be the case at the end after five epochs our Training set is now worse with weight decay now. That's what I would expect right? I would expect like if you actually find like I never use the term global optimum because It's just not something we have any guarantees about we don't really care about we just care like where do we get to after?

a certain number of epochs We hope that we found somewhere. That's like a good solution And so by the time we get to like a good solution the training set with weight decay the loss is worse Because it's penalty right but on The validation set the loss is better Right because we penalized the training set in order to kind of try and create something that generalizes better So we've got more parameter You know that the parameters that are kind of pointless are now zero and it generalizes better Right so so always saying is that it just got to a good point After one epoch is really always saying So is it always true?

No, no But if you're bit by it you mean just wait decay you always make the function surface smoother No, it's not always true, but it's like it's worth remembering that if you're having trouble training a function adding a little bit of weight decay may may help The word so by recognizing the parameters what it does is it smoothens out the loss I mean it's not it's not why we do it you know the reason why we do it is because we want to penalize things that aren't zero to say like Don't make this parameter a high number unless it's really helping the loss a lot right set it to zero if you can Because setting as many parameters to zero as possible means it's going to generalize better, right?

It's like the same as having a smaller Network, right so that's that's we do that's why we do it But it can change how it learns as well So let's okay. That's one moment. Okay, so I just wanted to check how we actually went here So after the second epoch yeah, so you can see here.

It's really has helped right after the second epoch Before we got to 97% accuracy now. We're nearly up to about 98% accuracy Right and you can see that the loss was 0.08 versus 0.13 right so adding regularization Has allowed us to find a you know 3% versus 2% so like a 50% better Solution yes Erica, so there are two pieces to this right one is L2 regularization and the weight decay No, there's so my claim was they're the same thing, right?

So weight decay is the version if you just take the derivative of L2 regularization you get weight decay So you can implement it either by changing the loss function with an with a squared loss Penalty or you can implement it by adding The weights themselves as part of the gradient, okay?

Yeah, I was just going to finish the questions. Yes. Okay pass it to division Can we use regularization convolution layer as well absolutely so convolution layer just is is weights so yep And Jeremy can you explain why you thought you needed weight decay in this particular problem? Not easily I mean other than to say it's something that I would always try you're all fitting founder well.

Yeah, I mean okay, so Even if I yeah, okay, that's a good point unit, so if if my training loss Was higher than my validation loss than I'm under fitting Right, so there's definitely no point regularizing right if like that would always be a bad thing That would always mean you need like more parameters in your model In this case.

I'm I'm over fitting that doesn't necessarily mean regularization will help, but it's certainly worth trying Thank you, and that's a great point. There's one more question. Yeah Tyler gonna pass over there So how do you choose the up to a number of epoch? You do my deep learning course It's a it's that's a long story and lots of lots of It's a bit of both we just don't as I say we don't have time to cover Best practices in this class we're going to learn the kind of fundamentals.

Yeah, okay, so let's take a Six minute break and come back at 11 10 All right So something that we cover in great detail in the deep learning course But it's like really important to mention here. Is that is that the secret in my opinion to kind of modern machine learning techniques is to massively over parameterize The solution to your problem right like as we've done here.

You know we've got like a hundred thousand weights When we only had a small number of 28 by 28 images And then use regularization okay, it's like the direct opposite of how nearly all statistics and learning was done for decades before and still most kind of like Senior lecturers at most universities in most areas of have this background where they've learned the correct way to build a model is To like have as few parameters as possible Right and so hopefully we've learned two things so far.

You know one is we can build Very accurate models even when they have lots and lots of parameters Like a random forest has a lot of parameters and you know this here deep network has a lot of parameters And they can be accurate right? And we can do that by either using bagging or by using regularization Okay, and regularization in neural nets means either weight decay also known as kind of L2 regularization or Drop out which we won't worry too much about here okay So like it's a It's a very different way of thinking about Building useful models and like I just wanted to kind of warn you that once you leave this classroom Like even possibly when you go to the next faculty members talk like there'll be people at USF as well who?

Entirely trained in the world of like Models with small numbers of parameters you know your next boss is very likely to have been trained in the world of like models with small numbers of parameters The idea that they are somehow More pure or easier or better or more interpretable or whatever I?

Am convinced that that is not true probably not ever true certainly very rarely true and that actually Models with lots of parameters can be extremely interpretable as we learn from our whole lesson of random forest interpretation You can use most of the same techniques with neural nets, but with neural nets are even easier right remember how we did feature importance by Randomizing a column to see how it changes in that column would impact the output Well, that's just like a kind of dumb way of calculating its gradient How much does burying this import change the output with a neural net we can actually calculate its gradient?

Right so with PI torch you could actually say what's the gradient of the output with respect to this column? All right You can do the same kind of thing to do partial dependence plot with a neural net And you know I'll mention for those of you interested in making a real impact Nobody's written Basically any of these things the neural nets all right so that that that whole area Needs like libraries to be written blog posts to be written You know some papers have been written But only in very narrow domains like computer vision as far as I know nobody's written the paper saying Here's how to do structured data Neural networks you know interpretation methods So it's a really exciting big area So what we're going to do though is we're going to start with applying this With a simple linear model And this is mildly terrifying for me because we're going to do NLP and our NLP Faculty expert is in the room so David just yell at me if I screw this up too badly And so NLP refers to you know any any kind of modeling where we're working with with natural language text right and it interestingly enough We're going to look at a situation where a Linear model is pretty close to the state-of-the-art for solving a particular problem.

It's actually something where I actually surpassed this bad at state-of-the-art in this using a Recurrent neural network a few weeks ago But this is actually going to show you pretty close to the state of art with with a linear model We're going to be working with the IMDB IMDB data set so this is a data set of movie reviews You can download it by following these steps and Once you download it you'll see that you've got a train and a test directory and In your train directory you'll see there's a negative and a positive directory and in your positive directory You'll see there's a bunch of text files And here's an example of a text file So somehow we've managed to pick out a story of a man who has unnatural feelings for a pig as our first choice That wasn't intentional, but it'll be fine So we're going to look at these movie reviews And for each one, we're going to look to see whether they were positive or negative So they've been put into one of these folders.

They were downloaded from from IMDB them the movie database and review site The ones that were strongly in positive went in positive strongly negative went negative and the rest they didn't label at all So these are only highly polarized reviews so in this case. You know We have an insane violent mob which unfortunately just too absurd Too off-putting those in the area we turned off so the label for this was a zero which is Negative okay, so this is a negative review so In the first AI library.

There's lots of little functions and classes to help with Most kinds of domains that you do machine learning on for NLP one of the simple things we have is text from folders That's just going to go ahead and go through and find all of the folders in here With these names and create a labeled data set and you know don't let these things Ever stop you from understanding.

What's going on behind the scenes? Right we can grab its source code and as you can see it's time. You know it's like five lines Okay, so I don't like to write these things out in full You know but hide them behind at all functions so you can reuse them But basically it's just going to go through each directory and then within that so it goes through Yeah, go through each directory And then go through each file in that directory and then stick that into This array of texts and figure out what folder it's in and stick that into the array of labels, okay, so That's how we basically end up with something where we have an array of The reviews and an array of the labels, okay, so that's our data so our job will be to take that and to predict that Okay, and the way we're going to do it is we're going to throw away Like all of the interesting stuff about language Which is the order in which the words are in right now?

This is very often not a good idea But in this particular case it's going to turn out to work like not too badly So let me show what I mean by like throwing away the order of the words like normally the order of the words Matters a lot if you've got a not Before something then that not refers to that thing right so but the thing is when in this case We're trying to predict whether something's positive or negative if you see the word absurd appear a lot Right then maybe that's a sign that this isn't very good So you know cryptic maybe that's a sign that it's not very good.

So the idea is that we're going to turn it into something called a term document matrix Where for each document I each review what is going to create a list of what words are in it? Rather than what order they're in so let me give an example Can you see this okay?

Okay So here are four Movie reviews that I made up This movie is good. The movie is good. They're both positive this movie is bad. The movie is bad They're both negative right so I'm going to turn this into a term document matrix So the first thing I need to do is create something called a vocabulary a vocabulary is a list of all the unique words That appear okay, so here's my vocabulary this movie is good the bad.

That's all the words Okay, and so now I'm going to take each one of my movie reviews and turn it into a Vector of which words appear and how often do they appear right and in this case none of my words appear twice So this movie is good has those four words in it Where else this movie is bad has?

Those four words in it Okay, so this Is called a term document matrix Right and this representation we call a bag of words Representation right so this here is a bag of words representation of the view of the review It doesn't contain the order of the text anymore. It's just a bag of the words What words are in it it contains bad is?

Movie this okay, so that's the first thing we're going to do is we're going to turn it into a bag of words Representation and the reason that this is convenient For linear models is that this is a nice rectangular matrix that we can like do math on Okay, and specifically we can do a logistic regression, and that's what we're going to do is we're going to get to a point We do a logistic regression Before we get there though.

We're going to do something else which is called naive base, okay? so SK learn Has something which will create a term document matrix for us. It's called count vectorizer. Okay, so we'll just use it now in NLP You have to turn your text into a list of words And that's called tokenization Okay, and that's actually non-trivial Because like if this was actually this movie is good Dot right or if it was this movie is good like How do you deal with like that?

Punctuation well perhaps more interestingly what if it was this movie isn't good right, so How you turn a piece of text into a list of tokens is called tokenization, right? And so a good tokenizer would turn this movie isn't good Into this this space Quote movie space is space and good space right so you can see in this version here If I now split this on spaces every token is either a single piece of punctuation or like this suffix and is Considered like a word right that's kind of like how we would probably want to tokenize that piece of text because you wouldn't want good full stop to be like an object right because that does there's no concept of good full stop right or Double-quote movie is not like an object so Tokenization is something we hand off to a tokenizer Fast AI has a tokenizer in it that we can use So this is how we create our term document matrix with a tokenizer SK learn has a pretty standard API which is nice I'm sure you've seen it a few times now before so once we've built some kind of model We can kind of think of this as a model Just ish This is just defining what it's going to do.

We can call fit transform to To do that right so in this case fit transform is going to create the vocabulary Okay, and create the term document matrix based on the training set Transform is a little bit different that says use the previously fitted model which in this case means use the previously created vocabulary We wouldn't want the validation set in the training set to have You know the words in different orders in the matrices right because then they'd like to have different meanings So this is here saying use the same vocabulary To create a bag of words for the validation set could you pass that back please?

What if the violation set has different set of words other than training? Yeah, that's a great question so generally most Of these kind of vocab creating approaches will have a special token for unknown Sometimes you can you'll also say like hey if a word appears less than three times call it unknown But otherwise it's like if you see something you haven't seen before call it unknown So that would just become a column in the bag of words is is unknown Good question all right, so when we create this Term document matrix of the training set we have 25,000 rows because there are 25,000 movie reviews And there are 75,000 132 columns What does that represent?

What does that mean there are seven hundred and thirty five thousand one hundred thirty two? What can you pass that to the veg? At just a moment you can pass it to the veg All vocabulary yeah, go on. What do you mean? So like the the number of words union of a number of words that the number of unique words yeah exactly good okay, now most documents Don't have most of these 75,000 Words all right, so we don't want to actually store that as A normal array in memory because it's going to be very wasteful So instead we store it as a sparse Matrix all right and what a sparse matrix does is it just stores it?

as something that says Whereabouts of the non zeros right so it says like okay term number so document number one word number four Appears and it has four of them. You know document one term number 123 Has that that appears and it's a one right and so forth. That's basically how it's stored There's actually a number of different ways of storing And if you do Rachel's computational linear algebra course you'll learn about the different types and why you choose them and how to convert And so forth, but they're all kind of something like this right and you don't really on the whole have to worry about the details The important thing to know is it's it's efficient.

Okay, and so we could grab the first review right and that gives us 75,000 long sparse One long one row long matrix okay with 93 stored elements so in other words 93 of those words are actually used in the first document, okay? We can have a look at the vocabulary by saying vectorizer dot get fetch feature names that gives us the vocab And so here's an example of a few of the elements of get feature names I Didn't intentionally pick the one that had Aussie, but you know that's the important words obviously I Haven't used the tokenizer here.

I'm just bidding on space so this isn't quite the same as what the Vectorizer did but to simplify things Let's grab a set of all the lowercase words By making it a set we make them unique so this is Roughly the list of words that would appear right and that length is 91 Which is pretty similar to 93 and just the difference will be that I didn't use a real tokenizer.

Yeah All right So that's basically all that's been done there. It's kind of created this unique list of words and map them We could check by calling vectorizer dot vocabulary underscore to find the idea of a particular word So this is like the reverse map of this one right this is like integer to word Here is word to integer, and so we saw absurd appeared twice in the first document So let's check train term doc 0 comma 1 2 9 7 there It is is 2 right or else unfortunately Aussie didn't appear in the unnatural relationship with a pig movie So 0 comma 5,000 is 0 okay, so that's that's our term document matrix Yes, so does it care about the relative relationship between the words As in the ordering of the words no, we've thrown away the orderings.

That's why it's a bag of words And I'm not claiming that this is like Necessarily a good idea what I will say is that like the vast majority of NLP work That's been done over the last few decades generally uses this representation because we didn't really know much better Nowadays increasingly we're using recurrent neural networks instead which we'll learn about in our last deep learning lesson of part one But sometimes this representation works pretty well, and it's actually going to work pretty well in this case Okay, so in fact you know most like back when I was at fast mail my email company a Lot of the spam filtering we did used this next technique naive Bayes Which is as a bag of words approach just kind of like you know if you're getting a lot of?

Email containing the word Viagra, and it's always been a spam And you never get email from your friends talking about Viagra Then it's very likely something that says Viagra regardless of the detail of the language is probably from a spammer Alright, so that's the basic theory about like classification using a term document matrix, okay, so let's talk about naive Bayes And here's the basic idea.

We're going to start with our term document matrix right and These first two is our corpus of positive reviews These next two is our corpus of negative reviews, and so here's our whole corpus of all reviews So what I could do is now to create a Probability I Got a call the as we tend to call these more generically features rather than words, right?

This is a feature movie is a feature is as a feature, right? So it's kind of more now like machine learning language a column is a feature We'll call those we often call those f in the phase so we can basically say the probability That you would see the word this Given that the class is one given that it's a positive review It's just the average of how often do you see this in the positive reviews?

right Now we've got to be a bit careful though because If you never ever see a particular word In a particular class right so if I've never received an email from a friend that said Viagra All right, that doesn't actually mean the probability of us of a friend sending sending me an email about Viagra is zero It's not really zero, right?

I Hope I don't get an email. You know from Terrence tomorrow saying like Jeremy you probably could use this you know advertisement for Viagra, but you know it could happen and you know You know, I'm sure it'd be in my best interest So so what we do is we say actually what we've seen so far is not the full sample of everything that could happen It's like a sample of what's happened so far.

So let's assume that the next email you get Actually does mention Viagra and every other possible word right so basically we're going to add a row of ones Okay, so that's like the email that contains every possible word so that way nothing's ever infinitely unlikely okay, so I take the average of All of the Times that this appears in my positive corpus plus the ones okay, so that's like the the probability that Feature equals this appears in a document given that class equals one And so not surprisingly here's the same thing For probability that this feature this appears given class equals zero right same calculation except for the zero Rows and obviously these are the same because this appears twice in the positives sorry once in the positives and once in the negatives, okay Let's just put this back to what it was All right So we can do that for every feature for every class Right so our trick now is to basically use Base rule to kind of fill this in So what we want is the probability that Given that I've got this particular document so somebody sent me this particular email or I have this particular IMDB review What's the probability that its class is?

equal to I Don't know positive right so for this particular movie review. What's the probability that its class is? Positive right and so we can say well that's equal to the probability That we got this particular movie review Given that its class is positive Multiplied by the probability that any movie reviews class is positive Divided by the probability of getting this particular movie review All right, that's just basis rule okay, and so we can calculate All of those things But actually what we really want to know is is it more likely that this is class zero or class one?

Right so what if we actually took? Probability that's plus one and divided by a probability that's plus zero What if we did that right and so then we could say like okay? If this number is bigger than one then it's more likely to be class one if it's smaller than one It's more likely to be class zero right so in that case we could just divide This whole thing Right by the same version for class zero right which is the same as multiplying it by the reciprocal And so the nice thing is now that's going to put a probability D on top here, which we can get rid of Right and a probability of getting the data given class zero down here and the probability of getting plus Zero here right and so if we basically what that means is we want to calculate The probability that we would get this particular document given that the class is one Times the probability that the class is one divided by the probability of getting this particular document given the class is two zero times the probability that the class is zero so the probability that the class is one is Just equal to the average of the labels Right probability that the class is zero is just one minus that right so So there are those two numbers right I've got an equal amount of both so it's both point five What is the probability of getting this document given that the class is one can anybody tell me how I would calculate that Can somebody pass that please Look at all the documents which have class equal to one uh-huh and one divided by that will give you So remember it's though.

It's going to be for a particular document so for example. We'd be saying like what's the probability that? This review is positive right so what so you're on the right track But what we have to going to have to do is going to have to say let's just look at the words it has and Then multiply the probabilities together For class equals one right so the probability that a class one review has this is Two-thirds the probability it has movie is one is is one and good is one So the probability it has all of them is all of those multiplied together Kinda and the kinder Tyler why is it not really can you pass it to Tyler?

So glad you look horrified and skeptical word choice is not independent So nobody can call Tyler naive Because the reason this is naive Bayes is Because this is what happens if you take Bayes's theorems in a naive way and Tyler is not naive anything better right so Naive Bayes says let's assume that if you have this movie is bloody stupid I hate it But the probability of hate is independent of the probability of bloody is an independent of the probability of stupid, right?

Which is definitely not true right and so naive Bayes ain't actually very good But I'm kind of teaching it to you because it's going to turn out to be a convenient Peace for something we're about to learn later It's okay, right? I mean, it's it's it's I would never I would never choose it Like I don't think it's better than any other technique.

That's equally fast and equally easy But you know, it's a thing you can do and it's certainly going to be a useful foundation so so here is our calculation right of the probability that this document is That we get this particular document assuming. It's a positive review. Here's the probability given It's a negative and here's the ratio and this ratio is above one So we're going to say I think that this is probably a positive review.

Okay, so that's the Excel version and So you can tell that I let your net touch this because it's got latex in it. We've got actual math. So So here is the here is the same thing the log count ratio for each feature F each word F and so here it is Written out as Python.

Okay, so our independent variable is our term document matrix Dependent variable is just the labels of the Y So using NumPy This is going to grab the rows Where the dependent variable is one? Okay, and so then we can sum them over the rows to get the total word count For that feature across all the documents, right?

Plus one right because that's the email Terrence is totally going to send me something about Biagra today I can tell that's that's that yeah, okay, so I'll do the same thing for the negative reviews Right and then of course it's nicer to take the log Right because if we take the log then we can add things together rather than multiply them together And once you like multiply enough of these things together It's going to get kind of so close to zero that you'll probably run out of floating point, right?

So we take the log of the ratios And Then we can as I say we then multiply that or in log we subtract that from the so add that to the ratio of the class the whole class probabilities, right So in order to say for each document Multiply the Bayes probabilities by the accounts we can just use matrix multiply okay, and then to add on the The log of the class ratios we can just use plus B and so we end up with something that looks a lot like our Logistic regression right, but we're not learning anything right not in kind of a SGD point of view We're just we're calculating it using this theoretical model Okay, and so as I said we can then compare that as to whether it's bigger or smaller than zero Not one anymore because we're now in log space Right and then we can compare that to the mean and we say okay.

That's 80% accurate 81% accurate Right so naive Bayes, you know is not is not nothing. It gave us something. Okay? it turns out that This version where we're actually looking at how often a word appears Like absurd appeared twice It turns out at least for this problem and quite often it doesn't matter whether absurd appeared twice or once all that matters Is that it appeared?

So what what people tend to try doing is to say take the turn of the term Document matrix and go dot sign dot sign Replaces anything positive with one and anything negative with negative one we don't have any negative counts obviously so this Binerizes it so it says it's I don't care that you saw absurd twice I just care that you saw it right so if we do exactly the same thing With the binarized version Then you get a better result, okay?

Okay now this is the difference between theory and practice right in theory Naive Bayes sounds okay, but it's it's naive unlike Tyler. It's naive right so what Tyler would probably do would instead say rather than assuming That I should use these coefficients are why don't we learn them so it sound reasonable Tyler?

Yeah, okay, so let's learn them so we can you know we can totally learn them, so let's create a logistic regression Right and let's fit Some coefficients, and that's going to literally give us something with exactly the same functional form that we had before But now rather than using a theoretical R and a theoretical B.

We're going to calculate the two things based on logistic regression, and that's better okay, so So it's kind of like yeah, why Why do something based on some theoretical model because theoretical models are never Going to be as accurate pretty much as a data-driven model right because theoretical models unless you're dealing with some I Don't know like physics thing or something where you're like okay?

This is actually how the world works there really is no I don't know We're working in a vacuum, and this is the exact gravity and blah blah blah right, but most of the real world This is how things are like it's better to learn your coefficients and calculate them.

Yes, you know Generally what's this dual equal true? Hoping it ignore not notice, but you saw it basically in this case our Term document matrix is much wider than it is tall There is a reformulation Mathematically basically almost a mathematically equivalent reformulation of logistic regression that happens to be a lot faster when it's wider than it is tall So the short answer is if you don't put that here anytime It's wider than it is tall put dual equals true and it will run this runs in like two seconds If you don't have it here, it'll take a few minutes So like in math there's this kind of concept of dual versions of problems which are kind of like Equivalent versions that sometimes work better for certain situations Okay, here is so here is the binarized version right and it's it's about the same right so you can see I've fitted it with the the sign of the dock of the dock term dock matrix and Predicted it with this right Now the thing is that this is going to be a coefficient for every term There was about 75,000 terms in our vocabulary And that seems like a lot of coefficients given that we've only got 25,000 reviews, so maybe we should try regularizing this So we can use Regularization built into SK learns logistic regression plus which is C is the parameter that they use a smaller This is slightly weird a smaller parameter is more regularization, right?

So that's why I used one a to basically turn off regularization here. So if I turn on regularization set it to point one Then now it's 88 percent. Okay, which makes sense. You know, you wouldn't you would think like 25,000 parameters for 25,000 documents, you know, it's likely to overfit indeed.

It did overfit So this is adding L2 regularization to avoid overfitting I Mentioned earlier that as well as L2, which is looking at the weight squared. There's also L1 Which is looking at just the absolute value of the weights, right? I was Kind of pretty sloppy in my wording before I said that L2 tries to make things zero That's kind of true.

But if you've got two things that are highly correlated Then L2 regularization will like move them both down together It won't make one of them zero and one of them non-zero, right? So L1 regularization actually has the property that it'll try to make as many things zero as possible Whereas L2 regularization has a property that it tends to try to make kind of everything smaller we actually don't care about that difference in Really any modern machine learning because we very rarely try to directly interpret the coefficients.

We try to understand our models through Interrogation using the kind of techniques that we've learned The reason that we would care about L1 versus L2 is simply like which one ends up with a better error on the validation Set okay, and you can try both With SK learns logistic regression L2 actually turns out to be a lot faster because you can't use dual equals true unless you have L2 So you know and L2 is the default so I didn't really worry too much about that difference here So you can see here if we use regularization and binarized We actually do pretty well Okay So Yes, can you pass that back to w please Before we learned about elastic net right like combining L1 and L2.

Yeah. Yeah. Yeah, you can do that is that but I mean It's like you know with with deeper models Yeah, I've never seen anybody find that useful Okay, so the last thing I mentioned is That you can when you do your count vectorizer Wherever that was when you do your count vectorizer you can also ask for n grams right by default we get unigrams that is single words, but if we if we say n gram range equals 1 comma 3 that's also going to give us Bigrams and trigrams by which I mean if I now say okay.

Let's go ahead and Do the count vectorizer get feature names now my vocabulary includes a bigram Right by fast by vengeance and a trigram by vengeance full stop Five era miles right so this is now doing the same thing but after tokenizing It's not just grabbing each word and saying that's part of our vocabulary But each two words next to each other and each three words next to each other and this ten this turns out to be like Super helpful in like taking advantage of bag of word Approaches because we now can see like the difference between like You know not good versus not bad versus not terrible Right or even like double quote good double quote, which is probably going to be sarcastic right so using trigram features Actually is going to turn out to make both naive phase And logistic regression quite a lot better.

It really takes us quite a lot further and makes them quite useful I have a question about the Tokenizers so you are saying some marks features, so how are these? Bigrams and trigrams selected right so Since I'm using a linear model I Didn't want to create too many features.

I mean it actually worked fine even without max features. I think I had something like I Can't remember 70 million coefficients. It still worked right, but just there's no need to have 70 million coefficients So if you say max features equals 800,000 The count vectorizer will sort the vocabulary by how often everything appears whether it be unigram by gram trigram And it will cut it off After the first 800,000 most common n grams n gram is just the generic word for unigram by gram and trigram so that's why the the train term doc dot shape is now 25,000 by 800,000 and like if you're not sure what number this should be I Just picked something that was really big and you know didn't didn't worry about it too much, and it seemed to be fine Like it's not terribly sensitive All right, okay, well, that's we're out of time so what we're going to see Next week and by the way you know we could have Replaced this logistic regression with our pytorch version and next week We'll actually see something in the fastai library that does exactly that but also what we'll see next week so next week tomorrow is How to combine logistic regression and naive Bayes together to get something that's better than either and then we'll learn how to move from there to create a Deeper neural network to get a pretty much state-of-the-art result for structured learning all right, so we'll see them