back to indexMachine Learning 1: Lesson 12
Chapters
0:0 Introduction
1:0 Recap
4:30 Durations
8:55 Zip
21:45 Embedding
25:50 Scaling
30:0 Validation
31:10 Check against
32:30 Create model
33:7 Define embedding dimensionality
38:41 Embedding matrices
45:11 Categorical data
00:00:00.000 |
I thought what we might do today is to finish off where we were in this Rossman notebook 00:00:10.720 |
looking at time series forecasting and structured data analysis. 00:00:16.640 |
And then we might do a little mini-review of everything we've learnt, because believe 00:00:23.720 |
it or not, this is the end, there's nothing more to know about machine learning other 00:00:28.880 |
than everything that you're going to learn next semester and for the rest of your life. 00:00:36.460 |
But anyway, I've got nothing else to teach, so we'll do a little review and then we'll 00:00:42.360 |
cover the most important part of the course, which is thinking about how are ways to think 00:00:50.640 |
about how to use this kind of technology appropriately and effectively in a way that's hopefully 00:01:02.160 |
So last time we got to the point where we talked a bit about this idea that when we 00:01:08.720 |
were looking at building this competition months open derived variable, that we actually 00:01:15.400 |
truncated it down to be no more than 24 months, and we talked about the reason why being that 00:01:20.240 |
we actually wanted to use it as a categorical variable, because categorical variables, thanks 00:01:24.760 |
to embeddings, have more flexibility in how the neural net can use them. 00:01:37.720 |
So let's keep working through this, because what's happening in this notebook is stuff 00:01:46.080 |
which is probably going to apply to most time series data sets that you work with. 00:01:53.840 |
And as we talked about, although we use df.apply here, this is something where it's running 00:01:59.160 |
a piece of Python code over every row, and that's horrifically slow. 00:02:06.520 |
So we only do that if we can't find a vectorized pandas or numpy function that can do it to 00:02:12.720 |
the whole column at once, but in this case I couldn't find a way to convert a year and 00:02:19.400 |
a week number into a date without using arbitrary Python. 00:02:29.160 |
Also worth remembering this idea of a lambda function, any time you're trying to apply 00:02:35.120 |
a function to every row of something, or every element of a tensor, or something like that, 00:02:40.240 |
if there isn't a vectorized version already, you're going to have to call something like 00:02:44.680 |
dataframe.apply, which will run a function you pass to every element. 00:02:52.040 |
So this is basically a map in functional programming. 00:03:00.120 |
Since very often the function that you want to pass to it is something you're just going 00:03:03.760 |
to use once and then throw it away, it's really common to use this lambda approach. 00:03:09.600 |
So this lambda is creating a function just for the purpose of telling df.apply what to 00:03:16.900 |
So we could also have written this in a different way, which would have been to say, define 00:04:09.380 |
So one approach is to define the function and then pass it by name, or the other is 00:04:14.060 |
to define the function in place using lambda. 00:04:18.440 |
And so if you're not comfortable creating and using lambdas, it's a good thing to practice, 00:04:25.640 |
and playing around with df.apply is a good way to practice it. 00:04:33.280 |
So let's talk about this durations section, which may at first seem a little specific, 00:04:46.360 |
What we're going to do is we're going to look at three fields, promo, state holiday and 00:04:55.200 |
And so basically what we have is a table for each store, for each date, does that store 00:05:07.420 |
Is there a school holiday in that region of that store at that date? 00:05:12.600 |
Is there a state holiday in that region for that store at that date? 00:05:17.080 |
And so this kind of thing is, there are events, and time series with events are very common. 00:05:26.840 |
If you're looking at oil and gas drilling data, you're trying to say the flow through 00:05:32.200 |
this pipe, here's an event representing when it set off some alarm, or here's an event 00:05:42.800 |
And so most time series at some level will tend to represent some events. 00:05:49.760 |
So the fact that an event happened at a time is interesting itself, but very often a time 00:05:59.600 |
series will also show something happening before and after the event. 00:06:05.800 |
So for example, in this case we're doing grocery sales prediction. 00:06:10.660 |
If there's a holiday coming up, it's quite likely that sales will be higher before and 00:06:15.920 |
after the holiday, and lower during the holiday if this is a city-based store, because you're 00:06:23.960 |
going to stock up before you go away to bring things with you, and when you come back you've 00:06:34.360 |
So although we don't necessarily have to do this kind of feature engineering to create 00:06:41.240 |
features specifically about this is before or after a holiday, the neural net, the more 00:06:48.200 |
we can give the neural net the kind of information it needs, the less it's going to have to learn 00:06:54.120 |
it, the less it's going to have to learn it, the more we can do with the data we already 00:06:58.360 |
have, the more we can do with the size architecture we already have. 00:07:03.880 |
So feature engineering, even with stuff like neural nets, is still important because it 00:07:10.920 |
means that we'll be able to get better results with whatever limited data we have, whatever 00:07:21.240 |
So the basic idea here, therefore, is when we have events in our time series as we want 00:07:26.480 |
to create two new columns for each event, how long is it going to be until the next 00:07:32.240 |
time this event happens, and how long has it been since the last time that event happened. 00:07:38.160 |
So in other words, how long until the next state holiday, how long since the previous 00:07:44.840 |
So that's not something which I'm aware of as existing as a library or anything like 00:07:55.360 |
And so importantly, I need to do this by store. 00:08:03.120 |
For this store, when was this store's last promo, so how long has it been since the last 00:08:08.720 |
time it had a promo, how long it will be until the next time it has a promo, for instance. 00:08:17.560 |
So here's what I'm going to do, I'm going to create a little function that's going to 00:08:23.720 |
take a field name and I'm going to pass it each of promo and then state holiday and then 00:08:31.200 |
So we'll say field = school holiday, and then we'll say get elapsed school holiday, after. 00:08:41.920 |
So we've got a first of all sort by store and date. 00:08:47.000 |
So now when we loop through this, we're going to be looping through within a store, so store 00:08:51.120 |
number 1, January the 1st, January the 2nd, January the 3rd, and so forth. 00:08:57.120 |
And as we loop through each store, we're basically going to say, is this row a school holiday 00:09:05.920 |
And if it is a school holiday, then we'll keep track of this variable called last_date, which 00:09:10.040 |
says this is the last date where we saw a school holiday. 00:09:15.720 |
And so then we're basically going to append to our result the number of days since the 00:09:32.600 |
So I could actually write this much more simply. 00:09:36.800 |
I could basically go through for row in df.idder rows and then grab the fields we want from 00:09:55.360 |
It turns out this is 300 times slower than the version that I have. 00:10:01.280 |
And basically iterating through a data frame and extracting specific fields out of a row 00:10:14.800 |
What's much faster is to iterate through a numpy array. 00:10:22.080 |
So if you take a series like df.store and add .values after it, that grabs a numpy array 00:10:32.920 |
One is the store IDs, one is whatever field is, in this case, let's say school_holiday, 00:10:43.600 |
So now what I want to do is loop through the first one of each of those lists, and then 00:10:50.520 |
the second one of each of those lists, and then the third one of each of those lists. 00:10:55.920 |
I need to do something like this in basically every notebook I write, and the way to do 00:11:02.640 |
So zip means loop through each of these lists one at a time, and then this here is where 00:11:09.920 |
we can grab that element out of the first list, the second list and the third list. 00:11:16.440 |
So if you haven't played around much with zip, that's a really important function to 00:11:22.360 |
Like I say, I use it in pretty much every notebook I write all the time. 00:11:28.660 |
You have to loop through a bunch of lists at the same time. 00:11:36.080 |
So we're going to loop through every store, every school_holiday, every date, yes. 00:11:45.580 |
So in this case we basically want to say let's grab the first store, the first school_holiday, 00:12:03.240 |
So for store_1, January 1st school_holiday was true or false. 00:12:10.520 |
And so if it is a school_holiday, I'll keep track of that fact by saying the last time 00:12:19.360 |
And then append, how long has it been since the last school_holiday? 00:12:25.320 |
And if the store_id is different to the last store_id I saw, then I've now got to a whole 00:12:32.040 |
new store, in which case I have to basically reset everything. 00:12:40.280 |
What will happen to the first points that we don't have the last holiday? 00:12:46.460 |
I basically set this to some arbitrary starting point, it's going to end up with the largest 00:13:00.280 |
And you may need to replace this with a missing value afterwards, or some zero, or whatever. 00:13:15.320 |
The nice thing is, thanks to values, it's very easy for a neural net to kind of cut 00:13:24.860 |
So in this case I didn't do anything special with it, I ended up with a negative a billion 00:13:36.300 |
So we can go through, and the next thing to note is there's a whole bunch of stuff that 00:13:42.720 |
I need to do to both the training set and the test set. 00:13:46.400 |
So in the previous section I actually kind of added this little loop where I go for each 00:13:51.760 |
of the training data frame and the test data frame do these things. 00:13:58.640 |
So each cell I did for each of the data frames. 00:14:02.920 |
I've now got a whole series of cells that I want to run first of all for the training 00:14:12.720 |
So in this case the way I did that was I had two different cells here. 00:14:16.340 |
One which set df to be the training set, one which set to be the test set. 00:14:20.760 |
So the way I use this is I run just this cell, and then I run all the cells underneath, so 00:14:28.280 |
it does it all for the training set, and then I come back and run just this cell and then 00:14:35.240 |
So this notebook is not designed to be just run from top to bottom, but it's designed 00:14:43.280 |
And I mention that because this can be a handy trick to know, you could of course put all 00:14:49.720 |
the stuff underneath in a function that you pass the data frame to and call it once with 00:14:54.640 |
the test set and once with the training set, but I kind of like to experiment a bit more 00:15:00.640 |
interactively, look at each step as I go, so this way is an easy way to kind of run something 00:15:05.620 |
on two different data frames without turning it into a function. 00:15:12.560 |
So if I sort by store and by date, then this is keeping track of the last time something 00:15:20.160 |
happened and so this is therefore going to end up telling me how many days was it since 00:15:28.080 |
So now if I sort date descending and call the exact same function, then it's going to 00:15:41.600 |
So that's a nice little trick for adding these kind of arbitrary event timers into your time 00:15:49.880 |
So if you're doing, for example, the Ecuadorian Groceries competition right now, maybe this 00:15:55.840 |
kind of approach would be useful for various events in that as well. 00:16:01.160 |
Do it for state holiday, do it for promo, here we go. 00:16:11.760 |
The next thing that we look at here is rolling functions. 00:16:19.120 |
So rolling in pandas is how we create what we call windowing functions. 00:16:32.600 |
Let's say I had some data, something like this, and this is like date, and I don't know 00:16:51.960 |
this is like sales or whatever, what I could do is I could say let's create a window around 00:17:00.640 |
this point of like 7 days, so it would be like okay, this is a 7 day window. 00:17:11.480 |
And so then I could take the average sales in that 7 day window, and I could do the same 00:17:17.880 |
thing like I don't know, over here, take the average sales over that 7 day window. 00:17:26.180 |
And so if we do that for every point and join up those averages, you're going to end up 00:17:35.820 |
So the more generic version of the moving average is a window function, i.e. something 00:17:46.020 |
where you apply some function to some window of data around each point. 00:17:52.440 |
Now very often the windows that I've shown here are not actually what you want, if you're 00:17:58.240 |
trying to build a predictive model you can't include the future as part of a moving average. 00:18:04.800 |
So quite often you actually need a window that ends here, so that would be our window 00:18:14.880 |
And so Pandas lets you create arbitrary window functions using this rolling here. 00:18:25.640 |
This here says how many time steps do I want to apply the function to. 00:18:32.400 |
This here says if I'm at the edge, so in other words if I'm like out here, should you make 00:18:39.200 |
that a missing value because I don't have 7 days to average over, or what's the minimum 00:18:48.480 |
So here I said 1, and then optionally you can also say do you want to set the window at 00:18:55.080 |
the start of a period, or the end of a period, or the middle of the period. 00:19:02.120 |
And then within that you can apply whatever function you like. 00:19:13.000 |
So there's a nice easy way of getting moving averages, or whatever else. 00:19:20.920 |
And I should mention in Pandas, if you go to the time series page on Pandas, there's 00:19:29.400 |
literally like, look at just the index here, time series functionality, all of this, this, 00:19:40.880 |
Because Wes Bikini who created this, he was originally in hedge fund trading, I believe, 00:19:50.840 |
And so I think Pandas originally was very focused on time series, and still it's perhaps 00:19:58.640 |
So if you're playing around with time series computations, you definitely owe it to yourself 00:20:08.880 |
And there's a lot of conceptual pieces around time stamps, and date offsets, and resampling, 00:20:19.280 |
and stuff like that to kind of get your head around, but it's totally worth it because 00:20:23.880 |
otherwise you'll be writing this stuff as loops by hand, it's going to take you a lot 00:20:28.240 |
longer than leveraging what Pandas already does, and of course Pandas will do it in highly 00:20:34.760 |
optimized C code for you, vectorized C code, whereas your version is going to loop in Python. 00:20:41.040 |
So definitely worth, if you're doing stuff in time series learning, the full Pandas time 00:20:47.840 |
series API is about as strong as any time series API out there. 00:20:54.960 |
Okay, so at the end of all that, you can see here's those kind of starting point values 00:21:01.280 |
I mentioned, slightly on the extreme side, and so you can see here the 17th of September 00:21:10.560 |
store 1 was 13 days after the last school holiday, the 16th was 12, 11/10, so forth. 00:21:19.080 |
We're currently in a promotion, here this is one day before the promotion, here we've 00:21:26.560 |
got 9 days after the last promotion, and so forth. 00:21:32.640 |
So that's how we can add kind of event counters to a time series, and probably always a good 00:21:40.840 |
idea when you're doing work with time series. 00:21:46.780 |
So now that we've done that, we've got lots of columns in our dataset, and so we split 00:21:53.360 |
them out into categorical versus continuous columns, we'll talk more about that in a moment 00:22:01.480 |
So these are going to be all the things I'm going to create an embedding for. 00:22:05.480 |
And these are all of the things that I'm going to feed directly into the model. 00:22:11.960 |
So for example, we've got competition distance, that's distance to the nearest competitor, 00:22:18.860 |
maximum temperature, and here we've got day of wake. 00:22:31.320 |
So here we've got maximum temperature, maybe it's like 22.1, centigrade in Germany, we've 00:22:41.160 |
got distance to nearest competitor, might be 321 kilometers, 0.7, and then we've got 00:22:58.760 |
So these numbers here are going to go straight into our vector, the vector that we're going 00:23:16.480 |
We'll see in a moment we'll normalize them, but more or less. 00:23:29.200 |
But this categorical variable we're not, we need to put it through an embedding. 00:23:34.280 |
So we'll have some embedding matrix of, if there are 7 days, maybe dimension 4 embedding, 00:23:45.280 |
and so this will look up the 6th row to get back the 4 items. 00:23:52.920 |
And so this is going to turn into length 4 vector, which we'll then add here. 00:24:09.280 |
So that's how our continuous and categorical variables are going to work. 00:24:22.120 |
So then all of our categorical variables, we'll turn them into Panda's categorical variables 00:24:33.920 |
And then we're going to apply the same mappings to the test set. 00:24:38.000 |
So if Saturday is a 6 in the training set, this apply_cats makes sure that Saturday is 00:24:48.040 |
For the continuous variables, make sure they're all floats because PyTorch expects everything 00:24:57.960 |
So then, this is another little trick that I use. 00:25:03.080 |
Both of these cells define something called joined_samp. 00:25:07.760 |
One of them defines them as the whole training set. 00:25:16.200 |
And so the idea is that I do all of my work on the sample, make sure it all works well, 00:25:21.560 |
play around with different hyperparameters and architectures, and then I'm like, "Okay, 00:25:26.880 |
I then go back and run this line of code to say, "Okay, now make the whole data set be 00:25:36.000 |
This is a good way, again, similar to what I showed you before, it lets you use the same 00:25:40.400 |
cells in your notebook to run first of all on a sample, and then go back later and run 00:25:52.280 |
So now that we've got that joined_samp, we can then pass it to proc.df as we've done 00:25:57.200 |
before to grab the dependent variable to deal with missing values, and in this case we pass 00:26:09.080 |
doScale = true will subtract the mean and divide by the standard deviation. 00:26:18.240 |
And so the reason for that is that if our first layer is just a matrix multiply, so 00:26:25.720 |
here's our set of weights, and our input is like, I don't know, it's got something which 00:26:32.520 |
is like 0.001, and then it's got something which is like 10^6, and then our weight matrix 00:26:41.440 |
has been initialized to be like random numbers between 0 and 1, so we've got 0.6, 0.1, etc. 00:26:50.720 |
Then basically this thing here is going to have gradients that are 9 orders of magnitude 00:26:57.240 |
bigger than this thing here, which is not going to be good for optimization. 00:27:03.720 |
So by normalizing everything to be mean of 0, standard deviation of 1 to start with, 00:27:10.520 |
then that means that all of the gradients are going to be on the same kind of scale. 00:27:19.320 |
We didn't have to do that in random forests, because in random forests we only cared about 00:27:26.220 |
We didn't care about the values at all, but with linear models and things that are built 00:27:33.920 |
out of layers of linear models, like neural nets, we care very much about the scale. 00:27:45.840 |
Now since it normalizes our data for us, it returns one extra object, which is a mapper, 00:27:52.080 |
which is an object that contains for each continuous variable what was the mean and 00:27:56.960 |
standard deviation it was normalized with, the reason being that we're going to have 00:28:02.740 |
to use the same mean and standard deviation on the test set, because we need our test 00:28:09.480 |
set and our training set to be scaled in the exact same way, otherwise they're going to 00:28:16.000 |
And so these details about making sure that your test and training set have the same categorical 00:28:23.240 |
codings, the same missing value replacement and the same scaling normalization are really 00:28:30.560 |
important to get right, because if you don't get it right then your test set is not going 00:28:41.080 |
But if you follow these steps, it'll work fine. 00:28:45.280 |
We also take the log of the dependent variable, and that's because in this Kaggle competition 00:28:51.340 |
the evaluation metric was root mean squared percent error. 00:28:55.960 |
So root mean squared percent error means we're being penalized based on the ratio between 00:29:07.120 |
We don't have a loss function in PyTorch called root mean squared percent error. 00:29:12.160 |
We could write one, but easier is just to take the log of the dependent because the 00:29:17.040 |
difference between logs is the same as the ratio. 00:29:24.180 |
You'll notice the vast majority of regression competitions on Kaggle use either root mean 00:29:32.360 |
squared percent error or root mean squared error of the log as their evaluation metric, 00:29:37.680 |
and that's because in real-world problems most of the time we care more about ratios 00:29:46.140 |
So if you're designing your own project, it's quite likely that you'll want to think about 00:30:01.080 |
So then we create a validation set, and as we've learned before, most of the time if you've 00:30:06.700 |
got a problem involving a time component, your validation set probably wants to be the most 00:30:12.720 |
recent time period rather than a random subset, so that's what I do here. 00:30:20.240 |
When I finished modeling and I found an architecture and a set of hyperparameters and a number 00:30:24.880 |
of epochs and all that stuff that works really well, if I want to make my model as good as 00:30:29.640 |
possible I'll retrain on the whole thing, including the validation set. 00:30:36.400 |
Now currently at least fastAI assumes that you do have a validation set, so my kind of 00:30:41.640 |
hacky workaround is to set my validation set to just be one index, which is the first row, 00:30:48.000 |
in that way all the code keeps working but there's no real validation set. 00:30:53.000 |
So obviously if you do this you need to make sure that your final training is like the 00:30:59.040 |
exact same hyperparameters, the exact same number of epochs, exactly the same as the thing 00:31:04.040 |
that worked, because you don't actually have a proper validation set now to check against. 00:31:08.960 |
I have a question regarding get elapsed function which we discussed before, so in get elapsed 00:31:17.680 |
function we are trying to find when will the next holiday come? 00:31:28.160 |
So every year the holidays are more or less fixed, like there will be holiday on 4th of 00:31:33.520 |
July, 25th of December and there's hardly any change. 00:31:37.440 |
So can't we just look from previous years and just get a list of all the holidays that 00:31:46.480 |
Maybe, in this case I guess that's not true of promo, and some holidays change, like Easter, 00:31:56.200 |
so this way I get to write one piece of code that works for all of them, and it doesn't 00:32:08.400 |
So there might be ways, if your dataset was so big that this took too long you could maybe 00:32:13.160 |
do it on one year and then somehow copy it, but in this case there was no need to. 00:32:18.160 |
And I always value my time over my computer's time, so I try to keep things as simple as 00:32:31.320 |
So now we can create our model, and so to create our model we have to create a model 00:32:37.160 |
data object, as we always do with fast.ai, so a columnar model object is just a model 00:32:42.680 |
data object that represents a training set, a validation set, and an optional test set 00:32:53.560 |
We just have to tell it which of the variables should we treat as categorical. 00:33:07.720 |
So for each of our categorical variables, here is the number of categories it has. 00:33:17.540 |
So for each of our embedding matrices, this tells us the number of rows in that embedding 00:33:26.560 |
And so then we define what embedding dimensionality we want. 00:33:34.800 |
If you're doing natural language processing, then the number of dimensions you need to 00:33:39.680 |
capture all the nuance of what a word means and how it's used has been found empirically 00:33:48.560 |
It turns out that when you do NLP models with embedding matrices that are smaller than 600, 00:33:59.400 |
you don't get as good of results as you do if there's size 600, beyond 600, it doesn't 00:34:07.280 |
I would say that human language is one of the most complex things that we model, so 00:34:14.040 |
I wouldn't expect you to come across many if any categorical variables that need embedding 00:34:25.120 |
At the other end, some things may have pretty simple kind of causality. 00:34:33.960 |
So for example, state holiday, maybe if something's a holiday, then it's just a case of stores 00:34:49.860 |
that are in the city, there's some behavior, there's stores that are in the country, there's 00:35:02.040 |
So ideally when you decide what embedding size to use, you would kind of use your knowledge 00:35:11.280 |
about the domain to decide how complex is the relationship and so how big embedding 00:35:24.920 |
You would only know that because maybe somebody else has previously done that research and 00:35:32.160 |
So in practice, you probably need to use some rule of thumb, and then having tried your 00:35:38.680 |
rule of thumb, you could then maybe try a little bit higher and a little bit lower and 00:35:43.000 |
see what helps, so it's kind of experimental. 00:35:46.880 |
My rule of thumb is look at how many discrete values the category has, i.e. the number of 00:35:54.760 |
rows in the embedding matrix, and make the dimensionality of the embedding half of that. 00:36:00.840 |
So for day of week, which is the second one, 8 rows and 4 columns. 00:36:10.160 |
So here it is there, the number of categories divided by 2. 00:36:18.000 |
So here you can see for stores, there's 1000 stores, I only have a dimensionality of 50. 00:36:23.000 |
I don't know, it seems to have worked okay so far. 00:36:26.100 |
You may find you need something a little different. 00:36:29.400 |
Actually for the Ecuadorian groceries competition, I haven't really tried playing with this, but 00:36:34.400 |
I think we may need some larger embedding sizes, but it's something to fiddle with. 00:36:44.840 |
So as your variables, the cardinality size becomes larger and larger, you're creating 00:36:50.200 |
more and more or wider embedding matrices, aren't you therefore massively risking overfitting, 00:36:57.200 |
because you're just choosing so many parameters that the model can never possibly capture 00:37:00.280 |
all that variation unless your data is absolutely huge? 00:37:04.800 |
And so let me remind you about my kind of golden rule of the difference between modern 00:37:13.280 |
In old machine learning, we control complexity by reducing the number of parameters. 00:37:18.160 |
In modern machine learning, we control complexity by regularization. 00:37:22.240 |
So the answer is no, I'm not concerned about overfitting, because the way I avoid overfitting 00:37:27.640 |
is not by reducing the number of parameters, but by increasing my dropout or increasing 00:37:37.960 |
Having said that, there's no point using more parameters for a particular embedding than 00:37:44.240 |
I need, because regularization is penalizing a model by giving it more random data or by 00:37:52.720 |
actually penalizing weights, so we'd rather not use more than we have to. 00:37:59.520 |
But my general rule of thumb for designing an architecture is to be generous on the side 00:38:08.400 |
But in this case, if after doing some work we felt like, you know what, the store doesn't 00:38:16.560 |
actually seem to be that important, then I might manually go and change this to make 00:38:23.360 |
Or if I was really finding there's not enough data here, I'm either overfitting or I'm using 00:38:28.880 |
more regularization than I'm comfortable with, again, then you might go back. 00:38:32.960 |
But I would always start with being generous with parameters, and in this case, this model 00:38:42.440 |
So now we've got a list of tuples containing the number of rows and columns of each of 00:38:48.240 |
And so when we call get-learner to create our neural net, that's the first thing we 00:38:52.480 |
pass in, is how big is each of our embeddings. 00:38:58.360 |
And then we tell it how many continuous variables we have. 00:39:02.760 |
We tell it how many activations to create for each layer, and we tell it what dropout 00:39:18.600 |
So then we fit for a while, and we're kind of getting something around the 0.1 mark. 00:39:25.620 |
So I tried running this on the test set and I submitted it to Kaggle during the week, actually 00:39:46.580 |
So let's have a look and see how that would go. 00:39:48.860 |
So 107, private 103 public, so let's start on public, which is 103, not there, out of 00:40:25.960 |
Let's try the private leaderboard, which is 107, oh, 5th. 00:40:36.960 |
So hopefully you're now thinking, oh, there are some Kaggle competitions finishing soon, 00:40:42.600 |
which I entered, and I spent a lot of time trying to get good results on the public leaderboard. 00:40:51.600 |
The Kaggle public leaderboard is not meant to be a replacement for your carefully developed 00:41:01.300 |
So for example, if you're doing the iceberg competition, which ones are ships, which ones 00:41:06.600 |
are icebergs, then they've actually put something like 4000 synthetic images into the public 00:41:13.520 |
leaderboard and none into the private leaderboard. 00:41:18.080 |
So this is one of the really good things that tests you out on Kaggle, is like are you creating 00:41:27.880 |
a good validation set and are you trusting it? 00:41:30.880 |
Because if you're trusting your leaderboard feedback more than your validation feedback, 00:41:36.960 |
then you may find yourself in 350th place when you thought you were in 5th. 00:41:43.200 |
So in this case, we actually had a pretty good validation set, because as you can see, 00:41:47.920 |
it's saying somewhere around 0.1, and we actually did get somewhere around 0.1. 00:41:55.840 |
And so in this case, the public leaderboard in this competition was entirely useless. 00:42:07.760 |
So in regards to that, how much does the top of the public leaderboard actually correspond 00:42:14.880 |
Because in the churn prediction challenge, there's like four people who are just completely 00:42:26.960 |
If they randomly sample the public and private leaderboard, then it should be extremely indicative. 00:42:45.360 |
So in this case, the person who was second on the public leaderboard did end up winning. 00:43:02.440 |
So in fact you can see the little green thing here, whereas this guy jumped 96 places. 00:43:11.520 |
If we had entered with a neural net, we just looked at it, we would have jumped 350 places. 00:43:18.060 |
And so often you can figure out whether the public leaderboard -- like sometimes they'll 00:43:24.440 |
tell you the public leaderboard was randomly sampled, sometimes they'll tell you it's not. 00:43:29.000 |
Generally you have to figure it out by looking at the correlation between your validation 00:43:33.440 |
set results and the public leaderboard results to see how well they're correlated. 00:43:40.880 |
Sometimes if two or three people are way ahead of everybody else, they may have found some 00:43:57.040 |
So that's Rossman, and that brings us to the end of all of our material. 00:44:06.440 |
So let's come back after the break and do a quick review, and then we will talk about 00:44:29.480 |
One is by building a tree, and one is with SGD. 00:44:36.280 |
And so the SGD approach is a way we can train a model which is a linear model or a stack 00:44:46.240 |
of linear layers with nonlinearities between them, whereas tree building specifically will 00:44:55.800 |
And then tree building we can combine with bagging to create a random forest, or with 00:45:01.680 |
boosting to create a GPM, or various other slight variations such as extremely randomized 00:45:12.200 |
So it's worth reminding ourselves of what these things do. 00:45:33.320 |
So if we've got some data like so, actually let's look specifically at categorical data. 00:45:48.640 |
So categorical data, there's a couple of possibilities of what categorical data might look like. 00:45:54.500 |
It could be like, let's say we've got zip code, so we've got line4003 is our zip code, 00:46:01.440 |
and then we've got sales, and it's like 50, and line4131, sales of 22, and so forth. 00:46:18.080 |
So there's a couple of ways we could represent that categorical variable. 00:46:23.340 |
One would be just to use the number, and maybe it wasn't a number at all, maybe our categorical 00:46:30.960 |
variable is like San Francisco, New York, Mumbai, and Sydney. 00:46:39.960 |
But we can turn it into a number just by arbitrarily deciding to give them numbers. 00:46:47.840 |
So we could just use that kind of arbitrary number. 00:46:51.080 |
So if it turns out that zip codes that are numerically next to each other have somewhat 00:46:59.660 |
similar behavior, then the zip code versus sales chart might look something like this. 00:47:16.340 |
Or alternatively, if the two zip codes next to each other didn't have in any way similar 00:47:28.720 |
sales behavior, you would expect to see something that looked more like this, just all over 00:47:44.960 |
So what a random forest would do if we had just encoded zip in this way is it's going 00:47:50.740 |
to say, alright, I need to find my single best split point. 00:47:56.880 |
The split point is going to make the two sides have as small a standard deviation as possible, 00:48:03.520 |
or mathematically equivalently have the lowest root mean squared error. 00:48:08.040 |
So in this case it might pick here as our first split point, because on this side there's 00:48:18.240 |
one average, and on the other side there's the other average. 00:48:23.880 |
And then for its second split point it's going to say, okay, how do I split this? 00:48:29.400 |
And it's probably going to say I would split here, because now we've got this average versus 00:48:40.200 |
And then finally it's going to say, okay, how do we split here? 00:48:44.080 |
And it's going to say, okay, I'll split there. 00:48:47.080 |
So now I've got that average and that average. 00:48:50.200 |
So you can see that it's able to kind of hone in on the set of splits it needs, even though 00:48:56.360 |
it kind of does it greedily, top down one at a time. 00:48:59.760 |
The only reason it wouldn't be able to do this is if it was just such bad luck that 00:49:05.080 |
the two halves were kind of always exactly balanced, but even if that happens it's not 00:49:10.680 |
going to be the end of the world, it will split on something else, some other variable, 00:49:15.080 |
and next time around it's very unlikely that it's still going to be exactly balanced in 00:49:26.640 |
In the second case, it can do exactly the same thing. 00:49:31.120 |
It'll say, okay, which is my best first split, even though there's no relationship between 00:49:38.040 |
one zip code and its neighboring zip code numerically. 00:49:41.280 |
We can still see here if it splits here, there's the average on one side, and the average on 00:49:54.800 |
Probably here, because here's the average on one side, here's the average on the other 00:50:00.320 |
So again, it can do the same thing, it's going to need more splits because it's going to 00:50:04.000 |
end up having to kind of narrow down on each individual large zip code and each individual 00:50:08.680 |
small zip code, but it's still going to be fine. 00:50:12.320 |
So when we're dealing with building decision trees for random forests or GBMs or whatever, 00:50:20.600 |
we tend to encode our variables just as ordinals. 00:50:27.640 |
On the other hand, if we're doing a neural network, or like a simplest version, like a 00:50:35.000 |
linear regression or a logistic regression, the best it could do is that, which is no 00:50:43.640 |
good at all, and ditto with this one, it's going to be like that. 00:50:48.360 |
So an ordinal is not going to be a useful encoding for a linear model or something that 00:51:01.640 |
So instead, what we do is we create a one-hot encoding. 00:51:05.840 |
So we'll say, 1, 0, 0, 0, here's 0, 1, 0, 0, here's 0, 0, 1, 0, 0, 0, 1. 00:51:18.360 |
And so with that encoding, it can effectively create a little histogram where it's going 00:51:24.800 |
to have a different coefficient for each level. 00:51:29.000 |
So that way it can do exactly what it needs to do. 00:51:32.240 |
At what point does that become too tedious for your system, or does it not? 00:51:48.000 |
Because remember, in real life we don't actually have to create that matrix, instead we can 00:51:55.920 |
just have the 4 coefficients and just do an index lookup to grab the second one, which 00:52:02.920 |
is mathematically equivalent to multiplying by the one-hot encoding. 00:52:15.400 |
One thing to mention, I know you guys have been taught quite a bit of more analytical 00:52:24.920 |
And in analytical solutions to linear regression, you can't solve something with this amount 00:52:37.880 |
In other words, you know something is Sydney if it's not Mumbai or New York or San Francisco. 00:52:45.980 |
In other words, there's 100% collinearity between the fourth of these classes versus 00:52:52.560 |
And so if you try to solve a linear regression analytically, that way the whole thing falls 00:52:57.520 |
Now note, with SGD we have no such problem, like SGD, why would it care? 00:53:03.480 |
We're just taking one step along the derivative. 00:53:07.000 |
It cares a little, because in the end the main problem with collinearity is that there's 00:53:13.760 |
an infinite number of equally good solutions. 00:53:17.680 |
So in other words, we could increase all of these and decrease this, or decrease all of 00:53:23.400 |
these and increase this, and they're going to balance out. 00:53:28.240 |
And when there's an infinitely large number of good solutions, that means there's a lot 00:53:32.600 |
of kind of flat spots in the loss surface, and it can be harder to optimize. 00:53:38.960 |
So it's a really easy way to get rid of all of those flat spots, which is to add a little 00:53:43.600 |
So if we added a little bit of weight decay, like 1e neg 7 even, then that basically says 00:53:50.120 |
these are not all equally good anymore, the one which is the best is the one where the 00:53:54.400 |
parameters are the smallest and the most similar to each other, and so that will again move 00:54:02.280 |
Could you just clarify that point you made about why one hot-coating wouldn't be that 00:54:13.920 |
If we have a one hot-encoded vector, and we are multiplying it by a set of coefficients, 00:54:27.520 |
then that's exactly the same thing as simply saying let's grab the thing where the 1 is. 00:54:33.440 |
So in other words, if we had stored this as a 0, and this one as a 1, and this one as 00:54:40.120 |
a 2, then it's exactly the same as just saying look up that thing in the array. 00:54:50.480 |
So an embedding is a weight matrix you can multiply by a 1 hot-encoding, and it's just 00:54:57.120 |
a computational shortcut, but it's mathematically the same. 00:55:03.840 |
So there's a key difference between solving linear type models analytically versus with 00:55:14.200 |
With SGD we don't have to worry about collinearity and stuff, or at least not nearly to the same 00:55:18.880 |
degree, and then the difference between solving a linear or a single layer or multilayer model 00:55:27.760 |
with SGD versus a tree, a tree is going to complain about less things. 00:55:34.040 |
So in particular you can just use ordinals as your categorical variables. 00:55:39.140 |
And as we learned just before, we also don't have to worry about normalizing continuous 00:55:45.200 |
variables for a tree, but we do have to worry about it for these SGD-trained models. 00:55:54.120 |
So then we also learned a lot about interpreting random forests in particular. 00:56:00.800 |
And if you're interested, you may be interested in trying to use those same techniques to 00:56:11.840 |
So if you want to know which of my features are important in a neural net, you could try 00:56:16.920 |
Try shuffling each column in turn and see how much it changes your accuracy, and that's 00:56:23.400 |
going to be your feature importance for your neural net. 00:56:26.840 |
And then if you really want to have fun, recognize then that shuffling that column is just a 00:56:33.300 |
way of calculating how sensitive the output is to that input, which in other words is 00:56:38.760 |
the derivative of the output with respect to that input. 00:56:43.640 |
And so therefore maybe you could just ask PyTorch to give you the derivatives with respect 00:56:47.840 |
to the input directly, and see if that gives you the same kind of answers. 00:56:55.120 |
You could do the same kind of thing for a partial dependence plot, you could try doing 00:56:59.200 |
the exact same thing with your neural net, replace everything in a column with the same 00:57:03.560 |
value, do it for 1960, 1961, 1962, plot that. 00:57:08.880 |
I don't know of anybody who's done these things before, not because it's rocket science, but 00:57:13.600 |
just because I don't know, maybe no one thought of it, or it's not in a library, but if somebody 00:57:20.240 |
tried it, I think you should find it useful, it would make a great blog post, maybe even 00:57:24.600 |
a paper if you wanted to take it a bit further. 00:57:27.680 |
So there's a thought that something could do. 00:57:29.320 |
So most of those interpretation techniques are not particularly specific to random forests. 00:57:34.720 |
Things like the tree interpreter certainly are, because they're all about what's inside 00:57:43.400 |
We are applying tree interpreter for neural nets. 00:57:46.200 |
How are we going to make inference out of activations that the path follows, for example? 00:57:55.240 |
We're looking at the paths and their contributions of the features. 00:58:02.360 |
In this case, it will be same with activations, I guess, the contributions of each activation 00:58:10.880 |
How can we make inference out of the activations? 00:58:14.680 |
So I'd be careful saying the word inference, because people normally use the word inference 00:58:17.960 |
specifically to mean the same as a test time prediction. 00:58:22.820 |
You may like make some kind of an interrogate the model. 00:58:28.640 |
Actually Hinton and one of his students just published a paper on how to approximate a 00:58:32.900 |
neural net with a tree for this exact reason, which I haven't read the paper yet. 00:58:44.040 |
So in linear regression and traditional statistics, one of the things that we focused on was statistical 00:58:50.440 |
significance of like the changes and things like that. 00:58:53.600 |
And so when thinking about a tree interpreter or even like the waterfall chart, which I 00:58:58.160 |
guess is just a visualization, I guess where does that fit in? 00:59:02.920 |
Because we can see like, oh, yeah, this looks important in the sense that it causes large 00:59:08.440 |
But how do we know that it's like traditionally statistically significant or anything of that 00:59:14.800 |
So most of the time I don't care about the traditional statistical significance, and 00:59:18.720 |
the reason why is that nowadays the main driver of statistical significance is data volume, 00:59:29.460 |
And nowadays most of the models you build will have so much data that like every tiny 00:59:34.380 |
thing will be statistically significant, but most of them won't be practically significant. 00:59:39.960 |
So my main focus therefore is practical significance, which is does the size of this influence impact 00:59:50.720 |
Statistical significance, it was much more important when we had a lot less data to work 00:59:57.680 |
If you do need to know statistical significance, because for example you have a very small 01:00:01.760 |
dataset because it's like really expensive to label or hard to collect or whatever, or 01:00:06.080 |
it's a medical dataset for a rare disease, you can always get statistical significance 01:00:11.240 |
by bootstrapping, which is to say that you can randomly resample your dataset a number 01:00:17.980 |
of times, train your model a number of times, and you can then see the actual variation 01:00:25.840 |
So with bootstrapping, you can turn any model into something that gives you confidence intervals. 01:00:31.800 |
There's a paper by Michael Jordan which has a technique called the bag of little bootstraps 01:00:37.480 |
which actually kind of takes this a little bit further, well worth reading if you're 01:00:46.920 |
So you said we don't need one-hot encoding matrix if we are doing random forest or if 01:00:55.640 |
What will happen if we do that and how bad can a model be? 01:01:02.960 |
We actually did do it, remember we had that maximum category size and we did create one-hot 01:01:07.920 |
encodings and the reason why we did it was that then our feature importance would tell 01:01:13.920 |
us the importance of the individual levels and our partial dependence plot, we could 01:01:20.120 |
So it doesn't necessarily make the model worse, it may make it better, but it probably won't 01:01:30.840 |
This is something that we have noticed on real data also that if cardinality is higher, 01:01:36.200 |
let's say 50 levels, and if you do one-hot encoding, the random forest performs very 01:01:44.080 |
That's why in fast.ai we have that maximum categorical size because at some point your 01:01:59.360 |
Also because when you get past that it becomes less useful because of the feature importance 01:02:04.160 |
there's going to be too many levels to really look at. 01:02:07.960 |
So can it not look at those levels which are not important and just give those significant 01:02:30.800 |
Once the cardinality increases too high you're just splitting your data up too much basically. 01:02:33.800 |
And so in practice your ordinal version is likely to be better. 01:02:47.160 |
There's no time to kind of review everything, but I think that's the key concepts and then 01:02:50.840 |
of course remembering that the embedding matrix that we can use is likely to have more than 01:02:55.680 |
just one coefficient, we'll actually have a dimensionality of a few coefficients which 01:03:00.440 |
isn't going to be useful for most linear models, but once you've got multi-layer models that's 01:03:05.700 |
now creating a representation of your category which is quite a lot richer and you can do 01:03:17.560 |
We started off early in this course talking about how actually a lot of machine learning 01:03:28.880 |
People focus on predictive accuracy like Amazon has a collaborative filtering algorithm for 01:03:35.280 |
recommending books and they end up recommending the book which it thinks you're most likely 01:03:42.800 |
And so what they end up doing is probably recommending a book that you already have 01:03:47.360 |
or that you already know about and would have bought anyway, which isn't very valuable. 01:03:51.800 |
What they should instead have done is to figure out which book can I recommend that would 01:04:00.480 |
And so that way we actually maximize our lift in sales due to recommendations. 01:04:06.640 |
And so this idea of the difference between optimizing and influencing your actions versus 01:04:13.860 |
just improving predictive accuracy is a really important distinction which is very rarely 01:04:23.140 |
discussed in academia or industry kind of crazy enough. 01:04:28.700 |
It's more discussed in industry, it's particularly ignored in most of academia. 01:04:33.920 |
So it's a really important idea which is that in the end the idea, the goal of your model 01:04:44.680 |
And remember I actually mentioned a whole paper I have about this where I introduce 01:04:48.840 |
this thing called the drivetrain approach where I talk about ways to think about how 01:04:53.160 |
to incorporate machine learning into how do we actually influence behavior. 01:05:01.080 |
So that's a starting point, but then the next question is like okay if we're trying to influence 01:05:06.040 |
behavior, what kind of behavior should we be influencing and how and what might it mean 01:05:17.000 |
Because nowadays a lot of the companies that you're going to end up working at are big 01:05:24.100 |
ass companies and you'll be building stuff that can influence millions of people. 01:05:33.640 |
So I'm actually not going to tell you what it means because I don't know, all I'm going 01:05:39.080 |
to try and do is make you aware of some of the issues and make you believe two things 01:05:46.560 |
First, that you should care, and second, that they're big current issues. 01:05:54.640 |
The main reason I want you to care is because I want you to want to be a good person and 01:06:00.400 |
show you that not thinking about these things will make you a bad person. 01:06:04.840 |
But if you don't find that convincing I will tell you this, Volkswagen were found to be 01:06:16.240 |
The person who was sent to jail for it was the programmer that implemented that piece 01:06:25.740 |
And so if you're coming in here thinking, "Hey, I'm just a techie, I'll just do what 01:06:30.320 |
I'm told, that's my job is to do what I'm told." 01:06:34.320 |
I'm telling you if you do that you can be sent to jail for doing what you're told. 01:06:40.280 |
So A) don't just do what you're told because you can be a bad person, and B) you can go 01:06:49.720 |
Second thing to realize is in the heat of the moment you're in a meeting with 20 people 01:06:55.120 |
at work and you're all talking about how you're going to implement this new feature and everybody's 01:07:00.140 |
discussing it, and everybody's like, "We can do this, and here's a way of modeling it, 01:07:04.800 |
and then we can implement it, and here's these constraints." 01:07:06.480 |
And there's some part of you that's thinking, "Am I sure we should be doing this?" 01:07:12.280 |
That's not the right time to be thinking about that, because it's really hard to step up 01:07:17.320 |
then and say, "Excuse me, I'm not sure this is a good idea." 01:07:22.520 |
You actually need to think about how you would handle that situation ahead of time. 01:07:27.280 |
So I want you to think about these issues now and realize that by the time you're in 01:07:34.720 |
the middle of it, you might not even realize it's happening. 01:07:40.120 |
It'll just be a meeting, like every other meeting, and a bunch of people will be talking 01:07:46.960 |
And you need to be able to recognize, "Oh, this is actually something with ethical implications." 01:07:53.480 |
So Rachel actually wrote all of these slides, I'm sorry she can't be here to present this 01:07:59.120 |
because she's studied this in depth, and she's actually been in difficult environments herself 01:08:06.520 |
where she's kind of seen these things happening. 01:08:12.440 |
We know how hard it is, but let me give you a sense of what happens. 01:08:17.880 |
So engineers trying to solve engineering problems and causing problems is not a new thing. 01:08:28.040 |
So in Nazi Germany, IBM, the group known as Hollerith, Hollerith was the original name 01:08:37.640 |
of IBM, and it comes from the guy who actually invented the use of punch cards for tracking 01:08:42.440 |
the US Census, the first mass, wide-scale use of punch cards for data collection in 01:08:51.200 |
So at this point, this unit was still called Hollerith. 01:08:53.800 |
So Hollerith sold a punch card system to Nazi Germany. 01:09:01.680 |
And so each punch card would like code, you know, this is a Jew, 8, GFC, 12, general execution, 01:09:12.680 |
And so here's one of these cards describing the right way to kill these various people. 01:09:17.900 |
And so a Swiss judge ruled that IBM's technical assistance facilitated the tasks of the Nazis 01:09:25.040 |
in commission of their crimes against humanity. 01:09:27.520 |
This led to the death of something like 20 million civilians. 01:09:33.180 |
So according to the Jewish Virtual Library, where I got these pictures and quotes from, 01:09:38.280 |
their view is that the destruction of the Jewish people became even less important because 01:09:43.040 |
of the invigorating nature of IBM's technical achievement, only heightened by the fantastical 01:09:51.760 |
So this was a long time ago, and hopefully you won't end up working at companies that 01:09:59.920 |
But perhaps you will, because perhaps you'll go to Facebook, who are facilitating genocide 01:10:07.240 |
And I know people at Facebook who are doing this, and they had no idea they were doing 01:10:14.840 |
So right now in Facebook, the Rohingya are in the middle of a genocide, a Muslim population 01:10:24.240 |
Babies are being grabbed out of their mother's arms and thrown into fires. 01:10:28.640 |
People are being killed, hundreds of thousands of refugees. 01:10:32.520 |
When interviewed, the Myanmar generals doing this say, "We are so grateful to Facebook 01:10:40.280 |
for letting us know about the Rohingya fake news that these people are actually not human, 01:10:51.760 |
Now Facebook did not set out to enable the genocide of the Rohingya people in Myanmar. 01:10:58.320 |
No, instead what happened is they wanted to maximize impressions and clicks. 01:11:03.840 |
And so it turns out that for the data scientists at Facebook, their algorithms kind of learned 01:11:08.960 |
that if you take the kinds of stuff people are interested in and feed them slightly more 01:11:13.960 |
extreme versions of that, you're actually going to get a lot more impressions. 01:11:18.560 |
And the project managers are saying maximize these impressions, and people are clicking 01:11:26.720 |
And so the potential implications are extraordinary and global. 01:11:34.680 |
And this is something that is literally happening, this is October 2017, it's happening now. 01:11:48.640 |
So I just want to clarify what was happening here. 01:11:51.120 |
So it was the facilitation of fake news or inaccurate media? 01:12:00.080 |
So what happened was in mid-2016, Facebook fired its human editors. 01:12:07.720 |
So it was humans that decided how to order things on your homepage. 01:12:12.880 |
Those people got fired and replaced with machine learning algorithms. 01:12:16.840 |
And so the machine learning algorithms written by data scientists like you, they had nice 01:12:25.000 |
clear metrics and they were trying to maximize their predictive accuracy and be like okay, 01:12:30.100 |
we think if we put this thing higher up than this thing, we'll get more clicks. 01:12:35.280 |
And so it turned out that these algorithms for putting things on the Facebook news feed 01:12:41.280 |
had a tendency to say like oh, human nature is that we tend to click on things which stimulate 01:12:48.360 |
our views and therefore like more extreme versions of things we already see. 01:12:53.520 |
So this is great for the Facebook revenue model of maximizing engagement. 01:13:02.600 |
And so at the time, there was some negative press about like I'm not sure that the stuff 01:13:10.160 |
that Facebook is now putting on their trending section is actually that accurate, but from 01:13:16.280 |
the point of view of the metrics that people were optimizing at Facebook, it looked terrific. 01:13:22.600 |
And so way back to October 2016, people started noticing some serious problems. 01:13:29.120 |
For example, it is illegal to target housing to people of certain races in America. 01:13:38.160 |
And yet a news organization discovered that Facebook was doing exactly that in October 01:13:45.480 |
Again, not because somebody in that data science team said let's make sure black people can't 01:13:50.760 |
live in nice neighborhoods, but instead they found that their automatic clustering and 01:13:58.480 |
segmentation algorithm found there was a cluster of people who didn't like African Americans 01:14:04.960 |
and that if you targeted them with these kinds of ads then they would be more likely to select 01:14:12.400 |
But the interesting thing is that even after being told about this three times, Facebook 01:14:20.780 |
And that is to say these are not just technical issues, they're also economic issues. 01:14:25.200 |
When you start saying the thing that you get paid for, that is ads, you have to change 01:14:30.840 |
the way that you structure those so that you either use more people that cost money or 01:14:37.280 |
you are less aggressive on your algorithms to target people based on minority group status 01:14:48.440 |
So the reason I mention this is you will at likely at some point in your career find yourself 01:14:53.880 |
in a conversation where you're thinking I'm not confident that this is like morally okay, 01:15:01.000 |
the person you're talking to is thinking in their head this is going to make us a lot 01:15:04.160 |
of money, and you don't quite ever manage to have a successful conversation because 01:15:13.940 |
And so when you're talking to somebody who may be more experienced and more senior than 01:15:17.800 |
you and they may sound like they know what they're talking about, just realize that their 01:15:21.960 |
incentives are not necessarily going to be focused on like how do I be a good person. 01:15:28.880 |
They're not thinking how do I be a bad person, but the more time you spend in industry in 01:15:34.200 |
my experience, the more desensitized you kind of get to this stuff of like okay maybe getting 01:15:40.760 |
promotions and making money isn't the most important thing. 01:15:45.240 |
So for example, I've got a lot of friends who are very good at computer vision and some 01:15:50.840 |
of them have gone on to create startups that seem like they're almost handmade to help 01:15:57.000 |
authoritarian governments surveil their citizens. 01:16:01.880 |
And when I ask my friends like have you thought about how this could be used in that way, 01:16:08.200 |
they're generally kind of offended that I ask, but I'm asking you to think about this. 01:16:17.280 |
Wherever you end up working, if you end up creating a startup, tools can be used for 01:16:23.600 |
good or for evil, and so I'm not saying don't create excellent object tracking and detection 01:16:31.400 |
tools from computer vision, because you could go on and use that to create a much better 01:16:39.200 |
surgical intervention robot toolkit, just saying be aware of it, think about it, talk 01:16:50.320 |
So here's one I find fascinating, and there's this really cool thing actually that meetup.com 01:16:55.560 |
did, this is from a meetup.com talk that's online, they think about this. 01:17:00.640 |
They actually thought about this, they actually thought, you know what, if we built a collaborative 01:17:05.000 |
filtering system like we learned about in class to help people decide what meetup to 01:17:11.360 |
go to, it might notice that on the whole in San Francisco, a few more men than women tend 01:17:21.560 |
And so it might then start to decide to recommend techie meetups to more men than women, as 01:17:28.320 |
a result of which, more men will go to techie meetups. 01:17:32.600 |
As a result of which, when women go to techie meetups, they'll be like oh, this is all men, 01:17:38.760 |
As a result of which, the algorithm will get new data saying that men like techie meetups 01:17:46.300 |
And so a little bit of that initial push from the algorithm can create this runaway feedback 01:17:54.560 |
loop and you end up with almost all male techie meetups, for instance. 01:18:00.840 |
And so this kind of feedback loop is a kind of subtle issue that you really want to think 01:18:07.120 |
about when you're thinking about what is the behavior that I'm changing with this algorithm 01:18:18.440 |
So another example, which is kind of terrifying, is in this paper where the authors describe 01:18:28.080 |
how a lot of departments in the US are now using predictive policing algorithms. 01:18:34.800 |
So where can we go to find somebody who's about to commit a crime? 01:18:40.420 |
And so you know that the algorithm simply feeds back to you basically the data that 01:18:49.400 |
So if your police department has engaged in racial profiling at all in the past, then 01:18:57.240 |
it might suggest slightly more often maybe you should go to the black neighborhoods to 01:19:04.040 |
As a result of which, more of your police officers go to the black neighborhoods. 01:19:07.360 |
As a result of which, they arrest more black people. 01:19:10.080 |
As a result of which, the data says that the black neighborhoods are less safe. 01:19:14.120 |
As a result of which, the algorithm says to the policeman, maybe you should go to the 01:19:17.120 |
black neighborhoods more often, and so forth. 01:19:21.080 |
And this is not like vague possibilities of something that might happen in the future, 01:19:29.720 |
this is like documented work from top academics who have carefully studied the data and the 01:19:37.200 |
This is like serious scholarly work, it's like no, this is happening right now. 01:19:42.560 |
And so again, I'm sure the people that started creating this predictive policing algorithm 01:19:49.580 |
didn't think like how do we arrest more black people, hopefully they were actually thinking 01:19:54.860 |
gosh I'd like my children to be safer on the streets, how do I create a safer society? 01:20:02.720 |
But they didn't think about this nasty runaway feedback loop. 01:20:09.060 |
So actually this one about social network algorithms is actually an article in the New 01:20:13.720 |
York Times recently about one of my friends, Renee DiResta, and she did something kind 01:20:20.920 |
She set up a second Facebook account, like a fake Facebook account, and she was very 01:20:27.200 |
interested in the anti-vax movement at the time. 01:20:30.680 |
So she started following a couple of anti-vaxxers and visited a couple of anti-vaxxer links. 01:20:39.400 |
And so suddenly her news feed starts getting full of anti-vaxxer news, along with other 01:20:46.240 |
stuff like chemtrails, and deep state conspiracy theories, and all this stuff. 01:20:53.440 |
And so she's like, 'huh', starts clicking on those. 01:20:57.180 |
And the more she clicked, the more hardcore far-out conspiracy stuff Facebook recommended. 01:21:05.280 |
So now when Renee goes to that Facebook account, the whole thing is just full of angry, crazy, 01:21:14.360 |
far-out conspiracy stuff, like that's all she sees. 01:21:18.180 |
And so if that was your world, then as far as you're concerned, it's just like this continuous 01:21:30.200 |
And so again, to answer your question, this is the kind of runaway feedback loop that 01:21:37.580 |
ends up telling me and my generals, you know, throughout their Facebook homepage, that animals 01:22:01.960 |
So bias in image software comes from bias in data. 01:22:08.160 |
And so most of the folks I know at Google Brain building computer vision algorithms, 01:22:19.460 |
And so when they're training the algorithms with photos of their families and friends, 01:22:24.440 |
they are training them with very few people of color. 01:22:27.180 |
And so when FaceApp then decided, we're going to try looking at lots of Instagram photos 01:22:34.300 |
to see which ones are upvoted the most, without them necessarily realizing it, the answer 01:22:44.160 |
So then they built a generative model to make you more hot. 01:22:48.440 |
And so this is the actual photo, and here is the hotter version. 01:22:53.040 |
So the hotter version is more white, less nostrils, more European looking. 01:23:01.240 |
And so this did not go down well, to say the least. 01:23:07.720 |
So again, I don't think anybody at FaceApp said, let's create something that makes people 01:23:15.840 |
They just trained it on a bunch of images of the people that they had around them. 01:23:21.560 |
And this has kind of serious commercial implications as well. 01:23:27.360 |
They had to pull this feature, and they had a huge amount of negative pushback as they 01:23:33.460 |
Here's another example, Google Photos created this photo classifier, airplanes, skyscrapers, 01:23:45.980 |
So think about how this looks to most people. 01:23:50.920 |
To most people they look at this, they don't know about machine learning, they say, what 01:23:56.740 |
Somebody at Google wrote some code to take black people and call them gorillas. 01:24:05.960 |
We know what happened is the team of folks at Google Computer Vision experts who have 01:24:15.800 |
none or few people of color working in the team built a classifier using all the photos 01:24:23.720 |
And so when the system came across a person with dark skin, it was like, I've only mainly 01:24:32.480 |
seen that before amongst gorillas, so I'll put it in that category. 01:24:36.920 |
So again, the bias in the data creates a bias in the software, and again, the commercial 01:24:44.920 |
Google really got a lot of bad PR from this, as they should. 01:24:49.800 |
This was a photo that somebody put in their Twitter feed. 01:24:53.080 |
They said, look what Google Photos just decided to do. 01:24:59.560 |
You can imagine what happened with the first international beauty contest judged by artificial 01:25:05.240 |
Basically it turns out all the beautiful people are white. 01:25:08.840 |
So you kind of see this bias in image software, thanks to bias in the data, thanks to lack 01:25:16.680 |
of diversity in the teams building it, you see the same thing in natural language processing. 01:25:24.480 |
So here is Turkish, O is the pronoun in Turkish which has no gender, but of course in English 01:25:39.520 |
we don't really have a widely used un-gendered singular pronoun, so Google Translate converts 01:25:50.400 |
Now there are plenty of people who saw this online and said, literally, so what? 01:25:59.640 |
It is correctly feeding back the usual usage in English. 01:26:06.000 |
I know how this is trained, this is like Word2Vec vectors, I was trained on Google News corpus, 01:26:11.480 |
Google Books corpus, it's just telling us how things are. 01:26:15.880 |
And from a point of view, that's entirely true. 01:26:20.120 |
The biased data to create this biased algorithm is the actual data of how people have written 01:26:32.080 |
But does that mean that this is the product that you want to create? 01:26:38.080 |
Does this mean this is the product you have to create? 01:26:41.340 |
Just because the particular way you've trained the model means it ends up doing this, is 01:26:49.340 |
And can you think of potential negative implications and feedback loops this could create? 01:26:55.660 |
And if any of these things bother you, then now, lucky you, you have a new cool engineering 01:27:01.120 |
problem to work on, like how do I create unbiased NLP solutions? 01:27:06.380 |
And now there are some start-ups starting to do that and starting to make some money. 01:27:11.520 |
These are opportunities for you, like here's some stuff where people are creating screwed 01:27:16.560 |
up societal outcomes because of their shitty models, like okay, well you can go and build 01:27:23.520 |
So like another example of the bias in word2vec word vectors is restaurant reviews rank Mexican 01:27:30.880 |
restaurants worse because the Mexican words tend to be associated with criminal words 01:27:40.800 |
Again, this is like a real problem that is happening right now. 01:27:48.720 |
So Rachel actually did some interesting analysis of just the plain word2vec word vectors where 01:27:56.680 |
she basically pulled them out and looked at these analogies based on some research that 01:28:03.960 |
And so you can see word2vec, the vector directions show that father is to doctor, mother is to 01:28:10.480 |
nurse, man is to computer programmer, as woman is to homemaker, and so forth. 01:28:16.680 |
So it's really easy to see what's in these word vectors, and they're kind of fundamental 01:28:24.200 |
to much of the NLP or probably just about all of the NLP software we use today. 01:28:31.080 |
So a ProPublica has actually done a lot of good work in this area. 01:28:42.600 |
Many judges now have access to Sentencing Guidelines software. 01:28:46.840 |
And so Sentencing Guidelines software says to the judge, for this individual we would 01:28:56.080 |
And now of course a judge doesn't understand machine learning. 01:29:00.000 |
So like they have two choices, which is either do what it says or ignore it entirely, and 01:29:08.920 |
And so for the ones that fall into the like do what it says category, here's what happens. 01:29:13.640 |
For those that were labeled higher risk, the subset of those that labeled higher risk it 01:29:19.120 |
actually turned out not to re-offend, was about a quarter of whites and about a half 01:29:28.080 |
So like nearly twice as often, people who didn't re-offend were marked as higher risk 01:29:36.560 |
if they were African Americans, and vice versa. 01:29:39.280 |
Amongst those that were labeled lower risk but actually did re-offend, it turned out 01:29:44.800 |
to be about half of the whites and only 28% of the African Americans. 01:29:49.440 |
So this is data which I would like to think nobody is setting out to create something 01:29:57.040 |
But when you start with biased data, and the data says that whites and blacks smoke marijuana 01:30:09.200 |
at about the same rate, but blacks are jailed at something like five times more often than 01:30:16.320 |
whites, the nature of the justice system in America at the moment is that it's not equal, 01:30:24.840 |
And therefore the data that's fed into the machine learning model is going to basically 01:30:31.800 |
And then because of the negative feedback loop, it's just going to get worse and worse. 01:30:35.760 |
I'll tell you something else interesting about this one, which research called Abe Gong has 01:30:40.320 |
pointed out, is here are some of the questions that are being asked. 01:30:53.920 |
So your answer to that question is going to decide whether you're locked up and for how 01:31:00.880 |
Now as a machine learning researcher, do you think that might improve the predictive accuracy 01:31:05.400 |
of your algorithm and get you a better R-squared? 01:31:12.160 |
You try it out and say oh, I've got a better R-squared. 01:31:16.800 |
Well there's another question, do you think it's reasonable to lock somebody up for longer 01:31:25.400 |
And yet these are actually the examples of questions that we are asking right now to 01:31:31.320 |
offenders and then putting into a machine learning system to decide what happens to 01:31:37.360 |
So again, whoever designed this, presumably they were laser focused on technical excellence, 01:31:44.360 |
getting the maximum area under the ROC curve, and I found these great predictors that give 01:31:49.280 |
me another .02, and I guess didn't start to think like well, is that a reasonable way 01:32:03.840 |
So like putting this together, you can kind of see how this can get more and more scary. 01:32:12.160 |
We take a company like Taser, and Tasers are these devices that kind of give you a big 01:32:19.920 |
And Tasers managed to do a great job of creating strong relationships with some academic researchers 01:32:26.680 |
who seem to say whatever they tell them to say, to the extent where now if you look at 01:32:32.400 |
the data it turns out that there's a pretty high probability that if you get tased that 01:32:41.240 |
That happens not unusually, and yet the researchers who they've paid to look into this have consistently 01:32:49.520 |
come back and said oh no, it was nothing to do with the Taser, the fact that they died 01:32:53.880 |
immediately afterwards was totally unrelated, it was just a random thing that happened. 01:33:01.600 |
So this company now owns 80% of the market for body cameras, and they started buying 01:33:09.260 |
computer vision AI companies, and they're going to try and now use these police body 01:33:14.280 |
camera videos to anticipate criminal activity. 01:33:22.220 |
So is that like okay, I now have some augmented reality display saying tase this person because 01:33:31.880 |
So it's kind of like a worrying direction, and so I'm sure nobody who's a data scientist 01:33:40.920 |
at Taser or at the companies that they bought out is thinking like this is the world I want 01:33:46.360 |
to help create, but they could find themselves, or you could find yourself in the middle of 01:33:53.000 |
this kind of discussion, where it's not explicitly about that topic but there's part of you that 01:33:58.000 |
says I wonder if this is how this could be used, and I don't know exactly what the right 01:34:05.760 |
thing to do in that situation is, because you can ask, and of course people are going 01:34:17.200 |
You could ask for some kind of written promise, you could decide to leave, you could start 01:34:25.320 |
doing some research into the legality of things to say I would at least protect my own legal 01:34:32.960 |
I don't know, have a think about how you would respond to that. 01:34:39.840 |
So these are some questions that Rachel created as being things to think about. 01:34:45.480 |
So if you're looking at building a data product or using a model, if you're building a machine 01:34:51.320 |
learning model as for a reason, you're trying to do something. 01:34:59.280 |
Because whatever bias is in that data ends up being a bias in your predictions, potentially 01:35:03.360 |
then biases the actions you're influencing, potentially then biases the data that you 01:35:07.560 |
come back and you may create a feedback loop. 01:35:10.560 |
If the team that built it isn't diverse, what might you be missing? 01:35:15.920 |
So for example, one senior executive at Twitter called the alarm about major Russian bot problems 01:35:28.760 |
at Twitter way back well before the election. 01:35:34.340 |
That was the one black person in the exec team at Twitter, the one. 01:35:47.940 |
Definitely having a more diverse team means having a more diverse set of opinions and 01:35:54.000 |
beliefs and ideas and things to look for and so forth. 01:35:57.080 |
So non-diverse teams seem to make more of these bad mistakes. 01:36:02.640 |
Can we audit the code, is it open source, check for the different error rates amongst 01:36:08.640 |
different groups, is there a simple rule we could use instead that's extremely interpretable 01:36:14.540 |
and easy to communicate and if something goes wrong do we have a good way to deal with it. 01:36:22.080 |
So when we've talked to people about this and a lot of people have come to Rachel and 01:36:29.400 |
said I'm concerned about something my organization is doing, what do I do, or I'm just concerned 01:36:42.200 |
And very often Rachel will say, have you considered leaving? 01:36:47.960 |
And they will say, I don't want to lose my job. 01:36:52.520 |
But actually if you can code, you're in 0.3% of the population. 01:36:57.420 |
If you can code and do machine learning, you're in probably 0.01% of the population. 01:37:09.240 |
So realistically, obviously an organization does not want you to feel like you're somebody 01:37:15.480 |
who could just leave and get another job, that's not in their interest, but that is 01:37:22.000 |
And so one of the things I hope you'll leave this course with is enough self-confidence 01:37:28.680 |
to recognize that you have the skills to get a job, and particularly once you've got your 01:37:36.040 |
first job, your second job is an order of magnitude easier. 01:37:40.200 |
And so this is important not just so that you feel like you actually have the ability 01:37:44.260 |
to act ethically, but it's also important to realize if you find yourself in a toxic 01:37:51.320 |
environment which is pretty damn common, unfortunately, there's a lot of shitty tech cultures, environments 01:38:02.480 |
If you find yourself in one of those environments, the best thing to do is to get the hell out. 01:38:09.560 |
And if you don't have the self-confidence to think you can get another job, you can 01:38:17.640 |
So it's really important, it's really important to know that you are leaving this program 01:38:24.160 |
with very in-demand skills, and particularly after you have that first job, you're now 01:38:28.680 |
somebody with in-demand skills and a track record of being employed in that area. 01:38:36.720 |
This is kind of just a broad question, but what are some things that you know of that 01:38:52.440 |
It's kind of like a bit of a controversial subject at the moment, and people are trying 01:38:59.320 |
to use, some people are trying to use an algorithmic approach, where they're basically trying to 01:39:03.200 |
say how can we identify the bias and kind of subtract it out, but the most effective 01:39:10.920 |
ways I know of are ones that are trying to treat it at the data level. 01:39:15.000 |
So start with a more diverse team, particularly a team involving people from the humanities, 01:39:21.560 |
like sociologists, psychologists, economists, people that understand feedback loops and implications 01:39:27.560 |
for human behavior, and they tend to be equipped with good tools for kind of identifying and 01:39:35.460 |
tracking these kinds of problems, and then kind of trying to incorporate the solutions 01:39:43.200 |
Let's say there isn't kind of like some standard process I can point you to and say here's 01:39:52.120 |
If there is such a thing, we haven't found it yet, it requires a diverse team of smart 01:39:58.440 |
people to be aware of the problems and work hard at them, is the short answer. 01:40:02.720 |
This is just kind of a general thing I guess for the whole class. 01:40:12.640 |
If you're interested in this stuff, I read a pretty cool book, Jeremy you've probably 01:40:16.480 |
heard of it, Weapons of Math Destruction by Cathy O'Neill, it covers a lot of the same 01:40:24.960 |
Thanks for the recommendation, Cathy's great, she's also got a TED talk, I didn't manage 01:40:31.120 |
to finish the book because it's so damn depressing, I was just like, no more. 01:40:47.920 |
This has been really intense for me, obviously this was meant to be something that I was 01:40:56.280 |
sharing with Rachel, so I've ended up doing one of the hardest things in my life, which 01:41:01.920 |
is to teach two people's worth of course on my own and also look after a sick wife and 01:41:07.880 |
have a toddler and also do a deep learning course and also do all this with a new library 01:41:14.920 |
So I'm looking forward to getting some sleep, but it's been totally worth it because you've 01:41:20.760 |
been amazing, like I'm thrilled with how you've reacted to the kind of opportunities I've 01:41:31.320 |
given you and also to the feedback that I've given you.