Back to Index

Machine Learning 1: Lesson 12


Chapters

0:0 Introduction
1:0 Recap
4:30 Durations
8:55 Zip
21:45 Embedding
25:50 Scaling
30:0 Validation
31:10 Check against
32:30 Create model
33:7 Define embedding dimensionality
38:41 Embedding matrices
45:11 Categorical data

Transcript

I thought what we might do today is to finish off where we were in this Rossman notebook looking at time series forecasting and structured data analysis. And then we might do a little mini-review of everything we've learnt, because believe it or not, this is the end, there's nothing more to know about machine learning other than everything that you're going to learn next semester and for the rest of your life.

But anyway, I've got nothing else to teach, so we'll do a little review and then we'll cover the most important part of the course, which is thinking about how are ways to think about how to use this kind of technology appropriately and effectively in a way that's hopefully a positive impact on society.

So last time we got to the point where we talked a bit about this idea that when we were looking at building this competition months open derived variable, that we actually truncated it down to be no more than 24 months, and we talked about the reason why being that we actually wanted to use it as a categorical variable, because categorical variables, thanks to embeddings, have more flexibility in how the neural net can use them.

And so that was kind of where we left off. So let's keep working through this, because what's happening in this notebook is stuff which is probably going to apply to most time series data sets that you work with. And as we talked about, although we use df.apply here, this is something where it's running a piece of Python code over every row, and that's horrifically slow.

So we only do that if we can't find a vectorized pandas or numpy function that can do it to the whole column at once, but in this case I couldn't find a way to convert a year and a week number into a date without using arbitrary Python. Also worth remembering this idea of a lambda function, any time you're trying to apply a function to every row of something, or every element of a tensor, or something like that, if there isn't a vectorized version already, you're going to have to call something like dataframe.apply, which will run a function you pass to every element.

So this is basically a map in functional programming. Since very often the function that you want to pass to it is something you're just going to use once and then throw it away, it's really common to use this lambda approach. So this lambda is creating a function just for the purpose of telling df.apply what to use.

So we could also have written this in a different way, which would have been to say, define date promo 2 cents on some value, return. And then we could put that in here. So that and that are the same thing. So one approach is to define the function and then pass it by name, or the other is to define the function in place using lambda.

And so if you're not comfortable creating and using lambdas, it's a good thing to practice, and playing around with df.apply is a good way to practice it. So let's talk about this durations section, which may at first seem a little specific, but actually it turns out not to be.

What we're going to do is we're going to look at three fields, promo, state holiday and school holiday. And so basically what we have is a table for each store, for each date, does that store have a promo going on at that date? Is there a school holiday in that region of that store at that date?

Is there a state holiday in that region for that store at that date? And so this kind of thing is, there are events, and time series with events are very common. If you're looking at oil and gas drilling data, you're trying to say the flow through this pipe, here's an event representing when it set off some alarm, or here's an event where the drill got stuck, or whatever.

And so most time series at some level will tend to represent some events. So the fact that an event happened at a time is interesting itself, but very often a time series will also show something happening before and after the event. So for example, in this case we're doing grocery sales prediction.

If there's a holiday coming up, it's quite likely that sales will be higher before and after the holiday, and lower during the holiday if this is a city-based store, because you're going to stock up before you go away to bring things with you, and when you come back you've got to refill the fridge, for instance.

So although we don't necessarily have to do this kind of feature engineering to create features specifically about this is before or after a holiday, the neural net, the more we can give the neural net the kind of information it needs, the less it's going to have to learn it, the less it's going to have to learn it, the more we can do with the data we already have, the more we can do with the size architecture we already have.

So feature engineering, even with stuff like neural nets, is still important because it means that we'll be able to get better results with whatever limited data we have, whatever limited computation we have. So the basic idea here, therefore, is when we have events in our time series as we want to create two new columns for each event, how long is it going to be until the next time this event happens, and how long has it been since the last time that event happened.

So in other words, how long until the next state holiday, how long since the previous state holiday. So that's not something which I'm aware of as existing as a library or anything like that, so I wrote it here by hand. And so importantly, I need to do this by store.

For this store, when was this store's last promo, so how long has it been since the last time it had a promo, how long it will be until the next time it has a promo, for instance. So here's what I'm going to do, I'm going to create a little function that's going to take a field name and I'm going to pass it each of promo and then state holiday and then school holiday.

So let's do school holiday, for example. So we'll say field = school holiday, and then we'll say get elapsed school holiday, after. So let me show you what that's going to do. So we've got a first of all sort by store and date. So now when we loop through this, we're going to be looping through within a store, so store number 1, January the 1st, January the 2nd, January the 3rd, and so forth.

And as we loop through each store, we're basically going to say, is this row a school holiday or not? And if it is a school holiday, then we'll keep track of this variable called last_date, which says this is the last date where we saw a school holiday. And so then we're basically going to append to our result the number of days since the last school holiday.

That's the kind of basic idea here. So there's a few interesting features. One is the use of zip. So I could actually write this much more simply. I could basically go through for row in df.idder rows and then grab the fields we want from each row. It turns out this is 300 times slower than the version that I have.

And basically iterating through a data frame and extracting specific fields out of a row has a lot of overhead. What's much faster is to iterate through a numpy array. So if you take a series like df.store and add .values after it, that grabs a numpy array of that series.

So here are three numpy arrays. One is the store IDs, one is whatever field is, in this case, let's say school_holiday, and one is the date. So now what I want to do is loop through the first one of each of those lists, and then the second one of each of those lists, and then the third one of each of those lists.

This is a really, really common pattern. I need to do something like this in basically every notebook I write, and the way to do it is with zip. So zip means loop through each of these lists one at a time, and then this here is where we can grab that element out of the first list, the second list and the third list.

So if you haven't played around much with zip, that's a really important function to practice with. Like I say, I use it in pretty much every notebook I write all the time. You have to loop through a bunch of lists at the same time. So we're going to loop through every store, every school_holiday, every date, yes.

So in this case we basically want to say let's grab the first store, the first school_holiday, the first date. So for store_1, January 1st school_holiday was true or false. And so if it is a school_holiday, I'll keep track of that fact by saying the last time I saw a school_holiday was that day.

And then append, how long has it been since the last school_holiday? And if the store_id is different to the last store_id I saw, then I've now got to a whole new store, in which case I have to basically reset everything. Could you pass that to Karen? What will happen to the first points that we don't have the last holiday?

I basically set this to some arbitrary starting point, it's going to end up with the largest or the smallest possible date. And you may need to replace this with a missing value afterwards, or some zero, or whatever. The nice thing is, thanks to values, it's very easy for a neural net to kind of cut off extreme values.

So in this case I didn't do anything special with it, I ended up with a negative a billion day time stamps and it still worked fine. So we can go through, and the next thing to note is there's a whole bunch of stuff that I need to do to both the training set and the test set.

So in the previous section I actually kind of added this little loop where I go for each of the training data frame and the test data frame do these things. So each cell I did for each of the data frames. I've now got a whole series of cells that I want to run first of all for the training set and then for the test set.

So in this case the way I did that was I had two different cells here. One which set df to be the training set, one which set to be the test set. So the way I use this is I run just this cell, and then I run all the cells underneath, so it does it all for the training set, and then I come back and run just this cell and then run all the cells underneath.

So this notebook is not designed to be just run from top to bottom, but it's designed to be run in this particular way. And I mention that because this can be a handy trick to know, you could of course put all the stuff underneath in a function that you pass the data frame to and call it once with the test set and once with the training set, but I kind of like to experiment a bit more interactively, look at each step as I go, so this way is an easy way to kind of run something on two different data frames without turning it into a function.

So if I sort by store and by date, then this is keeping track of the last time something happened and so this is therefore going to end up telling me how many days was it since the last school holiday. So now if I sort date descending and call the exact same function, then it's going to say how long until the next school holiday.

So that's a nice little trick for adding these kind of arbitrary event timers into your time series models. So if you're doing, for example, the Ecuadorian Groceries competition right now, maybe this kind of approach would be useful for various events in that as well. Do it for state holiday, do it for promo, here we go.

The next thing that we look at here is rolling functions. So rolling in pandas is how we create what we call windowing functions. Let's say I had some data, something like this, and this is like date, and I don't know this is like sales or whatever, what I could do is I could say let's create a window around this point of like 7 days, so it would be like okay, this is a 7 day window.

And so then I could take the average sales in that 7 day window, and I could do the same thing like I don't know, over here, take the average sales over that 7 day window. And so if we do that for every point and join up those averages, you're going to end up with a moving average.

So the more generic version of the moving average is a window function, i.e. something where you apply some function to some window of data around each point. Now very often the windows that I've shown here are not actually what you want, if you're trying to build a predictive model you can't include the future as part of a moving average.

So quite often you actually need a window that ends here, so that would be our window function. And so Pandas lets you create arbitrary window functions using this rolling here. This here says how many time steps do I want to apply the function to. This here says if I'm at the edge, so in other words if I'm like out here, should you make that a missing value because I don't have 7 days to average over, or what's the minimum number of time periods to use?

So here I said 1, and then optionally you can also say do you want to set the window at the start of a period, or the end of a period, or the middle of the period. And then within that you can apply whatever function you like. So here I've got my weekly by store sums.

So there's a nice easy way of getting moving averages, or whatever else. And I should mention in Pandas, if you go to the time series page on Pandas, there's literally like, look at just the index here, time series functionality, all of this, this, there's lots. Because Wes Bikini who created this, he was originally in hedge fund trading, I believe, and his work was all about time series.

And so I think Pandas originally was very focused on time series, and still it's perhaps the strongest part of Pandas. So if you're playing around with time series computations, you definitely owe it to yourself to try to learn this entire API. And there's a lot of conceptual pieces around time stamps, and date offsets, and resampling, and stuff like that to kind of get your head around, but it's totally worth it because otherwise you'll be writing this stuff as loops by hand, it's going to take you a lot longer than leveraging what Pandas already does, and of course Pandas will do it in highly optimized C code for you, vectorized C code, whereas your version is going to loop in Python.

So definitely worth, if you're doing stuff in time series learning, the full Pandas time series API is about as strong as any time series API out there. Okay, so at the end of all that, you can see here's those kind of starting point values I mentioned, slightly on the extreme side, and so you can see here the 17th of September store 1 was 13 days after the last school holiday, the 16th was 12, 11/10, so forth.

We're currently in a promotion, here this is one day before the promotion, here we've got 9 days after the last promotion, and so forth. So that's how we can add kind of event counters to a time series, and probably always a good idea when you're doing work with time series.

So now that we've done that, we've got lots of columns in our dataset, and so we split them out into categorical versus continuous columns, we'll talk more about that in a moment in the review section. So these are going to be all the things I'm going to create an embedding for.

And these are all of the things that I'm going to feed directly into the model. So for example, we've got competition distance, that's distance to the nearest competitor, maximum temperature, and here we've got day of wake. So here we've got maximum temperature, maybe it's like 22.1, centigrade in Germany, we've got distance to nearest competitor, might be 321 kilometers, 0.7, and then we've got day of wake, which might be Saturday as a 6.

So these numbers here are going to go straight into our vector, the vector that we're going to be feeding into our neural net. We'll see in a moment we'll normalize them, but more or less. But this categorical variable we're not, we need to put it through an embedding. So we'll have some embedding matrix of, if there are 7 days, maybe dimension 4 embedding, and so this will look up the 6th row to get back the 4 items.

And so this is going to turn into length 4 vector, which we'll then add here. So that's how our continuous and categorical variables are going to work. So then all of our categorical variables, we'll turn them into Panda's categorical variables in the same way that we've done before. And then we're going to apply the same mappings to the test set.

So if Saturday is a 6 in the training set, this apply_cats makes sure that Saturday is also a 6 in the test set. For the continuous variables, make sure they're all floats because PyTorch expects everything to be a float. So then, this is another little trick that I use.

Both of these cells define something called joined_samp. One of them defines them as the whole training set. One of them defines them as a random subset. And so the idea is that I do all of my work on the sample, make sure it all works well, play around with different hyperparameters and architectures, and then I'm like, "Okay, I'm very happy with this." I then go back and run this line of code to say, "Okay, now make the whole data set be the sample," and then rerun it.

This is a good way, again, similar to what I showed you before, it lets you use the same cells in your notebook to run first of all on a sample, and then go back later and run it on the full data set. So now that we've got that joined_samp, we can then pass it to proc.df as we've done before to grab the dependent variable to deal with missing values, and in this case we pass one more thing, which is doScale = true.

doScale = true will subtract the mean and divide by the standard deviation. And so the reason for that is that if our first layer is just a matrix multiply, so here's our set of weights, and our input is like, I don't know, it's got something which is like 0.001, and then it's got something which is like 10^6, and then our weight matrix has been initialized to be like random numbers between 0 and 1, so we've got 0.6, 0.1, etc.

Then basically this thing here is going to have gradients that are 9 orders of magnitude bigger than this thing here, which is not going to be good for optimization. So by normalizing everything to be mean of 0, standard deviation of 1 to start with, then that means that all of the gradients are going to be on the same kind of scale.

We didn't have to do that in random forests, because in random forests we only cared about the sort order. We didn't care about the values at all, but with linear models and things that are built out of layers of linear models, like neural nets, we care very much about the scale.

So dscale=true normalizes our data for us. Now since it normalizes our data for us, it returns one extra object, which is a mapper, which is an object that contains for each continuous variable what was the mean and standard deviation it was normalized with, the reason being that we're going to have to use the same mean and standard deviation on the test set, because we need our test set and our training set to be scaled in the exact same way, otherwise they're going to have different meanings.

And so these details about making sure that your test and training set have the same categorical codings, the same missing value replacement and the same scaling normalization are really important to get right, because if you don't get it right then your test set is not going to work at all.

But if you follow these steps, it'll work fine. We also take the log of the dependent variable, and that's because in this Kaggle competition the evaluation metric was root mean squared percent error. So root mean squared percent error means we're being penalized based on the ratio between our answer and the correct answer.

We don't have a loss function in PyTorch called root mean squared percent error. We could write one, but easier is just to take the log of the dependent because the difference between logs is the same as the ratio. So by taking the log we get that for free. You'll notice the vast majority of regression competitions on Kaggle use either root mean squared percent error or root mean squared error of the log as their evaluation metric, and that's because in real-world problems most of the time we care more about ratios than about raw differences.

So if you're designing your own project, it's quite likely that you'll want to think about using the log of your dependent variable. So then we create a validation set, and as we've learned before, most of the time if you've got a problem involving a time component, your validation set probably wants to be the most recent time period rather than a random subset, so that's what I do here.

When I finished modeling and I found an architecture and a set of hyperparameters and a number of epochs and all that stuff that works really well, if I want to make my model as good as possible I'll retrain on the whole thing, including the validation set. Now currently at least fastAI assumes that you do have a validation set, so my kind of hacky workaround is to set my validation set to just be one index, which is the first row, in that way all the code keeps working but there's no real validation set.

So obviously if you do this you need to make sure that your final training is like the exact same hyperparameters, the exact same number of epochs, exactly the same as the thing that worked, because you don't actually have a proper validation set now to check against. I have a question regarding get elapsed function which we discussed before, so in get elapsed function we are trying to find when will the next holiday come?

How many days away is it? So every year the holidays are more or less fixed, like there will be holiday on 4th of July, 25th of December and there's hardly any change. So can't we just look from previous years and just get a list of all the holidays that are going to occur this year?

Maybe, in this case I guess that's not true of promo, and some holidays change, like Easter, so this way I get to write one piece of code that works for all of them, and it doesn't take very long to run. So there might be ways, if your dataset was so big that this took too long you could maybe do it on one year and then somehow copy it, but in this case there was no need to.

And I always value my time over my computer's time, so I try to keep things as simple as I can. So now we can create our model, and so to create our model we have to create a model data object, as we always do with fast.ai, so a columnar model object is just a model data object that represents a training set, a validation set, and an optional test set of standard columnar structured data.

We just have to tell it which of the variables should we treat as categorical. And then pass in our dataframes. So for each of our categorical variables, here is the number of categories it has. So for each of our embedding matrices, this tells us the number of rows in that embedding matrix.

And so then we define what embedding dimensionality we want. If you're doing natural language processing, then the number of dimensions you need to capture all the nuance of what a word means and how it's used has been found empirically to be about 600. It turns out that when you do NLP models with embedding matrices that are smaller than 600, you don't get as good of results as you do if there's size 600, beyond 600, it doesn't seem to improve much.

I would say that human language is one of the most complex things that we model, so I wouldn't expect you to come across many if any categorical variables that need embedding matrices with more than 600 dimensions. At the other end, some things may have pretty simple kind of causality.

So for example, state holiday, maybe if something's a holiday, then it's just a case of stores that are in the city, there's some behavior, there's stores that are in the country, there's some other behavior, and that's about it. Maybe it's a pretty simple relationship. So ideally when you decide what embedding size to use, you would kind of use your knowledge about the domain to decide how complex is the relationship and so how big embedding do I need.

In practice, you almost never know that. You would only know that because maybe somebody else has previously done that research and figured it out, like in NLP. So in practice, you probably need to use some rule of thumb, and then having tried your rule of thumb, you could then maybe try a little bit higher and a little bit lower and see what helps, so it's kind of experimental.

So here's my rule of thumb. My rule of thumb is look at how many discrete values the category has, i.e. the number of rows in the embedding matrix, and make the dimensionality of the embedding half of that. So for day of week, which is the second one, 8 rows and 4 columns.

So here it is there, the number of categories divided by 2. But then I say, don't go more than 50. So here you can see for stores, there's 1000 stores, I only have a dimensionality of 50. Why 50? I don't know, it seems to have worked okay so far.

You may find you need something a little different. Actually for the Ecuadorian groceries competition, I haven't really tried playing with this, but I think we may need some larger embedding sizes, but it's something to fiddle with. Prince, can you pass that left? So as your variables, the cardinality size becomes larger and larger, you're creating more and more or wider embedding matrices, aren't you therefore massively risking overfitting, because you're just choosing so many parameters that the model can never possibly capture all that variation unless your data is absolutely huge?

That's a great question. And so let me remind you about my kind of golden rule of the difference between modern machine learning and old machine learning. In old machine learning, we control complexity by reducing the number of parameters. In modern machine learning, we control complexity by regularization. So the answer is no, I'm not concerned about overfitting, because the way I avoid overfitting is not by reducing the number of parameters, but by increasing my dropout or increasing my weight decay.

Having said that, there's no point using more parameters for a particular embedding than I need, because regularization is penalizing a model by giving it more random data or by actually penalizing weights, so we'd rather not use more than we have to. But my general rule of thumb for designing an architecture is to be generous on the side of the number of parameters.

But in this case, if after doing some work we felt like, you know what, the store doesn't actually seem to be that important, then I might manually go and change this to make it smaller. Or if I was really finding there's not enough data here, I'm either overfitting or I'm using more regularization than I'm comfortable with, again, then you might go back.

But I would always start with being generous with parameters, and in this case, this model turned out pretty good. So now we've got a list of tuples containing the number of rows and columns of each of our embedding matrices. And so when we call get-learner to create our neural net, that's the first thing we pass in, is how big is each of our embeddings.

And then we tell it how many continuous variables we have. We tell it how many activations to create for each layer, and we tell it what dropout to use for each layer. And so then we can go ahead and call fit. So then we fit for a while, and we're kind of getting something around the 0.1 mark.

So I tried running this on the test set and I submitted it to Kaggle during the week, actually last week, and here it is. Private score 107, public score 103. So let's have a look and see how that would go. So 107, private 103 public, so let's start on public, which is 103, not there, out of 3000, got to go back a long way.

There it is, 103, okay, 340th. That's not good. So on the public leaderboard, 340th. Let's try the private leaderboard, which is 107, oh, 5th. So hopefully you're now thinking, oh, there are some Kaggle competitions finishing soon, which I entered, and I spent a lot of time trying to get good results on the public leaderboard.

I wonder if that was a good idea. And the answer is, no it won't. The Kaggle public leaderboard is not meant to be a replacement for your carefully developed validation set. So for example, if you're doing the iceberg competition, which ones are ships, which ones are icebergs, then they've actually put something like 4000 synthetic images into the public leaderboard and none into the private leaderboard.

So this is one of the really good things that tests you out on Kaggle, is like are you creating a good validation set and are you trusting it? Because if you're trusting your leaderboard feedback more than your validation feedback, then you may find yourself in 350th place when you thought you were in 5th.

So in this case, we actually had a pretty good validation set, because as you can see, it's saying somewhere around 0.1, and we actually did get somewhere around 0.1. And so in this case, the public leaderboard in this competition was entirely useless. Can you use the box please? So in regards to that, how much does the top of the public leaderboard actually correspond to the top of the private leaderboard?

Because in the churn prediction challenge, there's like four people who are just completely above everyone else. It totally depends. If they randomly sample the public and private leaderboard, then it should be extremely indicative. So in this case, the person who was second on the public leaderboard did end up winning.

SDNT came 7th. So in fact you can see the little green thing here, whereas this guy jumped 96 places. If we had entered with a neural net, we just looked at it, we would have jumped 350 places. So it just depends. And so often you can figure out whether the public leaderboard -- like sometimes they'll tell you the public leaderboard was randomly sampled, sometimes they'll tell you it's not.

Generally you have to figure it out by looking at the correlation between your validation set results and the public leaderboard results to see how well they're correlated. Sometimes if two or three people are way ahead of everybody else, they may have found some kind of leakage or something like that.

That's often a sign that there's some trick. So that's Rossman, and that brings us to the end of all of our material. So let's come back after the break and do a quick review, and then we will talk about ethics and machine learning. So let's come back in 5 minutes.

So we've learnt two ways to train a model. One is by building a tree, and one is with SGD. And so the SGD approach is a way we can train a model which is a linear model or a stack of linear layers with nonlinearities between them, whereas tree building specifically will give us a tree.

And then tree building we can combine with bagging to create a random forest, or with boosting to create a GPM, or various other slight variations such as extremely randomized trees. So it's worth reminding ourselves of what these things do. So let's look at some data. So if we've got some data like so, actually let's look specifically at categorical data.

So categorical data, there's a couple of possibilities of what categorical data might look like. It could be like, let's say we've got zip code, so we've got line4003 is our zip code, and then we've got sales, and it's like 50, and line4131, sales of 22, and so forth. So we've got some categorical variable.

So there's a couple of ways we could represent that categorical variable. One would be just to use the number, and maybe it wasn't a number at all, maybe our categorical variable is like San Francisco, New York, Mumbai, and Sydney. But we can turn it into a number just by arbitrarily deciding to give them numbers.

So it ends up being a number. So we could just use that kind of arbitrary number. So if it turns out that zip codes that are numerically next to each other have somewhat similar behavior, then the zip code versus sales chart might look something like this. Or alternatively, if the two zip codes next to each other didn't have in any way similar sales behavior, you would expect to see something that looked more like this, just all over the place.

So they're the kind of two possibilities. So what a random forest would do if we had just encoded zip in this way is it's going to say, alright, I need to find my single best split point. The split point is going to make the two sides have as small a standard deviation as possible, or mathematically equivalently have the lowest root mean squared error.

So in this case it might pick here as our first split point, because on this side there's one average, and on the other side there's the other average. And then for its second split point it's going to say, okay, how do I split this? And it's probably going to say I would split here, because now we've got this average versus this average.

And then finally it's going to say, okay, how do we split here? And it's going to say, okay, I'll split there. So now I've got that average and that average. So you can see that it's able to kind of hone in on the set of splits it needs, even though it kind of does it greedily, top down one at a time.

The only reason it wouldn't be able to do this is if it was just such bad luck that the two halves were kind of always exactly balanced, but even if that happens it's not going to be the end of the world, it will split on something else, some other variable, and next time around it's very unlikely that it's still going to be exactly balanced in both parts of the tree.

So in practice this works just fine. In the second case, it can do exactly the same thing. It'll say, okay, which is my best first split, even though there's no relationship between one zip code and its neighboring zip code numerically. We can still see here if it splits here, there's the average on one side, and the average on the other side is probably about here.

And then where would it split next? Probably here, because here's the average on one side, here's the average on the other side. So again, it can do the same thing, it's going to need more splits because it's going to end up having to kind of narrow down on each individual large zip code and each individual small zip code, but it's still going to be fine.

So when we're dealing with building decision trees for random forests or GBMs or whatever, we tend to encode our variables just as ordinals. On the other hand, if we're doing a neural network, or like a simplest version, like a linear regression or a logistic regression, the best it could do is that, which is no good at all, and ditto with this one, it's going to be like that.

So an ordinal is not going to be a useful encoding for a linear model or something that stacks linear and nonlinear models together. So instead, what we do is we create a one-hot encoding. So we'll say, 1, 0, 0, 0, here's 0, 1, 0, 0, here's 0, 0, 1, 0, 0, 0, 1.

And so with that encoding, it can effectively create a little histogram where it's going to have a different coefficient for each level. So that way it can do exactly what it needs to do. At what point does that become too tedious for your system, or does it not? Pretty much never.

Because remember, in real life we don't actually have to create that matrix, instead we can just have the 4 coefficients and just do an index lookup to grab the second one, which is mathematically equivalent to multiplying by the one-hot encoding. So that's no problem. One thing to mention, I know you guys have been taught quite a bit of more analytical solutions to things.

And in analytical solutions to linear regression, you can't solve something with this amount of collinearity. In other words, you know something is Sydney if it's not Mumbai or New York or San Francisco. In other words, there's 100% collinearity between the fourth of these classes versus the other three. And so if you try to solve a linear regression analytically, that way the whole thing falls apart.

Now note, with SGD we have no such problem, like SGD, why would it care? We're just taking one step along the derivative. It cares a little, because in the end the main problem with collinearity is that there's an infinite number of equally good solutions. So in other words, we could increase all of these and decrease this, or decrease all of these and increase this, and they're going to balance out.

And when there's an infinitely large number of good solutions, that means there's a lot of kind of flat spots in the loss surface, and it can be harder to optimize. So it's a really easy way to get rid of all of those flat spots, which is to add a little bit of regularization.

So if we added a little bit of weight decay, like 1e neg 7 even, then that basically says these are not all equally good anymore, the one which is the best is the one where the parameters are the smallest and the most similar to each other, and so that will again move it back to being a nice loss function.

Could you just clarify that point you made about why one hot-coating wouldn't be that tedious? Sure. If we have a one hot-encoded vector, and we are multiplying it by a set of coefficients, then that's exactly the same thing as simply saying let's grab the thing where the 1 is.

So in other words, if we had stored this as a 0, and this one as a 1, and this one as a 2, then it's exactly the same as just saying look up that thing in the array. And so we call that version an embedding. So an embedding is a weight matrix you can multiply by a 1 hot-encoding, and it's just a computational shortcut, but it's mathematically the same.

So there's a key difference between solving linear type models analytically versus with SGD. With SGD we don't have to worry about collinearity and stuff, or at least not nearly to the same degree, and then the difference between solving a linear or a single layer or multilayer model with SGD versus a tree, a tree is going to complain about less things.

So in particular you can just use ordinals as your categorical variables. And as we learned just before, we also don't have to worry about normalizing continuous variables for a tree, but we do have to worry about it for these SGD-trained models. So then we also learned a lot about interpreting random forests in particular.

And if you're interested, you may be interested in trying to use those same techniques to interpret neural nets. So if you want to know which of my features are important in a neural net, you could try the same thing. Try shuffling each column in turn and see how much it changes your accuracy, and that's going to be your feature importance for your neural net.

And then if you really want to have fun, recognize then that shuffling that column is just a way of calculating how sensitive the output is to that input, which in other words is the derivative of the output with respect to that input. And so therefore maybe you could just ask PyTorch to give you the derivatives with respect to the input directly, and see if that gives you the same kind of answers.

You could do the same kind of thing for a partial dependence plot, you could try doing the exact same thing with your neural net, replace everything in a column with the same value, do it for 1960, 1961, 1962, plot that. I don't know of anybody who's done these things before, not because it's rocket science, but just because I don't know, maybe no one thought of it, or it's not in a library, but if somebody tried it, I think you should find it useful, it would make a great blog post, maybe even a paper if you wanted to take it a bit further.

So there's a thought that something could do. So most of those interpretation techniques are not particularly specific to random forests. Things like the tree interpreter certainly are, because they're all about what's inside the tree. Can you pass it to Karen? We are applying tree interpreter for neural nets. How are we going to make inference out of activations that the path follows, for example?

How are we going in tree interpreter? We're looking at the paths and their contributions of the features. In this case, it will be same with activations, I guess, the contributions of each activation on their path. Yeah, maybe. I don't know. I haven't thought about it. How can we make inference out of the activations?

So I'd be careful saying the word inference, because people normally use the word inference specifically to mean the same as a test time prediction. You may like make some kind of an interrogate the model. I'm not sure. We should think about that. Actually Hinton and one of his students just published a paper on how to approximate a neural net with a tree for this exact reason, which I haven't read the paper yet.

Could you pass that? So in linear regression and traditional statistics, one of the things that we focused on was statistical significance of like the changes and things like that. And so when thinking about a tree interpreter or even like the waterfall chart, which I guess is just a visualization, I guess where does that fit in?

Because we can see like, oh, yeah, this looks important in the sense that it causes large changes. But how do we know that it's like traditionally statistically significant or anything of that sort? Yeah. So most of the time I don't care about the traditional statistical significance, and the reason why is that nowadays the main driver of statistical significance is data volume, not kind of practical importance.

And nowadays most of the models you build will have so much data that like every tiny thing will be statistically significant, but most of them won't be practically significant. So my main focus therefore is practical significance, which is does the size of this influence impact your business? Statistical significance, it was much more important when we had a lot less data to work with.

If you do need to know statistical significance, because for example you have a very small dataset because it's like really expensive to label or hard to collect or whatever, or it's a medical dataset for a rare disease, you can always get statistical significance by bootstrapping, which is to say that you can randomly resample your dataset a number of times, train your model a number of times, and you can then see the actual variation in predictions.

So with bootstrapping, you can turn any model into something that gives you confidence intervals. There's a paper by Michael Jordan which has a technique called the bag of little bootstraps which actually kind of takes this a little bit further, well worth reading if you're interested. Can you pass it to Prince?

So you said we don't need one-hot encoding matrix if we are doing random forest or if we are doing any tree-based models. What will happen if we do that and how bad can a model be? If you do do one-hot encoding? We actually did do it, remember we had that maximum category size and we did create one-hot encodings and the reason why we did it was that then our feature importance would tell us the importance of the individual levels and our partial dependence plot, we could include the individual levels.

So it doesn't necessarily make the model worse, it may make it better, but it probably won't change it much at all. In this case it hardly changed it. This is something that we have noticed on real data also that if cardinality is higher, let's say 50 levels, and if you do one-hot encoding, the random forest performs very badly.

Yeah, that's right. That's why in fast.ai we have that maximum categorical size because at some point your one-hot encoded variables become two-spasts. So I generally cut it off at 6 or 7. Also because when you get past that it becomes less useful because of the feature importance there's going to be too many levels to really look at.

So can it not look at those levels which are not important and just give those significant features as important? Yeah, it'll be okay. Once the cardinality increases too high you're just splitting your data up too much basically. And so in practice your ordinal version is likely to be better.

There's no time to kind of review everything, but I think that's the key concepts and then of course remembering that the embedding matrix that we can use is likely to have more than just one coefficient, we'll actually have a dimensionality of a few coefficients which isn't going to be useful for most linear models, but once you've got multi-layer models that's now creating a representation of your category which is quite a lot richer and you can do a lot more with it.

Let's now talk about the most important bit. We started off early in this course talking about how actually a lot of machine learning is kind of misplaced. People focus on predictive accuracy like Amazon has a collaborative filtering algorithm for recommending books and they end up recommending the book which it thinks you're most likely to write highly.

And so what they end up doing is probably recommending a book that you already have or that you already know about and would have bought anyway, which isn't very valuable. What they should instead have done is to figure out which book can I recommend that would cause you to change your behavior.

And so that way we actually maximize our lift in sales due to recommendations. And so this idea of the difference between optimizing and influencing your actions versus just improving predictive accuracy is a really important distinction which is very rarely discussed in academia or industry kind of crazy enough. It's more discussed in industry, it's particularly ignored in most of academia.

So it's a really important idea which is that in the end the idea, the goal of your model presumably is to influence behavior. And remember I actually mentioned a whole paper I have about this where I introduce this thing called the drivetrain approach where I talk about ways to think about how to incorporate machine learning into how do we actually influence behavior.

So that's a starting point, but then the next question is like okay if we're trying to influence behavior, what kind of behavior should we be influencing and how and what might it mean when we start influencing behavior? Because nowadays a lot of the companies that you're going to end up working at are big ass companies and you'll be building stuff that can influence millions of people.

So what does that mean? So I'm actually not going to tell you what it means because I don't know, all I'm going to try and do is make you aware of some of the issues and make you believe two things about them. First, that you should care, and second, that they're big current issues.

The main reason I want you to care is because I want you to want to be a good person and show you that not thinking about these things will make you a bad person. But if you don't find that convincing I will tell you this, Volkswagen were found to be cheating on their emissions tests.

The person who was sent to jail for it was the programmer that implemented that piece of code. They did exactly what they were told to do. And so if you're coming in here thinking, "Hey, I'm just a techie, I'll just do what I'm told, that's my job is to do what I'm told." I'm telling you if you do that you can be sent to jail for doing what you're told.

So A) don't just do what you're told because you can be a bad person, and B) you can go to jail. Second thing to realize is in the heat of the moment you're in a meeting with 20 people at work and you're all talking about how you're going to implement this new feature and everybody's discussing it, and everybody's like, "We can do this, and here's a way of modeling it, and then we can implement it, and here's these constraints." And there's some part of you that's thinking, "Am I sure we should be doing this?" That's not the right time to be thinking about that, because it's really hard to step up then and say, "Excuse me, I'm not sure this is a good idea." You actually need to think about how you would handle that situation ahead of time.

So I want you to think about these issues now and realize that by the time you're in the middle of it, you might not even realize it's happening. It'll just be a meeting, like every other meeting, and a bunch of people will be talking about how to solve this technical question.

And you need to be able to recognize, "Oh, this is actually something with ethical implications." So Rachel actually wrote all of these slides, I'm sorry she can't be here to present this because she's studied this in depth, and she's actually been in difficult environments herself where she's kind of seen these things happening.

We know how hard it is, but let me give you a sense of what happens. So engineers trying to solve engineering problems and causing problems is not a new thing. So in Nazi Germany, IBM, the group known as Hollerith, Hollerith was the original name of IBM, and it comes from the guy who actually invented the use of punch cards for tracking the US Census, the first mass, wide-scale use of punch cards for data collection in the world.

And that turned into IBM. So at this point, this unit was still called Hollerith. So Hollerith sold a punch card system to Nazi Germany. And so each punch card would like code, you know, this is a Jew, 8, GFC, 12, general execution, 4, death by gas chamber, 6. And so here's one of these cards describing the right way to kill these various people.

And so a Swiss judge ruled that IBM's technical assistance facilitated the tasks of the Nazis in commission of their crimes against humanity. This led to the death of something like 20 million civilians. So according to the Jewish Virtual Library, where I got these pictures and quotes from, their view is that the destruction of the Jewish people became even less important because of the invigorating nature of IBM's technical achievement, only heightened by the fantastical profits to be made.

So this was a long time ago, and hopefully you won't end up working at companies that facilitate genocide. But perhaps you will, because perhaps you'll go to Facebook, who are facilitating genocide right now. And I know people at Facebook who are doing this, and they had no idea they were doing this.

So right now in Facebook, the Rohingya are in the middle of a genocide, a Muslim population of Myanmar. Babies are being grabbed out of their mother's arms and thrown into fires. People are being killed, hundreds of thousands of refugees. When interviewed, the Myanmar generals doing this say, "We are so grateful to Facebook for letting us know about the Rohingya fake news that these people are actually not human, that they're actually animals." Now Facebook did not set out to enable the genocide of the Rohingya people in Myanmar.

No, instead what happened is they wanted to maximize impressions and clicks. And so it turns out that for the data scientists at Facebook, their algorithms kind of learned that if you take the kinds of stuff people are interested in and feed them slightly more extreme versions of that, you're actually going to get a lot more impressions.

And the project managers are saying maximize these impressions, and people are clicking and it creates this thing. And so the potential implications are extraordinary and global. And this is something that is literally happening, this is October 2017, it's happening now. Could you pass that back there? So I just want to clarify what was happening here.

So it was the facilitation of fake news or inaccurate media? Let me go into it in more detail. So what happened was in mid-2016, Facebook fired its human editors. So it was humans that decided how to order things on your homepage. Those people got fired and replaced with machine learning algorithms.

And so the machine learning algorithms written by data scientists like you, they had nice clear metrics and they were trying to maximize their predictive accuracy and be like okay, we think if we put this thing higher up than this thing, we'll get more clicks. And so it turned out that these algorithms for putting things on the Facebook news feed had a tendency to say like oh, human nature is that we tend to click on things which stimulate our views and therefore like more extreme versions of things we already see.

So this is great for the Facebook revenue model of maximizing engagement. It looked good on all of their KPIs. And so at the time, there was some negative press about like I'm not sure that the stuff that Facebook is now putting on their trending section is actually that accurate, but from the point of view of the metrics that people were optimizing at Facebook, it looked terrific.

And so way back to October 2016, people started noticing some serious problems. For example, it is illegal to target housing to people of certain races in America. That is illegal. And yet a news organization discovered that Facebook was doing exactly that in October 2016. Again, not because somebody in that data science team said let's make sure black people can't live in nice neighborhoods, but instead they found that their automatic clustering and segmentation algorithm found there was a cluster of people who didn't like African Americans and that if you targeted them with these kinds of ads then they would be more likely to select this kind of housing or whatever.

But the interesting thing is that even after being told about this three times, Facebook still hasn't fixed it. And that is to say these are not just technical issues, they're also economic issues. When you start saying the thing that you get paid for, that is ads, you have to change the way that you structure those so that you either use more people that cost money or you are less aggressive on your algorithms to target people based on minority group status or whatever, that can impact revenues.

So the reason I mention this is you will at likely at some point in your career find yourself in a conversation where you're thinking I'm not confident that this is like morally okay, the person you're talking to is thinking in their head this is going to make us a lot of money, and you don't quite ever manage to have a successful conversation because you're talking about different things.

And so when you're talking to somebody who may be more experienced and more senior than you and they may sound like they know what they're talking about, just realize that their incentives are not necessarily going to be focused on like how do I be a good person. They're not thinking how do I be a bad person, but the more time you spend in industry in my experience, the more desensitized you kind of get to this stuff of like okay maybe getting promotions and making money isn't the most important thing.

So for example, I've got a lot of friends who are very good at computer vision and some of them have gone on to create startups that seem like they're almost handmade to help authoritarian governments surveil their citizens. And when I ask my friends like have you thought about how this could be used in that way, they're generally kind of offended that I ask, but I'm asking you to think about this.

Wherever you end up working, if you end up creating a startup, tools can be used for good or for evil, and so I'm not saying don't create excellent object tracking and detection tools from computer vision, because you could go on and use that to create a much better surgical intervention robot toolkit, just saying be aware of it, think about it, talk about it.

So here's one I find fascinating, and there's this really cool thing actually that meetup.com did, this is from a meetup.com talk that's online, they think about this. They actually thought about this, they actually thought, you know what, if we built a collaborative filtering system like we learned about in class to help people decide what meetup to go to, it might notice that on the whole in San Francisco, a few more men than women tend to go to techie meetups.

And so it might then start to decide to recommend techie meetups to more men than women, as a result of which, more men will go to techie meetups. As a result of which, when women go to techie meetups, they'll be like oh, this is all men, I don't really want to go to techie meetups.

As a result of which, the algorithm will get new data saying that men like techie meetups better, and so it continues. And so a little bit of that initial push from the algorithm can create this runaway feedback loop and you end up with almost all male techie meetups, for instance.

And so this kind of feedback loop is a kind of subtle issue that you really want to think about when you're thinking about what is the behavior that I'm changing with this algorithm that I'm building. So another example, which is kind of terrifying, is in this paper where the authors describe how a lot of departments in the US are now using predictive policing algorithms.

So where can we go to find somebody who's about to commit a crime? And so you know that the algorithm simply feeds back to you basically the data that you've given it. So if your police department has engaged in racial profiling at all in the past, then it might suggest slightly more often maybe you should go to the black neighborhoods to check for people committing crimes.

As a result of which, more of your police officers go to the black neighborhoods. As a result of which, they arrest more black people. As a result of which, the data says that the black neighborhoods are less safe. As a result of which, the algorithm says to the policeman, maybe you should go to the black neighborhoods more often, and so forth.

And this is not like vague possibilities of something that might happen in the future, this is like documented work from top academics who have carefully studied the data and the theory. This is like serious scholarly work, it's like no, this is happening right now. And so again, I'm sure the people that started creating this predictive policing algorithm didn't think like how do we arrest more black people, hopefully they were actually thinking gosh I'd like my children to be safer on the streets, how do I create a safer society?

But they didn't think about this nasty runaway feedback loop. So actually this one about social network algorithms is actually an article in the New York Times recently about one of my friends, Renee DiResta, and she did something kind of amazing. She set up a second Facebook account, like a fake Facebook account, and she was very interested in the anti-vax movement at the time.

So she started following a couple of anti-vaxxers and visited a couple of anti-vaxxer links. And so suddenly her news feed starts getting full of anti-vaxxer news, along with other stuff like chemtrails, and deep state conspiracy theories, and all this stuff. And so she's like, 'huh', starts clicking on those.

And the more she clicked, the more hardcore far-out conspiracy stuff Facebook recommended. So now when Renee goes to that Facebook account, the whole thing is just full of angry, crazy, far-out conspiracy stuff, like that's all she sees. And so if that was your world, then as far as you're concerned, it's just like this continuous reminder and proof of all this stuff.

And so again, to answer your question, this is the kind of runaway feedback loop that ends up telling me and my generals, you know, throughout their Facebook homepage, that animals and fake news and whatever else. So a lot of this comes also from bias. And so let's talk about bias specifically.

So bias in image software comes from bias in data. And so most of the folks I know at Google Brain building computer vision algorithms, very few of them are people of color. And so when they're training the algorithms with photos of their families and friends, they are training them with very few people of color.

And so when FaceApp then decided, we're going to try looking at lots of Instagram photos to see which ones are upvoted the most, without them necessarily realizing it, the answer was light-colored faces. So then they built a generative model to make you more hot. And so this is the actual photo, and here is the hotter version.

So the hotter version is more white, less nostrils, more European looking. And so this did not go down well, to say the least. So again, I don't think anybody at FaceApp said, let's create something that makes people look more white. They just trained it on a bunch of images of the people that they had around them.

And this has kind of serious commercial implications as well. They had to pull this feature, and they had a huge amount of negative pushback as they should. Here's another example, Google Photos created this photo classifier, airplanes, skyscrapers, cars, graduation, and gorillas. So think about how this looks to most people.

To most people they look at this, they don't know about machine learning, they say, what the fuck? Somebody at Google wrote some code to take black people and call them gorillas. That's what it looks like. Now we know that's not what happened. We know what happened is the team of folks at Google Computer Vision experts who have none or few people of color working in the team built a classifier using all the photos they had available to them.

And so when the system came across a person with dark skin, it was like, I've only mainly seen that before amongst gorillas, so I'll put it in that category. So again, the bias in the data creates a bias in the software, and again, the commercial implications were very significant.

Google really got a lot of bad PR from this, as they should. This was a photo that somebody put in their Twitter feed. They said, look what Google Photos just decided to do. You can imagine what happened with the first international beauty contest judged by artificial intelligence. Basically it turns out all the beautiful people are white.

So you kind of see this bias in image software, thanks to bias in the data, thanks to lack of diversity in the teams building it, you see the same thing in natural language processing. So here is Turkish, O is the pronoun in Turkish which has no gender, but of course in English we don't really have a widely used un-gendered singular pronoun, so Google Translate converts it to this.

Now there are plenty of people who saw this online and said, literally, so what? It is correctly feeding back the usual usage in English. I know how this is trained, this is like Word2Vec vectors, I was trained on Google News corpus, Google Books corpus, it's just telling us how things are.

And from a point of view, that's entirely true. The biased data to create this biased algorithm is the actual data of how people have written books and used paper radicals for decades. But does that mean that this is the product that you want to create? Does this mean this is the product you have to create?

Just because the particular way you've trained the model means it ends up doing this, is this actually the design you want? And can you think of potential negative implications and feedback loops this could create? And if any of these things bother you, then now, lucky you, you have a new cool engineering problem to work on, like how do I create unbiased NLP solutions?

And now there are some start-ups starting to do that and starting to make some money. These are opportunities for you, like here's some stuff where people are creating screwed up societal outcomes because of their shitty models, like okay, well you can go and build something better. So like another example of the bias in word2vec word vectors is restaurant reviews rank Mexican restaurants worse because the Mexican words tend to be associated with criminal words in the US press and books more often.

Again, this is like a real problem that is happening right now. So Rachel actually did some interesting analysis of just the plain word2vec word vectors where she basically pulled them out and looked at these analogies based on some research that had been done elsewhere. And so you can see word2vec, the vector directions show that father is to doctor, mother is to nurse, man is to computer programmer, as woman is to homemaker, and so forth.

So it's really easy to see what's in these word vectors, and they're kind of fundamental to much of the NLP or probably just about all of the NLP software we use today. So a ProPublica has actually done a lot of good work in this area. Many judges now have access to Sentencing Guidelines software.

And so Sentencing Guidelines software says to the judge, for this individual we would recommend this kind of sentence. And now of course a judge doesn't understand machine learning. So like they have two choices, which is either do what it says or ignore it entirely, and some people fall into each category.

And so for the ones that fall into the like do what it says category, here's what happens. For those that were labeled higher risk, the subset of those that labeled higher risk it actually turned out not to re-offend, was about a quarter of whites and about a half of African Americans.

So like nearly twice as often, people who didn't re-offend were marked as higher risk if they were African Americans, and vice versa. Amongst those that were labeled lower risk but actually did re-offend, it turned out to be about half of the whites and only 28% of the African Americans.

So this is data which I would like to think nobody is setting out to create something that does this. But when you start with biased data, and the data says that whites and blacks smoke marijuana at about the same rate, but blacks are jailed at something like five times more often than whites, the nature of the justice system in America at the moment is that it's not equal, it's not fair.

And therefore the data that's fed into the machine learning model is going to basically support that status quo. And then because of the negative feedback loop, it's just going to get worse and worse. I'll tell you something else interesting about this one, which research called Abe Gong has pointed out, is here are some of the questions that are being asked.

So let's take one. Was your father ever arrested? So your answer to that question is going to decide whether you're locked up and for how long. Now as a machine learning researcher, do you think that might improve the predictive accuracy of your algorithm and get you a better R-squared?

It could well, but I don't know. Maybe it does. You try it out and say oh, I've got a better R-squared. So does that mean you should use it? Well there's another question, do you think it's reasonable to lock somebody up for longer because of who their dad was?

And yet these are actually the examples of questions that we are asking right now to offenders and then putting into a machine learning system to decide what happens to them. So again, whoever designed this, presumably they were laser focused on technical excellence, getting the maximum area under the ROC curve, and I found these great predictors that give me another .02, and I guess didn't start to think like well, is that a reasonable way to decide who goes to jail for longer?

So like putting this together, you can kind of see how this can get more and more scary. We take a company like Taser, and Tasers are these devices that kind of give you a big electric shock basically. And Tasers managed to do a great job of creating strong relationships with some academic researchers who seem to say whatever they tell them to say, to the extent where now if you look at the data it turns out that there's a pretty high probability that if you get tased that you will die.

That happens not unusually, and yet the researchers who they've paid to look into this have consistently come back and said oh no, it was nothing to do with the Taser, the fact that they died immediately afterwards was totally unrelated, it was just a random thing that happened. So this company now owns 80% of the market for body cameras, and they started buying computer vision AI companies, and they're going to try and now use these police body camera videos to anticipate criminal activity.

And so what does that mean? So is that like okay, I now have some augmented reality display saying tase this person because they're about to do something bad. So it's kind of like a worrying direction, and so I'm sure nobody who's a data scientist at Taser or at the companies that they bought out is thinking like this is the world I want to help create, but they could find themselves, or you could find yourself in the middle of this kind of discussion, where it's not explicitly about that topic but there's part of you that says I wonder if this is how this could be used, and I don't know exactly what the right thing to do in that situation is, because you can ask, and of course people are going to be like no, no, no, no.

So it's like what could you do? You could ask for some kind of written promise, you could decide to leave, you could start doing some research into the legality of things to say I would at least protect my own legal situation. I don't know, have a think about how you would respond to that.

So these are some questions that Rachel created as being things to think about. So if you're looking at building a data product or using a model, if you're building a machine learning model as for a reason, you're trying to do something. So what bias may be in that data?

Because whatever bias is in that data ends up being a bias in your predictions, potentially then biases the actions you're influencing, potentially then biases the data that you come back and you may create a feedback loop. If the team that built it isn't diverse, what might you be missing?

So for example, one senior executive at Twitter called the alarm about major Russian bot problems at Twitter way back well before the election. That was the one black person in the exec team at Twitter, the one. And shortly afterwards they lost their job. Definitely having a more diverse team means having a more diverse set of opinions and beliefs and ideas and things to look for and so forth.

So non-diverse teams seem to make more of these bad mistakes. Can we audit the code, is it open source, check for the different error rates amongst different groups, is there a simple rule we could use instead that's extremely interpretable and easy to communicate and if something goes wrong do we have a good way to deal with it.

So when we've talked to people about this and a lot of people have come to Rachel and said I'm concerned about something my organization is doing, what do I do, or I'm just concerned about my toxic workplace, what do I do. And very often Rachel will say, have you considered leaving?

And they will say, I don't want to lose my job. But actually if you can code, you're in 0.3% of the population. If you can code and do machine learning, you're in probably 0.01% of the population. You are massively, massively in demand. So realistically, obviously an organization does not want you to feel like you're somebody who could just leave and get another job, that's not in their interest, but that is absolutely true.

And so one of the things I hope you'll leave this course with is enough self-confidence to recognize that you have the skills to get a job, and particularly once you've got your first job, your second job is an order of magnitude easier. And so this is important not just so that you feel like you actually have the ability to act ethically, but it's also important to realize if you find yourself in a toxic environment which is pretty damn common, unfortunately, there's a lot of shitty tech cultures, environments particularly in the Bay Area.

If you find yourself in one of those environments, the best thing to do is to get the hell out. And if you don't have the self-confidence to think you can get another job, you can get trapped. So it's really important, it's really important to know that you are leaving this program with very in-demand skills, and particularly after you have that first job, you're now somebody with in-demand skills and a track record of being employed in that area.

This is kind of just a broad question, but what are some things that you know of that people are doing to treat bias in data? It's kind of like a bit of a controversial subject at the moment, and people are trying to use, some people are trying to use an algorithmic approach, where they're basically trying to say how can we identify the bias and kind of subtract it out, but the most effective ways I know of are ones that are trying to treat it at the data level.

So start with a more diverse team, particularly a team involving people from the humanities, like sociologists, psychologists, economists, people that understand feedback loops and implications for human behavior, and they tend to be equipped with good tools for kind of identifying and tracking these kinds of problems, and then kind of trying to incorporate the solutions into the process itself.

Let's say there isn't kind of like some standard process I can point you to and say here's how to solve it. If there is such a thing, we haven't found it yet, it requires a diverse team of smart people to be aware of the problems and work hard at them, is the short answer.

This is just kind of a general thing I guess for the whole class. If you're interested in this stuff, I read a pretty cool book, Jeremy you've probably heard of it, Weapons of Math Destruction by Cathy O'Neill, it covers a lot of the same stuff, just more on the topic.

Thanks for the recommendation, Cathy's great, she's also got a TED talk, I didn't manage to finish the book because it's so damn depressing, I was just like, no more. But yeah, it's very good. Well that's it, thank you everybody. This has been really intense for me, obviously this was meant to be something that I was sharing with Rachel, so I've ended up doing one of the hardest things in my life, which is to teach two people's worth of course on my own and also look after a sick wife and have a toddler and also do a deep learning course and also do all this with a new library that I just wrote.

So I'm looking forward to getting some sleep, but it's been totally worth it because you've been amazing, like I'm thrilled with how you've reacted to the kind of opportunities I've given you and also to the feedback that I've given you. So congratulations. (audience applauds)