Machine Learning 1: Lesson 12

00:00:00.000 | I thought what we might do today is to finish off where we were in this Rossman notebook

00:00:10.720 | looking at time series forecasting and structured data analysis.

00:00:16.640 | And then we might do a little mini-review of everything we've learnt, because believe

00:00:23.720 | it or not, this is the end, there's nothing more to know about machine learning other

00:00:28.880 | than everything that you're going to learn next semester and for the rest of your life.

00:00:36.460 | But anyway, I've got nothing else to teach, so we'll do a little review and then we'll

00:00:42.360 | cover the most important part of the course, which is thinking about how are ways to think

00:00:50.640 | about how to use this kind of technology appropriately and effectively in a way that's hopefully

00:00:58.080 | a positive impact on society.

00:01:02.160 | So last time we got to the point where we talked a bit about this idea that when we

00:01:08.720 | were looking at building this competition months open derived variable, that we actually

00:01:15.400 | truncated it down to be no more than 24 months, and we talked about the reason why being that

00:01:20.240 | we actually wanted to use it as a categorical variable, because categorical variables, thanks

00:01:24.760 | to embeddings, have more flexibility in how the neural net can use them.

00:01:33.220 | And so that was kind of where we left off.

00:01:37.720 | So let's keep working through this, because what's happening in this notebook is stuff

00:01:46.080 | which is probably going to apply to most time series data sets that you work with.

00:01:53.840 | And as we talked about, although we use df.apply here, this is something where it's running

00:01:59.160 | a piece of Python code over every row, and that's horrifically slow.

00:02:06.520 | So we only do that if we can't find a vectorized pandas or numpy function that can do it to

00:02:12.720 | the whole column at once, but in this case I couldn't find a way to convert a year and

00:02:19.400 | a week number into a date without using arbitrary Python.

00:02:29.160 | Also worth remembering this idea of a lambda function, any time you're trying to apply

00:02:35.120 | a function to every row of something, or every element of a tensor, or something like that,

00:02:40.240 | if there isn't a vectorized version already, you're going to have to call something like

00:02:44.680 | dataframe.apply, which will run a function you pass to every element.

00:02:52.040 | So this is basically a map in functional programming.

00:03:00.120 | Since very often the function that you want to pass to it is something you're just going

00:03:03.760 | to use once and then throw it away, it's really common to use this lambda approach.

00:03:09.600 | So this lambda is creating a function just for the purpose of telling df.apply what to

00:03:15.600 | use.

00:03:16.900 | So we could also have written this in a different way, which would have been to say, define

00:03:32.880 | date promo 2 cents on some value, return.

00:03:50.560 | And then we could put that in here.

00:04:06.080 | So that and that are the same thing.

00:04:09.380 | So one approach is to define the function and then pass it by name, or the other is

00:04:14.060 | to define the function in place using lambda.

00:04:18.440 | And so if you're not comfortable creating and using lambdas, it's a good thing to practice,

00:04:25.640 | and playing around with df.apply is a good way to practice it.

00:04:33.280 | So let's talk about this durations section, which may at first seem a little specific,

00:04:43.880 | but actually it turns out not to be.

00:04:46.360 | What we're going to do is we're going to look at three fields, promo, state holiday and

00:04:53.920 | school holiday.

00:04:55.200 | And so basically what we have is a table for each store, for each date, does that store

00:05:04.360 | have a promo going on at that date?

00:05:07.420 | Is there a school holiday in that region of that store at that date?

00:05:12.600 | Is there a state holiday in that region for that store at that date?

00:05:17.080 | And so this kind of thing is, there are events, and time series with events are very common.

00:05:26.840 | If you're looking at oil and gas drilling data, you're trying to say the flow through

00:05:32.200 | this pipe, here's an event representing when it set off some alarm, or here's an event

00:05:38.960 | where the drill got stuck, or whatever.

00:05:42.800 | And so most time series at some level will tend to represent some events.

00:05:49.760 | So the fact that an event happened at a time is interesting itself, but very often a time

00:05:59.600 | series will also show something happening before and after the event.

00:06:05.800 | So for example, in this case we're doing grocery sales prediction.

00:06:10.660 | If there's a holiday coming up, it's quite likely that sales will be higher before and

00:06:15.920 | after the holiday, and lower during the holiday if this is a city-based store, because you're

00:06:23.960 | going to stock up before you go away to bring things with you, and when you come back you've

00:06:29.360 | got to refill the fridge, for instance.

00:06:34.360 | So although we don't necessarily have to do this kind of feature engineering to create

00:06:41.240 | features specifically about this is before or after a holiday, the neural net, the more

00:06:48.200 | we can give the neural net the kind of information it needs, the less it's going to have to learn

00:06:54.120 | it, the less it's going to have to learn it, the more we can do with the data we already

00:06:58.360 | have, the more we can do with the size architecture we already have.

00:07:03.880 | So feature engineering, even with stuff like neural nets, is still important because it

00:07:10.920 | means that we'll be able to get better results with whatever limited data we have, whatever

00:07:17.160 | limited computation we have.

00:07:21.240 | So the basic idea here, therefore, is when we have events in our time series as we want

00:07:26.480 | to create two new columns for each event, how long is it going to be until the next

00:07:32.240 | time this event happens, and how long has it been since the last time that event happened.

00:07:38.160 | So in other words, how long until the next state holiday, how long since the previous

00:07:41.880 | state holiday.

00:07:44.840 | So that's not something which I'm aware of as existing as a library or anything like

00:07:50.880 | that, so I wrote it here by hand.

00:07:55.360 | And so importantly, I need to do this by store.

00:08:03.120 | For this store, when was this store's last promo, so how long has it been since the last

00:08:08.720 | time it had a promo, how long it will be until the next time it has a promo, for instance.

00:08:17.560 | So here's what I'm going to do, I'm going to create a little function that's going to

00:08:23.720 | take a field name and I'm going to pass it each of promo and then state holiday and then

00:08:28.280 | school holiday.

00:08:29.280 | So let's do school holiday, for example.

00:08:31.200 | So we'll say field = school holiday, and then we'll say get elapsed school holiday, after.

00:08:40.280 | So let me show you what that's going to do.

00:08:41.920 | So we've got a first of all sort by store and date.

00:08:47.000 | So now when we loop through this, we're going to be looping through within a store, so store

00:08:51.120 | number 1, January the 1st, January the 2nd, January the 3rd, and so forth.

00:08:57.120 | And as we loop through each store, we're basically going to say, is this row a school holiday

00:09:04.920 | or not?

00:09:05.920 | And if it is a school holiday, then we'll keep track of this variable called last_date, which

00:09:10.040 | says this is the last date where we saw a school holiday.

00:09:15.720 | And so then we're basically going to append to our result the number of days since the

00:09:22.300 | last school holiday.

00:09:24.400 | That's the kind of basic idea here.

00:09:26.520 | So there's a few interesting features.

00:09:29.600 | One is the use of zip.

00:09:32.600 | So I could actually write this much more simply.

00:09:36.800 | I could basically go through for row in df.idder rows and then grab the fields we want from

00:09:53.240 | each row.

00:09:55.360 | It turns out this is 300 times slower than the version that I have.

00:10:01.280 | And basically iterating through a data frame and extracting specific fields out of a row

00:10:12.920 | has a lot of overhead.

00:10:14.800 | What's much faster is to iterate through a numpy array.

00:10:22.080 | So if you take a series like df.store and add .values after it, that grabs a numpy array

00:10:28.960 | of that series.

00:10:31.040 | So here are three numpy arrays.

00:10:32.920 | One is the store IDs, one is whatever field is, in this case, let's say school_holiday,

00:10:40.760 | and one is the date.

00:10:43.600 | So now what I want to do is loop through the first one of each of those lists, and then

00:10:50.520 | the second one of each of those lists, and then the third one of each of those lists.

00:10:54.160 | This is a really, really common pattern.

00:10:55.920 | I need to do something like this in basically every notebook I write, and the way to do

00:11:00.540 | it is with zip.

00:11:02.640 | So zip means loop through each of these lists one at a time, and then this here is where

00:11:09.920 | we can grab that element out of the first list, the second list and the third list.

00:11:16.440 | So if you haven't played around much with zip, that's a really important function to

00:11:21.360 | practice with.

00:11:22.360 | Like I say, I use it in pretty much every notebook I write all the time.

00:11:28.660 | You have to loop through a bunch of lists at the same time.

00:11:36.080 | So we're going to loop through every store, every school_holiday, every date, yes.

00:11:45.580 | So in this case we basically want to say let's grab the first store, the first school_holiday,

00:12:02.240 | the first date.

00:12:03.240 | So for store_1, January 1st school_holiday was true or false.

00:12:10.520 | And so if it is a school_holiday, I'll keep track of that fact by saying the last time

00:12:15.000 | I saw a school_holiday was that day.

00:12:19.360 | And then append, how long has it been since the last school_holiday?

00:12:25.320 | And if the store_id is different to the last store_id I saw, then I've now got to a whole

00:12:32.040 | new store, in which case I have to basically reset everything.

00:12:36.200 | Could you pass that to Karen?

00:12:40.280 | What will happen to the first points that we don't have the last holiday?

00:12:46.460 | I basically set this to some arbitrary starting point, it's going to end up with the largest

00:12:55.400 | or the smallest possible date.

00:13:00.280 | And you may need to replace this with a missing value afterwards, or some zero, or whatever.

00:13:15.320 | The nice thing is, thanks to values, it's very easy for a neural net to kind of cut

00:13:22.720 | off extreme values.

00:13:24.860 | So in this case I didn't do anything special with it, I ended up with a negative a billion

00:13:28.320 | day time stamps and it still worked fine.

00:13:36.300 | So we can go through, and the next thing to note is there's a whole bunch of stuff that

00:13:42.720 | I need to do to both the training set and the test set.

00:13:46.400 | So in the previous section I actually kind of added this little loop where I go for each

00:13:51.760 | of the training data frame and the test data frame do these things.

00:13:58.640 | So each cell I did for each of the data frames.

00:14:02.920 | I've now got a whole series of cells that I want to run first of all for the training

00:14:10.360 | set and then for the test set.

00:14:12.720 | So in this case the way I did that was I had two different cells here.

00:14:16.340 | One which set df to be the training set, one which set to be the test set.

00:14:20.760 | So the way I use this is I run just this cell, and then I run all the cells underneath, so

00:14:28.280 | it does it all for the training set, and then I come back and run just this cell and then

00:14:32.960 | run all the cells underneath.

00:14:35.240 | So this notebook is not designed to be just run from top to bottom, but it's designed

00:14:40.680 | to be run in this particular way.

00:14:43.280 | And I mention that because this can be a handy trick to know, you could of course put all

00:14:49.720 | the stuff underneath in a function that you pass the data frame to and call it once with

00:14:54.640 | the test set and once with the training set, but I kind of like to experiment a bit more

00:15:00.640 | interactively, look at each step as I go, so this way is an easy way to kind of run something

00:15:05.620 | on two different data frames without turning it into a function.

00:15:12.560 | So if I sort by store and by date, then this is keeping track of the last time something

00:15:20.160 | happened and so this is therefore going to end up telling me how many days was it since

00:15:24.800 | the last school holiday.

00:15:28.080 | So now if I sort date descending and call the exact same function, then it's going to

00:15:36.680 | say how long until the next school holiday.

00:15:41.600 | So that's a nice little trick for adding these kind of arbitrary event timers into your time

00:15:48.880 | series models.

00:15:49.880 | So if you're doing, for example, the Ecuadorian Groceries competition right now, maybe this

00:15:55.840 | kind of approach would be useful for various events in that as well.

00:16:01.160 | Do it for state holiday, do it for promo, here we go.

00:16:11.760 | The next thing that we look at here is rolling functions.

00:16:19.120 | So rolling in pandas is how we create what we call windowing functions.

00:16:32.600 | Let's say I had some data, something like this, and this is like date, and I don't know

00:16:51.960 | this is like sales or whatever, what I could do is I could say let's create a window around

00:17:00.640 | this point of like 7 days, so it would be like okay, this is a 7 day window.

00:17:11.480 | And so then I could take the average sales in that 7 day window, and I could do the same

00:17:17.880 | thing like I don't know, over here, take the average sales over that 7 day window.

00:17:26.180 | And so if we do that for every point and join up those averages, you're going to end up

00:17:31.280 | with a moving average.

00:17:35.820 | So the more generic version of the moving average is a window function, i.e. something

00:17:46.020 | where you apply some function to some window of data around each point.

00:17:52.440 | Now very often the windows that I've shown here are not actually what you want, if you're

00:17:58.240 | trying to build a predictive model you can't include the future as part of a moving average.

00:18:04.800 | So quite often you actually need a window that ends here, so that would be our window

00:18:12.640 | function.

00:18:14.880 | And so Pandas lets you create arbitrary window functions using this rolling here.

00:18:25.640 | This here says how many time steps do I want to apply the function to.

00:18:32.400 | This here says if I'm at the edge, so in other words if I'm like out here, should you make

00:18:39.200 | that a missing value because I don't have 7 days to average over, or what's the minimum

00:18:46.400 | number of time periods to use?

00:18:48.480 | So here I said 1, and then optionally you can also say do you want to set the window at

00:18:55.080 | the start of a period, or the end of a period, or the middle of the period.

00:19:02.120 | And then within that you can apply whatever function you like.

00:19:05.700 | So here I've got my weekly by store sums.

00:19:13.000 | So there's a nice easy way of getting moving averages, or whatever else.

00:19:20.920 | And I should mention in Pandas, if you go to the time series page on Pandas, there's

00:19:29.400 | literally like, look at just the index here, time series functionality, all of this, this,

00:19:39.680 | there's lots.

00:19:40.880 | Because Wes Bikini who created this, he was originally in hedge fund trading, I believe,

00:19:47.520 | and his work was all about time series.

00:19:50.840 | And so I think Pandas originally was very focused on time series, and still it's perhaps

00:19:56.480 | the strongest part of Pandas.

00:19:58.640 | So if you're playing around with time series computations, you definitely owe it to yourself

00:20:04.320 | to try to learn this entire API.

00:20:08.880 | And there's a lot of conceptual pieces around time stamps, and date offsets, and resampling,

00:20:19.280 | and stuff like that to kind of get your head around, but it's totally worth it because

00:20:23.880 | otherwise you'll be writing this stuff as loops by hand, it's going to take you a lot

00:20:28.240 | longer than leveraging what Pandas already does, and of course Pandas will do it in highly

00:20:34.760 | optimized C code for you, vectorized C code, whereas your version is going to loop in Python.

00:20:41.040 | So definitely worth, if you're doing stuff in time series learning, the full Pandas time

00:20:47.840 | series API is about as strong as any time series API out there.

00:20:54.960 | Okay, so at the end of all that, you can see here's those kind of starting point values

00:21:01.280 | I mentioned, slightly on the extreme side, and so you can see here the 17th of September

00:21:10.560 | store 1 was 13 days after the last school holiday, the 16th was 12, 11/10, so forth.

00:21:19.080 | We're currently in a promotion, here this is one day before the promotion, here we've

00:21:26.560 | got 9 days after the last promotion, and so forth.

00:21:32.640 | So that's how we can add kind of event counters to a time series, and probably always a good

00:21:40.840 | idea when you're doing work with time series.

00:21:46.780 | So now that we've done that, we've got lots of columns in our dataset, and so we split

00:21:53.360 | them out into categorical versus continuous columns, we'll talk more about that in a moment

00:21:59.760 | in the review section.

00:22:01.480 | So these are going to be all the things I'm going to create an embedding for.

00:22:05.480 | And these are all of the things that I'm going to feed directly into the model.

00:22:11.960 | So for example, we've got competition distance, that's distance to the nearest competitor,

00:22:18.860 | maximum temperature, and here we've got day of wake.

00:22:31.320 | So here we've got maximum temperature, maybe it's like 22.1, centigrade in Germany, we've

00:22:41.160 | got distance to nearest competitor, might be 321 kilometers, 0.7, and then we've got

00:22:51.200 | day of wake, which might be Saturday as a 6.

00:22:58.760 | So these numbers here are going to go straight into our vector, the vector that we're going

00:23:12.200 | to be feeding into our neural net.

00:23:16.480 | We'll see in a moment we'll normalize them, but more or less.

00:23:29.200 | But this categorical variable we're not, we need to put it through an embedding.

00:23:34.280 | So we'll have some embedding matrix of, if there are 7 days, maybe dimension 4 embedding,

00:23:45.280 | and so this will look up the 6th row to get back the 4 items.

00:23:52.920 | And so this is going to turn into length 4 vector, which we'll then add here.

00:24:09.280 | So that's how our continuous and categorical variables are going to work.

00:24:22.120 | So then all of our categorical variables, we'll turn them into Panda's categorical variables

00:24:29.240 | in the same way that we've done before.

00:24:33.920 | And then we're going to apply the same mappings to the test set.

00:24:38.000 | So if Saturday is a 6 in the training set, this apply_cats makes sure that Saturday is

00:24:44.640 | also a 6 in the test set.

00:24:48.040 | For the continuous variables, make sure they're all floats because PyTorch expects everything

00:24:53.280 | to be a float.

00:24:57.960 | So then, this is another little trick that I use.

00:25:03.080 | Both of these cells define something called joined_samp.

00:25:07.760 | One of them defines them as the whole training set.

00:25:12.020 | One of them defines them as a random subset.

00:25:16.200 | And so the idea is that I do all of my work on the sample, make sure it all works well,

00:25:21.560 | play around with different hyperparameters and architectures, and then I'm like, "Okay,

00:25:25.620 | I'm very happy with this."

00:25:26.880 | I then go back and run this line of code to say, "Okay, now make the whole data set be

00:25:33.360 | the sample," and then rerun it.

00:25:36.000 | This is a good way, again, similar to what I showed you before, it lets you use the same

00:25:40.400 | cells in your notebook to run first of all on a sample, and then go back later and run

00:25:45.800 | it on the full data set.

00:25:52.280 | So now that we've got that joined_samp, we can then pass it to proc.df as we've done

00:25:57.200 | before to grab the dependent variable to deal with missing values, and in this case we pass

00:26:05.280 | one more thing, which is doScale = true.

00:26:09.080 | doScale = true will subtract the mean and divide by the standard deviation.

00:26:18.240 | And so the reason for that is that if our first layer is just a matrix multiply, so

00:26:25.720 | here's our set of weights, and our input is like, I don't know, it's got something which

00:26:32.520 | is like 0.001, and then it's got something which is like 10^6, and then our weight matrix

00:26:41.440 | has been initialized to be like random numbers between 0 and 1, so we've got 0.6, 0.1, etc.

00:26:50.720 | Then basically this thing here is going to have gradients that are 9 orders of magnitude

00:26:57.240 | bigger than this thing here, which is not going to be good for optimization.

00:27:03.720 | So by normalizing everything to be mean of 0, standard deviation of 1 to start with,

00:27:10.520 | then that means that all of the gradients are going to be on the same kind of scale.

00:27:19.320 | We didn't have to do that in random forests, because in random forests we only cared about

00:27:24.520 | the sort order.

00:27:26.220 | We didn't care about the values at all, but with linear models and things that are built

00:27:33.920 | out of layers of linear models, like neural nets, we care very much about the scale.

00:27:42.000 | So dscale=true normalizes our data for us.

00:27:45.840 | Now since it normalizes our data for us, it returns one extra object, which is a mapper,

00:27:52.080 | which is an object that contains for each continuous variable what was the mean and

00:27:56.960 | standard deviation it was normalized with, the reason being that we're going to have

00:28:02.740 | to use the same mean and standard deviation on the test set, because we need our test

00:28:09.480 | set and our training set to be scaled in the exact same way, otherwise they're going to

00:28:13.140 | have different meanings.

00:28:16.000 | And so these details about making sure that your test and training set have the same categorical

00:28:23.240 | codings, the same missing value replacement and the same scaling normalization are really

00:28:30.560 | important to get right, because if you don't get it right then your test set is not going

00:28:36.840 | to work at all.

00:28:41.080 | But if you follow these steps, it'll work fine.

00:28:45.280 | We also take the log of the dependent variable, and that's because in this Kaggle competition

00:28:51.340 | the evaluation metric was root mean squared percent error.

00:28:55.960 | So root mean squared percent error means we're being penalized based on the ratio between

00:29:02.720 | our answer and the correct answer.

00:29:07.120 | We don't have a loss function in PyTorch called root mean squared percent error.

00:29:12.160 | We could write one, but easier is just to take the log of the dependent because the

00:29:17.040 | difference between logs is the same as the ratio.

00:29:20.960 | So by taking the log we get that for free.

00:29:24.180 | You'll notice the vast majority of regression competitions on Kaggle use either root mean

00:29:32.360 | squared percent error or root mean squared error of the log as their evaluation metric,

00:29:37.680 | and that's because in real-world problems most of the time we care more about ratios

00:29:43.560 | than about raw differences.

00:29:46.140 | So if you're designing your own project, it's quite likely that you'll want to think about

00:29:52.600 | using the log of your dependent variable.

00:30:01.080 | So then we create a validation set, and as we've learned before, most of the time if you've

00:30:06.700 | got a problem involving a time component, your validation set probably wants to be the most

00:30:12.720 | recent time period rather than a random subset, so that's what I do here.

00:30:20.240 | When I finished modeling and I found an architecture and a set of hyperparameters and a number

00:30:24.880 | of epochs and all that stuff that works really well, if I want to make my model as good as

00:30:29.640 | possible I'll retrain on the whole thing, including the validation set.

00:30:36.400 | Now currently at least fastAI assumes that you do have a validation set, so my kind of

00:30:41.640 | hacky workaround is to set my validation set to just be one index, which is the first row,

00:30:48.000 | in that way all the code keeps working but there's no real validation set.

00:30:53.000 | So obviously if you do this you need to make sure that your final training is like the

00:30:59.040 | exact same hyperparameters, the exact same number of epochs, exactly the same as the thing

00:31:04.040 | that worked, because you don't actually have a proper validation set now to check against.

00:31:08.960 | I have a question regarding get elapsed function which we discussed before, so in get elapsed

00:31:17.680 | function we are trying to find when will the next holiday come?

00:31:26.140 | How many days away is it?

00:31:28.160 | So every year the holidays are more or less fixed, like there will be holiday on 4th of

00:31:33.520 | July, 25th of December and there's hardly any change.

00:31:37.440 | So can't we just look from previous years and just get a list of all the holidays that

00:31:42.540 | are going to occur this year?

00:31:46.480 | Maybe, in this case I guess that's not true of promo, and some holidays change, like Easter,

00:31:56.200 | so this way I get to write one piece of code that works for all of them, and it doesn't

00:32:05.520 | take very long to run.

00:32:08.400 | So there might be ways, if your dataset was so big that this took too long you could maybe

00:32:13.160 | do it on one year and then somehow copy it, but in this case there was no need to.

00:32:18.160 | And I always value my time over my computer's time, so I try to keep things as simple as

00:32:26.920 | I can.

00:32:31.320 | So now we can create our model, and so to create our model we have to create a model

00:32:37.160 | data object, as we always do with fast.ai, so a columnar model object is just a model

00:32:42.680 | data object that represents a training set, a validation set, and an optional test set

00:32:47.440 | of standard columnar structured data.

00:32:53.560 | We just have to tell it which of the variables should we treat as categorical.

00:33:00.840 | And then pass in our dataframes.

00:33:07.720 | So for each of our categorical variables, here is the number of categories it has.

00:33:17.540 | So for each of our embedding matrices, this tells us the number of rows in that embedding

00:33:23.680 | matrix.

00:33:26.560 | And so then we define what embedding dimensionality we want.

00:33:34.800 | If you're doing natural language processing, then the number of dimensions you need to

00:33:39.680 | capture all the nuance of what a word means and how it's used has been found empirically

00:33:45.240 | to be about 600.

00:33:48.560 | It turns out that when you do NLP models with embedding matrices that are smaller than 600,

00:33:59.400 | you don't get as good of results as you do if there's size 600, beyond 600, it doesn't

00:34:04.680 | seem to improve much.

00:34:07.280 | I would say that human language is one of the most complex things that we model, so

00:34:14.040 | I wouldn't expect you to come across many if any categorical variables that need embedding

00:34:19.840 | matrices with more than 600 dimensions.

00:34:25.120 | At the other end, some things may have pretty simple kind of causality.

00:34:33.960 | So for example, state holiday, maybe if something's a holiday, then it's just a case of stores

00:34:49.860 | that are in the city, there's some behavior, there's stores that are in the country, there's

00:34:53.920 | some other behavior, and that's about it.

00:34:57.600 | Maybe it's a pretty simple relationship.

00:35:02.040 | So ideally when you decide what embedding size to use, you would kind of use your knowledge

00:35:11.280 | about the domain to decide how complex is the relationship and so how big embedding

00:35:18.440 | do I need.

00:35:20.280 | In practice, you almost never know that.

00:35:24.920 | You would only know that because maybe somebody else has previously done that research and

00:35:28.360 | figured it out, like in NLP.

00:35:32.160 | So in practice, you probably need to use some rule of thumb, and then having tried your

00:35:38.680 | rule of thumb, you could then maybe try a little bit higher and a little bit lower and

00:35:43.000 | see what helps, so it's kind of experimental.

00:35:45.720 | So here's my rule of thumb.

00:35:46.880 | My rule of thumb is look at how many discrete values the category has, i.e. the number of

00:35:54.760 | rows in the embedding matrix, and make the dimensionality of the embedding half of that.

00:36:00.840 | So for day of week, which is the second one, 8 rows and 4 columns.

00:36:10.160 | So here it is there, the number of categories divided by 2.

00:36:14.960 | But then I say, don't go more than 50.

00:36:18.000 | So here you can see for stores, there's 1000 stores, I only have a dimensionality of 50.

00:36:22.000 | Why 50?

00:36:23.000 | I don't know, it seems to have worked okay so far.

00:36:26.100 | You may find you need something a little different.

00:36:29.400 | Actually for the Ecuadorian groceries competition, I haven't really tried playing with this, but

00:36:34.400 | I think we may need some larger embedding sizes, but it's something to fiddle with.

00:36:41.440 | Prince, can you pass that left?

00:36:44.840 | So as your variables, the cardinality size becomes larger and larger, you're creating

00:36:50.200 | more and more or wider embedding matrices, aren't you therefore massively risking overfitting,

00:36:57.200 | because you're just choosing so many parameters that the model can never possibly capture

00:37:00.280 | all that variation unless your data is absolutely huge?

00:37:03.780 | That's a great question.

00:37:04.800 | And so let me remind you about my kind of golden rule of the difference between modern

00:37:09.760 | machine learning and old machine learning.

00:37:13.280 | In old machine learning, we control complexity by reducing the number of parameters.

00:37:18.160 | In modern machine learning, we control complexity by regularization.

00:37:22.240 | So the answer is no, I'm not concerned about overfitting, because the way I avoid overfitting

00:37:27.640 | is not by reducing the number of parameters, but by increasing my dropout or increasing

00:37:33.880 | my weight decay.

00:37:37.960 | Having said that, there's no point using more parameters for a particular embedding than

00:37:44.240 | I need, because regularization is penalizing a model by giving it more random data or by

00:37:52.720 | actually penalizing weights, so we'd rather not use more than we have to.

00:37:59.520 | But my general rule of thumb for designing an architecture is to be generous on the side

00:38:07.080 | of the number of parameters.

00:38:08.400 | But in this case, if after doing some work we felt like, you know what, the store doesn't

00:38:16.560 | actually seem to be that important, then I might manually go and change this to make

00:38:22.360 | it smaller.

00:38:23.360 | Or if I was really finding there's not enough data here, I'm either overfitting or I'm using

00:38:28.880 | more regularization than I'm comfortable with, again, then you might go back.

00:38:32.960 | But I would always start with being generous with parameters, and in this case, this model

00:38:39.400 | turned out pretty good.

00:38:42.440 | So now we've got a list of tuples containing the number of rows and columns of each of

00:38:46.480 | our embedding matrices.

00:38:48.240 | And so when we call get-learner to create our neural net, that's the first thing we

00:38:52.480 | pass in, is how big is each of our embeddings.

00:38:58.360 | And then we tell it how many continuous variables we have.

00:39:02.760 | We tell it how many activations to create for each layer, and we tell it what dropout

00:39:07.120 | to use for each layer.

00:39:10.480 | And so then we can go ahead and call fit.

00:39:18.600 | So then we fit for a while, and we're kind of getting something around the 0.1 mark.

00:39:25.620 | So I tried running this on the test set and I submitted it to Kaggle during the week, actually

00:39:33.200 | last week, and here it is.

00:39:40.200 | Private score 107, public score 103.

00:39:46.580 | So let's have a look and see how that would go.

00:39:48.860 | So 107, private 103 public, so let's start on public, which is 103, not there, out of

00:40:04.520 | 3000, got to go back a long way.

00:40:14.480 | There it is, 103, okay, 340th.

00:40:20.780 | That's not good.

00:40:23.600 | So on the public leaderboard, 340th.

00:40:25.960 | Let's try the private leaderboard, which is 107, oh, 5th.

00:40:36.960 | So hopefully you're now thinking, oh, there are some Kaggle competitions finishing soon,

00:40:42.600 | which I entered, and I spent a lot of time trying to get good results on the public leaderboard.

00:40:47.280 | I wonder if that was a good idea.

00:40:49.400 | And the answer is, no it won't.

00:40:51.600 | The Kaggle public leaderboard is not meant to be a replacement for your carefully developed

00:40:58.880 | validation set.

00:41:01.300 | So for example, if you're doing the iceberg competition, which ones are ships, which ones

00:41:06.600 | are icebergs, then they've actually put something like 4000 synthetic images into the public

00:41:13.520 | leaderboard and none into the private leaderboard.

00:41:18.080 | So this is one of the really good things that tests you out on Kaggle, is like are you creating

00:41:27.880 | a good validation set and are you trusting it?

00:41:30.880 | Because if you're trusting your leaderboard feedback more than your validation feedback,

00:41:36.960 | then you may find yourself in 350th place when you thought you were in 5th.

00:41:43.200 | So in this case, we actually had a pretty good validation set, because as you can see,

00:41:47.920 | it's saying somewhere around 0.1, and we actually did get somewhere around 0.1.

00:41:55.840 | And so in this case, the public leaderboard in this competition was entirely useless.

00:42:04.720 | Can you use the box please?

00:42:07.760 | So in regards to that, how much does the top of the public leaderboard actually correspond

00:42:13.240 | to the top of the private leaderboard?

00:42:14.880 | Because in the churn prediction challenge, there's like four people who are just completely

00:42:22.160 | above everyone else.

00:42:23.840 | It totally depends.

00:42:26.960 | If they randomly sample the public and private leaderboard, then it should be extremely indicative.

00:42:45.360 | So in this case, the person who was second on the public leaderboard did end up winning.

00:42:54.480 | SDNT came 7th.

00:43:02.440 | So in fact you can see the little green thing here, whereas this guy jumped 96 places.

00:43:11.520 | If we had entered with a neural net, we just looked at it, we would have jumped 350 places.

00:43:14.980 | So it just depends.

00:43:18.060 | And so often you can figure out whether the public leaderboard -- like sometimes they'll

00:43:24.440 | tell you the public leaderboard was randomly sampled, sometimes they'll tell you it's not.

00:43:29.000 | Generally you have to figure it out by looking at the correlation between your validation

00:43:33.440 | set results and the public leaderboard results to see how well they're correlated.

00:43:40.880 | Sometimes if two or three people are way ahead of everybody else, they may have found some

00:43:43.960 | kind of leakage or something like that.

00:43:48.920 | That's often a sign that there's some trick.

00:43:57.040 | So that's Rossman, and that brings us to the end of all of our material.

00:44:06.440 | So let's come back after the break and do a quick review, and then we will talk about

00:44:13.760 | ethics and machine learning.

00:44:15.400 | So let's come back in 5 minutes.

00:44:22.540 | So we've learnt two ways to train a model.

00:44:29.480 | One is by building a tree, and one is with SGD.

00:44:36.280 | And so the SGD approach is a way we can train a model which is a linear model or a stack

00:44:46.240 | of linear layers with nonlinearities between them, whereas tree building specifically will

00:44:53.400 | give us a tree.

00:44:55.800 | And then tree building we can combine with bagging to create a random forest, or with

00:45:01.680 | boosting to create a GPM, or various other slight variations such as extremely randomized

00:45:09.200 | trees.

00:45:12.200 | So it's worth reminding ourselves of what these things do.

00:45:22.000 | So let's look at some data.

00:45:33.320 | So if we've got some data like so, actually let's look specifically at categorical data.

00:45:48.640 | So categorical data, there's a couple of possibilities of what categorical data might look like.

00:45:54.500 | It could be like, let's say we've got zip code, so we've got line4003 is our zip code,

00:46:01.440 | and then we've got sales, and it's like 50, and line4131, sales of 22, and so forth.

00:46:14.560 | So we've got some categorical variable.

00:46:18.080 | So there's a couple of ways we could represent that categorical variable.

00:46:23.340 | One would be just to use the number, and maybe it wasn't a number at all, maybe our categorical

00:46:30.960 | variable is like San Francisco, New York, Mumbai, and Sydney.

00:46:39.960 | But we can turn it into a number just by arbitrarily deciding to give them numbers.

00:46:45.880 | So it ends up being a number.

00:46:47.840 | So we could just use that kind of arbitrary number.

00:46:51.080 | So if it turns out that zip codes that are numerically next to each other have somewhat

00:46:59.660 | similar behavior, then the zip code versus sales chart might look something like this.

00:47:16.340 | Or alternatively, if the two zip codes next to each other didn't have in any way similar

00:47:28.720 | sales behavior, you would expect to see something that looked more like this, just all over

00:47:36.200 | the place.

00:47:41.000 | So they're the kind of two possibilities.

00:47:44.960 | So what a random forest would do if we had just encoded zip in this way is it's going

00:47:50.740 | to say, alright, I need to find my single best split point.

00:47:56.880 | The split point is going to make the two sides have as small a standard deviation as possible,

00:48:03.520 | or mathematically equivalently have the lowest root mean squared error.

00:48:08.040 | So in this case it might pick here as our first split point, because on this side there's

00:48:18.240 | one average, and on the other side there's the other average.

00:48:23.880 | And then for its second split point it's going to say, okay, how do I split this?

00:48:29.400 | And it's probably going to say I would split here, because now we've got this average versus

00:48:37.920 | this average.

00:48:40.200 | And then finally it's going to say, okay, how do we split here?

00:48:44.080 | And it's going to say, okay, I'll split there.

00:48:47.080 | So now I've got that average and that average.

00:48:50.200 | So you can see that it's able to kind of hone in on the set of splits it needs, even though

00:48:56.360 | it kind of does it greedily, top down one at a time.

00:48:59.760 | The only reason it wouldn't be able to do this is if it was just such bad luck that

00:49:05.080 | the two halves were kind of always exactly balanced, but even if that happens it's not

00:49:10.680 | going to be the end of the world, it will split on something else, some other variable,

00:49:15.080 | and next time around it's very unlikely that it's still going to be exactly balanced in

00:49:20.440 | both parts of the tree.

00:49:22.160 | So in practice this works just fine.

00:49:26.640 | In the second case, it can do exactly the same thing.

00:49:31.120 | It'll say, okay, which is my best first split, even though there's no relationship between

00:49:38.040 | one zip code and its neighboring zip code numerically.

00:49:41.280 | We can still see here if it splits here, there's the average on one side, and the average on

00:49:48.200 | the other side is probably about here.

00:49:52.040 | And then where would it split next?

00:49:54.800 | Probably here, because here's the average on one side, here's the average on the other

00:49:58.520 | side.

00:50:00.320 | So again, it can do the same thing, it's going to need more splits because it's going to

00:50:04.000 | end up having to kind of narrow down on each individual large zip code and each individual

00:50:08.680 | small zip code, but it's still going to be fine.

00:50:12.320 | So when we're dealing with building decision trees for random forests or GBMs or whatever,

00:50:20.600 | we tend to encode our variables just as ordinals.

00:50:27.640 | On the other hand, if we're doing a neural network, or like a simplest version, like a

00:50:35.000 | linear regression or a logistic regression, the best it could do is that, which is no

00:50:43.640 | good at all, and ditto with this one, it's going to be like that.

00:50:48.360 | So an ordinal is not going to be a useful encoding for a linear model or something that

00:50:57.160 | stacks linear and nonlinear models together.

00:51:01.640 | So instead, what we do is we create a one-hot encoding.

00:51:05.840 | So we'll say, 1, 0, 0, 0, here's 0, 1, 0, 0, here's 0, 0, 1, 0, 0, 0, 1.

00:51:18.360 | And so with that encoding, it can effectively create a little histogram where it's going

00:51:24.800 | to have a different coefficient for each level.

00:51:29.000 | So that way it can do exactly what it needs to do.

00:51:32.240 | At what point does that become too tedious for your system, or does it not?

00:51:42.640 | Pretty much never.

00:51:48.000 | Because remember, in real life we don't actually have to create that matrix, instead we can

00:51:55.920 | just have the 4 coefficients and just do an index lookup to grab the second one, which

00:52:02.920 | is mathematically equivalent to multiplying by the one-hot encoding.

00:52:07.360 | So that's no problem.

00:52:15.400 | One thing to mention, I know you guys have been taught quite a bit of more analytical

00:52:22.040 | solutions to things.

00:52:24.920 | And in analytical solutions to linear regression, you can't solve something with this amount

00:52:36.480 | of collinearity.

00:52:37.880 | In other words, you know something is Sydney if it's not Mumbai or New York or San Francisco.

00:52:45.980 | In other words, there's 100% collinearity between the fourth of these classes versus

00:52:51.320 | the other three.

00:52:52.560 | And so if you try to solve a linear regression analytically, that way the whole thing falls

00:52:56.520 | apart.

00:52:57.520 | Now note, with SGD we have no such problem, like SGD, why would it care?

00:53:03.480 | We're just taking one step along the derivative.

00:53:07.000 | It cares a little, because in the end the main problem with collinearity is that there's

00:53:13.760 | an infinite number of equally good solutions.

00:53:17.680 | So in other words, we could increase all of these and decrease this, or decrease all of

00:53:23.400 | these and increase this, and they're going to balance out.

00:53:28.240 | And when there's an infinitely large number of good solutions, that means there's a lot

00:53:32.600 | of kind of flat spots in the loss surface, and it can be harder to optimize.

00:53:38.960 | So it's a really easy way to get rid of all of those flat spots, which is to add a little

00:53:42.400 | bit of regularization.

00:53:43.600 | So if we added a little bit of weight decay, like 1e neg 7 even, then that basically says

00:53:50.120 | these are not all equally good anymore, the one which is the best is the one where the

00:53:54.400 | parameters are the smallest and the most similar to each other, and so that will again move

00:54:00.320 | it back to being a nice loss function.

00:54:02.280 | Could you just clarify that point you made about why one hot-coating wouldn't be that

00:54:09.360 | tedious?

00:54:11.880 | Sure.

00:54:13.920 | If we have a one hot-encoded vector, and we are multiplying it by a set of coefficients,

00:54:27.520 | then that's exactly the same thing as simply saying let's grab the thing where the 1 is.

00:54:33.440 | So in other words, if we had stored this as a 0, and this one as a 1, and this one as

00:54:40.120 | a 2, then it's exactly the same as just saying look up that thing in the array.

00:54:47.060 | And so we call that version an embedding.

00:54:50.480 | So an embedding is a weight matrix you can multiply by a 1 hot-encoding, and it's just

00:54:57.120 | a computational shortcut, but it's mathematically the same.

00:55:03.840 | So there's a key difference between solving linear type models analytically versus with

00:55:13.200 | SGD.

00:55:14.200 | With SGD we don't have to worry about collinearity and stuff, or at least not nearly to the same

00:55:18.880 | degree, and then the difference between solving a linear or a single layer or multilayer model

00:55:27.760 | with SGD versus a tree, a tree is going to complain about less things.

00:55:34.040 | So in particular you can just use ordinals as your categorical variables.

00:55:39.140 | And as we learned just before, we also don't have to worry about normalizing continuous

00:55:45.200 | variables for a tree, but we do have to worry about it for these SGD-trained models.

00:55:54.120 | So then we also learned a lot about interpreting random forests in particular.

00:56:00.800 | And if you're interested, you may be interested in trying to use those same techniques to

00:56:06.840 | interpret neural nets.

00:56:11.840 | So if you want to know which of my features are important in a neural net, you could try

00:56:15.920 | the same thing.

00:56:16.920 | Try shuffling each column in turn and see how much it changes your accuracy, and that's

00:56:23.400 | going to be your feature importance for your neural net.

00:56:26.840 | And then if you really want to have fun, recognize then that shuffling that column is just a

00:56:33.300 | way of calculating how sensitive the output is to that input, which in other words is

00:56:38.760 | the derivative of the output with respect to that input.

00:56:43.640 | And so therefore maybe you could just ask PyTorch to give you the derivatives with respect

00:56:47.840 | to the input directly, and see if that gives you the same kind of answers.

00:56:55.120 | You could do the same kind of thing for a partial dependence plot, you could try doing

00:56:59.200 | the exact same thing with your neural net, replace everything in a column with the same

00:57:03.560 | value, do it for 1960, 1961, 1962, plot that.

00:57:08.880 | I don't know of anybody who's done these things before, not because it's rocket science, but

00:57:13.600 | just because I don't know, maybe no one thought of it, or it's not in a library, but if somebody

00:57:20.240 | tried it, I think you should find it useful, it would make a great blog post, maybe even

00:57:24.600 | a paper if you wanted to take it a bit further.

00:57:27.680 | So there's a thought that something could do.

00:57:29.320 | So most of those interpretation techniques are not particularly specific to random forests.

00:57:34.720 | Things like the tree interpreter certainly are, because they're all about what's inside

00:57:38.520 | the tree.

00:57:39.520 | Can you pass it to Karen?

00:57:43.400 | We are applying tree interpreter for neural nets.

00:57:46.200 | How are we going to make inference out of activations that the path follows, for example?

00:57:53.200 | How are we going in tree interpreter?

00:57:55.240 | We're looking at the paths and their contributions of the features.

00:58:02.360 | In this case, it will be same with activations, I guess, the contributions of each activation

00:58:06.880 | on their path.

00:58:07.880 | Yeah, maybe.

00:58:08.880 | I don't know.

00:58:09.880 | I haven't thought about it.

00:58:10.880 | How can we make inference out of the activations?

00:58:14.680 | So I'd be careful saying the word inference, because people normally use the word inference

00:58:17.960 | specifically to mean the same as a test time prediction.

00:58:22.820 | You may like make some kind of an interrogate the model.

00:58:25.440 | I'm not sure.

00:58:26.440 | We should think about that.

00:58:28.640 | Actually Hinton and one of his students just published a paper on how to approximate a

00:58:32.900 | neural net with a tree for this exact reason, which I haven't read the paper yet.

00:58:38.760 | Could you pass that?

00:58:44.040 | So in linear regression and traditional statistics, one of the things that we focused on was statistical

00:58:50.440 | significance of like the changes and things like that.

00:58:53.600 | And so when thinking about a tree interpreter or even like the waterfall chart, which I

00:58:58.160 | guess is just a visualization, I guess where does that fit in?

00:59:02.920 | Because we can see like, oh, yeah, this looks important in the sense that it causes large

00:59:07.440 | changes.

00:59:08.440 | But how do we know that it's like traditionally statistically significant or anything of that

00:59:12.800 | sort?

00:59:13.800 | Yeah.

00:59:14.800 | So most of the time I don't care about the traditional statistical significance, and

00:59:18.720 | the reason why is that nowadays the main driver of statistical significance is data volume,

00:59:25.880 | not kind of practical importance.

00:59:29.460 | And nowadays most of the models you build will have so much data that like every tiny

00:59:34.380 | thing will be statistically significant, but most of them won't be practically significant.

00:59:39.960 | So my main focus therefore is practical significance, which is does the size of this influence impact

00:59:46.840 | your business?

00:59:50.720 | Statistical significance, it was much more important when we had a lot less data to work

00:59:56.600 | with.

00:59:57.680 | If you do need to know statistical significance, because for example you have a very small

01:00:01.760 | dataset because it's like really expensive to label or hard to collect or whatever, or

01:00:06.080 | it's a medical dataset for a rare disease, you can always get statistical significance

01:00:11.240 | by bootstrapping, which is to say that you can randomly resample your dataset a number

01:00:17.980 | of times, train your model a number of times, and you can then see the actual variation

01:00:24.080 | in predictions.

01:00:25.840 | So with bootstrapping, you can turn any model into something that gives you confidence intervals.

01:00:31.800 | There's a paper by Michael Jordan which has a technique called the bag of little bootstraps

01:00:37.480 | which actually kind of takes this a little bit further, well worth reading if you're

01:00:42.320 | interested.

01:00:43.320 | Can you pass it to Prince?

01:00:46.920 | So you said we don't need one-hot encoding matrix if we are doing random forest or if

01:00:53.400 | we are doing any tree-based models.

01:00:55.640 | What will happen if we do that and how bad can a model be?

01:00:59.960 | If you do do one-hot encoding?

01:01:02.960 | We actually did do it, remember we had that maximum category size and we did create one-hot

01:01:07.920 | encodings and the reason why we did it was that then our feature importance would tell

01:01:13.920 | us the importance of the individual levels and our partial dependence plot, we could

01:01:18.320 | include the individual levels.

01:01:20.120 | So it doesn't necessarily make the model worse, it may make it better, but it probably won't

01:01:28.240 | change it much at all.

01:01:29.240 | In this case it hardly changed it.

01:01:30.840 | This is something that we have noticed on real data also that if cardinality is higher,

01:01:36.200 | let's say 50 levels, and if you do one-hot encoding, the random forest performs very

01:01:42.080 | badly.

01:01:43.080 | Yeah, that's right.

01:01:44.080 | That's why in fast.ai we have that maximum categorical size because at some point your

01:01:52.640 | one-hot encoded variables become two-spasts.

01:01:55.400 | So I generally cut it off at 6 or 7.

01:01:59.360 | Also because when you get past that it becomes less useful because of the feature importance

01:02:04.160 | there's going to be too many levels to really look at.

01:02:07.960 | So can it not look at those levels which are not important and just give those significant

01:02:28.800 | features as important?

01:02:29.800 | Yeah, it'll be okay.

01:02:30.800 | Once the cardinality increases too high you're just splitting your data up too much basically.

01:02:33.800 | And so in practice your ordinal version is likely to be better.

01:02:47.160 | There's no time to kind of review everything, but I think that's the key concepts and then

01:02:50.840 | of course remembering that the embedding matrix that we can use is likely to have more than

01:02:55.680 | just one coefficient, we'll actually have a dimensionality of a few coefficients which

01:03:00.440 | isn't going to be useful for most linear models, but once you've got multi-layer models that's

01:03:05.700 | now creating a representation of your category which is quite a lot richer and you can do

01:03:10.960 | a lot more with it.

01:03:14.120 | Let's now talk about the most important bit.

01:03:17.560 | We started off early in this course talking about how actually a lot of machine learning

01:03:26.800 | is kind of misplaced.

01:03:28.880 | People focus on predictive accuracy like Amazon has a collaborative filtering algorithm for

01:03:35.280 | recommending books and they end up recommending the book which it thinks you're most likely

01:03:40.000 | to write highly.

01:03:42.800 | And so what they end up doing is probably recommending a book that you already have

01:03:47.360 | or that you already know about and would have bought anyway, which isn't very valuable.

01:03:51.800 | What they should instead have done is to figure out which book can I recommend that would

01:03:57.620 | cause you to change your behavior.

01:04:00.480 | And so that way we actually maximize our lift in sales due to recommendations.

01:04:06.640 | And so this idea of the difference between optimizing and influencing your actions versus

01:04:13.860 | just improving predictive accuracy is a really important distinction which is very rarely

01:04:23.140 | discussed in academia or industry kind of crazy enough.

01:04:28.700 | It's more discussed in industry, it's particularly ignored in most of academia.

01:04:33.920 | So it's a really important idea which is that in the end the idea, the goal of your model

01:04:41.040 | presumably is to influence behavior.

01:04:44.680 | And remember I actually mentioned a whole paper I have about this where I introduce

01:04:48.840 | this thing called the drivetrain approach where I talk about ways to think about how

01:04:53.160 | to incorporate machine learning into how do we actually influence behavior.

01:05:01.080 | So that's a starting point, but then the next question is like okay if we're trying to influence

01:05:06.040 | behavior, what kind of behavior should we be influencing and how and what might it mean

01:05:14.440 | when we start influencing behavior?

01:05:17.000 | Because nowadays a lot of the companies that you're going to end up working at are big

01:05:24.100 | ass companies and you'll be building stuff that can influence millions of people.

01:05:30.560 | So what does that mean?

01:05:33.640 | So I'm actually not going to tell you what it means because I don't know, all I'm going

01:05:39.080 | to try and do is make you aware of some of the issues and make you believe two things

01:05:45.560 | about them.

01:05:46.560 | First, that you should care, and second, that they're big current issues.

01:05:54.640 | The main reason I want you to care is because I want you to want to be a good person and

01:06:00.400 | show you that not thinking about these things will make you a bad person.

01:06:04.840 | But if you don't find that convincing I will tell you this, Volkswagen were found to be

01:06:12.360 | cheating on their emissions tests.

01:06:16.240 | The person who was sent to jail for it was the programmer that implemented that piece

01:06:21.040 | of code.

01:06:22.480 | They did exactly what they were told to do.

01:06:25.740 | And so if you're coming in here thinking, "Hey, I'm just a techie, I'll just do what

01:06:30.320 | I'm told, that's my job is to do what I'm told."

01:06:34.320 | I'm telling you if you do that you can be sent to jail for doing what you're told.

01:06:40.280 | So A) don't just do what you're told because you can be a bad person, and B) you can go

01:06:46.520 | to jail.

01:06:49.720 | Second thing to realize is in the heat of the moment you're in a meeting with 20 people

01:06:55.120 | at work and you're all talking about how you're going to implement this new feature and everybody's

01:07:00.140 | discussing it, and everybody's like, "We can do this, and here's a way of modeling it,

01:07:04.800 | and then we can implement it, and here's these constraints."

01:07:06.480 | And there's some part of you that's thinking, "Am I sure we should be doing this?"

01:07:12.280 | That's not the right time to be thinking about that, because it's really hard to step up

01:07:17.320 | then and say, "Excuse me, I'm not sure this is a good idea."

01:07:22.520 | You actually need to think about how you would handle that situation ahead of time.

01:07:27.280 | So I want you to think about these issues now and realize that by the time you're in

01:07:34.720 | the middle of it, you might not even realize it's happening.

01:07:40.120 | It'll just be a meeting, like every other meeting, and a bunch of people will be talking

01:07:43.800 | about how to solve this technical question.

01:07:46.960 | And you need to be able to recognize, "Oh, this is actually something with ethical implications."

01:07:53.480 | So Rachel actually wrote all of these slides, I'm sorry she can't be here to present this

01:07:59.120 | because she's studied this in depth, and she's actually been in difficult environments herself

01:08:06.520 | where she's kind of seen these things happening.

01:08:12.440 | We know how hard it is, but let me give you a sense of what happens.

01:08:17.880 | So engineers trying to solve engineering problems and causing problems is not a new thing.

01:08:28.040 | So in Nazi Germany, IBM, the group known as Hollerith, Hollerith was the original name

01:08:37.640 | of IBM, and it comes from the guy who actually invented the use of punch cards for tracking

01:08:42.440 | the US Census, the first mass, wide-scale use of punch cards for data collection in

01:08:48.120 | the world.

01:08:49.320 | And that turned into IBM.

01:08:51.200 | So at this point, this unit was still called Hollerith.

01:08:53.800 | So Hollerith sold a punch card system to Nazi Germany.

01:09:01.680 | And so each punch card would like code, you know, this is a Jew, 8, GFC, 12, general execution,

01:09:09.240 | 4, death by gas chamber, 6.

01:09:12.680 | And so here's one of these cards describing the right way to kill these various people.

01:09:17.900 | And so a Swiss judge ruled that IBM's technical assistance facilitated the tasks of the Nazis

01:09:25.040 | in commission of their crimes against humanity.

01:09:27.520 | This led to the death of something like 20 million civilians.

01:09:33.180 | So according to the Jewish Virtual Library, where I got these pictures and quotes from,

01:09:38.280 | their view is that the destruction of the Jewish people became even less important because

01:09:43.040 | of the invigorating nature of IBM's technical achievement, only heightened by the fantastical

01:09:48.720 | profits to be made.

01:09:51.760 | So this was a long time ago, and hopefully you won't end up working at companies that

01:09:55.960 | facilitate genocide.

01:09:59.920 | But perhaps you will, because perhaps you'll go to Facebook, who are facilitating genocide

01:10:05.720 | right now.

01:10:07.240 | And I know people at Facebook who are doing this, and they had no idea they were doing

01:10:13.680 | this.

01:10:14.840 | So right now in Facebook, the Rohingya are in the middle of a genocide, a Muslim population

01:10:20.520 | of Myanmar.

01:10:24.240 | Babies are being grabbed out of their mother's arms and thrown into fires.

01:10:28.640 | People are being killed, hundreds of thousands of refugees.

01:10:32.520 | When interviewed, the Myanmar generals doing this say, "We are so grateful to Facebook

01:10:40.280 | for letting us know about the Rohingya fake news that these people are actually not human,

01:10:49.640 | that they're actually animals."

01:10:51.760 | Now Facebook did not set out to enable the genocide of the Rohingya people in Myanmar.

01:10:58.320 | No, instead what happened is they wanted to maximize impressions and clicks.

01:11:03.840 | And so it turns out that for the data scientists at Facebook, their algorithms kind of learned

01:11:08.960 | that if you take the kinds of stuff people are interested in and feed them slightly more

01:11:13.960 | extreme versions of that, you're actually going to get a lot more impressions.

01:11:18.560 | And the project managers are saying maximize these impressions, and people are clicking

01:11:22.320 | and it creates this thing.

01:11:26.720 | And so the potential implications are extraordinary and global.

01:11:34.680 | And this is something that is literally happening, this is October 2017, it's happening now.

01:11:41.800 | Could you pass that back there?

01:11:48.640 | So I just want to clarify what was happening here.

01:11:51.120 | So it was the facilitation of fake news or inaccurate media?

01:11:55.760 | Let me go into it in more detail.

01:12:00.080 | So what happened was in mid-2016, Facebook fired its human editors.

01:12:07.720 | So it was humans that decided how to order things on your homepage.

01:12:12.880 | Those people got fired and replaced with machine learning algorithms.

01:12:16.840 | And so the machine learning algorithms written by data scientists like you, they had nice

01:12:25.000 | clear metrics and they were trying to maximize their predictive accuracy and be like okay,

01:12:30.100 | we think if we put this thing higher up than this thing, we'll get more clicks.

01:12:35.280 | And so it turned out that these algorithms for putting things on the Facebook news feed

01:12:41.280 | had a tendency to say like oh, human nature is that we tend to click on things which stimulate

01:12:48.360 | our views and therefore like more extreme versions of things we already see.

01:12:53.520 | So this is great for the Facebook revenue model of maximizing engagement.

01:12:59.400 | It looked good on all of their KPIs.

01:13:02.600 | And so at the time, there was some negative press about like I'm not sure that the stuff

01:13:10.160 | that Facebook is now putting on their trending section is actually that accurate, but from

01:13:16.280 | the point of view of the metrics that people were optimizing at Facebook, it looked terrific.

01:13:22.600 | And so way back to October 2016, people started noticing some serious problems.

01:13:29.120 | For example, it is illegal to target housing to people of certain races in America.

01:13:37.120 | That is illegal.

01:13:38.160 | And yet a news organization discovered that Facebook was doing exactly that in October

01:13:44.480 | 2016.

01:13:45.480 | Again, not because somebody in that data science team said let's make sure black people can't

01:13:50.760 | live in nice neighborhoods, but instead they found that their automatic clustering and

01:13:58.480 | segmentation algorithm found there was a cluster of people who didn't like African Americans

01:14:04.960 | and that if you targeted them with these kinds of ads then they would be more likely to select

01:14:10.760 | this kind of housing or whatever.

01:14:12.400 | But the interesting thing is that even after being told about this three times, Facebook

01:14:18.960 | still hasn't fixed it.

01:14:20.780 | And that is to say these are not just technical issues, they're also economic issues.

01:14:25.200 | When you start saying the thing that you get paid for, that is ads, you have to change

01:14:30.840 | the way that you structure those so that you either use more people that cost money or

01:14:37.280 | you are less aggressive on your algorithms to target people based on minority group status

01:14:44.800 | or whatever, that can impact revenues.

01:14:48.440 | So the reason I mention this is you will at likely at some point in your career find yourself

01:14:53.880 | in a conversation where you're thinking I'm not confident that this is like morally okay,

01:15:01.000 | the person you're talking to is thinking in their head this is going to make us a lot

01:15:04.160 | of money, and you don't quite ever manage to have a successful conversation because

01:15:10.880 | you're talking about different things.

01:15:13.940 | And so when you're talking to somebody who may be more experienced and more senior than

01:15:17.800 | you and they may sound like they know what they're talking about, just realize that their

01:15:21.960 | incentives are not necessarily going to be focused on like how do I be a good person.

01:15:28.880 | They're not thinking how do I be a bad person, but the more time you spend in industry in

01:15:34.200 | my experience, the more desensitized you kind of get to this stuff of like okay maybe getting

01:15:40.760 | promotions and making money isn't the most important thing.

01:15:45.240 | So for example, I've got a lot of friends who are very good at computer vision and some

01:15:50.840 | of them have gone on to create startups that seem like they're almost handmade to help

01:15:57.000 | authoritarian governments surveil their citizens.

01:16:01.880 | And when I ask my friends like have you thought about how this could be used in that way,

01:16:08.200 | they're generally kind of offended that I ask, but I'm asking you to think about this.

01:16:17.280 | Wherever you end up working, if you end up creating a startup, tools can be used for

01:16:23.600 | good or for evil, and so I'm not saying don't create excellent object tracking and detection

01:16:31.400 | tools from computer vision, because you could go on and use that to create a much better

01:16:39.200 | surgical intervention robot toolkit, just saying be aware of it, think about it, talk

01:16:46.080 | about it.

01:16:50.320 | So here's one I find fascinating, and there's this really cool thing actually that meetup.com

01:16:55.560 | did, this is from a meetup.com talk that's online, they think about this.

01:17:00.640 | They actually thought about this, they actually thought, you know what, if we built a collaborative

01:17:05.000 | filtering system like we learned about in class to help people decide what meetup to

01:17:11.360 | go to, it might notice that on the whole in San Francisco, a few more men than women tend

01:17:19.160 | to go to techie meetups.

01:17:21.560 | And so it might then start to decide to recommend techie meetups to more men than women, as

01:17:28.320 | a result of which, more men will go to techie meetups.

01:17:32.600 | As a result of which, when women go to techie meetups, they'll be like oh, this is all men,

01:17:36.480 | I don't really want to go to techie meetups.

01:17:38.760 | As a result of which, the algorithm will get new data saying that men like techie meetups

01:17:42.960 | better, and so it continues.

01:17:46.300 | And so a little bit of that initial push from the algorithm can create this runaway feedback

01:17:54.560 | loop and you end up with almost all male techie meetups, for instance.

01:18:00.840 | And so this kind of feedback loop is a kind of subtle issue that you really want to think

01:18:07.120 | about when you're thinking about what is the behavior that I'm changing with this algorithm

01:18:13.200 | that I'm building.

01:18:18.440 | So another example, which is kind of terrifying, is in this paper where the authors describe

01:18:28.080 | how a lot of departments in the US are now using predictive policing algorithms.

01:18:34.800 | So where can we go to find somebody who's about to commit a crime?

01:18:40.420 | And so you know that the algorithm simply feeds back to you basically the data that

01:18:47.240 | you've given it.

01:18:49.400 | So if your police department has engaged in racial profiling at all in the past, then

01:18:57.240 | it might suggest slightly more often maybe you should go to the black neighborhoods to

01:19:01.440 | check for people committing crimes.

01:19:04.040 | As a result of which, more of your police officers go to the black neighborhoods.

01:19:07.360 | As a result of which, they arrest more black people.

01:19:10.080 | As a result of which, the data says that the black neighborhoods are less safe.

01:19:14.120 | As a result of which, the algorithm says to the policeman, maybe you should go to the

01:19:17.120 | black neighborhoods more often, and so forth.

01:19:21.080 | And this is not like vague possibilities of something that might happen in the future,

01:19:29.720 | this is like documented work from top academics who have carefully studied the data and the

01:19:35.560 | theory.

01:19:37.200 | This is like serious scholarly work, it's like no, this is happening right now.

01:19:42.560 | And so again, I'm sure the people that started creating this predictive policing algorithm

01:19:49.580 | didn't think like how do we arrest more black people, hopefully they were actually thinking

01:19:54.860 | gosh I'd like my children to be safer on the streets, how do I create a safer society?

01:20:02.720 | But they didn't think about this nasty runaway feedback loop.

01:20:09.060 | So actually this one about social network algorithms is actually an article in the New

01:20:13.720 | York Times recently about one of my friends, Renee DiResta, and she did something kind

01:20:19.920 | of amazing.

01:20:20.920 | She set up a second Facebook account, like a fake Facebook account, and she was very

01:20:27.200 | interested in the anti-vax movement at the time.

01:20:30.680 | So she started following a couple of anti-vaxxers and visited a couple of anti-vaxxer links.

01:20:39.400 | And so suddenly her news feed starts getting full of anti-vaxxer news, along with other

01:20:46.240 | stuff like chemtrails, and deep state conspiracy theories, and all this stuff.

01:20:53.440 | And so she's like, 'huh', starts clicking on those.

01:20:57.180 | And the more she clicked, the more hardcore far-out conspiracy stuff Facebook recommended.

01:21:05.280 | So now when Renee goes to that Facebook account, the whole thing is just full of angry, crazy,

01:21:14.360 | far-out conspiracy stuff, like that's all she sees.

01:21:18.180 | And so if that was your world, then as far as you're concerned, it's just like this continuous

01:21:25.080 | reminder and proof of all this stuff.

01:21:30.200 | And so again, to answer your question, this is the kind of runaway feedback loop that

01:21:37.580 | ends up telling me and my generals, you know, throughout their Facebook homepage, that animals

01:21:46.400 | and fake news and whatever else.

01:21:51.900 | So a lot of this comes also from bias.

01:21:58.720 | And so let's talk about bias specifically.

01:22:01.960 | So bias in image software comes from bias in data.

01:22:08.160 | And so most of the folks I know at Google Brain building computer vision algorithms,

01:22:16.680 | very few of them are people of color.

01:22:19.460 | And so when they're training the algorithms with photos of their families and friends,

01:22:24.440 | they are training them with very few people of color.

01:22:27.180 | And so when FaceApp then decided, we're going to try looking at lots of Instagram photos

01:22:34.300 | to see which ones are upvoted the most, without them necessarily realizing it, the answer

01:22:40.820 | was light-colored faces.

01:22:44.160 | So then they built a generative model to make you more hot.

01:22:48.440 | And so this is the actual photo, and here is the hotter version.

01:22:53.040 | So the hotter version is more white, less nostrils, more European looking.

01:23:01.240 | And so this did not go down well, to say the least.

01:23:07.720 | So again, I don't think anybody at FaceApp said, let's create something that makes people

01:23:13.880 | look more white.

01:23:15.840 | They just trained it on a bunch of images of the people that they had around them.

01:23:21.560 | And this has kind of serious commercial implications as well.

01:23:27.360 | They had to pull this feature, and they had a huge amount of negative pushback as they

01:23:32.460 | should.

01:23:33.460 | Here's another example, Google Photos created this photo classifier, airplanes, skyscrapers,

01:23:42.160 | cars, graduation, and gorillas.

01:23:45.980 | So think about how this looks to most people.

01:23:50.920 | To most people they look at this, they don't know about machine learning, they say, what

01:23:55.440 | the fuck?

01:23:56.740 | Somebody at Google wrote some code to take black people and call them gorillas.

01:24:02.560 | That's what it looks like.

01:24:04.400 | Now we know that's not what happened.

01:24:05.960 | We know what happened is the team of folks at Google Computer Vision experts who have

01:24:15.800 | none or few people of color working in the team built a classifier using all the photos

01:24:22.000 | they had available to them.

01:24:23.720 | And so when the system came across a person with dark skin, it was like, I've only mainly

01:24:32.480 | seen that before amongst gorillas, so I'll put it in that category.

01:24:36.920 | So again, the bias in the data creates a bias in the software, and again, the commercial

01:24:42.860 | implications were very significant.

01:24:44.920 | Google really got a lot of bad PR from this, as they should.

01:24:49.800 | This was a photo that somebody put in their Twitter feed.

01:24:53.080 | They said, look what Google Photos just decided to do.

01:24:59.560 | You can imagine what happened with the first international beauty contest judged by artificial

01:25:03.200 | intelligence.

01:25:05.240 | Basically it turns out all the beautiful people are white.

01:25:08.840 | So you kind of see this bias in image software, thanks to bias in the data, thanks to lack

01:25:16.680 | of diversity in the teams building it, you see the same thing in natural language processing.

01:25:24.480 | So here is Turkish, O is the pronoun in Turkish which has no gender, but of course in English

01:25:39.520 | we don't really have a widely used un-gendered singular pronoun, so Google Translate converts

01:25:46.720 | it to this.

01:25:50.400 | Now there are plenty of people who saw this online and said, literally, so what?

01:25:59.640 | It is correctly feeding back the usual usage in English.

01:26:06.000 | I know how this is trained, this is like Word2Vec vectors, I was trained on Google News corpus,

01:26:11.480 | Google Books corpus, it's just telling us how things are.

01:26:15.880 | And from a point of view, that's entirely true.

01:26:20.120 | The biased data to create this biased algorithm is the actual data of how people have written

01:26:26.840 | books and used paper radicals for decades.

01:26:32.080 | But does that mean that this is the product that you want to create?

01:26:38.080 | Does this mean this is the product you have to create?

01:26:41.340 | Just because the particular way you've trained the model means it ends up doing this, is

01:26:47.320 | this actually the design you want?

01:26:49.340 | And can you think of potential negative implications and feedback loops this could create?

01:26:55.660 | And if any of these things bother you, then now, lucky you, you have a new cool engineering

01:27:01.120 | problem to work on, like how do I create unbiased NLP solutions?

01:27:06.380 | And now there are some start-ups starting to do that and starting to make some money.

01:27:11.520 | These are opportunities for you, like here's some stuff where people are creating screwed

01:27:16.560 | up societal outcomes because of their shitty models, like okay, well you can go and build

01:27:21.400 | something better.

01:27:23.520 | So like another example of the bias in word2vec word vectors is restaurant reviews rank Mexican

01:27:30.880 | restaurants worse because the Mexican words tend to be associated with criminal words

01:27:38.000 | in the US press and books more often.

01:27:40.800 | Again, this is like a real problem that is happening right now.

01:27:48.720 | So Rachel actually did some interesting analysis of just the plain word2vec word vectors where

01:27:56.680 | she basically pulled them out and looked at these analogies based on some research that

01:28:02.480 | had been done elsewhere.

01:28:03.960 | And so you can see word2vec, the vector directions show that father is to doctor, mother is to

01:28:10.480 | nurse, man is to computer programmer, as woman is to homemaker, and so forth.

01:28:16.680 | So it's really easy to see what's in these word vectors, and they're kind of fundamental

01:28:24.200 | to much of the NLP or probably just about all of the NLP software we use today.

01:28:31.080 | So a ProPublica has actually done a lot of good work in this area.

01:28:42.600 | Many judges now have access to Sentencing Guidelines software.

01:28:46.840 | And so Sentencing Guidelines software says to the judge, for this individual we would

01:28:52.320 | recommend this kind of sentence.

01:28:56.080 | And now of course a judge doesn't understand machine learning.

01:29:00.000 | So like they have two choices, which is either do what it says or ignore it entirely, and

01:29:05.800 | some people fall into each category.

01:29:08.920 | And so for the ones that fall into the like do what it says category, here's what happens.

01:29:13.640 | For those that were labeled higher risk, the subset of those that labeled higher risk it

01:29:19.120 | actually turned out not to re-offend, was about a quarter of whites and about a half

01:29:25.640 | of African Americans.

01:29:28.080 | So like nearly twice as often, people who didn't re-offend were marked as higher risk

01:29:36.560 | if they were African Americans, and vice versa.

01:29:39.280 | Amongst those that were labeled lower risk but actually did re-offend, it turned out

01:29:44.800 | to be about half of the whites and only 28% of the African Americans.

01:29:49.440 | So this is data which I would like to think nobody is setting out to create something

01:29:56.040 | that does this.

01:29:57.040 | But when you start with biased data, and the data says that whites and blacks smoke marijuana

01:30:09.200 | at about the same rate, but blacks are jailed at something like five times more often than

01:30:16.320 | whites, the nature of the justice system in America at the moment is that it's not equal,

01:30:23.840 | it's not fair.

01:30:24.840 | And therefore the data that's fed into the machine learning model is going to basically

01:30:29.800 | support that status quo.

01:30:31.800 | And then because of the negative feedback loop, it's just going to get worse and worse.

01:30:35.760 | I'll tell you something else interesting about this one, which research called Abe Gong has

01:30:40.320 | pointed out, is here are some of the questions that are being asked.

01:30:45.640 | So let's take one.

01:30:49.880 | Was your father ever arrested?

01:30:53.920 | So your answer to that question is going to decide whether you're locked up and for how

01:30:58.400 | long.

01:31:00.880 | Now as a machine learning researcher, do you think that might improve the predictive accuracy

01:31:05.400 | of your algorithm and get you a better R-squared?

01:31:08.920 | It could well, but I don't know.

01:31:11.160 | Maybe it does.

01:31:12.160 | You try it out and say oh, I've got a better R-squared.

01:31:14.960 | So does that mean you should use it?

01:31:16.800 | Well there's another question, do you think it's reasonable to lock somebody up for longer

01:31:22.720 | because of who their dad was?

01:31:25.400 | And yet these are actually the examples of questions that we are asking right now to

01:31:31.320 | offenders and then putting into a machine learning system to decide what happens to

01:31:35.880 | them.

01:31:37.360 | So again, whoever designed this, presumably they were laser focused on technical excellence,

01:31:44.360 | getting the maximum area under the ROC curve, and I found these great predictors that give

01:31:49.280 | me another .02, and I guess didn't start to think like well, is that a reasonable way

01:31:57.520 | to decide who goes to jail for longer?

01:32:03.840 | So like putting this together, you can kind of see how this can get more and more scary.

01:32:12.160 | We take a company like Taser, and Tasers are these devices that kind of give you a big

01:32:17.880 | electric shock basically.

01:32:19.920 | And Tasers managed to do a great job of creating strong relationships with some academic researchers

01:32:26.680 | who seem to say whatever they tell them to say, to the extent where now if you look at

01:32:32.400 | the data it turns out that there's a pretty high probability that if you get tased that

01:32:39.600 | you will die.

01:32:41.240 | That happens not unusually, and yet the researchers who they've paid to look into this have consistently

01:32:49.520 | come back and said oh no, it was nothing to do with the Taser, the fact that they died

01:32:53.880 | immediately afterwards was totally unrelated, it was just a random thing that happened.

01:33:01.600 | So this company now owns 80% of the market for body cameras, and they started buying

01:33:09.260 | computer vision AI companies, and they're going to try and now use these police body

01:33:14.280 | camera videos to anticipate criminal activity.

01:33:19.080 | And so what does that mean?

01:33:22.220 | So is that like okay, I now have some augmented reality display saying tase this person because

01:33:29.600 | they're about to do something bad.

01:33:31.880 | So it's kind of like a worrying direction, and so I'm sure nobody who's a data scientist

01:33:40.920 | at Taser or at the companies that they bought out is thinking like this is the world I want

01:33:46.360 | to help create, but they could find themselves, or you could find yourself in the middle of

01:33:53.000 | this kind of discussion, where it's not explicitly about that topic but there's part of you that

01:33:58.000 | says I wonder if this is how this could be used, and I don't know exactly what the right

01:34:05.760 | thing to do in that situation is, because you can ask, and of course people are going

01:34:08.840 | to be like no, no, no, no.

01:34:12.760 | So it's like what could you do?

01:34:17.200 | You could ask for some kind of written promise, you could decide to leave, you could start

01:34:25.320 | doing some research into the legality of things to say I would at least protect my own legal

01:34:31.960 | situation.

01:34:32.960 | I don't know, have a think about how you would respond to that.

01:34:39.840 | So these are some questions that Rachel created as being things to think about.

01:34:45.480 | So if you're looking at building a data product or using a model, if you're building a machine

01:34:51.320 | learning model as for a reason, you're trying to do something.

01:34:56.460 | So what bias may be in that data?

01:34:59.280 | Because whatever bias is in that data ends up being a bias in your predictions, potentially

01:35:03.360 | then biases the actions you're influencing, potentially then biases the data that you

01:35:07.560 | come back and you may create a feedback loop.

01:35:10.560 | If the team that built it isn't diverse, what might you be missing?

01:35:15.920 | So for example, one senior executive at Twitter called the alarm about major Russian bot problems

01:35:28.760 | at Twitter way back well before the election.

01:35:34.340 | That was the one black person in the exec team at Twitter, the one.

01:35:42.800 | And shortly afterwards they lost their job.

01:35:47.940 | Definitely having a more diverse team means having a more diverse set of opinions and

01:35:54.000 | beliefs and ideas and things to look for and so forth.

01:35:57.080 | So non-diverse teams seem to make more of these bad mistakes.

01:36:02.640 | Can we audit the code, is it open source, check for the different error rates amongst

01:36:08.640 | different groups, is there a simple rule we could use instead that's extremely interpretable

01:36:14.540 | and easy to communicate and if something goes wrong do we have a good way to deal with it.

01:36:22.080 | So when we've talked to people about this and a lot of people have come to Rachel and

01:36:29.400 | said I'm concerned about something my organization is doing, what do I do, or I'm just concerned

01:36:37.880 | about my toxic workplace, what do I do.

01:36:42.200 | And very often Rachel will say, have you considered leaving?

01:36:47.960 | And they will say, I don't want to lose my job.

01:36:52.520 | But actually if you can code, you're in 0.3% of the population.

01:36:57.420 | If you can code and do machine learning, you're in probably 0.01% of the population.

01:37:02.880 | You are massively, massively in demand.

01:37:09.240 | So realistically, obviously an organization does not want you to feel like you're somebody

01:37:15.480 | who could just leave and get another job, that's not in their interest, but that is

01:37:20.360 | absolutely true.

01:37:22.000 | And so one of the things I hope you'll leave this course with is enough self-confidence

01:37:28.680 | to recognize that you have the skills to get a job, and particularly once you've got your

01:37:36.040 | first job, your second job is an order of magnitude easier.

01:37:40.200 | And so this is important not just so that you feel like you actually have the ability

01:37:44.260 | to act ethically, but it's also important to realize if you find yourself in a toxic

01:37:51.320 | environment which is pretty damn common, unfortunately, there's a lot of shitty tech cultures, environments

01:38:01.080 | particularly in the Bay Area.

01:38:02.480 | If you find yourself in one of those environments, the best thing to do is to get the hell out.

01:38:09.560 | And if you don't have the self-confidence to think you can get another job, you can

01:38:15.440 | get trapped.

01:38:17.640 | So it's really important, it's really important to know that you are leaving this program

01:38:24.160 | with very in-demand skills, and particularly after you have that first job, you're now

01:38:28.680 | somebody with in-demand skills and a track record of being employed in that area.

01:38:36.720 | This is kind of just a broad question, but what are some things that you know of that

01:38:46.920 | people are doing to treat bias in data?

01:38:52.440 | It's kind of like a bit of a controversial subject at the moment, and people are trying

01:38:59.320 | to use, some people are trying to use an algorithmic approach, where they're basically trying to

01:39:03.200 | say how can we identify the bias and kind of subtract it out, but the most effective

01:39:10.920 | ways I know of are ones that are trying to treat it at the data level.

01:39:15.000 | So start with a more diverse team, particularly a team involving people from the humanities,

01:39:21.560 | like sociologists, psychologists, economists, people that understand feedback loops and implications

01:39:27.560 | for human behavior, and they tend to be equipped with good tools for kind of identifying and

01:39:35.460 | tracking these kinds of problems, and then kind of trying to incorporate the solutions

01:39:40.440 | into the process itself.

01:39:43.200 | Let's say there isn't kind of like some standard process I can point you to and say here's

01:39:50.120 | how to solve it.

01:39:52.120 | If there is such a thing, we haven't found it yet, it requires a diverse team of smart

01:39:58.440 | people to be aware of the problems and work hard at them, is the short answer.

01:40:02.720 | This is just kind of a general thing I guess for the whole class.

01:40:12.640 | If you're interested in this stuff, I read a pretty cool book, Jeremy you've probably

01:40:16.480 | heard of it, Weapons of Math Destruction by Cathy O'Neill, it covers a lot of the same

01:40:22.240 | stuff, just more on the topic.

01:40:24.960 | Thanks for the recommendation, Cathy's great, she's also got a TED talk, I didn't manage

01:40:31.120 | to finish the book because it's so damn depressing, I was just like, no more.

01:40:37.720 | But yeah, it's very good.

01:40:42.560 | Well that's it, thank you everybody.

01:40:47.920 | This has been really intense for me, obviously this was meant to be something that I was

01:40:56.280 | sharing with Rachel, so I've ended up doing one of the hardest things in my life, which

01:41:01.920 | is to teach two people's worth of course on my own and also look after a sick wife and

01:41:07.880 | have a toddler and also do a deep learning course and also do all this with a new library

01:41:12.920 | that I just wrote.

01:41:14.920 | So I'm looking forward to getting some sleep, but it's been totally worth it because you've

01:41:20.760 | been amazing, like I'm thrilled with how you've reacted to the kind of opportunities I've

01:41:31.320 | given you and also to the feedback that I've given you.

01:41:37.400 | So congratulations.

01:41:39.140 | (audience applauds)

Machine Learning 1: Lesson 12

Chapters