back to index

Machine Learning 1: Lesson 12


Chapters

0:0 Introduction
1:0 Recap
4:30 Durations
8:55 Zip
21:45 Embedding
25:50 Scaling
30:0 Validation
31:10 Check against
32:30 Create model
33:7 Define embedding dimensionality
38:41 Embedding matrices
45:11 Categorical data

Whisper Transcript | Transcript Only Page

00:00:00.000 | I thought what we might do today is to finish off where we were in this Rossman notebook
00:00:10.720 | looking at time series forecasting and structured data analysis.
00:00:16.640 | And then we might do a little mini-review of everything we've learnt, because believe
00:00:23.720 | it or not, this is the end, there's nothing more to know about machine learning other
00:00:28.880 | than everything that you're going to learn next semester and for the rest of your life.
00:00:36.460 | But anyway, I've got nothing else to teach, so we'll do a little review and then we'll
00:00:42.360 | cover the most important part of the course, which is thinking about how are ways to think
00:00:50.640 | about how to use this kind of technology appropriately and effectively in a way that's hopefully
00:00:58.080 | a positive impact on society.
00:01:02.160 | So last time we got to the point where we talked a bit about this idea that when we
00:01:08.720 | were looking at building this competition months open derived variable, that we actually
00:01:15.400 | truncated it down to be no more than 24 months, and we talked about the reason why being that
00:01:20.240 | we actually wanted to use it as a categorical variable, because categorical variables, thanks
00:01:24.760 | to embeddings, have more flexibility in how the neural net can use them.
00:01:33.220 | And so that was kind of where we left off.
00:01:37.720 | So let's keep working through this, because what's happening in this notebook is stuff
00:01:46.080 | which is probably going to apply to most time series data sets that you work with.
00:01:53.840 | And as we talked about, although we use df.apply here, this is something where it's running
00:01:59.160 | a piece of Python code over every row, and that's horrifically slow.
00:02:06.520 | So we only do that if we can't find a vectorized pandas or numpy function that can do it to
00:02:12.720 | the whole column at once, but in this case I couldn't find a way to convert a year and
00:02:19.400 | a week number into a date without using arbitrary Python.
00:02:29.160 | Also worth remembering this idea of a lambda function, any time you're trying to apply
00:02:35.120 | a function to every row of something, or every element of a tensor, or something like that,
00:02:40.240 | if there isn't a vectorized version already, you're going to have to call something like
00:02:44.680 | dataframe.apply, which will run a function you pass to every element.
00:02:52.040 | So this is basically a map in functional programming.
00:03:00.120 | Since very often the function that you want to pass to it is something you're just going
00:03:03.760 | to use once and then throw it away, it's really common to use this lambda approach.
00:03:09.600 | So this lambda is creating a function just for the purpose of telling df.apply what to
00:03:16.900 | So we could also have written this in a different way, which would have been to say, define
00:03:32.880 | date promo 2 cents on some value, return.
00:03:50.560 | And then we could put that in here.
00:04:06.080 | So that and that are the same thing.
00:04:09.380 | So one approach is to define the function and then pass it by name, or the other is
00:04:14.060 | to define the function in place using lambda.
00:04:18.440 | And so if you're not comfortable creating and using lambdas, it's a good thing to practice,
00:04:25.640 | and playing around with df.apply is a good way to practice it.
00:04:33.280 | So let's talk about this durations section, which may at first seem a little specific,
00:04:43.880 | but actually it turns out not to be.
00:04:46.360 | What we're going to do is we're going to look at three fields, promo, state holiday and
00:04:53.920 | school holiday.
00:04:55.200 | And so basically what we have is a table for each store, for each date, does that store
00:05:04.360 | have a promo going on at that date?
00:05:07.420 | Is there a school holiday in that region of that store at that date?
00:05:12.600 | Is there a state holiday in that region for that store at that date?
00:05:17.080 | And so this kind of thing is, there are events, and time series with events are very common.
00:05:26.840 | If you're looking at oil and gas drilling data, you're trying to say the flow through
00:05:32.200 | this pipe, here's an event representing when it set off some alarm, or here's an event
00:05:38.960 | where the drill got stuck, or whatever.
00:05:42.800 | And so most time series at some level will tend to represent some events.
00:05:49.760 | So the fact that an event happened at a time is interesting itself, but very often a time
00:05:59.600 | series will also show something happening before and after the event.
00:06:05.800 | So for example, in this case we're doing grocery sales prediction.
00:06:10.660 | If there's a holiday coming up, it's quite likely that sales will be higher before and
00:06:15.920 | after the holiday, and lower during the holiday if this is a city-based store, because you're
00:06:23.960 | going to stock up before you go away to bring things with you, and when you come back you've
00:06:29.360 | got to refill the fridge, for instance.
00:06:34.360 | So although we don't necessarily have to do this kind of feature engineering to create
00:06:41.240 | features specifically about this is before or after a holiday, the neural net, the more
00:06:48.200 | we can give the neural net the kind of information it needs, the less it's going to have to learn
00:06:54.120 | it, the less it's going to have to learn it, the more we can do with the data we already
00:06:58.360 | have, the more we can do with the size architecture we already have.
00:07:03.880 | So feature engineering, even with stuff like neural nets, is still important because it
00:07:10.920 | means that we'll be able to get better results with whatever limited data we have, whatever
00:07:17.160 | limited computation we have.
00:07:21.240 | So the basic idea here, therefore, is when we have events in our time series as we want
00:07:26.480 | to create two new columns for each event, how long is it going to be until the next
00:07:32.240 | time this event happens, and how long has it been since the last time that event happened.
00:07:38.160 | So in other words, how long until the next state holiday, how long since the previous
00:07:41.880 | state holiday.
00:07:44.840 | So that's not something which I'm aware of as existing as a library or anything like
00:07:50.880 | that, so I wrote it here by hand.
00:07:55.360 | And so importantly, I need to do this by store.
00:08:03.120 | For this store, when was this store's last promo, so how long has it been since the last
00:08:08.720 | time it had a promo, how long it will be until the next time it has a promo, for instance.
00:08:17.560 | So here's what I'm going to do, I'm going to create a little function that's going to
00:08:23.720 | take a field name and I'm going to pass it each of promo and then state holiday and then
00:08:28.280 | school holiday.
00:08:29.280 | So let's do school holiday, for example.
00:08:31.200 | So we'll say field = school holiday, and then we'll say get elapsed school holiday, after.
00:08:40.280 | So let me show you what that's going to do.
00:08:41.920 | So we've got a first of all sort by store and date.
00:08:47.000 | So now when we loop through this, we're going to be looping through within a store, so store
00:08:51.120 | number 1, January the 1st, January the 2nd, January the 3rd, and so forth.
00:08:57.120 | And as we loop through each store, we're basically going to say, is this row a school holiday
00:09:04.920 | or not?
00:09:05.920 | And if it is a school holiday, then we'll keep track of this variable called last_date, which
00:09:10.040 | says this is the last date where we saw a school holiday.
00:09:15.720 | And so then we're basically going to append to our result the number of days since the
00:09:22.300 | last school holiday.
00:09:24.400 | That's the kind of basic idea here.
00:09:26.520 | So there's a few interesting features.
00:09:29.600 | One is the use of zip.
00:09:32.600 | So I could actually write this much more simply.
00:09:36.800 | I could basically go through for row in df.idder rows and then grab the fields we want from
00:09:53.240 | each row.
00:09:55.360 | It turns out this is 300 times slower than the version that I have.
00:10:01.280 | And basically iterating through a data frame and extracting specific fields out of a row
00:10:12.920 | has a lot of overhead.
00:10:14.800 | What's much faster is to iterate through a numpy array.
00:10:22.080 | So if you take a series like df.store and add .values after it, that grabs a numpy array
00:10:28.960 | of that series.
00:10:31.040 | So here are three numpy arrays.
00:10:32.920 | One is the store IDs, one is whatever field is, in this case, let's say school_holiday,
00:10:40.760 | and one is the date.
00:10:43.600 | So now what I want to do is loop through the first one of each of those lists, and then
00:10:50.520 | the second one of each of those lists, and then the third one of each of those lists.
00:10:54.160 | This is a really, really common pattern.
00:10:55.920 | I need to do something like this in basically every notebook I write, and the way to do
00:11:00.540 | it is with zip.
00:11:02.640 | So zip means loop through each of these lists one at a time, and then this here is where
00:11:09.920 | we can grab that element out of the first list, the second list and the third list.
00:11:16.440 | So if you haven't played around much with zip, that's a really important function to
00:11:21.360 | practice with.
00:11:22.360 | Like I say, I use it in pretty much every notebook I write all the time.
00:11:28.660 | You have to loop through a bunch of lists at the same time.
00:11:36.080 | So we're going to loop through every store, every school_holiday, every date, yes.
00:11:45.580 | So in this case we basically want to say let's grab the first store, the first school_holiday,
00:12:02.240 | the first date.
00:12:03.240 | So for store_1, January 1st school_holiday was true or false.
00:12:10.520 | And so if it is a school_holiday, I'll keep track of that fact by saying the last time
00:12:15.000 | I saw a school_holiday was that day.
00:12:19.360 | And then append, how long has it been since the last school_holiday?
00:12:25.320 | And if the store_id is different to the last store_id I saw, then I've now got to a whole
00:12:32.040 | new store, in which case I have to basically reset everything.
00:12:36.200 | Could you pass that to Karen?
00:12:40.280 | What will happen to the first points that we don't have the last holiday?
00:12:46.460 | I basically set this to some arbitrary starting point, it's going to end up with the largest
00:12:55.400 | or the smallest possible date.
00:13:00.280 | And you may need to replace this with a missing value afterwards, or some zero, or whatever.
00:13:15.320 | The nice thing is, thanks to values, it's very easy for a neural net to kind of cut
00:13:22.720 | off extreme values.
00:13:24.860 | So in this case I didn't do anything special with it, I ended up with a negative a billion
00:13:28.320 | day time stamps and it still worked fine.
00:13:36.300 | So we can go through, and the next thing to note is there's a whole bunch of stuff that
00:13:42.720 | I need to do to both the training set and the test set.
00:13:46.400 | So in the previous section I actually kind of added this little loop where I go for each
00:13:51.760 | of the training data frame and the test data frame do these things.
00:13:58.640 | So each cell I did for each of the data frames.
00:14:02.920 | I've now got a whole series of cells that I want to run first of all for the training
00:14:10.360 | set and then for the test set.
00:14:12.720 | So in this case the way I did that was I had two different cells here.
00:14:16.340 | One which set df to be the training set, one which set to be the test set.
00:14:20.760 | So the way I use this is I run just this cell, and then I run all the cells underneath, so
00:14:28.280 | it does it all for the training set, and then I come back and run just this cell and then
00:14:32.960 | run all the cells underneath.
00:14:35.240 | So this notebook is not designed to be just run from top to bottom, but it's designed
00:14:40.680 | to be run in this particular way.
00:14:43.280 | And I mention that because this can be a handy trick to know, you could of course put all
00:14:49.720 | the stuff underneath in a function that you pass the data frame to and call it once with
00:14:54.640 | the test set and once with the training set, but I kind of like to experiment a bit more
00:15:00.640 | interactively, look at each step as I go, so this way is an easy way to kind of run something
00:15:05.620 | on two different data frames without turning it into a function.
00:15:12.560 | So if I sort by store and by date, then this is keeping track of the last time something
00:15:20.160 | happened and so this is therefore going to end up telling me how many days was it since
00:15:24.800 | the last school holiday.
00:15:28.080 | So now if I sort date descending and call the exact same function, then it's going to
00:15:36.680 | say how long until the next school holiday.
00:15:41.600 | So that's a nice little trick for adding these kind of arbitrary event timers into your time
00:15:48.880 | series models.
00:15:49.880 | So if you're doing, for example, the Ecuadorian Groceries competition right now, maybe this
00:15:55.840 | kind of approach would be useful for various events in that as well.
00:16:01.160 | Do it for state holiday, do it for promo, here we go.
00:16:11.760 | The next thing that we look at here is rolling functions.
00:16:19.120 | So rolling in pandas is how we create what we call windowing functions.
00:16:32.600 | Let's say I had some data, something like this, and this is like date, and I don't know
00:16:51.960 | this is like sales or whatever, what I could do is I could say let's create a window around
00:17:00.640 | this point of like 7 days, so it would be like okay, this is a 7 day window.
00:17:11.480 | And so then I could take the average sales in that 7 day window, and I could do the same
00:17:17.880 | thing like I don't know, over here, take the average sales over that 7 day window.
00:17:26.180 | And so if we do that for every point and join up those averages, you're going to end up
00:17:31.280 | with a moving average.
00:17:35.820 | So the more generic version of the moving average is a window function, i.e. something
00:17:46.020 | where you apply some function to some window of data around each point.
00:17:52.440 | Now very often the windows that I've shown here are not actually what you want, if you're
00:17:58.240 | trying to build a predictive model you can't include the future as part of a moving average.
00:18:04.800 | So quite often you actually need a window that ends here, so that would be our window
00:18:12.640 | function.
00:18:14.880 | And so Pandas lets you create arbitrary window functions using this rolling here.
00:18:25.640 | This here says how many time steps do I want to apply the function to.
00:18:32.400 | This here says if I'm at the edge, so in other words if I'm like out here, should you make
00:18:39.200 | that a missing value because I don't have 7 days to average over, or what's the minimum
00:18:46.400 | number of time periods to use?
00:18:48.480 | So here I said 1, and then optionally you can also say do you want to set the window at
00:18:55.080 | the start of a period, or the end of a period, or the middle of the period.
00:19:02.120 | And then within that you can apply whatever function you like.
00:19:05.700 | So here I've got my weekly by store sums.
00:19:13.000 | So there's a nice easy way of getting moving averages, or whatever else.
00:19:20.920 | And I should mention in Pandas, if you go to the time series page on Pandas, there's
00:19:29.400 | literally like, look at just the index here, time series functionality, all of this, this,
00:19:39.680 | there's lots.
00:19:40.880 | Because Wes Bikini who created this, he was originally in hedge fund trading, I believe,
00:19:47.520 | and his work was all about time series.
00:19:50.840 | And so I think Pandas originally was very focused on time series, and still it's perhaps
00:19:56.480 | the strongest part of Pandas.
00:19:58.640 | So if you're playing around with time series computations, you definitely owe it to yourself
00:20:04.320 | to try to learn this entire API.
00:20:08.880 | And there's a lot of conceptual pieces around time stamps, and date offsets, and resampling,
00:20:19.280 | and stuff like that to kind of get your head around, but it's totally worth it because
00:20:23.880 | otherwise you'll be writing this stuff as loops by hand, it's going to take you a lot
00:20:28.240 | longer than leveraging what Pandas already does, and of course Pandas will do it in highly
00:20:34.760 | optimized C code for you, vectorized C code, whereas your version is going to loop in Python.
00:20:41.040 | So definitely worth, if you're doing stuff in time series learning, the full Pandas time
00:20:47.840 | series API is about as strong as any time series API out there.
00:20:54.960 | Okay, so at the end of all that, you can see here's those kind of starting point values
00:21:01.280 | I mentioned, slightly on the extreme side, and so you can see here the 17th of September
00:21:10.560 | store 1 was 13 days after the last school holiday, the 16th was 12, 11/10, so forth.
00:21:19.080 | We're currently in a promotion, here this is one day before the promotion, here we've
00:21:26.560 | got 9 days after the last promotion, and so forth.
00:21:32.640 | So that's how we can add kind of event counters to a time series, and probably always a good
00:21:40.840 | idea when you're doing work with time series.
00:21:46.780 | So now that we've done that, we've got lots of columns in our dataset, and so we split
00:21:53.360 | them out into categorical versus continuous columns, we'll talk more about that in a moment
00:21:59.760 | in the review section.
00:22:01.480 | So these are going to be all the things I'm going to create an embedding for.
00:22:05.480 | And these are all of the things that I'm going to feed directly into the model.
00:22:11.960 | So for example, we've got competition distance, that's distance to the nearest competitor,
00:22:18.860 | maximum temperature, and here we've got day of wake.
00:22:31.320 | So here we've got maximum temperature, maybe it's like 22.1, centigrade in Germany, we've
00:22:41.160 | got distance to nearest competitor, might be 321 kilometers, 0.7, and then we've got
00:22:51.200 | day of wake, which might be Saturday as a 6.
00:22:58.760 | So these numbers here are going to go straight into our vector, the vector that we're going
00:23:12.200 | to be feeding into our neural net.
00:23:16.480 | We'll see in a moment we'll normalize them, but more or less.
00:23:29.200 | But this categorical variable we're not, we need to put it through an embedding.
00:23:34.280 | So we'll have some embedding matrix of, if there are 7 days, maybe dimension 4 embedding,
00:23:45.280 | and so this will look up the 6th row to get back the 4 items.
00:23:52.920 | And so this is going to turn into length 4 vector, which we'll then add here.
00:24:09.280 | So that's how our continuous and categorical variables are going to work.
00:24:22.120 | So then all of our categorical variables, we'll turn them into Panda's categorical variables
00:24:29.240 | in the same way that we've done before.
00:24:33.920 | And then we're going to apply the same mappings to the test set.
00:24:38.000 | So if Saturday is a 6 in the training set, this apply_cats makes sure that Saturday is
00:24:44.640 | also a 6 in the test set.
00:24:48.040 | For the continuous variables, make sure they're all floats because PyTorch expects everything
00:24:53.280 | to be a float.
00:24:57.960 | So then, this is another little trick that I use.
00:25:03.080 | Both of these cells define something called joined_samp.
00:25:07.760 | One of them defines them as the whole training set.
00:25:12.020 | One of them defines them as a random subset.
00:25:16.200 | And so the idea is that I do all of my work on the sample, make sure it all works well,
00:25:21.560 | play around with different hyperparameters and architectures, and then I'm like, "Okay,
00:25:25.620 | I'm very happy with this."
00:25:26.880 | I then go back and run this line of code to say, "Okay, now make the whole data set be
00:25:33.360 | the sample," and then rerun it.
00:25:36.000 | This is a good way, again, similar to what I showed you before, it lets you use the same
00:25:40.400 | cells in your notebook to run first of all on a sample, and then go back later and run
00:25:45.800 | it on the full data set.
00:25:52.280 | So now that we've got that joined_samp, we can then pass it to proc.df as we've done
00:25:57.200 | before to grab the dependent variable to deal with missing values, and in this case we pass
00:26:05.280 | one more thing, which is doScale = true.
00:26:09.080 | doScale = true will subtract the mean and divide by the standard deviation.
00:26:18.240 | And so the reason for that is that if our first layer is just a matrix multiply, so
00:26:25.720 | here's our set of weights, and our input is like, I don't know, it's got something which
00:26:32.520 | is like 0.001, and then it's got something which is like 10^6, and then our weight matrix
00:26:41.440 | has been initialized to be like random numbers between 0 and 1, so we've got 0.6, 0.1, etc.
00:26:50.720 | Then basically this thing here is going to have gradients that are 9 orders of magnitude
00:26:57.240 | bigger than this thing here, which is not going to be good for optimization.
00:27:03.720 | So by normalizing everything to be mean of 0, standard deviation of 1 to start with,
00:27:10.520 | then that means that all of the gradients are going to be on the same kind of scale.
00:27:19.320 | We didn't have to do that in random forests, because in random forests we only cared about
00:27:24.520 | the sort order.
00:27:26.220 | We didn't care about the values at all, but with linear models and things that are built
00:27:33.920 | out of layers of linear models, like neural nets, we care very much about the scale.
00:27:42.000 | So dscale=true normalizes our data for us.
00:27:45.840 | Now since it normalizes our data for us, it returns one extra object, which is a mapper,
00:27:52.080 | which is an object that contains for each continuous variable what was the mean and
00:27:56.960 | standard deviation it was normalized with, the reason being that we're going to have
00:28:02.740 | to use the same mean and standard deviation on the test set, because we need our test
00:28:09.480 | set and our training set to be scaled in the exact same way, otherwise they're going to
00:28:13.140 | have different meanings.
00:28:16.000 | And so these details about making sure that your test and training set have the same categorical
00:28:23.240 | codings, the same missing value replacement and the same scaling normalization are really
00:28:30.560 | important to get right, because if you don't get it right then your test set is not going
00:28:36.840 | to work at all.
00:28:41.080 | But if you follow these steps, it'll work fine.
00:28:45.280 | We also take the log of the dependent variable, and that's because in this Kaggle competition
00:28:51.340 | the evaluation metric was root mean squared percent error.
00:28:55.960 | So root mean squared percent error means we're being penalized based on the ratio between
00:29:02.720 | our answer and the correct answer.
00:29:07.120 | We don't have a loss function in PyTorch called root mean squared percent error.
00:29:12.160 | We could write one, but easier is just to take the log of the dependent because the
00:29:17.040 | difference between logs is the same as the ratio.
00:29:20.960 | So by taking the log we get that for free.
00:29:24.180 | You'll notice the vast majority of regression competitions on Kaggle use either root mean
00:29:32.360 | squared percent error or root mean squared error of the log as their evaluation metric,
00:29:37.680 | and that's because in real-world problems most of the time we care more about ratios
00:29:43.560 | than about raw differences.
00:29:46.140 | So if you're designing your own project, it's quite likely that you'll want to think about
00:29:52.600 | using the log of your dependent variable.
00:30:01.080 | So then we create a validation set, and as we've learned before, most of the time if you've
00:30:06.700 | got a problem involving a time component, your validation set probably wants to be the most
00:30:12.720 | recent time period rather than a random subset, so that's what I do here.
00:30:20.240 | When I finished modeling and I found an architecture and a set of hyperparameters and a number
00:30:24.880 | of epochs and all that stuff that works really well, if I want to make my model as good as
00:30:29.640 | possible I'll retrain on the whole thing, including the validation set.
00:30:36.400 | Now currently at least fastAI assumes that you do have a validation set, so my kind of
00:30:41.640 | hacky workaround is to set my validation set to just be one index, which is the first row,
00:30:48.000 | in that way all the code keeps working but there's no real validation set.
00:30:53.000 | So obviously if you do this you need to make sure that your final training is like the
00:30:59.040 | exact same hyperparameters, the exact same number of epochs, exactly the same as the thing
00:31:04.040 | that worked, because you don't actually have a proper validation set now to check against.
00:31:08.960 | I have a question regarding get elapsed function which we discussed before, so in get elapsed
00:31:17.680 | function we are trying to find when will the next holiday come?
00:31:26.140 | How many days away is it?
00:31:28.160 | So every year the holidays are more or less fixed, like there will be holiday on 4th of
00:31:33.520 | July, 25th of December and there's hardly any change.
00:31:37.440 | So can't we just look from previous years and just get a list of all the holidays that
00:31:42.540 | are going to occur this year?
00:31:46.480 | Maybe, in this case I guess that's not true of promo, and some holidays change, like Easter,
00:31:56.200 | so this way I get to write one piece of code that works for all of them, and it doesn't
00:32:05.520 | take very long to run.
00:32:08.400 | So there might be ways, if your dataset was so big that this took too long you could maybe
00:32:13.160 | do it on one year and then somehow copy it, but in this case there was no need to.
00:32:18.160 | And I always value my time over my computer's time, so I try to keep things as simple as
00:32:26.920 | I can.
00:32:31.320 | So now we can create our model, and so to create our model we have to create a model
00:32:37.160 | data object, as we always do with fast.ai, so a columnar model object is just a model
00:32:42.680 | data object that represents a training set, a validation set, and an optional test set
00:32:47.440 | of standard columnar structured data.
00:32:53.560 | We just have to tell it which of the variables should we treat as categorical.
00:33:00.840 | And then pass in our dataframes.
00:33:07.720 | So for each of our categorical variables, here is the number of categories it has.
00:33:17.540 | So for each of our embedding matrices, this tells us the number of rows in that embedding
00:33:23.680 | matrix.
00:33:26.560 | And so then we define what embedding dimensionality we want.
00:33:34.800 | If you're doing natural language processing, then the number of dimensions you need to
00:33:39.680 | capture all the nuance of what a word means and how it's used has been found empirically
00:33:45.240 | to be about 600.
00:33:48.560 | It turns out that when you do NLP models with embedding matrices that are smaller than 600,
00:33:59.400 | you don't get as good of results as you do if there's size 600, beyond 600, it doesn't
00:34:04.680 | seem to improve much.
00:34:07.280 | I would say that human language is one of the most complex things that we model, so
00:34:14.040 | I wouldn't expect you to come across many if any categorical variables that need embedding
00:34:19.840 | matrices with more than 600 dimensions.
00:34:25.120 | At the other end, some things may have pretty simple kind of causality.
00:34:33.960 | So for example, state holiday, maybe if something's a holiday, then it's just a case of stores
00:34:49.860 | that are in the city, there's some behavior, there's stores that are in the country, there's
00:34:53.920 | some other behavior, and that's about it.
00:34:57.600 | Maybe it's a pretty simple relationship.
00:35:02.040 | So ideally when you decide what embedding size to use, you would kind of use your knowledge
00:35:11.280 | about the domain to decide how complex is the relationship and so how big embedding
00:35:18.440 | do I need.
00:35:20.280 | In practice, you almost never know that.
00:35:24.920 | You would only know that because maybe somebody else has previously done that research and
00:35:28.360 | figured it out, like in NLP.
00:35:32.160 | So in practice, you probably need to use some rule of thumb, and then having tried your
00:35:38.680 | rule of thumb, you could then maybe try a little bit higher and a little bit lower and
00:35:43.000 | see what helps, so it's kind of experimental.
00:35:45.720 | So here's my rule of thumb.
00:35:46.880 | My rule of thumb is look at how many discrete values the category has, i.e. the number of
00:35:54.760 | rows in the embedding matrix, and make the dimensionality of the embedding half of that.
00:36:00.840 | So for day of week, which is the second one, 8 rows and 4 columns.
00:36:10.160 | So here it is there, the number of categories divided by 2.
00:36:14.960 | But then I say, don't go more than 50.
00:36:18.000 | So here you can see for stores, there's 1000 stores, I only have a dimensionality of 50.
00:36:22.000 | Why 50?
00:36:23.000 | I don't know, it seems to have worked okay so far.
00:36:26.100 | You may find you need something a little different.
00:36:29.400 | Actually for the Ecuadorian groceries competition, I haven't really tried playing with this, but
00:36:34.400 | I think we may need some larger embedding sizes, but it's something to fiddle with.
00:36:41.440 | Prince, can you pass that left?
00:36:44.840 | So as your variables, the cardinality size becomes larger and larger, you're creating
00:36:50.200 | more and more or wider embedding matrices, aren't you therefore massively risking overfitting,
00:36:57.200 | because you're just choosing so many parameters that the model can never possibly capture
00:37:00.280 | all that variation unless your data is absolutely huge?
00:37:03.780 | That's a great question.
00:37:04.800 | And so let me remind you about my kind of golden rule of the difference between modern
00:37:09.760 | machine learning and old machine learning.
00:37:13.280 | In old machine learning, we control complexity by reducing the number of parameters.
00:37:18.160 | In modern machine learning, we control complexity by regularization.
00:37:22.240 | So the answer is no, I'm not concerned about overfitting, because the way I avoid overfitting
00:37:27.640 | is not by reducing the number of parameters, but by increasing my dropout or increasing
00:37:33.880 | my weight decay.
00:37:37.960 | Having said that, there's no point using more parameters for a particular embedding than
00:37:44.240 | I need, because regularization is penalizing a model by giving it more random data or by
00:37:52.720 | actually penalizing weights, so we'd rather not use more than we have to.
00:37:59.520 | But my general rule of thumb for designing an architecture is to be generous on the side
00:38:07.080 | of the number of parameters.
00:38:08.400 | But in this case, if after doing some work we felt like, you know what, the store doesn't
00:38:16.560 | actually seem to be that important, then I might manually go and change this to make
00:38:22.360 | it smaller.
00:38:23.360 | Or if I was really finding there's not enough data here, I'm either overfitting or I'm using
00:38:28.880 | more regularization than I'm comfortable with, again, then you might go back.
00:38:32.960 | But I would always start with being generous with parameters, and in this case, this model
00:38:39.400 | turned out pretty good.
00:38:42.440 | So now we've got a list of tuples containing the number of rows and columns of each of
00:38:46.480 | our embedding matrices.
00:38:48.240 | And so when we call get-learner to create our neural net, that's the first thing we
00:38:52.480 | pass in, is how big is each of our embeddings.
00:38:58.360 | And then we tell it how many continuous variables we have.
00:39:02.760 | We tell it how many activations to create for each layer, and we tell it what dropout
00:39:07.120 | to use for each layer.
00:39:10.480 | And so then we can go ahead and call fit.
00:39:18.600 | So then we fit for a while, and we're kind of getting something around the 0.1 mark.
00:39:25.620 | So I tried running this on the test set and I submitted it to Kaggle during the week, actually
00:39:33.200 | last week, and here it is.
00:39:40.200 | Private score 107, public score 103.
00:39:46.580 | So let's have a look and see how that would go.
00:39:48.860 | So 107, private 103 public, so let's start on public, which is 103, not there, out of
00:40:04.520 | 3000, got to go back a long way.
00:40:14.480 | There it is, 103, okay, 340th.
00:40:20.780 | That's not good.
00:40:23.600 | So on the public leaderboard, 340th.
00:40:25.960 | Let's try the private leaderboard, which is 107, oh, 5th.
00:40:36.960 | So hopefully you're now thinking, oh, there are some Kaggle competitions finishing soon,
00:40:42.600 | which I entered, and I spent a lot of time trying to get good results on the public leaderboard.
00:40:47.280 | I wonder if that was a good idea.
00:40:49.400 | And the answer is, no it won't.
00:40:51.600 | The Kaggle public leaderboard is not meant to be a replacement for your carefully developed
00:40:58.880 | validation set.
00:41:01.300 | So for example, if you're doing the iceberg competition, which ones are ships, which ones
00:41:06.600 | are icebergs, then they've actually put something like 4000 synthetic images into the public
00:41:13.520 | leaderboard and none into the private leaderboard.
00:41:18.080 | So this is one of the really good things that tests you out on Kaggle, is like are you creating
00:41:27.880 | a good validation set and are you trusting it?
00:41:30.880 | Because if you're trusting your leaderboard feedback more than your validation feedback,
00:41:36.960 | then you may find yourself in 350th place when you thought you were in 5th.
00:41:43.200 | So in this case, we actually had a pretty good validation set, because as you can see,
00:41:47.920 | it's saying somewhere around 0.1, and we actually did get somewhere around 0.1.
00:41:55.840 | And so in this case, the public leaderboard in this competition was entirely useless.
00:42:04.720 | Can you use the box please?
00:42:07.760 | So in regards to that, how much does the top of the public leaderboard actually correspond
00:42:13.240 | to the top of the private leaderboard?
00:42:14.880 | Because in the churn prediction challenge, there's like four people who are just completely
00:42:22.160 | above everyone else.
00:42:23.840 | It totally depends.
00:42:26.960 | If they randomly sample the public and private leaderboard, then it should be extremely indicative.
00:42:45.360 | So in this case, the person who was second on the public leaderboard did end up winning.
00:42:54.480 | SDNT came 7th.
00:43:02.440 | So in fact you can see the little green thing here, whereas this guy jumped 96 places.
00:43:11.520 | If we had entered with a neural net, we just looked at it, we would have jumped 350 places.
00:43:14.980 | So it just depends.
00:43:18.060 | And so often you can figure out whether the public leaderboard -- like sometimes they'll
00:43:24.440 | tell you the public leaderboard was randomly sampled, sometimes they'll tell you it's not.
00:43:29.000 | Generally you have to figure it out by looking at the correlation between your validation
00:43:33.440 | set results and the public leaderboard results to see how well they're correlated.
00:43:40.880 | Sometimes if two or three people are way ahead of everybody else, they may have found some
00:43:43.960 | kind of leakage or something like that.
00:43:48.920 | That's often a sign that there's some trick.
00:43:57.040 | So that's Rossman, and that brings us to the end of all of our material.
00:44:06.440 | So let's come back after the break and do a quick review, and then we will talk about
00:44:13.760 | ethics and machine learning.
00:44:15.400 | So let's come back in 5 minutes.
00:44:22.540 | So we've learnt two ways to train a model.
00:44:29.480 | One is by building a tree, and one is with SGD.
00:44:36.280 | And so the SGD approach is a way we can train a model which is a linear model or a stack
00:44:46.240 | of linear layers with nonlinearities between them, whereas tree building specifically will
00:44:53.400 | give us a tree.
00:44:55.800 | And then tree building we can combine with bagging to create a random forest, or with
00:45:01.680 | boosting to create a GPM, or various other slight variations such as extremely randomized
00:45:09.200 | trees.
00:45:12.200 | So it's worth reminding ourselves of what these things do.
00:45:22.000 | So let's look at some data.
00:45:33.320 | So if we've got some data like so, actually let's look specifically at categorical data.
00:45:48.640 | So categorical data, there's a couple of possibilities of what categorical data might look like.
00:45:54.500 | It could be like, let's say we've got zip code, so we've got line4003 is our zip code,
00:46:01.440 | and then we've got sales, and it's like 50, and line4131, sales of 22, and so forth.
00:46:14.560 | So we've got some categorical variable.
00:46:18.080 | So there's a couple of ways we could represent that categorical variable.
00:46:23.340 | One would be just to use the number, and maybe it wasn't a number at all, maybe our categorical
00:46:30.960 | variable is like San Francisco, New York, Mumbai, and Sydney.
00:46:39.960 | But we can turn it into a number just by arbitrarily deciding to give them numbers.
00:46:45.880 | So it ends up being a number.
00:46:47.840 | So we could just use that kind of arbitrary number.
00:46:51.080 | So if it turns out that zip codes that are numerically next to each other have somewhat
00:46:59.660 | similar behavior, then the zip code versus sales chart might look something like this.
00:47:16.340 | Or alternatively, if the two zip codes next to each other didn't have in any way similar
00:47:28.720 | sales behavior, you would expect to see something that looked more like this, just all over
00:47:36.200 | the place.
00:47:41.000 | So they're the kind of two possibilities.
00:47:44.960 | So what a random forest would do if we had just encoded zip in this way is it's going
00:47:50.740 | to say, alright, I need to find my single best split point.
00:47:56.880 | The split point is going to make the two sides have as small a standard deviation as possible,
00:48:03.520 | or mathematically equivalently have the lowest root mean squared error.
00:48:08.040 | So in this case it might pick here as our first split point, because on this side there's
00:48:18.240 | one average, and on the other side there's the other average.
00:48:23.880 | And then for its second split point it's going to say, okay, how do I split this?
00:48:29.400 | And it's probably going to say I would split here, because now we've got this average versus
00:48:37.920 | this average.
00:48:40.200 | And then finally it's going to say, okay, how do we split here?
00:48:44.080 | And it's going to say, okay, I'll split there.
00:48:47.080 | So now I've got that average and that average.
00:48:50.200 | So you can see that it's able to kind of hone in on the set of splits it needs, even though
00:48:56.360 | it kind of does it greedily, top down one at a time.
00:48:59.760 | The only reason it wouldn't be able to do this is if it was just such bad luck that
00:49:05.080 | the two halves were kind of always exactly balanced, but even if that happens it's not
00:49:10.680 | going to be the end of the world, it will split on something else, some other variable,
00:49:15.080 | and next time around it's very unlikely that it's still going to be exactly balanced in
00:49:20.440 | both parts of the tree.
00:49:22.160 | So in practice this works just fine.
00:49:26.640 | In the second case, it can do exactly the same thing.
00:49:31.120 | It'll say, okay, which is my best first split, even though there's no relationship between
00:49:38.040 | one zip code and its neighboring zip code numerically.
00:49:41.280 | We can still see here if it splits here, there's the average on one side, and the average on
00:49:48.200 | the other side is probably about here.
00:49:52.040 | And then where would it split next?
00:49:54.800 | Probably here, because here's the average on one side, here's the average on the other
00:49:58.520 | side.
00:50:00.320 | So again, it can do the same thing, it's going to need more splits because it's going to
00:50:04.000 | end up having to kind of narrow down on each individual large zip code and each individual
00:50:08.680 | small zip code, but it's still going to be fine.
00:50:12.320 | So when we're dealing with building decision trees for random forests or GBMs or whatever,
00:50:20.600 | we tend to encode our variables just as ordinals.
00:50:27.640 | On the other hand, if we're doing a neural network, or like a simplest version, like a
00:50:35.000 | linear regression or a logistic regression, the best it could do is that, which is no
00:50:43.640 | good at all, and ditto with this one, it's going to be like that.
00:50:48.360 | So an ordinal is not going to be a useful encoding for a linear model or something that
00:50:57.160 | stacks linear and nonlinear models together.
00:51:01.640 | So instead, what we do is we create a one-hot encoding.
00:51:05.840 | So we'll say, 1, 0, 0, 0, here's 0, 1, 0, 0, here's 0, 0, 1, 0, 0, 0, 1.
00:51:18.360 | And so with that encoding, it can effectively create a little histogram where it's going
00:51:24.800 | to have a different coefficient for each level.
00:51:29.000 | So that way it can do exactly what it needs to do.
00:51:32.240 | At what point does that become too tedious for your system, or does it not?
00:51:42.640 | Pretty much never.
00:51:48.000 | Because remember, in real life we don't actually have to create that matrix, instead we can
00:51:55.920 | just have the 4 coefficients and just do an index lookup to grab the second one, which
00:52:02.920 | is mathematically equivalent to multiplying by the one-hot encoding.
00:52:07.360 | So that's no problem.
00:52:15.400 | One thing to mention, I know you guys have been taught quite a bit of more analytical
00:52:22.040 | solutions to things.
00:52:24.920 | And in analytical solutions to linear regression, you can't solve something with this amount
00:52:36.480 | of collinearity.
00:52:37.880 | In other words, you know something is Sydney if it's not Mumbai or New York or San Francisco.
00:52:45.980 | In other words, there's 100% collinearity between the fourth of these classes versus
00:52:51.320 | the other three.
00:52:52.560 | And so if you try to solve a linear regression analytically, that way the whole thing falls
00:52:56.520 | apart.
00:52:57.520 | Now note, with SGD we have no such problem, like SGD, why would it care?
00:53:03.480 | We're just taking one step along the derivative.
00:53:07.000 | It cares a little, because in the end the main problem with collinearity is that there's
00:53:13.760 | an infinite number of equally good solutions.
00:53:17.680 | So in other words, we could increase all of these and decrease this, or decrease all of
00:53:23.400 | these and increase this, and they're going to balance out.
00:53:28.240 | And when there's an infinitely large number of good solutions, that means there's a lot
00:53:32.600 | of kind of flat spots in the loss surface, and it can be harder to optimize.
00:53:38.960 | So it's a really easy way to get rid of all of those flat spots, which is to add a little
00:53:42.400 | bit of regularization.
00:53:43.600 | So if we added a little bit of weight decay, like 1e neg 7 even, then that basically says
00:53:50.120 | these are not all equally good anymore, the one which is the best is the one where the
00:53:54.400 | parameters are the smallest and the most similar to each other, and so that will again move
00:54:00.320 | it back to being a nice loss function.
00:54:02.280 | Could you just clarify that point you made about why one hot-coating wouldn't be that
00:54:09.360 | tedious?
00:54:11.880 | Sure.
00:54:13.920 | If we have a one hot-encoded vector, and we are multiplying it by a set of coefficients,
00:54:27.520 | then that's exactly the same thing as simply saying let's grab the thing where the 1 is.
00:54:33.440 | So in other words, if we had stored this as a 0, and this one as a 1, and this one as
00:54:40.120 | a 2, then it's exactly the same as just saying look up that thing in the array.
00:54:47.060 | And so we call that version an embedding.
00:54:50.480 | So an embedding is a weight matrix you can multiply by a 1 hot-encoding, and it's just
00:54:57.120 | a computational shortcut, but it's mathematically the same.
00:55:03.840 | So there's a key difference between solving linear type models analytically versus with
00:55:14.200 | With SGD we don't have to worry about collinearity and stuff, or at least not nearly to the same
00:55:18.880 | degree, and then the difference between solving a linear or a single layer or multilayer model
00:55:27.760 | with SGD versus a tree, a tree is going to complain about less things.
00:55:34.040 | So in particular you can just use ordinals as your categorical variables.
00:55:39.140 | And as we learned just before, we also don't have to worry about normalizing continuous
00:55:45.200 | variables for a tree, but we do have to worry about it for these SGD-trained models.
00:55:54.120 | So then we also learned a lot about interpreting random forests in particular.
00:56:00.800 | And if you're interested, you may be interested in trying to use those same techniques to
00:56:06.840 | interpret neural nets.
00:56:11.840 | So if you want to know which of my features are important in a neural net, you could try
00:56:15.920 | the same thing.
00:56:16.920 | Try shuffling each column in turn and see how much it changes your accuracy, and that's
00:56:23.400 | going to be your feature importance for your neural net.
00:56:26.840 | And then if you really want to have fun, recognize then that shuffling that column is just a
00:56:33.300 | way of calculating how sensitive the output is to that input, which in other words is
00:56:38.760 | the derivative of the output with respect to that input.
00:56:43.640 | And so therefore maybe you could just ask PyTorch to give you the derivatives with respect
00:56:47.840 | to the input directly, and see if that gives you the same kind of answers.
00:56:55.120 | You could do the same kind of thing for a partial dependence plot, you could try doing
00:56:59.200 | the exact same thing with your neural net, replace everything in a column with the same
00:57:03.560 | value, do it for 1960, 1961, 1962, plot that.
00:57:08.880 | I don't know of anybody who's done these things before, not because it's rocket science, but
00:57:13.600 | just because I don't know, maybe no one thought of it, or it's not in a library, but if somebody
00:57:20.240 | tried it, I think you should find it useful, it would make a great blog post, maybe even
00:57:24.600 | a paper if you wanted to take it a bit further.
00:57:27.680 | So there's a thought that something could do.
00:57:29.320 | So most of those interpretation techniques are not particularly specific to random forests.
00:57:34.720 | Things like the tree interpreter certainly are, because they're all about what's inside
00:57:38.520 | the tree.
00:57:39.520 | Can you pass it to Karen?
00:57:43.400 | We are applying tree interpreter for neural nets.
00:57:46.200 | How are we going to make inference out of activations that the path follows, for example?
00:57:53.200 | How are we going in tree interpreter?
00:57:55.240 | We're looking at the paths and their contributions of the features.
00:58:02.360 | In this case, it will be same with activations, I guess, the contributions of each activation
00:58:06.880 | on their path.
00:58:07.880 | Yeah, maybe.
00:58:08.880 | I don't know.
00:58:09.880 | I haven't thought about it.
00:58:10.880 | How can we make inference out of the activations?
00:58:14.680 | So I'd be careful saying the word inference, because people normally use the word inference
00:58:17.960 | specifically to mean the same as a test time prediction.
00:58:22.820 | You may like make some kind of an interrogate the model.
00:58:25.440 | I'm not sure.
00:58:26.440 | We should think about that.
00:58:28.640 | Actually Hinton and one of his students just published a paper on how to approximate a
00:58:32.900 | neural net with a tree for this exact reason, which I haven't read the paper yet.
00:58:38.760 | Could you pass that?
00:58:44.040 | So in linear regression and traditional statistics, one of the things that we focused on was statistical
00:58:50.440 | significance of like the changes and things like that.
00:58:53.600 | And so when thinking about a tree interpreter or even like the waterfall chart, which I
00:58:58.160 | guess is just a visualization, I guess where does that fit in?
00:59:02.920 | Because we can see like, oh, yeah, this looks important in the sense that it causes large
00:59:07.440 | changes.
00:59:08.440 | But how do we know that it's like traditionally statistically significant or anything of that
00:59:12.800 | sort?
00:59:13.800 | Yeah.
00:59:14.800 | So most of the time I don't care about the traditional statistical significance, and
00:59:18.720 | the reason why is that nowadays the main driver of statistical significance is data volume,
00:59:25.880 | not kind of practical importance.
00:59:29.460 | And nowadays most of the models you build will have so much data that like every tiny
00:59:34.380 | thing will be statistically significant, but most of them won't be practically significant.
00:59:39.960 | So my main focus therefore is practical significance, which is does the size of this influence impact
00:59:46.840 | your business?
00:59:50.720 | Statistical significance, it was much more important when we had a lot less data to work
00:59:56.600 | with.
00:59:57.680 | If you do need to know statistical significance, because for example you have a very small
01:00:01.760 | dataset because it's like really expensive to label or hard to collect or whatever, or
01:00:06.080 | it's a medical dataset for a rare disease, you can always get statistical significance
01:00:11.240 | by bootstrapping, which is to say that you can randomly resample your dataset a number
01:00:17.980 | of times, train your model a number of times, and you can then see the actual variation
01:00:24.080 | in predictions.
01:00:25.840 | So with bootstrapping, you can turn any model into something that gives you confidence intervals.
01:00:31.800 | There's a paper by Michael Jordan which has a technique called the bag of little bootstraps
01:00:37.480 | which actually kind of takes this a little bit further, well worth reading if you're
01:00:42.320 | interested.
01:00:43.320 | Can you pass it to Prince?
01:00:46.920 | So you said we don't need one-hot encoding matrix if we are doing random forest or if
01:00:53.400 | we are doing any tree-based models.
01:00:55.640 | What will happen if we do that and how bad can a model be?
01:00:59.960 | If you do do one-hot encoding?
01:01:02.960 | We actually did do it, remember we had that maximum category size and we did create one-hot
01:01:07.920 | encodings and the reason why we did it was that then our feature importance would tell
01:01:13.920 | us the importance of the individual levels and our partial dependence plot, we could
01:01:18.320 | include the individual levels.
01:01:20.120 | So it doesn't necessarily make the model worse, it may make it better, but it probably won't
01:01:28.240 | change it much at all.
01:01:29.240 | In this case it hardly changed it.
01:01:30.840 | This is something that we have noticed on real data also that if cardinality is higher,
01:01:36.200 | let's say 50 levels, and if you do one-hot encoding, the random forest performs very
01:01:42.080 | badly.
01:01:43.080 | Yeah, that's right.
01:01:44.080 | That's why in fast.ai we have that maximum categorical size because at some point your
01:01:52.640 | one-hot encoded variables become two-spasts.
01:01:55.400 | So I generally cut it off at 6 or 7.
01:01:59.360 | Also because when you get past that it becomes less useful because of the feature importance
01:02:04.160 | there's going to be too many levels to really look at.
01:02:07.960 | So can it not look at those levels which are not important and just give those significant
01:02:28.800 | features as important?
01:02:29.800 | Yeah, it'll be okay.
01:02:30.800 | Once the cardinality increases too high you're just splitting your data up too much basically.
01:02:33.800 | And so in practice your ordinal version is likely to be better.
01:02:47.160 | There's no time to kind of review everything, but I think that's the key concepts and then
01:02:50.840 | of course remembering that the embedding matrix that we can use is likely to have more than
01:02:55.680 | just one coefficient, we'll actually have a dimensionality of a few coefficients which
01:03:00.440 | isn't going to be useful for most linear models, but once you've got multi-layer models that's
01:03:05.700 | now creating a representation of your category which is quite a lot richer and you can do
01:03:10.960 | a lot more with it.
01:03:14.120 | Let's now talk about the most important bit.
01:03:17.560 | We started off early in this course talking about how actually a lot of machine learning
01:03:26.800 | is kind of misplaced.
01:03:28.880 | People focus on predictive accuracy like Amazon has a collaborative filtering algorithm for
01:03:35.280 | recommending books and they end up recommending the book which it thinks you're most likely
01:03:40.000 | to write highly.
01:03:42.800 | And so what they end up doing is probably recommending a book that you already have
01:03:47.360 | or that you already know about and would have bought anyway, which isn't very valuable.
01:03:51.800 | What they should instead have done is to figure out which book can I recommend that would
01:03:57.620 | cause you to change your behavior.
01:04:00.480 | And so that way we actually maximize our lift in sales due to recommendations.
01:04:06.640 | And so this idea of the difference between optimizing and influencing your actions versus
01:04:13.860 | just improving predictive accuracy is a really important distinction which is very rarely
01:04:23.140 | discussed in academia or industry kind of crazy enough.
01:04:28.700 | It's more discussed in industry, it's particularly ignored in most of academia.
01:04:33.920 | So it's a really important idea which is that in the end the idea, the goal of your model
01:04:41.040 | presumably is to influence behavior.
01:04:44.680 | And remember I actually mentioned a whole paper I have about this where I introduce
01:04:48.840 | this thing called the drivetrain approach where I talk about ways to think about how
01:04:53.160 | to incorporate machine learning into how do we actually influence behavior.
01:05:01.080 | So that's a starting point, but then the next question is like okay if we're trying to influence
01:05:06.040 | behavior, what kind of behavior should we be influencing and how and what might it mean
01:05:14.440 | when we start influencing behavior?
01:05:17.000 | Because nowadays a lot of the companies that you're going to end up working at are big
01:05:24.100 | ass companies and you'll be building stuff that can influence millions of people.
01:05:30.560 | So what does that mean?
01:05:33.640 | So I'm actually not going to tell you what it means because I don't know, all I'm going
01:05:39.080 | to try and do is make you aware of some of the issues and make you believe two things
01:05:45.560 | about them.
01:05:46.560 | First, that you should care, and second, that they're big current issues.
01:05:54.640 | The main reason I want you to care is because I want you to want to be a good person and
01:06:00.400 | show you that not thinking about these things will make you a bad person.
01:06:04.840 | But if you don't find that convincing I will tell you this, Volkswagen were found to be
01:06:12.360 | cheating on their emissions tests.
01:06:16.240 | The person who was sent to jail for it was the programmer that implemented that piece
01:06:21.040 | of code.
01:06:22.480 | They did exactly what they were told to do.
01:06:25.740 | And so if you're coming in here thinking, "Hey, I'm just a techie, I'll just do what
01:06:30.320 | I'm told, that's my job is to do what I'm told."
01:06:34.320 | I'm telling you if you do that you can be sent to jail for doing what you're told.
01:06:40.280 | So A) don't just do what you're told because you can be a bad person, and B) you can go
01:06:46.520 | to jail.
01:06:49.720 | Second thing to realize is in the heat of the moment you're in a meeting with 20 people
01:06:55.120 | at work and you're all talking about how you're going to implement this new feature and everybody's
01:07:00.140 | discussing it, and everybody's like, "We can do this, and here's a way of modeling it,
01:07:04.800 | and then we can implement it, and here's these constraints."
01:07:06.480 | And there's some part of you that's thinking, "Am I sure we should be doing this?"
01:07:12.280 | That's not the right time to be thinking about that, because it's really hard to step up
01:07:17.320 | then and say, "Excuse me, I'm not sure this is a good idea."
01:07:22.520 | You actually need to think about how you would handle that situation ahead of time.
01:07:27.280 | So I want you to think about these issues now and realize that by the time you're in
01:07:34.720 | the middle of it, you might not even realize it's happening.
01:07:40.120 | It'll just be a meeting, like every other meeting, and a bunch of people will be talking
01:07:43.800 | about how to solve this technical question.
01:07:46.960 | And you need to be able to recognize, "Oh, this is actually something with ethical implications."
01:07:53.480 | So Rachel actually wrote all of these slides, I'm sorry she can't be here to present this
01:07:59.120 | because she's studied this in depth, and she's actually been in difficult environments herself
01:08:06.520 | where she's kind of seen these things happening.
01:08:12.440 | We know how hard it is, but let me give you a sense of what happens.
01:08:17.880 | So engineers trying to solve engineering problems and causing problems is not a new thing.
01:08:28.040 | So in Nazi Germany, IBM, the group known as Hollerith, Hollerith was the original name
01:08:37.640 | of IBM, and it comes from the guy who actually invented the use of punch cards for tracking
01:08:42.440 | the US Census, the first mass, wide-scale use of punch cards for data collection in
01:08:48.120 | the world.
01:08:49.320 | And that turned into IBM.
01:08:51.200 | So at this point, this unit was still called Hollerith.
01:08:53.800 | So Hollerith sold a punch card system to Nazi Germany.
01:09:01.680 | And so each punch card would like code, you know, this is a Jew, 8, GFC, 12, general execution,
01:09:09.240 | 4, death by gas chamber, 6.
01:09:12.680 | And so here's one of these cards describing the right way to kill these various people.
01:09:17.900 | And so a Swiss judge ruled that IBM's technical assistance facilitated the tasks of the Nazis
01:09:25.040 | in commission of their crimes against humanity.
01:09:27.520 | This led to the death of something like 20 million civilians.
01:09:33.180 | So according to the Jewish Virtual Library, where I got these pictures and quotes from,
01:09:38.280 | their view is that the destruction of the Jewish people became even less important because
01:09:43.040 | of the invigorating nature of IBM's technical achievement, only heightened by the fantastical
01:09:48.720 | profits to be made.
01:09:51.760 | So this was a long time ago, and hopefully you won't end up working at companies that
01:09:55.960 | facilitate genocide.
01:09:59.920 | But perhaps you will, because perhaps you'll go to Facebook, who are facilitating genocide
01:10:05.720 | right now.
01:10:07.240 | And I know people at Facebook who are doing this, and they had no idea they were doing
01:10:13.680 | this.
01:10:14.840 | So right now in Facebook, the Rohingya are in the middle of a genocide, a Muslim population
01:10:20.520 | of Myanmar.
01:10:24.240 | Babies are being grabbed out of their mother's arms and thrown into fires.
01:10:28.640 | People are being killed, hundreds of thousands of refugees.
01:10:32.520 | When interviewed, the Myanmar generals doing this say, "We are so grateful to Facebook
01:10:40.280 | for letting us know about the Rohingya fake news that these people are actually not human,
01:10:49.640 | that they're actually animals."
01:10:51.760 | Now Facebook did not set out to enable the genocide of the Rohingya people in Myanmar.
01:10:58.320 | No, instead what happened is they wanted to maximize impressions and clicks.
01:11:03.840 | And so it turns out that for the data scientists at Facebook, their algorithms kind of learned
01:11:08.960 | that if you take the kinds of stuff people are interested in and feed them slightly more
01:11:13.960 | extreme versions of that, you're actually going to get a lot more impressions.
01:11:18.560 | And the project managers are saying maximize these impressions, and people are clicking
01:11:22.320 | and it creates this thing.
01:11:26.720 | And so the potential implications are extraordinary and global.
01:11:34.680 | And this is something that is literally happening, this is October 2017, it's happening now.
01:11:41.800 | Could you pass that back there?
01:11:48.640 | So I just want to clarify what was happening here.
01:11:51.120 | So it was the facilitation of fake news or inaccurate media?
01:11:55.760 | Let me go into it in more detail.
01:12:00.080 | So what happened was in mid-2016, Facebook fired its human editors.
01:12:07.720 | So it was humans that decided how to order things on your homepage.
01:12:12.880 | Those people got fired and replaced with machine learning algorithms.
01:12:16.840 | And so the machine learning algorithms written by data scientists like you, they had nice
01:12:25.000 | clear metrics and they were trying to maximize their predictive accuracy and be like okay,
01:12:30.100 | we think if we put this thing higher up than this thing, we'll get more clicks.
01:12:35.280 | And so it turned out that these algorithms for putting things on the Facebook news feed
01:12:41.280 | had a tendency to say like oh, human nature is that we tend to click on things which stimulate
01:12:48.360 | our views and therefore like more extreme versions of things we already see.
01:12:53.520 | So this is great for the Facebook revenue model of maximizing engagement.
01:12:59.400 | It looked good on all of their KPIs.
01:13:02.600 | And so at the time, there was some negative press about like I'm not sure that the stuff
01:13:10.160 | that Facebook is now putting on their trending section is actually that accurate, but from
01:13:16.280 | the point of view of the metrics that people were optimizing at Facebook, it looked terrific.
01:13:22.600 | And so way back to October 2016, people started noticing some serious problems.
01:13:29.120 | For example, it is illegal to target housing to people of certain races in America.
01:13:37.120 | That is illegal.
01:13:38.160 | And yet a news organization discovered that Facebook was doing exactly that in October
01:13:44.480 | 2016.
01:13:45.480 | Again, not because somebody in that data science team said let's make sure black people can't
01:13:50.760 | live in nice neighborhoods, but instead they found that their automatic clustering and
01:13:58.480 | segmentation algorithm found there was a cluster of people who didn't like African Americans
01:14:04.960 | and that if you targeted them with these kinds of ads then they would be more likely to select
01:14:10.760 | this kind of housing or whatever.
01:14:12.400 | But the interesting thing is that even after being told about this three times, Facebook
01:14:18.960 | still hasn't fixed it.
01:14:20.780 | And that is to say these are not just technical issues, they're also economic issues.
01:14:25.200 | When you start saying the thing that you get paid for, that is ads, you have to change
01:14:30.840 | the way that you structure those so that you either use more people that cost money or
01:14:37.280 | you are less aggressive on your algorithms to target people based on minority group status
01:14:44.800 | or whatever, that can impact revenues.
01:14:48.440 | So the reason I mention this is you will at likely at some point in your career find yourself
01:14:53.880 | in a conversation where you're thinking I'm not confident that this is like morally okay,
01:15:01.000 | the person you're talking to is thinking in their head this is going to make us a lot
01:15:04.160 | of money, and you don't quite ever manage to have a successful conversation because
01:15:10.880 | you're talking about different things.
01:15:13.940 | And so when you're talking to somebody who may be more experienced and more senior than
01:15:17.800 | you and they may sound like they know what they're talking about, just realize that their
01:15:21.960 | incentives are not necessarily going to be focused on like how do I be a good person.
01:15:28.880 | They're not thinking how do I be a bad person, but the more time you spend in industry in
01:15:34.200 | my experience, the more desensitized you kind of get to this stuff of like okay maybe getting
01:15:40.760 | promotions and making money isn't the most important thing.
01:15:45.240 | So for example, I've got a lot of friends who are very good at computer vision and some
01:15:50.840 | of them have gone on to create startups that seem like they're almost handmade to help
01:15:57.000 | authoritarian governments surveil their citizens.
01:16:01.880 | And when I ask my friends like have you thought about how this could be used in that way,
01:16:08.200 | they're generally kind of offended that I ask, but I'm asking you to think about this.
01:16:17.280 | Wherever you end up working, if you end up creating a startup, tools can be used for
01:16:23.600 | good or for evil, and so I'm not saying don't create excellent object tracking and detection
01:16:31.400 | tools from computer vision, because you could go on and use that to create a much better
01:16:39.200 | surgical intervention robot toolkit, just saying be aware of it, think about it, talk
01:16:46.080 | about it.
01:16:50.320 | So here's one I find fascinating, and there's this really cool thing actually that meetup.com
01:16:55.560 | did, this is from a meetup.com talk that's online, they think about this.
01:17:00.640 | They actually thought about this, they actually thought, you know what, if we built a collaborative
01:17:05.000 | filtering system like we learned about in class to help people decide what meetup to
01:17:11.360 | go to, it might notice that on the whole in San Francisco, a few more men than women tend
01:17:19.160 | to go to techie meetups.
01:17:21.560 | And so it might then start to decide to recommend techie meetups to more men than women, as
01:17:28.320 | a result of which, more men will go to techie meetups.
01:17:32.600 | As a result of which, when women go to techie meetups, they'll be like oh, this is all men,
01:17:36.480 | I don't really want to go to techie meetups.
01:17:38.760 | As a result of which, the algorithm will get new data saying that men like techie meetups
01:17:42.960 | better, and so it continues.
01:17:46.300 | And so a little bit of that initial push from the algorithm can create this runaway feedback
01:17:54.560 | loop and you end up with almost all male techie meetups, for instance.
01:18:00.840 | And so this kind of feedback loop is a kind of subtle issue that you really want to think
01:18:07.120 | about when you're thinking about what is the behavior that I'm changing with this algorithm
01:18:13.200 | that I'm building.
01:18:18.440 | So another example, which is kind of terrifying, is in this paper where the authors describe
01:18:28.080 | how a lot of departments in the US are now using predictive policing algorithms.
01:18:34.800 | So where can we go to find somebody who's about to commit a crime?
01:18:40.420 | And so you know that the algorithm simply feeds back to you basically the data that
01:18:47.240 | you've given it.
01:18:49.400 | So if your police department has engaged in racial profiling at all in the past, then
01:18:57.240 | it might suggest slightly more often maybe you should go to the black neighborhoods to
01:19:01.440 | check for people committing crimes.
01:19:04.040 | As a result of which, more of your police officers go to the black neighborhoods.
01:19:07.360 | As a result of which, they arrest more black people.
01:19:10.080 | As a result of which, the data says that the black neighborhoods are less safe.
01:19:14.120 | As a result of which, the algorithm says to the policeman, maybe you should go to the
01:19:17.120 | black neighborhoods more often, and so forth.
01:19:21.080 | And this is not like vague possibilities of something that might happen in the future,
01:19:29.720 | this is like documented work from top academics who have carefully studied the data and the
01:19:35.560 | theory.
01:19:37.200 | This is like serious scholarly work, it's like no, this is happening right now.
01:19:42.560 | And so again, I'm sure the people that started creating this predictive policing algorithm
01:19:49.580 | didn't think like how do we arrest more black people, hopefully they were actually thinking
01:19:54.860 | gosh I'd like my children to be safer on the streets, how do I create a safer society?
01:20:02.720 | But they didn't think about this nasty runaway feedback loop.
01:20:09.060 | So actually this one about social network algorithms is actually an article in the New
01:20:13.720 | York Times recently about one of my friends, Renee DiResta, and she did something kind
01:20:19.920 | of amazing.
01:20:20.920 | She set up a second Facebook account, like a fake Facebook account, and she was very
01:20:27.200 | interested in the anti-vax movement at the time.
01:20:30.680 | So she started following a couple of anti-vaxxers and visited a couple of anti-vaxxer links.
01:20:39.400 | And so suddenly her news feed starts getting full of anti-vaxxer news, along with other
01:20:46.240 | stuff like chemtrails, and deep state conspiracy theories, and all this stuff.
01:20:53.440 | And so she's like, 'huh', starts clicking on those.
01:20:57.180 | And the more she clicked, the more hardcore far-out conspiracy stuff Facebook recommended.
01:21:05.280 | So now when Renee goes to that Facebook account, the whole thing is just full of angry, crazy,
01:21:14.360 | far-out conspiracy stuff, like that's all she sees.
01:21:18.180 | And so if that was your world, then as far as you're concerned, it's just like this continuous
01:21:25.080 | reminder and proof of all this stuff.
01:21:30.200 | And so again, to answer your question, this is the kind of runaway feedback loop that
01:21:37.580 | ends up telling me and my generals, you know, throughout their Facebook homepage, that animals
01:21:46.400 | and fake news and whatever else.
01:21:51.900 | So a lot of this comes also from bias.
01:21:58.720 | And so let's talk about bias specifically.
01:22:01.960 | So bias in image software comes from bias in data.
01:22:08.160 | And so most of the folks I know at Google Brain building computer vision algorithms,
01:22:16.680 | very few of them are people of color.
01:22:19.460 | And so when they're training the algorithms with photos of their families and friends,
01:22:24.440 | they are training them with very few people of color.
01:22:27.180 | And so when FaceApp then decided, we're going to try looking at lots of Instagram photos
01:22:34.300 | to see which ones are upvoted the most, without them necessarily realizing it, the answer
01:22:40.820 | was light-colored faces.
01:22:44.160 | So then they built a generative model to make you more hot.
01:22:48.440 | And so this is the actual photo, and here is the hotter version.
01:22:53.040 | So the hotter version is more white, less nostrils, more European looking.
01:23:01.240 | And so this did not go down well, to say the least.
01:23:07.720 | So again, I don't think anybody at FaceApp said, let's create something that makes people
01:23:13.880 | look more white.
01:23:15.840 | They just trained it on a bunch of images of the people that they had around them.
01:23:21.560 | And this has kind of serious commercial implications as well.
01:23:27.360 | They had to pull this feature, and they had a huge amount of negative pushback as they
01:23:32.460 | should.
01:23:33.460 | Here's another example, Google Photos created this photo classifier, airplanes, skyscrapers,
01:23:42.160 | cars, graduation, and gorillas.
01:23:45.980 | So think about how this looks to most people.
01:23:50.920 | To most people they look at this, they don't know about machine learning, they say, what
01:23:55.440 | the fuck?
01:23:56.740 | Somebody at Google wrote some code to take black people and call them gorillas.
01:24:02.560 | That's what it looks like.
01:24:04.400 | Now we know that's not what happened.
01:24:05.960 | We know what happened is the team of folks at Google Computer Vision experts who have
01:24:15.800 | none or few people of color working in the team built a classifier using all the photos
01:24:22.000 | they had available to them.
01:24:23.720 | And so when the system came across a person with dark skin, it was like, I've only mainly
01:24:32.480 | seen that before amongst gorillas, so I'll put it in that category.
01:24:36.920 | So again, the bias in the data creates a bias in the software, and again, the commercial
01:24:42.860 | implications were very significant.
01:24:44.920 | Google really got a lot of bad PR from this, as they should.
01:24:49.800 | This was a photo that somebody put in their Twitter feed.
01:24:53.080 | They said, look what Google Photos just decided to do.
01:24:59.560 | You can imagine what happened with the first international beauty contest judged by artificial
01:25:03.200 | intelligence.
01:25:05.240 | Basically it turns out all the beautiful people are white.
01:25:08.840 | So you kind of see this bias in image software, thanks to bias in the data, thanks to lack
01:25:16.680 | of diversity in the teams building it, you see the same thing in natural language processing.
01:25:24.480 | So here is Turkish, O is the pronoun in Turkish which has no gender, but of course in English
01:25:39.520 | we don't really have a widely used un-gendered singular pronoun, so Google Translate converts
01:25:46.720 | it to this.
01:25:50.400 | Now there are plenty of people who saw this online and said, literally, so what?
01:25:59.640 | It is correctly feeding back the usual usage in English.
01:26:06.000 | I know how this is trained, this is like Word2Vec vectors, I was trained on Google News corpus,
01:26:11.480 | Google Books corpus, it's just telling us how things are.
01:26:15.880 | And from a point of view, that's entirely true.
01:26:20.120 | The biased data to create this biased algorithm is the actual data of how people have written
01:26:26.840 | books and used paper radicals for decades.
01:26:32.080 | But does that mean that this is the product that you want to create?
01:26:38.080 | Does this mean this is the product you have to create?
01:26:41.340 | Just because the particular way you've trained the model means it ends up doing this, is
01:26:47.320 | this actually the design you want?
01:26:49.340 | And can you think of potential negative implications and feedback loops this could create?
01:26:55.660 | And if any of these things bother you, then now, lucky you, you have a new cool engineering
01:27:01.120 | problem to work on, like how do I create unbiased NLP solutions?
01:27:06.380 | And now there are some start-ups starting to do that and starting to make some money.
01:27:11.520 | These are opportunities for you, like here's some stuff where people are creating screwed
01:27:16.560 | up societal outcomes because of their shitty models, like okay, well you can go and build
01:27:21.400 | something better.
01:27:23.520 | So like another example of the bias in word2vec word vectors is restaurant reviews rank Mexican
01:27:30.880 | restaurants worse because the Mexican words tend to be associated with criminal words
01:27:38.000 | in the US press and books more often.
01:27:40.800 | Again, this is like a real problem that is happening right now.
01:27:48.720 | So Rachel actually did some interesting analysis of just the plain word2vec word vectors where
01:27:56.680 | she basically pulled them out and looked at these analogies based on some research that
01:28:02.480 | had been done elsewhere.
01:28:03.960 | And so you can see word2vec, the vector directions show that father is to doctor, mother is to
01:28:10.480 | nurse, man is to computer programmer, as woman is to homemaker, and so forth.
01:28:16.680 | So it's really easy to see what's in these word vectors, and they're kind of fundamental
01:28:24.200 | to much of the NLP or probably just about all of the NLP software we use today.
01:28:31.080 | So a ProPublica has actually done a lot of good work in this area.
01:28:42.600 | Many judges now have access to Sentencing Guidelines software.
01:28:46.840 | And so Sentencing Guidelines software says to the judge, for this individual we would
01:28:52.320 | recommend this kind of sentence.
01:28:56.080 | And now of course a judge doesn't understand machine learning.
01:29:00.000 | So like they have two choices, which is either do what it says or ignore it entirely, and
01:29:05.800 | some people fall into each category.
01:29:08.920 | And so for the ones that fall into the like do what it says category, here's what happens.
01:29:13.640 | For those that were labeled higher risk, the subset of those that labeled higher risk it
01:29:19.120 | actually turned out not to re-offend, was about a quarter of whites and about a half
01:29:25.640 | of African Americans.
01:29:28.080 | So like nearly twice as often, people who didn't re-offend were marked as higher risk
01:29:36.560 | if they were African Americans, and vice versa.
01:29:39.280 | Amongst those that were labeled lower risk but actually did re-offend, it turned out
01:29:44.800 | to be about half of the whites and only 28% of the African Americans.
01:29:49.440 | So this is data which I would like to think nobody is setting out to create something
01:29:56.040 | that does this.
01:29:57.040 | But when you start with biased data, and the data says that whites and blacks smoke marijuana
01:30:09.200 | at about the same rate, but blacks are jailed at something like five times more often than
01:30:16.320 | whites, the nature of the justice system in America at the moment is that it's not equal,
01:30:23.840 | it's not fair.
01:30:24.840 | And therefore the data that's fed into the machine learning model is going to basically
01:30:29.800 | support that status quo.
01:30:31.800 | And then because of the negative feedback loop, it's just going to get worse and worse.
01:30:35.760 | I'll tell you something else interesting about this one, which research called Abe Gong has
01:30:40.320 | pointed out, is here are some of the questions that are being asked.
01:30:45.640 | So let's take one.
01:30:49.880 | Was your father ever arrested?
01:30:53.920 | So your answer to that question is going to decide whether you're locked up and for how
01:30:58.400 | long.
01:31:00.880 | Now as a machine learning researcher, do you think that might improve the predictive accuracy
01:31:05.400 | of your algorithm and get you a better R-squared?
01:31:08.920 | It could well, but I don't know.
01:31:11.160 | Maybe it does.
01:31:12.160 | You try it out and say oh, I've got a better R-squared.
01:31:14.960 | So does that mean you should use it?
01:31:16.800 | Well there's another question, do you think it's reasonable to lock somebody up for longer
01:31:22.720 | because of who their dad was?
01:31:25.400 | And yet these are actually the examples of questions that we are asking right now to
01:31:31.320 | offenders and then putting into a machine learning system to decide what happens to
01:31:35.880 | them.
01:31:37.360 | So again, whoever designed this, presumably they were laser focused on technical excellence,
01:31:44.360 | getting the maximum area under the ROC curve, and I found these great predictors that give
01:31:49.280 | me another .02, and I guess didn't start to think like well, is that a reasonable way
01:31:57.520 | to decide who goes to jail for longer?
01:32:03.840 | So like putting this together, you can kind of see how this can get more and more scary.
01:32:12.160 | We take a company like Taser, and Tasers are these devices that kind of give you a big
01:32:17.880 | electric shock basically.
01:32:19.920 | And Tasers managed to do a great job of creating strong relationships with some academic researchers
01:32:26.680 | who seem to say whatever they tell them to say, to the extent where now if you look at
01:32:32.400 | the data it turns out that there's a pretty high probability that if you get tased that
01:32:39.600 | you will die.
01:32:41.240 | That happens not unusually, and yet the researchers who they've paid to look into this have consistently
01:32:49.520 | come back and said oh no, it was nothing to do with the Taser, the fact that they died
01:32:53.880 | immediately afterwards was totally unrelated, it was just a random thing that happened.
01:33:01.600 | So this company now owns 80% of the market for body cameras, and they started buying
01:33:09.260 | computer vision AI companies, and they're going to try and now use these police body
01:33:14.280 | camera videos to anticipate criminal activity.
01:33:19.080 | And so what does that mean?
01:33:22.220 | So is that like okay, I now have some augmented reality display saying tase this person because
01:33:29.600 | they're about to do something bad.
01:33:31.880 | So it's kind of like a worrying direction, and so I'm sure nobody who's a data scientist
01:33:40.920 | at Taser or at the companies that they bought out is thinking like this is the world I want
01:33:46.360 | to help create, but they could find themselves, or you could find yourself in the middle of
01:33:53.000 | this kind of discussion, where it's not explicitly about that topic but there's part of you that
01:33:58.000 | says I wonder if this is how this could be used, and I don't know exactly what the right
01:34:05.760 | thing to do in that situation is, because you can ask, and of course people are going
01:34:08.840 | to be like no, no, no, no.
01:34:12.760 | So it's like what could you do?
01:34:17.200 | You could ask for some kind of written promise, you could decide to leave, you could start
01:34:25.320 | doing some research into the legality of things to say I would at least protect my own legal
01:34:31.960 | situation.
01:34:32.960 | I don't know, have a think about how you would respond to that.
01:34:39.840 | So these are some questions that Rachel created as being things to think about.
01:34:45.480 | So if you're looking at building a data product or using a model, if you're building a machine
01:34:51.320 | learning model as for a reason, you're trying to do something.
01:34:56.460 | So what bias may be in that data?
01:34:59.280 | Because whatever bias is in that data ends up being a bias in your predictions, potentially
01:35:03.360 | then biases the actions you're influencing, potentially then biases the data that you
01:35:07.560 | come back and you may create a feedback loop.
01:35:10.560 | If the team that built it isn't diverse, what might you be missing?
01:35:15.920 | So for example, one senior executive at Twitter called the alarm about major Russian bot problems
01:35:28.760 | at Twitter way back well before the election.
01:35:34.340 | That was the one black person in the exec team at Twitter, the one.
01:35:42.800 | And shortly afterwards they lost their job.
01:35:47.940 | Definitely having a more diverse team means having a more diverse set of opinions and
01:35:54.000 | beliefs and ideas and things to look for and so forth.
01:35:57.080 | So non-diverse teams seem to make more of these bad mistakes.
01:36:02.640 | Can we audit the code, is it open source, check for the different error rates amongst
01:36:08.640 | different groups, is there a simple rule we could use instead that's extremely interpretable
01:36:14.540 | and easy to communicate and if something goes wrong do we have a good way to deal with it.
01:36:22.080 | So when we've talked to people about this and a lot of people have come to Rachel and
01:36:29.400 | said I'm concerned about something my organization is doing, what do I do, or I'm just concerned
01:36:37.880 | about my toxic workplace, what do I do.
01:36:42.200 | And very often Rachel will say, have you considered leaving?
01:36:47.960 | And they will say, I don't want to lose my job.
01:36:52.520 | But actually if you can code, you're in 0.3% of the population.
01:36:57.420 | If you can code and do machine learning, you're in probably 0.01% of the population.
01:37:02.880 | You are massively, massively in demand.
01:37:09.240 | So realistically, obviously an organization does not want you to feel like you're somebody
01:37:15.480 | who could just leave and get another job, that's not in their interest, but that is
01:37:20.360 | absolutely true.
01:37:22.000 | And so one of the things I hope you'll leave this course with is enough self-confidence
01:37:28.680 | to recognize that you have the skills to get a job, and particularly once you've got your
01:37:36.040 | first job, your second job is an order of magnitude easier.
01:37:40.200 | And so this is important not just so that you feel like you actually have the ability
01:37:44.260 | to act ethically, but it's also important to realize if you find yourself in a toxic
01:37:51.320 | environment which is pretty damn common, unfortunately, there's a lot of shitty tech cultures, environments
01:38:01.080 | particularly in the Bay Area.
01:38:02.480 | If you find yourself in one of those environments, the best thing to do is to get the hell out.
01:38:09.560 | And if you don't have the self-confidence to think you can get another job, you can
01:38:15.440 | get trapped.
01:38:17.640 | So it's really important, it's really important to know that you are leaving this program
01:38:24.160 | with very in-demand skills, and particularly after you have that first job, you're now
01:38:28.680 | somebody with in-demand skills and a track record of being employed in that area.
01:38:36.720 | This is kind of just a broad question, but what are some things that you know of that
01:38:46.920 | people are doing to treat bias in data?
01:38:52.440 | It's kind of like a bit of a controversial subject at the moment, and people are trying
01:38:59.320 | to use, some people are trying to use an algorithmic approach, where they're basically trying to
01:39:03.200 | say how can we identify the bias and kind of subtract it out, but the most effective
01:39:10.920 | ways I know of are ones that are trying to treat it at the data level.
01:39:15.000 | So start with a more diverse team, particularly a team involving people from the humanities,
01:39:21.560 | like sociologists, psychologists, economists, people that understand feedback loops and implications
01:39:27.560 | for human behavior, and they tend to be equipped with good tools for kind of identifying and
01:39:35.460 | tracking these kinds of problems, and then kind of trying to incorporate the solutions
01:39:40.440 | into the process itself.
01:39:43.200 | Let's say there isn't kind of like some standard process I can point you to and say here's
01:39:50.120 | how to solve it.
01:39:52.120 | If there is such a thing, we haven't found it yet, it requires a diverse team of smart
01:39:58.440 | people to be aware of the problems and work hard at them, is the short answer.
01:40:02.720 | This is just kind of a general thing I guess for the whole class.
01:40:12.640 | If you're interested in this stuff, I read a pretty cool book, Jeremy you've probably
01:40:16.480 | heard of it, Weapons of Math Destruction by Cathy O'Neill, it covers a lot of the same
01:40:22.240 | stuff, just more on the topic.
01:40:24.960 | Thanks for the recommendation, Cathy's great, she's also got a TED talk, I didn't manage
01:40:31.120 | to finish the book because it's so damn depressing, I was just like, no more.
01:40:37.720 | But yeah, it's very good.
01:40:42.560 | Well that's it, thank you everybody.
01:40:47.920 | This has been really intense for me, obviously this was meant to be something that I was
01:40:56.280 | sharing with Rachel, so I've ended up doing one of the hardest things in my life, which
01:41:01.920 | is to teach two people's worth of course on my own and also look after a sick wife and
01:41:07.880 | have a toddler and also do a deep learning course and also do all this with a new library
01:41:12.920 | that I just wrote.
01:41:14.920 | So I'm looking forward to getting some sleep, but it's been totally worth it because you've
01:41:20.760 | been amazing, like I'm thrilled with how you've reacted to the kind of opportunities I've
01:41:31.320 | given you and also to the feedback that I've given you.
01:41:37.400 | So congratulations.
01:41:39.140 | (audience applauds)