Lesson 14: Cutting Edge Deep Learning for Coders

00:00:00.000 | Okay, so welcome to lesson 14, the final lesson for now.

00:00:13.960 | And we'll talk at the end about what's next.

00:00:17.920 | As you can see from what's increasingly been happening, what next is very much about you

00:00:23.200 | with us rather than us leading you or telling you.

00:00:27.440 | So we're a community now and we can figure these stuff out together and obviously USF

00:00:33.840 | is a wonderful ally to have.

00:00:39.440 | So for now, this is the last of these lessons.

00:00:46.840 | One of the things that was great to see this week was this terrific article in Forbes that

00:00:52.740 | talked about deep learning education and it was written by one of our terrific students,

00:00:58.400 | Maria.

00:00:59.400 | It focuses on the great work of some of the students that have come through this course.

00:01:06.800 | So I wanted to say thank you very much and congratulations on this great article.

00:01:10.280 | I hope everybody would check it out.

00:01:14.960 | Very beautifully written as well and terrific stories, I found it quite inspiring.

00:01:25.860 | So today we are going to be talking about a couple of things, but we're going to start

00:01:31.920 | with time series and structured data.

00:01:41.120 | Time series, I wanted to start very briefly by talking about something which I think you

00:01:45.080 | basically know how to do.

00:01:47.920 | This is a fantastic paper because it is not by deep mind, nobody's heard of it.

00:01:54.280 | It actually comes from the Children's Hospital of Los Angeles.

00:01:57.880 | Believe it or not, perhaps the epicenter of practical applied AI and medicine data is

00:02:05.520 | in Southern California, and specifically Southern California Pediatrics, the Children's Hospital

00:02:09.760 | of Orange County, CHOC, and the Children's Hospital of Los Angeles, CHLA.

00:02:15.240 | CHLA, which this paper comes from, actually has this thing they call V-PICU, the Virtual

00:02:22.080 | Pediatric Intensive Care Unit, where for many, many years they've been tracking every electronic

00:02:28.320 | signal about how every patient, every kid in the hospital was treated and what all their

00:02:34.760 | ongoing sensor readings are.

00:02:40.440 | One of the extraordinary things they do is when the doctors there do rounds, data scientists

00:02:45.360 | come with them.

00:02:46.360 | I don't know anywhere else in the world that this happens.

00:02:52.080 | And so a couple of months ago they released a draft of this amazing paper where they talked

00:02:59.840 | about how they pulled out all this data from the EMR and from the sensors and attempted

00:03:06.440 | to predict patient mortality.

00:03:08.880 | The reason this is interesting is that when a kid goes into the ICU, if a model starts

00:03:15.720 | saying this kid is looking like they might die, then that's the thing that sets the alarms

00:03:21.160 | going and everybody rushes over and starts looking after them.

00:03:25.040 | And they found that they built a model that was more accurate than any existing model.

00:03:28.840 | Those existing models were built on many years of deep clinical input and they used an RNN.

00:03:35.640 | Now this kind of time series data is what I'm going to refer to as signal-type time series

00:03:44.040 | data.

00:03:45.040 | So let's say you've got a series of blood pressure readings.

00:03:51.960 | So they come in and their blood pressure is kind of low and it's kind of all over the

00:03:57.040 | place and then suddenly it shoots up.

00:04:02.600 | And then in addition to that, maybe there's other readings such as at which points they

00:04:12.520 | receive some kind of medical intervention, there was one here and one here and then there

00:04:17.560 | was like six here and so forth.

00:04:21.960 | So these kinds of things, generally speaking, the state of health at time t is probably

00:04:28.680 | best predicted by all of the various sensor readings at t-minus-1 and t-minus-2 and t-minus-3.

00:04:35.100 | So in statistical terms we would refer to that as autocorrelation.

00:04:41.760 | Autocorrelation means correlation with previous time periods.

00:04:45.720 | For this kind of signal, I think it's very likely that an RNN is the way to go.

00:04:54.960 | Obviously you could probably get a better result using a bidirectional RNN, but that's

00:04:58.920 | not going to be any help in the ICU because you don't have the future time period sensors.

00:05:02.960 | So be careful of this data leakage issue.

00:05:07.040 | And indeed this is what this team at the VPCU at Children's Hospital of Los Angeles did,

00:05:12.720 | they used an RNN to get this data-to-the-art result.

00:05:17.400 | I'm not really going to teach you more about this because basically you already know how

00:05:20.120 | to do it.

00:05:22.640 | You can check out the paper, you'll see there's almost nothing special.

00:05:25.800 | The only thing which was quite clever was that their sensor readings were not necessarily

00:05:34.960 | equally spaced.

00:05:36.680 | For example, did they receive some particular medical intervention?

00:05:42.880 | Clearly they're very widely spaced and they're not equally spaced.

00:05:46.880 | So rather than having the RNN have basically a sequence of interventions that gets fed

00:05:57.440 | to the RNN, instead they actually have two things.

00:06:00.740 | One is the signal, and the other is the time since the last signal was read.

00:06:09.200 | So each point at the RNN is basically some function f.

00:06:13.400 | It's receiving two things, it's receiving the signal at time t and the value of t itself.

00:06:22.400 | What is the time?

00:06:23.400 | Or the difference in time?

00:06:24.400 | How long was it since the last one?

00:06:27.880 | That doesn't require any different deep learning, that's just concatenating one extra thing

00:06:33.480 | under your vector.

00:06:34.840 | They actually show mathematically that this makes a certain amount of sense as a way to

00:06:40.400 | deal with this, and then they find empirically that it does actually seem to work pretty

00:06:44.240 | well.

00:06:48.240 | I can't tell you whether this is state-of-the-art for anything because I just haven't seen deep

00:06:55.000 | comparative papers or competitions or anything that really have this kind of data, which

00:07:01.320 | is weird because a lot of the world runs on this kind of data.

00:07:04.000 | This kind of data effectively thinks it's super valuable, like if you're an oil and

00:07:09.100 | gas company, what's the drill head telling you, what's the signals coming out of the

00:07:14.040 | pipe telling you, and so on and so forth.

00:07:17.400 | There we go.

00:07:18.400 | It's not the kind of cool thing that the Google kids work on, so who knows.

00:07:25.760 | So I'm not going to talk more about that.

00:07:28.020 | That's how you can do time series with this kind of signal data.

00:07:33.040 | You can also incorporate all of the other stuff we're about to talk about, which is

00:07:38.120 | the other kind of time series data.

00:07:41.120 | For example, there was a Kaggle competition which was looking at forecasting sales for

00:08:00.440 | each store at this big company in Europe called Rossman based on the date and what promotions

00:08:14.600 | are going on and what the competitors are doing and so forth.

00:08:24.640 | Or maybe it will have some kind of trend to it.

00:08:40.720 | So these kinds of seasonal time series are very widely analyzed by econometricians.

00:08:52.480 | They're everywhere, particularly in business, if you're trying to predict how many widgets

00:08:59.600 | you have to buy next month, or whether to increase or decrease your prices, or all kinds

00:09:08.160 | of operational type things tend to look like this.

00:09:12.520 | How full your planes are going to be, whether you should add promotions, so on and so forth.

00:09:20.160 | So it turns out that the state of the art for this kind of approach is not necessarily

00:09:26.320 | fun to use in RNN.

00:09:29.120 | I'm actually going to look at the third place result from this competition because the third

00:09:35.200 | place result was nearly as good as places 1 and 2, but way, way, way simpler.

00:09:42.880 | And also it turns out that there's stuff that we can build on top of for almost every model

00:09:47.960 | of this kind.

00:09:51.240 | And basically, surprise, surprise, it turns out that the answer is to use a neural network.

00:09:59.040 | So I need to warn you again, what I'm going to teach you here is very, very uncool.

00:10:08.440 | You'll never read about it from DeepMind or OpenAI.

00:10:12.320 | It doesn't involve any robot arms, it doesn't involve thousands of GPU's, it's the kind

00:10:18.400 | of boring stuff that normal companies use to make more money or spend less money or

00:10:24.960 | satisfy their customers.

00:10:27.160 | So I apologize deeply for that oversight.

00:10:33.320 | Having said that, in the 25 years or more, I've been doing machine learning work applied

00:10:41.800 | in the real world, 98% of it has been this kind of data.

00:10:49.480 | Whether it be when I was working in agriculture, I've worked in wool, macadamia nuts and rice,

00:10:54.760 | and we were figuring out how full our barrels were going to be, whether we needed more,

00:11:01.760 | we were figuring out how to set futures markets, prices for agricultural goods, whatever, worked

00:11:09.360 | in mining and brewing, which required analyzing all kinds of engineering data and sales data.

00:11:16.560 | I've worked in banking, that required looking at transaction account pricing and risk and

00:11:23.480 | fraud.

00:11:24.480 | All of these areas basically involve this kind of data.

00:11:29.380 | So I think although no one publishes stuff about this, because anybody who comes out

00:11:37.320 | of a Stanford PhD and goes to Google doesn't know about any of those things, it's probably

00:11:43.760 | the most useful thing for the vast majority of people.

00:11:50.880 | And excitingly it turns out that you don't need to learn any new techniques at all.

00:11:57.360 | In fact, the model that they got this third-place result with, a very simple model, is basically

00:12:06.120 | one where each different categorical variable was one hot encoded and chucked into an embedding

00:12:12.880 | layer.

00:12:14.080 | The embedding layers were concatenated and chucked through a dense layer, then a second

00:12:17.680 | dense layer, then it went through a sigmoid function into an output layer.

00:12:26.400 | Very simple.

00:12:28.120 | The continuous variables they haven't drawn here, and all these pictures are going to

00:12:32.520 | come straight from this paper, which these folks that came third kindly wrote a paper

00:12:38.280 | about this.

00:12:40.360 | The continuous variables basically get fed directly into the dense layer.

00:12:44.880 | So that's the structure of the model.

00:12:50.680 | How well does it work?

00:12:52.560 | So the short answer is, compared to K-nearest neighbors, random forests and GBMs, just a

00:13:01.120 | simple neural network beats all of those approaches, just with standard one-hot encoding, whatever.

00:13:09.360 | But then the EE is Entity Embeddings.

00:13:12.760 | So adding in this idea of using embeddings, interestingly you can take the embeddings

00:13:19.160 | trained by a neural network and feed them into a KNN, or a random forest, or a GBM.

00:13:25.600 | And in fact, using embeddings with every one of those things is way better than anything

00:13:34.320 | other than neural networks.

00:13:36.560 | So that's pretty interesting.

00:13:38.320 | And then if you use the embeddings with a neural network, you get the best results still.

00:13:43.260 | So this actually is kind of fascinating, because training this neural network took me some

00:13:55.380 | hours on a Titan X, whereas training the GBM took I think less than a second.

00:14:05.760 | It was so fast, I thought I had screwed something up.

00:14:10.260 | And then I tried running it, and it's like holy shit, it's giving accurate predictions.

00:14:14.760 | So GBMs and random forests are so fast.

00:14:17.640 | So in your organization, you could try taking everything that you could think of as a categorical

00:14:25.000 | variable and once a month train a neural net with embeddings, and then store those embeddings

00:14:33.340 | in a database table and tell all of your business users, "Hey, anytime you want to create a

00:14:39.600 | model that incorporates day of week or a store ID or a customer ID, you can go grab the embeddings."

00:14:49.420 | And so they're basically like word vectors, but they're customer vectors and store vectors

00:14:54.280 | and product vectors.

00:14:56.280 | So I've never seen anybody write about this other than this paper.

00:15:02.800 | And even in this paper they don't really get to this hugely important idea of what you

00:15:09.760 | could do with these embeddings.

00:15:11.320 | What's the difference between A and B and C?

00:15:15.040 | Is it like different data types flowing in?

00:15:17.280 | A and B and C, yeah, we're going to get to that in a moment.

00:15:21.880 | Basically the different things is like A might be the store ID, B might be the product ID,

00:15:27.280 | and C might be the day of work.

00:15:33.320 | One of the really nice things that they did in this paper was to then draw some projections

00:15:40.060 | of some of these embeddings.

00:15:41.080 | They just used T-SNE, it doesn't really matter what the projection method is, but here's

00:15:48.080 | some interesting ideas.

00:15:50.720 | They took each state of Germany, or based in Germany, and did a projection of the embeddings

00:15:58.640 | from the state field.

00:16:01.400 | And here is those projections.

00:16:03.720 | And I've drawn around them different colored circles, and you might notice the different

00:16:08.640 | colored circles exactly correspond to the different colored circles on a map of Germany.

00:16:13.960 | Now this were just random embeddings trained with SGD, trying to predict sales in stores

00:16:22.600 | at Rossman, and yet somehow they've drawn a map of Germany.

00:16:28.960 | So obviously the reason why is because things close to each other in Germany have similar

00:16:33.840 | behaviors around how they respond to events, and who buys what kinds of products, and so

00:16:40.320 | on and so forth.

00:16:42.000 | So that's crazy fascinating.

00:16:47.120 | Here's the kind of bigger picture.

00:16:48.560 | Every one of these dots is the distance between two stores, and this shows the correlation

00:16:58.640 | between the distance in embedding space versus the actual distance between the stores.

00:17:04.680 | So you can basically see that there's a strong correlation between things being close to

00:17:10.040 | each other in real life and close to each other in these SGD-trained embeddings.

00:17:15.040 | Here's a couple more pictures of all the lines drawn on top of mine, but everything else

00:17:20.600 | is just there from the paper.

00:17:22.920 | On the left is days of the week embedding.

00:17:25.400 | And you can see the days of the week that are near each other have ended up embedded

00:17:28.800 | close together.

00:17:30.400 | On the right is months of the year embedding, again, same thing.

00:17:35.320 | And you can see that the weekend is fairly separate.

00:17:41.920 | So that's where we're going to get to.

00:17:47.560 | I'm actually going to take you through the end-to-end process, and I rebuilt the end-to-end

00:17:53.840 | process from scratch and tried to make it in as few lines as possible because we just

00:18:01.680 | haven't really looked at any of these structured data type problems before.

00:18:08.680 | So it's kind of a very different process and even a different set of techniques.

00:18:23.120 | We import the usual stuff.

00:18:25.880 | When you try to do this stuff yourself, you'll find three or four libraries we haven't used

00:18:31.040 | before.

00:18:32.040 | So when you hit something that says "module not found", you can just pip install all these

00:18:39.800 | things if you're a Python.

00:18:42.200 | We'll talk about them as we get to them.

00:18:45.940 | So the data that comes from Kaggle comes down as a bunch of CSV files, and I wrote a quick

00:18:52.520 | thing to combine some of those CSVs together.

00:18:57.160 | This was one of those competitions where people were allowed to use additional external data

00:19:03.120 | as long as they were shared on the forum.

00:19:06.280 | So the data I'll share with you, I'm going to combine it all into one place for you.

00:19:12.360 | So I've commented on these apps because the stuff I'll give you will have already run

00:19:16.040 | this concatenation process.

00:19:18.660 | So the basic tables that you're going to get access to is the training set itself, a list

00:19:24.920 | of stores, a list of which state each store is in, a list of the abbreviation and the name

00:19:34.000 | of each state in Germany, a list of data from Google Trends.

00:19:39.840 | So if you've used Google Trends you can basically see how particular keywords change over time.

00:19:44.480 | I don't actually know which keywords they used, but somebody found that there were some

00:19:52.360 | Google Trends keywords that correlated well, so we've got access to those, some information

00:19:57.800 | about the weather, and then a test set.

00:20:01.200 | So I'm not sure that we've really used pandas much yet, so let's talk a bit about pandas.

00:20:08.320 | Pandas lets us take this kind of structured data and manipulate it in similar ways to

00:20:13.480 | the way you would manipulate it in a database.

00:20:16.400 | So the first thing you do, so pandas, just like NumPy, tends to become np, pandas tends

00:20:23.080 | to become pd.

00:20:24.080 | So pd.read_csv is going to return a data frame.

00:20:28.440 | So a data frame is like a database table if you've used R, it's called the same thing.

00:20:35.640 | So this read_csv is going to return a data frame containing the information from this

00:20:40.400 | CSV file, and we're going to go through each one of those table names and read the CSV.

00:20:47.880 | So this list comprehension is going to return a list of data frames.

00:20:55.220 | So I can now go ahead and display the head, so the first five rows from each table.

00:21:03.280 | And that's a good way to get a sense of what these tables are.

00:21:07.080 | So the first one is the trading set.

00:21:12.320 | So for some store, on some date, they had some level of sales to some number of customers.

00:21:23.120 | And they were either open or closed, they either had a promotion on or they didn't,

00:21:26.720 | it either was a holiday or it wasn't for state and school, and then some additional information

00:21:32.240 | about the date.

00:21:34.040 | So that's the basic information we have.

00:21:37.320 | And then everything else, we join onto that.

00:21:40.320 | So for example, for each store, we can join up some kind of categorical variable about

00:21:48.160 | what kind of store it is.

00:21:49.880 | I have no idea what this is, it might be a different brand or something.

00:21:54.360 | What kinds of products do they carry, again it's just a letter, I don't know what it means,

00:21:58.480 | but maybe it's like some are electronics, some are supermarkets, some are full spectrum.

00:22:05.680 | How far away is the nearest competitor?

00:22:10.120 | And what year and month did the competitor open for business?

00:22:15.880 | Notice that sometimes the competitor opened for business quite late in the game, like

00:22:22.560 | later than some of the data we're looking at, so that's going to be a little bit confusing.

00:22:29.000 | And then this thing called Promo2, which as far as I understand it is basically is this

00:22:33.480 | a store which has some kind of standard promotion timing going on.

00:22:38.400 | So you can see here that this store has standard promotions in January, April, July and October.

00:22:45.720 | So that's the stores.

00:22:49.880 | We also know for each store what state they're in based on the abbreviation, and then we can

00:22:54.600 | find out for each state what is the name of that state.

00:22:59.040 | And then for each, this is slightly weird, this is the state abbreviation, the last two

00:23:03.120 | letters.

00:23:04.120 | In this state, during this week, this was the Google Trend data for some keyword, I'm

00:23:10.560 | not sure what keyword it was.

00:23:15.360 | For this state name, on this date is the temperature, dewpoint, and so forth.

00:23:24.520 | And then finally here's the test set.

00:23:26.760 | It's identical to the training set, but we don't have the number of customers, and we

00:23:30.520 | don't have the number of sales.

00:23:33.680 | So this is a pretty standard kind of industry data set.

00:23:40.240 | You've got a central table, various tables related to that, and some things representing

00:23:46.080 | time periods or time points.

00:23:48.360 | One of the nice things you can do in Pandas is to use this Pandas summary module called

00:23:57.520 | DataFrame summary for a table.summary, and that will return a whole bunch of information

00:24:03.040 | about every field.

00:24:04.040 | So I'm not going to go through all of it in detail, but you can see for example for the

00:24:08.840 | sales, on average 5,800 sales, standard deviation of 3,800, sometimes the sales goes all the

00:24:16.200 | way down to 0, sometimes all the way up to 41,000.

00:24:19.880 | There's no missing to sales, that's good to know.

00:24:25.160 | So this is the kind of thing that's good to scroll through and identify, okay, competition

00:24:30.160 | open since month is missing about a third of the time, that's good to know.

00:24:39.240 | There's 12 unique states, that might be worth checking because there's actually 16 things

00:24:45.360 | in our state table for some reason.

00:24:49.680 | Google trend data is never missing, that's good.

00:24:54.280 | The year goes from 2012 through 2015.

00:25:00.520 | The weather data is never missing.

00:25:04.720 | And then here's our test set.

00:25:09.600 | This is the kind of thing that might screw up a model, it's like actually sometimes the

00:25:12.560 | test set is missing the information about whether that store was open or not, so that's

00:25:16.800 | something to be careful of.

00:25:19.800 | So we can take that list of tables and just de-structure it out into a whole bunch of

00:25:23.720 | different table names, find out how big the training set is, how big the test set is.

00:25:30.840 | And then with this kind of problem, there's going to be a whole bunch of data cleaning

00:25:39.440 | and a whole bunch of feature engineering.

00:25:42.000 | So neural nets don't make any of that go away, particularly because we're using this style

00:25:50.620 | of neural net where we're basically feeding in a whole bunch of separate continuous and

00:25:54.680 | categorical variables.

00:25:56.520 | So simplify things a bit, turn state holidays into Booleans, and then I'm going to join

00:26:04.400 | all of these tables together.

00:26:10.200 | I always use a default join type of an outer join, so you can see here this is how we join

00:26:16.940 | in pandas. We say table.merge, table2, and then to make a left outer join, how equals

00:26:24.640 | left, and then you say what's the name of the fields that you're going to join on the

00:26:29.040 | left hand side, what are the fields you're going to join on the right hand side, and

00:26:33.360 | then if both tables have some fields with the same name, what are you going to suffix

00:26:40.360 | those fields with?

00:26:41.720 | So on the left hand side we're not going to add any suffix, on the right hand side we'll

00:26:45.640 | put in _y. So again, I try to refactor things as much as I can, so we're going to join lots

00:26:52.000 | of things. Let's create one function to do the joining, and then we can call it lots

00:26:55.840 | of times.

00:26:57.720 | Was there any fields different to the same value but named differently?

00:27:06.520 | Not that I saw, no. It wouldn't matter too much if there were, because when we run the

00:27:15.440 | model, no problem.

00:27:17.200 | Question - Would you liken the use of embeddings from a neural network to extraction of implicit

00:27:26.280 | features, or can we think of it more like what a PCA would do, like dimensionality reduction?

00:27:37.360 | Let's talk about it more when we get there. Basically, when you deal with categorical

00:27:42.760 | variables in any kind of model, you have to decide what to do with them. One of my favorite

00:27:51.800 | data scientists, or a pair of them actually, who are very nearly neighbors of Rachel and

00:27:57.440 | mine, have this fantastic R package called Vtreat, which has a bunch of state-of-the-art

00:28:09.160 | approaches to dealing with stuff like categorical variable encoding.

00:28:20.000 | The obvious way to do categorical variable encoding is to just do a one-hot encoding,

00:28:26.680 | and that's the way nearly everybody puts it into their gradient-boosting machines or random

00:28:31.200 | forests or whatever. One of the things that Vtreat does is it has some much more interesting

00:28:40.040 | techniques. For example, you could look at the univariate mean of sales for each day

00:28:55.160 | of week, and you could encode day of week using a continuous variable which represents

00:29:01.920 | the mean of sales. But then you have to think about, "Would I take that mean from the trading

00:29:08.000 | set or the test set or the validation set? How do I avoid a fitting?" There's all kinds

00:29:12.800 | of complex statistical subtleties to think about that Vtreat handles all this stuff automatically.

00:29:25.040 | There's a lot of great techniques, but they're kind of complicated and in the end they tend

00:29:29.240 | to make a whole bunch of assumptions about linearity or univariate correlations or whatever.

00:29:36.740 | Whereas with embeddings, we're using SGD to learn how to deal with it, just like we do

00:29:45.400 | when we build an NLP model or a collaborative filtering model. We provide some initially

00:29:53.800 | random embeddings and the system learns how the movies vary compared to each other, or

00:30:01.400 | uses vary, or words vary or whatever. This is to me the ultimate pure technique.

00:30:10.040 | Of course the other nice thing about embeddings is we get to pick the dimensionality of the

00:30:13.560 | embedding so we can decide how much complexity and how much learning are we going to put

00:30:19.080 | into each of the categorical variables. We'll see how to do that in a moment.

00:30:30.360 | One complexity was that the weather uses the name of the state rather than the abbreviation

00:30:42.760 | of the state, so we can just go ahead and join weather to states to get the abbreviation.

00:30:51.040 | The Google Trend information about the week, week from A to B, we can split that apart.

00:30:59.320 | You can see here one of the things that happens in the Google Trend data is that one of the

00:31:06.280 | states is called ni, or else in the rest of the data is called hb, ni. So this is a good

00:31:12.420 | opportunity to learn about pandas indexing. So pandas indexing, most of the time you want

00:31:18.360 | to use this .ix method. And the .ix method is your general indexing method. It's going

00:31:26.760 | to take two things, a list of rows to select and a list of columns to select. You can use

00:31:32.840 | it in pretty standard intuitive ways. This is a lot like numpy. This here is going to

00:31:37.920 | return a list of Booleans, which things are in this state. And if you pass the list of

00:31:43.520 | Booleans to the pandas row selector, it will just return the rows where that Boolean is

00:31:48.960 | true. So therefore this is just going to return the rows from Google Trend, where googletrend.state

00:31:55.880 | is ni. And then the second thing we pass in is a list of columns, in this case we just

00:32:01.640 | got one column. And one very important thing to remember, again just like numpy, you can

00:32:06.720 | put this kind of thing on the left-hand side of an equal sign. In computer science we call

00:32:10.880 | this an L-value, so you can use it as an L-value. So we can take this state field, four things

00:32:17.120 | which are equal to ni, and change their value to this. So this is like a very nice simple

00:32:25.360 | technique that you'll use all the time in pandas, both for looking at things and for

00:32:31.040 | changing things.

00:32:38.620 | We have a few questions. One is, in this particular example, do you think the granularity of the

00:32:44.280 | data matter, as in per day or per week, is one better than the other?

00:32:49.600 | Yeah, I mean I would want to have the lower granularity so that I can capture that. Ideally

00:33:00.760 | you at one time as well. It kind of depends on how the organization is going to use it.

00:33:08.680 | What are they going to do with this information? It's probably for purchasing and stuff, so

00:33:11.680 | maybe they don't care about an hourly level. Clearly the difference between Sunday sales

00:33:17.640 | and Wednesday sales will be quite significant. This is mainly a kind of business context

00:33:25.320 | or domain understanding question.

00:33:27.280 | Another question is, do you know if there's any work that compares for structured data,

00:33:32.120 | supervised embeddings like these, to embeddings that come from an unsupervised paradigm such

00:33:37.000 | as an autoencoder? It seems like you'd get more useful for prediction embeddings with

00:33:42.040 | the former case, but if you wanted general purpose embeddings you might prefer the latter.

00:33:46.400 | Yeah, I think you guys are aware of my feelings about autoencoders. It's like giving up on

00:33:51.280 | life. You can always come up with a loss function that's more interesting than an autoencoder

00:33:57.480 | loss function basically. I would be very surprised if embeddings that came from a sales model

00:34:03.860 | were not more useful for just about everything than something that came from an unsupervised

00:34:07.600 | model. These things are easily tested, and if you do find a model that they don't work

00:34:13.080 | as well with, then you can come up with a different set of supervised embeddings for

00:34:16.800 | that model.

00:34:17.800 | There's also just a note that .ix is deprecated and we should use .loc instead.

00:34:23.840 | I was going to mention Pandas is changing a lot. Because I've been running this course

00:34:30.400 | I have not been keeping track of the recent versions of Pandas, so thank you. In Pandas

00:34:42.580 | there's a whole page called Advanced Indexing Methods. I don't find the Pandas documentation

00:34:47.080 | terribly clear to be honest, but there is a fantastic book by the author of Pandas called

00:34:51.800 | Python for Data Analysis. There is a new edition out, and it covers Pandas, NumPy, Matplotlib,

00:35:01.280 | whatever. That's the best way by far to actually understand Pandas because the documentation

00:35:09.920 | is a bit of a nightmare and it keeps changing so the new version has all the new stuff in

00:35:14.400 | it. With these kind of indexing methods, Pandas tries really hard to be intuitive, which means

00:35:22.200 | that quite often you'll read the documentation for these methods and it will say if you pass

00:35:28.600 | it a Boolean it will behave in this way, if you pass it a float it will behave this way,

00:35:32.520 | if it's an index it's this way unless this other thing happens. I don't find it intuitive

00:35:39.480 | at all because in the end I need to know how something works in order to use it correctly

00:35:42.780 | and so you end up having to remember this huge list of things. I think Pandas is great,

00:35:47.880 | but this is one thing to be very careful of, is to really make sure you understand how

00:35:53.480 | all these indexing methods actually work. I know Rachel's laughing because she's been

00:35:57.440 | there and probably laughing in disgust at what we all have to go through.

00:36:02.960 | Another question, when you use embeddings from a supervised model in another model, do you

00:36:16.080 | always have to worry about data leakage? I think that's a great point. I don't think

00:36:20.960 | I've got anything to add to that. You can figure out easily enough if there's data leakage.

00:36:30.420 | So there's this kind of standard set of steps that I take for every single structured machine

00:36:50.160 | learning model I do. One of those is every time I see a date, I always do this. I always

00:36:58.440 | create four more fields, the year, the month of year, the week of year and the day of week.

00:37:08.620 | This is something which should be automatically built into every data loader, I feel. It's

00:37:18.220 | so important because these are the kinds of structures that you see, and once every single

00:37:23.880 | date has got this added to it, you're doing great. So you can see that I add that into

00:37:32.260 | all of my tables that have a date field, so we'll have that from now on.

00:37:42.420 | So now I go ahead and do all of these outer joins. You'll see that the first thing I do

00:37:49.340 | after every outer join is check whether the thing I just joined with has any nulls. Even

00:37:58.540 | if you're sure that these things match perfectly, I would still never ever do an inner join.

00:38:06.500 | Do the outer join and then check for nulls, and that way if anything changes ever or if

00:38:11.020 | you ever make a mistake, one of these things will not be zero. If this was happening in

00:38:17.660 | a production process, this would be an assert. This would be emailing Henry at 2am to say

00:38:26.140 | something you're relying on is not working the way it was meant to look at. So that's

00:38:32.940 | why I always do it this way.

00:38:35.460 | So you can see I'm basically joining my training to everything else until it's all in there

00:38:43.740 | together in one big thing. So that table "everything joined together" is called "joined", and then

00:38:53.580 | I do a whole bunch more thinking about -- well, I didn't do the thinking, the people that

00:38:58.220 | won this competition, then I replicated their results from scratch -- think about what are

00:39:04.700 | all the other things you might want to do with these dates.

00:39:07.300 | So competition open, we noticed before, a third of the time they're empty. So we just

00:39:13.700 | fill in the empties with some kind of sentinel value because a lot of machine learning systems

00:39:21.180 | don't like missing values. Fill in the missing months with some sentinel value. Again, keep

00:39:28.860 | on filling in missing data. So fill_na is a really important thing to be aware of.

00:39:42.580 | I guess the answer is yes, it is a problem. In this case, I happen to know that every

00:39:59.260 | time a year is empty, a month is also empty, and we only ever use both of them together.

00:40:04.100 | So we don't really care when the competition store was opened, what we really care about

00:40:25.540 | is how long is it between when they were opened and the particular row that we're looking

00:40:30.020 | at. The sales on the 2nd of February 2014, how long was it between 2nd of February 2014

00:40:37.260 | and when the competition opened.

00:40:39.100 | So you can see here we use this very important .apply function which just runs a Python function

00:40:48.300 | on every row of a data frame. In this case, the function is to create a new date from

00:40:56.180 | the open-since year and the open-since month. We're just going to assume that it's the middle

00:40:59.740 | of the month. That's our competition open-since, and then we can get our days opened by just

00:41:06.020 | doing a subtract.

00:41:10.580 | In pandas, every date field has this special magical dt property, which is what all the

00:41:18.940 | days, month, year, all that stuff sits inside this little dt property. Sometimes, as I mentioned,

00:41:31.580 | the competition actually opened later than the particular observation we're looking at.

00:41:36.900 | So that would give us a negative, so we replace our negatives with zero. We're going to use

00:41:46.220 | an embedding for this, so that's why we replace days open with months open so we have less

00:41:52.820 | values.

00:41:58.740 | I didn't actually try replacing this with a continuous variable. I suspect it wouldn't

00:42:03.580 | make too much difference, but this is what they do. In order to make the embedding again

00:42:11.100 | not too big, they replaced anything that was bigger than 2 years with 2 years.

00:42:18.700 | So there's our unique values. Every time we do something, print something out to make

00:42:24.380 | sure the thing you thought you did is what you actually did. It's much easier if we're

00:42:29.260 | using Excel because you see straight away what you're doing. In Python, this is the

00:42:34.860 | kind of stuff that you have to really be rigorous about checking your work at every step. When

00:42:42.940 | I build stuff like this, I generally make at least one error in every cell, so check

00:42:50.860 | carefully.

00:42:51.860 | Okay, do the same thing for the promo days, turn those into weeks. So that's some basic

00:43:01.780 | pre-processing, you get the idea of how pandas works hopefully.

00:43:07.820 | So the next thing that they did in the paper was a very common kind of time series feature

00:43:18.220 | manipulation, one to be aware of. They basically wanted to say, "Okay, every time there's a

00:43:27.820 | promotion, every time there's a holiday, I want to create some additional fields for

00:43:34.980 | every one of our training set rows," which is on a particular date. "On that date, how

00:43:39.660 | long is it until the next holiday? How long is it until the previous holiday? How long

00:43:45.020 | is it until the next promotion? How long is it since the previous promotion?" So if we

00:43:49.620 | basically create those fields, this is the kind of thing which is super difficult for

00:43:58.620 | any GBM or random forest or neural net to figure out how to calculate itself. There's

00:44:07.020 | no obvious kind of mathematical function that it's going to build on its own. So this is

00:44:11.060 | the kind of feature engineering that we have to do in order to allow us to use these kinds

00:44:16.220 | of techniques effectively on time series data.

00:44:19.660 | So a lot of people who work with time series data, particularly in academia outside of

00:44:27.380 | industry, they're just not aware of the fact that the state-of-the-art approaches really

00:44:33.140 | involve all these heuristics. Separating out your dates into their components, turning

00:44:40.220 | everything you can into durations both forward and backwards, and also running averages.

00:44:52.100 | When I used to do a lot of this kind of work, I had a bunch of library functions that I

00:44:59.660 | would run on every file that came in and would automatically do these things for every combination

00:45:04.700 | of dates. So this thing of how long until the next promotion, how long since the previous

00:45:12.580 | promotion is not easy to do in any database system pretty much, or indeed in pandas. Because

00:45:21.940 | generally speaking, these kind of systems are looking for relationships between tables,

00:45:26.700 | but we're trying to look at relationships between rows.

00:45:30.060 | So I had to create this tiny little simple little class to do this. So basically what

00:45:36.820 | happens is, let's say I'm looking at school holiday. So I sort my data frame by store,

00:45:45.300 | and then by date, and I call this little function called add_elapsed_school_holiday_after. What

00:45:53.420 | does add_elapsed do? Add_elapsed is going to create an instance of this class called elapsed,

00:46:00.220 | and in this case it's going to be called with school_holiday.

00:46:04.580 | So what this class is going to do, we're going to be calling this apply function again. It's

00:46:08.980 | going to run on every single row, and it's going to call my elapsed_class.debt for every

00:46:15.620 | row. So I'm going to go through every row in order of store, in order of date, and I'm

00:46:21.780 | trying to find how long has it been since the last school holiday.

00:46:26.580 | So when I create this object, I just have to keep track of what field is it, school_holiday.

00:46:34.080 | Initialize, when was the last time we saw a school holiday? The answer is we haven't,

00:46:40.860 | so let's initialize it to not a number. And we also have to know each time we cross over

00:46:47.740 | to a new store. When we cross over to a new store, we just have to re-initialize. So the

00:46:51.860 | previous store was 0. So every time we call get, we basically check. Have we crossed over

00:46:57.900 | to a new store? And if so, just initialize both of those things back again. And then

00:47:05.260 | we just say, Is this a school holiday? If so, then the last time you saw a school holiday

00:47:12.980 | is today. And then finally return, how long is it between today and the last time you

00:47:18.840 | saw a school holiday?

00:47:20.620 | So it's basically this class is a way of keeping track of some memory about when did I last

00:47:28.620 | see this observation. So then by just calling df.apply, it's going to keep track of this

00:47:35.460 | for every single row. So then I can call that for school_holiday, after and before. The

00:47:41.380 | only difference being that for before I just sort my dates in ascending order. State_holiday

00:47:49.220 | and promo. So that's going to add in the end 6 fields, how long until and how long since

00:47:56.700 | the last school holiday, state_holiday and promotion.

00:47:59.700 | And then there's two questions. One asking, Is this similar to a windowing function?

00:48:10.020 | Not quite, we're about to do a windowing function.

00:48:12.420 | And then is there a reason to think that the current approach would be problematic with

00:48:16.220 | sparse data?

00:48:17.620 | I don't see why, but I'm not sure I quite follow. So we don't care about absolute days.

00:48:27.540 | We care about time deltas between events.

00:48:30.700 | We care about two things. We do care about the dates, but we care about what year is

00:48:37.740 | it, what date week it is. And we also care about the elapsed time between the date I'm

00:48:46.580 | predicting sales for and the previous and next of various events.

00:48:53.220 | And then windowing functions, for the features that are time until an event, how do you deal

00:48:59.700 | with that given that you might not know when the last event is in the data?

00:49:06.340 | Well all I do is I've sorted descending, and then we initialize last with not a number.

00:49:17.780 | So basically when we then go subtract, here we are subtract, and it tries to subtract

00:49:23.940 | not a number, we'll end up with a null. So basically anything that's an unknown time

00:49:30.620 | because it's at one end or the other is going to end up null, which is why we're going to

00:49:39.860 | replace those nulls with zeros. Pandas has this slightly strange way of thinking

00:49:48.540 | about indexes, but once you get used to it, it's fine. At any point you can call DataFrame.setIndex

00:49:54.900 | and pass in a field. You then have to just kind of remember what field you have as the

00:50:01.500 | index, because quite a few methods in Pandas use the currently active index by default,

00:50:07.980 | and of course things all run faster when you do stuff with the currently active index. And

00:50:13.660 | you can pass multiple fields, in which case you end up with a multiple key index.

00:50:20.260 | So the next thing we do is these windowing functions. So a windowing function in Pandas,

00:50:26.420 | we can use this rolling. So this is like a rolling mean, rolling min, rolling max, whatever

00:50:31.180 | you like. So this basically says let's take our DataFrame with the columns we're interested

00:50:40.940 | in, school holiday, state holiday and promo, and we're going to keep track of how many

00:50:45.980 | holidays are there in the next week and the previous week. How many promos are there in

00:50:52.220 | the next week and the previous week? To do that we can sort, here we are, by date, group

00:51:03.060 | by, store, and then rolling will be applied to each group. So within each group, create

00:51:09.740 | a rolling 7-day sum. It's the kind of notation I'm never likely to remember, but you can

00:51:25.220 | just look it up. This is how you do group by type stuff. Pandas actually has quite a

00:51:32.460 | lot of time series functions, and this rolling function is one of the most useful ones. Wes

00:51:39.220 | McKinney had a background as a quant who memory serves correctly, and so the quants love their

00:51:44.740 | time series functions, so I think that was a lot of the history of Pandas. So if you're

00:51:49.100 | interested in time series stuff, you'll find a lot of time series stuff in Pandas.

00:52:01.980 | One helpful parameter that sits inside a lot of methods is inPlace = true. That means that

00:52:08.180 | rather than returning a new data frame with this change made, it changes the data frame

00:52:13.460 | you already have, and when your data frames are quite big this is going to save a lot

00:52:17.740 | of time and memory. That's a good little trick to know about.

00:52:23.100 | So now we merge all these together, and we can now see that we've got all these after

00:52:28.060 | school holidays, before school holidays, and our backward and forward running means. Then

00:52:37.860 | we join that up to our original data frame, and here we have our final result.

00:52:46.940 | So there it is. We started out with a pretty small set of fields in the training set, but

00:52:53.980 | we've done this feature engineering. This feature engineering is not arbitrary. Although I didn't

00:53:01.760 | create this solution, I was just re-implementing the solution that came from the competition

00:53:08.700 | that place getters -- this is nearly exactly the set of feature engineering steps I would

00:53:16.100 | have done. It's just a really standard way of thinking about a time series. So you can

00:53:21.900 | definitely borrow these ideas pretty closely.

00:53:32.300 | So now that we've got this table, we've done our feature engineering, we now want to feed

00:53:45.140 | it into a neural network. To feed it into a neural network we have to do a few things.

00:53:51.900 | The categorical variables have to be turned into one-hot encoded variables, or at least

00:53:59.360 | into contiguous integers. And the continuous variables we probably want to normalize to

00:54:06.660 | a zero-mean one-standard deviation. There's a very little-known package called sklearn_pandas.

00:54:16.180 | And actually I contributed some new stuff to it for this course to make this even easier

00:54:19.940 | to use. If you use this data frame mapper from sklearn_pandas, as you'll see, it makes

00:54:26.180 | life very easy. Without it, life is very hard. And because very few people know about it,

00:54:32.420 | the vast majority of code you will find on the internet makes life look very hard. So

00:54:37.460 | use this code, not the other code. Actually I was talking to some of the students the

00:54:44.020 | other day and they were saying for their project they were stealing lots of code from part

00:54:49.060 | one of the course because they just couldn't find anywhere else people writing any of the

00:54:55.460 | kinds of code that we've used. The stuff that we've learned throughout this course is on

00:55:00.740 | the whole not code that lives elsewhere very much at all. So feel free to use a lot of

00:55:07.860 | these functions in your own work because I've really tried to make them the best version

00:55:13.520 | of that function.

00:55:15.780 | So one way to do the embeddings and the way that they did it in the paper is to basically

00:55:21.900 | say for each categorical variable they just manually decided what embedding dimensionality

00:55:27.900 | to use. They don't say in the paper how they pick these dimensionalities, but generally

00:55:33.020 | speaking things with a larger number of separate levels tend to have more dimensions. So I

00:55:39.620 | think there's like 1000 stores, so that has a big embedding dimensionality, where else

00:55:46.020 | obviously things like promo, forward and backward, or they have weak or whatever have much smaller

00:55:52.900 | ones. So this is this dictionary I created that basically goes from the name of the field

00:55:58.740 | to the embedding dimensionality. Again, this is all code that you guys can use in your

00:56:02.100 | models.

00:56:05.300 | So then all I do is I say my categorical variables is go through my dictionary, sort it in reverse

00:56:14.420 | order of the value, and then get the first thing from that. So that's just going to give

00:56:21.180 | me the keys from this in reverse order of dimensionality. Continuous variables is just

00:56:31.820 | a list. Just make sure that there's no nulls, so continuous variables replace nulls with

00:56:40.460 | zeros, categorical variables replace nulls with empties.

00:56:44.380 | And then here's where we use the DataFrameMapper. A DataFrameMapper takes a list of tuples with

00:56:53.780 | just two items in. The first item is the name of the variable, so in this case I'm looping

00:56:58.140 | through each categorical variable name. The second thing in the tuple is an instance of

00:57:05.460 | a class which is going to do your preprocessing. And there's really just two that you're going

00:57:12.180 | to use almost all the time. The categorical variables, sklearn comes with something called

00:57:17.660 | label encoder. It's really badly documented, in fact misleadingly documented, but this

00:57:26.620 | is exactly what you want. It's something that takes a column, figures out what are all the

00:57:31.820 | unique values that appear in that column, and replaces them with a set of contiguous

00:57:36.100 | integers. So if you've got the days of the week, Monday through Sunday, it'll replace

00:57:41.500 | them with zeros through sevens.

00:57:45.300 | And then very importantly, this is critically important, you need to make sure that the

00:57:50.500 | training set and the test set have the same codes. There's no point in having Sunday be

00:57:54.860 | zero in the training set and one in the test set. So because we're actually instantiating

00:58:01.460 | this class here, this object is going to actually keep track of which codes it's using.

00:58:08.780 | And then ditto for the continuous, we want to normalize them to a 0, 1 variables. But

00:58:15.180 | again, we need to remember what was the mean that we subtracted, what was the standard

00:58:19.020 | deviation we divided by, so that we can do exactly the same thing to the test set. Otherwise

00:58:23.660 | again our models are going to be nearly totally useless. So the way the dataframe mapper works

00:58:28.260 | is that it's using this instantiated object, it's going to keep track with this information.

00:58:33.260 | So this is basically code you can copy and paste in every one of your models.

00:58:38.420 | Once we've got those mappings, you just pass those to a dataframe mapper, and then you

00:58:44.540 | call .fit passing in your dataset. And so this thing now is a special object which has

00:58:52.900 | a .features property that's going to contain all of the pre-processed features that you

00:59:02.260 | want. Categorical columns contains the result of doing this mapping, basically doing this

00:59:09.100 | label encoding. In some ways the details of how this works doesn't matter too much because

00:59:17.020 | you can just use exactly this code in every one of your models.

00:59:20.180 | Same for continuous, it's exactly the same code, but of course continuous, it's going

00:59:26.020 | to be using standard scalar, which is the scikit-learn thing that turns it into a zero-mean-one standard

00:59:32.840 | deviation variable. So we've now got continuous columns that have all been standardized.

00:59:39.020 | Here's an example of the first five rows from the zeroth column for a categorical, and then

00:59:51.420 | ditto for a continuous. You can see these have been turned into integers and these have been

00:59:59.100 | turned into numbers which are going to average to zero and have a standard deviation of one.

01:00:05.400 | One of the nice things about this dataframe mapper is that you can now take that object

01:00:11.940 | and actually store it, pickle it. So now you can use those categorical encodings and scaling

01:00:20.260 | parameters elsewhere. By just unpickling it, you've immediately got those same parameters.

01:00:30.180 | For my categorical variables, you can see here the number of unique classes in every

01:00:37.340 | one. So here's my 1,100 stores, 31 days of the month, 7 days of the week, and so forth.

01:00:48.940 | So that's the kind of key pre-processing that has to be done. So here is their big mistake,

01:00:59.020 | and I think if they didn't do this big mistake, they probably would have won. Their big mistake

01:01:03.820 | is that they went join.sales, not equal to zero. So they've removed all of the rows with

01:01:12.940 | no sales. Those are all of the rows where the store was closed. Why was this a big mistake?

01:01:21.420 | Because if you go to the Rossman Store Sales Competition website and click on "Kernels"

01:01:29.300 | and look at the kernel that got the highest rating. I'll show you a couple of pictures.

01:01:51.940 | Here is an example of a store, Store 708, and these are all from this kernel. Here is

01:01:58.460 | a period of time where it was closed to refurbishment. This happens a lot in Rossman stores. You

01:02:05.340 | get these periods of time when you get zeros for sales, lots in a row. Look what happens

01:02:11.500 | immediately before and after. So in the data set that we're looking at, our unfortunate

01:02:19.740 | third place winners deleted all of these. So they had no ability to build a feature

01:02:25.980 | that could find this. So this Store 708. Look, here's another one where it was closed. So

01:02:35.100 | this turns out to be super common. The second place winner actually built a feature. It's

01:02:44.700 | going to be exactly the same feature we've seen before. How many days since they're closing

01:02:48.820 | and how many days until when they're closing. If they had just done that, I'm pretty sure

01:02:53.220 | they would have won. So that was their big mistake.

01:02:59.940 | This kernel has a number of interesting analyses in it. Here's another one which I think our

01:03:08.380 | neural net can capture, although it might have been better to be explicit. Some stores

01:03:16.980 | opened on Sundays. Most didn't, but some did. For those stores that opened on Sundays, their

01:03:25.340 | sales on Sundays were far higher than on any other day. I guess that's because in Germany

01:03:30.460 | I guess not many shops opened on Sundays. So something else that they didn't explicitly

01:03:35.620 | do was create a "is store open on Sunday" field. Having said that, I think the neural

01:03:43.340 | net may have been able to put that in the embedding. So if you're interested during

01:03:48.100 | the week, you could try adding this field and see if it actually improves it or not.

01:03:51.780 | It would certainly be interesting to hear if you try adding this field. Do you find

01:03:56.420 | that you actually would win the competition?

01:04:01.460 | This Sunday thing, these are all from the same Kaggle kernel, here's the day of week

01:04:07.620 | and here's the sales as a box plot. You can see normally on a Sunday, it's not that the

01:04:14.540 | sales are much higher. So it's really explicitly just for these particular stores.

01:04:22.980 | That's the kind of visualization stuff which is really helpful to do as you work through

01:04:30.900 | these kinds of problems. I don't know, just draw lots of pictures. Those pictures were

01:04:35.900 | drawn in R, and R is actually pretty good for this kind of structured data.

01:04:40.340 | I have a question. For categorical fields, they're converted by the numbers not with

01:04:48.020 | me and zero. They were just messages Monday is zero, Tuesday is one, whatever.

01:04:54.900 | As is, they will send to a neural network just like...

01:04:59.440 | We're going to get there. We're going to use embeddings. Just like we did with word embeddings,

01:05:05.880 | remember, we turned every word into a word index. So our sentences, rather than being

01:05:13.220 | like the dog ate the beans, it would be 3, 6, 12, 2, whatever. We're going to do the

01:05:21.940 | same basic thing. We've done the same basic thing.

01:05:25.780 | So now that we've done our terrible mistake, we've now still got 824,000 rows left. As

01:05:34.460 | per usual, I made it really easy for me to create a random sample and did most of my

01:05:40.000 | analysis with a random sample, but can just as easily not do the random sample. So now

01:05:46.260 | I've got a separate sample version of it.

01:05:51.980 | Split it into training and test. Notice here, the way I split it into training and test

01:05:58.820 | is not randomly. The reason it's not randomly is because in the Kaggle competition, they

01:06:04.500 | set it up the smart way. The smart way to set up a test set in a time series is to make

01:06:11.580 | your test set the most recent period of time. If you choose random points, you've got two

01:06:18.420 | problems. The first is you're predicting tomorrow's sales where you always have the previous

01:06:23.500 | day's sales which is very rarely the way things really work. And then secondly, you're ignoring

01:06:32.580 | the fact that in the real world, you're always trying to model a few days or a few weeks

01:06:38.620 | or a few months in the future that haven't happened yet.

01:06:41.420 | So the way you want to set up, if you were setting up the data for such a model yourself,

01:06:48.500 | you would need to be deciding how often am I going to be rerunning this model, how long

01:06:53.540 | is it going to take for those model results to get into the field, to be used in however

01:06:57.820 | they're being used. In this case, I can't remember, I think it's like a month or two.

01:07:05.300 | So in that case I should make sure there's a month or two test set, which is the last

01:07:13.900 | bit. So you can see here, I've taken the last 10% of my validation set and it's literally

01:07:20.620 | just here's the first bit and here's the last bit, and since it was already sorted by date,

01:07:29.340 | this ensures that I have it done the way I want.

01:07:32.260 | I just wanted to point out that it's 10 to 8, so we should probably take a break.

01:07:48.660 | This is how you take that data frame map object we created earlier, we call .fit, in order

01:07:54.620 | to learn the transformation parameters, we then call transform to actually do it. So take

01:08:05.100 | my training set and transform it to grab the categorical variables, and then the continuous

01:08:12.060 | preprocessing is the same thing for my continuous map. So preprocess my training set and grab

01:08:18.140 | my continuous variables. So that's nearly done. The only final piece is in their solution,

01:08:32.580 | they modified their target, their sales value. And the way they modified it was that they

01:08:40.260 | found the highest amount of sales, and they took the log of that, and then they modified

01:08:49.060 | all of their y values to take the log of sales divided by the maximum log of sales.

01:08:56.460 | So what this means is that the y values are going to be no higher than 1. And furthermore,

01:09:04.900 | remember how they had a long tail, the average was 5,000, the maximum was 40-something thousand.

01:09:11.100 | This is really common, like most financial data, sales data, so forth, generally has

01:09:16.420 | a nicer shape when it's logged than it does not. So taking a log is a really good idea.

01:09:22.280 | The reason that as well as taking the log they also did this division is it means that

01:09:27.860 | what we can now do is we can use an activation function in our neural net of a sigmoid, which

01:09:33.700 | goes between 0 and 1, and then just multiply by the maximum log. So that's basically going

01:09:40.100 | to ensure that the data is in the right scaling area.

01:09:43.820 | I actually tried taking this out, and this technique doesn't really seem to help. And

01:09:50.900 | it actually reminds me of the style transfer paper where they mentioned they originally

01:09:57.500 | had a hyperbolic tan layer at the end for exactly the same reason, to make sure everything

01:10:02.620 | was between 0 and 255. It actually turns out if you just use a linear activation it worked

01:10:06.820 | just as well. So interestingly this idea of using sigmoids at the end in order to get

01:10:14.060 | the right range doesn't seem to be that helpful.

01:10:17.620 | My guess is the reason why is because for a sigmoid it's really difficult to get the

01:10:23.020 | maximum. And I think actually what they should have done is they probably should have, instead

01:10:28.820 | of using maximum, they should have used maximum times 1.25 so that they never have to predict

01:10:35.780 | 1, because it's impossible to predict 1 because it's a sigmoid.

01:10:40.380 | Someone asked, "Is there any issue in fitting the preprocessors on the full training and

01:10:46.780 | validation data? Shouldn't they be fit only to the training set?"

01:10:51.620 | No, it's fine. In fact, for the categorical variables, if you don't include the test set

01:11:01.460 | then you're going to have some codes that aren't there at all. Or else this way there's

01:11:05.860 | going to be random, which is better than failing. As for deciding what to divide and subtract

01:11:14.380 | in order to get a 0, 1 random variable, it doesn't really matter. There's no leakage

01:11:19.860 | involved because that's what you're worried about.

01:11:26.120 | Root means squared percent error is what the Kaggle competition used as the official loss

01:11:31.660 | function, so this is just calculating that.

01:11:34.640 | So before we take a break, we'll finally take a look at the definition of the model. I'll

01:11:41.860 | kind of work backwards. Here's the basic model. Get our embeddings, combine the embeddings

01:11:53.620 | with the continuous variables, a tiny bit of dropout, one dense layer, two dense layers,

01:11:59.420 | more dropout, and then the final sigmoid activation function.

01:12:05.060 | You'll see that I've got commented out stuff all over the place. This is because I had

01:12:10.340 | a lot of questions, we're going to cover this after the break, a lot of questions about

01:12:13.580 | some of the details of why did they do things certain ways, some of the things they did

01:12:19.620 | were so weird, I just thought they couldn't possibly be right. So I did some experimenting,

01:12:23.620 | we'll learn more about that in a moment.

01:12:27.020 | So the embeddings, as per usual, I create a little function to create an embedding, which

01:12:34.740 | first of all creates my regular Keras input layer, and then it creates my embedding layer,

01:12:43.500 | and then how many embedding dimensions I'm going to use. Sometimes I looked them up in

01:12:48.460 | that dictionary I had earlier, and sometimes I calculated them using this simple approach

01:12:54.860 | of saying I will use however many levels there are in the categorical variable divided by

01:13:00.340 | 2 with a maximum of 50. These were two different techniques I was playing with.

01:13:07.740 | Normally with word embeddings, you have a whole sentence, and so you've got to feed

01:13:13.740 | it to an RNN, and so you have time steps. So normally you have an input length equal

01:13:18.540 | to the length of your sentence. This is the time steps for an RNN. We don't have an RNN,

01:13:25.140 | we don't have any time steps. We just have one element in one column. So therefore I

01:13:32.100 | have to pass flatten after this because it's going to have this redundant unit 1 time axis

01:13:40.580 | that I don't want.

01:13:42.220 | So this is just because people don't normally do this kind of stuff with embeddings, so

01:13:46.940 | they're assuming that you're going to want it in a format ready to go to an RNN, so this

01:13:51.100 | is just turning it back into a normal format. So we grab each embedding, we end up with

01:13:57.660 | a whole list of those. We then combine all of those embeddings with all of our continuous

01:14:03.180 | variables into a single list of variables. And so then our model is going to have all

01:14:10.340 | of those embedding inputs and all of our continuous inputs, and then we can compile it and train

01:14:17.380 | it. So let's take a break and see you back here at 5 past 8.

01:14:46.700 | So we've got our neural net set up. We train it in the usual way, go.fit, and away we go.

01:15:05.220 | So that's basically that. It trains reasonably quickly, 6 minutes in this case.

01:15:18.800 | So we've got two questions that came in. One of them is, for the normalization, is it possible

01:15:29.700 | to use another function other than log, such as sigmoid?

01:15:40.020 | I don't think you'd want to use sigmoid. A kind of financial data and sales data tends

01:15:45.620 | to be of a shape where log will make it more linear, which is generally what you want.

01:15:52.080 | And then when we log transform our target variable, we're also transforming the squared

01:15:57.540 | error. Is this a problem? Or is it helping the model to find a better minimum error in

01:16:02.260 | the untransformed space?

01:16:03.260 | Yeah, so you've got to be careful about what loss function you want. In this case the Kaggle

01:16:07.100 | competition is trying to minimize root and mean squared percent error. So I actually

01:16:12.540 | then said I want you to do mean absolute error because in log space that's basically doing

01:16:21.900 | the same thing. The percent is a ratio, so this is the absolute error between two logs

01:16:28.460 | which is basically the same as a ratio. So you need to make sure your loss function is

01:16:32.620 | appropriate in that space.

01:16:38.540 | I think this is one of the things that I didn't do in the original competition. As you can

01:16:42.900 | see I tried changing it and I think it helped. By the way, XGBoost is fantastic. Here is the

01:16:57.660 | same series of steps to run this model with XGBoost. As you can see, I just concatenate

01:17:05.020 | my categorical and continuous for training and my validation set. Here is a set of parameters

01:17:12.780 | which tends to work pretty well. XGBoost has a data type called DMatrix, which is basically

01:17:23.300 | a normal matrix but it keeps track of the names of the features, so it prints out better

01:17:31.100 | information. Then you go .train and this takes less than a second to run. It's not massively

01:17:39.020 | worse than our previous result. This is a good way to get started.

01:17:47.460 | The reason that XGBoost and Random Forest is particularly helpful is because it does

01:17:53.220 | something called variable importance. This is how you get the variable importance for

01:17:58.060 | an XGBoost model. It takes a second and suddenly here is the information you need. When I was

01:18:05.700 | having trouble replicating the original results from the third place winners, one of the things

01:18:14.460 | that helped me a lot was to look at this feature importance plot and say, "Competition distance,

01:18:21.140 | holy cow, that's really really important. Let's make sure that my competition distance

01:18:26.420 | results pre-processing really is exactly the same." On the other hand, events doesn't really

01:18:36.540 | matter at all, so I'm not going to worry really at all about checking my events. This feature

01:18:43.980 | importance or variable importance plot, also as it's known, you can also create with a

01:18:48.460 | random forest. These are amazing. Because you're using a tree ensemble, it doesn't matter the

01:18:58.740 | shape of anything, it doesn't matter if you have or don't have interactions, this is all

01:19:04.060 | totally assumption free. In real life, this is the first thing I do. The first thing I

01:19:12.500 | do is try to get a feature importance plot printed. Often it turns out that there's only

01:19:19.220 | three or four variables in that. If you've got 10,000 variables, so I worked on a big

01:19:24.940 | credit scoring problem a couple of years ago, I had 9,500 variables. It turned out that only

01:19:30.220 | nine of them mattered. So the company I was working for literally had spent something

01:19:35.900 | like $5 million on this big management consulting project, and this big management consulting

01:19:40.700 | project had told them all these ways in which they can capture all this information in this

01:19:45.220 | really clean way for their credit scoring models. Of course none of those things were

01:19:50.220 | in these nine that mattered, so they could have saved $5 billion, but they didn't because

01:19:57.740 | management consulting companies don't use random forests.

01:20:01.980 | I can't overstate the importance of this plot, but this is a deep learning course, so we're

01:20:09.480 | not really going to spend time talking about it. Now I mentioned that I had a whole bunch

01:20:17.940 | of really weird things in the way that the competition playscaders did things. For one,

01:20:31.580 | they didn't normalize their continuous variables. Who does that? But then when people do well

01:20:37.700 | in a competition, something's working. The ways in which they initialized their embeddings

01:20:50.540 | were really, really weird. But all these things were really, really weird.

01:20:55.420 | So what I did was I wrote a little script, Rusman Experiments, and what I did was basically

01:21:12.340 | I copied and pasted all the important code out of my notebook. Remember I've already

01:21:17.460 | pickled the parameters for the label encoder and the scalar, so I didn't have to worry

01:21:26.260 | about doing those again. Once I copied and pasted all that code in, so this is exactly

01:21:30.900 | all the code you just saw, I then had this bunch of for loops. Pretty inelegant. But

01:21:45.820 | these are all of the things that I wanted to basically find out. Does it matter whether

01:21:50.460 | or not you use 1.0 scaling? Does it matter whether you use their weird approach to initializing

01:21:58.500 | embeddings? Does it matter whether you use their particular dictionary of embedding dimensions

01:22:04.660 | or use my simple little formula?

01:22:08.580 | Something else I tried is they basically took all their continuous variables and put them

01:22:12.980 | through a separate little dense layer each. I was like, why don't we put them all together.

01:22:17.420 | I also tried some other things like batch normalization. So I ran this and got back

01:22:24.540 | every possible combination of these. This is where you want to be using the script.

01:22:30.140 | I'm not going to tell you that I jumped straight to this. First of all, I spent days screwing

01:22:37.280 | around with experiments in a notebook by hand, continually forgetting what I had just done,

01:22:44.160 | until eventually it took me like an hour to write this. And then of course I pasted it

01:22:52.620 | into Excel. And here it is. Chucked it into a pivot table, used conditional formatting,

01:23:02.140 | and here's my results. You can see all my different combinations, with and without normalization,

01:23:08.380 | with my special function versus their dictionary, using a single dense matrix versus putting

01:23:14.260 | everything together, using their weird init versus not using a weird init. And here is

01:23:23.740 | this dark blue here is what they did. It's full of weird to me. But as you can see, it's

01:23:35.540 | actually the darkest blue. It actually is the best.

01:23:39.380 | But then when you zoom out, you realize there's a whole corner over here that's got a couple

01:23:46.300 | of eight-sixes, it's nearly as good, but seems much more consistent. And also more consistent

01:23:54.020 | with sanity. Like yes, do normalize your data. And yes, do use an appropriate initialization

01:24:01.020 | function. And if you do those two things, it doesn't really matter what else you do,

01:24:04.340 | it's all going to work fine.

01:24:06.180 | So what I then did was I created a little sparkline in Excel for the actual training

01:24:12.780 | graphs. And so here's their winning one, again, .085. But here's the variance of getting there.

01:24:24.980 | And as you can see, their approach was pretty bumpy, up and down, up and down, up and down.

01:24:29.020 | The second best on the other hand, .086 rather than .085, is going down very smoothly. And

01:24:38.580 | so that made me think, given that it's in this very stable part of the world, and given

01:24:43.140 | it's training much better, I actually think this is just random chance. It just happened

01:24:47.780 | to be low in this point. I actually thought this is a better approach. It's more sensible

01:24:55.940 | and it's more consistent.

01:24:59.060 | So this kind of approach to running experiments, I thought I'd just show you to say when you

01:25:06.980 | run experiments, try and do it in a rigorous way and track both the stability of the approach

01:25:14.380 | as well as the actual result of the approach. So this one here makes so much sense. It's

01:25:19.740 | like use my simple function rather than the weird dictionary, use normalization, use a

01:25:26.020 | single dense matrix, and use a thoughtful initialization. And you do all of those things,

01:25:30.140 | you end up with something that's basically as good and much more stable.

01:25:36.100 | That's all I wanted to say about Rossman. I'm going to very briefly mention another competition,

01:25:47.500 | which is the Kaggle Taxi Destination competition.

01:26:00.580 | You were saying that you did a couple of experiments. One, you figured out the embeddings and then

01:26:07.460 | put the embeddings into random forests, and then put embeddings again into neural network.

01:26:13.620 | I didn't do that, that was from the paper.

01:26:16.420 | Yeah, so I don't understand because you just use one neural network to do everything together,

01:26:21.700 | no?

01:26:22.700 | Yeah, so what they did was, for this one here, this 115, they trained the neural network I

01:26:27.580 | just showed you. They then threw away the neural network and trained a GBM model, but

01:26:36.180 | for the categorical variables, rather than using 100 encodings, they used the embeddings.

01:26:40.900 | That's all.

01:26:44.700 | So the taxi competition was won by the team with this Unicode name, which is pretty cool.

01:26:55.060 | And it's actually turned out to be a team run by Yoshua Bengio, who's one of the people

01:27:00.140 | that stuck it out through the AI winter and is now one of the leading lights in deep learning.

01:27:07.660 | And interestingly, the thing I just showed you, the Rossman competition, this paper they

01:27:14.300 | wrote in the Rossman competition claimed to have invented this idea of categorical embeddings.

01:27:20.500 | But actually, Yoshua Bengio's team won this competition a year earlier with this same

01:27:25.980 | technique. But again, it's so uncool, nobody noticed even though it was Yoshua Bengio.

01:27:32.900 | So I want to quickly show you what they did. This is the paper they wrote. And their approach

01:27:42.020 | to picking an embedding size was very simple. Use 10. So the data was which customer is

01:27:52.540 | taking this taxi, which taxi are they in, which taxi stand did they get the taxi from,

01:28:00.380 | and then quarter hour of the day, day of the week, week of the year. And they didn't add

01:28:07.180 | all kinds of other stuff, this is basically it. And so then they said we're going to learn

01:28:13.980 | embeddings inspired by NLP. So actually to my knowledge, this is the first time this

01:28:21.700 | appears in the literature. Having said that, I'm sure a thousand people have done it before,

01:28:26.660 | it's just not obvious to make it into a paper.

01:28:31.420 | >> As a quick sanity check, if you have day of the week, like with seven, even one hot

01:28:39.660 | variable potentials, and embedding size of 10, that doesn't make any sense, right?

01:28:46.900 | >> Yeah, so I used to think that. But actually it does. Since the last few months quite often

01:28:55.460 | ended up with bigger embeddings than my original patternality. And often it does give better

01:29:02.220 | results. And I think it's just like when you realize that it's just a dense layer on top

01:29:08.100 | of a one-hot encoding, it's like okay, why shouldn't the dense layer have more information?

01:29:14.180 | I found it weird too, I still find it a little weird, but it definitely seems to be something

01:29:18.940 | that's quite useful.

01:29:27.300 | >> It does, it helps. I have absolutely found plenty of times now where I need a bigger

01:29:38.460 | embedding metric's dimensionality than my cardinality of my categorical variable.

01:29:46.620 | Now in this competition, again it's a time series competition really, because the main

01:29:52.660 | thing you're given other than all this metadata is a series of GPS points, which is every

01:29:58.660 | GPS point along a route. And at some point for the test set, the route is cut off and

01:30:05.300 | you have to figure out what the final GPS point would have been. Where are they going?

01:30:13.660 | Here's the model that they won with. It turns out to be very simple. You take all the metadata

01:30:22.180 | we just saw and chuck it through the embeddings. You then take the first 5 GPS points and the

01:30:30.340 | last 5 GPS points and concatenate them together with the embeddings. Chuck them through a

01:30:36.940 | hidden layer, then through a softmax. This is quite interesting. What they then do is

01:30:44.260 | they take the result of this softmax and they combine it with clusters.

01:30:50.460 | Now what are these clusters? They used mean shift clustering, and they used mean shift

01:30:56.460 | clustering to figure out where are the places people tend to go. So with taxis, people tend

01:31:02.940 | to go to the airport or they tend to go to the hospital or they tend to go to the shopping

01:31:07.020 | strip. So using mean shift clustering, I think it was about 3,000 clusters, x, y coordinates

01:31:17.380 | of places that people tend to go.

01:31:20.740 | However, people don't always go to those 3,000 places. So this is a really cool thing. By

01:31:27.700 | using a softmax, and then they took the softmax and they multiplied it and took a weighted

01:31:34.900 | average using the softmax as the weights and the cluster centers as the thing that you're

01:31:40.340 | taking the weighted average of.

01:31:42.280 | So in other words, if they're going to the airport for sure, the softmax will end up

01:31:47.500 | giving a p of very close to 1 for the airport cluster. On the other hand, if it's not really

01:31:53.300 | that clear whether they're going to this shopping strip or this movie, then those two cluster

01:32:01.140 | centers could both have a softmax of about 0.5, and so it's going to end up predicting

01:32:05.880 | somewhere halfway between the two.

01:32:09.100 | So this is really interesting. They've built a different kind of architecture to anything

01:32:17.260 | we've seen before, where the softmax is not the last thing we do. It's being used to

01:32:23.260 | average a bunch of clusters. So this is really smart because the softmax forces it to be

01:32:31.820 | easier for it to pick a specific destination that's very common, but also makes it possible

01:32:37.900 | for it to predict any destination anywhere by combining the average of a number of clusters

01:32:44.100 | together.

01:32:45.100 | I think this is really elegant architecture engineering.

01:32:53.540 | Last 5 GPS points that were given. To create the training set, what they did was they took

01:33:08.220 | all of the roots and truncated them randomly. So every time they sampled another root, think

01:33:18.540 | of the data generator. Basically the data generator would randomly slice it off somewhere.

01:33:26.220 | So this was the last 5 points which we have access to, and the first 5 points. The reason

01:33:33.580 | it's not all the points is because they're using a standard multilayer perceptron here.

01:33:39.220 | So it's a variable length, a, and also you don't want it to be too big.

01:33:45.180 | There's a question. So the prefix is not fed into an RNN, it's just fed into a dense layer?

01:33:51.500 | Correct. So we just get 10 points, concatenate it together into a dense layer. So surprisingly

01:33:57.940 | simple. How good was it? Look at the results, 2-1-4, 2-1-4, 2-1-3, 2-1-3, 2-1-1, 2-1-2. Everybody

01:34:14.900 | is clustered together. One person's a bit better at 208, and they're way better at 203.

01:34:21.180 | And then they mentioned in the paper that they didn't actually have time to finish training,

01:34:25.380 | so when they actually finished training, it was actually 1.87. They won so easily, it's

01:34:33.220 | not funny. And interestingly in the paper, they actually mentioned the test set was so

01:34:37.380 | small that they knew the only way they could be sure to win was to make sure they won easily.

01:34:45.100 | Now because the test set was so small, the leaderboard is actually not statistically

01:34:49.980 | that great. So they created a custom test set and tried to see if they could find something

01:34:55.820 | that's even better still on the custom test set. And it turns out that actually an RNN

01:35:00.780 | is better still. It still would have won the competition, but there's not enough data in

01:35:06.980 | the Kaggle test set that this is a statistically significant result. In this case it is statistically

01:35:12.540 | significant. A regular RNN wasn't better, but what they did instead was take an RNN where

01:35:21.420 | we pass in 5 points at a time into the RNN basically. I think what probably would have

01:35:29.140 | been even better would be to have had a convolutional layer first and then passed that into an RNN.

01:35:35.060 | They didn't try it as far as I can see from the paper. Importantly, a bidirectional RNN

01:35:42.660 | which ensures that the initial points and the last points tend to have more weight because

01:35:48.760 | we know that RNN's state generally reflects things they've seen more recently. So this

01:35:55.180 | result is this model.

01:35:58.220 | So Paul Longsuffering intern Brad has been trying to replicate this result. He had at

01:36:04.100 | least two all-nighters in the last two weeks but hasn't quite managed to yet. So I'm not

01:36:08.220 | going to show you the code, but hopefully once Brad starts sleeping again he'll be able

01:36:13.140 | to finish it off and we can show you the notebook during the week on the forum that actually

01:36:18.900 | re-implements this thing.

01:36:21.780 | It was an interesting process to watch Brad try to replicate this because the vast majority

01:36:30.300 | of the time in my experience when people say they've tried a model and the model didn't

01:36:34.940 | work out and they've given up on the model, it turns out that it's actually because they

01:36:38.220 | screwed something up, not because of the problem with the model. And if you weren't comparing

01:36:44.500 | to Yoshua Bengio's team's result, knowing that you haven't replicated it yet, at which

01:36:50.740 | point do you give up and say, "Oh my model's not working" versus saying, "No, I've still

01:36:55.820 | got bugs!" It's very difficult to debug machine learning models.

01:37:03.540 | What Brad's actually had to end up doing is literally take the original Bengio team code,

01:37:09.300 | run it line by line, and then try to replicate it in Keras line by line in literally np.allclose

01:37:16.100 | every time. Because to build a model like this, it doesn't look that complex, but there's

01:37:22.260 | just so many places that you can make little mistakes. No normal person will make like

01:37:29.260 | zero mistakes. In fact, normal people like me will make dozens of mistakes.

01:37:34.340 | So when you build a model like this, you need to find a way to test every single line of

01:37:39.420 | code. Any line of code you don't test, I guarantee you'll end up with a bug and you won't know

01:37:43.720 | you have a bug and there's no way to ever find out you had a bug.

01:37:48.500 | So we have several questions. One is a note that pi*ci is very similar to what happens

01:37:58.260 | in the memory network paper. In that case, the output embeddings are weighted by the

01:38:02.300 | attention probability. It's a lot like a regular attentional language model.

01:38:10.940 | Can you talk more about the idea you have about first having the convolutional layer

01:38:14.780 | and passing that to an RNN? What do you mean by that?

01:38:19.300 | So here is a fantastic paper. We looked at these kind of subword encodings last week

01:38:48.900 | for language models. I don't know if any of you thought about this and wondered what if

01:38:53.220 | we just had individual characters. There's a really fascinating paper called Fully Character

01:39:00.180 | Level Machine Translation with no explicit segmentation from November of last year. They

01:39:08.100 | actually get fantastic results on just character level, beating pretty much everything, including

01:39:17.460 | the BPE approach we saw last time. So they looked at lots of different approaches and

01:39:28.620 | comparing BPE to individual character, and most of the time they got the best results.

01:39:36.180 | Their model looks like this. They start out with every individual character. It goes through

01:39:42.420 | a character embedding, just like we've used character embeddings lots of times. Then you

01:39:46.620 | take those character embeddings and you pass it through a one-dimensional convolution.

01:39:53.700 | I don't know if you guys remember, but in Part 1 of this course, Ben actually had a

01:39:59.020 | blog post about showing how you can do multiple size convolutions and concatenate them altogether.

01:40:05.380 | So you could use that approach. Or you could just pick a single size. So you end up basically

01:40:11.460 | scrolling your convolutional window across your sets of characters. So you end up with

01:40:18.460 | the same number of convolution outputs as you started out with letters, but they're

01:40:23.540 | now representing the information in a window around that letter. In this case, they then

01:40:31.800 | did max pooling. So they basically said which window, assuming that we had a different size

01:40:41.340 | as a size 4, a size 3, and a size 5. Which bits seem to have the highest activations

01:40:49.580 | around here. Then they took those max pooled things and they put them through a second

01:40:54.580 | set of segment embeddings. They then put that through something called a highway network

01:40:59.260 | which the details don't matter too much. It's kind of something like a DenseNet, like we

01:41:03.540 | learned about last week. This is a slightly older approach than the DenseNet. Then finally

01:41:09.180 | after doing all that, stick that through an RNN. So the idea here in this model was they

01:41:15.260 | basically did as much learnt pre-processing as possible, and then finally put that into

01:41:25.220 | an RNN. Because we've got these max pooling layers, this RNN ends up with a lot less time

01:41:30.540 | points, which is really important to minimize the amount of processing in the RNN. So I'm

01:41:39.660 | not going to go into detail on this, but check out this paper because it's really interesting.

01:41:46.720 | Next question is, for the destinations we would have more error for the peripheral points?

01:41:52.260 | Are we taking a centroid of clusters? I don't understand that, sorry. All we're doing is

01:42:02.840 | we're taking the softmax p, multiply by the cluster c, multiply them and add them up.

01:42:10.300 | I thought the first part was asking that with destinations that are more peripheral, they

01:42:16.060 | would have higher error because they would be harder to predict this way.

01:42:19.140 | That probably, which is fine because by definition they're not close to a cluster center so they're

01:42:25.340 | not common.

01:42:26.340 | Then going back, there was a question on the Rossman example. What does MAPE with neural

01:42:33.480 | network mean? I would have expected that result to be the same, why is it lower?

01:42:37.980 | This is just using a one-hot encoding without an embedding layer. We kind of run out of

01:42:50.580 | time a bit quickly, but I really want to show you this. The students and I have been trying

01:42:58.500 | to get a new approach to segmentation working, and I finally got it working in the last day

01:43:04.260 | or two, and I really wanted to show it to you. We talked last week about DenseNet, and

01:43:09.540 | I mentioned that DenseNet is arse-kickingly good at doing image classification with a

01:43:16.820 | small number of data points, like crazily good. But I also mentioned that it's the basis

01:43:24.120 | of this thing called the 100-phase tiramisu, which is an approach to segmentation.

01:43:29.100 | So segmentation refers to taking a picture, an image, and figuring out where's the tree,

01:43:38.260 | where's the dog, where's the bicycle and so forth. So it seems like we're not sure of

01:43:45.260 | NGO fans today because this is one of his group's papers as well. Let me set the scene.

01:44:00.760 | So Brendan, one of our students, who many of you have seen a lot of his blog posts,

01:44:06.820 | he has successfully got a PyTorch of this working, so I've shared that on our files.vast.ai.

01:44:14.120 | And I got the Keras version of it working. So I'll show you the Keras version because

01:44:17.700 | I actually understand it. And if anybody's interested in asking questions about the PyTorch

01:44:22.060 | version, hopefully Brendan will be happy to answer them during the week.

01:44:27.900 | So the data looks like this. There's an image, and then there's a labels. So that's basically

01:44:50.620 | what it looks like. So you can see here, you've got traffic lights, you've got poles, you've

01:44:54.220 | got trees, buildings, paths, roads. Interestingly, the dataset we're using is something called

01:45:02.940 | CanVid. The dataset is actually frames from a video. So a lot of the frames look very

01:45:08.820 | similar to each other. And there's only like 600 frames in total, so there's very variable

01:45:16.980 | data in this CanVid dataset. Furthermore, we're not going to do any pre-training. So

01:45:24.340 | we're going to try and build the state-of-the-art classification system on video, which is already

01:45:29.900 | much lower information content because most of the frames are pretty similar, using just

01:45:34.380 | 600 frames without pre-training. Now if you were to ask me a month ago, I would have told

01:45:38.260 | you it's not possible. This just seems like an incredibly difficult thing to do. But just

01:45:44.620 | watch.

01:45:47.780 | So I'm going to skip to the answer first. Here's an example of a particular frame we're

01:45:58.220 | trying to match. Here is the ground truth for that frame. You can see there's a tiny

01:46:07.300 | car here and a little car here. There are those little cars. There's a tree. Trees are

01:46:12.780 | really difficult. They're incredibly fine, funny things. And here is my trained model.

01:46:22.060 | And as you can see, it's done really, really well. It's interesting to look at the mistakes

01:46:28.620 | it made. This little thing here is a person. But you can see that the person, their head

01:46:37.940 | looks a lot like traffic light and their jacket looks a lot like mailbox. Whereas these tiny

01:46:43.420 | little people here, it's done perfectly, or else this person got a little bit confused.

01:46:55.820 | Another example of where it's gone wrong is this should be a road, or else it wasn't quite

01:46:59.700 | sure what was road and what was footpath, which makes sense because the colors do look

01:47:04.020 | very similar. But had we have pre-trained something, a pre-trained network would have understood

01:47:11.500 | that crossroads tend to go straight across, they don't tend to look like that. So you

01:47:18.060 | can kind of see where the minor mistakes it made, it also would have learned, had it looked

01:47:24.540 | at more than a couple of hundred examples of people, that people generally are a particular

01:47:30.420 | day. So there's just not enough data for it to have learned some of these things. But

01:47:34.420 | nonetheless, it is extraordinarily effective. Look at this traffic light, it's surrounded

01:47:41.340 | by a sign, so the ground truth actually has the traffic light and then a tiny little edge

01:47:46.860 | of sign, and it's even got that right. So it's an incredibly accurate model.

01:47:56.220 | So how does it work? And in particular, how does it do these amazing trees? So the answer

01:48:02.340 | is in this picture. Basically, this is inspired by a model called UNET. Until the UNET model

01:48:15.220 | came along, everybody was doing these kinds of segmentation models using an approach just

01:48:22.420 | like what we did for style transfer, which is basically you have a number of convolutional

01:48:30.700 | layers with max pooling, or with a stride of 2, which gradually make the image smaller

01:48:36.980 | and smaller, a bigger receptive field. And then you go back up the other side using upsampling

01:48:42.660 | or deconvolutions until you get back to the original size, and then your final layer is

01:48:49.060 | the same size as your starting layer and has a bunch of different classes that you're trying

01:48:53.460 | to use in the softmax.

01:48:58.940 | The problem with that is that you end up with, in fact I'll show you an example. There's

01:49:05.420 | a really nice paper called UNET. UNET is not only an incredibly accurate model for segmentation,

01:49:14.300 | but it's also incredibly fast. It actually can run in real time. You can actually run

01:49:19.740 | it on a video.

01:49:20.740 | But the mistakes it makes, look at this chair. This chair has a big gap here and here and

01:49:26.220 | here, but UNET gets it totally wrong. And the reason why is because they use a very

01:49:32.460 | traditional downsampling-upsampling approach. And by the time they get to the bottom, they've

01:49:38.860 | just lost track of the fine detail. So the trick are these connections here. What we

01:49:47.900 | do is we start with our input, we do a standard initial convolution, just like we did with

01:49:52.900 | style transfer. We then have a DenseNet block, which we learned about last week. And then

01:49:59.660 | that block, we keep going down, we do a MaxPooling type thing, another DenseNet block, MaxPooling

01:50:05.260 | type thing, keep going down. And then as we go up the other side, so we do a DenseBlock,

01:50:16.260 | we take the output from the DenseBlock on the way down and we actually copy it over

01:50:23.540 | to here and concatenate the two together. So actually Brendan a few days ago actually

01:50:32.380 | drew this on our whiteboard when we were explaining it to Melissa, and so he's shown us every

01:50:36.520 | stage here.

01:50:37.520 | We start out with a 224x224 input, it goes through the convolutions with 48 filters, goes

01:50:48.660 | through our DenseBlock, adds another 80 filters. It then goes through our, they call it a transition

01:50:54.940 | down, so basically a MaxPooling. So it's now size 112. We keep doing that. DenseBlock, transition

01:51:01.700 | down, so it's now 56x56, 28x28, 14x14, 7x7. And then on the way up again, we go transition

01:51:10.540 | up, it's now 14x14. We copy across the results of the 14x14 from the transition down and

01:51:18.380 | concatenate together. Then we do a DenseBlock, transition up, it's now 28x28, so we copy

01:51:24.580 | across our 28x28 from the transition down and so forth.

01:51:29.180 | So by the time we get all the way back up here, we're actually copying across something

01:51:35.740 | that was originally of size 224x224. It hadn't had much done to it, it had only been through

01:51:44.580 | one convolutional layer and one DenseBlock, so it hadn't really got much rich computation

01:51:49.380 | being done. But the thing is, by the time it gets back up all the way up here, the model

01:51:56.300 | knows pretty much this is a tree and this is a person and this is a house, and it just

01:52:01.900 | needs to get the fine little details. Where exactly does this leaf finish? Where exactly

01:52:06.500 | does the person's hat finish? So it's basically copying across something which is very high

01:52:13.260 | resolution but doesn't have that much rich information, but that's fine because it really

01:52:18.780 | only needs to fill in the details.

01:52:21.260 | So these things here, they're called skip-connections. They were really inspired by this paper called

01:52:26.100 | Unet, which has been one of many Kaggle competitions. But it's using dense blocks rather than normal

01:52:34.860 | fully connected blocks.

01:52:38.500 | So let me show you. We're not going to have time to go into this in detail, but I've done

01:52:45.940 | all this coding Keras from scratch. This is actually a fantastic fit for Keras. I didn't

01:52:51.260 | have to create any custom layers, I didn't really have to do anything weird at all, except

01:52:56.660 | for one thing, data augmentation.

01:53:00.980 | So the data augmentation was we start with 480x360 images, we randomly crop some 224x224

01:53:10.820 | part, and also randomly we may flip it horizontally. That's all perfectly fine. Keras doesn't really

01:53:19.380 | have the random crops, unfortunately. But more importantly, whatever we do to the input

01:53:23.940 | image, we also have to do to the target image. We need to get the same 224x224 crop, and

01:53:30.780 | we need to do the same horizontal flip.

01:53:32.660 | So I had to write a data generator, which you guys may actually find useful anyway.

01:53:44.700 | So this is my data generator. Basically I called it a segment generator. It's just a standard

01:53:52.780 | generator so it's got a next function. Each time you call next, it grabs some random bunch

01:53:58.220 | of indexes, it goes through each one of those indexes and grabs the necessary item, grabbing

01:54:06.220 | a random slice, sometimes randomly flipping it horizontally, and then it's doing this

01:54:11.900 | to both the x's and the y's, returning them back.

01:54:19.220 | Along with this segment generator, in order to randomly grab a batch of random indices

01:54:26.220 | each time, I created this little class called batch indices, which can basically do that.

01:54:33.700 | And it can have either shuffle true or shuffle false.

01:54:37.660 | So this pair of classes you guys might find really helpful for creating your own data

01:54:42.980 | generators. This batch indices class in particular, now that I've written it, you can see how

01:54:48.900 | it works, right? If I say batch indices from a data set of size 10, I want to grab three

01:54:55.740 | indices at a time. So then let's grab five batches. Now in this case I've got by default

01:55:02.940 | shuffle = false, so it just returns 0.1.2, 0.3.4.5, 0.6.7.8, 0.9, I'm finished, go back

01:55:09.100 | to the start, 0.1.2. On the other hand, if I say shuffle = true, it returns them in random

01:55:15.380 | order but it still makes sure it captures all of them. And then when we're done it starts

01:55:19.940 | a new random order. So this makes it really easy to create random generators.

01:55:26.580 | So that was the only thing I had to add to Keras to get this order work. Other than that,

01:55:35.360 | we wrote the tiramisu. And the tiramisu looks very very similar to the DenseNet that we

01:55:40.620 | saw last week. We've got all our pieces, the relu, the dropout, the batchnorm, the relu

01:55:50.260 | on top of batchnorm, the concat layer, so this is something I had to add, my convolution2d

01:55:58.100 | followed by dropout, and then finally my batchnorm followed by relu followed by convolution2d.

01:56:04.940 | So this is just the dense block that we saw last week. So a dense block is something where

01:56:10.420 | we just keep grabbing 12 or 16 filters at a time, concatenating them to the last set

01:56:16.820 | and doing that a few times. That's what a dense block is.

01:56:23.100 | So here's something interesting. The original paper for its down sampling, they call it

01:56:31.420 | transition down, did a 1x1 convolution followed by a max pooling. I actually discovered that

01:56:43.340 | doing a stride2 convolution gives better results. So you'll see I actually have not followed

01:56:50.140 | the paper. The one that's in commented out here is what the paper did, but actually this

01:56:55.920 | works better. So that was interesting.

01:57:01.060 | Interestingly though, on the transition up side, do you remember that checkerboard artifacts

01:57:08.140 | blog post we saw that showed that upsampling2d followed by a convolutional layer works better?

01:57:15.340 | It does not work better for this. In fact, a deconvolution works better for this. So

01:57:21.780 | that's why you can see I've got this deconvolution layer. So I thought that was interesting.

01:57:29.660 | So basically you can see when I go downsampling a bunch of times, it's basically do a dense

01:57:35.940 | block and then I have to keep track of my skip connections. So basically keep a list

01:57:41.900 | of all of those skip connections. So I've got to hang on to all of these. So every one

01:57:47.300 | of these skip connections I just stick in this little array, depending on them after

01:57:52.300 | every dense block. So then I keep them all and then I pass them to my upward path. So

01:58:00.620 | I basically do my transition up and then I concatenate that with that skip connection.

01:58:09.020 | So that's the basic approach. So then the actual Tirabisu model itself with those pieces

01:58:16.980 | is less than a screen of code. It's basically just do a 3x3 conv, do my down path, do my

01:58:24.100 | up path using those skip connections, then a 1x1 conv at the end, and a softmax.

01:58:39.580 | So these dense nets, and indeed this fully convolutional dense net or this Tirabisu model,

01:58:47.460 | they actually take quite a long time to train. They don't have very many parameters, which

01:58:51.460 | is why I think they work so well with these tiny datasets. But they do still take a long

01:58:55.860 | time to train. Each epoch took a couple of minutes, and in the end I had to do many hundreds

01:59:02.740 | of epochs. And I was also doing a bunch of learning rate annealing. So in the end this

01:59:09.580 | kind of really had to train overnight, even though I had only about 500-600 friends. But

01:59:28.620 | in the end I got a really good result. I was a bit nervous at first. I was getting this

01:59:40.620 | 87.6% accuracy. In the paper they were getting 90% plus. It turns out that 3% of the pixels

01:59:48.060 | are marked as void. I don't know why they're marked as void, but in the paper they actually

01:59:52.780 | remove them. So you'll see when you get to my results section, I've got this bit where

01:59:56.980 | I remove those void ones. And I ended up with 89.5%. None of us in class managed to replicate

02:00:08.940 | the paper. The paper got 91.5% or 91.2%. We tried the lasagna code they provided. We tried

02:00:18.740 | Brendan's PyTorch. We tried my Keras. Even though we couldn't replicate their result,

02:00:25.700 | this is still better than any other result I've found. So this is still super accurate.

02:00:33.020 | A couple of quick notes about this. First is they tried training also on something called

02:00:40.980 | the GATech dataset, which is another video dataset. The degree to which this is an amazing

02:00:47.160 | model is really clear here. This 76% is from a model which is specifically built for video,

02:00:55.580 | so it actually includes the time component, which is absolutely critical, and it uses

02:01:00.100 | a pre-trained network, so it's used like a million images to pre-train, and it's still

02:01:05.780 | not as good as this model. So that is an extraordinary comparison.

02:01:12.620 | This is the CamTech comparison. Here's the model we were just looking at. Again, I actually

02:01:18.780 | looked into this. I thought 91.5%, whereas this one here 88%, wow, it actually looks

02:01:26.860 | like it's not that much better. I'm really surprised. Like even tree, I really thought

02:01:31.940 | it should win easily on tree, but it doesn't win by very much. So I actually went back

02:01:37.000 | and looked at this paper, and it turns out that the authors of the DenseNet paper (this

02:01:42.180 | is the paper by the way, modest scale that they're comparing to) turned out that they

02:01:49.060 | actually trained on props of 852x852, so they actually used a way higher resolution image

02:01:57.500 | to start with. You've got to be really careful when you read these comparisons. Sometimes

02:02:04.500 | people actually shoot themselves in the foot, so these guys were comparing their result

02:02:08.660 | to another model that was using like twice as big a picture. So again, this is actually

02:02:16.100 | way better than they actually made it look like.

02:02:19.580 | Another one, like this one here, this 88 also looks impressive. But then I looked across

02:02:24.300 | here and I noticed that the dilution8 model is like way better than this model on every

02:02:32.940 | single category, way better. And yet somehow the average is only 0.3 better, and I realized

02:02:39.060 | this actually has to be an error. So this model is actually a lot better than this table

02:02:48.620 | gives the impression.

02:02:54.260 | So I briefly mentioned that there's a model which doesn't have any skip connections called

02:03:00.860 | eNet, which is actually better than the tiramisu on everything except for tree. But on the

02:03:08.460 | tree it's terrible. It's 77.8 versus, oh hang on, 77.3. That's not right. I take that back.

02:03:24.020 | I'm sure it was less good than this model, but now I can't find that data. Anyway, the

02:03:32.540 | reason I wanted to mention this is that Eugenio is about to release a new model which combines

02:03:43.040 | these approaches with skip connections. It's called LinkNet. So keep an eye on the forum

02:03:48.460 | because I'll be looking into that shortly.

02:04:00.940 | Let's answer them on the forum. I actually wanted to talk about this briefly. A lot of

02:04:10.380 | you have come up to me and been like, "We're finishing! What do we do now?" The answer is

02:04:20.300 | we have now created a community of all these people who have spent well over 100 hours

02:04:26.020 | working on deep learning for many, many months and have built their own boxes and written

02:04:32.380 | blog posts and done all kinds of stuff, set up social impact talks, written articles in

02:04:44.460 | Forbes. Okay, this community is happening. It doesn't make any sense in my opinion for

02:04:51.540 | Rachel and I to now be saying here's what happens next. So just like Elena has decided,

02:04:59.220 | "Okay, I want a book club." So she talked to Mindy and we now have a book club and a

02:05:03.660 | couple of months time, you can all come to the book club.

02:05:08.260 | So what's next? The forums will continue forever. We all know each other. Let's do good shit.

02:05:20.980 | Most importantly, write code. Please write code. Build apps. Take your work projects and

02:05:33.300 | try doing them with deep learning. Build libraries to make things easier. Maybe go back to stuff

02:05:39.780 | from part 1 of the course and look back and think, "Why didn't we do it this other way?

02:05:44.340 | I can make this simpler." Write papers. I showed you that amazing result of the new style transfer

02:05:53.060 | approach from Vincent last week. Hopefully that might take you into a paper. Write blog

02:05:58.700 | posts. In a few weeks' time, all the MOOC guys are going to be coming through and doing part

02:06:06.740 | 2 of the course. So help them out on the forum. Teaching is the best way to learn yourself.

02:06:13.820 | I really want to hear the success stories. People don't believe that what you've done

02:06:22.260 | is possible. I know that because as recently as yesterday, there was the highest ranked

02:06:31.140 | Hacker News comment on a story about deep learning was about how it's pointless trying

02:06:35.860 | to do deep learning unless you have years of mathematical background and you know C++

02:06:41.020 | and you're an expert in machine learning techniques across the board and otherwise there's no

02:06:45.620 | way that you're going to be able to do anything useful in the real world project. That today

02:06:50.620 | is what everybody believes. We now know that's not true. Rachel and I are going to start

02:07:00.420 | up a podcast where we're going to try to both help deep learning learners. But one of the

02:07:08.820 | key things you want to do is tell your stories. So if you've done something interesting at

02:07:14.660 | work or you've got an interesting new result or you're just in the middle of a project

02:07:19.460 | and it's kind of fun, please tell us. Either on the forum or private message or whatever.

02:07:26.100 | Please tell us because we really want to share your story. And if it's not a story yet, tell

02:07:33.460 | us enough that we can help you and that the community can help you. Get together, the

02:07:43.620 | book club, if you're watching this on the MOOC, organize other people in your geography

02:07:49.260 | to get together and meet up or your workplace. In this group here I know we've got people

02:07:55.900 | from Apple and Uber and Airbnb who started doing this in kind of lunchtime MOOC chats

02:08:03.660 | and now they're here at this course. Yes, Rachel?

02:08:06.740 | I also wanted to recommend it would be great to start meetups to help lead other people

02:08:12.220 | through, say, part one of the course, kind of assist them going through it.

02:08:17.940 | So Rachel and I really just want to spend the next 6 to 12 months focused on supporting

02:08:30.140 | your projects. So I'm very interested in working on this lung cancer stuff, but I'm also interested

02:08:41.220 | in every project that you guys are working on. I want to help with that. I also want

02:08:46.220 | to help people who want to teach this. So Yannette is going to go from being a student

02:08:52.900 | to a teacher hopefully soon. We'll be teaching USF students about deep learning and hopefully

02:08:57.880 | the next batch of people about deep learning. Anybody who's interested in teaching, let

02:09:02.340 | us know. This is the best high leverage activity is to teach the teachers.

02:09:10.820 | So yeah, I don't know where this is going to end up, but my hope is really that basically

02:09:17.300 | I would say the experiment has worked. You guys are all here. You're reading papers,

02:09:23.300 | you're writing code, you're understanding the most cutting edge research level deep

02:09:29.040 | learning that exists today. We've gone beyond some of the cutting edge research in many

02:09:34.780 | situations. Some of you have gone beyond the cutting edge research. So yeah, let's build

02:09:42.180 | from here as a community and anything that Rachel and I can do to help, please tell us,

02:09:48.700 | because we just want you to be successful and the community to be successful.

02:09:55.260 | So will you be still active in the forums? Very active. My job is to make you guys successful.

02:10:05.660 | So thank you all so much for coming, and congratulations to all of you.

02:10:08.060 | (audience applauds)

02:10:11.060 | (audience applauding)

Lesson 14: Cutting Edge Deep Learning for Coders

Chapters