back to indexLesson 14: Cutting Edge Deep Learning for Coders
Chapters
0:0 Introduction
0:45 Time Series Data
9:50 Neural Network
12:48 Continuous Variables
15:45 Projection Method
17:59 Importing
19:18 Basic Tables
20:1 pandas
23:49 pandas summary
25:19 data cleaning
25:57 joining tables
27:36 categorical variables
30:28 pandas indexing
37:41 join
43:7 time series feature manipulation
48:52 time until event
49:33 index
50:57 rolling
52:22 final result
00:00:00.000 |
Okay, so welcome to lesson 14, the final lesson for now. 00:00:17.920 |
As you can see from what's increasingly been happening, what next is very much about you 00:00:23.200 |
with us rather than us leading you or telling you. 00:00:27.440 |
So we're a community now and we can figure these stuff out together and obviously USF 00:00:39.440 |
So for now, this is the last of these lessons. 00:00:46.840 |
One of the things that was great to see this week was this terrific article in Forbes that 00:00:52.740 |
talked about deep learning education and it was written by one of our terrific students, 00:00:59.400 |
It focuses on the great work of some of the students that have come through this course. 00:01:06.800 |
So I wanted to say thank you very much and congratulations on this great article. 00:01:14.960 |
Very beautifully written as well and terrific stories, I found it quite inspiring. 00:01:25.860 |
So today we are going to be talking about a couple of things, but we're going to start 00:01:41.120 |
Time series, I wanted to start very briefly by talking about something which I think you 00:01:47.920 |
This is a fantastic paper because it is not by deep mind, nobody's heard of it. 00:01:54.280 |
It actually comes from the Children's Hospital of Los Angeles. 00:01:57.880 |
Believe it or not, perhaps the epicenter of practical applied AI and medicine data is 00:02:05.520 |
in Southern California, and specifically Southern California Pediatrics, the Children's Hospital 00:02:09.760 |
of Orange County, CHOC, and the Children's Hospital of Los Angeles, CHLA. 00:02:15.240 |
CHLA, which this paper comes from, actually has this thing they call V-PICU, the Virtual 00:02:22.080 |
Pediatric Intensive Care Unit, where for many, many years they've been tracking every electronic 00:02:28.320 |
signal about how every patient, every kid in the hospital was treated and what all their 00:02:40.440 |
One of the extraordinary things they do is when the doctors there do rounds, data scientists 00:02:46.360 |
I don't know anywhere else in the world that this happens. 00:02:52.080 |
And so a couple of months ago they released a draft of this amazing paper where they talked 00:02:59.840 |
about how they pulled out all this data from the EMR and from the sensors and attempted 00:03:08.880 |
The reason this is interesting is that when a kid goes into the ICU, if a model starts 00:03:15.720 |
saying this kid is looking like they might die, then that's the thing that sets the alarms 00:03:21.160 |
going and everybody rushes over and starts looking after them. 00:03:25.040 |
And they found that they built a model that was more accurate than any existing model. 00:03:28.840 |
Those existing models were built on many years of deep clinical input and they used an RNN. 00:03:35.640 |
Now this kind of time series data is what I'm going to refer to as signal-type time series 00:03:45.040 |
So let's say you've got a series of blood pressure readings. 00:03:51.960 |
So they come in and their blood pressure is kind of low and it's kind of all over the 00:04:02.600 |
And then in addition to that, maybe there's other readings such as at which points they 00:04:12.520 |
receive some kind of medical intervention, there was one here and one here and then there 00:04:21.960 |
So these kinds of things, generally speaking, the state of health at time t is probably 00:04:28.680 |
best predicted by all of the various sensor readings at t-minus-1 and t-minus-2 and t-minus-3. 00:04:35.100 |
So in statistical terms we would refer to that as autocorrelation. 00:04:41.760 |
Autocorrelation means correlation with previous time periods. 00:04:45.720 |
For this kind of signal, I think it's very likely that an RNN is the way to go. 00:04:54.960 |
Obviously you could probably get a better result using a bidirectional RNN, but that's 00:04:58.920 |
not going to be any help in the ICU because you don't have the future time period sensors. 00:05:07.040 |
And indeed this is what this team at the VPCU at Children's Hospital of Los Angeles did, 00:05:12.720 |
they used an RNN to get this data-to-the-art result. 00:05:17.400 |
I'm not really going to teach you more about this because basically you already know how 00:05:22.640 |
You can check out the paper, you'll see there's almost nothing special. 00:05:25.800 |
The only thing which was quite clever was that their sensor readings were not necessarily 00:05:36.680 |
For example, did they receive some particular medical intervention? 00:05:42.880 |
Clearly they're very widely spaced and they're not equally spaced. 00:05:46.880 |
So rather than having the RNN have basically a sequence of interventions that gets fed 00:05:57.440 |
to the RNN, instead they actually have two things. 00:06:00.740 |
One is the signal, and the other is the time since the last signal was read. 00:06:09.200 |
So each point at the RNN is basically some function f. 00:06:13.400 |
It's receiving two things, it's receiving the signal at time t and the value of t itself. 00:06:27.880 |
That doesn't require any different deep learning, that's just concatenating one extra thing 00:06:34.840 |
They actually show mathematically that this makes a certain amount of sense as a way to 00:06:40.400 |
deal with this, and then they find empirically that it does actually seem to work pretty 00:06:48.240 |
I can't tell you whether this is state-of-the-art for anything because I just haven't seen deep 00:06:55.000 |
comparative papers or competitions or anything that really have this kind of data, which 00:07:01.320 |
is weird because a lot of the world runs on this kind of data. 00:07:04.000 |
This kind of data effectively thinks it's super valuable, like if you're an oil and 00:07:09.100 |
gas company, what's the drill head telling you, what's the signals coming out of the 00:07:18.400 |
It's not the kind of cool thing that the Google kids work on, so who knows. 00:07:28.020 |
That's how you can do time series with this kind of signal data. 00:07:33.040 |
You can also incorporate all of the other stuff we're about to talk about, which is 00:07:41.120 |
For example, there was a Kaggle competition which was looking at forecasting sales for 00:08:00.440 |
each store at this big company in Europe called Rossman based on the date and what promotions 00:08:14.600 |
are going on and what the competitors are doing and so forth. 00:08:24.640 |
Or maybe it will have some kind of trend to it. 00:08:40.720 |
So these kinds of seasonal time series are very widely analyzed by econometricians. 00:08:52.480 |
They're everywhere, particularly in business, if you're trying to predict how many widgets 00:08:59.600 |
you have to buy next month, or whether to increase or decrease your prices, or all kinds 00:09:08.160 |
of operational type things tend to look like this. 00:09:12.520 |
How full your planes are going to be, whether you should add promotions, so on and so forth. 00:09:20.160 |
So it turns out that the state of the art for this kind of approach is not necessarily 00:09:29.120 |
I'm actually going to look at the third place result from this competition because the third 00:09:35.200 |
place result was nearly as good as places 1 and 2, but way, way, way simpler. 00:09:42.880 |
And also it turns out that there's stuff that we can build on top of for almost every model 00:09:51.240 |
And basically, surprise, surprise, it turns out that the answer is to use a neural network. 00:09:59.040 |
So I need to warn you again, what I'm going to teach you here is very, very uncool. 00:10:08.440 |
You'll never read about it from DeepMind or OpenAI. 00:10:12.320 |
It doesn't involve any robot arms, it doesn't involve thousands of GPU's, it's the kind 00:10:18.400 |
of boring stuff that normal companies use to make more money or spend less money or 00:10:33.320 |
Having said that, in the 25 years or more, I've been doing machine learning work applied 00:10:41.800 |
in the real world, 98% of it has been this kind of data. 00:10:49.480 |
Whether it be when I was working in agriculture, I've worked in wool, macadamia nuts and rice, 00:10:54.760 |
and we were figuring out how full our barrels were going to be, whether we needed more, 00:11:01.760 |
we were figuring out how to set futures markets, prices for agricultural goods, whatever, worked 00:11:09.360 |
in mining and brewing, which required analyzing all kinds of engineering data and sales data. 00:11:16.560 |
I've worked in banking, that required looking at transaction account pricing and risk and 00:11:24.480 |
All of these areas basically involve this kind of data. 00:11:29.380 |
So I think although no one publishes stuff about this, because anybody who comes out 00:11:37.320 |
of a Stanford PhD and goes to Google doesn't know about any of those things, it's probably 00:11:43.760 |
the most useful thing for the vast majority of people. 00:11:50.880 |
And excitingly it turns out that you don't need to learn any new techniques at all. 00:11:57.360 |
In fact, the model that they got this third-place result with, a very simple model, is basically 00:12:06.120 |
one where each different categorical variable was one hot encoded and chucked into an embedding 00:12:14.080 |
The embedding layers were concatenated and chucked through a dense layer, then a second 00:12:17.680 |
dense layer, then it went through a sigmoid function into an output layer. 00:12:28.120 |
The continuous variables they haven't drawn here, and all these pictures are going to 00:12:32.520 |
come straight from this paper, which these folks that came third kindly wrote a paper 00:12:40.360 |
The continuous variables basically get fed directly into the dense layer. 00:12:52.560 |
So the short answer is, compared to K-nearest neighbors, random forests and GBMs, just a 00:13:01.120 |
simple neural network beats all of those approaches, just with standard one-hot encoding, whatever. 00:13:12.760 |
So adding in this idea of using embeddings, interestingly you can take the embeddings 00:13:19.160 |
trained by a neural network and feed them into a KNN, or a random forest, or a GBM. 00:13:25.600 |
And in fact, using embeddings with every one of those things is way better than anything 00:13:38.320 |
And then if you use the embeddings with a neural network, you get the best results still. 00:13:43.260 |
So this actually is kind of fascinating, because training this neural network took me some 00:13:55.380 |
hours on a Titan X, whereas training the GBM took I think less than a second. 00:14:05.760 |
It was so fast, I thought I had screwed something up. 00:14:10.260 |
And then I tried running it, and it's like holy shit, it's giving accurate predictions. 00:14:17.640 |
So in your organization, you could try taking everything that you could think of as a categorical 00:14:25.000 |
variable and once a month train a neural net with embeddings, and then store those embeddings 00:14:33.340 |
in a database table and tell all of your business users, "Hey, anytime you want to create a 00:14:39.600 |
model that incorporates day of week or a store ID or a customer ID, you can go grab the embeddings." 00:14:49.420 |
And so they're basically like word vectors, but they're customer vectors and store vectors 00:14:56.280 |
So I've never seen anybody write about this other than this paper. 00:15:02.800 |
And even in this paper they don't really get to this hugely important idea of what you 00:15:17.280 |
A and B and C, yeah, we're going to get to that in a moment. 00:15:21.880 |
Basically the different things is like A might be the store ID, B might be the product ID, 00:15:33.320 |
One of the really nice things that they did in this paper was to then draw some projections 00:15:41.080 |
They just used T-SNE, it doesn't really matter what the projection method is, but here's 00:15:50.720 |
They took each state of Germany, or based in Germany, and did a projection of the embeddings 00:16:03.720 |
And I've drawn around them different colored circles, and you might notice the different 00:16:08.640 |
colored circles exactly correspond to the different colored circles on a map of Germany. 00:16:13.960 |
Now this were just random embeddings trained with SGD, trying to predict sales in stores 00:16:22.600 |
at Rossman, and yet somehow they've drawn a map of Germany. 00:16:28.960 |
So obviously the reason why is because things close to each other in Germany have similar 00:16:33.840 |
behaviors around how they respond to events, and who buys what kinds of products, and so 00:16:48.560 |
Every one of these dots is the distance between two stores, and this shows the correlation 00:16:58.640 |
between the distance in embedding space versus the actual distance between the stores. 00:17:04.680 |
So you can basically see that there's a strong correlation between things being close to 00:17:10.040 |
each other in real life and close to each other in these SGD-trained embeddings. 00:17:15.040 |
Here's a couple more pictures of all the lines drawn on top of mine, but everything else 00:17:25.400 |
And you can see the days of the week that are near each other have ended up embedded 00:17:30.400 |
On the right is months of the year embedding, again, same thing. 00:17:35.320 |
And you can see that the weekend is fairly separate. 00:17:47.560 |
I'm actually going to take you through the end-to-end process, and I rebuilt the end-to-end 00:17:53.840 |
process from scratch and tried to make it in as few lines as possible because we just 00:18:01.680 |
haven't really looked at any of these structured data type problems before. 00:18:08.680 |
So it's kind of a very different process and even a different set of techniques. 00:18:25.880 |
When you try to do this stuff yourself, you'll find three or four libraries we haven't used 00:18:32.040 |
So when you hit something that says "module not found", you can just pip install all these 00:18:45.940 |
So the data that comes from Kaggle comes down as a bunch of CSV files, and I wrote a quick 00:18:52.520 |
thing to combine some of those CSVs together. 00:18:57.160 |
This was one of those competitions where people were allowed to use additional external data 00:19:06.280 |
So the data I'll share with you, I'm going to combine it all into one place for you. 00:19:12.360 |
So I've commented on these apps because the stuff I'll give you will have already run 00:19:18.660 |
So the basic tables that you're going to get access to is the training set itself, a list 00:19:24.920 |
of stores, a list of which state each store is in, a list of the abbreviation and the name 00:19:34.000 |
of each state in Germany, a list of data from Google Trends. 00:19:39.840 |
So if you've used Google Trends you can basically see how particular keywords change over time. 00:19:44.480 |
I don't actually know which keywords they used, but somebody found that there were some 00:19:52.360 |
Google Trends keywords that correlated well, so we've got access to those, some information 00:20:01.200 |
So I'm not sure that we've really used pandas much yet, so let's talk a bit about pandas. 00:20:08.320 |
Pandas lets us take this kind of structured data and manipulate it in similar ways to 00:20:13.480 |
the way you would manipulate it in a database. 00:20:16.400 |
So the first thing you do, so pandas, just like NumPy, tends to become np, pandas tends 00:20:24.080 |
So pd.read_csv is going to return a data frame. 00:20:28.440 |
So a data frame is like a database table if you've used R, it's called the same thing. 00:20:35.640 |
So this read_csv is going to return a data frame containing the information from this 00:20:40.400 |
CSV file, and we're going to go through each one of those table names and read the CSV. 00:20:47.880 |
So this list comprehension is going to return a list of data frames. 00:20:55.220 |
So I can now go ahead and display the head, so the first five rows from each table. 00:21:03.280 |
And that's a good way to get a sense of what these tables are. 00:21:12.320 |
So for some store, on some date, they had some level of sales to some number of customers. 00:21:23.120 |
And they were either open or closed, they either had a promotion on or they didn't, 00:21:26.720 |
it either was a holiday or it wasn't for state and school, and then some additional information 00:21:40.320 |
So for example, for each store, we can join up some kind of categorical variable about 00:21:49.880 |
I have no idea what this is, it might be a different brand or something. 00:21:54.360 |
What kinds of products do they carry, again it's just a letter, I don't know what it means, 00:21:58.480 |
but maybe it's like some are electronics, some are supermarkets, some are full spectrum. 00:22:10.120 |
And what year and month did the competitor open for business? 00:22:15.880 |
Notice that sometimes the competitor opened for business quite late in the game, like 00:22:22.560 |
later than some of the data we're looking at, so that's going to be a little bit confusing. 00:22:29.000 |
And then this thing called Promo2, which as far as I understand it is basically is this 00:22:33.480 |
a store which has some kind of standard promotion timing going on. 00:22:38.400 |
So you can see here that this store has standard promotions in January, April, July and October. 00:22:49.880 |
We also know for each store what state they're in based on the abbreviation, and then we can 00:22:54.600 |
find out for each state what is the name of that state. 00:22:59.040 |
And then for each, this is slightly weird, this is the state abbreviation, the last two 00:23:04.120 |
In this state, during this week, this was the Google Trend data for some keyword, I'm 00:23:15.360 |
For this state name, on this date is the temperature, dewpoint, and so forth. 00:23:26.760 |
It's identical to the training set, but we don't have the number of customers, and we 00:23:33.680 |
So this is a pretty standard kind of industry data set. 00:23:40.240 |
You've got a central table, various tables related to that, and some things representing 00:23:48.360 |
One of the nice things you can do in Pandas is to use this Pandas summary module called 00:23:57.520 |
DataFrame summary for a table.summary, and that will return a whole bunch of information 00:24:04.040 |
So I'm not going to go through all of it in detail, but you can see for example for the 00:24:08.840 |
sales, on average 5,800 sales, standard deviation of 3,800, sometimes the sales goes all the 00:24:16.200 |
way down to 0, sometimes all the way up to 41,000. 00:24:19.880 |
There's no missing to sales, that's good to know. 00:24:25.160 |
So this is the kind of thing that's good to scroll through and identify, okay, competition 00:24:30.160 |
open since month is missing about a third of the time, that's good to know. 00:24:39.240 |
There's 12 unique states, that might be worth checking because there's actually 16 things 00:24:49.680 |
Google trend data is never missing, that's good. 00:25:09.600 |
This is the kind of thing that might screw up a model, it's like actually sometimes the 00:25:12.560 |
test set is missing the information about whether that store was open or not, so that's 00:25:19.800 |
So we can take that list of tables and just de-structure it out into a whole bunch of 00:25:23.720 |
different table names, find out how big the training set is, how big the test set is. 00:25:30.840 |
And then with this kind of problem, there's going to be a whole bunch of data cleaning 00:25:42.000 |
So neural nets don't make any of that go away, particularly because we're using this style 00:25:50.620 |
of neural net where we're basically feeding in a whole bunch of separate continuous and 00:25:56.520 |
So simplify things a bit, turn state holidays into Booleans, and then I'm going to join 00:26:10.200 |
I always use a default join type of an outer join, so you can see here this is how we join 00:26:16.940 |
in pandas. We say table.merge, table2, and then to make a left outer join, how equals 00:26:24.640 |
left, and then you say what's the name of the fields that you're going to join on the 00:26:29.040 |
left hand side, what are the fields you're going to join on the right hand side, and 00:26:33.360 |
then if both tables have some fields with the same name, what are you going to suffix 00:26:41.720 |
So on the left hand side we're not going to add any suffix, on the right hand side we'll 00:26:45.640 |
put in _y. So again, I try to refactor things as much as I can, so we're going to join lots 00:26:52.000 |
of things. Let's create one function to do the joining, and then we can call it lots 00:26:57.720 |
Was there any fields different to the same value but named differently? 00:27:06.520 |
Not that I saw, no. It wouldn't matter too much if there were, because when we run the 00:27:17.200 |
Question - Would you liken the use of embeddings from a neural network to extraction of implicit 00:27:26.280 |
features, or can we think of it more like what a PCA would do, like dimensionality reduction? 00:27:37.360 |
Let's talk about it more when we get there. Basically, when you deal with categorical 00:27:42.760 |
variables in any kind of model, you have to decide what to do with them. One of my favorite 00:27:51.800 |
data scientists, or a pair of them actually, who are very nearly neighbors of Rachel and 00:27:57.440 |
mine, have this fantastic R package called Vtreat, which has a bunch of state-of-the-art 00:28:09.160 |
approaches to dealing with stuff like categorical variable encoding. 00:28:20.000 |
The obvious way to do categorical variable encoding is to just do a one-hot encoding, 00:28:26.680 |
and that's the way nearly everybody puts it into their gradient-boosting machines or random 00:28:31.200 |
forests or whatever. One of the things that Vtreat does is it has some much more interesting 00:28:40.040 |
techniques. For example, you could look at the univariate mean of sales for each day 00:28:55.160 |
of week, and you could encode day of week using a continuous variable which represents 00:29:01.920 |
the mean of sales. But then you have to think about, "Would I take that mean from the trading 00:29:08.000 |
set or the test set or the validation set? How do I avoid a fitting?" There's all kinds 00:29:12.800 |
of complex statistical subtleties to think about that Vtreat handles all this stuff automatically. 00:29:25.040 |
There's a lot of great techniques, but they're kind of complicated and in the end they tend 00:29:29.240 |
to make a whole bunch of assumptions about linearity or univariate correlations or whatever. 00:29:36.740 |
Whereas with embeddings, we're using SGD to learn how to deal with it, just like we do 00:29:45.400 |
when we build an NLP model or a collaborative filtering model. We provide some initially 00:29:53.800 |
random embeddings and the system learns how the movies vary compared to each other, or 00:30:01.400 |
uses vary, or words vary or whatever. This is to me the ultimate pure technique. 00:30:10.040 |
Of course the other nice thing about embeddings is we get to pick the dimensionality of the 00:30:13.560 |
embedding so we can decide how much complexity and how much learning are we going to put 00:30:19.080 |
into each of the categorical variables. We'll see how to do that in a moment. 00:30:30.360 |
One complexity was that the weather uses the name of the state rather than the abbreviation 00:30:42.760 |
of the state, so we can just go ahead and join weather to states to get the abbreviation. 00:30:51.040 |
The Google Trend information about the week, week from A to B, we can split that apart. 00:30:59.320 |
You can see here one of the things that happens in the Google Trend data is that one of the 00:31:06.280 |
states is called ni, or else in the rest of the data is called hb, ni. So this is a good 00:31:12.420 |
opportunity to learn about pandas indexing. So pandas indexing, most of the time you want 00:31:18.360 |
to use this .ix method. And the .ix method is your general indexing method. It's going 00:31:26.760 |
to take two things, a list of rows to select and a list of columns to select. You can use 00:31:32.840 |
it in pretty standard intuitive ways. This is a lot like numpy. This here is going to 00:31:37.920 |
return a list of Booleans, which things are in this state. And if you pass the list of 00:31:43.520 |
Booleans to the pandas row selector, it will just return the rows where that Boolean is 00:31:48.960 |
true. So therefore this is just going to return the rows from Google Trend, where googletrend.state 00:31:55.880 |
is ni. And then the second thing we pass in is a list of columns, in this case we just 00:32:01.640 |
got one column. And one very important thing to remember, again just like numpy, you can 00:32:06.720 |
put this kind of thing on the left-hand side of an equal sign. In computer science we call 00:32:10.880 |
this an L-value, so you can use it as an L-value. So we can take this state field, four things 00:32:17.120 |
which are equal to ni, and change their value to this. So this is like a very nice simple 00:32:25.360 |
technique that you'll use all the time in pandas, both for looking at things and for 00:32:38.620 |
We have a few questions. One is, in this particular example, do you think the granularity of the 00:32:44.280 |
data matter, as in per day or per week, is one better than the other? 00:32:49.600 |
Yeah, I mean I would want to have the lower granularity so that I can capture that. Ideally 00:33:00.760 |
you at one time as well. It kind of depends on how the organization is going to use it. 00:33:08.680 |
What are they going to do with this information? It's probably for purchasing and stuff, so 00:33:11.680 |
maybe they don't care about an hourly level. Clearly the difference between Sunday sales 00:33:17.640 |
and Wednesday sales will be quite significant. This is mainly a kind of business context 00:33:27.280 |
Another question is, do you know if there's any work that compares for structured data, 00:33:32.120 |
supervised embeddings like these, to embeddings that come from an unsupervised paradigm such 00:33:37.000 |
as an autoencoder? It seems like you'd get more useful for prediction embeddings with 00:33:42.040 |
the former case, but if you wanted general purpose embeddings you might prefer the latter. 00:33:46.400 |
Yeah, I think you guys are aware of my feelings about autoencoders. It's like giving up on 00:33:51.280 |
life. You can always come up with a loss function that's more interesting than an autoencoder 00:33:57.480 |
loss function basically. I would be very surprised if embeddings that came from a sales model 00:34:03.860 |
were not more useful for just about everything than something that came from an unsupervised 00:34:07.600 |
model. These things are easily tested, and if you do find a model that they don't work 00:34:13.080 |
as well with, then you can come up with a different set of supervised embeddings for 00:34:17.800 |
There's also just a note that .ix is deprecated and we should use .loc instead. 00:34:23.840 |
I was going to mention Pandas is changing a lot. Because I've been running this course 00:34:30.400 |
I have not been keeping track of the recent versions of Pandas, so thank you. In Pandas 00:34:42.580 |
there's a whole page called Advanced Indexing Methods. I don't find the Pandas documentation 00:34:47.080 |
terribly clear to be honest, but there is a fantastic book by the author of Pandas called 00:34:51.800 |
Python for Data Analysis. There is a new edition out, and it covers Pandas, NumPy, Matplotlib, 00:35:01.280 |
whatever. That's the best way by far to actually understand Pandas because the documentation 00:35:09.920 |
is a bit of a nightmare and it keeps changing so the new version has all the new stuff in 00:35:14.400 |
it. With these kind of indexing methods, Pandas tries really hard to be intuitive, which means 00:35:22.200 |
that quite often you'll read the documentation for these methods and it will say if you pass 00:35:28.600 |
it a Boolean it will behave in this way, if you pass it a float it will behave this way, 00:35:32.520 |
if it's an index it's this way unless this other thing happens. I don't find it intuitive 00:35:39.480 |
at all because in the end I need to know how something works in order to use it correctly 00:35:42.780 |
and so you end up having to remember this huge list of things. I think Pandas is great, 00:35:47.880 |
but this is one thing to be very careful of, is to really make sure you understand how 00:35:53.480 |
all these indexing methods actually work. I know Rachel's laughing because she's been 00:35:57.440 |
there and probably laughing in disgust at what we all have to go through. 00:36:02.960 |
Another question, when you use embeddings from a supervised model in another model, do you 00:36:16.080 |
always have to worry about data leakage? I think that's a great point. I don't think 00:36:20.960 |
I've got anything to add to that. You can figure out easily enough if there's data leakage. 00:36:30.420 |
So there's this kind of standard set of steps that I take for every single structured machine 00:36:50.160 |
learning model I do. One of those is every time I see a date, I always do this. I always 00:36:58.440 |
create four more fields, the year, the month of year, the week of year and the day of week. 00:37:08.620 |
This is something which should be automatically built into every data loader, I feel. It's 00:37:18.220 |
so important because these are the kinds of structures that you see, and once every single 00:37:23.880 |
date has got this added to it, you're doing great. So you can see that I add that into 00:37:32.260 |
all of my tables that have a date field, so we'll have that from now on. 00:37:42.420 |
So now I go ahead and do all of these outer joins. You'll see that the first thing I do 00:37:49.340 |
after every outer join is check whether the thing I just joined with has any nulls. Even 00:37:58.540 |
if you're sure that these things match perfectly, I would still never ever do an inner join. 00:38:06.500 |
Do the outer join and then check for nulls, and that way if anything changes ever or if 00:38:11.020 |
you ever make a mistake, one of these things will not be zero. If this was happening in 00:38:17.660 |
a production process, this would be an assert. This would be emailing Henry at 2am to say 00:38:26.140 |
something you're relying on is not working the way it was meant to look at. So that's 00:38:35.460 |
So you can see I'm basically joining my training to everything else until it's all in there 00:38:43.740 |
together in one big thing. So that table "everything joined together" is called "joined", and then 00:38:53.580 |
I do a whole bunch more thinking about -- well, I didn't do the thinking, the people that 00:38:58.220 |
won this competition, then I replicated their results from scratch -- think about what are 00:39:04.700 |
all the other things you might want to do with these dates. 00:39:07.300 |
So competition open, we noticed before, a third of the time they're empty. So we just 00:39:13.700 |
fill in the empties with some kind of sentinel value because a lot of machine learning systems 00:39:21.180 |
don't like missing values. Fill in the missing months with some sentinel value. Again, keep 00:39:28.860 |
on filling in missing data. So fill_na is a really important thing to be aware of. 00:39:42.580 |
I guess the answer is yes, it is a problem. In this case, I happen to know that every 00:39:59.260 |
time a year is empty, a month is also empty, and we only ever use both of them together. 00:40:04.100 |
So we don't really care when the competition store was opened, what we really care about 00:40:25.540 |
is how long is it between when they were opened and the particular row that we're looking 00:40:30.020 |
at. The sales on the 2nd of February 2014, how long was it between 2nd of February 2014 00:40:39.100 |
So you can see here we use this very important .apply function which just runs a Python function 00:40:48.300 |
on every row of a data frame. In this case, the function is to create a new date from 00:40:56.180 |
the open-since year and the open-since month. We're just going to assume that it's the middle 00:40:59.740 |
of the month. That's our competition open-since, and then we can get our days opened by just 00:41:10.580 |
In pandas, every date field has this special magical dt property, which is what all the 00:41:18.940 |
days, month, year, all that stuff sits inside this little dt property. Sometimes, as I mentioned, 00:41:31.580 |
the competition actually opened later than the particular observation we're looking at. 00:41:36.900 |
So that would give us a negative, so we replace our negatives with zero. We're going to use 00:41:46.220 |
an embedding for this, so that's why we replace days open with months open so we have less 00:41:58.740 |
I didn't actually try replacing this with a continuous variable. I suspect it wouldn't 00:42:03.580 |
make too much difference, but this is what they do. In order to make the embedding again 00:42:11.100 |
not too big, they replaced anything that was bigger than 2 years with 2 years. 00:42:18.700 |
So there's our unique values. Every time we do something, print something out to make 00:42:24.380 |
sure the thing you thought you did is what you actually did. It's much easier if we're 00:42:29.260 |
using Excel because you see straight away what you're doing. In Python, this is the 00:42:34.860 |
kind of stuff that you have to really be rigorous about checking your work at every step. When 00:42:42.940 |
I build stuff like this, I generally make at least one error in every cell, so check 00:42:51.860 |
Okay, do the same thing for the promo days, turn those into weeks. So that's some basic 00:43:01.780 |
pre-processing, you get the idea of how pandas works hopefully. 00:43:07.820 |
So the next thing that they did in the paper was a very common kind of time series feature 00:43:18.220 |
manipulation, one to be aware of. They basically wanted to say, "Okay, every time there's a 00:43:27.820 |
promotion, every time there's a holiday, I want to create some additional fields for 00:43:34.980 |
every one of our training set rows," which is on a particular date. "On that date, how 00:43:39.660 |
long is it until the next holiday? How long is it until the previous holiday? How long 00:43:45.020 |
is it until the next promotion? How long is it since the previous promotion?" So if we 00:43:49.620 |
basically create those fields, this is the kind of thing which is super difficult for 00:43:58.620 |
any GBM or random forest or neural net to figure out how to calculate itself. There's 00:44:07.020 |
no obvious kind of mathematical function that it's going to build on its own. So this is 00:44:11.060 |
the kind of feature engineering that we have to do in order to allow us to use these kinds 00:44:16.220 |
of techniques effectively on time series data. 00:44:19.660 |
So a lot of people who work with time series data, particularly in academia outside of 00:44:27.380 |
industry, they're just not aware of the fact that the state-of-the-art approaches really 00:44:33.140 |
involve all these heuristics. Separating out your dates into their components, turning 00:44:40.220 |
everything you can into durations both forward and backwards, and also running averages. 00:44:52.100 |
When I used to do a lot of this kind of work, I had a bunch of library functions that I 00:44:59.660 |
would run on every file that came in and would automatically do these things for every combination 00:45:04.700 |
of dates. So this thing of how long until the next promotion, how long since the previous 00:45:12.580 |
promotion is not easy to do in any database system pretty much, or indeed in pandas. Because 00:45:21.940 |
generally speaking, these kind of systems are looking for relationships between tables, 00:45:26.700 |
but we're trying to look at relationships between rows. 00:45:30.060 |
So I had to create this tiny little simple little class to do this. So basically what 00:45:36.820 |
happens is, let's say I'm looking at school holiday. So I sort my data frame by store, 00:45:45.300 |
and then by date, and I call this little function called add_elapsed_school_holiday_after. What 00:45:53.420 |
does add_elapsed do? Add_elapsed is going to create an instance of this class called elapsed, 00:46:00.220 |
and in this case it's going to be called with school_holiday. 00:46:04.580 |
So what this class is going to do, we're going to be calling this apply function again. It's 00:46:08.980 |
going to run on every single row, and it's going to call my elapsed_class.debt for every 00:46:15.620 |
row. So I'm going to go through every row in order of store, in order of date, and I'm 00:46:21.780 |
trying to find how long has it been since the last school holiday. 00:46:26.580 |
So when I create this object, I just have to keep track of what field is it, school_holiday. 00:46:34.080 |
Initialize, when was the last time we saw a school holiday? The answer is we haven't, 00:46:40.860 |
so let's initialize it to not a number. And we also have to know each time we cross over 00:46:47.740 |
to a new store. When we cross over to a new store, we just have to re-initialize. So the 00:46:51.860 |
previous store was 0. So every time we call get, we basically check. Have we crossed over 00:46:57.900 |
to a new store? And if so, just initialize both of those things back again. And then 00:47:05.260 |
we just say, Is this a school holiday? If so, then the last time you saw a school holiday 00:47:12.980 |
is today. And then finally return, how long is it between today and the last time you 00:47:20.620 |
So it's basically this class is a way of keeping track of some memory about when did I last 00:47:28.620 |
see this observation. So then by just calling df.apply, it's going to keep track of this 00:47:35.460 |
for every single row. So then I can call that for school_holiday, after and before. The 00:47:41.380 |
only difference being that for before I just sort my dates in ascending order. State_holiday 00:47:49.220 |
and promo. So that's going to add in the end 6 fields, how long until and how long since 00:47:56.700 |
the last school holiday, state_holiday and promotion. 00:47:59.700 |
And then there's two questions. One asking, Is this similar to a windowing function? 00:48:10.020 |
Not quite, we're about to do a windowing function. 00:48:12.420 |
And then is there a reason to think that the current approach would be problematic with 00:48:17.620 |
I don't see why, but I'm not sure I quite follow. So we don't care about absolute days. 00:48:30.700 |
We care about two things. We do care about the dates, but we care about what year is 00:48:37.740 |
it, what date week it is. And we also care about the elapsed time between the date I'm 00:48:46.580 |
predicting sales for and the previous and next of various events. 00:48:53.220 |
And then windowing functions, for the features that are time until an event, how do you deal 00:48:59.700 |
with that given that you might not know when the last event is in the data? 00:49:06.340 |
Well all I do is I've sorted descending, and then we initialize last with not a number. 00:49:17.780 |
So basically when we then go subtract, here we are subtract, and it tries to subtract 00:49:23.940 |
not a number, we'll end up with a null. So basically anything that's an unknown time 00:49:30.620 |
because it's at one end or the other is going to end up null, which is why we're going to 00:49:39.860 |
replace those nulls with zeros. Pandas has this slightly strange way of thinking 00:49:48.540 |
about indexes, but once you get used to it, it's fine. At any point you can call DataFrame.setIndex 00:49:54.900 |
and pass in a field. You then have to just kind of remember what field you have as the 00:50:01.500 |
index, because quite a few methods in Pandas use the currently active index by default, 00:50:07.980 |
and of course things all run faster when you do stuff with the currently active index. And 00:50:13.660 |
you can pass multiple fields, in which case you end up with a multiple key index. 00:50:20.260 |
So the next thing we do is these windowing functions. So a windowing function in Pandas, 00:50:26.420 |
we can use this rolling. So this is like a rolling mean, rolling min, rolling max, whatever 00:50:31.180 |
you like. So this basically says let's take our DataFrame with the columns we're interested 00:50:40.940 |
in, school holiday, state holiday and promo, and we're going to keep track of how many 00:50:45.980 |
holidays are there in the next week and the previous week. How many promos are there in 00:50:52.220 |
the next week and the previous week? To do that we can sort, here we are, by date, group 00:51:03.060 |
by, store, and then rolling will be applied to each group. So within each group, create 00:51:09.740 |
a rolling 7-day sum. It's the kind of notation I'm never likely to remember, but you can 00:51:25.220 |
just look it up. This is how you do group by type stuff. Pandas actually has quite a 00:51:32.460 |
lot of time series functions, and this rolling function is one of the most useful ones. Wes 00:51:39.220 |
McKinney had a background as a quant who memory serves correctly, and so the quants love their 00:51:44.740 |
time series functions, so I think that was a lot of the history of Pandas. So if you're 00:51:49.100 |
interested in time series stuff, you'll find a lot of time series stuff in Pandas. 00:52:01.980 |
One helpful parameter that sits inside a lot of methods is inPlace = true. That means that 00:52:08.180 |
rather than returning a new data frame with this change made, it changes the data frame 00:52:13.460 |
you already have, and when your data frames are quite big this is going to save a lot 00:52:17.740 |
of time and memory. That's a good little trick to know about. 00:52:23.100 |
So now we merge all these together, and we can now see that we've got all these after 00:52:28.060 |
school holidays, before school holidays, and our backward and forward running means. Then 00:52:37.860 |
we join that up to our original data frame, and here we have our final result. 00:52:46.940 |
So there it is. We started out with a pretty small set of fields in the training set, but 00:52:53.980 |
we've done this feature engineering. This feature engineering is not arbitrary. Although I didn't 00:53:01.760 |
create this solution, I was just re-implementing the solution that came from the competition 00:53:08.700 |
that place getters -- this is nearly exactly the set of feature engineering steps I would 00:53:16.100 |
have done. It's just a really standard way of thinking about a time series. So you can 00:53:21.900 |
definitely borrow these ideas pretty closely. 00:53:32.300 |
So now that we've got this table, we've done our feature engineering, we now want to feed 00:53:45.140 |
it into a neural network. To feed it into a neural network we have to do a few things. 00:53:51.900 |
The categorical variables have to be turned into one-hot encoded variables, or at least 00:53:59.360 |
into contiguous integers. And the continuous variables we probably want to normalize to 00:54:06.660 |
a zero-mean one-standard deviation. There's a very little-known package called sklearn_pandas. 00:54:16.180 |
And actually I contributed some new stuff to it for this course to make this even easier 00:54:19.940 |
to use. If you use this data frame mapper from sklearn_pandas, as you'll see, it makes 00:54:26.180 |
life very easy. Without it, life is very hard. And because very few people know about it, 00:54:32.420 |
the vast majority of code you will find on the internet makes life look very hard. So 00:54:37.460 |
use this code, not the other code. Actually I was talking to some of the students the 00:54:44.020 |
other day and they were saying for their project they were stealing lots of code from part 00:54:49.060 |
one of the course because they just couldn't find anywhere else people writing any of the 00:54:55.460 |
kinds of code that we've used. The stuff that we've learned throughout this course is on 00:55:00.740 |
the whole not code that lives elsewhere very much at all. So feel free to use a lot of 00:55:07.860 |
these functions in your own work because I've really tried to make them the best version 00:55:15.780 |
So one way to do the embeddings and the way that they did it in the paper is to basically 00:55:21.900 |
say for each categorical variable they just manually decided what embedding dimensionality 00:55:27.900 |
to use. They don't say in the paper how they pick these dimensionalities, but generally 00:55:33.020 |
speaking things with a larger number of separate levels tend to have more dimensions. So I 00:55:39.620 |
think there's like 1000 stores, so that has a big embedding dimensionality, where else 00:55:46.020 |
obviously things like promo, forward and backward, or they have weak or whatever have much smaller 00:55:52.900 |
ones. So this is this dictionary I created that basically goes from the name of the field 00:55:58.740 |
to the embedding dimensionality. Again, this is all code that you guys can use in your 00:56:05.300 |
So then all I do is I say my categorical variables is go through my dictionary, sort it in reverse 00:56:14.420 |
order of the value, and then get the first thing from that. So that's just going to give 00:56:21.180 |
me the keys from this in reverse order of dimensionality. Continuous variables is just 00:56:31.820 |
a list. Just make sure that there's no nulls, so continuous variables replace nulls with 00:56:40.460 |
zeros, categorical variables replace nulls with empties. 00:56:44.380 |
And then here's where we use the DataFrameMapper. A DataFrameMapper takes a list of tuples with 00:56:53.780 |
just two items in. The first item is the name of the variable, so in this case I'm looping 00:56:58.140 |
through each categorical variable name. The second thing in the tuple is an instance of 00:57:05.460 |
a class which is going to do your preprocessing. And there's really just two that you're going 00:57:12.180 |
to use almost all the time. The categorical variables, sklearn comes with something called 00:57:17.660 |
label encoder. It's really badly documented, in fact misleadingly documented, but this 00:57:26.620 |
is exactly what you want. It's something that takes a column, figures out what are all the 00:57:31.820 |
unique values that appear in that column, and replaces them with a set of contiguous 00:57:36.100 |
integers. So if you've got the days of the week, Monday through Sunday, it'll replace 00:57:45.300 |
And then very importantly, this is critically important, you need to make sure that the 00:57:50.500 |
training set and the test set have the same codes. There's no point in having Sunday be 00:57:54.860 |
zero in the training set and one in the test set. So because we're actually instantiating 00:58:01.460 |
this class here, this object is going to actually keep track of which codes it's using. 00:58:08.780 |
And then ditto for the continuous, we want to normalize them to a 0, 1 variables. But 00:58:15.180 |
again, we need to remember what was the mean that we subtracted, what was the standard 00:58:19.020 |
deviation we divided by, so that we can do exactly the same thing to the test set. Otherwise 00:58:23.660 |
again our models are going to be nearly totally useless. So the way the dataframe mapper works 00:58:28.260 |
is that it's using this instantiated object, it's going to keep track with this information. 00:58:33.260 |
So this is basically code you can copy and paste in every one of your models. 00:58:38.420 |
Once we've got those mappings, you just pass those to a dataframe mapper, and then you 00:58:44.540 |
call .fit passing in your dataset. And so this thing now is a special object which has 00:58:52.900 |
a .features property that's going to contain all of the pre-processed features that you 00:59:02.260 |
want. Categorical columns contains the result of doing this mapping, basically doing this 00:59:09.100 |
label encoding. In some ways the details of how this works doesn't matter too much because 00:59:17.020 |
you can just use exactly this code in every one of your models. 00:59:20.180 |
Same for continuous, it's exactly the same code, but of course continuous, it's going 00:59:26.020 |
to be using standard scalar, which is the scikit-learn thing that turns it into a zero-mean-one standard 00:59:32.840 |
deviation variable. So we've now got continuous columns that have all been standardized. 00:59:39.020 |
Here's an example of the first five rows from the zeroth column for a categorical, and then 00:59:51.420 |
ditto for a continuous. You can see these have been turned into integers and these have been 00:59:59.100 |
turned into numbers which are going to average to zero and have a standard deviation of one. 01:00:05.400 |
One of the nice things about this dataframe mapper is that you can now take that object 01:00:11.940 |
and actually store it, pickle it. So now you can use those categorical encodings and scaling 01:00:20.260 |
parameters elsewhere. By just unpickling it, you've immediately got those same parameters. 01:00:30.180 |
For my categorical variables, you can see here the number of unique classes in every 01:00:37.340 |
one. So here's my 1,100 stores, 31 days of the month, 7 days of the week, and so forth. 01:00:48.940 |
So that's the kind of key pre-processing that has to be done. So here is their big mistake, 01:00:59.020 |
and I think if they didn't do this big mistake, they probably would have won. Their big mistake 01:01:03.820 |
is that they went join.sales, not equal to zero. So they've removed all of the rows with 01:01:12.940 |
no sales. Those are all of the rows where the store was closed. Why was this a big mistake? 01:01:21.420 |
Because if you go to the Rossman Store Sales Competition website and click on "Kernels" 01:01:29.300 |
and look at the kernel that got the highest rating. I'll show you a couple of pictures. 01:01:51.940 |
Here is an example of a store, Store 708, and these are all from this kernel. Here is 01:01:58.460 |
a period of time where it was closed to refurbishment. This happens a lot in Rossman stores. You 01:02:05.340 |
get these periods of time when you get zeros for sales, lots in a row. Look what happens 01:02:11.500 |
immediately before and after. So in the data set that we're looking at, our unfortunate 01:02:19.740 |
third place winners deleted all of these. So they had no ability to build a feature 01:02:25.980 |
that could find this. So this Store 708. Look, here's another one where it was closed. So 01:02:35.100 |
this turns out to be super common. The second place winner actually built a feature. It's 01:02:44.700 |
going to be exactly the same feature we've seen before. How many days since they're closing 01:02:48.820 |
and how many days until when they're closing. If they had just done that, I'm pretty sure 01:02:53.220 |
they would have won. So that was their big mistake. 01:02:59.940 |
This kernel has a number of interesting analyses in it. Here's another one which I think our 01:03:08.380 |
neural net can capture, although it might have been better to be explicit. Some stores 01:03:16.980 |
opened on Sundays. Most didn't, but some did. For those stores that opened on Sundays, their 01:03:25.340 |
sales on Sundays were far higher than on any other day. I guess that's because in Germany 01:03:30.460 |
I guess not many shops opened on Sundays. So something else that they didn't explicitly 01:03:35.620 |
do was create a "is store open on Sunday" field. Having said that, I think the neural 01:03:43.340 |
net may have been able to put that in the embedding. So if you're interested during 01:03:48.100 |
the week, you could try adding this field and see if it actually improves it or not. 01:03:51.780 |
It would certainly be interesting to hear if you try adding this field. Do you find 01:04:01.460 |
This Sunday thing, these are all from the same Kaggle kernel, here's the day of week 01:04:07.620 |
and here's the sales as a box plot. You can see normally on a Sunday, it's not that the 01:04:14.540 |
sales are much higher. So it's really explicitly just for these particular stores. 01:04:22.980 |
That's the kind of visualization stuff which is really helpful to do as you work through 01:04:30.900 |
these kinds of problems. I don't know, just draw lots of pictures. Those pictures were 01:04:35.900 |
drawn in R, and R is actually pretty good for this kind of structured data. 01:04:40.340 |
I have a question. For categorical fields, they're converted by the numbers not with 01:04:48.020 |
me and zero. They were just messages Monday is zero, Tuesday is one, whatever. 01:04:54.900 |
As is, they will send to a neural network just like... 01:04:59.440 |
We're going to get there. We're going to use embeddings. Just like we did with word embeddings, 01:05:05.880 |
remember, we turned every word into a word index. So our sentences, rather than being 01:05:13.220 |
like the dog ate the beans, it would be 3, 6, 12, 2, whatever. We're going to do the 01:05:21.940 |
same basic thing. We've done the same basic thing. 01:05:25.780 |
So now that we've done our terrible mistake, we've now still got 824,000 rows left. As 01:05:34.460 |
per usual, I made it really easy for me to create a random sample and did most of my 01:05:40.000 |
analysis with a random sample, but can just as easily not do the random sample. So now 01:05:51.980 |
Split it into training and test. Notice here, the way I split it into training and test 01:05:58.820 |
is not randomly. The reason it's not randomly is because in the Kaggle competition, they 01:06:04.500 |
set it up the smart way. The smart way to set up a test set in a time series is to make 01:06:11.580 |
your test set the most recent period of time. If you choose random points, you've got two 01:06:18.420 |
problems. The first is you're predicting tomorrow's sales where you always have the previous 01:06:23.500 |
day's sales which is very rarely the way things really work. And then secondly, you're ignoring 01:06:32.580 |
the fact that in the real world, you're always trying to model a few days or a few weeks 01:06:38.620 |
or a few months in the future that haven't happened yet. 01:06:41.420 |
So the way you want to set up, if you were setting up the data for such a model yourself, 01:06:48.500 |
you would need to be deciding how often am I going to be rerunning this model, how long 01:06:53.540 |
is it going to take for those model results to get into the field, to be used in however 01:06:57.820 |
they're being used. In this case, I can't remember, I think it's like a month or two. 01:07:05.300 |
So in that case I should make sure there's a month or two test set, which is the last 01:07:13.900 |
bit. So you can see here, I've taken the last 10% of my validation set and it's literally 01:07:20.620 |
just here's the first bit and here's the last bit, and since it was already sorted by date, 01:07:29.340 |
this ensures that I have it done the way I want. 01:07:32.260 |
I just wanted to point out that it's 10 to 8, so we should probably take a break. 01:07:48.660 |
This is how you take that data frame map object we created earlier, we call .fit, in order 01:07:54.620 |
to learn the transformation parameters, we then call transform to actually do it. So take 01:08:05.100 |
my training set and transform it to grab the categorical variables, and then the continuous 01:08:12.060 |
preprocessing is the same thing for my continuous map. So preprocess my training set and grab 01:08:18.140 |
my continuous variables. So that's nearly done. The only final piece is in their solution, 01:08:32.580 |
they modified their target, their sales value. And the way they modified it was that they 01:08:40.260 |
found the highest amount of sales, and they took the log of that, and then they modified 01:08:49.060 |
all of their y values to take the log of sales divided by the maximum log of sales. 01:08:56.460 |
So what this means is that the y values are going to be no higher than 1. And furthermore, 01:09:04.900 |
remember how they had a long tail, the average was 5,000, the maximum was 40-something thousand. 01:09:11.100 |
This is really common, like most financial data, sales data, so forth, generally has 01:09:16.420 |
a nicer shape when it's logged than it does not. So taking a log is a really good idea. 01:09:22.280 |
The reason that as well as taking the log they also did this division is it means that 01:09:27.860 |
what we can now do is we can use an activation function in our neural net of a sigmoid, which 01:09:33.700 |
goes between 0 and 1, and then just multiply by the maximum log. So that's basically going 01:09:40.100 |
to ensure that the data is in the right scaling area. 01:09:43.820 |
I actually tried taking this out, and this technique doesn't really seem to help. And 01:09:50.900 |
it actually reminds me of the style transfer paper where they mentioned they originally 01:09:57.500 |
had a hyperbolic tan layer at the end for exactly the same reason, to make sure everything 01:10:02.620 |
was between 0 and 255. It actually turns out if you just use a linear activation it worked 01:10:06.820 |
just as well. So interestingly this idea of using sigmoids at the end in order to get 01:10:14.060 |
the right range doesn't seem to be that helpful. 01:10:17.620 |
My guess is the reason why is because for a sigmoid it's really difficult to get the 01:10:23.020 |
maximum. And I think actually what they should have done is they probably should have, instead 01:10:28.820 |
of using maximum, they should have used maximum times 1.25 so that they never have to predict 01:10:35.780 |
1, because it's impossible to predict 1 because it's a sigmoid. 01:10:40.380 |
Someone asked, "Is there any issue in fitting the preprocessors on the full training and 01:10:46.780 |
validation data? Shouldn't they be fit only to the training set?" 01:10:51.620 |
No, it's fine. In fact, for the categorical variables, if you don't include the test set 01:11:01.460 |
then you're going to have some codes that aren't there at all. Or else this way there's 01:11:05.860 |
going to be random, which is better than failing. As for deciding what to divide and subtract 01:11:14.380 |
in order to get a 0, 1 random variable, it doesn't really matter. There's no leakage 01:11:19.860 |
involved because that's what you're worried about. 01:11:26.120 |
Root means squared percent error is what the Kaggle competition used as the official loss 01:11:34.640 |
So before we take a break, we'll finally take a look at the definition of the model. I'll 01:11:41.860 |
kind of work backwards. Here's the basic model. Get our embeddings, combine the embeddings 01:11:53.620 |
with the continuous variables, a tiny bit of dropout, one dense layer, two dense layers, 01:11:59.420 |
more dropout, and then the final sigmoid activation function. 01:12:05.060 |
You'll see that I've got commented out stuff all over the place. This is because I had 01:12:10.340 |
a lot of questions, we're going to cover this after the break, a lot of questions about 01:12:13.580 |
some of the details of why did they do things certain ways, some of the things they did 01:12:19.620 |
were so weird, I just thought they couldn't possibly be right. So I did some experimenting, 01:12:27.020 |
So the embeddings, as per usual, I create a little function to create an embedding, which 01:12:34.740 |
first of all creates my regular Keras input layer, and then it creates my embedding layer, 01:12:43.500 |
and then how many embedding dimensions I'm going to use. Sometimes I looked them up in 01:12:48.460 |
that dictionary I had earlier, and sometimes I calculated them using this simple approach 01:12:54.860 |
of saying I will use however many levels there are in the categorical variable divided by 01:13:00.340 |
2 with a maximum of 50. These were two different techniques I was playing with. 01:13:07.740 |
Normally with word embeddings, you have a whole sentence, and so you've got to feed 01:13:13.740 |
it to an RNN, and so you have time steps. So normally you have an input length equal 01:13:18.540 |
to the length of your sentence. This is the time steps for an RNN. We don't have an RNN, 01:13:25.140 |
we don't have any time steps. We just have one element in one column. So therefore I 01:13:32.100 |
have to pass flatten after this because it's going to have this redundant unit 1 time axis 01:13:42.220 |
So this is just because people don't normally do this kind of stuff with embeddings, so 01:13:46.940 |
they're assuming that you're going to want it in a format ready to go to an RNN, so this 01:13:51.100 |
is just turning it back into a normal format. So we grab each embedding, we end up with 01:13:57.660 |
a whole list of those. We then combine all of those embeddings with all of our continuous 01:14:03.180 |
variables into a single list of variables. And so then our model is going to have all 01:14:10.340 |
of those embedding inputs and all of our continuous inputs, and then we can compile it and train 01:14:17.380 |
it. So let's take a break and see you back here at 5 past 8. 01:14:46.700 |
So we've got our neural net set up. We train it in the usual way, go.fit, and away we go. 01:15:05.220 |
So that's basically that. It trains reasonably quickly, 6 minutes in this case. 01:15:18.800 |
So we've got two questions that came in. One of them is, for the normalization, is it possible 01:15:29.700 |
to use another function other than log, such as sigmoid? 01:15:40.020 |
I don't think you'd want to use sigmoid. A kind of financial data and sales data tends 01:15:45.620 |
to be of a shape where log will make it more linear, which is generally what you want. 01:15:52.080 |
And then when we log transform our target variable, we're also transforming the squared 01:15:57.540 |
error. Is this a problem? Or is it helping the model to find a better minimum error in 01:16:03.260 |
Yeah, so you've got to be careful about what loss function you want. In this case the Kaggle 01:16:07.100 |
competition is trying to minimize root and mean squared percent error. So I actually 01:16:12.540 |
then said I want you to do mean absolute error because in log space that's basically doing 01:16:21.900 |
the same thing. The percent is a ratio, so this is the absolute error between two logs 01:16:28.460 |
which is basically the same as a ratio. So you need to make sure your loss function is 01:16:38.540 |
I think this is one of the things that I didn't do in the original competition. As you can 01:16:42.900 |
see I tried changing it and I think it helped. By the way, XGBoost is fantastic. Here is the 01:16:57.660 |
same series of steps to run this model with XGBoost. As you can see, I just concatenate 01:17:05.020 |
my categorical and continuous for training and my validation set. Here is a set of parameters 01:17:12.780 |
which tends to work pretty well. XGBoost has a data type called DMatrix, which is basically 01:17:23.300 |
a normal matrix but it keeps track of the names of the features, so it prints out better 01:17:31.100 |
information. Then you go .train and this takes less than a second to run. It's not massively 01:17:39.020 |
worse than our previous result. This is a good way to get started. 01:17:47.460 |
The reason that XGBoost and Random Forest is particularly helpful is because it does 01:17:53.220 |
something called variable importance. This is how you get the variable importance for 01:17:58.060 |
an XGBoost model. It takes a second and suddenly here is the information you need. When I was 01:18:05.700 |
having trouble replicating the original results from the third place winners, one of the things 01:18:14.460 |
that helped me a lot was to look at this feature importance plot and say, "Competition distance, 01:18:21.140 |
holy cow, that's really really important. Let's make sure that my competition distance 01:18:26.420 |
results pre-processing really is exactly the same." On the other hand, events doesn't really 01:18:36.540 |
matter at all, so I'm not going to worry really at all about checking my events. This feature 01:18:43.980 |
importance or variable importance plot, also as it's known, you can also create with a 01:18:48.460 |
random forest. These are amazing. Because you're using a tree ensemble, it doesn't matter the 01:18:58.740 |
shape of anything, it doesn't matter if you have or don't have interactions, this is all 01:19:04.060 |
totally assumption free. In real life, this is the first thing I do. The first thing I 01:19:12.500 |
do is try to get a feature importance plot printed. Often it turns out that there's only 01:19:19.220 |
three or four variables in that. If you've got 10,000 variables, so I worked on a big 01:19:24.940 |
credit scoring problem a couple of years ago, I had 9,500 variables. It turned out that only 01:19:30.220 |
nine of them mattered. So the company I was working for literally had spent something 01:19:35.900 |
like $5 million on this big management consulting project, and this big management consulting 01:19:40.700 |
project had told them all these ways in which they can capture all this information in this 01:19:45.220 |
really clean way for their credit scoring models. Of course none of those things were 01:19:50.220 |
in these nine that mattered, so they could have saved $5 billion, but they didn't because 01:19:57.740 |
management consulting companies don't use random forests. 01:20:01.980 |
I can't overstate the importance of this plot, but this is a deep learning course, so we're 01:20:09.480 |
not really going to spend time talking about it. Now I mentioned that I had a whole bunch 01:20:17.940 |
of really weird things in the way that the competition playscaders did things. For one, 01:20:31.580 |
they didn't normalize their continuous variables. Who does that? But then when people do well 01:20:37.700 |
in a competition, something's working. The ways in which they initialized their embeddings 01:20:50.540 |
were really, really weird. But all these things were really, really weird. 01:20:55.420 |
So what I did was I wrote a little script, Rusman Experiments, and what I did was basically 01:21:12.340 |
I copied and pasted all the important code out of my notebook. Remember I've already 01:21:17.460 |
pickled the parameters for the label encoder and the scalar, so I didn't have to worry 01:21:26.260 |
about doing those again. Once I copied and pasted all that code in, so this is exactly 01:21:30.900 |
all the code you just saw, I then had this bunch of for loops. Pretty inelegant. But 01:21:45.820 |
these are all of the things that I wanted to basically find out. Does it matter whether 01:21:50.460 |
or not you use 1.0 scaling? Does it matter whether you use their weird approach to initializing 01:21:58.500 |
embeddings? Does it matter whether you use their particular dictionary of embedding dimensions 01:22:08.580 |
Something else I tried is they basically took all their continuous variables and put them 01:22:12.980 |
through a separate little dense layer each. I was like, why don't we put them all together. 01:22:17.420 |
I also tried some other things like batch normalization. So I ran this and got back 01:22:24.540 |
every possible combination of these. This is where you want to be using the script. 01:22:30.140 |
I'm not going to tell you that I jumped straight to this. First of all, I spent days screwing 01:22:37.280 |
around with experiments in a notebook by hand, continually forgetting what I had just done, 01:22:44.160 |
until eventually it took me like an hour to write this. And then of course I pasted it 01:22:52.620 |
into Excel. And here it is. Chucked it into a pivot table, used conditional formatting, 01:23:02.140 |
and here's my results. You can see all my different combinations, with and without normalization, 01:23:08.380 |
with my special function versus their dictionary, using a single dense matrix versus putting 01:23:14.260 |
everything together, using their weird init versus not using a weird init. And here is 01:23:23.740 |
this dark blue here is what they did. It's full of weird to me. But as you can see, it's 01:23:35.540 |
actually the darkest blue. It actually is the best. 01:23:39.380 |
But then when you zoom out, you realize there's a whole corner over here that's got a couple 01:23:46.300 |
of eight-sixes, it's nearly as good, but seems much more consistent. And also more consistent 01:23:54.020 |
with sanity. Like yes, do normalize your data. And yes, do use an appropriate initialization 01:24:01.020 |
function. And if you do those two things, it doesn't really matter what else you do, 01:24:06.180 |
So what I then did was I created a little sparkline in Excel for the actual training 01:24:12.780 |
graphs. And so here's their winning one, again, .085. But here's the variance of getting there. 01:24:24.980 |
And as you can see, their approach was pretty bumpy, up and down, up and down, up and down. 01:24:29.020 |
The second best on the other hand, .086 rather than .085, is going down very smoothly. And 01:24:38.580 |
so that made me think, given that it's in this very stable part of the world, and given 01:24:43.140 |
it's training much better, I actually think this is just random chance. It just happened 01:24:47.780 |
to be low in this point. I actually thought this is a better approach. It's more sensible 01:24:59.060 |
So this kind of approach to running experiments, I thought I'd just show you to say when you 01:25:06.980 |
run experiments, try and do it in a rigorous way and track both the stability of the approach 01:25:14.380 |
as well as the actual result of the approach. So this one here makes so much sense. It's 01:25:19.740 |
like use my simple function rather than the weird dictionary, use normalization, use a 01:25:26.020 |
single dense matrix, and use a thoughtful initialization. And you do all of those things, 01:25:30.140 |
you end up with something that's basically as good and much more stable. 01:25:36.100 |
That's all I wanted to say about Rossman. I'm going to very briefly mention another competition, 01:25:47.500 |
which is the Kaggle Taxi Destination competition. 01:26:00.580 |
You were saying that you did a couple of experiments. One, you figured out the embeddings and then 01:26:07.460 |
put the embeddings into random forests, and then put embeddings again into neural network. 01:26:16.420 |
Yeah, so I don't understand because you just use one neural network to do everything together, 01:26:22.700 |
Yeah, so what they did was, for this one here, this 115, they trained the neural network I 01:26:27.580 |
just showed you. They then threw away the neural network and trained a GBM model, but 01:26:36.180 |
for the categorical variables, rather than using 100 encodings, they used the embeddings. 01:26:44.700 |
So the taxi competition was won by the team with this Unicode name, which is pretty cool. 01:26:55.060 |
And it's actually turned out to be a team run by Yoshua Bengio, who's one of the people 01:27:00.140 |
that stuck it out through the AI winter and is now one of the leading lights in deep learning. 01:27:07.660 |
And interestingly, the thing I just showed you, the Rossman competition, this paper they 01:27:14.300 |
wrote in the Rossman competition claimed to have invented this idea of categorical embeddings. 01:27:20.500 |
But actually, Yoshua Bengio's team won this competition a year earlier with this same 01:27:25.980 |
technique. But again, it's so uncool, nobody noticed even though it was Yoshua Bengio. 01:27:32.900 |
So I want to quickly show you what they did. This is the paper they wrote. And their approach 01:27:42.020 |
to picking an embedding size was very simple. Use 10. So the data was which customer is 01:27:52.540 |
taking this taxi, which taxi are they in, which taxi stand did they get the taxi from, 01:28:00.380 |
and then quarter hour of the day, day of the week, week of the year. And they didn't add 01:28:07.180 |
all kinds of other stuff, this is basically it. And so then they said we're going to learn 01:28:13.980 |
embeddings inspired by NLP. So actually to my knowledge, this is the first time this 01:28:21.700 |
appears in the literature. Having said that, I'm sure a thousand people have done it before, 01:28:26.660 |
it's just not obvious to make it into a paper. 01:28:31.420 |
>> As a quick sanity check, if you have day of the week, like with seven, even one hot 01:28:39.660 |
variable potentials, and embedding size of 10, that doesn't make any sense, right? 01:28:46.900 |
>> Yeah, so I used to think that. But actually it does. Since the last few months quite often 01:28:55.460 |
ended up with bigger embeddings than my original patternality. And often it does give better 01:29:02.220 |
results. And I think it's just like when you realize that it's just a dense layer on top 01:29:08.100 |
of a one-hot encoding, it's like okay, why shouldn't the dense layer have more information? 01:29:14.180 |
I found it weird too, I still find it a little weird, but it definitely seems to be something 01:29:27.300 |
>> It does, it helps. I have absolutely found plenty of times now where I need a bigger 01:29:38.460 |
embedding metric's dimensionality than my cardinality of my categorical variable. 01:29:46.620 |
Now in this competition, again it's a time series competition really, because the main 01:29:52.660 |
thing you're given other than all this metadata is a series of GPS points, which is every 01:29:58.660 |
GPS point along a route. And at some point for the test set, the route is cut off and 01:30:05.300 |
you have to figure out what the final GPS point would have been. Where are they going? 01:30:13.660 |
Here's the model that they won with. It turns out to be very simple. You take all the metadata 01:30:22.180 |
we just saw and chuck it through the embeddings. You then take the first 5 GPS points and the 01:30:30.340 |
last 5 GPS points and concatenate them together with the embeddings. Chuck them through a 01:30:36.940 |
hidden layer, then through a softmax. This is quite interesting. What they then do is 01:30:44.260 |
they take the result of this softmax and they combine it with clusters. 01:30:50.460 |
Now what are these clusters? They used mean shift clustering, and they used mean shift 01:30:56.460 |
clustering to figure out where are the places people tend to go. So with taxis, people tend 01:31:02.940 |
to go to the airport or they tend to go to the hospital or they tend to go to the shopping 01:31:07.020 |
strip. So using mean shift clustering, I think it was about 3,000 clusters, x, y coordinates 01:31:20.740 |
However, people don't always go to those 3,000 places. So this is a really cool thing. By 01:31:27.700 |
using a softmax, and then they took the softmax and they multiplied it and took a weighted 01:31:34.900 |
average using the softmax as the weights and the cluster centers as the thing that you're 01:31:42.280 |
So in other words, if they're going to the airport for sure, the softmax will end up 01:31:47.500 |
giving a p of very close to 1 for the airport cluster. On the other hand, if it's not really 01:31:53.300 |
that clear whether they're going to this shopping strip or this movie, then those two cluster 01:32:01.140 |
centers could both have a softmax of about 0.5, and so it's going to end up predicting 01:32:09.100 |
So this is really interesting. They've built a different kind of architecture to anything 01:32:17.260 |
we've seen before, where the softmax is not the last thing we do. It's being used to 01:32:23.260 |
average a bunch of clusters. So this is really smart because the softmax forces it to be 01:32:31.820 |
easier for it to pick a specific destination that's very common, but also makes it possible 01:32:37.900 |
for it to predict any destination anywhere by combining the average of a number of clusters 01:32:45.100 |
I think this is really elegant architecture engineering. 01:32:53.540 |
Last 5 GPS points that were given. To create the training set, what they did was they took 01:33:08.220 |
all of the roots and truncated them randomly. So every time they sampled another root, think 01:33:18.540 |
of the data generator. Basically the data generator would randomly slice it off somewhere. 01:33:26.220 |
So this was the last 5 points which we have access to, and the first 5 points. The reason 01:33:33.580 |
it's not all the points is because they're using a standard multilayer perceptron here. 01:33:39.220 |
So it's a variable length, a, and also you don't want it to be too big. 01:33:45.180 |
There's a question. So the prefix is not fed into an RNN, it's just fed into a dense layer? 01:33:51.500 |
Correct. So we just get 10 points, concatenate it together into a dense layer. So surprisingly 01:33:57.940 |
simple. How good was it? Look at the results, 2-1-4, 2-1-4, 2-1-3, 2-1-3, 2-1-1, 2-1-2. Everybody 01:34:14.900 |
is clustered together. One person's a bit better at 208, and they're way better at 203. 01:34:21.180 |
And then they mentioned in the paper that they didn't actually have time to finish training, 01:34:25.380 |
so when they actually finished training, it was actually 1.87. They won so easily, it's 01:34:33.220 |
not funny. And interestingly in the paper, they actually mentioned the test set was so 01:34:37.380 |
small that they knew the only way they could be sure to win was to make sure they won easily. 01:34:45.100 |
Now because the test set was so small, the leaderboard is actually not statistically 01:34:49.980 |
that great. So they created a custom test set and tried to see if they could find something 01:34:55.820 |
that's even better still on the custom test set. And it turns out that actually an RNN 01:35:00.780 |
is better still. It still would have won the competition, but there's not enough data in 01:35:06.980 |
the Kaggle test set that this is a statistically significant result. In this case it is statistically 01:35:12.540 |
significant. A regular RNN wasn't better, but what they did instead was take an RNN where 01:35:21.420 |
we pass in 5 points at a time into the RNN basically. I think what probably would have 01:35:29.140 |
been even better would be to have had a convolutional layer first and then passed that into an RNN. 01:35:35.060 |
They didn't try it as far as I can see from the paper. Importantly, a bidirectional RNN 01:35:42.660 |
which ensures that the initial points and the last points tend to have more weight because 01:35:48.760 |
we know that RNN's state generally reflects things they've seen more recently. So this 01:35:58.220 |
So Paul Longsuffering intern Brad has been trying to replicate this result. He had at 01:36:04.100 |
least two all-nighters in the last two weeks but hasn't quite managed to yet. So I'm not 01:36:08.220 |
going to show you the code, but hopefully once Brad starts sleeping again he'll be able 01:36:13.140 |
to finish it off and we can show you the notebook during the week on the forum that actually 01:36:21.780 |
It was an interesting process to watch Brad try to replicate this because the vast majority 01:36:30.300 |
of the time in my experience when people say they've tried a model and the model didn't 01:36:34.940 |
work out and they've given up on the model, it turns out that it's actually because they 01:36:38.220 |
screwed something up, not because of the problem with the model. And if you weren't comparing 01:36:44.500 |
to Yoshua Bengio's team's result, knowing that you haven't replicated it yet, at which 01:36:50.740 |
point do you give up and say, "Oh my model's not working" versus saying, "No, I've still 01:36:55.820 |
got bugs!" It's very difficult to debug machine learning models. 01:37:03.540 |
What Brad's actually had to end up doing is literally take the original Bengio team code, 01:37:09.300 |
run it line by line, and then try to replicate it in Keras line by line in literally np.allclose 01:37:16.100 |
every time. Because to build a model like this, it doesn't look that complex, but there's 01:37:22.260 |
just so many places that you can make little mistakes. No normal person will make like 01:37:29.260 |
zero mistakes. In fact, normal people like me will make dozens of mistakes. 01:37:34.340 |
So when you build a model like this, you need to find a way to test every single line of 01:37:39.420 |
code. Any line of code you don't test, I guarantee you'll end up with a bug and you won't know 01:37:43.720 |
you have a bug and there's no way to ever find out you had a bug. 01:37:48.500 |
So we have several questions. One is a note that pi*ci is very similar to what happens 01:37:58.260 |
in the memory network paper. In that case, the output embeddings are weighted by the 01:38:02.300 |
attention probability. It's a lot like a regular attentional language model. 01:38:10.940 |
Can you talk more about the idea you have about first having the convolutional layer 01:38:14.780 |
and passing that to an RNN? What do you mean by that? 01:38:19.300 |
So here is a fantastic paper. We looked at these kind of subword encodings last week 01:38:48.900 |
for language models. I don't know if any of you thought about this and wondered what if 01:38:53.220 |
we just had individual characters. There's a really fascinating paper called Fully Character 01:39:00.180 |
Level Machine Translation with no explicit segmentation from November of last year. They 01:39:08.100 |
actually get fantastic results on just character level, beating pretty much everything, including 01:39:17.460 |
the BPE approach we saw last time. So they looked at lots of different approaches and 01:39:28.620 |
comparing BPE to individual character, and most of the time they got the best results. 01:39:36.180 |
Their model looks like this. They start out with every individual character. It goes through 01:39:42.420 |
a character embedding, just like we've used character embeddings lots of times. Then you 01:39:46.620 |
take those character embeddings and you pass it through a one-dimensional convolution. 01:39:53.700 |
I don't know if you guys remember, but in Part 1 of this course, Ben actually had a 01:39:59.020 |
blog post about showing how you can do multiple size convolutions and concatenate them altogether. 01:40:05.380 |
So you could use that approach. Or you could just pick a single size. So you end up basically 01:40:11.460 |
scrolling your convolutional window across your sets of characters. So you end up with 01:40:18.460 |
the same number of convolution outputs as you started out with letters, but they're 01:40:23.540 |
now representing the information in a window around that letter. In this case, they then 01:40:31.800 |
did max pooling. So they basically said which window, assuming that we had a different size 01:40:41.340 |
as a size 4, a size 3, and a size 5. Which bits seem to have the highest activations 01:40:49.580 |
around here. Then they took those max pooled things and they put them through a second 01:40:54.580 |
set of segment embeddings. They then put that through something called a highway network 01:40:59.260 |
which the details don't matter too much. It's kind of something like a DenseNet, like we 01:41:03.540 |
learned about last week. This is a slightly older approach than the DenseNet. Then finally 01:41:09.180 |
after doing all that, stick that through an RNN. So the idea here in this model was they 01:41:15.260 |
basically did as much learnt pre-processing as possible, and then finally put that into 01:41:25.220 |
an RNN. Because we've got these max pooling layers, this RNN ends up with a lot less time 01:41:30.540 |
points, which is really important to minimize the amount of processing in the RNN. So I'm 01:41:39.660 |
not going to go into detail on this, but check out this paper because it's really interesting. 01:41:46.720 |
Next question is, for the destinations we would have more error for the peripheral points? 01:41:52.260 |
Are we taking a centroid of clusters? I don't understand that, sorry. All we're doing is 01:42:02.840 |
we're taking the softmax p, multiply by the cluster c, multiply them and add them up. 01:42:10.300 |
I thought the first part was asking that with destinations that are more peripheral, they 01:42:16.060 |
would have higher error because they would be harder to predict this way. 01:42:19.140 |
That probably, which is fine because by definition they're not close to a cluster center so they're 01:42:26.340 |
Then going back, there was a question on the Rossman example. What does MAPE with neural 01:42:33.480 |
network mean? I would have expected that result to be the same, why is it lower? 01:42:37.980 |
This is just using a one-hot encoding without an embedding layer. We kind of run out of 01:42:50.580 |
time a bit quickly, but I really want to show you this. The students and I have been trying 01:42:58.500 |
to get a new approach to segmentation working, and I finally got it working in the last day 01:43:04.260 |
or two, and I really wanted to show it to you. We talked last week about DenseNet, and 01:43:09.540 |
I mentioned that DenseNet is arse-kickingly good at doing image classification with a 01:43:16.820 |
small number of data points, like crazily good. But I also mentioned that it's the basis 01:43:24.120 |
of this thing called the 100-phase tiramisu, which is an approach to segmentation. 01:43:29.100 |
So segmentation refers to taking a picture, an image, and figuring out where's the tree, 01:43:38.260 |
where's the dog, where's the bicycle and so forth. So it seems like we're not sure of 01:43:45.260 |
NGO fans today because this is one of his group's papers as well. Let me set the scene. 01:44:00.760 |
So Brendan, one of our students, who many of you have seen a lot of his blog posts, 01:44:06.820 |
he has successfully got a PyTorch of this working, so I've shared that on our files.vast.ai. 01:44:14.120 |
And I got the Keras version of it working. So I'll show you the Keras version because 01:44:17.700 |
I actually understand it. And if anybody's interested in asking questions about the PyTorch 01:44:22.060 |
version, hopefully Brendan will be happy to answer them during the week. 01:44:27.900 |
So the data looks like this. There's an image, and then there's a labels. So that's basically 01:44:50.620 |
what it looks like. So you can see here, you've got traffic lights, you've got poles, you've 01:44:54.220 |
got trees, buildings, paths, roads. Interestingly, the dataset we're using is something called 01:45:02.940 |
CanVid. The dataset is actually frames from a video. So a lot of the frames look very 01:45:08.820 |
similar to each other. And there's only like 600 frames in total, so there's very variable 01:45:16.980 |
data in this CanVid dataset. Furthermore, we're not going to do any pre-training. So 01:45:24.340 |
we're going to try and build the state-of-the-art classification system on video, which is already 01:45:29.900 |
much lower information content because most of the frames are pretty similar, using just 01:45:34.380 |
600 frames without pre-training. Now if you were to ask me a month ago, I would have told 01:45:38.260 |
you it's not possible. This just seems like an incredibly difficult thing to do. But just 01:45:47.780 |
So I'm going to skip to the answer first. Here's an example of a particular frame we're 01:45:58.220 |
trying to match. Here is the ground truth for that frame. You can see there's a tiny 01:46:07.300 |
car here and a little car here. There are those little cars. There's a tree. Trees are 01:46:12.780 |
really difficult. They're incredibly fine, funny things. And here is my trained model. 01:46:22.060 |
And as you can see, it's done really, really well. It's interesting to look at the mistakes 01:46:28.620 |
it made. This little thing here is a person. But you can see that the person, their head 01:46:37.940 |
looks a lot like traffic light and their jacket looks a lot like mailbox. Whereas these tiny 01:46:43.420 |
little people here, it's done perfectly, or else this person got a little bit confused. 01:46:55.820 |
Another example of where it's gone wrong is this should be a road, or else it wasn't quite 01:46:59.700 |
sure what was road and what was footpath, which makes sense because the colors do look 01:47:04.020 |
very similar. But had we have pre-trained something, a pre-trained network would have understood 01:47:11.500 |
that crossroads tend to go straight across, they don't tend to look like that. So you 01:47:18.060 |
can kind of see where the minor mistakes it made, it also would have learned, had it looked 01:47:24.540 |
at more than a couple of hundred examples of people, that people generally are a particular 01:47:30.420 |
day. So there's just not enough data for it to have learned some of these things. But 01:47:34.420 |
nonetheless, it is extraordinarily effective. Look at this traffic light, it's surrounded 01:47:41.340 |
by a sign, so the ground truth actually has the traffic light and then a tiny little edge 01:47:46.860 |
of sign, and it's even got that right. So it's an incredibly accurate model. 01:47:56.220 |
So how does it work? And in particular, how does it do these amazing trees? So the answer 01:48:02.340 |
is in this picture. Basically, this is inspired by a model called UNET. Until the UNET model 01:48:15.220 |
came along, everybody was doing these kinds of segmentation models using an approach just 01:48:22.420 |
like what we did for style transfer, which is basically you have a number of convolutional 01:48:30.700 |
layers with max pooling, or with a stride of 2, which gradually make the image smaller 01:48:36.980 |
and smaller, a bigger receptive field. And then you go back up the other side using upsampling 01:48:42.660 |
or deconvolutions until you get back to the original size, and then your final layer is 01:48:49.060 |
the same size as your starting layer and has a bunch of different classes that you're trying 01:48:58.940 |
The problem with that is that you end up with, in fact I'll show you an example. There's 01:49:05.420 |
a really nice paper called UNET. UNET is not only an incredibly accurate model for segmentation, 01:49:14.300 |
but it's also incredibly fast. It actually can run in real time. You can actually run 01:49:20.740 |
But the mistakes it makes, look at this chair. This chair has a big gap here and here and 01:49:26.220 |
here, but UNET gets it totally wrong. And the reason why is because they use a very 01:49:32.460 |
traditional downsampling-upsampling approach. And by the time they get to the bottom, they've 01:49:38.860 |
just lost track of the fine detail. So the trick are these connections here. What we 01:49:47.900 |
do is we start with our input, we do a standard initial convolution, just like we did with 01:49:52.900 |
style transfer. We then have a DenseNet block, which we learned about last week. And then 01:49:59.660 |
that block, we keep going down, we do a MaxPooling type thing, another DenseNet block, MaxPooling 01:50:05.260 |
type thing, keep going down. And then as we go up the other side, so we do a DenseBlock, 01:50:16.260 |
we take the output from the DenseBlock on the way down and we actually copy it over 01:50:23.540 |
to here and concatenate the two together. So actually Brendan a few days ago actually 01:50:32.380 |
drew this on our whiteboard when we were explaining it to Melissa, and so he's shown us every 01:50:37.520 |
We start out with a 224x224 input, it goes through the convolutions with 48 filters, goes 01:50:48.660 |
through our DenseBlock, adds another 80 filters. It then goes through our, they call it a transition 01:50:54.940 |
down, so basically a MaxPooling. So it's now size 112. We keep doing that. DenseBlock, transition 01:51:01.700 |
down, so it's now 56x56, 28x28, 14x14, 7x7. And then on the way up again, we go transition 01:51:10.540 |
up, it's now 14x14. We copy across the results of the 14x14 from the transition down and 01:51:18.380 |
concatenate together. Then we do a DenseBlock, transition up, it's now 28x28, so we copy 01:51:24.580 |
across our 28x28 from the transition down and so forth. 01:51:29.180 |
So by the time we get all the way back up here, we're actually copying across something 01:51:35.740 |
that was originally of size 224x224. It hadn't had much done to it, it had only been through 01:51:44.580 |
one convolutional layer and one DenseBlock, so it hadn't really got much rich computation 01:51:49.380 |
being done. But the thing is, by the time it gets back up all the way up here, the model 01:51:56.300 |
knows pretty much this is a tree and this is a person and this is a house, and it just 01:52:01.900 |
needs to get the fine little details. Where exactly does this leaf finish? Where exactly 01:52:06.500 |
does the person's hat finish? So it's basically copying across something which is very high 01:52:13.260 |
resolution but doesn't have that much rich information, but that's fine because it really 01:52:21.260 |
So these things here, they're called skip-connections. They were really inspired by this paper called 01:52:26.100 |
Unet, which has been one of many Kaggle competitions. But it's using dense blocks rather than normal 01:52:38.500 |
So let me show you. We're not going to have time to go into this in detail, but I've done 01:52:45.940 |
all this coding Keras from scratch. This is actually a fantastic fit for Keras. I didn't 01:52:51.260 |
have to create any custom layers, I didn't really have to do anything weird at all, except 01:53:00.980 |
So the data augmentation was we start with 480x360 images, we randomly crop some 224x224 01:53:10.820 |
part, and also randomly we may flip it horizontally. That's all perfectly fine. Keras doesn't really 01:53:19.380 |
have the random crops, unfortunately. But more importantly, whatever we do to the input 01:53:23.940 |
image, we also have to do to the target image. We need to get the same 224x224 crop, and 01:53:32.660 |
So I had to write a data generator, which you guys may actually find useful anyway. 01:53:44.700 |
So this is my data generator. Basically I called it a segment generator. It's just a standard 01:53:52.780 |
generator so it's got a next function. Each time you call next, it grabs some random bunch 01:53:58.220 |
of indexes, it goes through each one of those indexes and grabs the necessary item, grabbing 01:54:06.220 |
a random slice, sometimes randomly flipping it horizontally, and then it's doing this 01:54:11.900 |
to both the x's and the y's, returning them back. 01:54:19.220 |
Along with this segment generator, in order to randomly grab a batch of random indices 01:54:26.220 |
each time, I created this little class called batch indices, which can basically do that. 01:54:33.700 |
And it can have either shuffle true or shuffle false. 01:54:37.660 |
So this pair of classes you guys might find really helpful for creating your own data 01:54:42.980 |
generators. This batch indices class in particular, now that I've written it, you can see how 01:54:48.900 |
it works, right? If I say batch indices from a data set of size 10, I want to grab three 01:54:55.740 |
indices at a time. So then let's grab five batches. Now in this case I've got by default 01:55:02.940 |
shuffle = false, so it just returns 0.1.2, 0.3.4.5, 0.6.7.8, 0.9, I'm finished, go back 01:55:09.100 |
to the start, 0.1.2. On the other hand, if I say shuffle = true, it returns them in random 01:55:15.380 |
order but it still makes sure it captures all of them. And then when we're done it starts 01:55:19.940 |
a new random order. So this makes it really easy to create random generators. 01:55:26.580 |
So that was the only thing I had to add to Keras to get this order work. Other than that, 01:55:35.360 |
we wrote the tiramisu. And the tiramisu looks very very similar to the DenseNet that we 01:55:40.620 |
saw last week. We've got all our pieces, the relu, the dropout, the batchnorm, the relu 01:55:50.260 |
on top of batchnorm, the concat layer, so this is something I had to add, my convolution2d 01:55:58.100 |
followed by dropout, and then finally my batchnorm followed by relu followed by convolution2d. 01:56:04.940 |
So this is just the dense block that we saw last week. So a dense block is something where 01:56:10.420 |
we just keep grabbing 12 or 16 filters at a time, concatenating them to the last set 01:56:16.820 |
and doing that a few times. That's what a dense block is. 01:56:23.100 |
So here's something interesting. The original paper for its down sampling, they call it 01:56:31.420 |
transition down, did a 1x1 convolution followed by a max pooling. I actually discovered that 01:56:43.340 |
doing a stride2 convolution gives better results. So you'll see I actually have not followed 01:56:50.140 |
the paper. The one that's in commented out here is what the paper did, but actually this 01:57:01.060 |
Interestingly though, on the transition up side, do you remember that checkerboard artifacts 01:57:08.140 |
blog post we saw that showed that upsampling2d followed by a convolutional layer works better? 01:57:15.340 |
It does not work better for this. In fact, a deconvolution works better for this. So 01:57:21.780 |
that's why you can see I've got this deconvolution layer. So I thought that was interesting. 01:57:29.660 |
So basically you can see when I go downsampling a bunch of times, it's basically do a dense 01:57:35.940 |
block and then I have to keep track of my skip connections. So basically keep a list 01:57:41.900 |
of all of those skip connections. So I've got to hang on to all of these. So every one 01:57:47.300 |
of these skip connections I just stick in this little array, depending on them after 01:57:52.300 |
every dense block. So then I keep them all and then I pass them to my upward path. So 01:58:00.620 |
I basically do my transition up and then I concatenate that with that skip connection. 01:58:09.020 |
So that's the basic approach. So then the actual Tirabisu model itself with those pieces 01:58:16.980 |
is less than a screen of code. It's basically just do a 3x3 conv, do my down path, do my 01:58:24.100 |
up path using those skip connections, then a 1x1 conv at the end, and a softmax. 01:58:39.580 |
So these dense nets, and indeed this fully convolutional dense net or this Tirabisu model, 01:58:47.460 |
they actually take quite a long time to train. They don't have very many parameters, which 01:58:51.460 |
is why I think they work so well with these tiny datasets. But they do still take a long 01:58:55.860 |
time to train. Each epoch took a couple of minutes, and in the end I had to do many hundreds 01:59:02.740 |
of epochs. And I was also doing a bunch of learning rate annealing. So in the end this 01:59:09.580 |
kind of really had to train overnight, even though I had only about 500-600 friends. But 01:59:28.620 |
in the end I got a really good result. I was a bit nervous at first. I was getting this 01:59:40.620 |
87.6% accuracy. In the paper they were getting 90% plus. It turns out that 3% of the pixels 01:59:48.060 |
are marked as void. I don't know why they're marked as void, but in the paper they actually 01:59:52.780 |
remove them. So you'll see when you get to my results section, I've got this bit where 01:59:56.980 |
I remove those void ones. And I ended up with 89.5%. None of us in class managed to replicate 02:00:08.940 |
the paper. The paper got 91.5% or 91.2%. We tried the lasagna code they provided. We tried 02:00:18.740 |
Brendan's PyTorch. We tried my Keras. Even though we couldn't replicate their result, 02:00:25.700 |
this is still better than any other result I've found. So this is still super accurate. 02:00:33.020 |
A couple of quick notes about this. First is they tried training also on something called 02:00:40.980 |
the GATech dataset, which is another video dataset. The degree to which this is an amazing 02:00:47.160 |
model is really clear here. This 76% is from a model which is specifically built for video, 02:00:55.580 |
so it actually includes the time component, which is absolutely critical, and it uses 02:01:00.100 |
a pre-trained network, so it's used like a million images to pre-train, and it's still 02:01:05.780 |
not as good as this model. So that is an extraordinary comparison. 02:01:12.620 |
This is the CamTech comparison. Here's the model we were just looking at. Again, I actually 02:01:18.780 |
looked into this. I thought 91.5%, whereas this one here 88%, wow, it actually looks 02:01:26.860 |
like it's not that much better. I'm really surprised. Like even tree, I really thought 02:01:31.940 |
it should win easily on tree, but it doesn't win by very much. So I actually went back 02:01:37.000 |
and looked at this paper, and it turns out that the authors of the DenseNet paper (this 02:01:42.180 |
is the paper by the way, modest scale that they're comparing to) turned out that they 02:01:49.060 |
actually trained on props of 852x852, so they actually used a way higher resolution image 02:01:57.500 |
to start with. You've got to be really careful when you read these comparisons. Sometimes 02:02:04.500 |
people actually shoot themselves in the foot, so these guys were comparing their result 02:02:08.660 |
to another model that was using like twice as big a picture. So again, this is actually 02:02:16.100 |
way better than they actually made it look like. 02:02:19.580 |
Another one, like this one here, this 88 also looks impressive. But then I looked across 02:02:24.300 |
here and I noticed that the dilution8 model is like way better than this model on every 02:02:32.940 |
single category, way better. And yet somehow the average is only 0.3 better, and I realized 02:02:39.060 |
this actually has to be an error. So this model is actually a lot better than this table 02:02:54.260 |
So I briefly mentioned that there's a model which doesn't have any skip connections called 02:03:00.860 |
eNet, which is actually better than the tiramisu on everything except for tree. But on the 02:03:08.460 |
tree it's terrible. It's 77.8 versus, oh hang on, 77.3. That's not right. I take that back. 02:03:24.020 |
I'm sure it was less good than this model, but now I can't find that data. Anyway, the 02:03:32.540 |
reason I wanted to mention this is that Eugenio is about to release a new model which combines 02:03:43.040 |
these approaches with skip connections. It's called LinkNet. So keep an eye on the forum 02:04:00.940 |
Let's answer them on the forum. I actually wanted to talk about this briefly. A lot of 02:04:10.380 |
you have come up to me and been like, "We're finishing! What do we do now?" The answer is 02:04:20.300 |
we have now created a community of all these people who have spent well over 100 hours 02:04:26.020 |
working on deep learning for many, many months and have built their own boxes and written 02:04:32.380 |
blog posts and done all kinds of stuff, set up social impact talks, written articles in 02:04:44.460 |
Forbes. Okay, this community is happening. It doesn't make any sense in my opinion for 02:04:51.540 |
Rachel and I to now be saying here's what happens next. So just like Elena has decided, 02:04:59.220 |
"Okay, I want a book club." So she talked to Mindy and we now have a book club and a 02:05:03.660 |
couple of months time, you can all come to the book club. 02:05:08.260 |
So what's next? The forums will continue forever. We all know each other. Let's do good shit. 02:05:20.980 |
Most importantly, write code. Please write code. Build apps. Take your work projects and 02:05:33.300 |
try doing them with deep learning. Build libraries to make things easier. Maybe go back to stuff 02:05:39.780 |
from part 1 of the course and look back and think, "Why didn't we do it this other way? 02:05:44.340 |
I can make this simpler." Write papers. I showed you that amazing result of the new style transfer 02:05:53.060 |
approach from Vincent last week. Hopefully that might take you into a paper. Write blog 02:05:58.700 |
posts. In a few weeks' time, all the MOOC guys are going to be coming through and doing part 02:06:06.740 |
2 of the course. So help them out on the forum. Teaching is the best way to learn yourself. 02:06:13.820 |
I really want to hear the success stories. People don't believe that what you've done 02:06:22.260 |
is possible. I know that because as recently as yesterday, there was the highest ranked 02:06:31.140 |
Hacker News comment on a story about deep learning was about how it's pointless trying 02:06:35.860 |
to do deep learning unless you have years of mathematical background and you know C++ 02:06:41.020 |
and you're an expert in machine learning techniques across the board and otherwise there's no 02:06:45.620 |
way that you're going to be able to do anything useful in the real world project. That today 02:06:50.620 |
is what everybody believes. We now know that's not true. Rachel and I are going to start 02:07:00.420 |
up a podcast where we're going to try to both help deep learning learners. But one of the 02:07:08.820 |
key things you want to do is tell your stories. So if you've done something interesting at 02:07:14.660 |
work or you've got an interesting new result or you're just in the middle of a project 02:07:19.460 |
and it's kind of fun, please tell us. Either on the forum or private message or whatever. 02:07:26.100 |
Please tell us because we really want to share your story. And if it's not a story yet, tell 02:07:33.460 |
us enough that we can help you and that the community can help you. Get together, the 02:07:43.620 |
book club, if you're watching this on the MOOC, organize other people in your geography 02:07:49.260 |
to get together and meet up or your workplace. In this group here I know we've got people 02:07:55.900 |
from Apple and Uber and Airbnb who started doing this in kind of lunchtime MOOC chats 02:08:03.660 |
and now they're here at this course. Yes, Rachel? 02:08:06.740 |
I also wanted to recommend it would be great to start meetups to help lead other people 02:08:12.220 |
through, say, part one of the course, kind of assist them going through it. 02:08:17.940 |
So Rachel and I really just want to spend the next 6 to 12 months focused on supporting 02:08:30.140 |
your projects. So I'm very interested in working on this lung cancer stuff, but I'm also interested 02:08:41.220 |
in every project that you guys are working on. I want to help with that. I also want 02:08:46.220 |
to help people who want to teach this. So Yannette is going to go from being a student 02:08:52.900 |
to a teacher hopefully soon. We'll be teaching USF students about deep learning and hopefully 02:08:57.880 |
the next batch of people about deep learning. Anybody who's interested in teaching, let 02:09:02.340 |
us know. This is the best high leverage activity is to teach the teachers. 02:09:10.820 |
So yeah, I don't know where this is going to end up, but my hope is really that basically 02:09:17.300 |
I would say the experiment has worked. You guys are all here. You're reading papers, 02:09:23.300 |
you're writing code, you're understanding the most cutting edge research level deep 02:09:29.040 |
learning that exists today. We've gone beyond some of the cutting edge research in many 02:09:34.780 |
situations. Some of you have gone beyond the cutting edge research. So yeah, let's build 02:09:42.180 |
from here as a community and anything that Rachel and I can do to help, please tell us, 02:09:48.700 |
because we just want you to be successful and the community to be successful. 02:09:55.260 |
So will you be still active in the forums? Very active. My job is to make you guys successful. 02:10:05.660 |
So thank you all so much for coming, and congratulations to all of you.