Lesson 14: Cutting Edge Deep Learning for Coders

Okay, so welcome to lesson 14, the final lesson for now. And we'll talk at the end about what's next. As you can see from what's increasingly been happening, what next is very much about you with us rather than us leading you or telling you. So we're a community now and we can figure these stuff out together and obviously USF is a wonderful ally to have.

So for now, this is the last of these lessons. One of the things that was great to see this week was this terrific article in Forbes that talked about deep learning education and it was written by one of our terrific students, Maria. It focuses on the great work of some of the students that have come through this course.

So I wanted to say thank you very much and congratulations on this great article. I hope everybody would check it out. Very beautifully written as well and terrific stories, I found it quite inspiring. So today we are going to be talking about a couple of things, but we're going to start with time series and structured data.

Time series, I wanted to start very briefly by talking about something which I think you basically know how to do. This is a fantastic paper because it is not by deep mind, nobody's heard of it. It actually comes from the Children's Hospital of Los Angeles. Believe it or not, perhaps the epicenter of practical applied AI and medicine data is in Southern California, and specifically Southern California Pediatrics, the Children's Hospital of Orange County, CHOC, and the Children's Hospital of Los Angeles, CHLA.

CHLA, which this paper comes from, actually has this thing they call V-PICU, the Virtual Pediatric Intensive Care Unit, where for many, many years they've been tracking every electronic signal about how every patient, every kid in the hospital was treated and what all their ongoing sensor readings are. One of the extraordinary things they do is when the doctors there do rounds, data scientists come with them.

I don't know anywhere else in the world that this happens. And so a couple of months ago they released a draft of this amazing paper where they talked about how they pulled out all this data from the EMR and from the sensors and attempted to predict patient mortality. The reason this is interesting is that when a kid goes into the ICU, if a model starts saying this kid is looking like they might die, then that's the thing that sets the alarms going and everybody rushes over and starts looking after them.

And they found that they built a model that was more accurate than any existing model. Those existing models were built on many years of deep clinical input and they used an RNN. Now this kind of time series data is what I'm going to refer to as signal-type time series data.

So let's say you've got a series of blood pressure readings. So they come in and their blood pressure is kind of low and it's kind of all over the place and then suddenly it shoots up. And then in addition to that, maybe there's other readings such as at which points they receive some kind of medical intervention, there was one here and one here and then there was like six here and so forth.

So these kinds of things, generally speaking, the state of health at time t is probably best predicted by all of the various sensor readings at t-minus-1 and t-minus-2 and t-minus-3. So in statistical terms we would refer to that as autocorrelation. Autocorrelation means correlation with previous time periods. For this kind of signal, I think it's very likely that an RNN is the way to go.

Obviously you could probably get a better result using a bidirectional RNN, but that's not going to be any help in the ICU because you don't have the future time period sensors. So be careful of this data leakage issue. And indeed this is what this team at the VPCU at Children's Hospital of Los Angeles did, they used an RNN to get this data-to-the-art result.

I'm not really going to teach you more about this because basically you already know how to do it. You can check out the paper, you'll see there's almost nothing special. The only thing which was quite clever was that their sensor readings were not necessarily equally spaced. For example, did they receive some particular medical intervention?

Clearly they're very widely spaced and they're not equally spaced. So rather than having the RNN have basically a sequence of interventions that gets fed to the RNN, instead they actually have two things. One is the signal, and the other is the time since the last signal was read. So each point at the RNN is basically some function f.

It's receiving two things, it's receiving the signal at time t and the value of t itself. What is the time? Or the difference in time? How long was it since the last one? That doesn't require any different deep learning, that's just concatenating one extra thing under your vector. They actually show mathematically that this makes a certain amount of sense as a way to deal with this, and then they find empirically that it does actually seem to work pretty well.

I can't tell you whether this is state-of-the-art for anything because I just haven't seen deep comparative papers or competitions or anything that really have this kind of data, which is weird because a lot of the world runs on this kind of data. This kind of data effectively thinks it's super valuable, like if you're an oil and gas company, what's the drill head telling you, what's the signals coming out of the pipe telling you, and so on and so forth.

There we go. It's not the kind of cool thing that the Google kids work on, so who knows. So I'm not going to talk more about that. That's how you can do time series with this kind of signal data. You can also incorporate all of the other stuff we're about to talk about, which is the other kind of time series data.

For example, there was a Kaggle competition which was looking at forecasting sales for each store at this big company in Europe called Rossman based on the date and what promotions are going on and what the competitors are doing and so forth. Or maybe it will have some kind of trend to it.

So these kinds of seasonal time series are very widely analyzed by econometricians. They're everywhere, particularly in business, if you're trying to predict how many widgets you have to buy next month, or whether to increase or decrease your prices, or all kinds of operational type things tend to look like this.

How full your planes are going to be, whether you should add promotions, so on and so forth. So it turns out that the state of the art for this kind of approach is not necessarily fun to use in RNN. I'm actually going to look at the third place result from this competition because the third place result was nearly as good as places 1 and 2, but way, way, way simpler.

And also it turns out that there's stuff that we can build on top of for almost every model of this kind. And basically, surprise, surprise, it turns out that the answer is to use a neural network. So I need to warn you again, what I'm going to teach you here is very, very uncool.

You'll never read about it from DeepMind or OpenAI. It doesn't involve any robot arms, it doesn't involve thousands of GPU's, it's the kind of boring stuff that normal companies use to make more money or spend less money or satisfy their customers. So I apologize deeply for that oversight. Having said that, in the 25 years or more, I've been doing machine learning work applied in the real world, 98% of it has been this kind of data.

Whether it be when I was working in agriculture, I've worked in wool, macadamia nuts and rice, and we were figuring out how full our barrels were going to be, whether we needed more, we were figuring out how to set futures markets, prices for agricultural goods, whatever, worked in mining and brewing, which required analyzing all kinds of engineering data and sales data.

I've worked in banking, that required looking at transaction account pricing and risk and fraud. All of these areas basically involve this kind of data. So I think although no one publishes stuff about this, because anybody who comes out of a Stanford PhD and goes to Google doesn't know about any of those things, it's probably the most useful thing for the vast majority of people.

And excitingly it turns out that you don't need to learn any new techniques at all. In fact, the model that they got this third-place result with, a very simple model, is basically one where each different categorical variable was one hot encoded and chucked into an embedding layer. The embedding layers were concatenated and chucked through a dense layer, then a second dense layer, then it went through a sigmoid function into an output layer.

Very simple. The continuous variables they haven't drawn here, and all these pictures are going to come straight from this paper, which these folks that came third kindly wrote a paper about this. The continuous variables basically get fed directly into the dense layer. So that's the structure of the model.

How well does it work? So the short answer is, compared to K-nearest neighbors, random forests and GBMs, just a simple neural network beats all of those approaches, just with standard one-hot encoding, whatever. But then the EE is Entity Embeddings. So adding in this idea of using embeddings, interestingly you can take the embeddings trained by a neural network and feed them into a KNN, or a random forest, or a GBM.

And in fact, using embeddings with every one of those things is way better than anything other than neural networks. So that's pretty interesting. And then if you use the embeddings with a neural network, you get the best results still. So this actually is kind of fascinating, because training this neural network took me some hours on a Titan X, whereas training the GBM took I think less than a second.

It was so fast, I thought I had screwed something up. And then I tried running it, and it's like holy shit, it's giving accurate predictions. So GBMs and random forests are so fast. So in your organization, you could try taking everything that you could think of as a categorical variable and once a month train a neural net with embeddings, and then store those embeddings in a database table and tell all of your business users, "Hey, anytime you want to create a model that incorporates day of week or a store ID or a customer ID, you can go grab the embeddings." And so they're basically like word vectors, but they're customer vectors and store vectors and product vectors.

So I've never seen anybody write about this other than this paper. And even in this paper they don't really get to this hugely important idea of what you could do with these embeddings. What's the difference between A and B and C? Is it like different data types flowing in?

A and B and C, yeah, we're going to get to that in a moment. Basically the different things is like A might be the store ID, B might be the product ID, and C might be the day of work. One of the really nice things that they did in this paper was to then draw some projections of some of these embeddings.

They just used T-SNE, it doesn't really matter what the projection method is, but here's some interesting ideas. They took each state of Germany, or based in Germany, and did a projection of the embeddings from the state field. And here is those projections. And I've drawn around them different colored circles, and you might notice the different colored circles exactly correspond to the different colored circles on a map of Germany.

Now this were just random embeddings trained with SGD, trying to predict sales in stores at Rossman, and yet somehow they've drawn a map of Germany. So obviously the reason why is because things close to each other in Germany have similar behaviors around how they respond to events, and who buys what kinds of products, and so on and so forth.

So that's crazy fascinating. Here's the kind of bigger picture. Every one of these dots is the distance between two stores, and this shows the correlation between the distance in embedding space versus the actual distance between the stores. So you can basically see that there's a strong correlation between things being close to each other in real life and close to each other in these SGD-trained embeddings.

Here's a couple more pictures of all the lines drawn on top of mine, but everything else is just there from the paper. On the left is days of the week embedding. And you can see the days of the week that are near each other have ended up embedded close together.

On the right is months of the year embedding, again, same thing. And you can see that the weekend is fairly separate. So that's where we're going to get to. I'm actually going to take you through the end-to-end process, and I rebuilt the end-to-end process from scratch and tried to make it in as few lines as possible because we just haven't really looked at any of these structured data type problems before.

So it's kind of a very different process and even a different set of techniques. We import the usual stuff. When you try to do this stuff yourself, you'll find three or four libraries we haven't used before. So when you hit something that says "module not found", you can just pip install all these things if you're a Python.

We'll talk about them as we get to them. So the data that comes from Kaggle comes down as a bunch of CSV files, and I wrote a quick thing to combine some of those CSVs together. This was one of those competitions where people were allowed to use additional external data as long as they were shared on the forum.

So the data I'll share with you, I'm going to combine it all into one place for you. So I've commented on these apps because the stuff I'll give you will have already run this concatenation process. So the basic tables that you're going to get access to is the training set itself, a list of stores, a list of which state each store is in, a list of the abbreviation and the name of each state in Germany, a list of data from Google Trends.

So if you've used Google Trends you can basically see how particular keywords change over time. I don't actually know which keywords they used, but somebody found that there were some Google Trends keywords that correlated well, so we've got access to those, some information about the weather, and then a test set.

So I'm not sure that we've really used pandas much yet, so let's talk a bit about pandas. Pandas lets us take this kind of structured data and manipulate it in similar ways to the way you would manipulate it in a database. So the first thing you do, so pandas, just like NumPy, tends to become np, pandas tends to become pd.

So pd.read_csv is going to return a data frame. So a data frame is like a database table if you've used R, it's called the same thing. So this read_csv is going to return a data frame containing the information from this CSV file, and we're going to go through each one of those table names and read the CSV.

So this list comprehension is going to return a list of data frames. So I can now go ahead and display the head, so the first five rows from each table. And that's a good way to get a sense of what these tables are. So the first one is the trading set.

So for some store, on some date, they had some level of sales to some number of customers. And they were either open or closed, they either had a promotion on or they didn't, it either was a holiday or it wasn't for state and school, and then some additional information about the date.

So that's the basic information we have. And then everything else, we join onto that. So for example, for each store, we can join up some kind of categorical variable about what kind of store it is. I have no idea what this is, it might be a different brand or something.

What kinds of products do they carry, again it's just a letter, I don't know what it means, but maybe it's like some are electronics, some are supermarkets, some are full spectrum. How far away is the nearest competitor? And what year and month did the competitor open for business? Notice that sometimes the competitor opened for business quite late in the game, like later than some of the data we're looking at, so that's going to be a little bit confusing.

And then this thing called Promo2, which as far as I understand it is basically is this a store which has some kind of standard promotion timing going on. So you can see here that this store has standard promotions in January, April, July and October. So that's the stores. We also know for each store what state they're in based on the abbreviation, and then we can find out for each state what is the name of that state.

And then for each, this is slightly weird, this is the state abbreviation, the last two letters. In this state, during this week, this was the Google Trend data for some keyword, I'm not sure what keyword it was. For this state name, on this date is the temperature, dewpoint, and so forth.

And then finally here's the test set. It's identical to the training set, but we don't have the number of customers, and we don't have the number of sales. So this is a pretty standard kind of industry data set. You've got a central table, various tables related to that, and some things representing time periods or time points.

One of the nice things you can do in Pandas is to use this Pandas summary module called DataFrame summary for a table.summary, and that will return a whole bunch of information about every field. So I'm not going to go through all of it in detail, but you can see for example for the sales, on average 5,800 sales, standard deviation of 3,800, sometimes the sales goes all the way down to 0, sometimes all the way up to 41,000.

There's no missing to sales, that's good to know. So this is the kind of thing that's good to scroll through and identify, okay, competition open since month is missing about a third of the time, that's good to know. There's 12 unique states, that might be worth checking because there's actually 16 things in our state table for some reason.

Google trend data is never missing, that's good. The year goes from 2012 through 2015. The weather data is never missing. And then here's our test set. This is the kind of thing that might screw up a model, it's like actually sometimes the test set is missing the information about whether that store was open or not, so that's something to be careful of.

So we can take that list of tables and just de-structure it out into a whole bunch of different table names, find out how big the training set is, how big the test set is. And then with this kind of problem, there's going to be a whole bunch of data cleaning and a whole bunch of feature engineering.

So neural nets don't make any of that go away, particularly because we're using this style of neural net where we're basically feeding in a whole bunch of separate continuous and categorical variables. So simplify things a bit, turn state holidays into Booleans, and then I'm going to join all of these tables together.

I always use a default join type of an outer join, so you can see here this is how we join in pandas. We say table.merge, table2, and then to make a left outer join, how equals left, and then you say what's the name of the fields that you're going to join on the left hand side, what are the fields you're going to join on the right hand side, and then if both tables have some fields with the same name, what are you going to suffix those fields with?

So on the left hand side we're not going to add any suffix, on the right hand side we'll put in _y. So again, I try to refactor things as much as I can, so we're going to join lots of things. Let's create one function to do the joining, and then we can call it lots of times.

Was there any fields different to the same value but named differently? Not that I saw, no. It wouldn't matter too much if there were, because when we run the model, no problem. Question - Would you liken the use of embeddings from a neural network to extraction of implicit features, or can we think of it more like what a PCA would do, like dimensionality reduction?

Let's talk about it more when we get there. Basically, when you deal with categorical variables in any kind of model, you have to decide what to do with them. One of my favorite data scientists, or a pair of them actually, who are very nearly neighbors of Rachel and mine, have this fantastic R package called Vtreat, which has a bunch of state-of-the-art approaches to dealing with stuff like categorical variable encoding.

The obvious way to do categorical variable encoding is to just do a one-hot encoding, and that's the way nearly everybody puts it into their gradient-boosting machines or random forests or whatever. One of the things that Vtreat does is it has some much more interesting techniques. For example, you could look at the univariate mean of sales for each day of week, and you could encode day of week using a continuous variable which represents the mean of sales.

But then you have to think about, "Would I take that mean from the trading set or the test set or the validation set? How do I avoid a fitting?" There's all kinds of complex statistical subtleties to think about that Vtreat handles all this stuff automatically. There's a lot of great techniques, but they're kind of complicated and in the end they tend to make a whole bunch of assumptions about linearity or univariate correlations or whatever.

Whereas with embeddings, we're using SGD to learn how to deal with it, just like we do when we build an NLP model or a collaborative filtering model. We provide some initially random embeddings and the system learns how the movies vary compared to each other, or uses vary, or words vary or whatever.

This is to me the ultimate pure technique. Of course the other nice thing about embeddings is we get to pick the dimensionality of the embedding so we can decide how much complexity and how much learning are we going to put into each of the categorical variables. We'll see how to do that in a moment.

One complexity was that the weather uses the name of the state rather than the abbreviation of the state, so we can just go ahead and join weather to states to get the abbreviation. The Google Trend information about the week, week from A to B, we can split that apart.

You can see here one of the things that happens in the Google Trend data is that one of the states is called ni, or else in the rest of the data is called hb, ni. So this is a good opportunity to learn about pandas indexing. So pandas indexing, most of the time you want to use this .ix method.

And the .ix method is your general indexing method. It's going to take two things, a list of rows to select and a list of columns to select. You can use it in pretty standard intuitive ways. This is a lot like numpy. This here is going to return a list of Booleans, which things are in this state.

And if you pass the list of Booleans to the pandas row selector, it will just return the rows where that Boolean is true. So therefore this is just going to return the rows from Google Trend, where googletrend.state is ni. And then the second thing we pass in is a list of columns, in this case we just got one column.

And one very important thing to remember, again just like numpy, you can put this kind of thing on the left-hand side of an equal sign. In computer science we call this an L-value, so you can use it as an L-value. So we can take this state field, four things which are equal to ni, and change their value to this.

So this is like a very nice simple technique that you'll use all the time in pandas, both for looking at things and for changing things. We have a few questions. One is, in this particular example, do you think the granularity of the data matter, as in per day or per week, is one better than the other?

Yeah, I mean I would want to have the lower granularity so that I can capture that. Ideally you at one time as well. It kind of depends on how the organization is going to use it. What are they going to do with this information? It's probably for purchasing and stuff, so maybe they don't care about an hourly level.

Clearly the difference between Sunday sales and Wednesday sales will be quite significant. This is mainly a kind of business context or domain understanding question. Another question is, do you know if there's any work that compares for structured data, supervised embeddings like these, to embeddings that come from an unsupervised paradigm such as an autoencoder?

It seems like you'd get more useful for prediction embeddings with the former case, but if you wanted general purpose embeddings you might prefer the latter. Yeah, I think you guys are aware of my feelings about autoencoders. It's like giving up on life. You can always come up with a loss function that's more interesting than an autoencoder loss function basically.

I would be very surprised if embeddings that came from a sales model were not more useful for just about everything than something that came from an unsupervised model. These things are easily tested, and if you do find a model that they don't work as well with, then you can come up with a different set of supervised embeddings for that model.

There's also just a note that .ix is deprecated and we should use .loc instead. I was going to mention Pandas is changing a lot. Because I've been running this course I have not been keeping track of the recent versions of Pandas, so thank you. In Pandas there's a whole page called Advanced Indexing Methods.

I don't find the Pandas documentation terribly clear to be honest, but there is a fantastic book by the author of Pandas called Python for Data Analysis. There is a new edition out, and it covers Pandas, NumPy, Matplotlib, whatever. That's the best way by far to actually understand Pandas because the documentation is a bit of a nightmare and it keeps changing so the new version has all the new stuff in it.

With these kind of indexing methods, Pandas tries really hard to be intuitive, which means that quite often you'll read the documentation for these methods and it will say if you pass it a Boolean it will behave in this way, if you pass it a float it will behave this way, if it's an index it's this way unless this other thing happens.

I don't find it intuitive at all because in the end I need to know how something works in order to use it correctly and so you end up having to remember this huge list of things. I think Pandas is great, but this is one thing to be very careful of, is to really make sure you understand how all these indexing methods actually work.

I know Rachel's laughing because she's been there and probably laughing in disgust at what we all have to go through. Another question, when you use embeddings from a supervised model in another model, do you always have to worry about data leakage? I think that's a great point. I don't think I've got anything to add to that.

You can figure out easily enough if there's data leakage. So there's this kind of standard set of steps that I take for every single structured machine learning model I do. One of those is every time I see a date, I always do this. I always create four more fields, the year, the month of year, the week of year and the day of week.

This is something which should be automatically built into every data loader, I feel. It's so important because these are the kinds of structures that you see, and once every single date has got this added to it, you're doing great. So you can see that I add that into all of my tables that have a date field, so we'll have that from now on.

So now I go ahead and do all of these outer joins. You'll see that the first thing I do after every outer join is check whether the thing I just joined with has any nulls. Even if you're sure that these things match perfectly, I would still never ever do an inner join.

Do the outer join and then check for nulls, and that way if anything changes ever or if you ever make a mistake, one of these things will not be zero. If this was happening in a production process, this would be an assert. This would be emailing Henry at 2am to say something you're relying on is not working the way it was meant to look at.

So that's why I always do it this way. So you can see I'm basically joining my training to everything else until it's all in there together in one big thing. So that table "everything joined together" is called "joined", and then I do a whole bunch more thinking about -- well, I didn't do the thinking, the people that won this competition, then I replicated their results from scratch -- think about what are all the other things you might want to do with these dates.

So competition open, we noticed before, a third of the time they're empty. So we just fill in the empties with some kind of sentinel value because a lot of machine learning systems don't like missing values. Fill in the missing months with some sentinel value. Again, keep on filling in missing data.

So fill_na is a really important thing to be aware of. I guess the answer is yes, it is a problem. In this case, I happen to know that every time a year is empty, a month is also empty, and we only ever use both of them together. So we don't really care when the competition store was opened, what we really care about is how long is it between when they were opened and the particular row that we're looking at.

The sales on the 2nd of February 2014, how long was it between 2nd of February 2014 and when the competition opened. So you can see here we use this very important .apply function which just runs a Python function on every row of a data frame. In this case, the function is to create a new date from the open-since year and the open-since month.

We're just going to assume that it's the middle of the month. That's our competition open-since, and then we can get our days opened by just doing a subtract. In pandas, every date field has this special magical dt property, which is what all the days, month, year, all that stuff sits inside this little dt property.

Sometimes, as I mentioned, the competition actually opened later than the particular observation we're looking at. So that would give us a negative, so we replace our negatives with zero. We're going to use an embedding for this, so that's why we replace days open with months open so we have less values.

I didn't actually try replacing this with a continuous variable. I suspect it wouldn't make too much difference, but this is what they do. In order to make the embedding again not too big, they replaced anything that was bigger than 2 years with 2 years. So there's our unique values.

Every time we do something, print something out to make sure the thing you thought you did is what you actually did. It's much easier if we're using Excel because you see straight away what you're doing. In Python, this is the kind of stuff that you have to really be rigorous about checking your work at every step.

When I build stuff like this, I generally make at least one error in every cell, so check carefully. Okay, do the same thing for the promo days, turn those into weeks. So that's some basic pre-processing, you get the idea of how pandas works hopefully. So the next thing that they did in the paper was a very common kind of time series feature manipulation, one to be aware of.

They basically wanted to say, "Okay, every time there's a promotion, every time there's a holiday, I want to create some additional fields for every one of our training set rows," which is on a particular date. "On that date, how long is it until the next holiday? How long is it until the previous holiday?

How long is it until the next promotion? How long is it since the previous promotion?" So if we basically create those fields, this is the kind of thing which is super difficult for any GBM or random forest or neural net to figure out how to calculate itself. There's no obvious kind of mathematical function that it's going to build on its own.

So this is the kind of feature engineering that we have to do in order to allow us to use these kinds of techniques effectively on time series data. So a lot of people who work with time series data, particularly in academia outside of industry, they're just not aware of the fact that the state-of-the-art approaches really involve all these heuristics.

Separating out your dates into their components, turning everything you can into durations both forward and backwards, and also running averages. When I used to do a lot of this kind of work, I had a bunch of library functions that I would run on every file that came in and would automatically do these things for every combination of dates.

So this thing of how long until the next promotion, how long since the previous promotion is not easy to do in any database system pretty much, or indeed in pandas. Because generally speaking, these kind of systems are looking for relationships between tables, but we're trying to look at relationships between rows.

So I had to create this tiny little simple little class to do this. So basically what happens is, let's say I'm looking at school holiday. So I sort my data frame by store, and then by date, and I call this little function called add_elapsed_school_holiday_after. What does add_elapsed do? Add_elapsed is going to create an instance of this class called elapsed, and in this case it's going to be called with school_holiday.

So what this class is going to do, we're going to be calling this apply function again. It's going to run on every single row, and it's going to call my elapsed_class.debt for every row. So I'm going to go through every row in order of store, in order of date, and I'm trying to find how long has it been since the last school holiday.

So when I create this object, I just have to keep track of what field is it, school_holiday. Initialize, when was the last time we saw a school holiday? The answer is we haven't, so let's initialize it to not a number. And we also have to know each time we cross over to a new store.

When we cross over to a new store, we just have to re-initialize. So the previous store was 0. So every time we call get, we basically check. Have we crossed over to a new store? And if so, just initialize both of those things back again. And then we just say, Is this a school holiday?

If so, then the last time you saw a school holiday is today. And then finally return, how long is it between today and the last time you saw a school holiday? So it's basically this class is a way of keeping track of some memory about when did I last see this observation.

So then by just calling df.apply, it's going to keep track of this for every single row. So then I can call that for school_holiday, after and before. The only difference being that for before I just sort my dates in ascending order. State_holiday and promo. So that's going to add in the end 6 fields, how long until and how long since the last school holiday, state_holiday and promotion.

And then there's two questions. One asking, Is this similar to a windowing function? Not quite, we're about to do a windowing function. And then is there a reason to think that the current approach would be problematic with sparse data? I don't see why, but I'm not sure I quite follow.

So we don't care about absolute days. We care about time deltas between events. We care about two things. We do care about the dates, but we care about what year is it, what date week it is. And we also care about the elapsed time between the date I'm predicting sales for and the previous and next of various events.

And then windowing functions, for the features that are time until an event, how do you deal with that given that you might not know when the last event is in the data? Well all I do is I've sorted descending, and then we initialize last with not a number. So basically when we then go subtract, here we are subtract, and it tries to subtract not a number, we'll end up with a null.

So basically anything that's an unknown time because it's at one end or the other is going to end up null, which is why we're going to replace those nulls with zeros. Pandas has this slightly strange way of thinking about indexes, but once you get used to it, it's fine.

At any point you can call DataFrame.setIndex and pass in a field. You then have to just kind of remember what field you have as the index, because quite a few methods in Pandas use the currently active index by default, and of course things all run faster when you do stuff with the currently active index.

And you can pass multiple fields, in which case you end up with a multiple key index. So the next thing we do is these windowing functions. So a windowing function in Pandas, we can use this rolling. So this is like a rolling mean, rolling min, rolling max, whatever you like.

So this basically says let's take our DataFrame with the columns we're interested in, school holiday, state holiday and promo, and we're going to keep track of how many holidays are there in the next week and the previous week. How many promos are there in the next week and the previous week?

To do that we can sort, here we are, by date, group by, store, and then rolling will be applied to each group. So within each group, create a rolling 7-day sum. It's the kind of notation I'm never likely to remember, but you can just look it up. This is how you do group by type stuff.

Pandas actually has quite a lot of time series functions, and this rolling function is one of the most useful ones. Wes McKinney had a background as a quant who memory serves correctly, and so the quants love their time series functions, so I think that was a lot of the history of Pandas.

So if you're interested in time series stuff, you'll find a lot of time series stuff in Pandas. One helpful parameter that sits inside a lot of methods is inPlace = true. That means that rather than returning a new data frame with this change made, it changes the data frame you already have, and when your data frames are quite big this is going to save a lot of time and memory.

That's a good little trick to know about. So now we merge all these together, and we can now see that we've got all these after school holidays, before school holidays, and our backward and forward running means. Then we join that up to our original data frame, and here we have our final result.

So there it is. We started out with a pretty small set of fields in the training set, but we've done this feature engineering. This feature engineering is not arbitrary. Although I didn't create this solution, I was just re-implementing the solution that came from the competition that place getters -- this is nearly exactly the set of feature engineering steps I would have done.

It's just a really standard way of thinking about a time series. So you can definitely borrow these ideas pretty closely. So now that we've got this table, we've done our feature engineering, we now want to feed it into a neural network. To feed it into a neural network we have to do a few things.

The categorical variables have to be turned into one-hot encoded variables, or at least into contiguous integers. And the continuous variables we probably want to normalize to a zero-mean one-standard deviation. There's a very little-known package called sklearn_pandas. And actually I contributed some new stuff to it for this course to make this even easier to use.

If you use this data frame mapper from sklearn_pandas, as you'll see, it makes life very easy. Without it, life is very hard. And because very few people know about it, the vast majority of code you will find on the internet makes life look very hard. So use this code, not the other code.

Actually I was talking to some of the students the other day and they were saying for their project they were stealing lots of code from part one of the course because they just couldn't find anywhere else people writing any of the kinds of code that we've used. The stuff that we've learned throughout this course is on the whole not code that lives elsewhere very much at all.

So feel free to use a lot of these functions in your own work because I've really tried to make them the best version of that function. So one way to do the embeddings and the way that they did it in the paper is to basically say for each categorical variable they just manually decided what embedding dimensionality to use.

They don't say in the paper how they pick these dimensionalities, but generally speaking things with a larger number of separate levels tend to have more dimensions. So I think there's like 1000 stores, so that has a big embedding dimensionality, where else obviously things like promo, forward and backward, or they have weak or whatever have much smaller ones.

So this is this dictionary I created that basically goes from the name of the field to the embedding dimensionality. Again, this is all code that you guys can use in your models. So then all I do is I say my categorical variables is go through my dictionary, sort it in reverse order of the value, and then get the first thing from that.

So that's just going to give me the keys from this in reverse order of dimensionality. Continuous variables is just a list. Just make sure that there's no nulls, so continuous variables replace nulls with zeros, categorical variables replace nulls with empties. And then here's where we use the DataFrameMapper. A DataFrameMapper takes a list of tuples with just two items in.

The first item is the name of the variable, so in this case I'm looping through each categorical variable name. The second thing in the tuple is an instance of a class which is going to do your preprocessing. And there's really just two that you're going to use almost all the time.

The categorical variables, sklearn comes with something called label encoder. It's really badly documented, in fact misleadingly documented, but this is exactly what you want. It's something that takes a column, figures out what are all the unique values that appear in that column, and replaces them with a set of contiguous integers.

So if you've got the days of the week, Monday through Sunday, it'll replace them with zeros through sevens. And then very importantly, this is critically important, you need to make sure that the training set and the test set have the same codes. There's no point in having Sunday be zero in the training set and one in the test set.

So because we're actually instantiating this class here, this object is going to actually keep track of which codes it's using. And then ditto for the continuous, we want to normalize them to a 0, 1 variables. But again, we need to remember what was the mean that we subtracted, what was the standard deviation we divided by, so that we can do exactly the same thing to the test set.

Otherwise again our models are going to be nearly totally useless. So the way the dataframe mapper works is that it's using this instantiated object, it's going to keep track with this information. So this is basically code you can copy and paste in every one of your models. Once we've got those mappings, you just pass those to a dataframe mapper, and then you call .fit passing in your dataset.

And so this thing now is a special object which has a .features property that's going to contain all of the pre-processed features that you want. Categorical columns contains the result of doing this mapping, basically doing this label encoding. In some ways the details of how this works doesn't matter too much because you can just use exactly this code in every one of your models.

Same for continuous, it's exactly the same code, but of course continuous, it's going to be using standard scalar, which is the scikit-learn thing that turns it into a zero-mean-one standard deviation variable. So we've now got continuous columns that have all been standardized. Here's an example of the first five rows from the zeroth column for a categorical, and then ditto for a continuous.

You can see these have been turned into integers and these have been turned into numbers which are going to average to zero and have a standard deviation of one. One of the nice things about this dataframe mapper is that you can now take that object and actually store it, pickle it.

So now you can use those categorical encodings and scaling parameters elsewhere. By just unpickling it, you've immediately got those same parameters. For my categorical variables, you can see here the number of unique classes in every one. So here's my 1,100 stores, 31 days of the month, 7 days of the week, and so forth.

So that's the kind of key pre-processing that has to be done. So here is their big mistake, and I think if they didn't do this big mistake, they probably would have won. Their big mistake is that they went join.sales, not equal to zero. So they've removed all of the rows with no sales.

Those are all of the rows where the store was closed. Why was this a big mistake? Because if you go to the Rossman Store Sales Competition website and click on "Kernels" and look at the kernel that got the highest rating. I'll show you a couple of pictures. Here is an example of a store, Store 708, and these are all from this kernel.

Here is a period of time where it was closed to refurbishment. This happens a lot in Rossman stores. You get these periods of time when you get zeros for sales, lots in a row. Look what happens immediately before and after. So in the data set that we're looking at, our unfortunate third place winners deleted all of these.

So they had no ability to build a feature that could find this. So this Store 708. Look, here's another one where it was closed. So this turns out to be super common. The second place winner actually built a feature. It's going to be exactly the same feature we've seen before.

How many days since they're closing and how many days until when they're closing. If they had just done that, I'm pretty sure they would have won. So that was their big mistake. This kernel has a number of interesting analyses in it. Here's another one which I think our neural net can capture, although it might have been better to be explicit.

Some stores opened on Sundays. Most didn't, but some did. For those stores that opened on Sundays, their sales on Sundays were far higher than on any other day. I guess that's because in Germany I guess not many shops opened on Sundays. So something else that they didn't explicitly do was create a "is store open on Sunday" field.

Having said that, I think the neural net may have been able to put that in the embedding. So if you're interested during the week, you could try adding this field and see if it actually improves it or not. It would certainly be interesting to hear if you try adding this field.

Do you find that you actually would win the competition? This Sunday thing, these are all from the same Kaggle kernel, here's the day of week and here's the sales as a box plot. You can see normally on a Sunday, it's not that the sales are much higher. So it's really explicitly just for these particular stores.

That's the kind of visualization stuff which is really helpful to do as you work through these kinds of problems. I don't know, just draw lots of pictures. Those pictures were drawn in R, and R is actually pretty good for this kind of structured data. I have a question. For categorical fields, they're converted by the numbers not with me and zero.

They were just messages Monday is zero, Tuesday is one, whatever. As is, they will send to a neural network just like... We're going to get there. We're going to use embeddings. Just like we did with word embeddings, remember, we turned every word into a word index. So our sentences, rather than being like the dog ate the beans, it would be 3, 6, 12, 2, whatever.

We're going to do the same basic thing. We've done the same basic thing. So now that we've done our terrible mistake, we've now still got 824,000 rows left. As per usual, I made it really easy for me to create a random sample and did most of my analysis with a random sample, but can just as easily not do the random sample.

So now I've got a separate sample version of it. Split it into training and test. Notice here, the way I split it into training and test is not randomly. The reason it's not randomly is because in the Kaggle competition, they set it up the smart way. The smart way to set up a test set in a time series is to make your test set the most recent period of time.

If you choose random points, you've got two problems. The first is you're predicting tomorrow's sales where you always have the previous day's sales which is very rarely the way things really work. And then secondly, you're ignoring the fact that in the real world, you're always trying to model a few days or a few weeks or a few months in the future that haven't happened yet.

So the way you want to set up, if you were setting up the data for such a model yourself, you would need to be deciding how often am I going to be rerunning this model, how long is it going to take for those model results to get into the field, to be used in however they're being used.

In this case, I can't remember, I think it's like a month or two. So in that case I should make sure there's a month or two test set, which is the last bit. So you can see here, I've taken the last 10% of my validation set and it's literally just here's the first bit and here's the last bit, and since it was already sorted by date, this ensures that I have it done the way I want.

I just wanted to point out that it's 10 to 8, so we should probably take a break. This is how you take that data frame map object we created earlier, we call .fit, in order to learn the transformation parameters, we then call transform to actually do it. So take my training set and transform it to grab the categorical variables, and then the continuous preprocessing is the same thing for my continuous map.

So preprocess my training set and grab my continuous variables. So that's nearly done. The only final piece is in their solution, they modified their target, their sales value. And the way they modified it was that they found the highest amount of sales, and they took the log of that, and then they modified all of their y values to take the log of sales divided by the maximum log of sales.

So what this means is that the y values are going to be no higher than 1. And furthermore, remember how they had a long tail, the average was 5,000, the maximum was 40-something thousand. This is really common, like most financial data, sales data, so forth, generally has a nicer shape when it's logged than it does not.

So taking a log is a really good idea. The reason that as well as taking the log they also did this division is it means that what we can now do is we can use an activation function in our neural net of a sigmoid, which goes between 0 and 1, and then just multiply by the maximum log.

So that's basically going to ensure that the data is in the right scaling area. I actually tried taking this out, and this technique doesn't really seem to help. And it actually reminds me of the style transfer paper where they mentioned they originally had a hyperbolic tan layer at the end for exactly the same reason, to make sure everything was between 0 and 255.

It actually turns out if you just use a linear activation it worked just as well. So interestingly this idea of using sigmoids at the end in order to get the right range doesn't seem to be that helpful. My guess is the reason why is because for a sigmoid it's really difficult to get the maximum.

And I think actually what they should have done is they probably should have, instead of using maximum, they should have used maximum times 1.25 so that they never have to predict 1, because it's impossible to predict 1 because it's a sigmoid. Someone asked, "Is there any issue in fitting the preprocessors on the full training and validation data?

Shouldn't they be fit only to the training set?" No, it's fine. In fact, for the categorical variables, if you don't include the test set then you're going to have some codes that aren't there at all. Or else this way there's going to be random, which is better than failing.

As for deciding what to divide and subtract in order to get a 0, 1 random variable, it doesn't really matter. There's no leakage involved because that's what you're worried about. Root means squared percent error is what the Kaggle competition used as the official loss function, so this is just calculating that.

So before we take a break, we'll finally take a look at the definition of the model. I'll kind of work backwards. Here's the basic model. Get our embeddings, combine the embeddings with the continuous variables, a tiny bit of dropout, one dense layer, two dense layers, more dropout, and then the final sigmoid activation function.

You'll see that I've got commented out stuff all over the place. This is because I had a lot of questions, we're going to cover this after the break, a lot of questions about some of the details of why did they do things certain ways, some of the things they did were so weird, I just thought they couldn't possibly be right.

So I did some experimenting, we'll learn more about that in a moment. So the embeddings, as per usual, I create a little function to create an embedding, which first of all creates my regular Keras input layer, and then it creates my embedding layer, and then how many embedding dimensions I'm going to use.

Sometimes I looked them up in that dictionary I had earlier, and sometimes I calculated them using this simple approach of saying I will use however many levels there are in the categorical variable divided by 2 with a maximum of 50. These were two different techniques I was playing with.

Normally with word embeddings, you have a whole sentence, and so you've got to feed it to an RNN, and so you have time steps. So normally you have an input length equal to the length of your sentence. This is the time steps for an RNN. We don't have an RNN, we don't have any time steps.

We just have one element in one column. So therefore I have to pass flatten after this because it's going to have this redundant unit 1 time axis that I don't want. So this is just because people don't normally do this kind of stuff with embeddings, so they're assuming that you're going to want it in a format ready to go to an RNN, so this is just turning it back into a normal format.

So we grab each embedding, we end up with a whole list of those. We then combine all of those embeddings with all of our continuous variables into a single list of variables. And so then our model is going to have all of those embedding inputs and all of our continuous inputs, and then we can compile it and train it.

So let's take a break and see you back here at 5 past 8. So we've got our neural net set up. We train it in the usual way, go.fit, and away we go. So that's basically that. It trains reasonably quickly, 6 minutes in this case. So we've got two questions that came in.

One of them is, for the normalization, is it possible to use another function other than log, such as sigmoid? I don't think you'd want to use sigmoid. A kind of financial data and sales data tends to be of a shape where log will make it more linear, which is generally what you want.

And then when we log transform our target variable, we're also transforming the squared error. Is this a problem? Or is it helping the model to find a better minimum error in the untransformed space? Yeah, so you've got to be careful about what loss function you want. In this case the Kaggle competition is trying to minimize root and mean squared percent error.

So I actually then said I want you to do mean absolute error because in log space that's basically doing the same thing. The percent is a ratio, so this is the absolute error between two logs which is basically the same as a ratio. So you need to make sure your loss function is appropriate in that space.

I think this is one of the things that I didn't do in the original competition. As you can see I tried changing it and I think it helped. By the way, XGBoost is fantastic. Here is the same series of steps to run this model with XGBoost. As you can see, I just concatenate my categorical and continuous for training and my validation set.

Here is a set of parameters which tends to work pretty well. XGBoost has a data type called DMatrix, which is basically a normal matrix but it keeps track of the names of the features, so it prints out better information. Then you go .train and this takes less than a second to run.

It's not massively worse than our previous result. This is a good way to get started. The reason that XGBoost and Random Forest is particularly helpful is because it does something called variable importance. This is how you get the variable importance for an XGBoost model. It takes a second and suddenly here is the information you need.

When I was having trouble replicating the original results from the third place winners, one of the things that helped me a lot was to look at this feature importance plot and say, "Competition distance, holy cow, that's really really important. Let's make sure that my competition distance results pre-processing really is exactly the same." On the other hand, events doesn't really matter at all, so I'm not going to worry really at all about checking my events.

This feature importance or variable importance plot, also as it's known, you can also create with a random forest. These are amazing. Because you're using a tree ensemble, it doesn't matter the shape of anything, it doesn't matter if you have or don't have interactions, this is all totally assumption free.

In real life, this is the first thing I do. The first thing I do is try to get a feature importance plot printed. Often it turns out that there's only three or four variables in that. If you've got 10,000 variables, so I worked on a big credit scoring problem a couple of years ago, I had 9,500 variables.

It turned out that only nine of them mattered. So the company I was working for literally had spent something like $5 million on this big management consulting project, and this big management consulting project had told them all these ways in which they can capture all this information in this really clean way for their credit scoring models.

Of course none of those things were in these nine that mattered, so they could have saved $5 billion, but they didn't because management consulting companies don't use random forests. I can't overstate the importance of this plot, but this is a deep learning course, so we're not really going to spend time talking about it.

Now I mentioned that I had a whole bunch of really weird things in the way that the competition playscaders did things. For one, they didn't normalize their continuous variables. Who does that? But then when people do well in a competition, something's working. The ways in which they initialized their embeddings were really, really weird.

But all these things were really, really weird. So what I did was I wrote a little script, Rusman Experiments, and what I did was basically I copied and pasted all the important code out of my notebook. Remember I've already pickled the parameters for the label encoder and the scalar, so I didn't have to worry about doing those again.

Once I copied and pasted all that code in, so this is exactly all the code you just saw, I then had this bunch of for loops. Pretty inelegant. But these are all of the things that I wanted to basically find out. Does it matter whether or not you use 1.0 scaling?

Does it matter whether you use their weird approach to initializing embeddings? Does it matter whether you use their particular dictionary of embedding dimensions or use my simple little formula? Something else I tried is they basically took all their continuous variables and put them through a separate little dense layer each.

I was like, why don't we put them all together. I also tried some other things like batch normalization. So I ran this and got back every possible combination of these. This is where you want to be using the script. I'm not going to tell you that I jumped straight to this.

First of all, I spent days screwing around with experiments in a notebook by hand, continually forgetting what I had just done, until eventually it took me like an hour to write this. And then of course I pasted it into Excel. And here it is. Chucked it into a pivot table, used conditional formatting, and here's my results.

You can see all my different combinations, with and without normalization, with my special function versus their dictionary, using a single dense matrix versus putting everything together, using their weird init versus not using a weird init. And here is this dark blue here is what they did. It's full of weird to me.

But as you can see, it's actually the darkest blue. It actually is the best. But then when you zoom out, you realize there's a whole corner over here that's got a couple of eight-sixes, it's nearly as good, but seems much more consistent. And also more consistent with sanity. Like yes, do normalize your data.

And yes, do use an appropriate initialization function. And if you do those two things, it doesn't really matter what else you do, it's all going to work fine. So what I then did was I created a little sparkline in Excel for the actual training graphs. And so here's their winning one, again, .085.

But here's the variance of getting there. And as you can see, their approach was pretty bumpy, up and down, up and down, up and down. The second best on the other hand, .086 rather than .085, is going down very smoothly. And so that made me think, given that it's in this very stable part of the world, and given it's training much better, I actually think this is just random chance.

It just happened to be low in this point. I actually thought this is a better approach. It's more sensible and it's more consistent. So this kind of approach to running experiments, I thought I'd just show you to say when you run experiments, try and do it in a rigorous way and track both the stability of the approach as well as the actual result of the approach.

So this one here makes so much sense. It's like use my simple function rather than the weird dictionary, use normalization, use a single dense matrix, and use a thoughtful initialization. And you do all of those things, you end up with something that's basically as good and much more stable.

That's all I wanted to say about Rossman. I'm going to very briefly mention another competition, which is the Kaggle Taxi Destination competition. You were saying that you did a couple of experiments. One, you figured out the embeddings and then put the embeddings into random forests, and then put embeddings again into neural network.

I didn't do that, that was from the paper. Yeah, so I don't understand because you just use one neural network to do everything together, no? Yeah, so what they did was, for this one here, this 115, they trained the neural network I just showed you. They then threw away the neural network and trained a GBM model, but for the categorical variables, rather than using 100 encodings, they used the embeddings.

That's all. So the taxi competition was won by the team with this Unicode name, which is pretty cool. And it's actually turned out to be a team run by Yoshua Bengio, who's one of the people that stuck it out through the AI winter and is now one of the leading lights in deep learning.

And interestingly, the thing I just showed you, the Rossman competition, this paper they wrote in the Rossman competition claimed to have invented this idea of categorical embeddings. But actually, Yoshua Bengio's team won this competition a year earlier with this same technique. But again, it's so uncool, nobody noticed even though it was Yoshua Bengio.

So I want to quickly show you what they did. This is the paper they wrote. And their approach to picking an embedding size was very simple. Use 10. So the data was which customer is taking this taxi, which taxi are they in, which taxi stand did they get the taxi from, and then quarter hour of the day, day of the week, week of the year.

And they didn't add all kinds of other stuff, this is basically it. And so then they said we're going to learn embeddings inspired by NLP. So actually to my knowledge, this is the first time this appears in the literature. Having said that, I'm sure a thousand people have done it before, it's just not obvious to make it into a paper.

>> As a quick sanity check, if you have day of the week, like with seven, even one hot variable potentials, and embedding size of 10, that doesn't make any sense, right? >> Yeah, so I used to think that. But actually it does. Since the last few months quite often ended up with bigger embeddings than my original patternality.

And often it does give better results. And I think it's just like when you realize that it's just a dense layer on top of a one-hot encoding, it's like okay, why shouldn't the dense layer have more information? I found it weird too, I still find it a little weird, but it definitely seems to be something that's quite useful.

>> It does, it helps. I have absolutely found plenty of times now where I need a bigger embedding metric's dimensionality than my cardinality of my categorical variable. Now in this competition, again it's a time series competition really, because the main thing you're given other than all this metadata is a series of GPS points, which is every GPS point along a route.

And at some point for the test set, the route is cut off and you have to figure out what the final GPS point would have been. Where are they going? Here's the model that they won with. It turns out to be very simple. You take all the metadata we just saw and chuck it through the embeddings.

You then take the first 5 GPS points and the last 5 GPS points and concatenate them together with the embeddings. Chuck them through a hidden layer, then through a softmax. This is quite interesting. What they then do is they take the result of this softmax and they combine it with clusters.

Now what are these clusters? They used mean shift clustering, and they used mean shift clustering to figure out where are the places people tend to go. So with taxis, people tend to go to the airport or they tend to go to the hospital or they tend to go to the shopping strip.

So using mean shift clustering, I think it was about 3,000 clusters, x, y coordinates of places that people tend to go. However, people don't always go to those 3,000 places. So this is a really cool thing. By using a softmax, and then they took the softmax and they multiplied it and took a weighted average using the softmax as the weights and the cluster centers as the thing that you're taking the weighted average of.

So in other words, if they're going to the airport for sure, the softmax will end up giving a p of very close to 1 for the airport cluster. On the other hand, if it's not really that clear whether they're going to this shopping strip or this movie, then those two cluster centers could both have a softmax of about 0.5, and so it's going to end up predicting somewhere halfway between the two.

So this is really interesting. They've built a different kind of architecture to anything we've seen before, where the softmax is not the last thing we do. It's being used to average a bunch of clusters. So this is really smart because the softmax forces it to be easier for it to pick a specific destination that's very common, but also makes it possible for it to predict any destination anywhere by combining the average of a number of clusters together.

I think this is really elegant architecture engineering. Last 5 GPS points that were given. To create the training set, what they did was they took all of the roots and truncated them randomly. So every time they sampled another root, think of the data generator. Basically the data generator would randomly slice it off somewhere.

So this was the last 5 points which we have access to, and the first 5 points. The reason it's not all the points is because they're using a standard multilayer perceptron here. So it's a variable length, a, and also you don't want it to be too big. There's a question.

So the prefix is not fed into an RNN, it's just fed into a dense layer? Correct. So we just get 10 points, concatenate it together into a dense layer. So surprisingly simple. How good was it? Look at the results, 2-1-4, 2-1-4, 2-1-3, 2-1-3, 2-1-1, 2-1-2. Everybody is clustered together.

One person's a bit better at 208, and they're way better at 203. And then they mentioned in the paper that they didn't actually have time to finish training, so when they actually finished training, it was actually 1.87. They won so easily, it's not funny. And interestingly in the paper, they actually mentioned the test set was so small that they knew the only way they could be sure to win was to make sure they won easily.

Now because the test set was so small, the leaderboard is actually not statistically that great. So they created a custom test set and tried to see if they could find something that's even better still on the custom test set. And it turns out that actually an RNN is better still.

It still would have won the competition, but there's not enough data in the Kaggle test set that this is a statistically significant result. In this case it is statistically significant. A regular RNN wasn't better, but what they did instead was take an RNN where we pass in 5 points at a time into the RNN basically.

I think what probably would have been even better would be to have had a convolutional layer first and then passed that into an RNN. They didn't try it as far as I can see from the paper. Importantly, a bidirectional RNN which ensures that the initial points and the last points tend to have more weight because we know that RNN's state generally reflects things they've seen more recently.

So this result is this model. So Paul Longsuffering intern Brad has been trying to replicate this result. He had at least two all-nighters in the last two weeks but hasn't quite managed to yet. So I'm not going to show you the code, but hopefully once Brad starts sleeping again he'll be able to finish it off and we can show you the notebook during the week on the forum that actually re-implements this thing.

It was an interesting process to watch Brad try to replicate this because the vast majority of the time in my experience when people say they've tried a model and the model didn't work out and they've given up on the model, it turns out that it's actually because they screwed something up, not because of the problem with the model.

And if you weren't comparing to Yoshua Bengio's team's result, knowing that you haven't replicated it yet, at which point do you give up and say, "Oh my model's not working" versus saying, "No, I've still got bugs!" It's very difficult to debug machine learning models. What Brad's actually had to end up doing is literally take the original Bengio team code, run it line by line, and then try to replicate it in Keras line by line in literally np.allclose every time.

Because to build a model like this, it doesn't look that complex, but there's just so many places that you can make little mistakes. No normal person will make like zero mistakes. In fact, normal people like me will make dozens of mistakes. So when you build a model like this, you need to find a way to test every single line of code.

Any line of code you don't test, I guarantee you'll end up with a bug and you won't know you have a bug and there's no way to ever find out you had a bug. So we have several questions. One is a note that pi*ci is very similar to what happens in the memory network paper.

In that case, the output embeddings are weighted by the attention probability. It's a lot like a regular attentional language model. Can you talk more about the idea you have about first having the convolutional layer and passing that to an RNN? What do you mean by that? So here is a fantastic paper.

We looked at these kind of subword encodings last week for language models. I don't know if any of you thought about this and wondered what if we just had individual characters. There's a really fascinating paper called Fully Character Level Machine Translation with no explicit segmentation from November of last year.

They actually get fantastic results on just character level, beating pretty much everything, including the BPE approach we saw last time. So they looked at lots of different approaches and comparing BPE to individual character, and most of the time they got the best results. Their model looks like this. They start out with every individual character.

It goes through a character embedding, just like we've used character embeddings lots of times. Then you take those character embeddings and you pass it through a one-dimensional convolution. I don't know if you guys remember, but in Part 1 of this course, Ben actually had a blog post about showing how you can do multiple size convolutions and concatenate them altogether.

So you could use that approach. Or you could just pick a single size. So you end up basically scrolling your convolutional window across your sets of characters. So you end up with the same number of convolution outputs as you started out with letters, but they're now representing the information in a window around that letter.

In this case, they then did max pooling. So they basically said which window, assuming that we had a different size as a size 4, a size 3, and a size 5. Which bits seem to have the highest activations around here. Then they took those max pooled things and they put them through a second set of segment embeddings.

They then put that through something called a highway network which the details don't matter too much. It's kind of something like a DenseNet, like we learned about last week. This is a slightly older approach than the DenseNet. Then finally after doing all that, stick that through an RNN. So the idea here in this model was they basically did as much learnt pre-processing as possible, and then finally put that into an RNN.

Because we've got these max pooling layers, this RNN ends up with a lot less time points, which is really important to minimize the amount of processing in the RNN. So I'm not going to go into detail on this, but check out this paper because it's really interesting. Next question is, for the destinations we would have more error for the peripheral points?

Are we taking a centroid of clusters? I don't understand that, sorry. All we're doing is we're taking the softmax p, multiply by the cluster c, multiply them and add them up. I thought the first part was asking that with destinations that are more peripheral, they would have higher error because they would be harder to predict this way.

That probably, which is fine because by definition they're not close to a cluster center so they're not common. Then going back, there was a question on the Rossman example. What does MAPE with neural network mean? I would have expected that result to be the same, why is it lower?

This is just using a one-hot encoding without an embedding layer. We kind of run out of time a bit quickly, but I really want to show you this. The students and I have been trying to get a new approach to segmentation working, and I finally got it working in the last day or two, and I really wanted to show it to you.

We talked last week about DenseNet, and I mentioned that DenseNet is arse-kickingly good at doing image classification with a small number of data points, like crazily good. But I also mentioned that it's the basis of this thing called the 100-phase tiramisu, which is an approach to segmentation. So segmentation refers to taking a picture, an image, and figuring out where's the tree, where's the dog, where's the bicycle and so forth.

So it seems like we're not sure of NGO fans today because this is one of his group's papers as well. Let me set the scene. So Brendan, one of our students, who many of you have seen a lot of his blog posts, he has successfully got a PyTorch of this working, so I've shared that on our files.vast.ai.

And I got the Keras version of it working. So I'll show you the Keras version because I actually understand it. And if anybody's interested in asking questions about the PyTorch version, hopefully Brendan will be happy to answer them during the week. So the data looks like this. There's an image, and then there's a labels.

So that's basically what it looks like. So you can see here, you've got traffic lights, you've got poles, you've got trees, buildings, paths, roads. Interestingly, the dataset we're using is something called CanVid. The dataset is actually frames from a video. So a lot of the frames look very similar to each other.

And there's only like 600 frames in total, so there's very variable data in this CanVid dataset. Furthermore, we're not going to do any pre-training. So we're going to try and build the state-of-the-art classification system on video, which is already much lower information content because most of the frames are pretty similar, using just 600 frames without pre-training.

Now if you were to ask me a month ago, I would have told you it's not possible. This just seems like an incredibly difficult thing to do. But just watch. So I'm going to skip to the answer first. Here's an example of a particular frame we're trying to match.

Here is the ground truth for that frame. You can see there's a tiny car here and a little car here. There are those little cars. There's a tree. Trees are really difficult. They're incredibly fine, funny things. And here is my trained model. And as you can see, it's done really, really well.

It's interesting to look at the mistakes it made. This little thing here is a person. But you can see that the person, their head looks a lot like traffic light and their jacket looks a lot like mailbox. Whereas these tiny little people here, it's done perfectly, or else this person got a little bit confused.

Another example of where it's gone wrong is this should be a road, or else it wasn't quite sure what was road and what was footpath, which makes sense because the colors do look very similar. But had we have pre-trained something, a pre-trained network would have understood that crossroads tend to go straight across, they don't tend to look like that.

So you can kind of see where the minor mistakes it made, it also would have learned, had it looked at more than a couple of hundred examples of people, that people generally are a particular day. So there's just not enough data for it to have learned some of these things.

But nonetheless, it is extraordinarily effective. Look at this traffic light, it's surrounded by a sign, so the ground truth actually has the traffic light and then a tiny little edge of sign, and it's even got that right. So it's an incredibly accurate model. So how does it work? And in particular, how does it do these amazing trees?

So the answer is in this picture. Basically, this is inspired by a model called UNET. Until the UNET model came along, everybody was doing these kinds of segmentation models using an approach just like what we did for style transfer, which is basically you have a number of convolutional layers with max pooling, or with a stride of 2, which gradually make the image smaller and smaller, a bigger receptive field.

And then you go back up the other side using upsampling or deconvolutions until you get back to the original size, and then your final layer is the same size as your starting layer and has a bunch of different classes that you're trying to use in the softmax. The problem with that is that you end up with, in fact I'll show you an example.

There's a really nice paper called UNET. UNET is not only an incredibly accurate model for segmentation, but it's also incredibly fast. It actually can run in real time. You can actually run it on a video. But the mistakes it makes, look at this chair. This chair has a big gap here and here and here, but UNET gets it totally wrong.

And the reason why is because they use a very traditional downsampling-upsampling approach. And by the time they get to the bottom, they've just lost track of the fine detail. So the trick are these connections here. What we do is we start with our input, we do a standard initial convolution, just like we did with style transfer.

We then have a DenseNet block, which we learned about last week. And then that block, we keep going down, we do a MaxPooling type thing, another DenseNet block, MaxPooling type thing, keep going down. And then as we go up the other side, so we do a DenseBlock, we take the output from the DenseBlock on the way down and we actually copy it over to here and concatenate the two together.

So actually Brendan a few days ago actually drew this on our whiteboard when we were explaining it to Melissa, and so he's shown us every stage here. We start out with a 224x224 input, it goes through the convolutions with 48 filters, goes through our DenseBlock, adds another 80 filters.

It then goes through our, they call it a transition down, so basically a MaxPooling. So it's now size 112. We keep doing that. DenseBlock, transition down, so it's now 56x56, 28x28, 14x14, 7x7. And then on the way up again, we go transition up, it's now 14x14. We copy across the results of the 14x14 from the transition down and concatenate together.

Then we do a DenseBlock, transition up, it's now 28x28, so we copy across our 28x28 from the transition down and so forth. So by the time we get all the way back up here, we're actually copying across something that was originally of size 224x224. It hadn't had much done to it, it had only been through one convolutional layer and one DenseBlock, so it hadn't really got much rich computation being done.

But the thing is, by the time it gets back up all the way up here, the model knows pretty much this is a tree and this is a person and this is a house, and it just needs to get the fine little details. Where exactly does this leaf finish?

Where exactly does the person's hat finish? So it's basically copying across something which is very high resolution but doesn't have that much rich information, but that's fine because it really only needs to fill in the details. So these things here, they're called skip-connections. They were really inspired by this paper called Unet, which has been one of many Kaggle competitions.

But it's using dense blocks rather than normal fully connected blocks. So let me show you. We're not going to have time to go into this in detail, but I've done all this coding Keras from scratch. This is actually a fantastic fit for Keras. I didn't have to create any custom layers, I didn't really have to do anything weird at all, except for one thing, data augmentation.

So the data augmentation was we start with 480x360 images, we randomly crop some 224x224 part, and also randomly we may flip it horizontally. That's all perfectly fine. Keras doesn't really have the random crops, unfortunately. But more importantly, whatever we do to the input image, we also have to do to the target image.

We need to get the same 224x224 crop, and we need to do the same horizontal flip. So I had to write a data generator, which you guys may actually find useful anyway. So this is my data generator. Basically I called it a segment generator. It's just a standard generator so it's got a next function.

Each time you call next, it grabs some random bunch of indexes, it goes through each one of those indexes and grabs the necessary item, grabbing a random slice, sometimes randomly flipping it horizontally, and then it's doing this to both the x's and the y's, returning them back. Along with this segment generator, in order to randomly grab a batch of random indices each time, I created this little class called batch indices, which can basically do that.

And it can have either shuffle true or shuffle false. So this pair of classes you guys might find really helpful for creating your own data generators. This batch indices class in particular, now that I've written it, you can see how it works, right? If I say batch indices from a data set of size 10, I want to grab three indices at a time.

So then let's grab five batches. Now in this case I've got by default shuffle = false, so it just returns 0.1.2, 0.3.4.5, 0.6.7.8, 0.9, I'm finished, go back to the start, 0.1.2. On the other hand, if I say shuffle = true, it returns them in random order but it still makes sure it captures all of them.

And then when we're done it starts a new random order. So this makes it really easy to create random generators. So that was the only thing I had to add to Keras to get this order work. Other than that, we wrote the tiramisu. And the tiramisu looks very very similar to the DenseNet that we saw last week.

We've got all our pieces, the relu, the dropout, the batchnorm, the relu on top of batchnorm, the concat layer, so this is something I had to add, my convolution2d followed by dropout, and then finally my batchnorm followed by relu followed by convolution2d. So this is just the dense block that we saw last week.

So a dense block is something where we just keep grabbing 12 or 16 filters at a time, concatenating them to the last set and doing that a few times. That's what a dense block is. So here's something interesting. The original paper for its down sampling, they call it transition down, did a 1x1 convolution followed by a max pooling.

I actually discovered that doing a stride2 convolution gives better results. So you'll see I actually have not followed the paper. The one that's in commented out here is what the paper did, but actually this works better. So that was interesting. Interestingly though, on the transition up side, do you remember that checkerboard artifacts blog post we saw that showed that upsampling2d followed by a convolutional layer works better?

It does not work better for this. In fact, a deconvolution works better for this. So that's why you can see I've got this deconvolution layer. So I thought that was interesting. So basically you can see when I go downsampling a bunch of times, it's basically do a dense block and then I have to keep track of my skip connections.

So basically keep a list of all of those skip connections. So I've got to hang on to all of these. So every one of these skip connections I just stick in this little array, depending on them after every dense block. So then I keep them all and then I pass them to my upward path.

So I basically do my transition up and then I concatenate that with that skip connection. So that's the basic approach. So then the actual Tirabisu model itself with those pieces is less than a screen of code. It's basically just do a 3x3 conv, do my down path, do my up path using those skip connections, then a 1x1 conv at the end, and a softmax.

So these dense nets, and indeed this fully convolutional dense net or this Tirabisu model, they actually take quite a long time to train. They don't have very many parameters, which is why I think they work so well with these tiny datasets. But they do still take a long time to train.

Each epoch took a couple of minutes, and in the end I had to do many hundreds of epochs. And I was also doing a bunch of learning rate annealing. So in the end this kind of really had to train overnight, even though I had only about 500-600 friends. But in the end I got a really good result.

I was a bit nervous at first. I was getting this 87.6% accuracy. In the paper they were getting 90% plus. It turns out that 3% of the pixels are marked as void. I don't know why they're marked as void, but in the paper they actually remove them. So you'll see when you get to my results section, I've got this bit where I remove those void ones.

And I ended up with 89.5%. None of us in class managed to replicate the paper. The paper got 91.5% or 91.2%. We tried the lasagna code they provided. We tried Brendan's PyTorch. We tried my Keras. Even though we couldn't replicate their result, this is still better than any other result I've found.

So this is still super accurate. A couple of quick notes about this. First is they tried training also on something called the GATech dataset, which is another video dataset. The degree to which this is an amazing model is really clear here. This 76% is from a model which is specifically built for video, so it actually includes the time component, which is absolutely critical, and it uses a pre-trained network, so it's used like a million images to pre-train, and it's still not as good as this model.

So that is an extraordinary comparison. This is the CamTech comparison. Here's the model we were just looking at. Again, I actually looked into this. I thought 91.5%, whereas this one here 88%, wow, it actually looks like it's not that much better. I'm really surprised. Like even tree, I really thought it should win easily on tree, but it doesn't win by very much.

So I actually went back and looked at this paper, and it turns out that the authors of the DenseNet paper (this is the paper by the way, modest scale that they're comparing to) turned out that they actually trained on props of 852x852, so they actually used a way higher resolution image to start with.

You've got to be really careful when you read these comparisons. Sometimes people actually shoot themselves in the foot, so these guys were comparing their result to another model that was using like twice as big a picture. So again, this is actually way better than they actually made it look like.

Another one, like this one here, this 88 also looks impressive. But then I looked across here and I noticed that the dilution8 model is like way better than this model on every single category, way better. And yet somehow the average is only 0.3 better, and I realized this actually has to be an error.

So this model is actually a lot better than this table gives the impression. So I briefly mentioned that there's a model which doesn't have any skip connections called eNet, which is actually better than the tiramisu on everything except for tree. But on the tree it's terrible. It's 77.8 versus, oh hang on, 77.3.

That's not right. I take that back. I'm sure it was less good than this model, but now I can't find that data. Anyway, the reason I wanted to mention this is that Eugenio is about to release a new model which combines these approaches with skip connections. It's called LinkNet.

So keep an eye on the forum because I'll be looking into that shortly. Let's answer them on the forum. I actually wanted to talk about this briefly. A lot of you have come up to me and been like, "We're finishing! What do we do now?" The answer is we have now created a community of all these people who have spent well over 100 hours working on deep learning for many, many months and have built their own boxes and written blog posts and done all kinds of stuff, set up social impact talks, written articles in Forbes.

Okay, this community is happening. It doesn't make any sense in my opinion for Rachel and I to now be saying here's what happens next. So just like Elena has decided, "Okay, I want a book club." So she talked to Mindy and we now have a book club and a couple of months time, you can all come to the book club.

So what's next? The forums will continue forever. We all know each other. Let's do good shit. Most importantly, write code. Please write code. Build apps. Take your work projects and try doing them with deep learning. Build libraries to make things easier. Maybe go back to stuff from part 1 of the course and look back and think, "Why didn't we do it this other way?

I can make this simpler." Write papers. I showed you that amazing result of the new style transfer approach from Vincent last week. Hopefully that might take you into a paper. Write blog posts. In a few weeks' time, all the MOOC guys are going to be coming through and doing part 2 of the course.

So help them out on the forum. Teaching is the best way to learn yourself. I really want to hear the success stories. People don't believe that what you've done is possible. I know that because as recently as yesterday, there was the highest ranked Hacker News comment on a story about deep learning was about how it's pointless trying to do deep learning unless you have years of mathematical background and you know C++ and you're an expert in machine learning techniques across the board and otherwise there's no way that you're going to be able to do anything useful in the real world project.

That today is what everybody believes. We now know that's not true. Rachel and I are going to start up a podcast where we're going to try to both help deep learning learners. But one of the key things you want to do is tell your stories. So if you've done something interesting at work or you've got an interesting new result or you're just in the middle of a project and it's kind of fun, please tell us.

Either on the forum or private message or whatever. Please tell us because we really want to share your story. And if it's not a story yet, tell us enough that we can help you and that the community can help you. Get together, the book club, if you're watching this on the MOOC, organize other people in your geography to get together and meet up or your workplace.

In this group here I know we've got people from Apple and Uber and Airbnb who started doing this in kind of lunchtime MOOC chats and now they're here at this course. Yes, Rachel? I also wanted to recommend it would be great to start meetups to help lead other people through, say, part one of the course, kind of assist them going through it.

So Rachel and I really just want to spend the next 6 to 12 months focused on supporting your projects. So I'm very interested in working on this lung cancer stuff, but I'm also interested in every project that you guys are working on. I want to help with that. I also want to help people who want to teach this.

So Yannette is going to go from being a student to a teacher hopefully soon. We'll be teaching USF students about deep learning and hopefully the next batch of people about deep learning. Anybody who's interested in teaching, let us know. This is the best high leverage activity is to teach the teachers.

So yeah, I don't know where this is going to end up, but my hope is really that basically I would say the experiment has worked. You guys are all here. You're reading papers, you're writing code, you're understanding the most cutting edge research level deep learning that exists today. We've gone beyond some of the cutting edge research in many situations.

Some of you have gone beyond the cutting edge research. So yeah, let's build from here as a community and anything that Rachel and I can do to help, please tell us, because we just want you to be successful and the community to be successful. So will you be still active in the forums?

Very active. My job is to make you guys successful. So thank you all so much for coming, and congratulations to all of you. (audience applauds) (audience applauding)

Lesson 14: Cutting Edge Deep Learning for Coders

Chapters

Transcript