back to indexMy Journey to Deep Learning
Chapters
0:0 Intro
2:40 Classic Machine Learning
8:3 Random Forests
12:45 Dogs vs Cats
15:32 Deep Learning Libraries
17:42 Neural Networks
28:2 Why Deep Learning
30:7 First Deep Learning Company
32:27 Making Deep Learning Accessible
34:12 Deep Learning Courses
35:45 Deep Learning Forums
36:40 Cool Models
37:23 Stateoftheart results
39:0 Building new courses
39:42 Dawnbench
41:43 Nvidia
42:26 Fast AI
43:13 Fast AI Documentation
45:28 Fast AI Code
46:19 Midtier API
47:12 Mixup
47:43 optimizers
48:54 summary
49:13 how to incorporate knowledge
00:00:00.000 |
So I'm going to talk today about something I don't think I've really discussed before, 00:00:10.560 |
So nowadays, I am all deep learning all the time. 00:00:15.360 |
And a lot of people seem to kind of assume that anybody who's doing deep learning kind 00:00:22.800 |
of jumps straight into it without looking at anything else. 00:00:25.760 |
But actually, at least for me, it was a many decade journey to this point, and it's because 00:00:33.920 |
I've done so many other things and seen how much easier my life is with deep learning 00:00:42.520 |
that I've become such an evangelist with this technology. 00:00:48.320 |
So I actually started out my career at McKinsey & Company, which is a management consulting 00:00:57.280 |
And quite unusually, I started there when I was 18. 00:01:04.400 |
And that was a challenge, because in strategy consulting, people are generally really leveraging 00:01:14.920 |
And of course, I didn't have either of those things. 00:01:18.440 |
So I really had to rely on data analysis right from the start. 00:01:27.920 |
So what happened was, from the very start of my career, I was really relying very heavily 00:01:35.720 |
on applied data analysis to answer real world questions. 00:01:41.160 |
And so in a consulting context, you don't have that much time. 00:01:44.640 |
And you're talking to people who really have a lot of domain expertise, and you have to 00:01:49.280 |
be able to communicate in a way that actually matters to them. 00:01:55.120 |
And I used a variety of techniques in my data analysis. 00:02:01.760 |
One of my favorite things to use was this was just before kind of pivot tables appeared. 00:02:07.720 |
And when they appeared, that was like something I used a lot, used various kind of database 00:02:15.520 |
But I did actually use machine learning quite a bit and had a lot of success with that. 00:02:25.320 |
Most of the machine learning that I was doing was based on kind of logistic or linear regression 00:02:37.400 |
Rather than show you something I did back then, because I can't because it was all proprietary, 00:02:46.080 |
let me give you an example from the computational pathologist paper from Andy Beck and Daphne 00:02:58.640 |
This was I'm trying to think maybe 2011 or something like that, 2012. 00:03:05.720 |
And they developed a five-year survival model for breast cancer, I believe it was. 00:03:13.440 |
And the inputs to their five-year survival model were histopathology slices, stain slides. 00:03:23.880 |
And they built a five-year survival predictive model that was very significantly better than 00:03:33.040 |
And what they described in their paper is the way they went about it was what I would 00:03:40.520 |
call nowadays kind of classic machine learning. 00:03:49.720 |
And they fed into that logistic regression, if I remember correctly, thousands I think 00:03:58.840 |
And the features they built were built by a large expert team of pathologists, computer 00:04:07.200 |
scientists, mathematicians, and so forth that worked together to think about what kinds 00:04:12.720 |
of features might be interesting and how do we encode them so that things like relationships 00:04:17.280 |
of contiguous epithelial regions with underlying nuclear objects or characteristics of epithelial 00:04:23.240 |
nuclei and epithelial cytopasm, characteristics of stromal nuclei and stromal matrix, and 00:04:30.560 |
So it took many people many years to create these features and come up with them and implement 00:04:41.680 |
And then the actual modeling process was fairly straightforward. 00:04:46.440 |
They took images from patients that stayed alive for five years. 00:04:51.360 |
And they took images from those that didn't and then used a fairly standard regularized 00:05:00.000 |
So basically to create parameters around these different features. 00:05:05.360 |
To be clear, this approach worked well for this particular case and worked well for me 00:05:20.000 |
And it's a perfectly reasonable bread and butter technique that you can certainly still 00:05:32.520 |
I spend a lot of time studying how to get the most out of this. 00:05:39.680 |
One nice trick that a lot of people are not as familiar with as they should be is what 00:05:46.480 |
you do with continuous inputs in these cases and how do you transform them so that you 00:05:59.400 |
And actually polynomials are generally a terrible choice. 00:06:02.640 |
Nearly always the best choice, it turns out, is actually to use something called natural 00:06:08.920 |
And natural cubic splines are basically where you split your data set into sections of the 00:06:18.560 |
domain and you connect each section up with a cubic, so each of these bits between dots 00:06:25.480 |
are cubics, and you create the bases such that these cubics connect up with each other and 00:06:33.960 |
And one of the interesting things that makes them natural splines is that the endpoints 00:06:38.880 |
are actually linear rather than cubic, which actually makes these extrapolate outside the 00:06:51.160 |
You can see as you add more and more knots with just two knots, you start out with a 00:06:54.920 |
line and as you add more knots, you start to get more and more opportunities for curves. 00:07:02.560 |
One of the cool things about natural splines, they're also called restricted cubic splines, 00:07:07.080 |
is that actually you don't have to think at all about where to put the knot points. 00:07:12.720 |
It turns out that there's basically a set of quantiles where you can put the knot points 00:07:20.400 |
pretty reliably depending on how many knot points you want, which is independent of the 00:07:28.880 |
And then another nice trick is if you do use regularized regression, particularly L1 regularized 00:07:35.040 |
regression I really like, you don't even have to be that careful about the number of parameters 00:07:42.240 |
So you can often include quite a lot of transformations, including interactions of 00:07:56.320 |
So this is an approach that I used a lot and had a lot of success with. 00:08:04.040 |
But then in, I think it was '99 that the first paper appeared in the early 2000s that started 00:08:16.280 |
And Random Forests, this is a picture from Terrence Parr's excellent D-Tree Vis package. 00:08:23.600 |
Random Forests are ensembles of decision trees, as I'm sure most of you know. 00:08:29.720 |
And so for an example of a decision tree, this is data from the Kaggle competition which 00:08:40.720 |
is trying to predict the auction price of heavy industrial equipment. 00:08:45.080 |
And you can see here that a decision tree has done a split on this binary variable of 00:08:52.880 |
And then for those which I guess don't have a coupler system, it did a binary split on 00:08:56.960 |
year made, and those which then were made in early years, then we can see immediately 00:09:07.240 |
So this is the thing we're trying to predict the sale price. 00:09:09.520 |
And so in this case, we can see that it's in just four splits, it successfully found 00:09:17.920 |
some things which, this is actually the log of sale price, has done a really good job 00:09:23.640 |
I actually used these single decision trees a little bit in the kind of early and mid 00:09:32.120 |
90s, but they were a nightmare to find something that fit adequately but didn't overfit. 00:09:41.880 |
And Random Forests then came along thanks to Bryman, who, a very interesting guy, he 00:09:51.560 |
And then he went out into industry and was basically a consultant, I think, for years 00:09:56.080 |
and then came back to Berkeley to do statistics. 00:09:59.960 |
And he was incredibly effective in creating like really practical algorithms. 00:10:05.640 |
And the Random Forest is one that's really been world changing, incredibly simple, you 00:10:14.360 |
And you then train a model, train it, just create a decision tree with a subset, you 00:10:20.920 |
save it, and then you repeat steps one, two, and three again and again and again, creating 00:10:25.600 |
lots and lots of decision trees on different random subsets of the data. 00:10:29.760 |
And it turns out that if you average the results of all these models, you get predictions that 00:10:49.880 |
So basically as soon as this came out, I added it to my arsenal. 00:10:52.880 |
One of the really nice things about this is how quickly you can implement it. 00:10:58.920 |
So this came out when I was running a company called Optimal Decisions, which I built to 00:11:07.040 |
help insurers come up with better prices, which is the most important thing insurers to do. 00:11:15.400 |
One of the interesting things about this, for me, is that we never actually deployed 00:11:24.640 |
What we did was we used random forests to understand the data, and then we used that 00:11:30.200 |
understanding of the data to then go back and basically build more traditional regression 00:11:35.720 |
models with the particular terms and transformations and interactions that the random forest found 00:11:43.240 |
So basically, this is one of the cool things that you get out of a random forest. 00:11:47.800 |
It's a feature importance plot, and it shows you-- so this is, again, the same data set, 00:11:54.760 |
the option price data set from Kaggle-- it shows you which are the most important features. 00:12:01.040 |
And the nice thing about this is you don't have to do any transformations or think about 00:12:09.600 |
Because they're using decision trees behind the scenes, it all just works. 00:12:14.300 |
And so I kind of developed this pretty simple approach where I would first create a random 00:12:20.800 |
forest, and I would then find which features and so forth are useful. 00:12:24.720 |
I'd then use partial dependence plots to kind of look at the shapes of them. 00:12:29.360 |
And then I would go back and kind of, for the continuous variables that matter, create 00:12:34.080 |
the cubic splines, and create the interactions, and then do a regression. 00:12:41.520 |
And so this basic kind of trick was incredibly powerful. 00:12:46.400 |
And I used it, a variance of it, in the early days of Kaggle amongst other things, and got 00:12:55.600 |
to number one in the world, and won a number of competitions. 00:12:59.960 |
And funnily enough, actually, back in 2011, I described my approaches to Kaggle competitions 00:13:14.760 |
And it's actually still pretty much just as relevant today as it was at that time. 00:13:23.360 |
So this is 2011, and I became the chief scientist and president at Kaggle. 00:13:33.680 |
And we took it over to the US, and got venture capital, and built it into quite a successful 00:13:44.880 |
But something interesting that happened as chief scientist of Kaggle, I was getting to 00:13:53.520 |
And seven years ago, there was a competition, Dogs vs Cats, which you can still see, actually, 00:14:04.240 |
on the Dogs vs Cats Kaggle page, it describes the state-of-the-art approach for recognizing 00:14:08.440 |
Dogs vs Cats as being around about 80% accuracy. 00:14:14.560 |
And so that was based on the academic papers that had tackled this problem at the time. 00:14:19.700 |
And then in this competition that just ran for three months, eight teams reached 98% 00:14:32.060 |
So if you think about this as a 20% error rate, and this is basically a 1% error rate. 00:14:38.120 |
So this competition brought the state-of-the-art down by about 20 times in three months, which 00:14:46.720 |
It's really unheard of to see an academic state-of-the-art result that has been carefully 00:14:52.920 |
studied slashed by 20x by somebody working for just three months on the problem. 00:15:02.120 |
That's normally something that might take decades or hundreds of years, if it's possible 00:15:11.880 |
And of course what happened was deep learning. 00:15:17.320 |
And Pierre actually had developed one of the early deep learning libraries. 00:15:23.320 |
And actually, even this kind of signal on Kaggle was in some ways a little late. 00:15:28.680 |
If you actually look at Pierre's Google Scholar, you'll see that it was actually back in 2011 00:15:35.800 |
that him and Yann LeCun had already produced a system that was better than human performance 00:15:48.280 |
And so this was actually the first time that I noticed this really extraordinary thing, 00:15:56.080 |
which was deep learning being better than humans at very human tasks, like looking at pictures. 00:16:11.920 |
And so in 2011, I thought, wow, that's super interesting. 00:16:17.800 |
But it's hard to do anything with that information, because there wasn't any open source software 00:16:24.120 |
or even any commercial software available to actually do it. 00:16:27.520 |
There was-- Jörgen Schmidthuber's lab had a kind of like a DLL or something or a library 00:16:33.840 |
you could buy from them to do it, although they didn't even have a demo. 00:16:43.680 |
And there wasn't any-- nobody had published anywhere like the actual recipe book of like, 00:17:00.280 |
It's exciting to see that this is possible, but then it's like, well, what do I do about 00:17:05.360 |
But one of the cool things is that at this exact moment, this dogs and cats moment, is 00:17:09.600 |
when two really accessible open source libraries appeared, allowing people to actually create 00:17:20.200 |
their own deep learning models for the first time. 00:17:23.920 |
And critically, they were built on top of CUDA, which was a dramatically more convenient 00:17:30.600 |
way of programming GPUs than had previously existed. 00:17:34.720 |
So kind of things started to come together really seven years ago, a little bit. 00:17:44.240 |
I had been interested in neural networks since the very start of my career. 00:17:52.360 |
And in fact, in consulting, I worked with one of the big Australian banks on implementing 00:18:00.200 |
a neural network in the early to mid 90s to help with targeted marketing. 00:18:08.040 |
Not a very exciting application, I'll give you. 00:18:11.720 |
But it really struck me at the time that this was a technology which I felt like at some 00:18:20.900 |
point would probably take over just about everything else in terms of my area of interest 00:18:27.720 |
And we actually had quite a bit of success with it even then. 00:18:31.000 |
So that's like 30 years ago now, nearly, but there was some issues back then for one thing 00:18:39.160 |
we had to buy custom hardware that cost millions of dollars. 00:18:42.760 |
We really needed a lot of data, millions of data points. 00:18:50.160 |
And yeah, it was even then there were things that just weren't quite working as well as 00:19:00.360 |
And so as it turned out, the key problem was that back then everybody was relying on this 00:19:05.240 |
math result called the universal approximation theorem, which said that a neural network could 00:19:11.120 |
solve any given problem, computable problem to any arbitrary level of accuracy. 00:19:21.960 |
And this is one of the many, many times in deep learning history where theory has been 00:19:31.800 |
And the problem with this theory was that although this was theoretically true, in practice, 00:19:38.240 |
a neural network with one hidden layer requires far too many nodes to be useful most of the 00:19:46.480 |
And what we actually need is lots of hidden layers. 00:19:49.120 |
And that turns out to be much more efficient. 00:19:54.640 |
So anyway, I did feel like for those 20 years, at some point, neural networks are going to 00:20:03.120 |
reappear in my life because of this infinitely flexible function, the fact that they can 00:20:15.120 |
And then along with this infinitely flexible function, we combine it with gradient descent, 00:20:19.600 |
which is this all-purpose parameter fitting algorithm. 00:20:24.880 |
And again, there was a problem with theory here, which is I spent many, many years focused 00:20:34.400 |
And operations research generally focused on, again, kind of theoretical questions of 00:20:40.080 |
what is proofably able to find the definite maximum or minimum of a function. 00:20:49.400 |
And gradient descent doesn't do that, particularly stochastic gradient descent. 00:20:57.640 |
And so a lot of people were kind of ignoring it. 00:21:01.320 |
But the thing is, the question we should be asking is not what can we prove, but what 00:21:09.840 |
And the people who, the very small number of people who were working on neural networks 00:21:14.760 |
and gradient descent throughout the '90s and early 2000s, despite all the theory that said 00:21:23.160 |
it's a terrible idea, actually were finding it was working really well. 00:21:30.080 |
Unfortunately, academia around machine learning has tended to be much more driven by theory 00:21:40.280 |
than results, or at least for a long time it was, I still think it is too much. 00:21:44.880 |
And so the fact that there were people like Hinton and LeCun saying, look, here's a model 00:21:51.520 |
that's better than anything in the world at solving this problem, but based on theory, 00:22:01.520 |
we can't exactly prove why, but it really works. 00:22:05.920 |
Because we're not getting published, unfortunately. 00:22:12.720 |
And one of the big things that changed was that finally, in around 2014, 2015, we started 00:22:19.840 |
to see some software appearing that allowed us to conveniently train these things on GPUs, 00:22:25.960 |
which allowed us to use relatively inexpensive computers to actually get pretty good results. 00:22:33.960 |
So although the theory didn't really change at this point, what did change is just more 00:22:40.200 |
people could try things out and be like, oh, OK, this is actually practically really helpful. 00:22:47.260 |
To people outside of the world of neural networks, this all seemed very sudden. 00:22:51.980 |
It seemed like there was this sudden fad around deep learning, where people were suddenly 00:23:00.080 |
And so people who had seen other fads quite recently thought, well, this one will pass 00:23:08.320 |
But the difference with this fad is it's actually been under development for many, many, many 00:23:15.240 |
So this was the first neural network to be built. 00:23:22.000 |
And continually, for all those decades, there were people working on making neural nets 00:23:32.480 |
So what was happening in 2015 was not a sudden, here's this new thing where I got organophlock 00:23:40.040 |
But it was actually, here's this old thing, which we finally got to the point where it's 00:23:49.320 |
And so it's not a new fad at all, but it's really the result of decades of hard work 00:23:55.200 |
of solving lots of problems and finally getting to a point where things are making sense. 00:24:02.120 |
But what has happened since 2015 is the ability of these infinitely flexible functions has 00:24:13.200 |
suddenly started to become clear, even to a layperson, because you can just look at 00:24:21.000 |
So for example, if you look at OpenAI's DALI, this is a model that's been trained on pairs 00:24:27.640 |
of pictures and captions such that you can now write any arbitrary sentence. 00:24:34.560 |
So if you write an illustration of a baby daikon radish in a tutu walking a dog, DALI 00:24:41.960 |
will draw pictures of what you described for you, and here are some actual non-cherry-picked 00:24:50.760 |
And so to be clear, this is all out of domain, right? 00:24:53.960 |
So DALI has never seen illustrations of baby daikon radishes yet or radishes and tutus, 00:25:00.880 |
or let alone any of this combination of things. 00:25:08.360 |
By the same token, it's never seen an avocado-shaped chair before, as best as I know, but if you 00:25:13.440 |
type in an armchair in the shape of an avocado, it creates these pictures for you from scratch. 00:25:22.840 |
And so it's really cool now that we can actually show, we can actually say, look what computers 00:25:32.280 |
can do, and look what computers can do if you use deep learning. 00:25:36.860 |
And to anybody who's grown up in the kind of pre-deep learning era, this just looks 00:25:43.160 |
It's like this is not things that I believe computers can do, but here we are. 00:25:49.240 |
This is this theoretically universally capable model actually doing things that we've trained 00:26:03.160 |
So in the last few years, we're now starting to see many times every year examples of computers 00:26:11.800 |
doing things, which we're being told computers won't be able to do in our lifetime. 00:26:15.960 |
So for example, I was repeatedly told by experts that in my lifetime, we would never see a computer 00:26:27.440 |
And of course, we're now at the point where AlphaGo Zero got to that point in three days. 00:26:33.640 |
And it's so far ahead of the best expert now that it's kind of makes the world's best experts 00:26:44.200 |
And one of the really interesting things about AlphaGo Zero is that if you actually look 00:26:53.920 |
And the source code for the key thing, which is like the thing that figures out whether 00:26:58.080 |
a Go board is a good position or not fits on one slide. 00:27:04.840 |
And furthermore, if you've done any deep learning, you'll recognize it as looking almost exactly 00:27:15.500 |
And so one of the things which people who are not themselves deep learning practitioners 00:27:20.720 |
don't quite realize is that deep learning on the whole is not a huge collection of somewhat 00:27:28.480 |
disconnected but slightly connected kind of tricks. 00:27:32.500 |
It's actually, you know, every deep learning model I build looks almost exactly like every 00:27:38.480 |
other model I build with fairly minor differences. 00:27:42.240 |
And I train them in nearly exactly the same way with fairly minor differences. 00:27:47.840 |
And so deep learning has become this incredibly flexible skill that if you have it, you can 00:27:53.560 |
turn your attention to lots of different domain areas and rapidly get incredibly good results. 00:28:03.660 |
So at this point, deep learning is now the best approach in the world for all kinds of 00:28:12.320 |
I'm not going to read them all, and this is by no means a complete list. 00:28:18.200 |
But these are some examples of the kinds of things that deep learning is better at than 00:28:28.160 |
So why am I spending so much time in my life now on deep learning? 00:28:34.760 |
Because it really feels to me like a very dramatic step change in human capability like 00:28:47.080 |
And you know, I would like to think that when I see a very dramatic step change in human 00:28:51.480 |
capability, I'm going to spend my time working on figuring out how best to take advantage 00:28:58.880 |
of that capability because that's, you know, there's going to be so many world-changing 00:29:07.180 |
And particularly as somebody who's built a few companies, as an entrepreneur, that the 00:29:12.840 |
number one thing for an entrepreneur to find and that investors look for is, is there something 00:29:19.120 |
you can build now that people couldn't build before in terms of as a company? 00:29:24.500 |
And with deep learning, the answer to that is yes, across tens of thousands or hundreds 00:29:29.860 |
of thousands of areas, because it's like, OK, suddenly there's tools which we couldn't 00:29:36.220 |
automate before and now we can, or we can make hundreds of times more productive or 00:29:42.560 |
So to me, it's a very obvious thing that like this is what I want to spend all my time on. 00:29:48.440 |
And when I talk to students, you know, or interested entrepreneurs, I always say, you 00:29:55.960 |
know, this is the thing which is making lots and lots of people extremely rich and is solving 00:30:02.320 |
lots and lots of important societal problems. 00:30:05.560 |
And we're just seeing the tip of the iceberg. 00:30:08.560 |
So as soon as I got to the point that I realized this, I decided to start a company to actually, 00:30:21.640 |
And so I got very excited about the opportunities in medicine and I created the first deep learning 00:30:30.320 |
And I didn't know anything about medicine and I didn't know any people in medicine. 00:30:36.220 |
So I kind of like got together a group of three other people and me and we decided to 00:30:42.760 |
kind of hack together a quick deep learning model that would see if we can predict the 00:31:00.160 |
And in fact, the algorithm that we built this company that that I ended up calling analytic 00:31:04.400 |
had a better false positive rate and a better false negative rate than actually a panel 00:31:12.920 |
And so this was at a time when deep learning in medicine and deep learning and radiology 00:31:31.000 |
And this was really important to me because my biggest goal with analytic was to kind 00:31:35.580 |
of get deep learning and medicine on the map because I felt like it could save a lot of 00:31:41.440 |
So I wanted to get a lot of as much attention around this as possible. 00:31:45.040 |
And yeah, very quickly, lots and lots of people were writing about this new company. 00:31:53.080 |
But as a result, very quickly, deep learning, particularly in radiology took off and within 00:32:02.200 |
two years, the main radiology conference had a huge stream around AI. 00:32:11.640 |
They created a new a whole new journal for it and so forth. 00:32:17.360 |
And so that was really exciting for me to see how we could help kind of put a technology 00:32:29.380 |
In some ways, this is great, but in some ways it was kind of disappointing because there 00:32:34.160 |
were so many other areas where deep learning should have been on the map and it wasn't. 00:32:38.760 |
And there's no way that I could create companies around every possible area. 00:32:44.880 |
So instead I thought, well, what I want to do is make it easy for other people to create 00:32:51.720 |
companies and products and solutions using deep learning, particularly because at that 00:32:57.440 |
time, nearly everybody I knew in the deep learning world were young white men from one 00:33:12.560 |
And the problem with that is that there's a lot of societally important problems to 00:33:18.200 |
solve which that group of people just weren't familiar with. 00:33:22.200 |
And even if they were both familiar with them and interested in them, they didn't know how 00:33:28.400 |
to find the data for those or what the kind of constraints and implementation are and 00:33:34.440 |
So Dr. Rachel Thomas and I decided to create a new organization that would focus on one 00:33:42.440 |
thing, which was making deep learning accessible. 00:33:46.080 |
And so basically the idea was to say, okay, if this really is a step change in human capability, 00:33:53.640 |
which happens from time to time in technology history, what can we do to help people use 00:34:01.720 |
this technology regardless of their background? 00:34:06.200 |
And so there was a lot of constraints that we had to help remove. 00:34:13.360 |
So the first thing we did was we thought, okay, let's at least make it so that what 00:34:19.720 |
people already know about how to build deep learning models is as available as possible. 00:34:25.400 |
So at this time, there weren't any courses or any kind of easy ways in to get going with 00:34:33.720 |
And we had a theory which was we thought you don't need a Stanford PhD to be an effective 00:34:43.640 |
You don't need years and years of graduate level math training. 00:34:48.940 |
We thought that we might be able to build a course that would allow people with just 00:34:53.520 |
a year of coding background to become effective deep learning practitioners. 00:34:59.600 |
Now at this time, so what is this, about 2014 or 2015, can't quite remember, maybe 2015. 00:35:07.720 |
Nothing like this existed and this was a really controversial hypothesis. 00:35:12.560 |
And to be clear, we weren't sure we were right, but this was a feeling we had. 00:35:20.880 |
So the first thing we created was a fast AI practical deep learning course. 00:35:27.400 |
And certainly one thing we immediately saw, which was thrilling, and we certainly didn't 00:35:34.200 |
know what would happen, was that it was popular. 00:35:39.520 |
We made it freely available online with no ads to make it as accessible as possible since 00:35:46.080 |
And I said to that first class, if you create something interesting with deep learning, 00:35:55.360 |
So we created a forum so that students could communicate with each other. 00:36:03.960 |
And I remember one of the first ones we got, I think this was one of the first, was somebody 00:36:10.960 |
who tried to recognize cricket pictures from baseball pictures. 00:36:17.320 |
And they had, I think it was like 100% accuracy or maybe 99% accuracy. 00:36:23.520 |
And one of the really interesting things was that they only used 30 training images. 00:36:29.680 |
And so this is exciting to us to see somebody building a model and not only that, building 00:36:34.800 |
it with far less data than people used to think was necessary. 00:36:39.800 |
And then suddenly we were being flooded with all these cool models people were building. 00:36:45.520 |
So a Trident ad in Tobago, a different types of people model, a zucchini and cucumber model. 00:36:56.680 |
This person actually managed to predict what part of the world a satellite photo was from 00:37:01.880 |
with over 110 classes with 85% accuracy, which is extraordinary. 00:37:06.680 |
A Panama bus recognizer, a buttock cloth recognizer, some of the things were clearly actually going 00:37:15.920 |
This was something useful for disaster resilience, which was recognizing the state of buildings 00:37:24.960 |
We saw people even right at the start of the course breaking state of the art results. 00:37:32.720 |
So this is on Devin Gary, character recognition, this person said, wow, I just got a new state 00:37:42.920 |
We saw people doing the same, similar to getting state of the art results on audio classification. 00:37:49.880 |
And then even we started to hear from some of these students in the first year that they 00:37:55.800 |
were taking their ideas back to their companies. 00:37:59.640 |
And in this case, a software engineer went back to his company who was working at Splunk 00:38:06.760 |
and built a new model, which basically took mouse movements and mouse clicks and turned 00:38:11.560 |
them into pictures, and then classified them and used this to help with fraud. 00:38:19.320 |
And we know about this because it was so successful that it ended up being a patented product 00:38:24.200 |
and Splunk created a blog about this core new technology that was built by a software 00:38:32.840 |
engineer with no previous background in this area. 00:38:35.800 |
We saw startups appearing, so for example, this startup called Envision appeared from 00:38:41.160 |
one of our students, and it's still going strong, I just looked it up before this. 00:38:48.120 |
And so, yeah, it was really cool to see how people from all walks of life were actually 00:39:01.080 |
And these courses got really popular, and so we started redoing them every year. 00:39:07.000 |
They would build a new course from scratch because things were moving so fast that within 00:39:11.120 |
a year, there was so much new stuff to cover that we would build a completely new course. 00:39:16.800 |
And so, there's many millions of views at this point, and people are loving them based 00:39:34.040 |
We ended up turning the course into a book as well, along with my friend, Sylvain Goucher, 00:39:38.880 |
and people are really liking the book as well. 00:39:44.040 |
So the next step after getting people started with what we already know by putting that 00:39:51.240 |
into a course is we wanted to push the boundaries beyond what we already know. 00:39:56.160 |
And so one of the things that came up was a lot of students or potential students were 00:40:01.640 |
saying, "I don't think it's worth me getting involved in deep learning because it takes 00:40:06.360 |
too much compute and too much data, and unless you're Google or Facebook, you can't do it." 00:40:14.080 |
And this became particularly an issue when Google released their TPUs and put out a big 00:40:18.960 |
PR exercise saying, "Okay, TPUs are so great that nobody else can pretty much do anything 00:40:29.280 |
And so we decided to enter this international competition called Dawn Bench that Google 00:40:34.920 |
and Intel had entered to see if we could beat them, like be faster than them at training 00:40:44.320 |
And so that was called 2018, and we won on many of the axes of the competition. 00:40:57.640 |
The cheapest, the fastest TPU, fastest single machine, and then we followed this up with 00:41:02.640 |
additional results that were actually 40% faster than Google's best TPU results. 00:41:07.420 |
And so this was exciting because here's a picture of us and our students working on 00:41:15.520 |
It was a really cool time because we really wanted to push back against this narrative 00:41:25.600 |
And so we got a lot of media attention, which was great. 00:41:31.560 |
And really, the finding here was using common sense is more important than using vast amounts 00:41:43.080 |
It was really cool to see also that a lot of the big companies noticed what we were 00:41:52.760 |
So Nvidia, when they started promoting how great their GPUs were, they started talking 00:41:58.160 |
about how good they were with the additional ideas that we had developed with our students. 00:42:05.000 |
So academic research became a critical component of fast AI's work, and we did similar research 00:42:17.000 |
to drive breakthroughs in natural language processing, in tabular modeling, and lots 00:42:26.800 |
So then the question is, OK, well, now that we've actually pushed the boundaries beyond 00:42:32.520 |
what's already known, to say, OK, we can actually get better results with less data and less 00:42:38.680 |
compute more quickly, how do we put that into the hands of everybody so that everybody can 00:42:46.560 |
So that's why we decided to build a software library called FastAI. 00:42:52.940 |
So that was just in 2018 that version 1 came out, but it immediately got a lot of attention. 00:42:57.960 |
It got supported by all the big cloud services. 00:43:03.400 |
And we were able to show that compared to Keras, for example, it was much more accurate, 00:43:09.160 |
much faster, far less lines of code, and we really tried to make it as accessible as possible. 00:43:20.940 |
So this is some of the documentation from FastAI. 00:43:25.480 |
You can see that not only do you get the normal kind of API documentation, but it's actually 00:43:33.920 |
It's got links to the papers that it's implementing. 00:43:36.760 |
And also, all of the code for all of the pictures is all directly there as well. 00:43:41.640 |
And one of the really nice things is that every single page of the documentation has 00:43:46.280 |
a link to actually open that page of documentation as an interactive notebook, because the entire 00:43:55.200 |
So you can then get the exact same thing, but now you can experiment with it. 00:44:02.040 |
So we really took the kind of approaches that we found worked well in our course, having 00:44:07.200 |
students do lots of experiments and lots of coding, and making that a kind of part of 00:44:13.480 |
our documentation as well, is to let people really play with everything themselves, experiment, 00:44:24.320 |
So incorporating all of this research into the software was super successful. 00:44:29.400 |
We started hearing from people saying, OK, well, I've just started with fast.ai, and 00:44:33.560 |
I've started pulling across some of my TensorFlow models. 00:44:37.380 |
And I don't understand why is everything so much better? 00:44:43.840 |
So people were really noticing that they were getting dramatically better results. 00:44:50.480 |
So this person said the same thing, yep, we used to try to use TensorFlow. 00:44:57.580 |
We switched to fast.ai, and within a couple of days, we were getting better results. 00:45:01.960 |
So by kind of combining the research with the software, we were able to provide a software 00:45:09.960 |
library that let people get started more quickly. 00:45:13.720 |
And then version 2, which has been around for a bit over a year now, was a very dramatic 00:45:20.920 |
There's a whole academic paper that you can read describing all the deep design approaches 00:45:29.980 |
One of the really nice things about it is that basically, regardless of what you're trying 00:45:35.280 |
to do with fast.ai, you can use almost exactly the same code. 00:45:40.840 |
So for example, here's the code necessary to recognize dogs from cats. 00:45:46.760 |
Here's the code necessary to build a segmentation model. 00:45:53.600 |
Here's a code to classify text movie reviews, almost the same lines of code. 00:45:59.920 |
Here's the code necessary to do collaborative filtering, almost the same lines of code. 00:46:04.120 |
So I said earlier that kind of under the covers, different models look more similar than different 00:46:12.520 |
And so with fast.ai, we really tried to surface that so that you learn one API and you can 00:46:22.360 |
That's not enough for researchers or people that really need to dig deeper. 00:46:27.400 |
So one of the really nice things about that is that underneath this applications layer 00:46:32.040 |
is a tiered or a layered API where you can go in and change anything. 00:46:38.320 |
I'm not going to describe it in too much detail, but for example, part of this mid-tier API 00:46:46.520 |
is a new two-way callback system, which basically allows you at any point when you're training 00:46:53.720 |
a model to see exactly what it's doing and to change literally anything that it's doing. 00:47:00.600 |
You can skip parts of the training process, you can change the gradients, you can change 00:47:08.820 |
And so with this new approach, we're able to implement, for example, this is from a paper 00:47:17.000 |
called Mixup, we're able to implement Mixup data augmentation in just a few lines of code. 00:47:22.640 |
And if you compare that to the actual original Facebook paper, not only was it far more lines 00:47:28.880 |
of code, this is what it looks like from the research paper without using callbacks, but 00:47:34.740 |
it's also far less flexible because everything's hard-coded or else with this approach, you 00:47:44.960 |
Another example of this layered API is we built a new approach to creating new optimizers 00:47:53.960 |
I won't go into the details, but in short, this is what a particular optimizer called 00:47:59.360 |
AdamW looks like in PyTorch and this took about two years between the paper being released 00:48:07.760 |
and Facebook releasing the AdamW implementation. 00:48:12.120 |
Our implementation was released within a day of the paper and it consists of these one, 00:48:21.680 |
As we're leveraging this layered API for optimizers, it's basically really easy to utilize the 00:48:34.600 |
Here's another example of an optimizer, this one's called lamb. 00:48:37.600 |
This came from a Google paper and one of the really cool things you might notice is that 00:48:44.080 |
there's a very close match between the lines of code in our implementation and the lines 00:48:56.960 |
There's a little summary of both what I'm doing now with fast AI and how I got there 00:49:27.160 |
You mentioned in deep learning that obviously there's very similar structures in code and 00:49:35.880 |
solving problems, but how do you incorporate things like knowledge about the problem? 00:49:43.720 |
Obviously the type of architecture that would have to go in there would come from the context 00:49:53.880 |
There's a number of really interesting ways of incorporating knowledge about the problem 00:50:00.440 |
and it's a really important thing to do because this is how you get a whole lot of extra performance 00:50:13.560 |
The more of that knowledge you can incorporate. 00:50:18.360 |
One way is certainly to directly implement it in the architecture. 00:50:22.520 |
For example, a very popular architecture for computer vision is convolutional architecture. 00:50:31.760 |
The convolution, a 2D convolution, is taking advantage of our domain knowledge, which is 00:50:37.200 |
that there's generally autocorrelation across pixels in both the X and Y dimensions. 00:50:44.640 |
We're basically mapping a set of weights across groups of pixels that are all next to each 00:50:53.480 |
There's a really wide range of interesting ways of incorporating all kinds of domain 00:51:03.040 |
There's lots of geometry-based approaches of doing that within natural language processing. 00:51:09.600 |
There's lots of autoregressive approaches there. 00:51:14.600 |
An area I am extremely fond of is data augmentation. 00:51:20.480 |
In particular, there's been a huge improvement in the last 12 months or so in how much we 00:51:32.320 |
can do with a tiny amount of data by using something called self-supervised learning 00:51:38.640 |
and in particular using something called contrastive loss. 00:51:41.640 |
What this is doing is you basically try to come up with really thoughtful data augmentation 00:51:48.400 |
approaches where you can say, "Okay, for example, in NLP, one of the approaches is to translate 00:51:55.680 |
each sentence with a translation model into a different language and then translate it 00:52:02.040 |
You're now going to get a different version of the same sentence, but it shouldn't mean 00:52:06.800 |
With contrastive loss, it basically says you add a part to the loss function that says 00:52:11.040 |
those two different sentences should have the same result in our model. 00:52:19.280 |
With something called UDA, which is basically adding contrastive loss and self-supervised 00:52:24.640 |
learning to NLP, they were able to get results for movie review classification with just 00:52:30.400 |
20 labeled examples that were better than the previous state of the art using 25,000 00:52:37.480 |
Anyway, there's lots of ways we can incorporate domain knowledge into models, but there's 00:52:45.720 |
Yes, I guess there are a couple of questions about interpretability. 00:52:53.000 |
One of the questions that came up is it's hard to explain to stakeholders, and so how 00:52:58.840 |
can you convince them that deep learning is worth adopting? 00:53:03.560 |
Obviously, you can show predictive performance, but is there any other ways that you can do 00:53:12.640 |
My view is that deep learning models are much more interpretable and explainable than most 00:53:21.640 |
Generally speaking, the traditional approach to people thinking the right way to understand 00:53:27.880 |
example regression models is to look at their coefficients, and I've always told people 00:53:32.280 |
don't do that, because in almost any real world regression problem, you've got coefficients 00:53:37.960 |
representing interactions, you've got coefficients on things that are collinear, you've got coefficients 00:53:42.760 |
on various different bases of a transformed nonlinear variable. 00:53:47.160 |
None of the coefficients can be understood independently, because they can only be understood 00:53:52.800 |
as how they combine with all the other things that are related. 00:53:58.720 |
I genuinely really dislike it when people try and explain a regression by looking at 00:54:06.040 |
To me, the right way to understand a model is to do the same thing we would do to understand 00:54:16.560 |
Whether it's a regression or a random forest or a deep learning model, you can generally 00:54:20.920 |
easily ask questions like, "What would happen if I made this variable on this row a little 00:54:26.560 |
bit bigger or a little bit smaller?", or things like that, which actually are much easier 00:54:32.680 |
to do in deep learning, because in deep learning, those are just questions about the derivative 00:54:36.520 |
of the input, and so you can actually get them much more quickly and easily. 00:54:41.860 |
You can also do really interesting things with deep learning around showing which things 00:54:46.720 |
are similar to each other, kind of in the deep learning feature space. 00:54:53.720 |
You can build really cool applications for domain experts, which can give them a lot 00:55:01.400 |
You can say, "Yes, it's accurate," but I can also show you which parts of the input 00:55:05.760 |
are being particularly important in this case, which other inputs are similar to this one, 00:55:13.760 |
and often we find, for example, in the medical space, doctors will go, "Wow, that's really 00:55:17.960 |
clever the way it recognized that this patient and this patient were similar. 00:55:29.920 |
I guess we're right at 11 o'clock, but maybe one last question that somebody brought up 00:55:35.320 |
is, "Is there any future research opportunities in the cross-machine learning and quantum 00:55:46.760 |
No, probably not one I've got any expertise on. 00:55:50.480 |
That will be an interesting question, but I'm not the right person to ask. 00:55:54.080 |
One thing I do want to mention is I have just moved back to Australia after 10 years in 00:56:04.280 |
I am extremely keen to see Australia become an absolute knowledge hub around deep learning, 00:56:13.440 |
and I would particularly love to see our fast AI software like this, just like when you 00:56:19.640 |
think about TensorFlow, you have this whole ecosystem around it, around Google and startups 00:56:26.160 |
I would love to see Australia become the fast AI, the homegrown library, and that people 00:56:33.720 |
here will really take it to heart and help us make it brilliant. 00:56:39.160 |
It's all open source, and we've got a Discord channel where we all chat about it, and any 00:56:46.040 |
organizations that are interested in taking advantage of this free open source library, 00:56:52.400 |
I would love to support them and see academic institutions. 00:56:56.440 |
I'd love to see this become a really successful ecosystem here in Australia. 00:57:02.040 |
It seems like it's going to be quite useful to solve lots of problems, so I think it would 00:57:13.120 |
We'll have the chat transcript, and if there's any questions that Jeremy might be worth addressing 00:57:20.400 |
from there, we can think about posting responses to those if there's anything in there, but 00:57:29.540 |
Thank you everybody for coming, and thank Jeremy for joining us today.