back to index

My Journey to Deep Learning


Chapters

0:0 Intro
2:40 Classic Machine Learning
8:3 Random Forests
12:45 Dogs vs Cats
15:32 Deep Learning Libraries
17:42 Neural Networks
28:2 Why Deep Learning
30:7 First Deep Learning Company
32:27 Making Deep Learning Accessible
34:12 Deep Learning Courses
35:45 Deep Learning Forums
36:40 Cool Models
37:23 Stateoftheart results
39:0 Building new courses
39:42 Dawnbench
41:43 Nvidia
42:26 Fast AI
43:13 Fast AI Documentation
45:28 Fast AI Code
46:19 Midtier API
47:12 Mixup
47:43 optimizers
48:54 summary
49:13 how to incorporate knowledge

Whisper Transcript | Transcript Only Page

00:00:00.000 | So I'm going to talk today about something I don't think I've really discussed before,
00:00:06.000 | which is my journey to deep learning.
00:00:10.560 | So nowadays, I am all deep learning all the time.
00:00:15.360 | And a lot of people seem to kind of assume that anybody who's doing deep learning kind
00:00:22.800 | of jumps straight into it without looking at anything else.
00:00:25.760 | But actually, at least for me, it was a many decade journey to this point, and it's because
00:00:33.920 | I've done so many other things and seen how much easier my life is with deep learning
00:00:42.520 | that I've become such an evangelist with this technology.
00:00:48.320 | So I actually started out my career at McKinsey & Company, which is a management consulting
00:00:55.920 | firm.
00:00:57.280 | And quite unusually, I started there when I was 18.
00:01:04.400 | And that was a challenge, because in strategy consulting, people are generally really leveraging
00:01:13.040 | their expertise and experience.
00:01:14.920 | And of course, I didn't have either of those things.
00:01:18.440 | So I really had to rely on data analysis right from the start.
00:01:27.920 | So what happened was, from the very start of my career, I was really relying very heavily
00:01:35.720 | on applied data analysis to answer real world questions.
00:01:41.160 | And so in a consulting context, you don't have that much time.
00:01:44.640 | And you're talking to people who really have a lot of domain expertise, and you have to
00:01:49.280 | be able to communicate in a way that actually matters to them.
00:01:55.120 | And I used a variety of techniques in my data analysis.
00:02:01.760 | One of my favorite things to use was this was just before kind of pivot tables appeared.
00:02:07.720 | And when they appeared, that was like something I used a lot, used various kind of database
00:02:14.360 | tools and so forth.
00:02:15.520 | But I did actually use machine learning quite a bit and had a lot of success with that.
00:02:25.320 | Most of the machine learning that I was doing was based on kind of logistic or linear regression
00:02:32.480 | or something like that.
00:02:37.400 | Rather than show you something I did back then, because I can't because it was all proprietary,
00:02:46.080 | let me give you an example from the computational pathologist paper from Andy Beck and Daphne
00:02:55.960 | Coller and others at Stanford.
00:02:58.640 | This was I'm trying to think maybe 2011 or something like that, 2012.
00:03:05.720 | And they developed a five-year survival model for breast cancer, I believe it was.
00:03:13.440 | And the inputs to their five-year survival model were histopathology slices, stain slides.
00:03:23.880 | And they built a five-year survival predictive model that was very significantly better than
00:03:30.680 | anything that had come before.
00:03:33.040 | And what they described in their paper is the way they went about it was what I would
00:03:40.520 | call nowadays kind of classic machine learning.
00:03:44.520 | They used a regularized logistic regression.
00:03:49.720 | And they fed into that logistic regression, if I remember correctly, thousands I think
00:03:55.280 | of features.
00:03:58.840 | And the features they built were built by a large expert team of pathologists, computer
00:04:07.200 | scientists, mathematicians, and so forth that worked together to think about what kinds
00:04:12.720 | of features might be interesting and how do we encode them so that things like relationships
00:04:17.280 | of contiguous epithelial regions with underlying nuclear objects or characteristics of epithelial
00:04:23.240 | nuclei and epithelial cytopasm, characteristics of stromal nuclei and stromal matrix, and
00:04:29.080 | so on and so forth.
00:04:30.560 | So it took many people many years to create these features and come up with them and implement
00:04:39.840 | them and test them.
00:04:41.680 | And then the actual modeling process was fairly straightforward.
00:04:46.440 | They took images from patients that stayed alive for five years.
00:04:51.360 | And they took images from those that didn't and then used a fairly standard regularized
00:04:55.760 | logistic regression to build a classifier.
00:05:00.000 | So basically to create parameters around these different features.
00:05:05.360 | To be clear, this approach worked well for this particular case and worked well for me
00:05:14.400 | for years for many, many projects.
00:05:20.000 | And it's a perfectly reasonable bread and butter technique that you can certainly still
00:05:27.360 | use today in a very similar way.
00:05:32.520 | I spend a lot of time studying how to get the most out of this.
00:05:39.680 | One nice trick that a lot of people are not as familiar with as they should be is what
00:05:46.480 | you do with continuous inputs in these cases and how do you transform them so that you
00:05:54.040 | can handle non-linearities.
00:05:55.040 | A lot of people use polynomials for that.
00:05:59.400 | And actually polynomials are generally a terrible choice.
00:06:02.640 | Nearly always the best choice, it turns out, is actually to use something called natural
00:06:06.840 | cubic splines.
00:06:08.920 | And natural cubic splines are basically where you split your data set into sections of the
00:06:18.560 | domain and you connect each section up with a cubic, so each of these bits between dots
00:06:25.480 | are cubics, and you create the bases such that these cubics connect up with each other and
00:06:32.960 | their gradients connect up.
00:06:33.960 | And one of the interesting things that makes them natural splines is that the endpoints
00:06:38.880 | are actually linear rather than cubic, which actually makes these extrapolate outside the
00:06:46.400 | input domain really nicely.
00:06:51.160 | You can see as you add more and more knots with just two knots, you start out with a
00:06:54.920 | line and as you add more knots, you start to get more and more opportunities for curves.
00:07:02.560 | One of the cool things about natural splines, they're also called restricted cubic splines,
00:07:07.080 | is that actually you don't have to think at all about where to put the knot points.
00:07:12.720 | It turns out that there's basically a set of quantiles where you can put the knot points
00:07:20.400 | pretty reliably depending on how many knot points you want, which is independent of the
00:07:24.080 | data, and nearly always works.
00:07:26.280 | So this was a nice trick.
00:07:28.880 | And then another nice trick is if you do use regularized regression, particularly L1 regularized
00:07:35.040 | regression I really like, you don't even have to be that careful about the number of parameters
00:07:40.240 | you include a lot of the time.
00:07:42.240 | So you can often include quite a lot of transformations, including interactions of
00:07:54.040 | cubic spline terms.
00:07:56.320 | So this is an approach that I used a lot and had a lot of success with.
00:08:04.040 | But then in, I think it was '99 that the first paper appeared in the early 2000s that started
00:08:11.960 | getting popular, was Random Forests.
00:08:16.280 | And Random Forests, this is a picture from Terrence Parr's excellent D-Tree Vis package.
00:08:23.600 | Random Forests are ensembles of decision trees, as I'm sure most of you know.
00:08:29.720 | And so for an example of a decision tree, this is data from the Kaggle competition which
00:08:40.720 | is trying to predict the auction price of heavy industrial equipment.
00:08:45.080 | And you can see here that a decision tree has done a split on this binary variable of
00:08:51.880 | coupler system.
00:08:52.880 | And then for those which I guess don't have a coupler system, it did a binary split on
00:08:56.960 | year made, and those which then were made in early years, then we can see immediately
00:09:06.240 | the sale price.
00:09:07.240 | So this is the thing we're trying to predict the sale price.
00:09:09.520 | And so in this case, we can see that it's in just four splits, it successfully found
00:09:17.920 | some things which, this is actually the log of sale price, has done a really good job
00:09:21.480 | of splitting out the log of sale price.
00:09:23.640 | I actually used these single decision trees a little bit in the kind of early and mid
00:09:32.120 | 90s, but they were a nightmare to find something that fit adequately but didn't overfit.
00:09:41.880 | And Random Forests then came along thanks to Bryman, who, a very interesting guy, he
00:09:48.840 | was originally a math professor at Berkeley.
00:09:51.560 | And then he went out into industry and was basically a consultant, I think, for years
00:09:56.080 | and then came back to Berkeley to do statistics.
00:09:59.960 | And he was incredibly effective in creating like really practical algorithms.
00:10:05.640 | And the Random Forest is one that's really been world changing, incredibly simple, you
00:10:10.360 | just randomly pick a subset of your data.
00:10:14.360 | And you then train a model, train it, just create a decision tree with a subset, you
00:10:20.920 | save it, and then you repeat steps one, two, and three again and again and again, creating
00:10:25.600 | lots and lots of decision trees on different random subsets of the data.
00:10:29.760 | And it turns out that if you average the results of all these models, you get predictions that
00:10:37.080 | are unbiased, accurate, and don't overfit.
00:10:45.520 | And it's a really, really cool approach.
00:10:49.880 | So basically as soon as this came out, I added it to my arsenal.
00:10:52.880 | One of the really nice things about this is how quickly you can implement it.
00:10:56.160 | We implemented it in like a day, basically.
00:10:58.920 | So this came out when I was running a company called Optimal Decisions, which I built to
00:11:07.040 | help insurers come up with better prices, which is the most important thing insurers to do.
00:11:15.400 | One of the interesting things about this, for me, is that we never actually deployed
00:11:22.680 | a random forest.
00:11:24.640 | What we did was we used random forests to understand the data, and then we used that
00:11:30.200 | understanding of the data to then go back and basically build more traditional regression
00:11:35.720 | models with the particular terms and transformations and interactions that the random forest found
00:11:42.080 | were important.
00:11:43.240 | So basically, this is one of the cool things that you get out of a random forest.
00:11:47.800 | It's a feature importance plot, and it shows you-- so this is, again, the same data set,
00:11:54.760 | the option price data set from Kaggle-- it shows you which are the most important features.
00:12:01.040 | And the nice thing about this is you don't have to do any transformations or think about
00:12:06.840 | interactions or non-linearities.
00:12:09.600 | Because they're using decision trees behind the scenes, it all just works.
00:12:14.300 | And so I kind of developed this pretty simple approach where I would first create a random
00:12:20.800 | forest, and I would then find which features and so forth are useful.
00:12:24.720 | I'd then use partial dependence plots to kind of look at the shapes of them.
00:12:29.360 | And then I would go back and kind of, for the continuous variables that matter, create
00:12:34.080 | the cubic splines, and create the interactions, and then do a regression.
00:12:41.520 | And so this basic kind of trick was incredibly powerful.
00:12:46.400 | And I used it, a variance of it, in the early days of Kaggle amongst other things, and got
00:12:55.600 | to number one in the world, and won a number of competitions.
00:12:59.960 | And funnily enough, actually, back in 2011, I described my approaches to Kaggle competitions
00:13:07.120 | in Melbourne at the Melbourne R meetup.
00:13:11.320 | And you can still find that talk on YouTube.
00:13:14.760 | And it's actually still pretty much just as relevant today as it was at that time.
00:13:23.360 | So this is 2011, and I became the chief scientist and president at Kaggle.
00:13:33.680 | And we took it over to the US, and got venture capital, and built it into quite a successful
00:13:41.560 | business.
00:13:44.880 | But something interesting that happened as chief scientist of Kaggle, I was getting to
00:13:49.880 | see all the competitions up close.
00:13:53.520 | And seven years ago, there was a competition, Dogs vs Cats, which you can still see, actually,
00:14:04.240 | on the Dogs vs Cats Kaggle page, it describes the state-of-the-art approach for recognizing
00:14:08.440 | Dogs vs Cats as being around about 80% accuracy.
00:14:14.560 | And so that was based on the academic papers that had tackled this problem at the time.
00:14:19.700 | And then in this competition that just ran for three months, eight teams reached 98%
00:14:28.420 | accuracy.
00:14:29.420 | And one nearly got to 99% accuracy.
00:14:32.060 | So if you think about this as a 20% error rate, and this is basically a 1% error rate.
00:14:38.120 | So this competition brought the state-of-the-art down by about 20 times in three months, which
00:14:44.200 | is really extraordinary.
00:14:46.720 | It's really unheard of to see an academic state-of-the-art result that has been carefully
00:14:52.920 | studied slashed by 20x by somebody working for just three months on the problem.
00:15:02.120 | That's normally something that might take decades or hundreds of years, if it's possible
00:15:06.120 | at all.
00:15:07.120 | So something clearly happened here.
00:15:11.880 | And of course what happened was deep learning.
00:15:17.320 | And Pierre actually had developed one of the early deep learning libraries.
00:15:23.320 | And actually, even this kind of signal on Kaggle was in some ways a little late.
00:15:28.680 | If you actually look at Pierre's Google Scholar, you'll see that it was actually back in 2011
00:15:35.800 | that him and Yann LeCun had already produced a system that was better than human performance
00:15:45.120 | at recognizing traffic signs.
00:15:48.280 | And so this was actually the first time that I noticed this really extraordinary thing,
00:15:56.080 | which was deep learning being better than humans at very human tasks, like looking at pictures.
00:16:11.920 | And so in 2011, I thought, wow, that's super interesting.
00:16:17.800 | But it's hard to do anything with that information, because there wasn't any open source software
00:16:24.120 | or even any commercial software available to actually do it.
00:16:27.520 | There was-- Jörgen Schmidthuber's lab had a kind of like a DLL or something or a library
00:16:33.840 | you could buy from them to do it, although they didn't even have a demo.
00:16:39.520 | You know, there wasn't any online services.
00:16:43.680 | And there wasn't any-- nobody had published anywhere like the actual recipe book of like,
00:16:50.600 | how the hell do you do these things?
00:16:55.440 | And so that was a huge challenge.
00:17:00.280 | It's exciting to see that this is possible, but then it's like, well, what do I do about
00:17:05.360 | But one of the cool things is that at this exact moment, this dogs and cats moment, is
00:17:09.600 | when two really accessible open source libraries appeared, allowing people to actually create
00:17:20.200 | their own deep learning models for the first time.
00:17:23.920 | And critically, they were built on top of CUDA, which was a dramatically more convenient
00:17:30.600 | way of programming GPUs than had previously existed.
00:17:34.720 | So kind of things started to come together really seven years ago, a little bit.
00:17:44.240 | I had been interested in neural networks since the very start of my career.
00:17:52.360 | And in fact, in consulting, I worked with one of the big Australian banks on implementing
00:18:00.200 | a neural network in the early to mid 90s to help with targeted marketing.
00:18:08.040 | Not a very exciting application, I'll give you.
00:18:11.720 | But it really struck me at the time that this was a technology which I felt like at some
00:18:20.900 | point would probably take over just about everything else in terms of my area of interest
00:18:25.480 | around predictive modelling.
00:18:27.720 | And we actually had quite a bit of success with it even then.
00:18:31.000 | So that's like 30 years ago now, nearly, but there was some issues back then for one thing
00:18:39.160 | we had to buy custom hardware that cost millions of dollars.
00:18:42.760 | We really needed a lot of data, millions of data points.
00:18:46.120 | So on a retail bank, we could do that.
00:18:50.160 | And yeah, it was even then there were things that just weren't quite working as well as
00:18:57.280 | we would expect.
00:19:00.360 | And so as it turned out, the key problem was that back then everybody was relying on this
00:19:05.240 | math result called the universal approximation theorem, which said that a neural network could
00:19:11.120 | solve any given problem, computable problem to any arbitrary level of accuracy.
00:19:17.960 | And it only needs one hidden layer.
00:19:21.960 | And this is one of the many, many times in deep learning history where theory has been
00:19:29.480 | used in totally inappropriate ways.
00:19:31.800 | And the problem with this theory was that although this was theoretically true, in practice,
00:19:38.240 | a neural network with one hidden layer requires far too many nodes to be useful most of the
00:19:45.480 | time.
00:19:46.480 | And what we actually need is lots of hidden layers.
00:19:49.120 | And that turns out to be much more efficient.
00:19:54.640 | So anyway, I did feel like for those 20 years, at some point, neural networks are going to
00:20:03.120 | reappear in my life because of this infinitely flexible function, the fact that they can
00:20:10.600 | solve any given problem in theory.
00:20:15.120 | And then along with this infinitely flexible function, we combine it with gradient descent,
00:20:19.600 | which is this all-purpose parameter fitting algorithm.
00:20:24.880 | And again, there was a problem with theory here, which is I spent many, many years focused
00:20:30.200 | on operations research and optimization.
00:20:34.400 | And operations research generally focused on, again, kind of theoretical questions of
00:20:40.080 | what is proofably able to find the definite maximum or minimum of a function.
00:20:49.400 | And gradient descent doesn't do that, particularly stochastic gradient descent.
00:20:57.640 | And so a lot of people were kind of ignoring it.
00:21:01.320 | But the thing is, the question we should be asking is not what can we prove, but what
00:21:06.960 | actually works in practice.
00:21:09.840 | And the people who, the very small number of people who were working on neural networks
00:21:14.760 | and gradient descent throughout the '90s and early 2000s, despite all the theory that said
00:21:23.160 | it's a terrible idea, actually were finding it was working really well.
00:21:30.080 | Unfortunately, academia around machine learning has tended to be much more driven by theory
00:21:40.280 | than results, or at least for a long time it was, I still think it is too much.
00:21:44.880 | And so the fact that there were people like Hinton and LeCun saying, look, here's a model
00:21:51.520 | that's better than anything in the world at solving this problem, but based on theory,
00:22:01.520 | we can't exactly prove why, but it really works.
00:22:05.920 | Because we're not getting published, unfortunately.
00:22:10.640 | Anyway, so things gradually began to change.
00:22:12.720 | And one of the big things that changed was that finally, in around 2014, 2015, we started
00:22:19.840 | to see some software appearing that allowed us to conveniently train these things on GPUs,
00:22:25.960 | which allowed us to use relatively inexpensive computers to actually get pretty good results.
00:22:33.960 | So although the theory didn't really change at this point, what did change is just more
00:22:40.200 | people could try things out and be like, oh, OK, this is actually practically really helpful.
00:22:47.260 | To people outside of the world of neural networks, this all seemed very sudden.
00:22:51.980 | It seemed like there was this sudden fad around deep learning, where people were suddenly
00:22:57.680 | going, well, this is amazing.
00:23:00.080 | And so people who had seen other fads quite recently thought, well, this one will pass
00:23:08.320 | But the difference with this fad is it's actually been under development for many, many, many
00:23:14.080 | decades.
00:23:15.240 | So this was the first neural network to be built.
00:23:18.440 | And it was back in 1957 that it was built.
00:23:22.000 | And continually, for all those decades, there were people working on making neural nets
00:23:30.360 | really work in practice.
00:23:32.480 | So what was happening in 2015 was not a sudden, here's this new thing where I got organophlock
00:23:40.040 | But it was actually, here's this old thing, which we finally got to the point where it's
00:23:45.960 | actually really working.
00:23:49.320 | And so it's not a new fad at all, but it's really the result of decades of hard work
00:23:55.200 | of solving lots of problems and finally getting to a point where things are making sense.
00:24:02.120 | But what has happened since 2015 is the ability of these infinitely flexible functions has
00:24:13.200 | suddenly started to become clear, even to a layperson, because you can just look at
00:24:17.720 | what they're doing and it's mind blowing.
00:24:21.000 | So for example, if you look at OpenAI's DALI, this is a model that's been trained on pairs
00:24:27.640 | of pictures and captions such that you can now write any arbitrary sentence.
00:24:34.560 | So if you write an illustration of a baby daikon radish in a tutu walking a dog, DALI
00:24:41.960 | will draw pictures of what you described for you, and here are some actual non-cherry-picked
00:24:48.600 | pictures of that.
00:24:50.760 | And so to be clear, this is all out of domain, right?
00:24:53.960 | So DALI has never seen illustrations of baby daikon radishes yet or radishes and tutus,
00:25:00.880 | or let alone any of this combination of things.
00:25:04.760 | It's creating these entirely from scratch.
00:25:08.360 | By the same token, it's never seen an avocado-shaped chair before, as best as I know, but if you
00:25:13.440 | type in an armchair in the shape of an avocado, it creates these pictures for you from scratch.
00:25:22.840 | And so it's really cool now that we can actually show, we can actually say, look what computers
00:25:32.280 | can do, and look what computers can do if you use deep learning.
00:25:36.860 | And to anybody who's grown up in the kind of pre-deep learning era, this just looks
00:25:42.160 | like magic.
00:25:43.160 | It's like this is not things that I believe computers can do, but here we are.
00:25:49.240 | This is this theoretically universally capable model actually doing things that we've trained
00:26:00.660 | it to do.
00:26:03.160 | So in the last few years, we're now starting to see many times every year examples of computers
00:26:11.800 | doing things, which we're being told computers won't be able to do in our lifetime.
00:26:15.960 | So for example, I was repeatedly told by experts that in my lifetime, we would never see a computer
00:26:23.480 | win a game of Go against an expert.
00:26:27.440 | And of course, we're now at the point where AlphaGo Zero got to that point in three days.
00:26:33.640 | And it's so far ahead of the best expert now that it's kind of makes the world's best experts
00:26:40.440 | look like total beginners.
00:26:44.200 | And one of the really interesting things about AlphaGo Zero is that if you actually look
00:26:49.200 | at the source code for it, here it is.
00:26:53.920 | And the source code for the key thing, which is like the thing that figures out whether
00:26:58.080 | a Go board is a good position or not fits on one slide.
00:27:04.840 | And furthermore, if you've done any deep learning, you'll recognize it as looking almost exactly
00:27:10.440 | like a standard computer vision model.
00:27:15.500 | And so one of the things which people who are not themselves deep learning practitioners
00:27:20.720 | don't quite realize is that deep learning on the whole is not a huge collection of somewhat
00:27:28.480 | disconnected but slightly connected kind of tricks.
00:27:32.500 | It's actually, you know, every deep learning model I build looks almost exactly like every
00:27:38.480 | other model I build with fairly minor differences.
00:27:42.240 | And I train them in nearly exactly the same way with fairly minor differences.
00:27:47.840 | And so deep learning has become this incredibly flexible skill that if you have it, you can
00:27:53.560 | turn your attention to lots of different domain areas and rapidly get incredibly good results.
00:28:03.660 | So at this point, deep learning is now the best approach in the world for all kinds of
00:28:11.320 | applications.
00:28:12.320 | I'm not going to read them all, and this is by no means a complete list.
00:28:16.960 | It's far longer than this.
00:28:18.200 | But these are some examples of the kinds of things that deep learning is better at than
00:28:23.420 | any other known approach.
00:28:28.160 | So why am I spending so much time in my life now on deep learning?
00:28:34.760 | Because it really feels to me like a very dramatic step change in human capability like
00:28:43.840 | the development of electricity, for example.
00:28:47.080 | And you know, I would like to think that when I see a very dramatic step change in human
00:28:51.480 | capability, I'm going to spend my time working on figuring out how best to take advantage
00:28:58.880 | of that capability because that's, you know, there's going to be so many world-changing
00:29:04.400 | breakthroughs that come out of that.
00:29:07.180 | And particularly as somebody who's built a few companies, as an entrepreneur, that the
00:29:12.840 | number one thing for an entrepreneur to find and that investors look for is, is there something
00:29:19.120 | you can build now that people couldn't build before in terms of as a company?
00:29:24.500 | And with deep learning, the answer to that is yes, across tens of thousands or hundreds
00:29:29.860 | of thousands of areas, because it's like, OK, suddenly there's tools which we couldn't
00:29:36.220 | automate before and now we can, or we can make hundreds of times more productive or
00:29:41.440 | so forth.
00:29:42.560 | So to me, it's a very obvious thing that like this is what I want to spend all my time on.
00:29:48.440 | And when I talk to students, you know, or interested entrepreneurs, I always say, you
00:29:55.960 | know, this is the thing which is making lots and lots of people extremely rich and is solving
00:30:02.320 | lots and lots of important societal problems.
00:30:05.560 | And we're just seeing the tip of the iceberg.
00:30:08.560 | So as soon as I got to the point that I realized this, I decided to start a company to actually,
00:30:19.200 | you know, do something important.
00:30:21.640 | And so I got very excited about the opportunities in medicine and I created the first deep learning
00:30:26.800 | and medicine company called analytic.
00:30:30.320 | And I didn't know anything about medicine and I didn't know any people in medicine.
00:30:36.220 | So I kind of like got together a group of three other people and me and we decided to
00:30:42.760 | kind of hack together a quick deep learning model that would see if we can predict the
00:30:49.840 | malignancy of nodules in lung CT scans.
00:30:56.080 | And it turned out that we could.
00:31:00.160 | And in fact, the algorithm that we built this company that that I ended up calling analytic
00:31:04.400 | had a better false positive rate and a better false negative rate than actually a panel
00:31:08.960 | of four trained radiologists.
00:31:12.920 | And so this was at a time when deep learning in medicine and deep learning and radiology
00:31:18.480 | was unheard of.
00:31:19.960 | There were basically no papers about it.
00:31:22.280 | There was certainly no startups about it.
00:31:25.360 | No one was talking about it.
00:31:27.320 | And so this this finding got some attention.
00:31:31.000 | And this was really important to me because my biggest goal with analytic was to kind
00:31:35.580 | of get deep learning and medicine on the map because I felt like it could save a lot of
00:31:40.440 | lives.
00:31:41.440 | So I wanted to get a lot of as much attention around this as possible.
00:31:45.040 | And yeah, very quickly, lots and lots of people were writing about this new company.
00:31:53.080 | But as a result, very quickly, deep learning, particularly in radiology took off and within
00:32:02.200 | two years, the main radiology conference had a huge stream around AI.
00:32:08.960 | It was lines out the door.
00:32:11.640 | They created a new a whole new journal for it and so forth.
00:32:17.360 | And so that was really exciting for me to see how we could help kind of put a technology
00:32:24.880 | on the map.
00:32:29.380 | In some ways, this is great, but in some ways it was kind of disappointing because there
00:32:34.160 | were so many other areas where deep learning should have been on the map and it wasn't.
00:32:38.760 | And there's no way that I could create companies around every possible area.
00:32:44.880 | So instead I thought, well, what I want to do is make it easy for other people to create
00:32:51.720 | companies and products and solutions using deep learning, particularly because at that
00:32:57.440 | time, nearly everybody I knew in the deep learning world were young white men from one
00:33:08.040 | of a small number of exclusive universities.
00:33:12.560 | And the problem with that is that there's a lot of societally important problems to
00:33:18.200 | solve which that group of people just weren't familiar with.
00:33:22.200 | And even if they were both familiar with them and interested in them, they didn't know how
00:33:28.400 | to find the data for those or what the kind of constraints and implementation are and
00:33:33.200 | so forth.
00:33:34.440 | So Dr. Rachel Thomas and I decided to create a new organization that would focus on one
00:33:42.440 | thing, which was making deep learning accessible.
00:33:46.080 | And so basically the idea was to say, okay, if this really is a step change in human capability,
00:33:53.640 | which happens from time to time in technology history, what can we do to help people use
00:34:01.720 | this technology regardless of their background?
00:34:06.200 | And so there was a lot of constraints that we had to help remove.
00:34:13.360 | So the first thing we did was we thought, okay, let's at least make it so that what
00:34:19.720 | people already know about how to build deep learning models is as available as possible.
00:34:25.400 | So at this time, there weren't any courses or any kind of easy ways in to get going with
00:34:32.160 | deep learning.
00:34:33.720 | And we had a theory which was we thought you don't need a Stanford PhD to be an effective
00:34:41.760 | deep learning practitioner.
00:34:43.640 | You don't need years and years of graduate level math training.
00:34:48.940 | We thought that we might be able to build a course that would allow people with just
00:34:53.520 | a year of coding background to become effective deep learning practitioners.
00:34:59.600 | Now at this time, so what is this, about 2014 or 2015, can't quite remember, maybe 2015.
00:35:07.720 | Nothing like this existed and this was a really controversial hypothesis.
00:35:12.560 | And to be clear, we weren't sure we were right, but this was a feeling we had.
00:35:19.040 | So we thought let's give it a go.
00:35:20.880 | So the first thing we created was a fast AI practical deep learning course.
00:35:27.400 | And certainly one thing we immediately saw, which was thrilling, and we certainly didn't
00:35:34.200 | know what would happen, was that it was popular.
00:35:36.520 | A lot of people talk the course.
00:35:39.520 | We made it freely available online with no ads to make it as accessible as possible since
00:35:44.160 | that's our mission.
00:35:46.080 | And I said to that first class, if you create something interesting with deep learning,
00:35:53.640 | please tell us about it on our forums.
00:35:55.360 | So we created a forum so that students could communicate with each other.
00:36:00.240 | And we got thousands of replies.
00:36:03.960 | And I remember one of the first ones we got, I think this was one of the first, was somebody
00:36:10.960 | who tried to recognize cricket pictures from baseball pictures.
00:36:17.320 | And they had, I think it was like 100% accuracy or maybe 99% accuracy.
00:36:23.520 | And one of the really interesting things was that they only used 30 training images.
00:36:29.680 | And so this is exciting to us to see somebody building a model and not only that, building
00:36:34.800 | it with far less data than people used to think was necessary.
00:36:39.800 | And then suddenly we were being flooded with all these cool models people were building.
00:36:45.520 | So a Trident ad in Tobago, a different types of people model, a zucchini and cucumber model.
00:36:55.240 | This is a really interesting one.
00:36:56.680 | This person actually managed to predict what part of the world a satellite photo was from
00:37:01.880 | with over 110 classes with 85% accuracy, which is extraordinary.
00:37:06.680 | A Panama bus recognizer, a buttock cloth recognizer, some of the things were clearly actually going
00:37:14.000 | to be very useful in practice.
00:37:15.920 | This was something useful for disaster resilience, which was recognizing the state of buildings
00:37:21.440 | in this place in Tanzania.
00:37:24.960 | We saw people even right at the start of the course breaking state of the art results.
00:37:32.720 | So this is on Devin Gary, character recognition, this person said, wow, I just got a new state
00:37:40.040 | of the art result, which is really exciting.
00:37:42.920 | We saw people doing the same, similar to getting state of the art results on audio classification.
00:37:49.880 | And then even we started to hear from some of these students in the first year that they
00:37:55.800 | were taking their ideas back to their companies.
00:37:59.640 | And in this case, a software engineer went back to his company who was working at Splunk
00:38:06.760 | and built a new model, which basically took mouse movements and mouse clicks and turned
00:38:11.560 | them into pictures, and then classified them and used this to help with fraud.
00:38:19.320 | And we know about this because it was so successful that it ended up being a patented product
00:38:24.200 | and Splunk created a blog about this core new technology that was built by a software
00:38:32.840 | engineer with no previous background in this area.
00:38:35.800 | We saw startups appearing, so for example, this startup called Envision appeared from
00:38:41.160 | one of our students, and it's still going strong, I just looked it up before this.
00:38:48.120 | And so, yeah, it was really cool to see how people from all walks of life were actually
00:38:54.720 | able to get started with deep learning.
00:39:01.080 | And these courses got really popular, and so we started redoing them every year.
00:39:07.000 | They would build a new course from scratch because things were moving so fast that within
00:39:11.120 | a year, there was so much new stuff to cover that we would build a completely new course.
00:39:16.800 | And so, there's many millions of views at this point, and people are loving them based
00:39:25.800 | on what they're telling YouTube anyway.
00:39:29.240 | So this has been really a pleasure to see.
00:39:34.040 | We ended up turning the course into a book as well, along with my friend, Sylvain Goucher,
00:39:38.880 | and people are really liking the book as well.
00:39:44.040 | So the next step after getting people started with what we already know by putting that
00:39:51.240 | into a course is we wanted to push the boundaries beyond what we already know.
00:39:56.160 | And so one of the things that came up was a lot of students or potential students were
00:40:01.640 | saying, "I don't think it's worth me getting involved in deep learning because it takes
00:40:06.360 | too much compute and too much data, and unless you're Google or Facebook, you can't do it."
00:40:14.080 | And this became particularly an issue when Google released their TPUs and put out a big
00:40:18.960 | PR exercise saying, "Okay, TPUs are so great that nobody else can pretty much do anything
00:40:25.960 | useful in deep learning now."
00:40:29.280 | And so we decided to enter this international competition called Dawn Bench that Google
00:40:34.920 | and Intel had entered to see if we could beat them, like be faster than them at training
00:40:41.320 | deep learning models.
00:40:44.320 | And so that was called 2018, and we won on many of the axes of the competition.
00:40:57.640 | The cheapest, the fastest TPU, fastest single machine, and then we followed this up with
00:41:02.640 | additional results that were actually 40% faster than Google's best TPU results.
00:41:07.420 | And so this was exciting because here's a picture of us and our students working on
00:41:14.520 | this project together.
00:41:15.520 | It was a really cool time because we really wanted to push back against this narrative
00:41:22.760 | that you have to be Google.
00:41:25.600 | And so we got a lot of media attention, which was great.
00:41:31.560 | And really, the finding here was using common sense is more important than using vast amounts
00:41:40.440 | of money and compute.
00:41:43.080 | It was really cool to see also that a lot of the big companies noticed what we were
00:41:49.160 | doing and brought in our ideas.
00:41:52.760 | So Nvidia, when they started promoting how great their GPUs were, they started talking
00:41:58.160 | about how good they were with the additional ideas that we had developed with our students.
00:42:05.000 | So academic research became a critical component of fast AI's work, and we did similar research
00:42:17.000 | to drive breakthroughs in natural language processing, in tabular modeling, and lots
00:42:24.160 | of other areas.
00:42:26.800 | So then the question is, OK, well, now that we've actually pushed the boundaries beyond
00:42:32.520 | what's already known, to say, OK, we can actually get better results with less data and less
00:42:38.680 | compute more quickly, how do we put that into the hands of everybody so that everybody can
00:42:44.280 | use these insights?
00:42:46.560 | So that's why we decided to build a software library called FastAI.
00:42:52.940 | So that was just in 2018 that version 1 came out, but it immediately got a lot of attention.
00:42:57.960 | It got supported by all the big cloud services.
00:43:03.400 | And we were able to show that compared to Keras, for example, it was much more accurate,
00:43:09.160 | much faster, far less lines of code, and we really tried to make it as accessible as possible.
00:43:20.940 | So this is some of the documentation from FastAI.
00:43:25.480 | You can see that not only do you get the normal kind of API documentation, but it's actually
00:43:31.760 | got pictures of exactly what's going on.
00:43:33.920 | It's got links to the papers that it's implementing.
00:43:36.760 | And also, all of the code for all of the pictures is all directly there as well.
00:43:41.640 | And one of the really nice things is that every single page of the documentation has
00:43:46.280 | a link to actually open that page of documentation as an interactive notebook, because the entire
00:43:52.240 | thing is built with interactive notebooks.
00:43:55.200 | So you can then get the exact same thing, but now you can experiment with it.
00:43:58.760 | And you can see all the source code there.
00:44:02.040 | So we really took the kind of approaches that we found worked well in our course, having
00:44:07.200 | students do lots of experiments and lots of coding, and making that a kind of part of
00:44:13.480 | our documentation as well, is to let people really play with everything themselves, experiment,
00:44:19.680 | and see how it all works.
00:44:24.320 | So incorporating all of this research into the software was super successful.
00:44:29.400 | We started hearing from people saying, OK, well, I've just started with fast.ai, and
00:44:33.560 | I've started pulling across some of my TensorFlow models.
00:44:37.380 | And I don't understand why is everything so much better?
00:44:42.160 | What's going on here?
00:44:43.840 | So people were really noticing that they were getting dramatically better results.
00:44:50.480 | So this person said the same thing, yep, we used to try to use TensorFlow.
00:44:55.600 | We spent months tweaking our model.
00:44:57.580 | We switched to fast.ai, and within a couple of days, we were getting better results.
00:45:01.960 | So by kind of combining the research with the software, we were able to provide a software
00:45:09.960 | library that let people get started more quickly.
00:45:13.720 | And then version 2, which has been around for a bit over a year now, was a very dramatic
00:45:19.680 | advance further still.
00:45:20.920 | There's a whole academic paper that you can read describing all the deep design approaches
00:45:26.600 | which we've used.
00:45:29.980 | One of the really nice things about it is that basically, regardless of what you're trying
00:45:35.280 | to do with fast.ai, you can use almost exactly the same code.
00:45:40.840 | So for example, here's the code necessary to recognize dogs from cats.
00:45:46.760 | Here's the code necessary to build a segmentation model.
00:45:50.960 | It's basically the same lines of code.
00:45:53.600 | Here's a code to classify text movie reviews, almost the same lines of code.
00:45:59.920 | Here's the code necessary to do collaborative filtering, almost the same lines of code.
00:46:04.120 | So I said earlier that kind of under the covers, different models look more similar than different
00:46:11.200 | with deep learning.
00:46:12.520 | And so with fast.ai, we really tried to surface that so that you learn one API and you can
00:46:17.680 | use it anywhere.
00:46:22.360 | That's not enough for researchers or people that really need to dig deeper.
00:46:27.400 | So one of the really nice things about that is that underneath this applications layer
00:46:32.040 | is a tiered or a layered API where you can go in and change anything.
00:46:38.320 | I'm not going to describe it in too much detail, but for example, part of this mid-tier API
00:46:46.520 | is a new two-way callback system, which basically allows you at any point when you're training
00:46:53.720 | a model to see exactly what it's doing and to change literally anything that it's doing.
00:47:00.600 | You can skip parts of the training process, you can change the gradients, you can change
00:47:04.720 | the data and so forth.
00:47:08.820 | And so with this new approach, we're able to implement, for example, this is from a paper
00:47:17.000 | called Mixup, we're able to implement Mixup data augmentation in just a few lines of code.
00:47:22.640 | And if you compare that to the actual original Facebook paper, not only was it far more lines
00:47:28.880 | of code, this is what it looks like from the research paper without using callbacks, but
00:47:34.740 | it's also far less flexible because everything's hard-coded or else with this approach, you
00:47:41.000 | can mix and match really easily.
00:47:44.960 | Another example of this layered API is we built a new approach to creating new optimizers
00:47:51.640 | using just two concepts, stats and steppers.
00:47:53.960 | I won't go into the details, but in short, this is what a particular optimizer called
00:47:59.360 | AdamW looks like in PyTorch and this took about two years between the paper being released
00:48:07.760 | and Facebook releasing the AdamW implementation.
00:48:12.120 | Our implementation was released within a day of the paper and it consists of these one,
00:48:17.520 | two, three, four, five words.
00:48:21.680 | As we're leveraging this layered API for optimizers, it's basically really easy to utilize the
00:48:30.400 | components to quickly implement new papers.
00:48:34.600 | Here's another example of an optimizer, this one's called lamb.
00:48:37.600 | This came from a Google paper and one of the really cool things you might notice is that
00:48:44.080 | there's a very close match between the lines of code in our implementation and the lines
00:48:50.440 | of math in the algorithm in the paper.
00:48:56.960 | There's a little summary of both what I'm doing now with fast AI and how I got there
00:49:06.440 | and why.
00:49:08.880 | I'm happy to take any questions.
00:49:12.080 | Thanks, Jeremy.
00:49:21.560 | I'll start with a quick thing.
00:49:27.160 | You mentioned in deep learning that obviously there's very similar structures in code and
00:49:35.880 | solving problems, but how do you incorporate things like knowledge about the problem?
00:49:43.720 | Obviously the type of architecture that would have to go in there would come from the context
00:49:50.360 | of the problem.
00:49:51.360 | Yeah, that's a great question.
00:49:53.880 | There's a number of really interesting ways of incorporating knowledge about the problem
00:50:00.440 | and it's a really important thing to do because this is how you get a whole lot of extra performance
00:50:09.960 | and need less data and less time.
00:50:13.560 | The more of that knowledge you can incorporate.
00:50:18.360 | One way is certainly to directly implement it in the architecture.
00:50:22.520 | For example, a very popular architecture for computer vision is convolutional architecture.
00:50:31.760 | The convolution, a 2D convolution, is taking advantage of our domain knowledge, which is
00:50:37.200 | that there's generally autocorrelation across pixels in both the X and Y dimensions.
00:50:44.640 | We're basically mapping a set of weights across groups of pixels that are all next to each
00:50:52.480 | other.
00:50:53.480 | There's a really wide range of interesting ways of incorporating all kinds of domain
00:50:58.400 | knowledge into architectures.
00:51:03.040 | There's lots of geometry-based approaches of doing that within natural language processing.
00:51:09.600 | There's lots of autoregressive approaches there.
00:51:13.600 | That's one area.
00:51:14.600 | An area I am extremely fond of is data augmentation.
00:51:20.480 | In particular, there's been a huge improvement in the last 12 months or so in how much we
00:51:32.320 | can do with a tiny amount of data by using something called self-supervised learning
00:51:38.640 | and in particular using something called contrastive loss.
00:51:41.640 | What this is doing is you basically try to come up with really thoughtful data augmentation
00:51:48.400 | approaches where you can say, "Okay, for example, in NLP, one of the approaches is to translate
00:51:55.680 | each sentence with a translation model into a different language and then translate it
00:52:01.040 | back again."
00:52:02.040 | You're now going to get a different version of the same sentence, but it shouldn't mean
00:52:04.960 | the same thing.
00:52:06.800 | With contrastive loss, it basically says you add a part to the loss function that says
00:52:11.040 | those two different sentences should have the same result in our model.
00:52:19.280 | With something called UDA, which is basically adding contrastive loss and self-supervised
00:52:24.640 | learning to NLP, they were able to get results for movie review classification with just
00:52:30.400 | 20 labeled examples that were better than the previous state of the art using 25,000
00:52:36.480 | labeled examples.
00:52:37.480 | Anyway, there's lots of ways we can incorporate domain knowledge into models, but there's
00:52:43.120 | a couple of ones that I like.
00:52:45.720 | Yes, I guess there are a couple of questions about interpretability.
00:52:53.000 | One of the questions that came up is it's hard to explain to stakeholders, and so how
00:52:58.840 | can you convince them that deep learning is worth adopting?
00:53:03.560 | Obviously, you can show predictive performance, but is there any other ways that you can do
00:53:10.640 | that?
00:53:11.640 | Sure.
00:53:12.640 | My view is that deep learning models are much more interpretable and explainable than most
00:53:18.440 | regression models, for example.
00:53:21.640 | Generally speaking, the traditional approach to people thinking the right way to understand
00:53:27.880 | example regression models is to look at their coefficients, and I've always told people
00:53:32.280 | don't do that, because in almost any real world regression problem, you've got coefficients
00:53:37.960 | representing interactions, you've got coefficients on things that are collinear, you've got coefficients
00:53:42.760 | on various different bases of a transformed nonlinear variable.
00:53:47.160 | None of the coefficients can be understood independently, because they can only be understood
00:53:52.800 | as how they combine with all the other things that are related.
00:53:58.720 | I genuinely really dislike it when people try and explain a regression by looking at
00:54:05.040 | coefficients.
00:54:06.040 | To me, the right way to understand a model is to do the same thing we would do to understand
00:54:12.800 | a person, which is to ask the questions.
00:54:16.560 | Whether it's a regression or a random forest or a deep learning model, you can generally
00:54:20.920 | easily ask questions like, "What would happen if I made this variable on this row a little
00:54:26.560 | bit bigger or a little bit smaller?", or things like that, which actually are much easier
00:54:32.680 | to do in deep learning, because in deep learning, those are just questions about the derivative
00:54:36.520 | of the input, and so you can actually get them much more quickly and easily.
00:54:41.860 | You can also do really interesting things with deep learning around showing which things
00:54:46.720 | are similar to each other, kind of in the deep learning feature space.
00:54:53.720 | You can build really cool applications for domain experts, which can give them a lot
00:55:00.400 | of comfort.
00:55:01.400 | You can say, "Yes, it's accurate," but I can also show you which parts of the input
00:55:05.760 | are being particularly important in this case, which other inputs are similar to this one,
00:55:13.760 | and often we find, for example, in the medical space, doctors will go, "Wow, that's really
00:55:17.960 | clever the way it recognized that this patient and this patient were similar.
00:55:22.040 | A lot of doctors wouldn't have noticed that.
00:55:24.040 | It's actually this subtle thing going on."
00:55:29.920 | I guess we're right at 11 o'clock, but maybe one last question that somebody brought up
00:55:35.320 | is, "Is there any future research opportunities in the cross-machine learning and quantum
00:55:42.120 | computing?"
00:55:43.120 | That you can think about?
00:55:44.120 | That's an interesting question.
00:55:45.120 | I don't know if you've thought about that.
00:55:46.760 | No, probably not one I've got any expertise on.
00:55:49.480 | Right.
00:55:50.480 | That will be an interesting question, but I'm not the right person to ask.
00:55:54.080 | One thing I do want to mention is I have just moved back to Australia after 10 years in
00:56:01.520 | San Francisco.
00:56:04.280 | I am extremely keen to see Australia become an absolute knowledge hub around deep learning,
00:56:13.440 | and I would particularly love to see our fast AI software like this, just like when you
00:56:19.640 | think about TensorFlow, you have this whole ecosystem around it, around Google and startups
00:56:25.160 | and all this.
00:56:26.160 | I would love to see Australia become the fast AI, the homegrown library, and that people
00:56:33.720 | here will really take it to heart and help us make it brilliant.
00:56:39.160 | It's all open source, and we've got a Discord channel where we all chat about it, and any
00:56:46.040 | organizations that are interested in taking advantage of this free open source library,
00:56:52.400 | I would love to support them and see academic institutions.
00:56:56.440 | I'd love to see this become a really successful ecosystem here in Australia.
00:57:01.040 | Right.
00:57:02.040 | It seems like it's going to be quite useful to solve lots of problems, so I think it would
00:57:08.160 | be good to do.
00:57:10.360 | There are still some questions in the chat.
00:57:13.120 | We'll have the chat transcript, and if there's any questions that Jeremy might be worth addressing
00:57:20.400 | from there, we can think about posting responses to those if there's anything in there, but
00:57:26.760 | we can do that after the fact.
00:57:29.540 | Thank you everybody for coming, and thank Jeremy for joining us today.
00:57:35.160 | Thanks so much.
00:57:36.160 | It's been great.