Back to Index

My Journey to Deep Learning


Chapters

0:0 Intro
2:40 Classic Machine Learning
8:3 Random Forests
12:45 Dogs vs Cats
15:32 Deep Learning Libraries
17:42 Neural Networks
28:2 Why Deep Learning
30:7 First Deep Learning Company
32:27 Making Deep Learning Accessible
34:12 Deep Learning Courses
35:45 Deep Learning Forums
36:40 Cool Models
37:23 Stateoftheart results
39:0 Building new courses
39:42 Dawnbench
41:43 Nvidia
42:26 Fast AI
43:13 Fast AI Documentation
45:28 Fast AI Code
46:19 Midtier API
47:12 Mixup
47:43 optimizers
48:54 summary
49:13 how to incorporate knowledge

Transcript

So I'm going to talk today about something I don't think I've really discussed before, which is my journey to deep learning. So nowadays, I am all deep learning all the time. And a lot of people seem to kind of assume that anybody who's doing deep learning kind of jumps straight into it without looking at anything else.

But actually, at least for me, it was a many decade journey to this point, and it's because I've done so many other things and seen how much easier my life is with deep learning that I've become such an evangelist with this technology. So I actually started out my career at McKinsey & Company, which is a management consulting firm.

And quite unusually, I started there when I was 18. And that was a challenge, because in strategy consulting, people are generally really leveraging their expertise and experience. And of course, I didn't have either of those things. So I really had to rely on data analysis right from the start.

So what happened was, from the very start of my career, I was really relying very heavily on applied data analysis to answer real world questions. And so in a consulting context, you don't have that much time. And you're talking to people who really have a lot of domain expertise, and you have to be able to communicate in a way that actually matters to them.

And I used a variety of techniques in my data analysis. One of my favorite things to use was this was just before kind of pivot tables appeared. And when they appeared, that was like something I used a lot, used various kind of database tools and so forth. But I did actually use machine learning quite a bit and had a lot of success with that.

Most of the machine learning that I was doing was based on kind of logistic or linear regression or something like that. Rather than show you something I did back then, because I can't because it was all proprietary, let me give you an example from the computational pathologist paper from Andy Beck and Daphne Coller and others at Stanford.

This was I'm trying to think maybe 2011 or something like that, 2012. And they developed a five-year survival model for breast cancer, I believe it was. And the inputs to their five-year survival model were histopathology slices, stain slides. And they built a five-year survival predictive model that was very significantly better than anything that had come before.

And what they described in their paper is the way they went about it was what I would call nowadays kind of classic machine learning. They used a regularized logistic regression. And they fed into that logistic regression, if I remember correctly, thousands I think of features. And the features they built were built by a large expert team of pathologists, computer scientists, mathematicians, and so forth that worked together to think about what kinds of features might be interesting and how do we encode them so that things like relationships of contiguous epithelial regions with underlying nuclear objects or characteristics of epithelial nuclei and epithelial cytopasm, characteristics of stromal nuclei and stromal matrix, and so on and so forth.

So it took many people many years to create these features and come up with them and implement them and test them. And then the actual modeling process was fairly straightforward. They took images from patients that stayed alive for five years. And they took images from those that didn't and then used a fairly standard regularized logistic regression to build a classifier.

So basically to create parameters around these different features. To be clear, this approach worked well for this particular case and worked well for me for years for many, many projects. And it's a perfectly reasonable bread and butter technique that you can certainly still use today in a very similar way.

I spend a lot of time studying how to get the most out of this. One nice trick that a lot of people are not as familiar with as they should be is what you do with continuous inputs in these cases and how do you transform them so that you can handle non-linearities.

A lot of people use polynomials for that. And actually polynomials are generally a terrible choice. Nearly always the best choice, it turns out, is actually to use something called natural cubic splines. And natural cubic splines are basically where you split your data set into sections of the domain and you connect each section up with a cubic, so each of these bits between dots are cubics, and you create the bases such that these cubics connect up with each other and their gradients connect up.

And one of the interesting things that makes them natural splines is that the endpoints are actually linear rather than cubic, which actually makes these extrapolate outside the input domain really nicely. You can see as you add more and more knots with just two knots, you start out with a line and as you add more knots, you start to get more and more opportunities for curves.

One of the cool things about natural splines, they're also called restricted cubic splines, is that actually you don't have to think at all about where to put the knot points. It turns out that there's basically a set of quantiles where you can put the knot points pretty reliably depending on how many knot points you want, which is independent of the data, and nearly always works.

So this was a nice trick. And then another nice trick is if you do use regularized regression, particularly L1 regularized regression I really like, you don't even have to be that careful about the number of parameters you include a lot of the time. So you can often include quite a lot of transformations, including interactions of cubic spline terms.

So this is an approach that I used a lot and had a lot of success with. But then in, I think it was '99 that the first paper appeared in the early 2000s that started getting popular, was Random Forests. And Random Forests, this is a picture from Terrence Parr's excellent D-Tree Vis package.

Random Forests are ensembles of decision trees, as I'm sure most of you know. And so for an example of a decision tree, this is data from the Kaggle competition which is trying to predict the auction price of heavy industrial equipment. And you can see here that a decision tree has done a split on this binary variable of coupler system.

And then for those which I guess don't have a coupler system, it did a binary split on year made, and those which then were made in early years, then we can see immediately the sale price. So this is the thing we're trying to predict the sale price. And so in this case, we can see that it's in just four splits, it successfully found some things which, this is actually the log of sale price, has done a really good job of splitting out the log of sale price.

I actually used these single decision trees a little bit in the kind of early and mid 90s, but they were a nightmare to find something that fit adequately but didn't overfit. And Random Forests then came along thanks to Bryman, who, a very interesting guy, he was originally a math professor at Berkeley.

And then he went out into industry and was basically a consultant, I think, for years and then came back to Berkeley to do statistics. And he was incredibly effective in creating like really practical algorithms. And the Random Forest is one that's really been world changing, incredibly simple, you just randomly pick a subset of your data.

And you then train a model, train it, just create a decision tree with a subset, you save it, and then you repeat steps one, two, and three again and again and again, creating lots and lots of decision trees on different random subsets of the data. And it turns out that if you average the results of all these models, you get predictions that are unbiased, accurate, and don't overfit.

And it's a really, really cool approach. So basically as soon as this came out, I added it to my arsenal. One of the really nice things about this is how quickly you can implement it. We implemented it in like a day, basically. So this came out when I was running a company called Optimal Decisions, which I built to help insurers come up with better prices, which is the most important thing insurers to do.

One of the interesting things about this, for me, is that we never actually deployed a random forest. What we did was we used random forests to understand the data, and then we used that understanding of the data to then go back and basically build more traditional regression models with the particular terms and transformations and interactions that the random forest found were important.

So basically, this is one of the cool things that you get out of a random forest. It's a feature importance plot, and it shows you-- so this is, again, the same data set, the option price data set from Kaggle-- it shows you which are the most important features. And the nice thing about this is you don't have to do any transformations or think about interactions or non-linearities.

Because they're using decision trees behind the scenes, it all just works. And so I kind of developed this pretty simple approach where I would first create a random forest, and I would then find which features and so forth are useful. I'd then use partial dependence plots to kind of look at the shapes of them.

And then I would go back and kind of, for the continuous variables that matter, create the cubic splines, and create the interactions, and then do a regression. And so this basic kind of trick was incredibly powerful. And I used it, a variance of it, in the early days of Kaggle amongst other things, and got to number one in the world, and won a number of competitions.

And funnily enough, actually, back in 2011, I described my approaches to Kaggle competitions in Melbourne at the Melbourne R meetup. And you can still find that talk on YouTube. And it's actually still pretty much just as relevant today as it was at that time. So this is 2011, and I became the chief scientist and president at Kaggle.

And we took it over to the US, and got venture capital, and built it into quite a successful business. But something interesting that happened as chief scientist of Kaggle, I was getting to see all the competitions up close. And seven years ago, there was a competition, Dogs vs Cats, which you can still see, actually, on the Dogs vs Cats Kaggle page, it describes the state-of-the-art approach for recognizing Dogs vs Cats as being around about 80% accuracy.

And so that was based on the academic papers that had tackled this problem at the time. And then in this competition that just ran for three months, eight teams reached 98% accuracy. And one nearly got to 99% accuracy. So if you think about this as a 20% error rate, and this is basically a 1% error rate.

So this competition brought the state-of-the-art down by about 20 times in three months, which is really extraordinary. It's really unheard of to see an academic state-of-the-art result that has been carefully studied slashed by 20x by somebody working for just three months on the problem. That's normally something that might take decades or hundreds of years, if it's possible at all.

So something clearly happened here. And of course what happened was deep learning. And Pierre actually had developed one of the early deep learning libraries. And actually, even this kind of signal on Kaggle was in some ways a little late. If you actually look at Pierre's Google Scholar, you'll see that it was actually back in 2011 that him and Yann LeCun had already produced a system that was better than human performance at recognizing traffic signs.

And so this was actually the first time that I noticed this really extraordinary thing, which was deep learning being better than humans at very human tasks, like looking at pictures. And so in 2011, I thought, wow, that's super interesting. But it's hard to do anything with that information, because there wasn't any open source software or even any commercial software available to actually do it.

There was-- Jörgen Schmidthuber's lab had a kind of like a DLL or something or a library you could buy from them to do it, although they didn't even have a demo. You know, there wasn't any online services. And there wasn't any-- nobody had published anywhere like the actual recipe book of like, how the hell do you do these things?

And so that was a huge challenge. It's exciting to see that this is possible, but then it's like, well, what do I do about it? But one of the cool things is that at this exact moment, this dogs and cats moment, is when two really accessible open source libraries appeared, allowing people to actually create their own deep learning models for the first time.

And critically, they were built on top of CUDA, which was a dramatically more convenient way of programming GPUs than had previously existed. So kind of things started to come together really seven years ago, a little bit. I had been interested in neural networks since the very start of my career.

And in fact, in consulting, I worked with one of the big Australian banks on implementing a neural network in the early to mid 90s to help with targeted marketing. Not a very exciting application, I'll give you. But it really struck me at the time that this was a technology which I felt like at some point would probably take over just about everything else in terms of my area of interest around predictive modelling.

And we actually had quite a bit of success with it even then. So that's like 30 years ago now, nearly, but there was some issues back then for one thing we had to buy custom hardware that cost millions of dollars. We really needed a lot of data, millions of data points.

So on a retail bank, we could do that. And yeah, it was even then there were things that just weren't quite working as well as we would expect. And so as it turned out, the key problem was that back then everybody was relying on this math result called the universal approximation theorem, which said that a neural network could solve any given problem, computable problem to any arbitrary level of accuracy.

And it only needs one hidden layer. And this is one of the many, many times in deep learning history where theory has been used in totally inappropriate ways. And the problem with this theory was that although this was theoretically true, in practice, a neural network with one hidden layer requires far too many nodes to be useful most of the time.

And what we actually need is lots of hidden layers. And that turns out to be much more efficient. So anyway, I did feel like for those 20 years, at some point, neural networks are going to reappear in my life because of this infinitely flexible function, the fact that they can solve any given problem in theory.

And then along with this infinitely flexible function, we combine it with gradient descent, which is this all-purpose parameter fitting algorithm. And again, there was a problem with theory here, which is I spent many, many years focused on operations research and optimization. And operations research generally focused on, again, kind of theoretical questions of what is proofably able to find the definite maximum or minimum of a function.

And gradient descent doesn't do that, particularly stochastic gradient descent. And so a lot of people were kind of ignoring it. But the thing is, the question we should be asking is not what can we prove, but what actually works in practice. And the people who, the very small number of people who were working on neural networks and gradient descent throughout the '90s and early 2000s, despite all the theory that said it's a terrible idea, actually were finding it was working really well.

Unfortunately, academia around machine learning has tended to be much more driven by theory than results, or at least for a long time it was, I still think it is too much. And so the fact that there were people like Hinton and LeCun saying, look, here's a model that's better than anything in the world at solving this problem, but based on theory, we can't exactly prove why, but it really works.

Because we're not getting published, unfortunately. Anyway, so things gradually began to change. And one of the big things that changed was that finally, in around 2014, 2015, we started to see some software appearing that allowed us to conveniently train these things on GPUs, which allowed us to use relatively inexpensive computers to actually get pretty good results.

So although the theory didn't really change at this point, what did change is just more people could try things out and be like, oh, OK, this is actually practically really helpful. To people outside of the world of neural networks, this all seemed very sudden. It seemed like there was this sudden fad around deep learning, where people were suddenly going, well, this is amazing.

And so people who had seen other fads quite recently thought, well, this one will pass too. But the difference with this fad is it's actually been under development for many, many, many decades. So this was the first neural network to be built. And it was back in 1957 that it was built.

And continually, for all those decades, there were people working on making neural nets really work in practice. So what was happening in 2015 was not a sudden, here's this new thing where I got organophlock to. But it was actually, here's this old thing, which we finally got to the point where it's actually really working.

And so it's not a new fad at all, but it's really the result of decades of hard work of solving lots of problems and finally getting to a point where things are making sense. But what has happened since 2015 is the ability of these infinitely flexible functions has suddenly started to become clear, even to a layperson, because you can just look at what they're doing and it's mind blowing.

So for example, if you look at OpenAI's DALI, this is a model that's been trained on pairs of pictures and captions such that you can now write any arbitrary sentence. So if you write an illustration of a baby daikon radish in a tutu walking a dog, DALI will draw pictures of what you described for you, and here are some actual non-cherry-picked pictures of that.

And so to be clear, this is all out of domain, right? So DALI has never seen illustrations of baby daikon radishes yet or radishes and tutus, or let alone any of this combination of things. It's creating these entirely from scratch. By the same token, it's never seen an avocado-shaped chair before, as best as I know, but if you type in an armchair in the shape of an avocado, it creates these pictures for you from scratch.

And so it's really cool now that we can actually show, we can actually say, look what computers can do, and look what computers can do if you use deep learning. And to anybody who's grown up in the kind of pre-deep learning era, this just looks like magic. It's like this is not things that I believe computers can do, but here we are.

This is this theoretically universally capable model actually doing things that we've trained it to do. So in the last few years, we're now starting to see many times every year examples of computers doing things, which we're being told computers won't be able to do in our lifetime. So for example, I was repeatedly told by experts that in my lifetime, we would never see a computer win a game of Go against an expert.

And of course, we're now at the point where AlphaGo Zero got to that point in three days. And it's so far ahead of the best expert now that it's kind of makes the world's best experts look like total beginners. And one of the really interesting things about AlphaGo Zero is that if you actually look at the source code for it, here it is.

And the source code for the key thing, which is like the thing that figures out whether a Go board is a good position or not fits on one slide. And furthermore, if you've done any deep learning, you'll recognize it as looking almost exactly like a standard computer vision model.

And so one of the things which people who are not themselves deep learning practitioners don't quite realize is that deep learning on the whole is not a huge collection of somewhat disconnected but slightly connected kind of tricks. It's actually, you know, every deep learning model I build looks almost exactly like every other model I build with fairly minor differences.

And I train them in nearly exactly the same way with fairly minor differences. And so deep learning has become this incredibly flexible skill that if you have it, you can turn your attention to lots of different domain areas and rapidly get incredibly good results. So at this point, deep learning is now the best approach in the world for all kinds of applications.

I'm not going to read them all, and this is by no means a complete list. It's far longer than this. But these are some examples of the kinds of things that deep learning is better at than any other known approach. So why am I spending so much time in my life now on deep learning?

Because it really feels to me like a very dramatic step change in human capability like the development of electricity, for example. And you know, I would like to think that when I see a very dramatic step change in human capability, I'm going to spend my time working on figuring out how best to take advantage of that capability because that's, you know, there's going to be so many world-changing breakthroughs that come out of that.

And particularly as somebody who's built a few companies, as an entrepreneur, that the number one thing for an entrepreneur to find and that investors look for is, is there something you can build now that people couldn't build before in terms of as a company? And with deep learning, the answer to that is yes, across tens of thousands or hundreds of thousands of areas, because it's like, OK, suddenly there's tools which we couldn't automate before and now we can, or we can make hundreds of times more productive or so forth.

So to me, it's a very obvious thing that like this is what I want to spend all my time on. And when I talk to students, you know, or interested entrepreneurs, I always say, you know, this is the thing which is making lots and lots of people extremely rich and is solving lots and lots of important societal problems.

And we're just seeing the tip of the iceberg. So as soon as I got to the point that I realized this, I decided to start a company to actually, you know, do something important. And so I got very excited about the opportunities in medicine and I created the first deep learning and medicine company called analytic.

And I didn't know anything about medicine and I didn't know any people in medicine. So I kind of like got together a group of three other people and me and we decided to kind of hack together a quick deep learning model that would see if we can predict the malignancy of nodules in lung CT scans.

And it turned out that we could. And in fact, the algorithm that we built this company that that I ended up calling analytic had a better false positive rate and a better false negative rate than actually a panel of four trained radiologists. And so this was at a time when deep learning in medicine and deep learning and radiology was unheard of.

There were basically no papers about it. There was certainly no startups about it. No one was talking about it. And so this this finding got some attention. And this was really important to me because my biggest goal with analytic was to kind of get deep learning and medicine on the map because I felt like it could save a lot of lives.

So I wanted to get a lot of as much attention around this as possible. And yeah, very quickly, lots and lots of people were writing about this new company. But as a result, very quickly, deep learning, particularly in radiology took off and within two years, the main radiology conference had a huge stream around AI.

It was lines out the door. They created a new a whole new journal for it and so forth. And so that was really exciting for me to see how we could help kind of put a technology on the map. In some ways, this is great, but in some ways it was kind of disappointing because there were so many other areas where deep learning should have been on the map and it wasn't.

And there's no way that I could create companies around every possible area. So instead I thought, well, what I want to do is make it easy for other people to create companies and products and solutions using deep learning, particularly because at that time, nearly everybody I knew in the deep learning world were young white men from one of a small number of exclusive universities.

And the problem with that is that there's a lot of societally important problems to solve which that group of people just weren't familiar with. And even if they were both familiar with them and interested in them, they didn't know how to find the data for those or what the kind of constraints and implementation are and so forth.

So Dr. Rachel Thomas and I decided to create a new organization that would focus on one thing, which was making deep learning accessible. And so basically the idea was to say, okay, if this really is a step change in human capability, which happens from time to time in technology history, what can we do to help people use this technology regardless of their background?

And so there was a lot of constraints that we had to help remove. So the first thing we did was we thought, okay, let's at least make it so that what people already know about how to build deep learning models is as available as possible. So at this time, there weren't any courses or any kind of easy ways in to get going with deep learning.

And we had a theory which was we thought you don't need a Stanford PhD to be an effective deep learning practitioner. You don't need years and years of graduate level math training. We thought that we might be able to build a course that would allow people with just a year of coding background to become effective deep learning practitioners.

Now at this time, so what is this, about 2014 or 2015, can't quite remember, maybe 2015. Nothing like this existed and this was a really controversial hypothesis. And to be clear, we weren't sure we were right, but this was a feeling we had. So we thought let's give it a go.

So the first thing we created was a fast AI practical deep learning course. And certainly one thing we immediately saw, which was thrilling, and we certainly didn't know what would happen, was that it was popular. A lot of people talk the course. We made it freely available online with no ads to make it as accessible as possible since that's our mission.

And I said to that first class, if you create something interesting with deep learning, please tell us about it on our forums. So we created a forum so that students could communicate with each other. And we got thousands of replies. And I remember one of the first ones we got, I think this was one of the first, was somebody who tried to recognize cricket pictures from baseball pictures.

And they had, I think it was like 100% accuracy or maybe 99% accuracy. And one of the really interesting things was that they only used 30 training images. And so this is exciting to us to see somebody building a model and not only that, building it with far less data than people used to think was necessary.

And then suddenly we were being flooded with all these cool models people were building. So a Trident ad in Tobago, a different types of people model, a zucchini and cucumber model. This is a really interesting one. This person actually managed to predict what part of the world a satellite photo was from with over 110 classes with 85% accuracy, which is extraordinary.

A Panama bus recognizer, a buttock cloth recognizer, some of the things were clearly actually going to be very useful in practice. This was something useful for disaster resilience, which was recognizing the state of buildings in this place in Tanzania. We saw people even right at the start of the course breaking state of the art results.

So this is on Devin Gary, character recognition, this person said, wow, I just got a new state of the art result, which is really exciting. We saw people doing the same, similar to getting state of the art results on audio classification. And then even we started to hear from some of these students in the first year that they were taking their ideas back to their companies.

And in this case, a software engineer went back to his company who was working at Splunk and built a new model, which basically took mouse movements and mouse clicks and turned them into pictures, and then classified them and used this to help with fraud. And we know about this because it was so successful that it ended up being a patented product and Splunk created a blog about this core new technology that was built by a software engineer with no previous background in this area.

We saw startups appearing, so for example, this startup called Envision appeared from one of our students, and it's still going strong, I just looked it up before this. And so, yeah, it was really cool to see how people from all walks of life were actually able to get started with deep learning.

And these courses got really popular, and so we started redoing them every year. They would build a new course from scratch because things were moving so fast that within a year, there was so much new stuff to cover that we would build a completely new course. And so, there's many millions of views at this point, and people are loving them based on what they're telling YouTube anyway.

So this has been really a pleasure to see. We ended up turning the course into a book as well, along with my friend, Sylvain Goucher, and people are really liking the book as well. So the next step after getting people started with what we already know by putting that into a course is we wanted to push the boundaries beyond what we already know.

And so one of the things that came up was a lot of students or potential students were saying, "I don't think it's worth me getting involved in deep learning because it takes too much compute and too much data, and unless you're Google or Facebook, you can't do it." And this became particularly an issue when Google released their TPUs and put out a big PR exercise saying, "Okay, TPUs are so great that nobody else can pretty much do anything useful in deep learning now." And so we decided to enter this international competition called Dawn Bench that Google and Intel had entered to see if we could beat them, like be faster than them at training deep learning models.

And so that was called 2018, and we won on many of the axes of the competition. The cheapest, the fastest TPU, fastest single machine, and then we followed this up with additional results that were actually 40% faster than Google's best TPU results. And so this was exciting because here's a picture of us and our students working on this project together.

It was a really cool time because we really wanted to push back against this narrative that you have to be Google. And so we got a lot of media attention, which was great. And really, the finding here was using common sense is more important than using vast amounts of money and compute.

It was really cool to see also that a lot of the big companies noticed what we were doing and brought in our ideas. So Nvidia, when they started promoting how great their GPUs were, they started talking about how good they were with the additional ideas that we had developed with our students.

So academic research became a critical component of fast AI's work, and we did similar research to drive breakthroughs in natural language processing, in tabular modeling, and lots of other areas. So then the question is, OK, well, now that we've actually pushed the boundaries beyond what's already known, to say, OK, we can actually get better results with less data and less compute more quickly, how do we put that into the hands of everybody so that everybody can use these insights?

So that's why we decided to build a software library called FastAI. So that was just in 2018 that version 1 came out, but it immediately got a lot of attention. It got supported by all the big cloud services. And we were able to show that compared to Keras, for example, it was much more accurate, much faster, far less lines of code, and we really tried to make it as accessible as possible.

So this is some of the documentation from FastAI. You can see that not only do you get the normal kind of API documentation, but it's actually got pictures of exactly what's going on. It's got links to the papers that it's implementing. And also, all of the code for all of the pictures is all directly there as well.

And one of the really nice things is that every single page of the documentation has a link to actually open that page of documentation as an interactive notebook, because the entire thing is built with interactive notebooks. So you can then get the exact same thing, but now you can experiment with it.

And you can see all the source code there. So we really took the kind of approaches that we found worked well in our course, having students do lots of experiments and lots of coding, and making that a kind of part of our documentation as well, is to let people really play with everything themselves, experiment, and see how it all works.

So incorporating all of this research into the software was super successful. We started hearing from people saying, OK, well, I've just started with fast.ai, and I've started pulling across some of my TensorFlow models. And I don't understand why is everything so much better? What's going on here? So people were really noticing that they were getting dramatically better results.

So this person said the same thing, yep, we used to try to use TensorFlow. We spent months tweaking our model. We switched to fast.ai, and within a couple of days, we were getting better results. So by kind of combining the research with the software, we were able to provide a software library that let people get started more quickly.

And then version 2, which has been around for a bit over a year now, was a very dramatic advance further still. There's a whole academic paper that you can read describing all the deep design approaches which we've used. One of the really nice things about it is that basically, regardless of what you're trying to do with fast.ai, you can use almost exactly the same code.

So for example, here's the code necessary to recognize dogs from cats. Here's the code necessary to build a segmentation model. It's basically the same lines of code. Here's a code to classify text movie reviews, almost the same lines of code. Here's the code necessary to do collaborative filtering, almost the same lines of code.

So I said earlier that kind of under the covers, different models look more similar than different with deep learning. And so with fast.ai, we really tried to surface that so that you learn one API and you can use it anywhere. That's not enough for researchers or people that really need to dig deeper.

So one of the really nice things about that is that underneath this applications layer is a tiered or a layered API where you can go in and change anything. I'm not going to describe it in too much detail, but for example, part of this mid-tier API is a new two-way callback system, which basically allows you at any point when you're training a model to see exactly what it's doing and to change literally anything that it's doing.

You can skip parts of the training process, you can change the gradients, you can change the data and so forth. And so with this new approach, we're able to implement, for example, this is from a paper called Mixup, we're able to implement Mixup data augmentation in just a few lines of code.

And if you compare that to the actual original Facebook paper, not only was it far more lines of code, this is what it looks like from the research paper without using callbacks, but it's also far less flexible because everything's hard-coded or else with this approach, you can mix and match really easily.

Another example of this layered API is we built a new approach to creating new optimizers using just two concepts, stats and steppers. I won't go into the details, but in short, this is what a particular optimizer called AdamW looks like in PyTorch and this took about two years between the paper being released and Facebook releasing the AdamW implementation.

Our implementation was released within a day of the paper and it consists of these one, two, three, four, five words. As we're leveraging this layered API for optimizers, it's basically really easy to utilize the components to quickly implement new papers. Here's another example of an optimizer, this one's called lamb.

This came from a Google paper and one of the really cool things you might notice is that there's a very close match between the lines of code in our implementation and the lines of math in the algorithm in the paper. There's a little summary of both what I'm doing now with fast AI and how I got there and why.

I'm happy to take any questions. Thanks, Jeremy. I'll start with a quick thing. You mentioned in deep learning that obviously there's very similar structures in code and solving problems, but how do you incorporate things like knowledge about the problem? Obviously the type of architecture that would have to go in there would come from the context of the problem.

Yeah, that's a great question. There's a number of really interesting ways of incorporating knowledge about the problem and it's a really important thing to do because this is how you get a whole lot of extra performance and need less data and less time. The more of that knowledge you can incorporate.

One way is certainly to directly implement it in the architecture. For example, a very popular architecture for computer vision is convolutional architecture. The convolution, a 2D convolution, is taking advantage of our domain knowledge, which is that there's generally autocorrelation across pixels in both the X and Y dimensions. We're basically mapping a set of weights across groups of pixels that are all next to each other.

There's a really wide range of interesting ways of incorporating all kinds of domain knowledge into architectures. There's lots of geometry-based approaches of doing that within natural language processing. There's lots of autoregressive approaches there. That's one area. An area I am extremely fond of is data augmentation. In particular, there's been a huge improvement in the last 12 months or so in how much we can do with a tiny amount of data by using something called self-supervised learning and in particular using something called contrastive loss.

What this is doing is you basically try to come up with really thoughtful data augmentation approaches where you can say, "Okay, for example, in NLP, one of the approaches is to translate each sentence with a translation model into a different language and then translate it back again." You're now going to get a different version of the same sentence, but it shouldn't mean the same thing.

With contrastive loss, it basically says you add a part to the loss function that says those two different sentences should have the same result in our model. With something called UDA, which is basically adding contrastive loss and self-supervised learning to NLP, they were able to get results for movie review classification with just 20 labeled examples that were better than the previous state of the art using 25,000 labeled examples.

Anyway, there's lots of ways we can incorporate domain knowledge into models, but there's a couple of ones that I like. Yes, I guess there are a couple of questions about interpretability. One of the questions that came up is it's hard to explain to stakeholders, and so how can you convince them that deep learning is worth adopting?

Obviously, you can show predictive performance, but is there any other ways that you can do that? Sure. My view is that deep learning models are much more interpretable and explainable than most regression models, for example. Generally speaking, the traditional approach to people thinking the right way to understand example regression models is to look at their coefficients, and I've always told people don't do that, because in almost any real world regression problem, you've got coefficients representing interactions, you've got coefficients on things that are collinear, you've got coefficients on various different bases of a transformed nonlinear variable.

None of the coefficients can be understood independently, because they can only be understood as how they combine with all the other things that are related. I genuinely really dislike it when people try and explain a regression by looking at coefficients. To me, the right way to understand a model is to do the same thing we would do to understand a person, which is to ask the questions.

Whether it's a regression or a random forest or a deep learning model, you can generally easily ask questions like, "What would happen if I made this variable on this row a little bit bigger or a little bit smaller?", or things like that, which actually are much easier to do in deep learning, because in deep learning, those are just questions about the derivative of the input, and so you can actually get them much more quickly and easily.

You can also do really interesting things with deep learning around showing which things are similar to each other, kind of in the deep learning feature space. You can build really cool applications for domain experts, which can give them a lot of comfort. You can say, "Yes, it's accurate," but I can also show you which parts of the input are being particularly important in this case, which other inputs are similar to this one, and often we find, for example, in the medical space, doctors will go, "Wow, that's really clever the way it recognized that this patient and this patient were similar.

A lot of doctors wouldn't have noticed that. It's actually this subtle thing going on." I guess we're right at 11 o'clock, but maybe one last question that somebody brought up is, "Is there any future research opportunities in the cross-machine learning and quantum computing?" That you can think about? That's an interesting question.

I don't know if you've thought about that. No, probably not one I've got any expertise on. Right. That will be an interesting question, but I'm not the right person to ask. One thing I do want to mention is I have just moved back to Australia after 10 years in San Francisco.

I am extremely keen to see Australia become an absolute knowledge hub around deep learning, and I would particularly love to see our fast AI software like this, just like when you think about TensorFlow, you have this whole ecosystem around it, around Google and startups and all this. I would love to see Australia become the fast AI, the homegrown library, and that people here will really take it to heart and help us make it brilliant.

It's all open source, and we've got a Discord channel where we all chat about it, and any organizations that are interested in taking advantage of this free open source library, I would love to support them and see academic institutions. I'd love to see this become a really successful ecosystem here in Australia.

Right. It seems like it's going to be quite useful to solve lots of problems, so I think it would be good to do. There are still some questions in the chat. We'll have the chat transcript, and if there's any questions that Jeremy might be worth addressing from there, we can think about posting responses to those if there's anything in there, but we can do that after the fact.

Thank you everybody for coming, and thank Jeremy for joining us today. Thanks so much. It's been great.