Jeremy Howard: fast.ai Deep Learning Courses and Research

00:00:00.000 | The following is a conversation with Jeremy Howard.

00:00:03.160 | He's the founder of Fast AI, a research institute

00:00:06.480 | dedicated to making deep learning more accessible.

00:00:09.760 | He's also a distinguished research scientist

00:00:12.600 | at the University of San Francisco,

00:00:14.640 | a former president of Kegel,

00:00:16.680 | as well as a top-ranking competitor there.

00:00:18.800 | And in general, he's a successful entrepreneur,

00:00:21.720 | educator, researcher, and an inspiring personality

00:00:25.240 | in the AI community.

00:00:27.040 | When someone asks me, how do I get started with deep learning?

00:00:30.240 | Fast AI is one of the top places I point them to.

00:00:33.360 | It's free, it's easy to get started,

00:00:35.560 | it's insightful and accessible.

00:00:37.640 | And if I may say so, it has very little BS.

00:00:41.000 | They can sometimes dilute the value of educational content

00:00:44.160 | on popular topics like deep learning.

00:00:46.760 | Fast AI has a focus on practical application of deep learning

00:00:50.320 | and hands-on exploration of the cutting edge

00:00:52.840 | that is incredibly both accessible to beginners

00:00:56.040 | and useful to experts.

00:00:58.000 | This is the Artificial Intelligence Podcast.

00:01:01.400 | If you enjoy it, subscribe on YouTube,

00:01:03.840 | give it five stars on iTunes, support it on Patreon,

00:01:07.000 | or simply connect with me on Twitter,

00:01:09.080 | @LexFriedman, spelled F-R-I-D-M-A-N.

00:01:13.360 | And now, here's my conversation with Jeremy Howard.

00:01:17.600 | What's the first program you ever written?

00:01:20.720 | - First program I wrote that I remember

00:01:24.840 | would be at high school.

00:01:26.720 | I did an assignment where I decided to try to find out

00:01:33.600 | if there were some better musical scales

00:01:36.280 | than the normal 12-tone, 12-interval scale.

00:01:40.640 | So I wrote a program on my Commodore 64 in BASIC

00:01:43.680 | that searched through other scale sizes

00:01:46.080 | to see if it could find one where

00:01:48.280 | there were more accurate harmonies.

00:01:51.920 | - Like mid-tone?

00:01:53.560 | - Like you want an actual exactly three to two ratio,

00:01:56.560 | whereas with a 12-interval scale,

00:01:59.440 | it's not exactly three to two, for example.

00:02:01.520 | So that's well-tempered, as they say in the--

00:02:05.080 | - In BASIC on a Commodore 64.

00:02:07.160 | - Yeah.

00:02:08.000 | - Where was the interest in music from?

00:02:09.480 | Or is it just--

00:02:10.480 | - I did music all my life, so I played saxophone

00:02:14.080 | and clarinet and piano and guitar and drums and whatever, so.

00:02:18.120 | - How does that thread go through your life?

00:02:22.160 | Where's music today?

00:02:24.200 | - It's not where I wish it was.

00:02:26.160 | For various reasons, couldn't really keep it going,

00:02:30.200 | particularly 'cause I had a lot of problems with RSI,

00:02:32.600 | with my fingers, and so I had to kind of like

00:02:34.760 | cut back anything that used hands and fingers.

00:02:38.320 | I hope one day I'll be able to get back to it health-wise.

00:02:43.920 | - So there's a love for music underlying it all?

00:02:46.080 | - For sure, yeah.

00:02:46.920 | - What's your favorite instrument?

00:02:49.520 | - Saxophone.

00:02:50.360 | - Sax.

00:02:51.200 | - It's a baritone saxophone.

00:02:52.880 | Well, probably bass saxophone, but they're awkward.

00:02:55.640 | - Well, I always love it when music

00:03:00.040 | is coupled with programming.

00:03:01.720 | There's something about a brain that utilizes those

00:03:04.680 | that emerges with creative ideas.

00:03:07.560 | So you've used and studied quite a few programming languages.

00:03:11.240 | Can you give an overview of what you've used?

00:03:15.160 | What are the pros and cons of each?

00:03:17.880 | - Well, my favorite programming environment,

00:03:21.120 | almost certainly, was Microsoft Access

00:03:24.600 | back in the earliest days.

00:03:26.480 | So that was Visual Basic for Applications,

00:03:28.920 | which is not a good programming language,

00:03:30.720 | but the programming environment was fantastic.

00:03:33.080 | It's like the ability to create user interfaces

00:03:38.080 | and tie data and actions to them and create reports

00:03:42.520 | and all that, I've never seen anything as good.

00:03:46.800 | There's things nowadays like Airtable,

00:03:48.600 | which are like small subsets of that,

00:03:53.600 | which people love for good reason,

00:03:56.160 | but unfortunately nobody's ever achieved anything like that.

00:04:01.120 | - What is that?

00:04:01.960 | If you could pause on that for a second.

00:04:03.280 | - Oh, Access?

00:04:04.120 | - Access is a database.

00:04:06.280 | - It was a database program that Microsoft produced,

00:04:09.640 | part of Office, and it kind of withered, you know,

00:04:13.440 | but basically it lets you in a totally graphical way

00:04:16.280 | create tables and relationships and queries

00:04:18.480 | and tie them to forms and set up, you know,

00:04:21.800 | event handlers and calculations.

00:04:24.720 | And it was a very complete, powerful system

00:04:28.160 | designed for not massive scalable things,

00:04:31.480 | but for like useful little applications that I loved.

00:04:36.360 | - So what's the connection between Excel and Access?

00:04:40.240 | - So very close.

00:04:42.120 | So Access kind of was the relational database equivalent,

00:04:47.680 | if you like.

00:04:48.520 | So people still do a lot of that stuff

00:04:51.080 | that should be in Access in Excel,

00:04:52.880 | because they know it.

00:04:54.120 | Excel's great as well.

00:04:55.360 | So, but it's just not as rich a programming model

00:05:00.200 | as VBA combined with a relational database.

00:05:04.640 | And so I've always loved relational databases,

00:05:07.320 | but today programming on top of a relational database

00:05:11.000 | is just a lot more of a headache.

00:05:13.520 | You know, you generally either need to kind of,

00:05:16.200 | you know, you need something that connects,

00:05:17.920 | that runs some kind of database server,

00:05:19.920 | unless you use SQLite, which has its own issues.

00:05:23.920 | Then you kind of often,

00:05:25.920 | if you want to get a nice programming model,

00:05:27.600 | you'll need to like create an, add an ORM on top.

00:05:30.400 | And then, I don't know,

00:05:31.960 | there's all these pieces to tie together,

00:05:34.360 | and it's just a lot more awkward than it should be.

00:05:36.960 | There are people that are trying to make it easier.

00:05:39.200 | So in particular, I think of F#, you know, Don Syme,

00:05:42.400 | who him and his team have done a great job

00:05:45.760 | of making something like a database appear

00:05:50.480 | in the type system.

00:05:51.600 | So you actually get like tab completion for fields

00:05:54.200 | and tables and stuff like that.

00:05:56.240 | Anyway, so that was kind of, anyway,

00:05:59.280 | so like that whole VBA office thing,

00:06:01.480 | I guess was a starting point, which I still miss.

00:06:04.600 | And I got into standard Visual Basic, which-

00:06:07.800 | - That's interesting just to pause on that for a second.

00:06:09.880 | It's interesting that you're connecting programming languages

00:06:13.480 | to the ease of management of data.

00:06:17.400 | - Yeah.

00:06:18.240 | - So in your use of programming languages,

00:06:20.560 | you always had a love and a connection with data.

00:06:24.840 | - I've always been interested in doing useful things

00:06:27.960 | for myself and for others,

00:06:29.440 | which generally means getting some data

00:06:31.840 | and doing something with it and putting it out there again.

00:06:34.520 | So that's been my interest throughout.

00:06:38.360 | So I also did a lot of stuff with AppleScript

00:06:41.520 | back in the early days.

00:06:42.960 | So it's kind of nice being able to get the computer

00:06:47.920 | and computers to talk to each other

00:06:50.080 | and to do things for you.

00:06:51.680 | And then I think that one,

00:06:54.560 | the programming language I most loved

00:06:57.840 | then would have been Delphi, which was Object Pascal,

00:07:01.760 | created by Anders Helsberg,

00:07:04.800 | who previously did Turbo Pascal

00:07:07.400 | and then went on to create .NET

00:07:08.800 | and then went on to create TypeScript.

00:07:11.040 | Delphi was amazing 'cause it was like a compiled,

00:07:14.840 | fast language that was as easy to use as Visual Basic.

00:07:19.840 | - Delphi, what is it similar to in more modern languages?

00:07:25.160 | - Visual Basic.

00:07:28.840 | - Visual Basic.

00:07:29.680 | - Yeah, but a compiled fast version.

00:07:32.280 | So I'm not sure there's anything quite like it anymore.

00:07:37.040 | If you took like C# or Java

00:07:40.600 | and got rid of the virtual machine

00:07:42.440 | and replaced it with something,

00:07:43.400 | you could compile a small type binary.

00:07:46.520 | I feel like it's where Swift could get to

00:07:50.680 | with the new Swift UI

00:07:52.600 | and the cross-platform development going on.

00:07:56.440 | Like that's one of my dreams

00:07:59.320 | is that we'll hopefully get back to where Delphi was.

00:08:02.800 | There is actually a free Pascal project nowadays

00:08:07.800 | called Lazarus,

00:08:09.320 | which is also attempting to kind of recreate Delphi.

00:08:13.360 | So they're making good progress.

00:08:16.040 | - So, okay, Delphi,

00:08:18.520 | that's one of your favorite programming languages.

00:08:20.920 | - Or at least programming environments.

00:08:22.320 | Again, I'd say Pascal's not a nice language.

00:08:26.240 | If you wanted to know specifically

00:08:27.840 | about what languages I like,

00:08:29.600 | I would definitely pick J

00:08:31.640 | as being an amazingly wonderful language.

00:08:34.480 | - What's J?

00:08:37.040 | - J, are you aware of APL?

00:08:39.600 | - I am not.

00:08:40.440 | - Okay, so. - Except from doing

00:08:41.440 | a little research on the work you've done.

00:08:44.040 | - Okay, so not at all surprising

00:08:47.120 | you're not familiar with it

00:08:47.960 | 'cause it's not well known,

00:08:49.000 | but it's actually one of the main

00:08:51.600 | families of programming languages

00:08:55.920 | going back to the late '50s, early '60s.

00:08:57.880 | So there was a couple of major directions.

00:09:01.640 | One was the kind of Lambda calculus,

00:09:04.400 | Alonzo Church direction,

00:09:06.120 | which I guess kind of Lisp and Scheme and whatever,

00:09:09.920 | which has a history going back

00:09:12.240 | to the early days of computing.

00:09:13.360 | The second was the kind of imperative slash OO,

00:09:18.360 | algo, similar, going under C, C++, so forth.

00:09:23.120 | There was a third,

00:09:23.960 | which are called array-oriented languages,

00:09:26.880 | which started with a paper by a guy called Ken Iverson,

00:09:31.480 | which was actually a math theory paper,

00:09:35.160 | not a programming paper.

00:09:37.480 | It was called "Notation as a Tool for Thought."

00:09:41.480 | And it was the development of a new type of math notation.

00:09:45.280 | And the idea is that this math notation

00:09:47.520 | was much more flexible, expressive,

00:09:51.320 | and also well-defined than traditional math notation,

00:09:55.240 | which is none of those things.

00:09:56.400 | Math notation is awful.

00:09:57.680 | And so he actually turned that into a programming language.

00:10:02.800 | 'Cause this was the early '50s,

00:10:04.120 | or the, sorry, late '50s, all the names were available.

00:10:06.720 | So he called his language a programming language, or APL.

00:10:10.520 | - APL, wow.

00:10:11.360 | - So APL is a implementation of notation

00:10:15.320 | as a tool for thought, by which he means math notation.

00:10:18.280 | And Ken and his son went on to do many things,

00:10:22.840 | but eventually they actually produced

00:10:25.760 | a new language that was built

00:10:27.040 | on top of all the learnings of APL,

00:10:28.440 | and that was called J.

00:10:29.600 | And J is the most expressive, composable language of,

00:10:35.560 | beautifully designed language I've ever seen.

00:10:42.400 | - Does it have object-oriented components?

00:10:44.520 | Does it have that kind of thing, or is it more like--

00:10:45.360 | - Not really, it's an array-oriented language.

00:10:47.680 | It's a new, it's the third path.

00:10:51.400 | - Are you saying array?

00:10:52.760 | - Array-oriented, yeah.

00:10:53.920 | - What does it mean to be array-oriented?

00:10:55.520 | - So array-oriented means that you generally

00:10:57.560 | don't use any loops, but the whole thing is done

00:11:01.000 | with kind of a extreme version of broadcasting,

00:11:06.000 | if you're familiar with that NumPy/Python concept.

00:11:09.960 | So you do a lot with one line of code.

00:11:14.320 | It looks a lot like math notation.

00:11:18.160 | - So it's basically--

00:11:19.000 | - Highly compact.

00:11:20.400 | And the idea is that you can kind of,

00:11:22.920 | because you can do so much with one line of code,

00:11:24.800 | a single screen of code is very unlikely to,

00:11:27.760 | you very rarely need more than that to express your program.

00:11:31.120 | And so you can kind of keep it all in your head,

00:11:33.320 | and you can kind of clearly communicate it.

00:11:36.080 | It's interesting, APL created two main branches,

00:11:39.960 | K and J.

00:11:41.640 | J is this kind of like open-source niche community

00:11:46.000 | of crazy enthusiasts like me.

00:11:49.440 | And then the other path, K, is fascinating.

00:11:52.160 | It's an astonishingly expensive programming language,

00:11:56.640 | which many of the world's most

00:11:59.720 | ludicrously rich hedge funds use.

00:12:02.920 | So the entire K machine is so small,

00:12:06.680 | it sits inside level three cache on your CPU,

00:12:09.360 | and it easily wins every benchmark I've ever seen

00:12:14.120 | in terms of data processing speed.

00:12:16.760 | But you don't come across it very much,

00:12:17.920 | it's like $100,000 per CPU to run it.

00:12:22.720 | But it's like this path of programming languages

00:12:26.280 | is just so much, I don't know,

00:12:28.920 | so much more powerful in every way

00:12:30.360 | than the ones that almost anybody uses every day.

00:12:33.920 | - So it's all about computation.

00:12:37.520 | It's really focusing on computation.

00:12:38.360 | - It's pretty heavily focused on computation.

00:12:40.600 | I mean, so much of programming

00:12:43.200 | is data processing by definition.

00:12:45.640 | So there's a lot of things you can do with it.

00:12:48.920 | But yeah, there's not much work being done

00:12:51.400 | on making like user interface toolkits or whatever.

00:12:56.400 | I mean, there's some, but they're not great.

00:12:59.280 | - At the same time, you've done a lot of stuff

00:13:00.840 | with Perl and Python.

00:13:02.440 | - Yeah.

00:13:03.280 | - So where does that fit into the picture

00:13:04.720 | of J and K and APL and--

00:13:08.760 | - Well, it's just much more pragmatic.

00:13:10.960 | Like in the end, you kind of have to end up

00:13:13.840 | where the libraries are,

00:13:17.880 | 'cause to me, my focus is on productivity.

00:13:21.200 | I just wanna get stuff done and solve problems.

00:13:23.640 | So Perl was great.

00:13:27.240 | I created an email company called Fastmail

00:13:29.640 | and Perl was great 'cause back in the late '90s,

00:13:32.800 | early 2000s, it just had a lot of stuff it could do.

00:13:37.800 | I still had to write my own monitoring system

00:13:41.720 | and my own web framework, my own whatever,

00:13:43.800 | 'cause like none of that stuff existed,

00:13:45.720 | but it was a super flexible language to do that in.

00:13:50.240 | - And you used Perl for Fastmail, you used it as a backend?

00:13:54.240 | Like, so everything was written in Perl?

00:13:55.760 | - Yeah, yeah, everything was Perl.

00:13:58.720 | - Why do you think Perl hasn't succeeded

00:14:02.920 | or hasn't dominated the market

00:14:04.840 | where Python really takes over a lot of the same tasks?

00:14:07.560 | - Well, I mean, Perl did dominate.

00:14:09.600 | It was-- - Four times.

00:14:10.760 | - Everything, everywhere, but then the guy

00:14:14.920 | that ran Perl, Larry Wool,

00:14:17.240 | kind of just didn't put the time in anymore.

00:14:22.240 | And no project can be successful if there isn't,

00:14:27.320 | you know, particularly one that started

00:14:30.560 | with a strong leader that loses that strong leadership.

00:14:35.080 | So then Python has kind of replaced it.

00:14:37.880 | You know, Python is a lot less elegant language

00:14:42.880 | in nearly every way, but it has the data science libraries

00:14:48.440 | and a lot of them are pretty great.

00:14:51.320 | So I kind of use it 'cause it's the best we have,

00:14:56.320 | but it's definitely not good enough.

00:15:01.840 | - But what do you think the future of programming looks like?

00:15:04.080 | What do you hope the future of programming looks like

00:15:06.600 | if we zoom in on the computational fields,

00:15:08.800 | on data science, on machine learning?

00:15:11.880 | - I hope Swift is successful because the goal of Swift,

00:15:16.880 | the way Chris Latner describes it,

00:15:21.040 | is to be infinitely hackable, and that's what I want.

00:15:23.360 | I want something where me and the people I do research with

00:15:26.960 | and my students can look at and change everything

00:15:30.400 | from top to bottom.

00:15:32.040 | There's nothing mysterious and magical and inaccessible.

00:15:36.240 | Unfortunately with Python, it's the opposite of that

00:15:38.600 | because Python's so slow, it's extremely unhackable.

00:15:42.680 | You get to a point where it's like,

00:15:43.840 | okay, from here on down, it's C.

00:15:45.360 | So your debugger doesn't work in the same way,

00:15:47.320 | your profiler doesn't work in the same way,

00:15:48.960 | your build system doesn't work in the same way.

00:15:50.800 | It's really not very hackable at all.

00:15:53.760 | - What's the part you like to be hackable?

00:15:55.640 | Is it for the objective of optimizing training

00:16:00.160 | of neural networks, inference of neural networks?

00:16:02.600 | Is it performance of the system

00:16:04.360 | or is there some non-performance related just--

00:16:07.880 | - It's everything.

00:16:09.040 | I mean, in the end, I wanna be productive as a practitioner.

00:16:13.880 | So that means that, so like at the moment,

00:16:16.320 | our understanding of deep learning is incredibly primitive.

00:16:20.040 | There's very little we understand.

00:16:21.480 | Most things don't work very well,

00:16:23.240 | even though it works better than anything else out there.

00:16:26.160 | There's so many opportunities to make it better.

00:16:28.640 | So you look at any domain area, like, I don't know,

00:16:32.800 | speech recognition with deep learning

00:16:35.680 | or natural language processing classification

00:16:38.360 | with deep learning or whatever.

00:16:39.400 | Every time I look at an area with deep learning,

00:16:41.880 | I always see like, oh, it's terrible.

00:16:44.440 | There's lots and lots of obviously stupid ways to do things

00:16:48.240 | that need to be fixed.

00:16:50.160 | So then I wanna be able to jump in there

00:16:51.600 | and quickly experiment and make them better.

00:16:54.840 | - You think the programming language has a role in that?

00:16:59.240 | - Huge role, yeah.

00:17:00.280 | So currently Python has a big gap

00:17:05.280 | in terms of our ability to innovate,

00:17:09.280 | particularly around recurrent neural networks

00:17:11.840 | and natural language processing,

00:17:14.920 | because it's so slow.

00:17:16.840 | The actual loop where we actually loop through words,

00:17:20.200 | we have to do that whole thing in CUDA C.

00:17:23.760 | So we actually can't innovate with the kernel,

00:17:27.120 | the heart of that most important algorithm.

00:17:30.200 | And it's just a huge problem.

00:17:33.640 | And this happens all over the place.

00:17:36.440 | So we hit research limitations.

00:17:40.080 | Another example, convolutional neural networks,

00:17:42.640 | which are actually the most popular architecture

00:17:44.720 | for lots of things, maybe most things in deep learning.

00:17:48.920 | We almost certainly should be using

00:17:50.320 | sparse convolutional neural networks,

00:17:52.920 | but only like two people are,

00:17:55.400 | because to do it, you have to rewrite

00:17:57.840 | all of that CUDA C level stuff.

00:17:59.920 | And yeah, just researchers and practitioners don't.

00:18:04.520 | So like there's just big gaps

00:18:06.040 | in like what people actually research on,

00:18:09.240 | what people actually implement

00:18:10.520 | because of the programming language problem.

00:18:13.240 | - So you think it's just too difficult to write in CUDA C

00:18:18.240 | that a programming, like a higher level programming language

00:18:23.440 | like Swift should enable the easier,

00:18:28.440 | fooling around creative stuff with RNNs

00:18:33.120 | or with sparse convolutional neural networks?

00:18:34.920 | - Kind of.

00:18:35.760 | - Who's at fault?

00:18:37.760 | Who's at charge of making it easy

00:18:41.040 | for a researcher to play around?

00:18:42.320 | - I mean, no one's at fault.

00:18:43.520 | It's just nobody's got around to it yet.

00:18:45.080 | Or it's just, it's hard, right?

00:18:47.040 | And I mean, part of the fault

00:18:48.440 | is that we ignored that whole APL kind of direction,

00:18:52.640 | almost nearly everybody did for 60 years, 50 years.

00:18:56.360 | But recently people have been starting

00:18:59.880 | to reinvent pieces of that

00:19:03.560 | and kind of create some interesting new directions

00:19:05.440 | in the compiler technology.

00:19:07.280 | So the place where that's particularly happening right now

00:19:11.720 | is something called MLIR,

00:19:13.520 | which is something that again,

00:19:14.920 | Chris Latner, the Swift guy is leading.

00:19:18.040 | And yeah, 'cause it's actually not gonna be Swift

00:19:20.600 | on its own that solves this problem

00:19:22.120 | because the problem is that currently writing

00:19:24.960 | a acceptably fast GPU program

00:19:29.960 | is too complicated regardless of what language you use.

00:19:33.800 | And that's just because if you have to deal with the fact

00:19:38.640 | that I've got 10,000 threads

00:19:41.680 | and I have to synchronize between them all

00:19:43.440 | and I have to put my thing into grid blocks

00:19:45.320 | and think about warps and all this stuff,

00:19:47.000 | it's just so much boilerplate that to do that well,

00:19:50.680 | you have to be a specialist at that

00:19:52.160 | and it's gonna be a year's work to optimize

00:19:56.960 | that algorithm in that way.

00:19:59.640 | But with things like tensor comprehensions

00:20:03.520 | and tile and MLIR and TVM,

00:20:07.120 | there's all these various projects

00:20:08.640 | which are all about saying,

00:20:10.000 | let's let people create like domain specific languages

00:20:14.000 | for tensor computations.

00:20:16.840 | These are the kinds of things we do generally on the GPU

00:20:20.080 | for deep learning and then have a compiler

00:20:22.800 | which can optimize that tensor computation.

00:20:27.800 | A lot of this work is actually sitting on top

00:20:30.120 | of a project called Halide,

00:20:32.600 | which is a mind blowing project

00:20:35.960 | where they came up with such a domain specific language.

00:20:38.800 | In fact, two, one domain specific language for expressing

00:20:41.160 | this is what my tensor computation is.

00:20:43.760 | And another domain specific language for expressing

00:20:46.280 | this is the kind of the way I want you to structure

00:20:50.280 | the compilation of that and like do it block by block

00:20:53.040 | and do these bits in parallel.

00:20:54.920 | And they were able to show how you can compress

00:20:57.720 | the amount of code by 10X compared to optimized GPU code

00:21:02.720 | and get the same performance.

00:21:05.520 | So that's like, so these are the things

00:21:07.560 | that kind of sitting on top of that kind of research

00:21:10.520 | and MLIR is pulling a lot of those best practices together.

00:21:15.120 | And now we're starting to see work done on making

00:21:18.040 | all of that directly accessible through Swift

00:21:21.360 | so that I could use Swift to kind of write

00:21:23.480 | those domain specific languages.

00:21:25.880 | And hopefully we'll get then Swift CUDA kernels

00:21:29.480 | written in a very expressive and concise way

00:21:31.520 | that looks a bit like J in APL

00:21:34.160 | and then Swift layers on top of that

00:21:36.680 | and then a Swift UI on top of that.

00:21:38.360 | And, you know, that'll be so nice

00:21:41.320 | if we can get to that point.

00:21:42.600 | - Now, does it all eventually boil down

00:21:45.000 | to CUDA and NVIDIA GPUs?

00:21:48.560 | - Unfortunately at the moment it does,

00:21:50.160 | but one of the nice things about MLIR

00:21:52.640 | if AMD ever gets their act together,

00:21:55.400 | which they probably won't,

00:21:56.760 | is that they or others could write MLIR backends

00:22:01.760 | for other GPUs or other tensor computation devices

00:22:07.120 | of which today there are increasing number

00:22:11.640 | like Graph Core or Vertex AI or whatever.

00:22:16.640 | So yeah, being able to target lots of backends

00:22:22.600 | would be another benefit of this.

00:22:23.960 | And the market really needs competition

00:22:26.720 | 'cause at the moment NVIDIA is massively overcharging

00:22:29.520 | for their kind of enterprise class cards

00:22:33.680 | because there is no serious competition

00:22:36.760 | 'cause nobody else is doing the software properly.

00:22:39.320 | - In the cloud there is some competition, right?

00:22:41.440 | But-

00:22:42.920 | - Not really, other than TPUs perhaps.

00:22:45.080 | But TPUs are almost unprogrammable at the moment.

00:22:48.240 | - So you can't, the TPUs has the same problem that you can't-

00:22:51.200 | - It's even worse.

00:22:52.040 | So TPUs, Google actually made an explicit decision

00:22:54.840 | to make them almost entirely unprogrammable

00:22:57.240 | because they felt that there was too much IP in there.

00:23:00.000 | And if they gave people direct access to program them,

00:23:02.680 | people would learn their secrets.

00:23:04.360 | So you can't actually directly program the memory in a TPU.

00:23:09.960 | You can't even directly create code that runs on

00:23:13.960 | and that you look at on the machine that has the GPU.

00:23:16.600 | It all goes through a virtual machine.

00:23:18.520 | So all you can really do is this kind of cookie cutter thing

00:23:21.680 | of like plug-in high-level stuff together,

00:23:25.320 | which is just super tedious and annoying

00:23:29.280 | and totally unnecessary.

00:23:31.520 | - So what was the, tell me if you could,

00:23:34.480 | the origin story of fast AI?

00:23:36.520 | - Fast AI?

00:23:37.360 | - The origin story of fast AI.

00:23:39.080 | What is the motivation, its mission, its dream?

00:23:44.400 | - So I guess the founding story is heavily tied

00:23:50.240 | to my previous startup, which is a company called Analytic,

00:23:53.560 | which was the first company to focus on deep learning

00:23:56.920 | for medicine.

00:23:58.240 | And I created that because I saw there was a huge

00:24:02.240 | opportunity to, there's about a 10X shortage

00:24:06.880 | of the number of doctors in the world,

00:24:08.520 | in the developing world that we need.

00:24:10.320 | Expected it would take about 300 years

00:24:13.800 | to train enough doctors to meet that gap.

00:24:16.080 | But I guess that maybe if we used deep learning

00:24:21.080 | for some of the analytics, we could maybe make it

00:24:24.960 | so you don't need as highly trained doctors.

00:24:27.400 | - For diagnosis?

00:24:28.320 | - For diagnosis and treatment planning.

00:24:29.800 | - Where's the biggest benefit, just before we get

00:24:32.520 | to fast AI, where's the biggest benefit of AI in medicine

00:24:36.640 | that you see today?

00:24:37.960 | - Not much happening today in terms of like stuff

00:24:41.480 | that's actually out there, it's very early,

00:24:43.080 | but in terms of the opportunity, it's to take markets

00:24:47.760 | like India and China and Indonesia,

00:24:50.840 | which have big populations, Africa,

00:24:54.160 | small numbers of doctors, and provide diagnostic,

00:24:59.160 | particularly treatment planning and triage kind of on device

00:25:05.120 | so that if you do a test for malaria or tuberculosis

00:25:10.120 | or whatever, you immediately get something

00:25:12.960 | that even a healthcare worker that's had a month

00:25:15.240 | of training can get a very high quality assessment

00:25:20.240 | of whether the patient might be at risk and tell,

00:25:24.280 | okay, we'll send them off to a hospital.

00:25:27.400 | So for example, in Africa, outside of South Africa,

00:25:31.640 | there's only five pediatric radiologists

00:25:34.000 | for the entire continent, so most countries don't have any.

00:25:37.120 | So if your kid is sick and they need something diagnosed

00:25:39.720 | through medical imaging, the person, even if you're able

00:25:42.880 | to get medical imaging done, the person that looks at it

00:25:45.040 | will be a nurse at best, but actually in India, for example,

00:25:50.040 | and China, almost no x-rays are read by anybody,

00:25:54.760 | by any trained professional because they don't have enough.

00:25:59.240 | So if instead we had a algorithm that could take

00:26:03.920 | the most likely high risk 5% and say, triage basically,

00:26:08.920 | say, okay, someone needs to look at this,

00:26:13.240 | it would massively change the kind of way that

00:26:17.120 | what's possible with medicine in the developing world.

00:26:20.680 | And remember, increasingly, they have money.

00:26:23.720 | They're the developing world, they're not the poor world,

00:26:25.560 | they're the developing world, so they have the money,

00:26:26.800 | so they're building the hospitals,

00:26:28.440 | they're getting the diagnostic equipment,

00:26:32.000 | but there's no way for a very long time

00:26:34.880 | will they be able to have the expertise.

00:26:38.520 | - Shortage of expertise, okay, and that's where

00:26:41.080 | the deep learning systems can step in

00:26:43.360 | and magnify the expertise they do have, essentially.

00:26:46.800 | - Yeah.

00:26:47.800 | - So you do see, just to linger it a little bit longer,

00:26:52.800 | the interaction, do you still see the human experts

00:26:58.240 | still at the core of these systems?

00:26:59.880 | - Yeah, absolutely.

00:27:00.720 | - Or is there something in medicine that could be automated

00:27:02.760 | almost completely?

00:27:03.760 | - I don't see the point of even thinking about that,

00:27:06.400 | because we have such a shortage of people,

00:27:08.480 | why would we want to find a way not to use them?

00:27:12.160 | Like, we have people, so the idea of,

00:27:15.560 | even from an economic point of view,

00:27:17.160 | if you can make them 10x more productive,

00:27:19.760 | getting rid of the person doesn't impact

00:27:21.920 | your unit economics at all, and it totally ignores the fact

00:27:25.520 | that there are things people do better than machines.

00:27:28.720 | So it's just, to me, that's not a useful way

00:27:33.120 | of framing the problem.

00:27:34.080 | - I guess, just to clarify, I guess I meant

00:27:36.640 | there may be some problems where you can avoid

00:27:40.280 | even going to the expert ever, sort of maybe preventative

00:27:43.880 | care or some basic stuff, allowing the expert to focus

00:27:48.320 | on the things that are really that, you know.

00:27:51.360 | - Well, that's what the triage would do, right?

00:27:53.000 | So the triage would say, okay, this 99% triage,

00:27:58.680 | sure, there's nothing here.

00:28:00.800 | So, you know, that can be done on device,

00:28:04.040 | and they can just say, okay, go home.

00:28:05.920 | So the experts are being used to look at the stuff

00:28:09.440 | which has some chance it's worth looking at,

00:28:12.280 | which most things is not, you know, it's fine.

00:28:16.360 | - Why do you think we haven't quite made progress

00:28:19.360 | on that yet, in terms of the scale of how much AI

00:28:24.360 | is applied in the method?

00:28:27.520 | - There's a lot of reasons.

00:28:28.400 | I mean, one is it's pretty new.

00:28:29.680 | I only started in Lytic in like 2014, and before that,

00:28:33.160 | like, it's hard to express to what degree

00:28:36.720 | the medical world was not aware of the opportunities here.

00:28:40.680 | So I went to RSNA, which is the world's largest

00:28:44.920 | radiology conference, and I told everybody I could,

00:28:49.240 | you know, like, I'm doing this thing with deep learning,

00:28:51.760 | please come and check it out.

00:28:53.360 | And no one had any idea what I was talking about,

00:28:56.800 | and no one had any interest in it.

00:28:58.560 | So like, we've come from absolute zero, which is hard,

00:29:04.680 | and then the whole regulatory framework, education system,

00:29:09.920 | everything is just set up to think of doctoring

00:29:13.400 | in a very different way.

00:29:14.960 | So today, there is a small number of people

00:29:17.120 | who are deep learning practitioners and doctors

00:29:22.080 | at the same time, and we're starting to see

00:29:24.000 | the first ones come out of their PhD programs,

00:29:26.600 | so Zach Kahan over in Boston, Cambridge,

00:29:31.600 | has a number of students now who are data science experts,

00:29:38.960 | deep learning experts, and actual medical doctors.

00:29:46.120 | Quite a few doctors have completed our fast AI course now

00:29:50.040 | and are publishing papers and creating journal reading

00:29:54.960 | groups in the American Council of Radiology,

00:29:58.080 | and like, it's just starting to happen.

00:30:00.360 | But it's gonna be a long process.

00:30:02.920 | The regulators have to learn how to regulate this,

00:30:04.920 | they have to build, you know, guidelines,

00:30:08.760 | and then the lawyers at hospitals have to develop

00:30:13.320 | a new way of understanding that sometimes it makes sense

00:30:18.240 | for data to be, you know, looked at in raw form

00:30:23.520 | in large quantities in order to create

00:30:25.840 | world-changing results.

00:30:26.960 | - Yeah, so regulation around data, all that,

00:30:30.080 | it sounds, well, it's probably the hardest problem,

00:30:33.840 | but sounds reminiscent of autonomous vehicles as well.

00:30:36.720 | Many of the same regulatory challenges,

00:30:38.720 | many of the same data challenges.

00:30:40.600 | - Yeah, I mean, funnily enough,

00:30:41.520 | the problem is less the regulation

00:30:43.640 | and more the interpretation of that regulation

00:30:45.840 | by lawyers in hospitals.

00:30:48.200 | So HIPAA is actually, was designed to,

00:30:52.560 | the P in HIPAA is not standing,

00:30:55.000 | does not stand for privacy, it stands for portability.

00:30:57.640 | It's actually meant to be a way that data can be used.

00:31:00.800 | And it was created with lots of gray areas

00:31:04.360 | because the idea is that would be more practical

00:31:06.520 | and it would help people to use this legislation

00:31:10.440 | to actually share data in a more thoughtful way.

00:31:13.680 | Unfortunately, it's done the opposite

00:31:15.280 | because when a lawyer sees a gray area,

00:31:17.760 | they say, oh, if we don't know, we won't get sued,

00:31:20.720 | then we can't do it.

00:31:22.400 | So HIPAA is not exactly the problem.

00:31:26.320 | The problem is more that there's,

00:31:29.160 | hospital lawyers are not incented to make bold decisions

00:31:34.160 | about data portability.

00:31:36.480 | - Or even to embrace technology that saves lives.

00:31:40.400 | They more wanna not get in trouble

00:31:42.400 | for embracing that technology.

00:31:44.160 | - Also, it is also, saves lives in a very abstract way,

00:31:47.800 | which is like, oh, we've been able to release

00:31:49.800 | these 100,000 anonymized records.

00:31:52.280 | I can't point at the specific person

00:31:54.120 | whose life that saved.

00:31:55.280 | I can say like, oh, we ended up with this paper,

00:31:57.720 | which found this result, which diagnosed a thousand

00:32:01.640 | more people than we would have otherwise,

00:32:03.080 | but it's like, which ones were helped?

00:32:05.480 | It's very abstract.

00:32:07.280 | - Yeah, and on the counter side of that,

00:32:09.360 | you may be able to point to a life that was taken

00:32:13.040 | because of something that was--

00:32:14.280 | - Yeah, or a person whose privacy was violated.

00:32:18.200 | It's like, oh, this specific person,

00:32:20.160 | you know, was de-identified.

00:32:24.200 | - So-- - Identified.

00:32:26.000 | - Just a fascinating topic.

00:32:27.280 | We're jumping around.

00:32:28.280 | We'll get back to fast AI, but on the question of privacy,

00:32:32.520 | data is the fuel for so much innovation in deep learning.

00:32:37.520 | What's your sense on privacy,

00:32:39.760 | whether we're talking about Twitter, Facebook, YouTube,

00:32:44.000 | just the technologies like in the medical field

00:32:48.640 | that rely on people's data in order to create impact.

00:32:53.360 | How do we get that right, respecting people's privacy

00:32:58.360 | and yet creating technology that is learned from data?

00:33:03.320 | - One of my areas of focus is on doing more with less data,

00:33:08.320 | which, so most vendors, unfortunately,

00:33:14.400 | are strongly incented to find ways

00:33:17.600 | to require more data and more computation.

00:33:20.040 | So Google and IBM being the most obvious--

00:33:23.440 | - IBM.

00:33:25.920 | - Yeah, so Watson. - Watson.

00:33:27.720 | - So Google and IBM both strongly push the idea

00:33:31.160 | that you have to be, you know,

00:33:33.080 | that they have more data and more computation

00:33:35.440 | and more intelligent people than anybody else.

00:33:37.840 | And so you have to trust them to do things

00:33:39.880 | 'cause nobody else can do it.

00:33:41.340 | And Google's very upfront about this.

00:33:45.400 | Like Jeff Dean has gone out there and given talks

00:33:48.440 | and said, "Our goal is to require

00:33:50.520 | "a thousand times more computation, but less people."

00:33:55.160 | Our goal is to use the people that you have better

00:34:00.160 | and the data you have better

00:34:01.680 | and the computation you have better.

00:34:03.000 | So one of the things that we've discovered is,

00:34:06.040 | or at least highlighted, is that you very, very,

00:34:10.600 | very often don't need much data at all.

00:34:13.360 | And so the data you already have in your organization

00:34:16.160 | will be enough to get state-of-the-art results.

00:34:19.240 | So like my starting point would be to kind of say

00:34:21.320 | around privacy is a lot of people are looking for ways

00:34:25.760 | to share data and aggregate data,

00:34:28.160 | but I think often that's unnecessary.

00:34:29.960 | They assume that they need more data than they do

00:34:32.200 | 'cause they're not familiar with the basics

00:34:34.160 | of transfer learning, which is this critical technique

00:34:38.480 | for needing orders of magnitude less data.

00:34:42.000 | - Is your sense, one reason you might wanna collect data

00:34:44.680 | from everyone is like in the recommender system context,

00:34:49.680 | where your individual, Jeremy Howard's individual data

00:34:54.520 | is the most useful for providing a product

00:34:58.440 | that's impactful for you.

00:34:59.880 | So for giving you advertisements,

00:35:02.280 | for recommending to you movies,

00:35:04.160 | for doing medical diagnosis.

00:35:06.360 | Is your sense we can build with a small amount of data,

00:35:11.680 | general models that will have a huge impact for most people

00:35:16.000 | that we don't need to have data from each individual?

00:35:19.160 | - On the whole, I'd say yes.

00:35:20.520 | I mean, there are things like,

00:35:23.400 | you know, recommender systems have this cold start problem

00:35:28.320 | where, you know, Jeremy is a new customer.

00:35:30.920 | We haven't seen him before.

00:35:31.960 | So we can't recommend him things based on what else

00:35:33.920 | he's bought and liked with us.

00:35:36.000 | And there's various workarounds to that.

00:35:38.800 | Like in a lot of music programs,

00:35:40.640 | we'll start out by saying,

00:35:42.440 | which of these artists do you like?

00:35:44.880 | Which of these albums do you like?

00:35:46.720 | Which of these songs do you like?

00:35:48.360 | Netflix used to do that.

00:35:50.960 | Nowadays, they tend not to.

00:35:53.480 | People kind of don't like that

00:35:54.760 | 'cause they think, oh, we don't wanna bother the user.

00:35:57.320 | So you could work around that

00:35:58.680 | by having some kind of data sharing

00:36:00.960 | where you get my marketing record from Axiom or whatever

00:36:04.880 | and try to guestion that.

00:36:06.600 | To me, the benefit to me and to society

00:36:11.600 | of saving me five minutes on answering some questions

00:36:16.480 | versus the negative externalities

00:36:18.920 | of the privacy issue doesn't add up.

00:36:23.920 | So I think like a lot of the time,

00:36:26.160 | the places where people are invading our privacy

00:36:30.160 | in order to provide convenience

00:36:32.800 | is really about just trying to make them more money

00:36:36.840 | and they move these negative externalities

00:36:41.080 | to places that they don't have to pay for them.

00:36:44.240 | So when you actually see regulations appear

00:36:48.440 | that actually cause the companies

00:36:50.400 | that create these negative externalities

00:36:52.080 | to have to pay for it themselves,

00:36:53.520 | they say, well, we can't do it anymore.

00:36:56.080 | So the cost is actually too high.

00:36:58.200 | But for something like medicine,

00:37:00.360 | yeah, I mean, the hospital has my medical imaging,

00:37:05.240 | my pathology studies, my medical records.

00:37:07.920 | And also I own my medical data.

00:37:11.880 | So I help a startup called DocAI.

00:37:16.920 | One of the things DocAI does is that it has an app

00:37:19.720 | you can connect to Sutter Health and LabCorp and Walgreens

00:37:24.720 | and download your medical data to your phone

00:37:29.840 | and then upload it again at your discretion

00:37:33.600 | to share it as you wish.

00:37:35.160 | So with that kind of approach,

00:37:38.080 | we can share our medical information

00:37:41.200 | with the people we want to.

00:37:44.840 | - Yeah, so control.

00:37:45.720 | I mean, really being able to control

00:37:47.520 | who you share it with and so on.

00:37:49.760 | So that has a beautiful, interesting tangent,

00:37:53.120 | but to return back to the origin story of Fast.ai.

00:37:59.400 | All right, so before I started Fast.ai,

00:38:02.520 | I spent a year researching

00:38:06.360 | where are the biggest opportunities for deep learning?

00:38:10.400 | 'Cause I knew from my time at Kaggle in particular

00:38:14.080 | that deep learning had kind of hit this threshold point

00:38:16.920 | where it was rapidly becoming the state-of-the-art approach

00:38:19.880 | in every area that looked at it.

00:38:21.600 | And I'd been working with neural nets for over 20 years.

00:38:25.400 | I knew that from a theoretical point of view,

00:38:27.440 | once it hit that point, it would do that

00:38:29.240 | in kind of just about every domain.

00:38:31.600 | And so I kind of spent a year researching

00:38:34.520 | what are the domains that's gonna have

00:38:36.280 | the biggest low-hanging fruit in the shortest time period.

00:38:39.440 | I picked medicine, but there were so many I could have picked

00:38:43.960 | and so there was a kind of level of frustration for me

00:38:46.280 | of like, okay, I'm really glad we've opened up

00:38:50.000 | the medical deep learning world

00:38:51.160 | and today it's huge, as you know,

00:38:53.960 | but we can't do, I can't do everything.

00:38:58.320 | I don't even know, like in medicine,

00:39:00.440 | it took me a really long time to even get a sense

00:39:02.320 | of like what kind of problems do medical practitioners solve?

00:39:05.120 | What kind of data do they have?

00:39:06.440 | Who has that data?

00:39:07.480 | So I kind of felt like I need to approach this differently

00:39:12.520 | if I wanna maximize the positive impact of deep learning.

00:39:15.360 | Rather than me picking an area

00:39:19.280 | and trying to become good at it and building something,

00:39:21.800 | I should let people who are already domain experts

00:39:24.480 | in those areas and who already have the data

00:39:26.720 | do it themselves.

00:39:29.280 | So that was the reason for Fast.ai

00:39:33.120 | is to basically try and figure out

00:39:36.800 | how to get deep learning into the hands of people

00:39:40.160 | who could benefit from it and help them to do so

00:39:43.280 | in as quick and easy and effective a way as possible.

00:39:47.120 | - Got it, so sort of empower the domain experts.

00:39:50.280 | - Yeah, and like partly it's 'cause like,

00:39:53.120 | unlike most people in this field,

00:39:56.360 | my background is very applied and industrial.

00:40:00.000 | Like my first job was at McKinsey and Company.

00:40:02.520 | I spent 10 years in management consulting.

00:40:04.840 | I spend a lot of time with domain experts,

00:40:10.560 | so I kind of respect them and appreciate them

00:40:12.840 | and I know that's where the value generation in society is.

00:40:16.560 | And so I also know how most of them can't code

00:40:21.560 | and most of them don't have the time to invest,

00:40:26.080 | you know, three years in a graduate degree or whatever.

00:40:29.440 | So it's like, how do I upskill those domain experts?

00:40:33.640 | I think that would be a super powerful thing,

00:40:36.200 | you know, biggest societal impact I could have.

00:40:39.000 | So yeah, that was the thinking.

00:40:41.800 | - So, so much of Fast.ai students and researchers

00:40:45.800 | and the things you teach are pragmatically minded,

00:40:50.200 | practically minded, figuring out ways

00:40:52.920 | how to solve real problems and fast.

00:40:55.880 | So from your experience, what's the difference

00:40:58.200 | between theory and practice of deep learning?

00:41:01.260 | - Well, most of the research in the deep mining world

00:41:07.600 | is a total waste of time.

00:41:09.920 | - Right, that's what I was getting at.

00:41:11.080 | - Yeah, it's a problem in science in general.

00:41:16.080 | Scientists need to be published,

00:41:19.640 | which means they need to work on things

00:41:21.520 | that their peers are extremely familiar with

00:41:24.080 | and can recognize and advance in that area.

00:41:26.240 | So that means that they all need to work on the same thing.

00:41:29.080 | And so it really, and the thing they work on,

00:41:33.040 | there's nothing to encourage them to work on things

00:41:35.640 | that are practically useful.

00:41:38.840 | So you get just a whole lot of research,

00:41:41.160 | which is minor advances in stuff

00:41:43.240 | that's been very highly studied

00:41:44.660 | and has no significant practical impact.

00:41:49.340 | Whereas the things that really make a difference,

00:41:50.920 | like I mentioned transfer learning,

00:41:52.800 | like if we can do better at transfer learning,

00:41:55.640 | then it's this like world-changing thing

00:41:58.200 | where suddenly like lots more people

00:41:59.800 | can do world-class work with less resources and less data.

00:42:04.800 | But almost nobody works on that.

00:42:08.540 | Or another example, active learning,

00:42:10.800 | which is the study of like,

00:42:11.920 | how do we get more out of the human beings in the loop?

00:42:15.960 | - That's my favorite topic.

00:42:17.160 | - Yeah, so active learning is great,

00:42:18.580 | but it's almost nobody working on it

00:42:21.220 | because it's just not a trendy thing right now.

00:42:23.840 | - You know what, somebody started to interrupt.

00:42:27.080 | He was saying that nobody is publishing on active learning,

00:42:31.560 | but there's people inside companies,

00:42:33.480 | anybody who actually has to solve a problem,

00:42:36.840 | they're going to innovate on active learning.

00:42:39.680 | - Yeah, everybody kind of reinvents active learning

00:42:42.120 | when they actually have to work in practice

00:42:43.800 | because they start labeling things and they think,

00:42:46.420 | gosh, this is taking a long time and it's very expensive.

00:42:49.340 | And then they start thinking,

00:42:51.280 | well, why am I labeling everything?

00:42:52.680 | I'm only, the machine's only making mistakes

00:42:54.880 | on those two classes, they're the hard ones.

00:42:56.920 | Maybe I'll just start labeling those two classes.

00:42:58.920 | And then you start thinking,

00:43:00.420 | well, why did I do that manually?

00:43:01.620 | Why can't I just get the system to tell me

00:43:03.040 | which things are gonna be hardest?

00:43:04.800 | It's an obvious thing to do,

00:43:06.260 | but yeah, it's just like transfer learning,

00:43:11.260 | it's understudied and the academic world

00:43:14.160 | just has no reason to care about practical results.

00:43:17.500 | The funny thing is, I've only really ever written one paper.

00:43:20.000 | I hate writing papers and I didn't even write it.

00:43:22.800 | It was my colleague, Sebastian Ruder, who actually wrote it.

00:43:25.520 | I just did the research for it,

00:43:27.960 | but it was basically introducing transfer learning,

00:43:30.640 | successful transfer learning to NLP for the first time.

00:43:34.320 | The algorithm is called ULMfit.

00:43:36.060 | And I actually wrote it for the course,

00:43:41.980 | for the first AI course.

00:43:43.700 | I wanted to teach people NLP

00:43:45.340 | and I thought I only wanna teach people practical stuff.

00:43:47.500 | And I think the only practical stuff is transfer learning.

00:43:50.540 | And I couldn't find any examples of transfer learning in NLP,

00:43:53.340 | so I just did it.

00:43:54.540 | And I was shocked to find that as soon as I did it,

00:43:57.300 | which the basic prototype took a couple of days,

00:44:01.060 | smashed the state of the art

00:44:02.500 | on one of the most important data sets

00:44:04.280 | in a field that I knew nothing about.

00:44:06.720 | And I just thought, well, this is ridiculous.

00:44:10.400 | And so I spoke to Sebastian about it

00:44:13.800 | and he kindly offered to write it up, the results.

00:44:17.680 | And so it ended up being published in ACL,

00:44:21.360 | which is the top computational linguistics conference.

00:44:25.560 | So like people do actually care once you do it,

00:44:28.880 | but I guess it's difficult for maybe like junior researchers

00:44:32.780 | or like, I don't care whether I get citations

00:44:36.600 | or papers or whatever.

00:44:37.740 | There's nothing in my life that makes that important,

00:44:39.620 | which is why I've never actually bothered

00:44:41.500 | to write a paper myself.

00:44:43.040 | But for people who do,

00:44:43.980 | I guess they have to pick the kind of safe option,

00:44:48.980 | which is like, yeah, make a slight improvement

00:44:52.280 | on something that everybody's already working on.

00:44:54.960 | - Yeah, nobody does anything interesting

00:44:58.300 | or succeeds in life with the safe option.

00:45:01.180 | - Although, I mean, the nice thing is nowadays,

00:45:02.940 | everybody is now working on NLP transfer learning

00:45:05.300 | because since that time we've had GPT and GPT-2 and BERT

00:45:09.780 | and it's like, it's so, yeah,

00:45:12.660 | once you show that something's possible,

00:45:15.380 | everybody jumps in, I guess.

00:45:17.660 | - I hope to be a part of,

00:45:19.220 | and I hope to see more innovation

00:45:20.660 | in active learning in the same way.

00:45:22.140 | I think transfer learning and active learning

00:45:24.500 | are fascinating public open work.

00:45:27.360 | - I actually helped start a startup called Platform AI,

00:45:29.960 | which is really all about active learning.

00:45:31.760 | And yeah, it's been interesting trying to kind of

00:45:34.640 | see what research is out there and make the most of it.

00:45:37.800 | And there's basically none.

00:45:39.200 | So we've had to do all our own research.

00:45:41.000 | - Once again, and just as you described.

00:45:43.000 | Can you tell the story of the Stanford competition,

00:45:47.640 | DawnBench and Fast.ai's achievement on it?

00:45:51.500 | - Sure, so something which I really enjoy

00:45:54.280 | is that I basically teach two courses a year,

00:45:57.400 | the practical deep learning for coders,

00:45:59.640 | which is kind of the introductory course

00:46:02.080 | and then cutting edge deep learning for coders,

00:46:04.000 | which is the kind of research level course.

00:46:06.880 | And while I teach those courses,

00:46:10.400 | I basically have a big office

00:46:15.400 | at the University of San Francisco,

00:46:18.400 | it'd be enough for like 30 people.

00:46:19.760 | And I invite anybody, any student who wants to come

00:46:22.120 | and hang out with me while I build the course.

00:46:25.320 | And so generally it's full.

00:46:26.600 | And so we have 20 or 30 people in a big office

00:46:30.860 | with nothing to do, but study deep learning.

00:46:33.880 | So it was during one of these times

00:46:35.880 | that somebody in the group said,

00:46:37.320 | "Oh, there's a thing called DawnBench,

00:46:40.600 | it looks interesting."

00:46:41.440 | And I was like, "What the hell is that?"

00:46:42.840 | And they set out some competition

00:46:44.100 | to see how quickly you can train a model.

00:46:46.400 | Seems kind of not exactly relevant to what we're doing,

00:46:50.320 | but it sounds like the kind of thing

00:46:51.400 | which you might be interested in.

00:46:52.480 | I checked it out and I was like,

00:46:53.320 | "Oh crap, there's only 10 days till it's over.

00:46:55.920 | It's pretty much too late."

00:46:58.080 | And we're kind of busy trying to teach this course.

00:47:00.960 | But we're like, "Oh, it would make an interesting

00:47:03.480 | case study for the course.

00:47:06.400 | Like it's all the stuff we're already doing.

00:47:08.180 | Why don't we just put together

00:47:09.480 | our current best practices and ideas?"

00:47:12.460 | So me and I guess about four students

00:47:16.040 | just decided to give it a go.

00:47:17.560 | And we focused on this small one called Cypher 10,

00:47:20.840 | which is little 32 by 32 pixel images.

00:47:24.640 | - Can you say what DawnBench is?

00:47:26.120 | - Yeah, so it's a competition

00:47:27.640 | to train a model as fast as possible.

00:47:29.520 | It was run by Stanford.

00:47:30.960 | - And as cheap as possible too.

00:47:32.480 | - That's also another one for as cheap as possible.

00:47:34.280 | And there's a couple of categories, ImageNet and Cypher 10.

00:47:38.120 | So ImageNet is this big 1.3 million image thing

00:47:42.040 | that took a couple of days to train.

00:47:44.520 | Remember a friend of mine, Pete Warden,

00:47:47.840 | who's now at Google.

00:47:50.180 | I remember he told me how he trained ImageNet

00:47:53.240 | a few years ago, and he basically like had this

00:47:55.640 | little granny flat out the back

00:47:59.760 | that he turned into his ImageNet training center.

00:48:01.880 | And he figured, you know, after like a year of work,

00:48:03.760 | he figured out how to train it in like 10 days or something.

00:48:07.040 | It's like, that was a big job.

00:48:08.480 | Well, Cypher 10 at that time, you could train in a few hours.

00:48:12.880 | You know, it was much smaller and easier.

00:48:14.520 | So we thought we'd try Cypher 10.

00:48:17.280 | And yeah, I'd really never done that before.

00:48:22.280 | Like I'd never really, like things like using more

00:48:25.800 | than one GPU at a time was something I tried to avoid.

00:48:29.760 | 'Cause to me, it's like very against the whole idea

00:48:32.080 | of accessibility is you should be able to do things

00:48:34.120 | with one GPU.

00:48:34.960 | - I mean, have you asked in the past before,

00:48:37.960 | after having accomplished something,

00:48:39.600 | how do I do this faster, much faster?

00:48:42.440 | - Oh, always, but it's always, for me, it's always,

00:48:44.480 | how do I make it much faster on a single GPU

00:48:47.640 | that a normal person could afford in their day-to-day life?

00:48:50.360 | It's not, how could I do it faster by, you know,

00:48:53.840 | having a huge data center?

00:48:55.240 | 'Cause to me, it's all about like,

00:48:57.200 | as many people should be able to use something as possible

00:48:59.480 | without fussing around with infrastructure.

00:49:03.160 | So anyways, in this case, it's like, well,

00:49:06.000 | we can use eight GPUs just by renting a AWS machine.

00:49:10.200 | So we thought we'd try that.

00:49:11.840 | And yeah, basically using the stuff we were already doing,

00:49:16.520 | we were able to get, you know, the speed,

00:49:20.120 | you know, within a few days, we had the speed down to,

00:49:22.880 | I don't know, a very small number of minutes.

00:49:26.000 | I can't remember exactly how many minutes it was,

00:49:28.760 | but it might've been like 10 minutes or something.

00:49:31.360 | And so, yeah, we found ourselves at the top

00:49:33.200 | of the leaderboard easily for both time and money,

00:49:38.200 | which really shocked me

00:49:39.040 | 'cause the other people competing in this

00:49:40.160 | were like Google and Intel and stuff

00:49:41.920 | were like, know a lot more about this stuff

00:49:43.920 | than I think we do.

00:49:45.400 | So then we were emboldened.

00:49:46.840 | We thought, let's try the ImageNet one too.

00:49:50.680 | I mean, it seemed way out of our league,

00:49:53.360 | but our goal was to get under 12 hours.

00:49:55.960 | And we did, which was really exciting.

00:49:59.280 | And, but we didn't put anything up on the leaderboard,

00:50:01.480 | but we were down to like 10 hours,

00:50:03.160 | but then Google put in some,

00:50:07.760 | like five hours or something,

00:50:10.040 | we're just like, oh, we're so screwed.

00:50:13.400 | But we kind of thought we'll keep trying,

00:50:16.920 | if Google can do it in five,

00:50:17.880 | I mean, Google did on five hours on some,

00:50:19.520 | on like a TPU pod or something,

00:50:21.520 | like a lot of hardware.

00:50:23.240 | But we kind of like had a bunch of ideas to try,

00:50:26.360 | like a really simple thing was,

00:50:28.760 | why are we using these big images?

00:50:30.520 | They're like 224 or 256 by 256 pixels.

00:50:34.640 | Why don't we try smaller ones?

00:50:37.760 | - And just to elaborate,

00:50:39.040 | there's a constraint on the accuracy

00:50:41.360 | that your train model is supposed to achieve.

00:50:43.040 | - Yeah, you gotta achieve 93%,

00:50:45.760 | I think it was for ImageNet, exactly.

00:50:49.240 | - Which is very tough, so you have to-

00:50:51.120 | - Yeah, 93%, like they picked a good threshold.

00:50:54.680 | It was a little bit higher

00:50:56.920 | than what the most commonly used ResNet-50 model

00:51:00.840 | could achieve at that time.

00:51:03.360 | So yeah, so it's quite a difficult problem to solve.

00:51:08.160 | But yeah, we realized if we actually

00:51:09.680 | just use 64 by 64 images,

00:51:12.280 | it trained a pretty good model.

00:51:16.160 | And then we could take that same model

00:51:18.000 | and just give it a couple of epochs

00:51:19.560 | to learn 224 by 224 images.

00:51:21.880 | And it was basically already trained,

00:51:24.480 | which makes a lot of sense.

00:51:25.440 | Like if you teach somebody,

00:51:26.600 | like here's what a dog looks like

00:51:28.080 | and you show them low res versions,

00:51:30.160 | and then you say, here's a really clear picture of a dog,

00:51:33.360 | they already know what a dog looks like.

00:51:35.920 | So that like, just, we jumped to the front

00:51:39.840 | and we ended up winning parts of that competition.

00:51:44.840 | We actually ended up doing a distributed version

00:51:49.600 | over multiple machines a couple of months later

00:51:51.920 | and ended up at the top of the leaderboard.

00:51:53.480 | We had 18 minutes.

00:51:54.960 | - (laughs) ImageNet.

00:51:56.200 | - Yeah, and it was,

00:51:57.960 | and people have just kept on blasting through

00:52:00.320 | again and again since then, so.

00:52:02.320 | - So what's your view on multi GPU

00:52:05.640 | or multiple machine training in general

00:52:08.480 | as a way to speed code up?

00:52:11.960 | - I think it's largely a waste of time.

00:52:13.680 | - Both multi GPU on a single machine and?

00:52:15.880 | - Yeah, particularly multi machines

00:52:17.680 | 'cause it's just clunky.

00:52:19.440 | Multi GPUs is less clunky than it used to be.

00:52:25.320 | But to me, anything that slows down your iteration speed

00:52:28.520 | is a waste of time.

00:52:30.320 | So you could maybe do your very last,

00:52:33.840 | you know, perfecting of the model on multi GPUs

00:52:37.960 | if you need to.

00:52:38.960 | But, so for example,

00:52:41.040 | I think doing stuff on ImageNet is generally a waste of time.

00:52:46.000 | Why test things on 1.3 million images?

00:52:48.200 | Most of us don't use 1.3 million images.

00:52:51.080 | And we've also done research that shows that

00:52:53.840 | doing things on a smaller subset of images

00:52:56.480 | gives you the same relative answers anyway.

00:52:59.160 | So from a research point of view, why waste that time?

00:53:02.080 | So actually I released a couple of new datasets recently.

00:53:06.120 | One is called ImageNet,

00:53:07.720 | the French ImageNet, which is a small subset of ImageNet,

00:53:12.880 | which is designed to be easy to classify.

00:53:15.040 | - What's, how do you spell ImageNet?

00:53:17.280 | - It's got an extra T and E at the end

00:53:19.200 | 'cause it's very French.

00:53:20.480 | - Image, okay.

00:53:21.320 | - Yeah, and then another one called ImageWolf,

00:53:24.720 | which is a subset of ImageNet that only contains dog breeds.

00:53:29.720 | - And that's a hard one, right?

00:53:30.800 | - That's a hard one.

00:53:32.000 | And I've discovered that if you just look

00:53:33.800 | at these two subsets,

00:53:34.920 | you can train things on a single GPU in 10 minutes

00:53:39.120 | and the results you get directly transferable

00:53:42.080 | to ImageNet nearly all the time.

00:53:44.320 | And so now I'm starting to see some researchers

00:53:46.360 | start to use these much smaller datasets.

00:53:49.000 | - So deeply love the way you think

00:53:51.160 | because I think you might've written a blog post saying

00:53:55.760 | that sort of going to these big datasets

00:54:00.160 | is encouraging people to not think creatively.

00:54:03.880 | - Absolutely.

00:54:04.720 | - So you're too, it sort of constrains you

00:54:08.280 | to train on large resources.

00:54:09.840 | And because you have these resources,

00:54:11.280 | you think more resources will be better.

00:54:14.000 | And then you start, so like somehow you kill the creativity.

00:54:17.720 | - Yeah, and even worse than that, Lex,

00:54:19.280 | I keep hearing from people who say,

00:54:21.120 | "I decided not to get into deep learning

00:54:23.400 | because I don't believe it's accessible

00:54:25.440 | to people outside of Google to do useful work."

00:54:28.520 | So like I see a lot of people make an explicit decision

00:54:31.640 | to not learn this incredibly valuable tool

00:54:35.960 | because they've drunk the Google Kool-Aid,

00:54:39.040 | which is that only Google's big enough

00:54:40.720 | and smart enough to do it.

00:54:42.440 | And I just find that so disappointing and it's so wrong.

00:54:45.360 | - And I think all of the major breakthroughs in AI

00:54:49.160 | in the next 20 years will be doable on a single GPU.

00:54:53.240 | Like I would say my sense is all the big sort of-

00:54:56.240 | - Well, let's put it this way.

00:54:58.240 | None of the big breakthroughs of the last 20 years

00:55:00.160 | have required multiple GPUs.

00:55:01.680 | So like batch norm, value, dropout.

00:55:05.960 | - To demonstrate that there's something to that.

00:55:08.080 | - Every one of them, none of them has required multiple GPUs.

00:55:11.960 | - GANs, the original GANs didn't require multiple GPUs.

00:55:15.760 | - Well, and we've actually recently shown

00:55:18.040 | that you don't even need GANs.

00:55:19.640 | So we've developed GAN level outcomes without needing GANs.

00:55:24.640 | And we can now do it with, again,

00:55:26.880 | by using transfer learning,

00:55:27.960 | we can do it in a couple of hours on a single GPU.

00:55:30.160 | - Just using a generated model,

00:55:31.400 | like without the adversarial part?

00:55:32.960 | - Yeah, so we've found loss functions

00:55:35.680 | that work super well without the adversarial part.

00:55:38.640 | And then one of our students, a guy called Jason Antich,

00:55:41.800 | has created a system called DeOldify,

00:55:44.600 | which uses this technique to colorize

00:55:47.240 | old black and white movies.

00:55:48.800 | You can do it on a single GPU,

00:55:50.440 | colorize a whole movie in a couple of hours.

00:55:52.840 | And one of the things that Jason and I did together

00:55:56.040 | was we figured out how to add a little bit of GAN

00:56:00.440 | at the very end, which it turns out for colorization

00:56:02.960 | makes it just a bit brighter and nicer.

00:56:05.960 | And then Jason did masses of experiments

00:56:07.880 | to figure out exactly how much to do,

00:56:09.960 | but it's still all done on his home machine

00:56:12.800 | on a single GPU in his lounge room.

00:56:15.320 | And like, if you think about like

00:56:17.520 | colorizing Hollywood movies,

00:56:19.160 | that sounds like something a huge studio would have to do,

00:56:21.680 | but he has the world's best results on this.

00:56:25.160 | - There's this problem of microphones.

00:56:27.000 | We're just talking to microphones now.

00:56:29.040 | It's such a pain in the ass to have these microphones

00:56:32.480 | to get good quality audio.

00:56:34.360 | And I tried to see if it's possible to plop down

00:56:36.680 | a bunch of cheap sensors and reconstruct

00:56:39.160 | higher quality audio from multiple sources.

00:56:41.800 | 'Cause right now I haven't seen work from,

00:56:45.160 | okay, we can save inexpensive mics,

00:56:47.440 | automatically combining audio from multiple sources

00:56:50.040 | to improve the combined audio.

00:56:52.280 | People haven't done that.

00:56:53.120 | And that feels like a learning problem.

00:56:55.080 | So hopefully somebody can.

00:56:56.840 | - Well, I mean, it's eminently doable

00:56:58.800 | and it should have been done by now.

00:57:01.000 | I felt the same way about computational photography

00:57:03.600 | four years ago.

00:57:05.240 | Why are we investing in big lenses

00:57:07.120 | when three cheap lenses,

00:57:09.800 | plus actually a little bit of intentional movement?

00:57:13.760 | So like take a few frames,

00:57:16.640 | gives you enough information to get excellent sub-pixel

00:57:19.800 | resolution, which particularly with deep learning,

00:57:22.440 | you would know exactly what you're meant to be looking at.

00:57:25.800 | We can totally do the same thing with audio.

00:57:28.160 | I think it's madness that it hasn't been done yet.

00:57:30.680 | - Is there been progress on the photography company?

00:57:33.240 | - Yeah, photography is basically a standard now.

00:57:36.720 | So the Google Pixel Night Light,

00:57:40.800 | I don't know if you've ever tried it,

00:57:42.080 | but it's astonishing.

00:57:43.200 | You take a picture in almost pitch black

00:57:45.440 | and you get back a very high quality image.

00:57:49.160 | And it's not because of the lens.

00:57:51.440 | Same stuff with like adding the bokeh

00:57:53.400 | to the background blurring done computationally.

00:57:57.160 | - This is the pixel right here.

00:57:58.560 | - Yeah, basically everybody now is doing most

00:58:03.560 | of the fanciest stuff on their phones

00:58:05.680 | with computational photography.

00:58:07.080 | And also increasingly people are putting more than one lens

00:58:10.560 | on the back of the camera.

00:58:11.760 | So the same will happen for audio for sure.

00:58:14.280 | - And there's applications in the audio side.

00:58:16.440 | If you look at an Alexa type device,

00:58:18.400 | most people I've seen, especially I worked at Google before,

00:58:22.280 | when you look at noise background removal,

00:58:25.880 | you don't think of multiple sources of audio.

00:58:28.760 | You don't play with that as much as I would hope people would.

00:58:31.840 | - But I mean, you can still do it even with one.

00:58:33.560 | Like again, it's not much work's been done in this area.

00:58:36.040 | So we're actually gonna be releasing an audio library soon,

00:58:38.960 | which hopefully will encourage development of this

00:58:41.000 | 'cause it's so underused.

00:58:43.120 | The basic approach we used for our super resolution

00:58:46.440 | in which Jason uses for DeOldify

00:58:48.600 | of generating high quality images,

00:58:50.920 | the exact same approach would work for audio.

00:58:53.400 | No one's done it yet,

00:58:54.400 | but it would be a couple of months work.

00:58:57.080 | - Okay, also learning rate in terms of DawnBench.

00:59:00.400 | There's some magic on learning rate

00:59:03.480 | that you played around with.

00:59:04.480 | That's kind of interesting.

00:59:05.680 | - Yeah, so this is all work that came

00:59:06.960 | from a guy called Leslie Smith.

00:59:09.280 | Leslie's a researcher who like us cares a lot

00:59:13.200 | about just the practicalities of training neural networks

00:59:18.200 | quickly and accurately,

00:59:20.280 | which you would think is what everybody should care about,

00:59:22.040 | but almost nobody does.

00:59:23.680 | And he discovered something very interesting,

00:59:28.000 | which he calls super convergence,

00:59:29.680 | which is there are certain networks

00:59:31.160 | that with certain settings of high parameters

00:59:33.240 | could suddenly be trained 10 times faster

00:59:37.000 | by using a 10 times higher learning rate.

00:59:39.400 | Now, no one published that paper

00:59:43.560 | because it's not an area of kind of active research

00:59:49.440 | in the academic world.

00:59:50.360 | No academics recognize this is important.

00:59:52.760 | And also deep learning in academia

00:59:56.040 | is not considered a experimental science.

00:59:59.800 | So unlike in physics where you could say like,

01:00:02.360 | I just saw a subatomic particle do something

01:00:05.320 | which the theory doesn't explain,

01:00:07.200 | you could publish that without an explanation.

01:00:10.400 | And then in the next 60 years,

01:00:11.840 | people can try to work out how to explain it.

01:00:14.080 | We don't allow this in the deep learning world.

01:00:16.120 | So it's literally impossible for Leslie

01:00:19.520 | to publish a paper that says,

01:00:21.600 | I've just seen something amazing happen.

01:00:23.520 | This thing trained 10 times faster than it should have.

01:00:25.640 | I don't know why.

01:00:27.360 | And so the reviewers were like,

01:00:28.480 | well, you can't publish that 'cause you don't know why.

01:00:30.240 | So anyway.

01:00:31.080 | - That's important to pause on

01:00:32.160 | because there's so many discoveries

01:00:34.280 | that would need to start like that.

01:00:36.120 | - Every other scientific field I know of works that way.

01:00:39.200 | I don't know why ours is uniquely disinterested

01:00:43.480 | in publishing unexplained experimental results,

01:00:47.680 | but there it is.

01:00:48.640 | So it wasn't published.

01:00:49.880 | Having said that,

01:00:52.480 | I read a lot more unpublished papers than published papers

01:00:56.800 | 'cause that's where you find the interesting insights.

01:01:00.000 | So I absolutely read this paper.

01:01:02.600 | And I was just like,

01:01:04.440 | this is astonishingly mind-blowing and weird and awesome.

01:01:09.440 | And like, why isn't everybody only talking about this?

01:01:12.320 | Because like, if you can train these things 10 times faster,

01:01:15.400 | they also generalize better

01:01:16.640 | because you're doing less epochs,

01:01:18.720 | which means you look at the data less,

01:01:20.000 | you get better accuracy.

01:01:21.360 | So I've been kind of studying that ever since.

01:01:24.560 | And eventually Leslie kind of figured out

01:01:28.440 | a lot of how to get this done.

01:01:30.040 | And we added minor tweaks

01:01:32.160 | and a big part of the trick

01:01:33.560 | is starting at a very low learning rate,

01:01:36.400 | very gradually increasing it.

01:01:37.840 | So as you're training your model,

01:01:39.760 | you would take very small steps at the start

01:01:42.040 | and you gradually make them bigger and bigger

01:01:44.000 | until eventually you're taking much bigger steps

01:01:46.360 | than anybody thought was possible.

01:01:48.120 | There's a few other little tricks to make it work,

01:01:51.040 | but basically we can reliably get super convergence.

01:01:55.160 | And so for the DawnBench thing,

01:01:56.560 | we were using just much higher learning rates

01:01:59.280 | than people expected to work.

01:02:02.160 | - What do you think the future of,

01:02:03.800 | I mean, it makes so much sense for that

01:02:05.160 | to be a critical hyperparameter learning rate that you vary.

01:02:08.600 | What do you think the future

01:02:09.480 | of learning rate magic looks like?

01:02:13.440 | - Well, there's been a lot of great work

01:02:14.880 | in the last 12 months in this area.

01:02:17.360 | And people are increasingly realizing that,

01:02:20.160 | like we just have no idea really how optimizers work.

01:02:23.080 | And the combination of weight decay,

01:02:25.800 | which is how we regularize optimizers

01:02:27.440 | and the learning rate,

01:02:29.160 | and then other things like the epsilon we use

01:02:31.480 | in the atom optimizer,

01:02:32.760 | they all work together in weird ways.

01:02:36.520 | And different parts of the model,

01:02:38.520 | this is another thing we've done a lot of work on

01:02:40.440 | is research into how different parts of the model

01:02:43.480 | should be trained at different rates in different ways.

01:02:46.600 | So we do something we call discriminative learning rates,

01:02:49.000 | which is really important,

01:02:50.120 | particularly for transfer learning.

01:02:51.880 | So really I think in the last 12 months,

01:02:54.840 | a lot of people have realized

01:02:55.840 | that all this stuff is important,

01:02:57.360 | there's been a lot of great work coming out,

01:02:59.960 | and we're starting to see algorithms appear,

01:03:03.640 | which have very, very few dials,

01:03:06.440 | if any, that you have to touch.

01:03:07.880 | So I think what's gonna happen

01:03:09.240 | is the idea of a learning rate,

01:03:10.800 | it almost already has disappeared in the latest research.

01:03:14.320 | And instead it's just like,

01:03:15.720 | we know enough about how to interpret the gradients

01:03:21.800 | and the change of gradients we see

01:03:23.800 | to know how to set every parameter.

01:03:25.320 | - That you can automate it.

01:03:26.280 | So you see the future of deep learning,

01:03:30.800 | where really, where's the input of a human expert needed?

01:03:34.520 | - Well, hopefully the input of a human expert

01:03:36.480 | will be almost entirely unneeded

01:03:38.720 | from the deep learning point of view.

01:03:40.400 | So again, like Google's approach to this

01:03:43.440 | is to try and use thousands of times more compute

01:03:45.960 | to run lots and lots of models at the same time

01:03:49.360 | and hope that one of them is good.

01:03:51.000 | - AutoML kind of?

01:03:51.840 | - Yeah, AutoML kind of stuff, which I think is insane.

01:03:54.680 | (laughing)

01:03:56.720 | When you better understand the mechanics

01:03:59.560 | of how models learn,

01:04:01.640 | you don't have to try a thousand different models

01:04:03.760 | to find which one happens to work the best.

01:04:05.600 | You can just jump straight to the best one,

01:04:08.080 | which means that it's more accessible

01:04:09.680 | in terms of compute, cheaper,

01:04:12.680 | and also with less hyperparameters to set,

01:04:14.880 | it means you don't need deep learning experts

01:04:16.760 | to train your deep learning model for you,

01:04:19.320 | which means that domain experts can do more of the work,

01:04:22.240 | which means that now you can focus the human time

01:04:24.960 | on the kind of interpretation, the data gathering,

01:04:28.280 | identifying model errors and stuff like that.

01:04:31.360 | - Yeah, the data side.

01:04:32.800 | How often do you work with data these days

01:04:34.720 | in terms of the cleaning, looking at it?

01:04:37.800 | Like Darwin looked at different species

01:04:41.120 | while traveling about.

01:04:42.880 | Do you look at data?

01:04:44.960 | Have you in your roots in Kaggle?

01:04:48.040 | - Always, yeah. - Just look at data?

01:04:49.360 | - Yeah, I mean, it's a key part of our course

01:04:51.320 | is like before we train a model in the course,

01:04:53.440 | we see how to look at the data.

01:04:55.160 | And then after, the first thing we do

01:04:56.520 | after we train our first model,

01:04:57.920 | which we fine tune an ImageNet model for five minutes.

01:05:00.520 | And then the thing we immediately do after that

01:05:02.200 | is we learn how to analyze the results of the model

01:05:05.800 | by looking at examples of misclassified images

01:05:08.920 | and looking at a classification matrix

01:05:10.880 | and then doing like research on Google

01:05:15.080 | to learn about the kinds of things that it's misclassifying.

01:05:18.120 | So to me, one of the really cool things

01:05:19.480 | about machine learning models in general

01:05:21.800 | is that when you interpret them,

01:05:24.280 | they tell you about things like,

01:05:25.400 | what are the most important features?

01:05:27.320 | Which groups you're misclassifying?

01:05:29.360 | And they help you become a domain expert more quickly

01:05:32.440 | because you can focus your time on the bits

01:05:34.840 | that the model is telling you is important.

01:05:38.680 | So it lets you deal with things like data leakage,

01:05:40.720 | for example, if it says,

01:05:41.560 | "Oh, the main feature I'm looking at is customer ID."

01:05:45.400 | You know, when you're like,

01:05:46.240 | "Oh, customer ID shouldn't be predictive."

01:05:47.600 | And then you can talk to the people

01:05:50.640 | that manage customer IDs and they'll tell you like,

01:05:53.160 | "Oh yes, as soon as a customer's application is accepted,

01:05:57.480 | we add a one on the end of their customer ID or something."

01:06:01.160 | So yeah, model, looking at data,

01:06:03.720 | particularly from the lens of which parts of the data

01:06:06.000 | the model says is important is super important.

01:06:09.360 | - Yeah, and using the model to almost debug the data

01:06:12.880 | to learn more about the data.

01:06:14.240 | - Exactly.

01:06:16.800 | - What are the different cloud options

01:06:18.600 | for training your networks?

01:06:20.160 | Last question related to DawnBench.

01:06:21.960 | Well, it's part of a lot of the work you do,

01:06:24.240 | but from a perspective of performance,

01:06:27.280 | I think you've written this in a blog post.

01:06:29.480 | There's AWS, there's a TPU from Google.

01:06:32.720 | What's your sense, what the future holds?

01:06:34.480 | What would you recommend now in terms of-

01:06:37.360 | - So from a hardware point of view,

01:06:39.440 | Google's TPUs and the best Nvidia GPUs are similar.

01:06:45.320 | I mean, maybe the TPUs are like 30% faster,

01:06:47.920 | but they're also much harder to program.

01:06:49.920 | There isn't a clear leader in terms of hardware right now,

01:06:54.720 | although much more importantly,

01:06:56.280 | the Nvidia GPUs are much more programmable.

01:06:59.560 | They've got much more written for all of them.

01:07:00.960 | So like that's the clear leader for me

01:07:03.160 | and where I would spend my time

01:07:04.440 | as a researcher and practitioner.

01:07:06.880 | But then in terms of the platform,

01:07:10.320 | I mean, we're super lucky now

01:07:13.800 | with stuff like Google GCP, Google Cloud,

01:07:17.040 | and AWS that you can access a GPU pretty quickly and easily.

01:07:22.040 | But I mean, for AWS, it's still too hard.

01:07:28.080 | Like you have to find an AMI and get the instance running

01:07:33.080 | and then install the software you want and blah, blah, blah.

01:07:37.080 | GCP is still, is currently the best way to get started

01:07:40.760 | on a full server environment

01:07:42.320 | because they have a fantastic fast AI

01:07:44.880 | and PyTorch ready to go instance,

01:07:47.680 | which has all the courses pre-installed.

01:07:51.080 | It has Jupyter Notebook pre-running.

01:07:53.040 | Jupyter Notebook is this wonderful

01:07:55.920 | interactive computing system,

01:07:57.600 | which everybody basically should be using

01:08:00.360 | for any kind of data-driven research.

01:08:02.880 | But then even better than that,

01:08:04.440 | there are platforms like Salamander,

01:08:08.400 | which we own and Paperspace,

01:08:11.240 | where literally you click a single button

01:08:13.560 | and it pops up a Jupyter Notebook straight away

01:08:17.200 | without any kind of installation or anything.

01:08:22.200 | And all the course notebooks are all pre-installed.

01:08:25.760 | So like for me, this is one of the things

01:08:28.560 | we spent a lot of time kind of curating and working on.

01:08:32.920 | 'Cause when we first started our courses,

01:08:35.960 | the biggest problem was people dropped out of lesson one

01:08:39.600 | 'cause they couldn't get an AWS instance running.

01:08:42.680 | So things are so much better now.

01:08:44.880 | And like we actually have, if you go to course.fast.ai,

01:08:47.760 | the first thing it says is,

01:08:48.720 | "Here's how to get started with your GPU."

01:08:50.480 | And there's like, you just click on the link

01:08:52.120 | and you click start and you're going.

01:08:55.160 | - You will go GCP.

01:08:56.280 | I have to confess, I've never used the Google GCP.

01:08:58.800 | - Yeah, GCP gives you $300 of compute for free,

01:09:01.640 | which is really nice.

01:09:03.920 | But as I say, Salamander and Paperspace

01:09:07.320 | are even easier still.

01:09:09.440 | - Okay.

01:09:10.960 | So from the perspective of deep learning frameworks,

01:09:15.120 | you work with Fast.ai, if you go to this framework,

01:09:18.440 | and PyTorch and TensorFlow.

01:09:21.240 | What are the strengths of each platform?

01:09:24.320 | - Sure. - Your perspective.

01:09:25.800 | - So in terms of what we've done our research on

01:09:28.760 | and taught in our course,

01:09:30.240 | we started with Theano and Keras.

01:09:34.360 | And then we switched to TensorFlow and Keras.

01:09:38.080 | And then we switched to PyTorch

01:09:40.360 | and then we switched to PyTorch and Fast.ai.

01:09:42.960 | And that kind of reflects a growth and development

01:09:47.560 | of the ecosystem of deep learning libraries.

01:09:50.960 | Theano and TensorFlow were great,

01:09:57.080 | but were much harder to teach

01:09:59.720 | and to do research and development on

01:10:01.680 | because they define what's called a computational graph

01:10:04.560 | up front, a static graph,

01:10:06.040 | where you basically have to say,

01:10:07.400 | here are all the things that I'm going to eventually do

01:10:10.840 | in my model.

01:10:12.000 | And then later on you say,

01:10:13.160 | okay, do those things with this data.

01:10:15.040 | And you can't like debug them,

01:10:17.080 | you can't do them step-by-step,

01:10:18.480 | you can't program them interactively

01:10:20.080 | in a Jupyter notebook and so forth.

01:10:22.240 | PyTorch was not the first,

01:10:23.680 | but PyTorch was certainly the strongest entrant

01:10:26.800 | to come along and say,

01:10:27.640 | let's not do it that way,

01:10:28.640 | let's just use normal Python.

01:10:30.240 | And everything you know about in Python

01:10:32.840 | is just gonna work.

01:10:34.080 | And we'll figure out how to make that run on the GPU

01:10:37.920 | as and when necessary.

01:10:39.320 | That turned out to be a huge leap

01:10:44.640 | in terms of what we could do with our research

01:10:46.800 | and what we could do with our teaching.

01:10:48.760 | - 'Cause it wasn't limiting.

01:10:51.240 | - Yeah, I mean, it was critical for us

01:10:52.760 | for something like DawnBench

01:10:53.880 | to be able to rapidly try things.

01:10:55.960 | It's just so much harder to be a researcher

01:10:57.840 | and practitioner when you have to do everything up front

01:11:00.520 | and you can't inspect it.

01:11:03.400 | The problem with PyTorch is

01:11:05.120 | it's not at all accessible to newcomers

01:11:08.880 | because you have to write your own training loop

01:11:11.600 | and manage the gradients and all this stuff.

01:11:14.120 | And it's also not great for researchers

01:11:17.880 | because you're spending your time

01:11:19.360 | dealing with all this boilerplate and overhead

01:11:21.640 | rather than thinking about your algorithm.

01:11:23.880 | So we ended up writing this very multi-layered API

01:11:27.760 | that at the top level,

01:11:29.040 | you can train a state-of-the-art neural network

01:11:31.400 | in three lines of code.

01:11:33.560 | And which kind of talks to an API,

01:11:35.040 | which talks to an API, which talks to an API,

01:11:36.640 | which like you can dive into at any level

01:11:38.800 | and get progressively closer to the machine

01:11:42.640 | kind of levels of control.

01:11:44.120 | And this is the Fast.ai library.

01:11:47.400 | That's been critical for us and for our students

01:11:51.800 | and for lots of people that have won

01:11:53.680 | big learning competitions with it

01:11:55.200 | and written academic papers with it.

01:11:57.400 | It's made a big difference.

01:12:00.640 | We're still limited though by Python

01:12:02.920 | and particularly this problem with things like

01:12:06.400 | recurrent neural nets say where you just can't change things

01:12:11.400 | unless you accept it going so slowly that it's impractical.

01:12:15.640 | So in the latest incarnation of the course

01:12:18.320 | and with some of the research we're now starting to do,

01:12:20.880 | we're starting to do some stuff in Swift.

01:12:24.520 | I think we're three years away

01:12:27.400 | from that being super practical,

01:12:29.800 | but I'm in no hurry.

01:12:31.040 | I'm very happy to invest the time to get there.

01:12:34.240 | But with that, we actually already have a nascent version

01:12:39.000 | of the Fast.ai library for vision

01:12:41.800 | running on Swift for TensorFlow.

01:12:44.720 | 'Cause Python for TensorFlow is not gonna cut it.

01:12:48.000 | It's just a disaster.

01:12:49.920 | What they did was they tried to replicate

01:12:52.960 | the bits that people were saying they like about PyTorch,

01:12:57.080 | this kind of interactive computation,

01:12:59.160 | but they didn't actually change

01:13:00.600 | their foundational runtime components.

01:13:03.880 | So they kind of added this like syntax sugar

01:13:06.600 | they call TF eager, TensorFlow eager,

01:13:08.360 | which makes it look a lot like PyTorch,

01:13:10.880 | but it's 10 times slower than PyTorch to actually do a step.

01:13:15.880 | So because they didn't invest the time

01:13:19.040 | in like retooling the foundations

01:13:21.080 | 'cause their code base is so horribly complex.

01:13:23.440 | - Yeah, I think it's probably very difficult

01:13:25.240 | to do that kind of retooling.

01:13:26.360 | - Yeah, well, particularly the way TensorFlow was written,

01:13:28.600 | it was written by a lot of people very quickly

01:13:31.440 | in a very disorganized way.

01:13:33.320 | So like when you actually look in the code, as I do often,

01:13:35.960 | I'm always just like, oh God, what were they thinking?

01:13:38.800 | It's just, it's pretty awful.

01:13:41.360 | So I'm really extremely negative

01:13:45.200 | about the potential future for Python.

01:13:47.800 | - TensorFlow, Python for TensorFlow.

01:13:50.040 | - But Swift for TensorFlow

01:13:52.080 | can be a different beast altogether.

01:13:53.720 | It can be like, it can basically be a layer on top of MLIR

01:13:57.520 | that takes advantage of all the great compiler stuff

01:14:02.520 | that Swift builds on with LLVM.

01:14:04.720 | And yeah, it could be,

01:14:07.000 | I think it will be absolutely fantastic.

01:14:09.280 | - Well, you're inspiring me to try.

01:14:11.840 | I haven't truly felt the pain of TensorFlow 2.0 Python.

01:14:16.840 | It's fine by me, but-

01:14:19.520 | - Yeah, I mean, it does the job

01:14:22.080 | if you're using like predefined things

01:14:25.080 | that somebody's already written.

01:14:27.680 | But if you actually compare, you know,

01:14:29.520 | like I've had to do,

01:14:31.320 | 'cause I've been having to do a lot of stuff

01:14:32.600 | with TensorFlow recently,

01:14:33.640 | you actually compare like,

01:14:34.720 | okay, I want to write something from scratch.

01:14:37.320 | And you're like, I just keep finding it's like,

01:14:38.840 | oh, it's running 10 times slower than PyTorch.

01:14:41.480 | - So is the biggest cost,

01:14:43.760 | let's throw running time out the window,

01:14:47.280 | how long it takes you to program?

01:14:49.560 | - That's not too different now.

01:14:50.920 | Thanks to TensorFlow Eager, that's not too different.

01:14:54.000 | But because so many things take so long to run,

01:14:58.560 | you wouldn't run it at 10 times slower.

01:15:00.240 | Like you just go like, oh, this is taking too long.

01:15:03.200 | And also there's a lot of things

01:15:04.200 | which are just less programmable,

01:15:05.760 | like tf.data, which is the way data processing works

01:15:08.920 | in TensorFlow is just this big mess.

01:15:11.320 | It's incredibly inefficient.

01:15:13.160 | And I kind of had to write it that way

01:15:14.720 | because of the TPU problems I described earlier.

01:15:19.120 | So I just, you know,

01:15:22.120 | I just feel like they've got this huge technical debt,

01:15:24.680 | which they're not going to solve

01:15:26.160 | without starting from scratch.

01:15:27.920 | - So here's an interesting question then.

01:15:29.400 | If there's a new student starting today,

01:15:33.560 | what would you recommend they use?

01:15:37.440 | - Well, I mean, we obviously recommend Fast.ai and PyTorch

01:15:40.400 | because we teach new students

01:15:42.680 | and that's what we teach with.

01:15:43.840 | So we would very strongly recommend that

01:15:46.040 | because it will let you get on top of the concepts

01:15:49.960 | much more quickly.

01:15:51.880 | So then you'll become an actual,

01:15:53.080 | and you'll also learn the actual state of the art techniques,

01:15:56.120 | you know, so you actually get world-class results.

01:15:59.160 | Honestly, it doesn't much matter what library you learn

01:16:03.880 | because switching from Chainer to MXNet

01:16:08.280 | to TensorFlow to PyTorch is gonna be a couple of days work

01:16:11.960 | as long as you understand the foundation as well.

01:16:15.200 | - But you think we'll swift creep in there

01:16:19.360 | as a thing that people start using?

01:16:22.880 | - Not for a few years,

01:16:24.320 | particularly because like Swift has no data science community,

01:16:29.320 | libraries, tooling. - So code bases are out there.

01:16:33.360 | - And the Swift community has a total lack of appreciation

01:16:38.360 | and understanding of numeric computing.

01:16:40.840 | So like they keep on making stupid decisions,

01:16:43.280 | you know, for years they've just done dumb things

01:16:45.400 | around performance and prioritization.

01:16:49.400 | That's clearly changing now

01:16:53.440 | because the developer of Swift, Chris Latner,

01:16:58.000 | is working at Google on Swift for TensorFlow.

01:17:00.720 | So like that's a priority.

01:17:04.160 | It'll be interesting to see what happens with Apple

01:17:05.800 | because like Apple hasn't shown any sign of caring

01:17:10.800 | about numeric programming in Swift.

01:17:13.800 | So I mean, hopefully they'll get off their ass

01:17:17.360 | and start appreciating this

01:17:18.800 | 'cause currently all of their low level libraries

01:17:22.240 | are not written in Swift.

01:17:25.120 | They're not particularly Swifty at all,

01:17:27.360 | stuff like core ML, they're really pretty rubbish.

01:17:30.760 | So yeah, so there's a long way to go,

01:17:33.680 | but at least one nice thing is that Swift for TensorFlow

01:17:36.080 | can actually directly use Python code and Python libraries

01:17:40.760 | in literally the entire lesson one notebook of fast AI

01:17:45.000 | runs in Swift right now in Python mode.

01:17:48.560 | So that's a nice intermediate thing.

01:17:51.640 | - How long does it take,

01:17:53.400 | if you look at the two fast AI courses,

01:17:57.560 | how long does it take to get from point zero

01:18:00.480 | to completing both courses?

01:18:02.040 | - It varies a lot.

01:18:04.320 | Somewhere between two months and two years generally.

01:18:13.160 | - So for two months, how many hours a day?

01:18:15.320 | - So like somebody who is a very competent coder

01:18:20.320 | can do 70 hours per course and-

01:18:26.480 | - 70, seven zero, that's it?

01:18:30.040 | Okay.

01:18:30.880 | - But a lot of people I know take a year off

01:18:35.680 | to study fast AI full time and say at the end of the year,

01:18:40.480 | they feel pretty competent.

01:18:43.440 | 'Cause generally there's a lot of other things you do.

01:18:45.560 | Like generally they'll be entering Kaggle competitions.

01:18:48.680 | They might be reading Ian Goodfellow's book.

01:18:51.440 | They might, you know, they'll be doing a bunch of stuff.

01:18:54.560 | And often, you know, particularly if they

01:18:56.720 | are a domain expert, their coding skills

01:18:59.040 | might be a little on the pedestrian side.

01:19:01.760 | So part of it's just like doing a lot more writing.

01:19:04.760 | - What do you find is the bottleneck for people usually,

01:19:08.000 | except getting started and setting stuff up?

01:19:11.720 | - I would say coding.

01:19:13.160 | - Just-

01:19:14.000 | - Yeah, I would say the best,

01:19:14.840 | the people who are strong coders pick it up the best.

01:19:17.880 | Although another bottleneck is people who have a lot

01:19:21.640 | of experience of classic statistics can really struggle

01:19:26.640 | because the intuition is so the opposite

01:19:30.000 | of what they're used to.

01:19:30.840 | They're very used to like trying to reduce the number

01:19:33.040 | of parameters in their model and looking

01:19:36.920 | at individual coefficients and stuff like that.

01:19:39.400 | So I find people who have a lot of coding background

01:19:42.920 | and know nothing about statistics are generally

01:19:45.680 | gonna be the best off.

01:19:47.440 | - So you taught several courses on deep learning

01:19:51.360 | and as Feynman says,

01:19:52.920 | "The best way to understand something is to teach it."

01:19:55.600 | What have you learned about deep learning from teaching it?

01:19:59.120 | - A lot.

01:20:00.600 | It's a key reason for me to teach the courses.

01:20:03.560 | I mean, obviously it's gonna be necessary

01:20:04.920 | to achieve our goal of getting domain experts

01:20:07.640 | to be familiar with deep learning,

01:20:09.320 | but it was also necessary for me to achieve my goal

01:20:12.040 | of being really familiar with deep learning.

01:20:14.240 | I mean, to see so many domain experts

01:20:23.200 | from so many different backgrounds,

01:20:25.640 | it's definitely, I wouldn't say taught me,

01:20:28.800 | but convinced me something that I liked to believe

01:20:31.520 | was true, which was anyone can do it.

01:20:34.880 | So there's a lot of kind of snobbishness out there

01:20:37.400 | about only certain people can learn to code,

01:20:40.200 | only certain people are gonna be smart enough to do AI.

01:20:43.120 | That's definitely bullshit.

01:20:45.320 | I've seen so many people

01:20:47.240 | from so many different backgrounds get state-of-the-art

01:20:50.320 | results in their domain areas now.

01:20:52.480 | It's definitely taught me that the key differentiator

01:20:57.120 | between people that succeed and people that fail

01:20:59.600 | is tenacity.

01:21:00.680 | That seems to be basically the only thing that matters.

01:21:03.920 | The people, a lot of people give up.

01:21:06.800 | And, but of the ones who don't give up,

01:21:11.360 | pretty much everybody succeeds.

01:21:15.000 | Even if at first I'm just kind of like thinking like,

01:21:17.840 | wow, they really aren't quite getting it yet, are they?

01:21:20.520 | But eventually people get it and they succeed.

01:21:24.720 | So I think that's been,

01:21:26.400 | I think they're both things I've liked to believe was true,

01:21:28.720 | but I don't feel like I really had strong evidence

01:21:30.880 | for them to be true,

01:21:31.760 | but now I can say I've seen it again and again.

01:21:34.760 | - So what advice do you have for someone

01:21:38.600 | who wants to get started in deep learning?

01:21:42.160 | - Train lots of models.

01:21:44.360 | That's how you learn it.

01:21:47.040 | So like, so I would, you know, I think, it's not just me.

01:21:51.560 | I think our course is very good,

01:21:53.320 | but also lots of people independently have said

01:21:54.960 | it's very good.

01:21:55.800 | It recently won the COGx award for AI courses

01:21:58.600 | as being the best in the world.

01:22:00.160 | I'd say come to our course, course.fast.ai.

01:22:02.960 | And the thing I keep on hopping on in my lessons

01:22:05.240 | is train models, print out the inputs to the models,

01:22:09.120 | print out to the outputs to the models,

01:22:11.000 | like study, you know, change the inputs a bit,

01:22:15.320 | look at how the outputs vary,

01:22:17.320 | just run lots of experiments to get a, you know,

01:22:20.360 | an intuitive understanding of what's going on.

01:22:24.480 | - To get hooked, do you think, you mentioned training,

01:22:29.080 | do you think just running the models inference?

01:22:32.640 | Like if we talk about getting started.

01:22:35.360 | - No, you've got to fine tune the models.

01:22:37.480 | So that's the critical thing,

01:22:39.480 | 'cause at that point you now have a model

01:22:41.240 | that's in your domain area.

01:22:43.240 | So there's no point running somebody else's model

01:22:46.840 | 'cause it's not your model.

01:22:47.880 | Like, so it only takes five minutes to fine tune a model

01:22:50.480 | for the data you care about.

01:22:52.040 | And in lesson two of the course,

01:22:53.520 | we teach you how to create your own dataset from scratch

01:22:56.360 | by scripting Google image search.

01:22:58.560 | So, and we show you how to actually create

01:23:01.160 | a web application running online.

01:23:02.840 | So I create one in the course that differentiates

01:23:05.280 | between a teddy bear, a grizzly bear, and a brown bear.

01:23:08.320 | And it does it with basically a hundred percent accuracy.

01:23:11.040 | Took me about four minutes to scrape the images

01:23:13.120 | from Google search in the script.

01:23:15.080 | There's a little graphical widgets we have in the notebook

01:23:18.760 | that help you clean up the dataset.

01:23:21.400 | There's other widgets that help you study the results

01:23:24.040 | to see where the errors are happening.

01:23:26.360 | And so now we've got over a thousand replies

01:23:29.280 | in our share your work here thread of students saying,

01:23:32.800 | here's the thing I built.

01:23:34.280 | And so there's people who like,

01:23:35.880 | and a lot of them are state of the art.

01:23:37.600 | Like somebody said, oh, I tried looking

01:23:39.000 | at Devan Garey characters and I couldn't believe it.

01:23:41.160 | The thing that came out was more accurate

01:23:43.320 | than the best academic paper after lesson one.

01:23:46.640 | And then there's others which are just more kind of fun.

01:23:48.560 | Like somebody who's doing Trinidad and Tobago hummingbirds.

01:23:53.080 | She said, that's kind of their national bird.

01:23:54.880 | And she's got something that can now classify a Trinidad

01:23:57.400 | and Tobago hummingbirds.

01:23:58.800 | So yeah, train models, fine tune models with your dataset

01:24:02.440 | and then study their inputs and outputs.

01:24:05.200 | - How much is Fast.ai courses?

01:24:07.160 | - Free.

01:24:08.000 | Everything we do is free.

01:24:10.480 | We have no revenue sources of any kind.

01:24:12.720 | It's just a service to the community.

01:24:15.400 | - You're a saint.

01:24:16.600 | Okay.

01:24:17.440 | Once a person understands the basics,

01:24:20.080 | trains a bunch of models.

01:24:22.720 | If we look at the scale of years,

01:24:25.880 | what advice do you have for someone wanting

01:24:27.640 | to eventually become an expert?

01:24:29.280 | - Train lots of models.

01:24:31.920 | (laughing)

01:24:33.120 | Specifically train lots of models in your domain area.

01:24:35.360 | So an expert what, right?

01:24:37.080 | We don't need more expert,

01:24:39.160 | like create slightly evolutionary research

01:24:44.160 | in areas that everybody's studying.

01:24:46.680 | We need experts at using deep learning

01:24:50.440 | to diagnose malaria.

01:24:52.640 | Or we need experts at using deep learning

01:24:55.520 | to analyze language to study media bias.

01:25:00.520 | So we need experts in analyzing fisheries

01:25:04.080 | to identify problem areas in the ocean.

01:25:11.960 | That's what we need.

01:25:13.240 | So like become the expert in your passion area.

01:25:17.760 | And this is a tool which you can use

01:25:20.160 | for just about anything.

01:25:21.240 | And you'll be able to do that thing better

01:25:22.920 | than other people, particularly by combining it

01:25:25.760 | with your passion and domain expertise.

01:25:27.440 | - So that's really interesting.

01:25:28.400 | Even if you do wanna innovate on transfer learning

01:25:30.880 | or active learning, your thought is,

01:25:34.040 | I mean, it's one I certainly share,

01:25:36.200 | is you also need to find a domain or a dataset

01:25:40.160 | that you actually really care for.

01:25:41.680 | - Right.

01:25:42.520 | If you're not working on a real problem that you understand,

01:25:45.360 | how do you know if you're doing it any good?

01:25:48.040 | How do you know if your results are good?

01:25:49.320 | How do you know if you're getting bad results?

01:25:50.800 | Why are you getting bad results?

01:25:52.040 | Is it a problem with the data?

01:25:53.600 | How do you know you're doing anything useful?

01:25:57.400 | Yeah, to me, the only really interesting research

01:26:00.160 | is not the only, but the vast majority

01:26:02.400 | of interesting research is like try

01:26:04.720 | and solve an actual problem and solve it really well.

01:26:06.880 | - So both understanding sufficient tools

01:26:09.440 | on the deep learning side and becoming a domain expert

01:26:13.720 | in a particular domain are really things within reach

01:26:17.360 | for anybody.

01:26:18.280 | - Yeah, I mean, to me, I would compare it

01:26:20.560 | to like studying self-driving cars,

01:26:23.480 | having never looked at a car or been in a car

01:26:26.560 | or turned a car on, which is like the way it is

01:26:29.360 | for a lot of people.

01:26:30.640 | They'll study some academic dataset

01:26:32.880 | where they literally have no idea about that.

01:26:36.160 | - By the way, I'm not sure how familiar

01:26:37.680 | with autonomous vehicles, but that is literally,

01:26:40.880 | you describe a large percentage of robotics folks

01:26:43.440 | working in self-driving cars

01:26:45.000 | is they actually haven't considered driving.

01:26:48.680 | They haven't actually looked at what driving looks like.

01:26:50.600 | They haven't driven.

01:26:51.440 | - Right, and it's a problem because you know,

01:26:53.320 | when you've actually driven, you know,

01:26:54.400 | like these are the things that happened to me

01:26:56.240 | when I was driving.

01:26:57.080 | - There's nothing that beats the real world examples

01:26:59.680 | of just experiencing them.

01:27:01.120 | You've created many successful startups.

01:27:04.880 | What does it take to create a successful startup?

01:27:07.400 | - Same thing as becoming a successful

01:27:11.520 | deep learning practitioner, which is not giving up.

01:27:15.000 | So you can run out of money or run out of time

01:27:20.000 | or run out of something, you know,

01:27:24.720 | but if you keep costs super low

01:27:28.000 | and try and save up some money beforehand

01:27:29.960 | so you can afford to have some time,

01:27:34.000 | then just sticking with it is one important thing.

01:27:38.080 | Doing something you understand and care about is important.

01:27:42.680 | By something, I don't mean,

01:27:44.040 | the biggest problem I see with deep learning people

01:27:46.720 | is they do a PhD in deep learning

01:27:50.160 | and then they try and commercialize their PhD,

01:27:52.440 | which is a waste of time

01:27:53.320 | 'cause that doesn't solve an actual problem.

01:27:55.880 | You picked your PhD topic 'cause it was an interesting

01:27:59.280 | kind of engineering or math or research exercise.

01:28:02.520 | But yeah, if you've actually spent time as a recruiter

01:28:06.680 | and you know that most of your time

01:28:08.240 | was spent sifting through resumes

01:28:10.680 | and you know that most of the time

01:28:12.880 | you're just looking for certain kinds of things

01:28:14.720 | and you can try doing that with a model for a few minutes

01:28:19.720 | and see whether that's something which the model

01:28:21.040 | seems to be able to do as well as you could,

01:28:23.760 | then you're on the right track to creating a startup.

01:28:27.640 | And then I think just, yeah, being,

01:28:29.400 | just be pragmatic and try and stay away

01:28:35.720 | from venture capital money as long as possible,

01:28:37.920 | preferably forever.

01:28:39.200 | - So yeah, on that point, do you,

01:28:41.320 | venture capital, so did you,

01:28:44.600 | were you able to successfully run startups

01:28:46.880 | with self-funded for quite a while?

01:28:48.240 | - Yeah, so my first two were self-funded

01:28:50.200 | and that was the right way to do it.

01:28:52.360 | - Is that scary?

01:28:53.200 | - No, VC startups are much more scary

01:28:57.840 | because you have these people on your back

01:29:00.680 | who do this all the time and who have done it for years

01:29:03.360 | telling you, "Grow, grow, grow, grow."

01:29:05.480 | And they don't care if you fail,

01:29:07.200 | they only care if you don't grow fast enough.

01:29:09.480 | So that's scary, whereas doing the ones myself,

01:29:13.280 | well, with partners who were friends,

01:29:17.760 | it's nice 'cause we just went along at a pace

01:29:21.120 | that made sense and we were able to build it to something

01:29:23.760 | which was big enough that we never had to work again,

01:29:27.280 | but it was not big enough that any VC

01:29:29.280 | would think it was impressive.

01:29:31.480 | And that was enough for us to be excited.

01:29:35.440 | So I thought that's a much better way

01:29:38.840 | to do things than most people.

01:29:40.280 | - In generally speaking, not for yourself,

01:29:41.920 | but how do you make money during that process?

01:29:44.520 | Do you cut into savings?

01:29:47.440 | - So yeah, so I started Fastmail and Optimal Decisions

01:29:50.640 | at the same time in 1999 with two different friends.

01:29:54.560 | And for Fastmail, I guess I spent $70 a month on the server.

01:30:04.000 | And when the server ran out of space,

01:30:06.240 | I put a payments button on the front page

01:30:09.400 | and said, "If you want more than 10 megs of space,

01:30:11.880 | you have to pay $10 a year."

01:30:15.640 | And- - So run low,

01:30:17.320 | like keep your cost down.

01:30:18.480 | - Yeah, so I kept my cost down.

01:30:19.480 | And once I needed to spend more money,

01:30:22.960 | I asked people to spend the money for me.

01:30:25.560 | And that was that basically from then on,

01:30:29.440 | we were making money and I was profitable from then.

01:30:34.440 | For Optimal Decisions, it was a bit harder

01:30:37.640 | 'cause we were trying to sell something

01:30:40.040 | that was more like a $1 million sale.

01:30:42.160 | But what we did was we would sell scoping projects.

01:30:46.400 | So kind of like prototype-y projects,

01:30:50.560 | but rather than doing it for free,

01:30:51.720 | we would sell them for 50 to $100,000.

01:30:54.200 | So again, we were covering our costs

01:30:56.920 | and also making the client feel

01:30:58.320 | like we were doing something valuable.

01:31:00.200 | So in both cases, we were profitable

01:31:01.920 | from six months in.

01:31:04.800 | - Ah, nevertheless, it's scary.

01:31:08.160 | - I mean, yeah, sure.

01:31:10.000 | I mean, it's scary before you jump in.

01:31:13.280 | And I guess I was comparing it to the scarediness of VC.

01:31:18.120 | I felt like with VC stuff, it was more scary,

01:31:20.480 | kind of much more in somebody else's hands,

01:31:24.320 | will they fund you or not?

01:31:26.160 | And what do they think of what you're doing?

01:31:27.880 | I also found it very difficult with VC-backed startups

01:31:30.560 | to actually do the thing which I thought was important

01:31:34.240 | for the company rather than doing the thing

01:31:35.960 | which I thought would make the VC happy.

01:31:38.880 | Now, VCs always tell you not to do the thing

01:31:40.920 | that makes them happy.

01:31:42.400 | But then if you don't do the thing that makes them happy,

01:31:44.080 | they get sad, so.

01:31:45.360 | - And do you think optimizing for the,

01:31:48.120 | whatever they call it, the exit,

01:31:50.160 | is a good thing to optimize for?

01:31:53.080 | - I mean, it can be, but not at the VC level,

01:31:54.920 | 'cause the VC exit needs to be, you know, a thousand X.

01:31:59.560 | So, where else the lifestyle exit,

01:32:03.120 | if you can sell something for $10 million,

01:32:05.360 | you've made it, right?

01:32:06.400 | So, I don't, it depends.

01:32:09.160 | If you wanna build something that's gonna,

01:32:11.200 | you're kind of happy to do forever, then fine.

01:32:13.560 | If you wanna build something you wanna sell

01:32:15.720 | in three years time, that's fine too.

01:32:18.440 | I mean, they're both perfectly good outcomes.

01:32:21.280 | - So, you're learning Swift now, in a way.

01:32:24.880 | I mean, you already-- - Trying to.

01:32:26.760 | - And I read that you use, at least in some cases,

01:32:31.160 | spaced repetition as a mechanism for learning new things.

01:32:34.440 | I use Anki quite a lot myself.

01:32:36.400 | - Yeah, me too.

01:32:37.240 | - I actually never talked to anybody about it.

01:32:41.440 | Don't know how many people do it,

01:32:44.160 | but it works incredibly well for me.

01:32:46.760 | Can you talk through your experience?

01:32:47.960 | Like, how did you, what do you,

01:32:51.120 | first of all, okay, let's back it up.

01:32:53.120 | What is spaced repetition?

01:32:55.120 | - So, spaced repetition is an idea created

01:33:00.120 | by a psychologist named Ebbinghaus.

01:33:03.440 | I don't know, must be a couple of hundred years ago

01:33:06.120 | or something, 150 years ago.

01:33:08.040 | He did something which sounds pretty damn tedious.

01:33:10.720 | He wrote down random sequences of letters on cards

01:33:15.600 | and tested how well he would remember

01:33:18.840 | those random sequences a day later,

01:33:21.320 | a week later, whatever.

01:33:23.000 | He discovered that there was this kind of a curve

01:33:26.120 | where his probability of remembering one of them

01:33:28.800 | would be dramatically smaller the next day

01:33:30.640 | and then a little bit smaller the next day

01:33:31.960 | and a little bit smaller the next day.

01:33:33.520 | What he discovered is that if he revised those cards

01:33:36.880 | after a day, the probabilities would decrease

01:33:41.560 | at a smaller rate.

01:33:42.880 | And then if he revised them again a week later,

01:33:44.960 | they would decrease at a smaller rate again.

01:33:47.040 | And so he basically figured out a roughly optimal equation

01:33:51.800 | for when you should revise something you wanna remember.

01:33:54.600 | So spaced repetition learning is using

01:33:58.640 | this simple algorithm, just something like

01:34:02.080 | revise something after a day and then three days

01:34:04.480 | and then a week and then three weeks and so forth.

01:34:07.680 | And so if you use a program like Anki, as you know,

01:34:10.640 | it will just do that for you.

01:34:12.080 | And it will say, did you remember this?

01:34:14.520 | And if you say no, it will reschedule it back

01:34:17.640 | to appear again like 10 times faster

01:34:20.280 | than it otherwise would have.

01:34:21.960 | It's a kind of a way of being guaranteed to learn something

01:34:27.880 | because by definition, if you're not learning it,

01:34:30.200 | it will be rescheduled to be revised more quickly.

01:34:32.680 | Unfortunately though, it's also like,

01:34:36.080 | it doesn't let you fool yourself.

01:34:37.440 | If you're not learning something,

01:34:39.480 | you know like your revisions will just get more and more.

01:34:44.040 | So you have to find ways to learn things productively

01:34:48.240 | and effectively like treat your brain well.

01:34:50.520 | So using like mnemonics and stories

01:34:52.920 | and context and stuff like that.

01:34:56.320 | So yeah, it's a super great technique.

01:34:59.720 | It's like learning how to learn is something which

01:35:02.560 | everybody should learn before they actually learn anything,

01:35:05.640 | but almost nobody does.

01:35:07.920 | - So what have you, so it certainly works well

01:35:10.120 | for learning new languages, for, I mean,

01:35:13.720 | for learning like small projects almost,

01:35:16.400 | but do you, you know, I started using it for,

01:35:19.800 | I forget who wrote a blog post about this inspired me.

01:35:22.400 | It might've been you, I'm not sure.

01:35:25.520 | Is, I started when I read papers,

01:35:28.480 | I'll concepts and ideas, I'll put them.

01:35:31.880 | - Was it Michael Nielsen?

01:35:32.800 | - It was Michael Nielsen.

01:35:33.640 | - Yeah, so Michael started doing this recently

01:35:36.400 | and has been writing about it.

01:35:37.920 | I, so the kind of today's Ebbinghaus

01:35:43.200 | is a guy called Peter Wozniak

01:35:45.040 | who developed a system called SuperMemo.

01:35:47.720 | And he's been basically trying to become like

01:35:50.040 | the world's greatest Renaissance man

01:35:54.040 | over the last few decades.

01:35:55.920 | He's basically lived his life with space repetition,

01:36:00.040 | learning for everything.

01:36:02.080 | I, and sort of like,

01:36:05.800 | Michael's only very recently got into this,

01:36:07.440 | but he started really getting excited about doing it

01:36:09.520 | for a lot of different things.

01:36:11.160 | For me personally, I actually don't use it

01:36:14.600 | for anything except Chinese.

01:36:16.960 | And the reason for that is that Chinese

01:36:20.680 | is specifically a thing I made a conscious decision

01:36:23.080 | that I want to continue to remember,

01:36:26.680 | even if I don't get much of a chance to exercise it,

01:36:30.120 | 'cause like I'm not often in China, so I don't.

01:36:33.040 | Or else something like programming languages or papers,

01:36:38.320 | I have a very different approach,

01:36:39.640 | which is I try not to learn anything from them,

01:36:43.040 | but instead I try to identify the important concepts

01:36:47.080 | and like actually ingest them.

01:36:49.000 | So like really understand that concept deeply

01:36:53.640 | and study it carefully.

01:36:54.760 | I will decide if it really is important,

01:36:56.600 | if it is like incorporated into our library,

01:37:01.600 | incorporated into how I do things

01:37:04.200 | or decide it's not worth it.

01:37:06.760 | So I find I then remember the things that I care about

01:37:12.600 | because I'm using it all the time.

01:37:15.720 | So for the last 25 years,

01:37:20.160 | I've committed to spending at least half of every day

01:37:23.440 | learning or practicing something new,

01:37:25.880 | which is all my colleagues have always hated

01:37:28.760 | because it always looks like I'm not working on

01:37:31.000 | what I'm meant to be working on,

01:37:32.000 | but it always means I do everything faster

01:37:34.560 | because I've been practicing a lot of stuff.

01:37:36.920 | So I kind of give myself a lot of opportunity

01:37:39.400 | to practice new things.

01:37:41.720 | And so I find now I don't,

01:37:43.280 | yeah, I don't often kind of find myself

01:37:47.880 | wishing I could remember something

01:37:50.320 | 'cause if it's something that's useful,

01:37:51.440 | then I've been using it a lot.

01:37:53.880 | It's easy enough to look it up on Google,

01:37:56.160 | but speaking Chinese, you can't look it up on Google.

01:37:59.720 | - Do you have advice for people learning new things?

01:38:01.560 | So if you, what have you learned as a process?

01:38:04.840 | I mean, it all starts with just making the hours

01:38:07.640 | and the day available.

01:38:08.960 | - Yeah, you gotta stick with it,

01:38:10.160 | which is, again, the number one thing

01:38:12.040 | that 99% of people don't do.

01:38:13.680 | So the people I started learning Chinese with,

01:38:15.880 | none of them were still doing it 12 months later.

01:38:18.360 | I'm still doing it 10 years later.

01:38:20.400 | I tried to stay in touch with them,

01:38:21.920 | but they just, no one did it.

01:38:23.600 | For something like Chinese,

01:38:26.240 | like study how human learning works.

01:38:28.520 | So every one of my Chinese flashcards

01:38:31.240 | is associated with a story,

01:38:33.760 | and that story is specifically designed to be memorable.

01:38:36.720 | And we find things memorable,

01:38:37.840 | which are like funny or disgusting or sexy

01:38:41.360 | or related to people that we know or care about.

01:38:44.240 | So I try to make sure all the stories that are in my head

01:38:47.320 | have those characteristics.

01:38:49.120 | Yeah, so you have to, you know,

01:38:52.160 | you won't remember things well

01:38:53.240 | if they don't have some context.

01:38:56.040 | And yeah, you won't remember them well

01:38:57.280 | if you don't regularly practice them,

01:39:00.640 | whether it be just part of your day-to-day life

01:39:02.480 | or the Chinese, I mean, flashcards.

01:39:06.080 | I mean, the other thing is,

01:39:07.800 | let yourself fail sometimes.

01:39:09.520 | So like I've had various medical problems

01:39:11.840 | over the last few years,

01:39:13.040 | and basically my flashcards just stopped

01:39:17.040 | for about three years.

01:39:18.640 | And then there've been other times

01:39:21.480 | I've stopped for a few months,

01:39:22.600 | and it's so hard because you get back to it,

01:39:24.200 | and it's like, you have 18,000 cards due.

01:39:27.400 | It's like, and so you just have to go,

01:39:30.480 | all right, well, I can either stop and give up everything

01:39:34.120 | or just decide to do this every day

01:39:36.560 | for the next two years until I get back to it.

01:39:39.000 | The amazing thing has been that even after three years,

01:39:41.720 | you know, the Chinese was still in there.

01:39:45.880 | Like it was so much faster to relearn

01:39:48.440 | than it was to learn the first time.

01:39:50.080 | - Yeah, absolutely.

01:39:52.280 | It's in there.

01:39:53.120 | I have the same with guitar, with music and so on.

01:39:56.520 | It's sad because the work sometimes takes away,

01:39:59.120 | and then you won't play for a year.

01:40:01.160 | But really, if you then just get back to it every day,

01:40:03.520 | you're right there again.

01:40:06.000 | What do you think is the next big breakthrough

01:40:08.400 | in artificial intelligence?

01:40:09.400 | What are your hopes in deep learning or beyond

01:40:12.720 | that people should be working on,

01:40:14.120 | or you hope there'll be breakthroughs?

01:40:16.280 | - I don't think it's possible to predict.

01:40:17.960 | I think what we already have

01:40:20.600 | is an incredibly powerful platform

01:40:23.680 | to solve lots of societally important problems

01:40:26.520 | that are currently unsolved.

01:40:27.600 | So I just hope that lots of people

01:40:30.440 | will learn this toolkit and try to use it.

01:40:33.360 | I don't think we need a lot of new technological breakthroughs

01:40:36.800 | to do a lot of great work right now.

01:40:38.600 | - And when do you think we're going to create

01:40:42.760 | a human level intelligence system?

01:40:45.160 | Do you think- - Don't know.

01:40:46.480 | - How hard is it?

01:40:47.440 | How far away are we?

01:40:48.720 | - Don't know.

01:40:49.560 | - Don't know. - I have no way to know.

01:40:50.760 | I don't know.

01:40:51.760 | I don't know why people make predictions about this

01:40:53.840 | 'cause there's no data and nothing to go on.

01:40:57.480 | And it's just like,

01:41:00.360 | there's so many societally important problems

01:41:03.520 | to solve right now.

01:41:04.440 | I just don't find it a really interesting question

01:41:08.720 | to even answer.

01:41:10.280 | - So in terms of societally important problems,

01:41:13.000 | what's the problem that is within reach?

01:41:16.400 | - Well, I mean, for example,

01:41:17.480 | there are problems that AI creates, right?

01:41:19.800 | So more specifically,

01:41:21.320 | labor force displacement is going to be huge

01:41:26.840 | and people keep making this frivolous econometric argument

01:41:30.920 | of being like, oh, there's been other things that aren't AI

01:41:33.960 | that have come along before

01:41:34.960 | and haven't created massive labor force displacement,

01:41:37.800 | therefore AI won't.

01:41:39.920 | - So that's a serious concern for you?

01:41:41.600 | - Oh, yeah. - Andrew Yang is running on it.

01:41:43.680 | - Yeah, I'm desperately concerned.

01:41:47.360 | And you see already that the changing workplace

01:41:52.360 | has led to a hollowing out of the middle class.

01:41:55.760 | You're seeing that students coming out of school today

01:41:59.040 | have a less rosy financial future ahead of them

01:42:03.200 | than their parents did,

01:42:04.040 | which has never happened in the last few hundred years.

01:42:09.040 | We've always had progress before.

01:42:10.960 | And you see this turning into anxiety and despair

01:42:16.320 | and even violence.

01:42:19.480 | So I very much worry about that.

01:42:23.440 | - You've written quite a bit about ethics too.

01:42:25.760 | - I do think that every data scientist

01:42:29.640 | working with deep learning needs to recognize

01:42:33.960 | they have an incredibly high leverage tool

01:42:35.640 | that they're using that can influence society

01:42:38.000 | in lots of ways.

01:42:39.040 | And if they're doing research,

01:42:40.320 | that that research is gonna be used by people

01:42:42.760 | doing this kind of work.

01:42:44.440 | And they have a responsibility to consider the consequences

01:42:48.400 | and to think about things like

01:42:51.800 | how will humans be in the loop here?

01:42:53.920 | How do we avoid runaway feedback loops?

01:42:56.520 | How do we ensure an appeals process for humans

01:42:59.200 | that are impacted by my algorithm?

01:43:01.720 | How do I ensure that the constraints of my algorithm

01:43:04.960 | are adequately explained to the people

01:43:06.720 | that end up using them?

01:43:09.160 | There's all kinds of human issues

01:43:11.880 | which only data scientists are actually in the right place

01:43:16.280 | to educate people about.

01:43:17.960 | But data scientists tend to think of themselves

01:43:20.280 | as just engineers and that they don't need

01:43:23.400 | to be part of that process.

01:43:24.520 | - For now.

01:43:25.360 | - Yeah, which is wrong.

01:43:26.680 | - Well, you're in a perfect position to educate them better,

01:43:30.280 | to read literature, to read history, to learn from history.

01:43:33.760 | Well, Jeremy, thank you so much for everything you do,

01:43:39.080 | for inspiring a huge amount of people,

01:43:41.320 | getting them into deep learning

01:43:42.480 | and having the ripple effects,

01:43:45.080 | the flap of a butterfly's wings

01:43:47.440 | that will probably change the world.

01:43:48.640 | So thank you very much.

01:43:50.080 | - Cheers.

01:43:50.920 | (upbeat music)

01:43:53.520 | (upbeat music)

01:43:56.120 | (upbeat music)

01:43:58.720 | (upbeat music)

01:44:01.320 | (upbeat music)

01:44:03.920 | (upbeat music)

01:44:06.520 | [BLANK_AUDIO]

Jeremy Howard: fast.ai Deep Learning Courses and Research | Lex Fridman Podcast #35

Chapters