back to index

Jeremy Howard: fast.ai Deep Learning Courses and Research | Lex Fridman Podcast #35


Chapters

0:0
0:1 Jeremy Howard
1:17 What's the First Program You'Ve Ever Ridden
3:9 Programming Languages
4:36 The Connection between Excel and Access
9:24 Array Oriented Languages
23:36 The Origin Story of Fast Ai
40:57 The Difference between Theory and Practice of Deep Learning
41:51 Transfer Learning
59:28 Super Convergence
62:8 The Future of Learning Rate Magic
66:16 Different Cloud Options for Training
69:13 Deep Learning Frameworks
92:52 What Is Space Repetition
93:56 Spaced Repetition Learning
97:59 Advice for People Learning New Things
100:6 Next Big Breakthrough in Artificial Intelligence

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Jeremy Howard.
00:00:03.160 | He's the founder of Fast AI, a research institute
00:00:06.480 | dedicated to making deep learning more accessible.
00:00:09.760 | He's also a distinguished research scientist
00:00:12.600 | at the University of San Francisco,
00:00:14.640 | a former president of Kegel,
00:00:16.680 | as well as a top-ranking competitor there.
00:00:18.800 | And in general, he's a successful entrepreneur,
00:00:21.720 | educator, researcher, and an inspiring personality
00:00:25.240 | in the AI community.
00:00:27.040 | When someone asks me, how do I get started with deep learning?
00:00:30.240 | Fast AI is one of the top places I point them to.
00:00:33.360 | It's free, it's easy to get started,
00:00:35.560 | it's insightful and accessible.
00:00:37.640 | And if I may say so, it has very little BS.
00:00:41.000 | They can sometimes dilute the value of educational content
00:00:44.160 | on popular topics like deep learning.
00:00:46.760 | Fast AI has a focus on practical application of deep learning
00:00:50.320 | and hands-on exploration of the cutting edge
00:00:52.840 | that is incredibly both accessible to beginners
00:00:56.040 | and useful to experts.
00:00:58.000 | This is the Artificial Intelligence Podcast.
00:01:01.400 | If you enjoy it, subscribe on YouTube,
00:01:03.840 | give it five stars on iTunes, support it on Patreon,
00:01:07.000 | or simply connect with me on Twitter,
00:01:09.080 | @LexFriedman, spelled F-R-I-D-M-A-N.
00:01:13.360 | And now, here's my conversation with Jeremy Howard.
00:01:17.600 | What's the first program you ever written?
00:01:20.720 | - First program I wrote that I remember
00:01:24.840 | would be at high school.
00:01:26.720 | I did an assignment where I decided to try to find out
00:01:33.600 | if there were some better musical scales
00:01:36.280 | than the normal 12-tone, 12-interval scale.
00:01:40.640 | So I wrote a program on my Commodore 64 in BASIC
00:01:43.680 | that searched through other scale sizes
00:01:46.080 | to see if it could find one where
00:01:48.280 | there were more accurate harmonies.
00:01:51.920 | - Like mid-tone?
00:01:53.560 | - Like you want an actual exactly three to two ratio,
00:01:56.560 | whereas with a 12-interval scale,
00:01:59.440 | it's not exactly three to two, for example.
00:02:01.520 | So that's well-tempered, as they say in the--
00:02:05.080 | - In BASIC on a Commodore 64.
00:02:07.160 | - Yeah.
00:02:08.000 | - Where was the interest in music from?
00:02:09.480 | Or is it just--
00:02:10.480 | - I did music all my life, so I played saxophone
00:02:14.080 | and clarinet and piano and guitar and drums and whatever, so.
00:02:18.120 | - How does that thread go through your life?
00:02:22.160 | Where's music today?
00:02:24.200 | - It's not where I wish it was.
00:02:26.160 | For various reasons, couldn't really keep it going,
00:02:30.200 | particularly 'cause I had a lot of problems with RSI,
00:02:32.600 | with my fingers, and so I had to kind of like
00:02:34.760 | cut back anything that used hands and fingers.
00:02:38.320 | I hope one day I'll be able to get back to it health-wise.
00:02:43.920 | - So there's a love for music underlying it all?
00:02:46.080 | - For sure, yeah.
00:02:46.920 | - What's your favorite instrument?
00:02:49.520 | - Saxophone.
00:02:50.360 | - Sax.
00:02:51.200 | - It's a baritone saxophone.
00:02:52.880 | Well, probably bass saxophone, but they're awkward.
00:02:55.640 | - Well, I always love it when music
00:03:00.040 | is coupled with programming.
00:03:01.720 | There's something about a brain that utilizes those
00:03:04.680 | that emerges with creative ideas.
00:03:07.560 | So you've used and studied quite a few programming languages.
00:03:11.240 | Can you give an overview of what you've used?
00:03:15.160 | What are the pros and cons of each?
00:03:17.880 | - Well, my favorite programming environment,
00:03:21.120 | almost certainly, was Microsoft Access
00:03:24.600 | back in the earliest days.
00:03:26.480 | So that was Visual Basic for Applications,
00:03:28.920 | which is not a good programming language,
00:03:30.720 | but the programming environment was fantastic.
00:03:33.080 | It's like the ability to create user interfaces
00:03:38.080 | and tie data and actions to them and create reports
00:03:42.520 | and all that, I've never seen anything as good.
00:03:46.800 | There's things nowadays like Airtable,
00:03:48.600 | which are like small subsets of that,
00:03:53.600 | which people love for good reason,
00:03:56.160 | but unfortunately nobody's ever achieved anything like that.
00:04:01.120 | - What is that?
00:04:01.960 | If you could pause on that for a second.
00:04:03.280 | - Oh, Access?
00:04:04.120 | - Access is a database.
00:04:06.280 | - It was a database program that Microsoft produced,
00:04:09.640 | part of Office, and it kind of withered, you know,
00:04:13.440 | but basically it lets you in a totally graphical way
00:04:16.280 | create tables and relationships and queries
00:04:18.480 | and tie them to forms and set up, you know,
00:04:21.800 | event handlers and calculations.
00:04:24.720 | And it was a very complete, powerful system
00:04:28.160 | designed for not massive scalable things,
00:04:31.480 | but for like useful little applications that I loved.
00:04:36.360 | - So what's the connection between Excel and Access?
00:04:40.240 | - So very close.
00:04:42.120 | So Access kind of was the relational database equivalent,
00:04:47.680 | if you like.
00:04:48.520 | So people still do a lot of that stuff
00:04:51.080 | that should be in Access in Excel,
00:04:52.880 | because they know it.
00:04:54.120 | Excel's great as well.
00:04:55.360 | So, but it's just not as rich a programming model
00:05:00.200 | as VBA combined with a relational database.
00:05:04.640 | And so I've always loved relational databases,
00:05:07.320 | but today programming on top of a relational database
00:05:11.000 | is just a lot more of a headache.
00:05:13.520 | You know, you generally either need to kind of,
00:05:16.200 | you know, you need something that connects,
00:05:17.920 | that runs some kind of database server,
00:05:19.920 | unless you use SQLite, which has its own issues.
00:05:23.920 | Then you kind of often,
00:05:25.920 | if you want to get a nice programming model,
00:05:27.600 | you'll need to like create an, add an ORM on top.
00:05:30.400 | And then, I don't know,
00:05:31.960 | there's all these pieces to tie together,
00:05:34.360 | and it's just a lot more awkward than it should be.
00:05:36.960 | There are people that are trying to make it easier.
00:05:39.200 | So in particular, I think of F#, you know, Don Syme,
00:05:42.400 | who him and his team have done a great job
00:05:45.760 | of making something like a database appear
00:05:50.480 | in the type system.
00:05:51.600 | So you actually get like tab completion for fields
00:05:54.200 | and tables and stuff like that.
00:05:56.240 | Anyway, so that was kind of, anyway,
00:05:59.280 | so like that whole VBA office thing,
00:06:01.480 | I guess was a starting point, which I still miss.
00:06:04.600 | And I got into standard Visual Basic, which-
00:06:07.800 | - That's interesting just to pause on that for a second.
00:06:09.880 | It's interesting that you're connecting programming languages
00:06:13.480 | to the ease of management of data.
00:06:17.400 | - Yeah.
00:06:18.240 | - So in your use of programming languages,
00:06:20.560 | you always had a love and a connection with data.
00:06:24.840 | - I've always been interested in doing useful things
00:06:27.960 | for myself and for others,
00:06:29.440 | which generally means getting some data
00:06:31.840 | and doing something with it and putting it out there again.
00:06:34.520 | So that's been my interest throughout.
00:06:38.360 | So I also did a lot of stuff with AppleScript
00:06:41.520 | back in the early days.
00:06:42.960 | So it's kind of nice being able to get the computer
00:06:47.920 | and computers to talk to each other
00:06:50.080 | and to do things for you.
00:06:51.680 | And then I think that one,
00:06:54.560 | the programming language I most loved
00:06:57.840 | then would have been Delphi, which was Object Pascal,
00:07:01.760 | created by Anders Helsberg,
00:07:04.800 | who previously did Turbo Pascal
00:07:07.400 | and then went on to create .NET
00:07:08.800 | and then went on to create TypeScript.
00:07:11.040 | Delphi was amazing 'cause it was like a compiled,
00:07:14.840 | fast language that was as easy to use as Visual Basic.
00:07:19.840 | - Delphi, what is it similar to in more modern languages?
00:07:25.160 | - Visual Basic.
00:07:28.840 | - Visual Basic.
00:07:29.680 | - Yeah, but a compiled fast version.
00:07:32.280 | So I'm not sure there's anything quite like it anymore.
00:07:37.040 | If you took like C# or Java
00:07:40.600 | and got rid of the virtual machine
00:07:42.440 | and replaced it with something,
00:07:43.400 | you could compile a small type binary.
00:07:46.520 | I feel like it's where Swift could get to
00:07:50.680 | with the new Swift UI
00:07:52.600 | and the cross-platform development going on.
00:07:56.440 | Like that's one of my dreams
00:07:59.320 | is that we'll hopefully get back to where Delphi was.
00:08:02.800 | There is actually a free Pascal project nowadays
00:08:07.800 | called Lazarus,
00:08:09.320 | which is also attempting to kind of recreate Delphi.
00:08:13.360 | So they're making good progress.
00:08:16.040 | - So, okay, Delphi,
00:08:18.520 | that's one of your favorite programming languages.
00:08:20.920 | - Or at least programming environments.
00:08:22.320 | Again, I'd say Pascal's not a nice language.
00:08:26.240 | If you wanted to know specifically
00:08:27.840 | about what languages I like,
00:08:29.600 | I would definitely pick J
00:08:31.640 | as being an amazingly wonderful language.
00:08:34.480 | - What's J?
00:08:37.040 | - J, are you aware of APL?
00:08:39.600 | - I am not.
00:08:40.440 | - Okay, so. - Except from doing
00:08:41.440 | a little research on the work you've done.
00:08:44.040 | - Okay, so not at all surprising
00:08:47.120 | you're not familiar with it
00:08:47.960 | 'cause it's not well known,
00:08:49.000 | but it's actually one of the main
00:08:51.600 | families of programming languages
00:08:55.920 | going back to the late '50s, early '60s.
00:08:57.880 | So there was a couple of major directions.
00:09:01.640 | One was the kind of Lambda calculus,
00:09:04.400 | Alonzo Church direction,
00:09:06.120 | which I guess kind of Lisp and Scheme and whatever,
00:09:09.920 | which has a history going back
00:09:12.240 | to the early days of computing.
00:09:13.360 | The second was the kind of imperative slash OO,
00:09:18.360 | algo, similar, going under C, C++, so forth.
00:09:23.120 | There was a third,
00:09:23.960 | which are called array-oriented languages,
00:09:26.880 | which started with a paper by a guy called Ken Iverson,
00:09:31.480 | which was actually a math theory paper,
00:09:35.160 | not a programming paper.
00:09:37.480 | It was called "Notation as a Tool for Thought."
00:09:41.480 | And it was the development of a new type of math notation.
00:09:45.280 | And the idea is that this math notation
00:09:47.520 | was much more flexible, expressive,
00:09:51.320 | and also well-defined than traditional math notation,
00:09:55.240 | which is none of those things.
00:09:56.400 | Math notation is awful.
00:09:57.680 | And so he actually turned that into a programming language.
00:10:02.800 | 'Cause this was the early '50s,
00:10:04.120 | or the, sorry, late '50s, all the names were available.
00:10:06.720 | So he called his language a programming language, or APL.
00:10:10.520 | - APL, wow.
00:10:11.360 | - So APL is a implementation of notation
00:10:15.320 | as a tool for thought, by which he means math notation.
00:10:18.280 | And Ken and his son went on to do many things,
00:10:22.840 | but eventually they actually produced
00:10:25.760 | a new language that was built
00:10:27.040 | on top of all the learnings of APL,
00:10:28.440 | and that was called J.
00:10:29.600 | And J is the most expressive, composable language of,
00:10:35.560 | beautifully designed language I've ever seen.
00:10:42.400 | - Does it have object-oriented components?
00:10:44.520 | Does it have that kind of thing, or is it more like--
00:10:45.360 | - Not really, it's an array-oriented language.
00:10:47.680 | It's a new, it's the third path.
00:10:51.400 | - Are you saying array?
00:10:52.760 | - Array-oriented, yeah.
00:10:53.920 | - What does it mean to be array-oriented?
00:10:55.520 | - So array-oriented means that you generally
00:10:57.560 | don't use any loops, but the whole thing is done
00:11:01.000 | with kind of a extreme version of broadcasting,
00:11:06.000 | if you're familiar with that NumPy/Python concept.
00:11:09.960 | So you do a lot with one line of code.
00:11:14.320 | It looks a lot like math notation.
00:11:18.160 | - So it's basically--
00:11:19.000 | - Highly compact.
00:11:20.400 | And the idea is that you can kind of,
00:11:22.920 | because you can do so much with one line of code,
00:11:24.800 | a single screen of code is very unlikely to,
00:11:27.760 | you very rarely need more than that to express your program.
00:11:31.120 | And so you can kind of keep it all in your head,
00:11:33.320 | and you can kind of clearly communicate it.
00:11:36.080 | It's interesting, APL created two main branches,
00:11:39.960 | K and J.
00:11:41.640 | J is this kind of like open-source niche community
00:11:46.000 | of crazy enthusiasts like me.
00:11:49.440 | And then the other path, K, is fascinating.
00:11:52.160 | It's an astonishingly expensive programming language,
00:11:56.640 | which many of the world's most
00:11:59.720 | ludicrously rich hedge funds use.
00:12:02.920 | So the entire K machine is so small,
00:12:06.680 | it sits inside level three cache on your CPU,
00:12:09.360 | and it easily wins every benchmark I've ever seen
00:12:14.120 | in terms of data processing speed.
00:12:16.760 | But you don't come across it very much,
00:12:17.920 | it's like $100,000 per CPU to run it.
00:12:22.720 | But it's like this path of programming languages
00:12:26.280 | is just so much, I don't know,
00:12:28.920 | so much more powerful in every way
00:12:30.360 | than the ones that almost anybody uses every day.
00:12:33.920 | - So it's all about computation.
00:12:37.520 | It's really focusing on computation.
00:12:38.360 | - It's pretty heavily focused on computation.
00:12:40.600 | I mean, so much of programming
00:12:43.200 | is data processing by definition.
00:12:45.640 | So there's a lot of things you can do with it.
00:12:48.920 | But yeah, there's not much work being done
00:12:51.400 | on making like user interface toolkits or whatever.
00:12:56.400 | I mean, there's some, but they're not great.
00:12:59.280 | - At the same time, you've done a lot of stuff
00:13:00.840 | with Perl and Python.
00:13:02.440 | - Yeah.
00:13:03.280 | - So where does that fit into the picture
00:13:04.720 | of J and K and APL and--
00:13:08.760 | - Well, it's just much more pragmatic.
00:13:10.960 | Like in the end, you kind of have to end up
00:13:13.840 | where the libraries are,
00:13:17.880 | 'cause to me, my focus is on productivity.
00:13:21.200 | I just wanna get stuff done and solve problems.
00:13:23.640 | So Perl was great.
00:13:27.240 | I created an email company called Fastmail
00:13:29.640 | and Perl was great 'cause back in the late '90s,
00:13:32.800 | early 2000s, it just had a lot of stuff it could do.
00:13:37.800 | I still had to write my own monitoring system
00:13:41.720 | and my own web framework, my own whatever,
00:13:43.800 | 'cause like none of that stuff existed,
00:13:45.720 | but it was a super flexible language to do that in.
00:13:50.240 | - And you used Perl for Fastmail, you used it as a backend?
00:13:54.240 | Like, so everything was written in Perl?
00:13:55.760 | - Yeah, yeah, everything was Perl.
00:13:58.720 | - Why do you think Perl hasn't succeeded
00:14:02.920 | or hasn't dominated the market
00:14:04.840 | where Python really takes over a lot of the same tasks?
00:14:07.560 | - Well, I mean, Perl did dominate.
00:14:09.600 | It was-- - Four times.
00:14:10.760 | - Everything, everywhere, but then the guy
00:14:14.920 | that ran Perl, Larry Wool,
00:14:17.240 | kind of just didn't put the time in anymore.
00:14:22.240 | And no project can be successful if there isn't,
00:14:27.320 | you know, particularly one that started
00:14:30.560 | with a strong leader that loses that strong leadership.
00:14:35.080 | So then Python has kind of replaced it.
00:14:37.880 | You know, Python is a lot less elegant language
00:14:42.880 | in nearly every way, but it has the data science libraries
00:14:48.440 | and a lot of them are pretty great.
00:14:51.320 | So I kind of use it 'cause it's the best we have,
00:14:56.320 | but it's definitely not good enough.
00:15:01.840 | - But what do you think the future of programming looks like?
00:15:04.080 | What do you hope the future of programming looks like
00:15:06.600 | if we zoom in on the computational fields,
00:15:08.800 | on data science, on machine learning?
00:15:11.880 | - I hope Swift is successful because the goal of Swift,
00:15:16.880 | the way Chris Latner describes it,
00:15:21.040 | is to be infinitely hackable, and that's what I want.
00:15:23.360 | I want something where me and the people I do research with
00:15:26.960 | and my students can look at and change everything
00:15:30.400 | from top to bottom.
00:15:32.040 | There's nothing mysterious and magical and inaccessible.
00:15:36.240 | Unfortunately with Python, it's the opposite of that
00:15:38.600 | because Python's so slow, it's extremely unhackable.
00:15:42.680 | You get to a point where it's like,
00:15:43.840 | okay, from here on down, it's C.
00:15:45.360 | So your debugger doesn't work in the same way,
00:15:47.320 | your profiler doesn't work in the same way,
00:15:48.960 | your build system doesn't work in the same way.
00:15:50.800 | It's really not very hackable at all.
00:15:53.760 | - What's the part you like to be hackable?
00:15:55.640 | Is it for the objective of optimizing training
00:16:00.160 | of neural networks, inference of neural networks?
00:16:02.600 | Is it performance of the system
00:16:04.360 | or is there some non-performance related just--
00:16:07.880 | - It's everything.
00:16:09.040 | I mean, in the end, I wanna be productive as a practitioner.
00:16:13.880 | So that means that, so like at the moment,
00:16:16.320 | our understanding of deep learning is incredibly primitive.
00:16:20.040 | There's very little we understand.
00:16:21.480 | Most things don't work very well,
00:16:23.240 | even though it works better than anything else out there.
00:16:26.160 | There's so many opportunities to make it better.
00:16:28.640 | So you look at any domain area, like, I don't know,
00:16:32.800 | speech recognition with deep learning
00:16:35.680 | or natural language processing classification
00:16:38.360 | with deep learning or whatever.
00:16:39.400 | Every time I look at an area with deep learning,
00:16:41.880 | I always see like, oh, it's terrible.
00:16:44.440 | There's lots and lots of obviously stupid ways to do things
00:16:48.240 | that need to be fixed.
00:16:50.160 | So then I wanna be able to jump in there
00:16:51.600 | and quickly experiment and make them better.
00:16:54.840 | - You think the programming language has a role in that?
00:16:59.240 | - Huge role, yeah.
00:17:00.280 | So currently Python has a big gap
00:17:05.280 | in terms of our ability to innovate,
00:17:09.280 | particularly around recurrent neural networks
00:17:11.840 | and natural language processing,
00:17:14.920 | because it's so slow.
00:17:16.840 | The actual loop where we actually loop through words,
00:17:20.200 | we have to do that whole thing in CUDA C.
00:17:23.760 | So we actually can't innovate with the kernel,
00:17:27.120 | the heart of that most important algorithm.
00:17:30.200 | And it's just a huge problem.
00:17:33.640 | And this happens all over the place.
00:17:36.440 | So we hit research limitations.
00:17:40.080 | Another example, convolutional neural networks,
00:17:42.640 | which are actually the most popular architecture
00:17:44.720 | for lots of things, maybe most things in deep learning.
00:17:48.920 | We almost certainly should be using
00:17:50.320 | sparse convolutional neural networks,
00:17:52.920 | but only like two people are,
00:17:55.400 | because to do it, you have to rewrite
00:17:57.840 | all of that CUDA C level stuff.
00:17:59.920 | And yeah, just researchers and practitioners don't.
00:18:04.520 | So like there's just big gaps
00:18:06.040 | in like what people actually research on,
00:18:09.240 | what people actually implement
00:18:10.520 | because of the programming language problem.
00:18:13.240 | - So you think it's just too difficult to write in CUDA C
00:18:18.240 | that a programming, like a higher level programming language
00:18:23.440 | like Swift should enable the easier,
00:18:28.440 | fooling around creative stuff with RNNs
00:18:33.120 | or with sparse convolutional neural networks?
00:18:34.920 | - Kind of.
00:18:35.760 | - Who's at fault?
00:18:37.760 | Who's at charge of making it easy
00:18:41.040 | for a researcher to play around?
00:18:42.320 | - I mean, no one's at fault.
00:18:43.520 | It's just nobody's got around to it yet.
00:18:45.080 | Or it's just, it's hard, right?
00:18:47.040 | And I mean, part of the fault
00:18:48.440 | is that we ignored that whole APL kind of direction,
00:18:52.640 | almost nearly everybody did for 60 years, 50 years.
00:18:56.360 | But recently people have been starting
00:18:59.880 | to reinvent pieces of that
00:19:03.560 | and kind of create some interesting new directions
00:19:05.440 | in the compiler technology.
00:19:07.280 | So the place where that's particularly happening right now
00:19:11.720 | is something called MLIR,
00:19:13.520 | which is something that again,
00:19:14.920 | Chris Latner, the Swift guy is leading.
00:19:18.040 | And yeah, 'cause it's actually not gonna be Swift
00:19:20.600 | on its own that solves this problem
00:19:22.120 | because the problem is that currently writing
00:19:24.960 | a acceptably fast GPU program
00:19:29.960 | is too complicated regardless of what language you use.
00:19:33.800 | And that's just because if you have to deal with the fact
00:19:38.640 | that I've got 10,000 threads
00:19:41.680 | and I have to synchronize between them all
00:19:43.440 | and I have to put my thing into grid blocks
00:19:45.320 | and think about warps and all this stuff,
00:19:47.000 | it's just so much boilerplate that to do that well,
00:19:50.680 | you have to be a specialist at that
00:19:52.160 | and it's gonna be a year's work to optimize
00:19:56.960 | that algorithm in that way.
00:19:59.640 | But with things like tensor comprehensions
00:20:03.520 | and tile and MLIR and TVM,
00:20:07.120 | there's all these various projects
00:20:08.640 | which are all about saying,
00:20:10.000 | let's let people create like domain specific languages
00:20:14.000 | for tensor computations.
00:20:16.840 | These are the kinds of things we do generally on the GPU
00:20:20.080 | for deep learning and then have a compiler
00:20:22.800 | which can optimize that tensor computation.
00:20:27.800 | A lot of this work is actually sitting on top
00:20:30.120 | of a project called Halide,
00:20:32.600 | which is a mind blowing project
00:20:35.960 | where they came up with such a domain specific language.
00:20:38.800 | In fact, two, one domain specific language for expressing
00:20:41.160 | this is what my tensor computation is.
00:20:43.760 | And another domain specific language for expressing
00:20:46.280 | this is the kind of the way I want you to structure
00:20:50.280 | the compilation of that and like do it block by block
00:20:53.040 | and do these bits in parallel.
00:20:54.920 | And they were able to show how you can compress
00:20:57.720 | the amount of code by 10X compared to optimized GPU code
00:21:02.720 | and get the same performance.
00:21:05.520 | So that's like, so these are the things
00:21:07.560 | that kind of sitting on top of that kind of research
00:21:10.520 | and MLIR is pulling a lot of those best practices together.
00:21:15.120 | And now we're starting to see work done on making
00:21:18.040 | all of that directly accessible through Swift
00:21:21.360 | so that I could use Swift to kind of write
00:21:23.480 | those domain specific languages.
00:21:25.880 | And hopefully we'll get then Swift CUDA kernels
00:21:29.480 | written in a very expressive and concise way
00:21:31.520 | that looks a bit like J in APL
00:21:34.160 | and then Swift layers on top of that
00:21:36.680 | and then a Swift UI on top of that.
00:21:38.360 | And, you know, that'll be so nice
00:21:41.320 | if we can get to that point.
00:21:42.600 | - Now, does it all eventually boil down
00:21:45.000 | to CUDA and NVIDIA GPUs?
00:21:48.560 | - Unfortunately at the moment it does,
00:21:50.160 | but one of the nice things about MLIR
00:21:52.640 | if AMD ever gets their act together,
00:21:55.400 | which they probably won't,
00:21:56.760 | is that they or others could write MLIR backends
00:22:01.760 | for other GPUs or other tensor computation devices
00:22:07.120 | of which today there are increasing number
00:22:11.640 | like Graph Core or Vertex AI or whatever.
00:22:16.640 | So yeah, being able to target lots of backends
00:22:22.600 | would be another benefit of this.
00:22:23.960 | And the market really needs competition
00:22:26.720 | 'cause at the moment NVIDIA is massively overcharging
00:22:29.520 | for their kind of enterprise class cards
00:22:33.680 | because there is no serious competition
00:22:36.760 | 'cause nobody else is doing the software properly.
00:22:39.320 | - In the cloud there is some competition, right?
00:22:42.920 | - Not really, other than TPUs perhaps.
00:22:45.080 | But TPUs are almost unprogrammable at the moment.
00:22:48.240 | - So you can't, the TPUs has the same problem that you can't-
00:22:51.200 | - It's even worse.
00:22:52.040 | So TPUs, Google actually made an explicit decision
00:22:54.840 | to make them almost entirely unprogrammable
00:22:57.240 | because they felt that there was too much IP in there.
00:23:00.000 | And if they gave people direct access to program them,
00:23:02.680 | people would learn their secrets.
00:23:04.360 | So you can't actually directly program the memory in a TPU.
00:23:09.960 | You can't even directly create code that runs on
00:23:13.960 | and that you look at on the machine that has the GPU.
00:23:16.600 | It all goes through a virtual machine.
00:23:18.520 | So all you can really do is this kind of cookie cutter thing
00:23:21.680 | of like plug-in high-level stuff together,
00:23:25.320 | which is just super tedious and annoying
00:23:29.280 | and totally unnecessary.
00:23:31.520 | - So what was the, tell me if you could,
00:23:34.480 | the origin story of fast AI?
00:23:36.520 | - Fast AI?
00:23:37.360 | - The origin story of fast AI.
00:23:39.080 | What is the motivation, its mission, its dream?
00:23:44.400 | - So I guess the founding story is heavily tied
00:23:50.240 | to my previous startup, which is a company called Analytic,
00:23:53.560 | which was the first company to focus on deep learning
00:23:56.920 | for medicine.
00:23:58.240 | And I created that because I saw there was a huge
00:24:02.240 | opportunity to, there's about a 10X shortage
00:24:06.880 | of the number of doctors in the world,
00:24:08.520 | in the developing world that we need.
00:24:10.320 | Expected it would take about 300 years
00:24:13.800 | to train enough doctors to meet that gap.
00:24:16.080 | But I guess that maybe if we used deep learning
00:24:21.080 | for some of the analytics, we could maybe make it
00:24:24.960 | so you don't need as highly trained doctors.
00:24:27.400 | - For diagnosis?
00:24:28.320 | - For diagnosis and treatment planning.
00:24:29.800 | - Where's the biggest benefit, just before we get
00:24:32.520 | to fast AI, where's the biggest benefit of AI in medicine
00:24:36.640 | that you see today?
00:24:37.960 | - Not much happening today in terms of like stuff
00:24:41.480 | that's actually out there, it's very early,
00:24:43.080 | but in terms of the opportunity, it's to take markets
00:24:47.760 | like India and China and Indonesia,
00:24:50.840 | which have big populations, Africa,
00:24:54.160 | small numbers of doctors, and provide diagnostic,
00:24:59.160 | particularly treatment planning and triage kind of on device
00:25:05.120 | so that if you do a test for malaria or tuberculosis
00:25:10.120 | or whatever, you immediately get something
00:25:12.960 | that even a healthcare worker that's had a month
00:25:15.240 | of training can get a very high quality assessment
00:25:20.240 | of whether the patient might be at risk and tell,
00:25:24.280 | okay, we'll send them off to a hospital.
00:25:27.400 | So for example, in Africa, outside of South Africa,
00:25:31.640 | there's only five pediatric radiologists
00:25:34.000 | for the entire continent, so most countries don't have any.
00:25:37.120 | So if your kid is sick and they need something diagnosed
00:25:39.720 | through medical imaging, the person, even if you're able
00:25:42.880 | to get medical imaging done, the person that looks at it
00:25:45.040 | will be a nurse at best, but actually in India, for example,
00:25:50.040 | and China, almost no x-rays are read by anybody,
00:25:54.760 | by any trained professional because they don't have enough.
00:25:59.240 | So if instead we had a algorithm that could take
00:26:03.920 | the most likely high risk 5% and say, triage basically,
00:26:08.920 | say, okay, someone needs to look at this,
00:26:13.240 | it would massively change the kind of way that
00:26:17.120 | what's possible with medicine in the developing world.
00:26:20.680 | And remember, increasingly, they have money.
00:26:23.720 | They're the developing world, they're not the poor world,
00:26:25.560 | they're the developing world, so they have the money,
00:26:26.800 | so they're building the hospitals,
00:26:28.440 | they're getting the diagnostic equipment,
00:26:32.000 | but there's no way for a very long time
00:26:34.880 | will they be able to have the expertise.
00:26:38.520 | - Shortage of expertise, okay, and that's where
00:26:41.080 | the deep learning systems can step in
00:26:43.360 | and magnify the expertise they do have, essentially.
00:26:46.800 | - Yeah.
00:26:47.800 | - So you do see, just to linger it a little bit longer,
00:26:52.800 | the interaction, do you still see the human experts
00:26:58.240 | still at the core of these systems?
00:26:59.880 | - Yeah, absolutely.
00:27:00.720 | - Or is there something in medicine that could be automated
00:27:02.760 | almost completely?
00:27:03.760 | - I don't see the point of even thinking about that,
00:27:06.400 | because we have such a shortage of people,
00:27:08.480 | why would we want to find a way not to use them?
00:27:12.160 | Like, we have people, so the idea of,
00:27:15.560 | even from an economic point of view,
00:27:17.160 | if you can make them 10x more productive,
00:27:19.760 | getting rid of the person doesn't impact
00:27:21.920 | your unit economics at all, and it totally ignores the fact
00:27:25.520 | that there are things people do better than machines.
00:27:28.720 | So it's just, to me, that's not a useful way
00:27:33.120 | of framing the problem.
00:27:34.080 | - I guess, just to clarify, I guess I meant
00:27:36.640 | there may be some problems where you can avoid
00:27:40.280 | even going to the expert ever, sort of maybe preventative
00:27:43.880 | care or some basic stuff, allowing the expert to focus
00:27:48.320 | on the things that are really that, you know.
00:27:51.360 | - Well, that's what the triage would do, right?
00:27:53.000 | So the triage would say, okay, this 99% triage,
00:27:58.680 | sure, there's nothing here.
00:28:00.800 | So, you know, that can be done on device,
00:28:04.040 | and they can just say, okay, go home.
00:28:05.920 | So the experts are being used to look at the stuff
00:28:09.440 | which has some chance it's worth looking at,
00:28:12.280 | which most things is not, you know, it's fine.
00:28:16.360 | - Why do you think we haven't quite made progress
00:28:19.360 | on that yet, in terms of the scale of how much AI
00:28:24.360 | is applied in the method?
00:28:27.520 | - There's a lot of reasons.
00:28:28.400 | I mean, one is it's pretty new.
00:28:29.680 | I only started in Lytic in like 2014, and before that,
00:28:33.160 | like, it's hard to express to what degree
00:28:36.720 | the medical world was not aware of the opportunities here.
00:28:40.680 | So I went to RSNA, which is the world's largest
00:28:44.920 | radiology conference, and I told everybody I could,
00:28:49.240 | you know, like, I'm doing this thing with deep learning,
00:28:51.760 | please come and check it out.
00:28:53.360 | And no one had any idea what I was talking about,
00:28:56.800 | and no one had any interest in it.
00:28:58.560 | So like, we've come from absolute zero, which is hard,
00:29:04.680 | and then the whole regulatory framework, education system,
00:29:09.920 | everything is just set up to think of doctoring
00:29:13.400 | in a very different way.
00:29:14.960 | So today, there is a small number of people
00:29:17.120 | who are deep learning practitioners and doctors
00:29:22.080 | at the same time, and we're starting to see
00:29:24.000 | the first ones come out of their PhD programs,
00:29:26.600 | so Zach Kahan over in Boston, Cambridge,
00:29:31.600 | has a number of students now who are data science experts,
00:29:38.960 | deep learning experts, and actual medical doctors.
00:29:46.120 | Quite a few doctors have completed our fast AI course now
00:29:50.040 | and are publishing papers and creating journal reading
00:29:54.960 | groups in the American Council of Radiology,
00:29:58.080 | and like, it's just starting to happen.
00:30:00.360 | But it's gonna be a long process.
00:30:02.920 | The regulators have to learn how to regulate this,
00:30:04.920 | they have to build, you know, guidelines,
00:30:08.760 | and then the lawyers at hospitals have to develop
00:30:13.320 | a new way of understanding that sometimes it makes sense
00:30:18.240 | for data to be, you know, looked at in raw form
00:30:23.520 | in large quantities in order to create
00:30:25.840 | world-changing results.
00:30:26.960 | - Yeah, so regulation around data, all that,
00:30:30.080 | it sounds, well, it's probably the hardest problem,
00:30:33.840 | but sounds reminiscent of autonomous vehicles as well.
00:30:36.720 | Many of the same regulatory challenges,
00:30:38.720 | many of the same data challenges.
00:30:40.600 | - Yeah, I mean, funnily enough,
00:30:41.520 | the problem is less the regulation
00:30:43.640 | and more the interpretation of that regulation
00:30:45.840 | by lawyers in hospitals.
00:30:48.200 | So HIPAA is actually, was designed to,
00:30:52.560 | the P in HIPAA is not standing,
00:30:55.000 | does not stand for privacy, it stands for portability.
00:30:57.640 | It's actually meant to be a way that data can be used.
00:31:00.800 | And it was created with lots of gray areas
00:31:04.360 | because the idea is that would be more practical
00:31:06.520 | and it would help people to use this legislation
00:31:10.440 | to actually share data in a more thoughtful way.
00:31:13.680 | Unfortunately, it's done the opposite
00:31:15.280 | because when a lawyer sees a gray area,
00:31:17.760 | they say, oh, if we don't know, we won't get sued,
00:31:20.720 | then we can't do it.
00:31:22.400 | So HIPAA is not exactly the problem.
00:31:26.320 | The problem is more that there's,
00:31:29.160 | hospital lawyers are not incented to make bold decisions
00:31:34.160 | about data portability.
00:31:36.480 | - Or even to embrace technology that saves lives.
00:31:40.400 | They more wanna not get in trouble
00:31:42.400 | for embracing that technology.
00:31:44.160 | - Also, it is also, saves lives in a very abstract way,
00:31:47.800 | which is like, oh, we've been able to release
00:31:49.800 | these 100,000 anonymized records.
00:31:52.280 | I can't point at the specific person
00:31:54.120 | whose life that saved.
00:31:55.280 | I can say like, oh, we ended up with this paper,
00:31:57.720 | which found this result, which diagnosed a thousand
00:32:01.640 | more people than we would have otherwise,
00:32:03.080 | but it's like, which ones were helped?
00:32:05.480 | It's very abstract.
00:32:07.280 | - Yeah, and on the counter side of that,
00:32:09.360 | you may be able to point to a life that was taken
00:32:13.040 | because of something that was--
00:32:14.280 | - Yeah, or a person whose privacy was violated.
00:32:18.200 | It's like, oh, this specific person,
00:32:20.160 | you know, was de-identified.
00:32:24.200 | - So-- - Identified.
00:32:26.000 | - Just a fascinating topic.
00:32:27.280 | We're jumping around.
00:32:28.280 | We'll get back to fast AI, but on the question of privacy,
00:32:32.520 | data is the fuel for so much innovation in deep learning.
00:32:37.520 | What's your sense on privacy,
00:32:39.760 | whether we're talking about Twitter, Facebook, YouTube,
00:32:44.000 | just the technologies like in the medical field
00:32:48.640 | that rely on people's data in order to create impact.
00:32:53.360 | How do we get that right, respecting people's privacy
00:32:58.360 | and yet creating technology that is learned from data?
00:33:03.320 | - One of my areas of focus is on doing more with less data,
00:33:08.320 | which, so most vendors, unfortunately,
00:33:14.400 | are strongly incented to find ways
00:33:17.600 | to require more data and more computation.
00:33:20.040 | So Google and IBM being the most obvious--
00:33:23.440 | - IBM.
00:33:25.920 | - Yeah, so Watson. - Watson.
00:33:27.720 | - So Google and IBM both strongly push the idea
00:33:31.160 | that you have to be, you know,
00:33:33.080 | that they have more data and more computation
00:33:35.440 | and more intelligent people than anybody else.
00:33:37.840 | And so you have to trust them to do things
00:33:39.880 | 'cause nobody else can do it.
00:33:41.340 | And Google's very upfront about this.
00:33:45.400 | Like Jeff Dean has gone out there and given talks
00:33:48.440 | and said, "Our goal is to require
00:33:50.520 | "a thousand times more computation, but less people."
00:33:55.160 | Our goal is to use the people that you have better
00:34:00.160 | and the data you have better
00:34:01.680 | and the computation you have better.
00:34:03.000 | So one of the things that we've discovered is,
00:34:06.040 | or at least highlighted, is that you very, very,
00:34:10.600 | very often don't need much data at all.
00:34:13.360 | And so the data you already have in your organization
00:34:16.160 | will be enough to get state-of-the-art results.
00:34:19.240 | So like my starting point would be to kind of say
00:34:21.320 | around privacy is a lot of people are looking for ways
00:34:25.760 | to share data and aggregate data,
00:34:28.160 | but I think often that's unnecessary.
00:34:29.960 | They assume that they need more data than they do
00:34:32.200 | 'cause they're not familiar with the basics
00:34:34.160 | of transfer learning, which is this critical technique
00:34:38.480 | for needing orders of magnitude less data.
00:34:42.000 | - Is your sense, one reason you might wanna collect data
00:34:44.680 | from everyone is like in the recommender system context,
00:34:49.680 | where your individual, Jeremy Howard's individual data
00:34:54.520 | is the most useful for providing a product
00:34:58.440 | that's impactful for you.
00:34:59.880 | So for giving you advertisements,
00:35:02.280 | for recommending to you movies,
00:35:04.160 | for doing medical diagnosis.
00:35:06.360 | Is your sense we can build with a small amount of data,
00:35:11.680 | general models that will have a huge impact for most people
00:35:16.000 | that we don't need to have data from each individual?
00:35:19.160 | - On the whole, I'd say yes.
00:35:20.520 | I mean, there are things like,
00:35:23.400 | you know, recommender systems have this cold start problem
00:35:28.320 | where, you know, Jeremy is a new customer.
00:35:30.920 | We haven't seen him before.
00:35:31.960 | So we can't recommend him things based on what else
00:35:33.920 | he's bought and liked with us.
00:35:36.000 | And there's various workarounds to that.
00:35:38.800 | Like in a lot of music programs,
00:35:40.640 | we'll start out by saying,
00:35:42.440 | which of these artists do you like?
00:35:44.880 | Which of these albums do you like?
00:35:46.720 | Which of these songs do you like?
00:35:48.360 | Netflix used to do that.
00:35:50.960 | Nowadays, they tend not to.
00:35:53.480 | People kind of don't like that
00:35:54.760 | 'cause they think, oh, we don't wanna bother the user.
00:35:57.320 | So you could work around that
00:35:58.680 | by having some kind of data sharing
00:36:00.960 | where you get my marketing record from Axiom or whatever
00:36:04.880 | and try to guestion that.
00:36:06.600 | To me, the benefit to me and to society
00:36:11.600 | of saving me five minutes on answering some questions
00:36:16.480 | versus the negative externalities
00:36:18.920 | of the privacy issue doesn't add up.
00:36:23.920 | So I think like a lot of the time,
00:36:26.160 | the places where people are invading our privacy
00:36:30.160 | in order to provide convenience
00:36:32.800 | is really about just trying to make them more money
00:36:36.840 | and they move these negative externalities
00:36:41.080 | to places that they don't have to pay for them.
00:36:44.240 | So when you actually see regulations appear
00:36:48.440 | that actually cause the companies
00:36:50.400 | that create these negative externalities
00:36:52.080 | to have to pay for it themselves,
00:36:53.520 | they say, well, we can't do it anymore.
00:36:56.080 | So the cost is actually too high.
00:36:58.200 | But for something like medicine,
00:37:00.360 | yeah, I mean, the hospital has my medical imaging,
00:37:05.240 | my pathology studies, my medical records.
00:37:07.920 | And also I own my medical data.
00:37:11.880 | So I help a startup called DocAI.
00:37:16.920 | One of the things DocAI does is that it has an app
00:37:19.720 | you can connect to Sutter Health and LabCorp and Walgreens
00:37:24.720 | and download your medical data to your phone
00:37:29.840 | and then upload it again at your discretion
00:37:33.600 | to share it as you wish.
00:37:35.160 | So with that kind of approach,
00:37:38.080 | we can share our medical information
00:37:41.200 | with the people we want to.
00:37:44.840 | - Yeah, so control.
00:37:45.720 | I mean, really being able to control
00:37:47.520 | who you share it with and so on.
00:37:49.760 | So that has a beautiful, interesting tangent,
00:37:53.120 | but to return back to the origin story of Fast.ai.
00:37:59.400 | All right, so before I started Fast.ai,
00:38:02.520 | I spent a year researching
00:38:06.360 | where are the biggest opportunities for deep learning?
00:38:10.400 | 'Cause I knew from my time at Kaggle in particular
00:38:14.080 | that deep learning had kind of hit this threshold point
00:38:16.920 | where it was rapidly becoming the state-of-the-art approach
00:38:19.880 | in every area that looked at it.
00:38:21.600 | And I'd been working with neural nets for over 20 years.
00:38:25.400 | I knew that from a theoretical point of view,
00:38:27.440 | once it hit that point, it would do that
00:38:29.240 | in kind of just about every domain.
00:38:31.600 | And so I kind of spent a year researching
00:38:34.520 | what are the domains that's gonna have
00:38:36.280 | the biggest low-hanging fruit in the shortest time period.
00:38:39.440 | I picked medicine, but there were so many I could have picked
00:38:43.960 | and so there was a kind of level of frustration for me
00:38:46.280 | of like, okay, I'm really glad we've opened up
00:38:50.000 | the medical deep learning world
00:38:51.160 | and today it's huge, as you know,
00:38:53.960 | but we can't do, I can't do everything.
00:38:58.320 | I don't even know, like in medicine,
00:39:00.440 | it took me a really long time to even get a sense
00:39:02.320 | of like what kind of problems do medical practitioners solve?
00:39:05.120 | What kind of data do they have?
00:39:06.440 | Who has that data?
00:39:07.480 | So I kind of felt like I need to approach this differently
00:39:12.520 | if I wanna maximize the positive impact of deep learning.
00:39:15.360 | Rather than me picking an area
00:39:19.280 | and trying to become good at it and building something,
00:39:21.800 | I should let people who are already domain experts
00:39:24.480 | in those areas and who already have the data
00:39:26.720 | do it themselves.
00:39:29.280 | So that was the reason for Fast.ai
00:39:33.120 | is to basically try and figure out
00:39:36.800 | how to get deep learning into the hands of people
00:39:40.160 | who could benefit from it and help them to do so
00:39:43.280 | in as quick and easy and effective a way as possible.
00:39:47.120 | - Got it, so sort of empower the domain experts.
00:39:50.280 | - Yeah, and like partly it's 'cause like,
00:39:53.120 | unlike most people in this field,
00:39:56.360 | my background is very applied and industrial.
00:40:00.000 | Like my first job was at McKinsey and Company.
00:40:02.520 | I spent 10 years in management consulting.
00:40:04.840 | I spend a lot of time with domain experts,
00:40:10.560 | so I kind of respect them and appreciate them
00:40:12.840 | and I know that's where the value generation in society is.
00:40:16.560 | And so I also know how most of them can't code
00:40:21.560 | and most of them don't have the time to invest,
00:40:26.080 | you know, three years in a graduate degree or whatever.
00:40:29.440 | So it's like, how do I upskill those domain experts?
00:40:33.640 | I think that would be a super powerful thing,
00:40:36.200 | you know, biggest societal impact I could have.
00:40:39.000 | So yeah, that was the thinking.
00:40:41.800 | - So, so much of Fast.ai students and researchers
00:40:45.800 | and the things you teach are pragmatically minded,
00:40:50.200 | practically minded, figuring out ways
00:40:52.920 | how to solve real problems and fast.
00:40:55.880 | So from your experience, what's the difference
00:40:58.200 | between theory and practice of deep learning?
00:41:01.260 | - Well, most of the research in the deep mining world
00:41:07.600 | is a total waste of time.
00:41:09.920 | - Right, that's what I was getting at.
00:41:11.080 | - Yeah, it's a problem in science in general.
00:41:16.080 | Scientists need to be published,
00:41:19.640 | which means they need to work on things
00:41:21.520 | that their peers are extremely familiar with
00:41:24.080 | and can recognize and advance in that area.
00:41:26.240 | So that means that they all need to work on the same thing.
00:41:29.080 | And so it really, and the thing they work on,
00:41:33.040 | there's nothing to encourage them to work on things
00:41:35.640 | that are practically useful.
00:41:38.840 | So you get just a whole lot of research,
00:41:41.160 | which is minor advances in stuff
00:41:43.240 | that's been very highly studied
00:41:44.660 | and has no significant practical impact.
00:41:49.340 | Whereas the things that really make a difference,
00:41:50.920 | like I mentioned transfer learning,
00:41:52.800 | like if we can do better at transfer learning,
00:41:55.640 | then it's this like world-changing thing
00:41:58.200 | where suddenly like lots more people
00:41:59.800 | can do world-class work with less resources and less data.
00:42:04.800 | But almost nobody works on that.
00:42:08.540 | Or another example, active learning,
00:42:10.800 | which is the study of like,
00:42:11.920 | how do we get more out of the human beings in the loop?
00:42:15.960 | - That's my favorite topic.
00:42:17.160 | - Yeah, so active learning is great,
00:42:18.580 | but it's almost nobody working on it
00:42:21.220 | because it's just not a trendy thing right now.
00:42:23.840 | - You know what, somebody started to interrupt.
00:42:27.080 | He was saying that nobody is publishing on active learning,
00:42:31.560 | but there's people inside companies,
00:42:33.480 | anybody who actually has to solve a problem,
00:42:36.840 | they're going to innovate on active learning.
00:42:39.680 | - Yeah, everybody kind of reinvents active learning
00:42:42.120 | when they actually have to work in practice
00:42:43.800 | because they start labeling things and they think,
00:42:46.420 | gosh, this is taking a long time and it's very expensive.
00:42:49.340 | And then they start thinking,
00:42:51.280 | well, why am I labeling everything?
00:42:52.680 | I'm only, the machine's only making mistakes
00:42:54.880 | on those two classes, they're the hard ones.
00:42:56.920 | Maybe I'll just start labeling those two classes.
00:42:58.920 | And then you start thinking,
00:43:00.420 | well, why did I do that manually?
00:43:01.620 | Why can't I just get the system to tell me
00:43:03.040 | which things are gonna be hardest?
00:43:04.800 | It's an obvious thing to do,
00:43:06.260 | but yeah, it's just like transfer learning,
00:43:11.260 | it's understudied and the academic world
00:43:14.160 | just has no reason to care about practical results.
00:43:17.500 | The funny thing is, I've only really ever written one paper.
00:43:20.000 | I hate writing papers and I didn't even write it.
00:43:22.800 | It was my colleague, Sebastian Ruder, who actually wrote it.
00:43:25.520 | I just did the research for it,
00:43:27.960 | but it was basically introducing transfer learning,
00:43:30.640 | successful transfer learning to NLP for the first time.
00:43:34.320 | The algorithm is called ULMfit.
00:43:36.060 | And I actually wrote it for the course,
00:43:41.980 | for the first AI course.
00:43:43.700 | I wanted to teach people NLP
00:43:45.340 | and I thought I only wanna teach people practical stuff.
00:43:47.500 | And I think the only practical stuff is transfer learning.
00:43:50.540 | And I couldn't find any examples of transfer learning in NLP,
00:43:53.340 | so I just did it.
00:43:54.540 | And I was shocked to find that as soon as I did it,
00:43:57.300 | which the basic prototype took a couple of days,
00:44:01.060 | smashed the state of the art
00:44:02.500 | on one of the most important data sets
00:44:04.280 | in a field that I knew nothing about.
00:44:06.720 | And I just thought, well, this is ridiculous.
00:44:10.400 | And so I spoke to Sebastian about it
00:44:13.800 | and he kindly offered to write it up, the results.
00:44:17.680 | And so it ended up being published in ACL,
00:44:21.360 | which is the top computational linguistics conference.
00:44:25.560 | So like people do actually care once you do it,
00:44:28.880 | but I guess it's difficult for maybe like junior researchers
00:44:32.780 | or like, I don't care whether I get citations
00:44:36.600 | or papers or whatever.
00:44:37.740 | There's nothing in my life that makes that important,
00:44:39.620 | which is why I've never actually bothered
00:44:41.500 | to write a paper myself.
00:44:43.040 | But for people who do,
00:44:43.980 | I guess they have to pick the kind of safe option,
00:44:48.980 | which is like, yeah, make a slight improvement
00:44:52.280 | on something that everybody's already working on.
00:44:54.960 | - Yeah, nobody does anything interesting
00:44:58.300 | or succeeds in life with the safe option.
00:45:01.180 | - Although, I mean, the nice thing is nowadays,
00:45:02.940 | everybody is now working on NLP transfer learning
00:45:05.300 | because since that time we've had GPT and GPT-2 and BERT
00:45:09.780 | and it's like, it's so, yeah,
00:45:12.660 | once you show that something's possible,
00:45:15.380 | everybody jumps in, I guess.
00:45:17.660 | - I hope to be a part of,
00:45:19.220 | and I hope to see more innovation
00:45:20.660 | in active learning in the same way.
00:45:22.140 | I think transfer learning and active learning
00:45:24.500 | are fascinating public open work.
00:45:27.360 | - I actually helped start a startup called Platform AI,
00:45:29.960 | which is really all about active learning.
00:45:31.760 | And yeah, it's been interesting trying to kind of
00:45:34.640 | see what research is out there and make the most of it.
00:45:37.800 | And there's basically none.
00:45:39.200 | So we've had to do all our own research.
00:45:41.000 | - Once again, and just as you described.
00:45:43.000 | Can you tell the story of the Stanford competition,
00:45:47.640 | DawnBench and Fast.ai's achievement on it?
00:45:51.500 | - Sure, so something which I really enjoy
00:45:54.280 | is that I basically teach two courses a year,
00:45:57.400 | the practical deep learning for coders,
00:45:59.640 | which is kind of the introductory course
00:46:02.080 | and then cutting edge deep learning for coders,
00:46:04.000 | which is the kind of research level course.
00:46:06.880 | And while I teach those courses,
00:46:10.400 | I basically have a big office
00:46:15.400 | at the University of San Francisco,
00:46:18.400 | it'd be enough for like 30 people.
00:46:19.760 | And I invite anybody, any student who wants to come
00:46:22.120 | and hang out with me while I build the course.
00:46:25.320 | And so generally it's full.
00:46:26.600 | And so we have 20 or 30 people in a big office
00:46:30.860 | with nothing to do, but study deep learning.
00:46:33.880 | So it was during one of these times
00:46:35.880 | that somebody in the group said,
00:46:37.320 | "Oh, there's a thing called DawnBench,
00:46:40.600 | it looks interesting."
00:46:41.440 | And I was like, "What the hell is that?"
00:46:42.840 | And they set out some competition
00:46:44.100 | to see how quickly you can train a model.
00:46:46.400 | Seems kind of not exactly relevant to what we're doing,
00:46:50.320 | but it sounds like the kind of thing
00:46:51.400 | which you might be interested in.
00:46:52.480 | I checked it out and I was like,
00:46:53.320 | "Oh crap, there's only 10 days till it's over.
00:46:55.920 | It's pretty much too late."
00:46:58.080 | And we're kind of busy trying to teach this course.
00:47:00.960 | But we're like, "Oh, it would make an interesting
00:47:03.480 | case study for the course.
00:47:06.400 | Like it's all the stuff we're already doing.
00:47:08.180 | Why don't we just put together
00:47:09.480 | our current best practices and ideas?"
00:47:12.460 | So me and I guess about four students
00:47:16.040 | just decided to give it a go.
00:47:17.560 | And we focused on this small one called Cypher 10,
00:47:20.840 | which is little 32 by 32 pixel images.
00:47:24.640 | - Can you say what DawnBench is?
00:47:26.120 | - Yeah, so it's a competition
00:47:27.640 | to train a model as fast as possible.
00:47:29.520 | It was run by Stanford.
00:47:30.960 | - And as cheap as possible too.
00:47:32.480 | - That's also another one for as cheap as possible.
00:47:34.280 | And there's a couple of categories, ImageNet and Cypher 10.
00:47:38.120 | So ImageNet is this big 1.3 million image thing
00:47:42.040 | that took a couple of days to train.
00:47:44.520 | Remember a friend of mine, Pete Warden,
00:47:47.840 | who's now at Google.
00:47:50.180 | I remember he told me how he trained ImageNet
00:47:53.240 | a few years ago, and he basically like had this
00:47:55.640 | little granny flat out the back
00:47:59.760 | that he turned into his ImageNet training center.
00:48:01.880 | And he figured, you know, after like a year of work,
00:48:03.760 | he figured out how to train it in like 10 days or something.
00:48:07.040 | It's like, that was a big job.
00:48:08.480 | Well, Cypher 10 at that time, you could train in a few hours.
00:48:12.880 | You know, it was much smaller and easier.
00:48:14.520 | So we thought we'd try Cypher 10.
00:48:17.280 | And yeah, I'd really never done that before.
00:48:22.280 | Like I'd never really, like things like using more
00:48:25.800 | than one GPU at a time was something I tried to avoid.
00:48:29.760 | 'Cause to me, it's like very against the whole idea
00:48:32.080 | of accessibility is you should be able to do things
00:48:34.120 | with one GPU.
00:48:34.960 | - I mean, have you asked in the past before,
00:48:37.960 | after having accomplished something,
00:48:39.600 | how do I do this faster, much faster?
00:48:42.440 | - Oh, always, but it's always, for me, it's always,
00:48:44.480 | how do I make it much faster on a single GPU
00:48:47.640 | that a normal person could afford in their day-to-day life?
00:48:50.360 | It's not, how could I do it faster by, you know,
00:48:53.840 | having a huge data center?
00:48:55.240 | 'Cause to me, it's all about like,
00:48:57.200 | as many people should be able to use something as possible
00:48:59.480 | without fussing around with infrastructure.
00:49:03.160 | So anyways, in this case, it's like, well,
00:49:06.000 | we can use eight GPUs just by renting a AWS machine.
00:49:10.200 | So we thought we'd try that.
00:49:11.840 | And yeah, basically using the stuff we were already doing,
00:49:16.520 | we were able to get, you know, the speed,
00:49:20.120 | you know, within a few days, we had the speed down to,
00:49:22.880 | I don't know, a very small number of minutes.
00:49:26.000 | I can't remember exactly how many minutes it was,
00:49:28.760 | but it might've been like 10 minutes or something.
00:49:31.360 | And so, yeah, we found ourselves at the top
00:49:33.200 | of the leaderboard easily for both time and money,
00:49:38.200 | which really shocked me
00:49:39.040 | 'cause the other people competing in this
00:49:40.160 | were like Google and Intel and stuff
00:49:41.920 | were like, know a lot more about this stuff
00:49:43.920 | than I think we do.
00:49:45.400 | So then we were emboldened.
00:49:46.840 | We thought, let's try the ImageNet one too.
00:49:50.680 | I mean, it seemed way out of our league,
00:49:53.360 | but our goal was to get under 12 hours.
00:49:55.960 | And we did, which was really exciting.
00:49:59.280 | And, but we didn't put anything up on the leaderboard,
00:50:01.480 | but we were down to like 10 hours,
00:50:03.160 | but then Google put in some,
00:50:07.760 | like five hours or something,
00:50:10.040 | we're just like, oh, we're so screwed.
00:50:13.400 | But we kind of thought we'll keep trying,
00:50:16.920 | if Google can do it in five,
00:50:17.880 | I mean, Google did on five hours on some,
00:50:19.520 | on like a TPU pod or something,
00:50:21.520 | like a lot of hardware.
00:50:23.240 | But we kind of like had a bunch of ideas to try,
00:50:26.360 | like a really simple thing was,
00:50:28.760 | why are we using these big images?
00:50:30.520 | They're like 224 or 256 by 256 pixels.
00:50:34.640 | Why don't we try smaller ones?
00:50:37.760 | - And just to elaborate,
00:50:39.040 | there's a constraint on the accuracy
00:50:41.360 | that your train model is supposed to achieve.
00:50:43.040 | - Yeah, you gotta achieve 93%,
00:50:45.760 | I think it was for ImageNet, exactly.
00:50:49.240 | - Which is very tough, so you have to-
00:50:51.120 | - Yeah, 93%, like they picked a good threshold.
00:50:54.680 | It was a little bit higher
00:50:56.920 | than what the most commonly used ResNet-50 model
00:51:00.840 | could achieve at that time.
00:51:03.360 | So yeah, so it's quite a difficult problem to solve.
00:51:08.160 | But yeah, we realized if we actually
00:51:09.680 | just use 64 by 64 images,
00:51:12.280 | it trained a pretty good model.
00:51:16.160 | And then we could take that same model
00:51:18.000 | and just give it a couple of epochs
00:51:19.560 | to learn 224 by 224 images.
00:51:21.880 | And it was basically already trained,
00:51:24.480 | which makes a lot of sense.
00:51:25.440 | Like if you teach somebody,
00:51:26.600 | like here's what a dog looks like
00:51:28.080 | and you show them low res versions,
00:51:30.160 | and then you say, here's a really clear picture of a dog,
00:51:33.360 | they already know what a dog looks like.
00:51:35.920 | So that like, just, we jumped to the front
00:51:39.840 | and we ended up winning parts of that competition.
00:51:44.840 | We actually ended up doing a distributed version
00:51:49.600 | over multiple machines a couple of months later
00:51:51.920 | and ended up at the top of the leaderboard.
00:51:53.480 | We had 18 minutes.
00:51:54.960 | - (laughs) ImageNet.
00:51:56.200 | - Yeah, and it was,
00:51:57.960 | and people have just kept on blasting through
00:52:00.320 | again and again since then, so.
00:52:02.320 | - So what's your view on multi GPU
00:52:05.640 | or multiple machine training in general
00:52:08.480 | as a way to speed code up?
00:52:11.960 | - I think it's largely a waste of time.
00:52:13.680 | - Both multi GPU on a single machine and?
00:52:15.880 | - Yeah, particularly multi machines
00:52:17.680 | 'cause it's just clunky.
00:52:19.440 | Multi GPUs is less clunky than it used to be.
00:52:25.320 | But to me, anything that slows down your iteration speed
00:52:28.520 | is a waste of time.
00:52:30.320 | So you could maybe do your very last,
00:52:33.840 | you know, perfecting of the model on multi GPUs
00:52:37.960 | if you need to.
00:52:38.960 | But, so for example,
00:52:41.040 | I think doing stuff on ImageNet is generally a waste of time.
00:52:46.000 | Why test things on 1.3 million images?
00:52:48.200 | Most of us don't use 1.3 million images.
00:52:51.080 | And we've also done research that shows that
00:52:53.840 | doing things on a smaller subset of images
00:52:56.480 | gives you the same relative answers anyway.
00:52:59.160 | So from a research point of view, why waste that time?
00:53:02.080 | So actually I released a couple of new datasets recently.
00:53:06.120 | One is called ImageNet,
00:53:07.720 | the French ImageNet, which is a small subset of ImageNet,
00:53:12.880 | which is designed to be easy to classify.
00:53:15.040 | - What's, how do you spell ImageNet?
00:53:17.280 | - It's got an extra T and E at the end
00:53:19.200 | 'cause it's very French.
00:53:20.480 | - Image, okay.
00:53:21.320 | - Yeah, and then another one called ImageWolf,
00:53:24.720 | which is a subset of ImageNet that only contains dog breeds.
00:53:29.720 | - And that's a hard one, right?
00:53:30.800 | - That's a hard one.
00:53:32.000 | And I've discovered that if you just look
00:53:33.800 | at these two subsets,
00:53:34.920 | you can train things on a single GPU in 10 minutes
00:53:39.120 | and the results you get directly transferable
00:53:42.080 | to ImageNet nearly all the time.
00:53:44.320 | And so now I'm starting to see some researchers
00:53:46.360 | start to use these much smaller datasets.
00:53:49.000 | - So deeply love the way you think
00:53:51.160 | because I think you might've written a blog post saying
00:53:55.760 | that sort of going to these big datasets
00:54:00.160 | is encouraging people to not think creatively.
00:54:03.880 | - Absolutely.
00:54:04.720 | - So you're too, it sort of constrains you
00:54:08.280 | to train on large resources.
00:54:09.840 | And because you have these resources,
00:54:11.280 | you think more resources will be better.
00:54:14.000 | And then you start, so like somehow you kill the creativity.
00:54:17.720 | - Yeah, and even worse than that, Lex,
00:54:19.280 | I keep hearing from people who say,
00:54:21.120 | "I decided not to get into deep learning
00:54:23.400 | because I don't believe it's accessible
00:54:25.440 | to people outside of Google to do useful work."
00:54:28.520 | So like I see a lot of people make an explicit decision
00:54:31.640 | to not learn this incredibly valuable tool
00:54:35.960 | because they've drunk the Google Kool-Aid,
00:54:39.040 | which is that only Google's big enough
00:54:40.720 | and smart enough to do it.
00:54:42.440 | And I just find that so disappointing and it's so wrong.
00:54:45.360 | - And I think all of the major breakthroughs in AI
00:54:49.160 | in the next 20 years will be doable on a single GPU.
00:54:53.240 | Like I would say my sense is all the big sort of-
00:54:56.240 | - Well, let's put it this way.
00:54:58.240 | None of the big breakthroughs of the last 20 years
00:55:00.160 | have required multiple GPUs.
00:55:01.680 | So like batch norm, value, dropout.
00:55:05.960 | - To demonstrate that there's something to that.
00:55:08.080 | - Every one of them, none of them has required multiple GPUs.
00:55:11.960 | - GANs, the original GANs didn't require multiple GPUs.
00:55:15.760 | - Well, and we've actually recently shown
00:55:18.040 | that you don't even need GANs.
00:55:19.640 | So we've developed GAN level outcomes without needing GANs.
00:55:24.640 | And we can now do it with, again,
00:55:26.880 | by using transfer learning,
00:55:27.960 | we can do it in a couple of hours on a single GPU.
00:55:30.160 | - Just using a generated model,
00:55:31.400 | like without the adversarial part?
00:55:32.960 | - Yeah, so we've found loss functions
00:55:35.680 | that work super well without the adversarial part.
00:55:38.640 | And then one of our students, a guy called Jason Antich,
00:55:41.800 | has created a system called DeOldify,
00:55:44.600 | which uses this technique to colorize
00:55:47.240 | old black and white movies.
00:55:48.800 | You can do it on a single GPU,
00:55:50.440 | colorize a whole movie in a couple of hours.
00:55:52.840 | And one of the things that Jason and I did together
00:55:56.040 | was we figured out how to add a little bit of GAN
00:56:00.440 | at the very end, which it turns out for colorization
00:56:02.960 | makes it just a bit brighter and nicer.
00:56:05.960 | And then Jason did masses of experiments
00:56:07.880 | to figure out exactly how much to do,
00:56:09.960 | but it's still all done on his home machine
00:56:12.800 | on a single GPU in his lounge room.
00:56:15.320 | And like, if you think about like
00:56:17.520 | colorizing Hollywood movies,
00:56:19.160 | that sounds like something a huge studio would have to do,
00:56:21.680 | but he has the world's best results on this.
00:56:25.160 | - There's this problem of microphones.
00:56:27.000 | We're just talking to microphones now.
00:56:29.040 | It's such a pain in the ass to have these microphones
00:56:32.480 | to get good quality audio.
00:56:34.360 | And I tried to see if it's possible to plop down
00:56:36.680 | a bunch of cheap sensors and reconstruct
00:56:39.160 | higher quality audio from multiple sources.
00:56:41.800 | 'Cause right now I haven't seen work from,
00:56:45.160 | okay, we can save inexpensive mics,
00:56:47.440 | automatically combining audio from multiple sources
00:56:50.040 | to improve the combined audio.
00:56:52.280 | People haven't done that.
00:56:53.120 | And that feels like a learning problem.
00:56:55.080 | So hopefully somebody can.
00:56:56.840 | - Well, I mean, it's eminently doable
00:56:58.800 | and it should have been done by now.
00:57:01.000 | I felt the same way about computational photography
00:57:03.600 | four years ago.
00:57:05.240 | Why are we investing in big lenses
00:57:07.120 | when three cheap lenses,
00:57:09.800 | plus actually a little bit of intentional movement?
00:57:13.760 | So like take a few frames,
00:57:16.640 | gives you enough information to get excellent sub-pixel
00:57:19.800 | resolution, which particularly with deep learning,
00:57:22.440 | you would know exactly what you're meant to be looking at.
00:57:25.800 | We can totally do the same thing with audio.
00:57:28.160 | I think it's madness that it hasn't been done yet.
00:57:30.680 | - Is there been progress on the photography company?
00:57:33.240 | - Yeah, photography is basically a standard now.
00:57:36.720 | So the Google Pixel Night Light,
00:57:40.800 | I don't know if you've ever tried it,
00:57:42.080 | but it's astonishing.
00:57:43.200 | You take a picture in almost pitch black
00:57:45.440 | and you get back a very high quality image.
00:57:49.160 | And it's not because of the lens.
00:57:51.440 | Same stuff with like adding the bokeh
00:57:53.400 | to the background blurring done computationally.
00:57:57.160 | - This is the pixel right here.
00:57:58.560 | - Yeah, basically everybody now is doing most
00:58:03.560 | of the fanciest stuff on their phones
00:58:05.680 | with computational photography.
00:58:07.080 | And also increasingly people are putting more than one lens
00:58:10.560 | on the back of the camera.
00:58:11.760 | So the same will happen for audio for sure.
00:58:14.280 | - And there's applications in the audio side.
00:58:16.440 | If you look at an Alexa type device,
00:58:18.400 | most people I've seen, especially I worked at Google before,
00:58:22.280 | when you look at noise background removal,
00:58:25.880 | you don't think of multiple sources of audio.
00:58:28.760 | You don't play with that as much as I would hope people would.
00:58:31.840 | - But I mean, you can still do it even with one.
00:58:33.560 | Like again, it's not much work's been done in this area.
00:58:36.040 | So we're actually gonna be releasing an audio library soon,
00:58:38.960 | which hopefully will encourage development of this
00:58:41.000 | 'cause it's so underused.
00:58:43.120 | The basic approach we used for our super resolution
00:58:46.440 | in which Jason uses for DeOldify
00:58:48.600 | of generating high quality images,
00:58:50.920 | the exact same approach would work for audio.
00:58:53.400 | No one's done it yet,
00:58:54.400 | but it would be a couple of months work.
00:58:57.080 | - Okay, also learning rate in terms of DawnBench.
00:59:00.400 | There's some magic on learning rate
00:59:03.480 | that you played around with.
00:59:04.480 | That's kind of interesting.
00:59:05.680 | - Yeah, so this is all work that came
00:59:06.960 | from a guy called Leslie Smith.
00:59:09.280 | Leslie's a researcher who like us cares a lot
00:59:13.200 | about just the practicalities of training neural networks
00:59:18.200 | quickly and accurately,
00:59:20.280 | which you would think is what everybody should care about,
00:59:22.040 | but almost nobody does.
00:59:23.680 | And he discovered something very interesting,
00:59:28.000 | which he calls super convergence,
00:59:29.680 | which is there are certain networks
00:59:31.160 | that with certain settings of high parameters
00:59:33.240 | could suddenly be trained 10 times faster
00:59:37.000 | by using a 10 times higher learning rate.
00:59:39.400 | Now, no one published that paper
00:59:43.560 | because it's not an area of kind of active research
00:59:49.440 | in the academic world.
00:59:50.360 | No academics recognize this is important.
00:59:52.760 | And also deep learning in academia
00:59:56.040 | is not considered a experimental science.
00:59:59.800 | So unlike in physics where you could say like,
01:00:02.360 | I just saw a subatomic particle do something
01:00:05.320 | which the theory doesn't explain,
01:00:07.200 | you could publish that without an explanation.
01:00:10.400 | And then in the next 60 years,
01:00:11.840 | people can try to work out how to explain it.
01:00:14.080 | We don't allow this in the deep learning world.
01:00:16.120 | So it's literally impossible for Leslie
01:00:19.520 | to publish a paper that says,
01:00:21.600 | I've just seen something amazing happen.
01:00:23.520 | This thing trained 10 times faster than it should have.
01:00:25.640 | I don't know why.
01:00:27.360 | And so the reviewers were like,
01:00:28.480 | well, you can't publish that 'cause you don't know why.
01:00:30.240 | So anyway.
01:00:31.080 | - That's important to pause on
01:00:32.160 | because there's so many discoveries
01:00:34.280 | that would need to start like that.
01:00:36.120 | - Every other scientific field I know of works that way.
01:00:39.200 | I don't know why ours is uniquely disinterested
01:00:43.480 | in publishing unexplained experimental results,
01:00:47.680 | but there it is.
01:00:48.640 | So it wasn't published.
01:00:49.880 | Having said that,
01:00:52.480 | I read a lot more unpublished papers than published papers
01:00:56.800 | 'cause that's where you find the interesting insights.
01:01:00.000 | So I absolutely read this paper.
01:01:02.600 | And I was just like,
01:01:04.440 | this is astonishingly mind-blowing and weird and awesome.
01:01:09.440 | And like, why isn't everybody only talking about this?
01:01:12.320 | Because like, if you can train these things 10 times faster,
01:01:15.400 | they also generalize better
01:01:16.640 | because you're doing less epochs,
01:01:18.720 | which means you look at the data less,
01:01:20.000 | you get better accuracy.
01:01:21.360 | So I've been kind of studying that ever since.
01:01:24.560 | And eventually Leslie kind of figured out
01:01:28.440 | a lot of how to get this done.
01:01:30.040 | And we added minor tweaks
01:01:32.160 | and a big part of the trick
01:01:33.560 | is starting at a very low learning rate,
01:01:36.400 | very gradually increasing it.
01:01:37.840 | So as you're training your model,
01:01:39.760 | you would take very small steps at the start
01:01:42.040 | and you gradually make them bigger and bigger
01:01:44.000 | until eventually you're taking much bigger steps
01:01:46.360 | than anybody thought was possible.
01:01:48.120 | There's a few other little tricks to make it work,
01:01:51.040 | but basically we can reliably get super convergence.
01:01:55.160 | And so for the DawnBench thing,
01:01:56.560 | we were using just much higher learning rates
01:01:59.280 | than people expected to work.
01:02:02.160 | - What do you think the future of,
01:02:03.800 | I mean, it makes so much sense for that
01:02:05.160 | to be a critical hyperparameter learning rate that you vary.
01:02:08.600 | What do you think the future
01:02:09.480 | of learning rate magic looks like?
01:02:13.440 | - Well, there's been a lot of great work
01:02:14.880 | in the last 12 months in this area.
01:02:17.360 | And people are increasingly realizing that,
01:02:20.160 | like we just have no idea really how optimizers work.
01:02:23.080 | And the combination of weight decay,
01:02:25.800 | which is how we regularize optimizers
01:02:27.440 | and the learning rate,
01:02:29.160 | and then other things like the epsilon we use
01:02:31.480 | in the atom optimizer,
01:02:32.760 | they all work together in weird ways.
01:02:36.520 | And different parts of the model,
01:02:38.520 | this is another thing we've done a lot of work on
01:02:40.440 | is research into how different parts of the model
01:02:43.480 | should be trained at different rates in different ways.
01:02:46.600 | So we do something we call discriminative learning rates,
01:02:49.000 | which is really important,
01:02:50.120 | particularly for transfer learning.
01:02:51.880 | So really I think in the last 12 months,
01:02:54.840 | a lot of people have realized
01:02:55.840 | that all this stuff is important,
01:02:57.360 | there's been a lot of great work coming out,
01:02:59.960 | and we're starting to see algorithms appear,
01:03:03.640 | which have very, very few dials,
01:03:06.440 | if any, that you have to touch.
01:03:07.880 | So I think what's gonna happen
01:03:09.240 | is the idea of a learning rate,
01:03:10.800 | it almost already has disappeared in the latest research.
01:03:14.320 | And instead it's just like,
01:03:15.720 | we know enough about how to interpret the gradients
01:03:21.800 | and the change of gradients we see
01:03:23.800 | to know how to set every parameter.
01:03:25.320 | - That you can automate it.
01:03:26.280 | So you see the future of deep learning,
01:03:30.800 | where really, where's the input of a human expert needed?
01:03:34.520 | - Well, hopefully the input of a human expert
01:03:36.480 | will be almost entirely unneeded
01:03:38.720 | from the deep learning point of view.
01:03:40.400 | So again, like Google's approach to this
01:03:43.440 | is to try and use thousands of times more compute
01:03:45.960 | to run lots and lots of models at the same time
01:03:49.360 | and hope that one of them is good.
01:03:51.000 | - AutoML kind of?
01:03:51.840 | - Yeah, AutoML kind of stuff, which I think is insane.
01:03:54.680 | (laughing)
01:03:56.720 | When you better understand the mechanics
01:03:59.560 | of how models learn,
01:04:01.640 | you don't have to try a thousand different models
01:04:03.760 | to find which one happens to work the best.
01:04:05.600 | You can just jump straight to the best one,
01:04:08.080 | which means that it's more accessible
01:04:09.680 | in terms of compute, cheaper,
01:04:12.680 | and also with less hyperparameters to set,
01:04:14.880 | it means you don't need deep learning experts
01:04:16.760 | to train your deep learning model for you,
01:04:19.320 | which means that domain experts can do more of the work,
01:04:22.240 | which means that now you can focus the human time
01:04:24.960 | on the kind of interpretation, the data gathering,
01:04:28.280 | identifying model errors and stuff like that.
01:04:31.360 | - Yeah, the data side.
01:04:32.800 | How often do you work with data these days
01:04:34.720 | in terms of the cleaning, looking at it?
01:04:37.800 | Like Darwin looked at different species
01:04:41.120 | while traveling about.
01:04:42.880 | Do you look at data?
01:04:44.960 | Have you in your roots in Kaggle?
01:04:48.040 | - Always, yeah. - Just look at data?
01:04:49.360 | - Yeah, I mean, it's a key part of our course
01:04:51.320 | is like before we train a model in the course,
01:04:53.440 | we see how to look at the data.
01:04:55.160 | And then after, the first thing we do
01:04:56.520 | after we train our first model,
01:04:57.920 | which we fine tune an ImageNet model for five minutes.
01:05:00.520 | And then the thing we immediately do after that
01:05:02.200 | is we learn how to analyze the results of the model
01:05:05.800 | by looking at examples of misclassified images
01:05:08.920 | and looking at a classification matrix
01:05:10.880 | and then doing like research on Google
01:05:15.080 | to learn about the kinds of things that it's misclassifying.
01:05:18.120 | So to me, one of the really cool things
01:05:19.480 | about machine learning models in general
01:05:21.800 | is that when you interpret them,
01:05:24.280 | they tell you about things like,
01:05:25.400 | what are the most important features?
01:05:27.320 | Which groups you're misclassifying?
01:05:29.360 | And they help you become a domain expert more quickly
01:05:32.440 | because you can focus your time on the bits
01:05:34.840 | that the model is telling you is important.
01:05:38.680 | So it lets you deal with things like data leakage,
01:05:40.720 | for example, if it says,
01:05:41.560 | "Oh, the main feature I'm looking at is customer ID."
01:05:45.400 | You know, when you're like,
01:05:46.240 | "Oh, customer ID shouldn't be predictive."
01:05:47.600 | And then you can talk to the people
01:05:50.640 | that manage customer IDs and they'll tell you like,
01:05:53.160 | "Oh yes, as soon as a customer's application is accepted,
01:05:57.480 | we add a one on the end of their customer ID or something."
01:06:01.160 | So yeah, model, looking at data,
01:06:03.720 | particularly from the lens of which parts of the data
01:06:06.000 | the model says is important is super important.
01:06:09.360 | - Yeah, and using the model to almost debug the data
01:06:12.880 | to learn more about the data.
01:06:14.240 | - Exactly.
01:06:16.800 | - What are the different cloud options
01:06:18.600 | for training your networks?
01:06:20.160 | Last question related to DawnBench.
01:06:21.960 | Well, it's part of a lot of the work you do,
01:06:24.240 | but from a perspective of performance,
01:06:27.280 | I think you've written this in a blog post.
01:06:29.480 | There's AWS, there's a TPU from Google.
01:06:32.720 | What's your sense, what the future holds?
01:06:34.480 | What would you recommend now in terms of-
01:06:37.360 | - So from a hardware point of view,
01:06:39.440 | Google's TPUs and the best Nvidia GPUs are similar.
01:06:45.320 | I mean, maybe the TPUs are like 30% faster,
01:06:47.920 | but they're also much harder to program.
01:06:49.920 | There isn't a clear leader in terms of hardware right now,
01:06:54.720 | although much more importantly,
01:06:56.280 | the Nvidia GPUs are much more programmable.
01:06:59.560 | They've got much more written for all of them.
01:07:00.960 | So like that's the clear leader for me
01:07:03.160 | and where I would spend my time
01:07:04.440 | as a researcher and practitioner.
01:07:06.880 | But then in terms of the platform,
01:07:10.320 | I mean, we're super lucky now
01:07:13.800 | with stuff like Google GCP, Google Cloud,
01:07:17.040 | and AWS that you can access a GPU pretty quickly and easily.
01:07:22.040 | But I mean, for AWS, it's still too hard.
01:07:28.080 | Like you have to find an AMI and get the instance running
01:07:33.080 | and then install the software you want and blah, blah, blah.
01:07:37.080 | GCP is still, is currently the best way to get started
01:07:40.760 | on a full server environment
01:07:42.320 | because they have a fantastic fast AI
01:07:44.880 | and PyTorch ready to go instance,
01:07:47.680 | which has all the courses pre-installed.
01:07:51.080 | It has Jupyter Notebook pre-running.
01:07:53.040 | Jupyter Notebook is this wonderful
01:07:55.920 | interactive computing system,
01:07:57.600 | which everybody basically should be using
01:08:00.360 | for any kind of data-driven research.
01:08:02.880 | But then even better than that,
01:08:04.440 | there are platforms like Salamander,
01:08:08.400 | which we own and Paperspace,
01:08:11.240 | where literally you click a single button
01:08:13.560 | and it pops up a Jupyter Notebook straight away
01:08:17.200 | without any kind of installation or anything.
01:08:22.200 | And all the course notebooks are all pre-installed.
01:08:25.760 | So like for me, this is one of the things
01:08:28.560 | we spent a lot of time kind of curating and working on.
01:08:32.920 | 'Cause when we first started our courses,
01:08:35.960 | the biggest problem was people dropped out of lesson one
01:08:39.600 | 'cause they couldn't get an AWS instance running.
01:08:42.680 | So things are so much better now.
01:08:44.880 | And like we actually have, if you go to course.fast.ai,
01:08:47.760 | the first thing it says is,
01:08:48.720 | "Here's how to get started with your GPU."
01:08:50.480 | And there's like, you just click on the link
01:08:52.120 | and you click start and you're going.
01:08:55.160 | - You will go GCP.
01:08:56.280 | I have to confess, I've never used the Google GCP.
01:08:58.800 | - Yeah, GCP gives you $300 of compute for free,
01:09:01.640 | which is really nice.
01:09:03.920 | But as I say, Salamander and Paperspace
01:09:07.320 | are even easier still.
01:09:09.440 | - Okay.
01:09:10.960 | So from the perspective of deep learning frameworks,
01:09:15.120 | you work with Fast.ai, if you go to this framework,
01:09:18.440 | and PyTorch and TensorFlow.
01:09:21.240 | What are the strengths of each platform?
01:09:24.320 | - Sure. - Your perspective.
01:09:25.800 | - So in terms of what we've done our research on
01:09:28.760 | and taught in our course,
01:09:30.240 | we started with Theano and Keras.
01:09:34.360 | And then we switched to TensorFlow and Keras.
01:09:38.080 | And then we switched to PyTorch
01:09:40.360 | and then we switched to PyTorch and Fast.ai.
01:09:42.960 | And that kind of reflects a growth and development
01:09:47.560 | of the ecosystem of deep learning libraries.
01:09:50.960 | Theano and TensorFlow were great,
01:09:57.080 | but were much harder to teach
01:09:59.720 | and to do research and development on
01:10:01.680 | because they define what's called a computational graph
01:10:04.560 | up front, a static graph,
01:10:06.040 | where you basically have to say,
01:10:07.400 | here are all the things that I'm going to eventually do
01:10:10.840 | in my model.
01:10:12.000 | And then later on you say,
01:10:13.160 | okay, do those things with this data.
01:10:15.040 | And you can't like debug them,
01:10:17.080 | you can't do them step-by-step,
01:10:18.480 | you can't program them interactively
01:10:20.080 | in a Jupyter notebook and so forth.
01:10:22.240 | PyTorch was not the first,
01:10:23.680 | but PyTorch was certainly the strongest entrant
01:10:26.800 | to come along and say,
01:10:27.640 | let's not do it that way,
01:10:28.640 | let's just use normal Python.
01:10:30.240 | And everything you know about in Python
01:10:32.840 | is just gonna work.
01:10:34.080 | And we'll figure out how to make that run on the GPU
01:10:37.920 | as and when necessary.
01:10:39.320 | That turned out to be a huge leap
01:10:44.640 | in terms of what we could do with our research
01:10:46.800 | and what we could do with our teaching.
01:10:48.760 | - 'Cause it wasn't limiting.
01:10:51.240 | - Yeah, I mean, it was critical for us
01:10:52.760 | for something like DawnBench
01:10:53.880 | to be able to rapidly try things.
01:10:55.960 | It's just so much harder to be a researcher
01:10:57.840 | and practitioner when you have to do everything up front
01:11:00.520 | and you can't inspect it.
01:11:03.400 | The problem with PyTorch is
01:11:05.120 | it's not at all accessible to newcomers
01:11:08.880 | because you have to write your own training loop
01:11:11.600 | and manage the gradients and all this stuff.
01:11:14.120 | And it's also not great for researchers
01:11:17.880 | because you're spending your time
01:11:19.360 | dealing with all this boilerplate and overhead
01:11:21.640 | rather than thinking about your algorithm.
01:11:23.880 | So we ended up writing this very multi-layered API
01:11:27.760 | that at the top level,
01:11:29.040 | you can train a state-of-the-art neural network
01:11:31.400 | in three lines of code.
01:11:33.560 | And which kind of talks to an API,
01:11:35.040 | which talks to an API, which talks to an API,
01:11:36.640 | which like you can dive into at any level
01:11:38.800 | and get progressively closer to the machine
01:11:42.640 | kind of levels of control.
01:11:44.120 | And this is the Fast.ai library.
01:11:47.400 | That's been critical for us and for our students
01:11:51.800 | and for lots of people that have won
01:11:53.680 | big learning competitions with it
01:11:55.200 | and written academic papers with it.
01:11:57.400 | It's made a big difference.
01:12:00.640 | We're still limited though by Python
01:12:02.920 | and particularly this problem with things like
01:12:06.400 | recurrent neural nets say where you just can't change things
01:12:11.400 | unless you accept it going so slowly that it's impractical.
01:12:15.640 | So in the latest incarnation of the course
01:12:18.320 | and with some of the research we're now starting to do,
01:12:20.880 | we're starting to do some stuff in Swift.
01:12:24.520 | I think we're three years away
01:12:27.400 | from that being super practical,
01:12:29.800 | but I'm in no hurry.
01:12:31.040 | I'm very happy to invest the time to get there.
01:12:34.240 | But with that, we actually already have a nascent version
01:12:39.000 | of the Fast.ai library for vision
01:12:41.800 | running on Swift for TensorFlow.
01:12:44.720 | 'Cause Python for TensorFlow is not gonna cut it.
01:12:48.000 | It's just a disaster.
01:12:49.920 | What they did was they tried to replicate
01:12:52.960 | the bits that people were saying they like about PyTorch,
01:12:57.080 | this kind of interactive computation,
01:12:59.160 | but they didn't actually change
01:13:00.600 | their foundational runtime components.
01:13:03.880 | So they kind of added this like syntax sugar
01:13:06.600 | they call TF eager, TensorFlow eager,
01:13:08.360 | which makes it look a lot like PyTorch,
01:13:10.880 | but it's 10 times slower than PyTorch to actually do a step.
01:13:15.880 | So because they didn't invest the time
01:13:19.040 | in like retooling the foundations
01:13:21.080 | 'cause their code base is so horribly complex.
01:13:23.440 | - Yeah, I think it's probably very difficult
01:13:25.240 | to do that kind of retooling.
01:13:26.360 | - Yeah, well, particularly the way TensorFlow was written,
01:13:28.600 | it was written by a lot of people very quickly
01:13:31.440 | in a very disorganized way.
01:13:33.320 | So like when you actually look in the code, as I do often,
01:13:35.960 | I'm always just like, oh God, what were they thinking?
01:13:38.800 | It's just, it's pretty awful.
01:13:41.360 | So I'm really extremely negative
01:13:45.200 | about the potential future for Python.
01:13:47.800 | - TensorFlow, Python for TensorFlow.
01:13:50.040 | - But Swift for TensorFlow
01:13:52.080 | can be a different beast altogether.
01:13:53.720 | It can be like, it can basically be a layer on top of MLIR
01:13:57.520 | that takes advantage of all the great compiler stuff
01:14:02.520 | that Swift builds on with LLVM.
01:14:04.720 | And yeah, it could be,
01:14:07.000 | I think it will be absolutely fantastic.
01:14:09.280 | - Well, you're inspiring me to try.
01:14:11.840 | I haven't truly felt the pain of TensorFlow 2.0 Python.
01:14:16.840 | It's fine by me, but-
01:14:19.520 | - Yeah, I mean, it does the job
01:14:22.080 | if you're using like predefined things
01:14:25.080 | that somebody's already written.
01:14:27.680 | But if you actually compare, you know,
01:14:29.520 | like I've had to do,
01:14:31.320 | 'cause I've been having to do a lot of stuff
01:14:32.600 | with TensorFlow recently,
01:14:33.640 | you actually compare like,
01:14:34.720 | okay, I want to write something from scratch.
01:14:37.320 | And you're like, I just keep finding it's like,
01:14:38.840 | oh, it's running 10 times slower than PyTorch.
01:14:41.480 | - So is the biggest cost,
01:14:43.760 | let's throw running time out the window,
01:14:47.280 | how long it takes you to program?
01:14:49.560 | - That's not too different now.
01:14:50.920 | Thanks to TensorFlow Eager, that's not too different.
01:14:54.000 | But because so many things take so long to run,
01:14:58.560 | you wouldn't run it at 10 times slower.
01:15:00.240 | Like you just go like, oh, this is taking too long.
01:15:03.200 | And also there's a lot of things
01:15:04.200 | which are just less programmable,
01:15:05.760 | like tf.data, which is the way data processing works
01:15:08.920 | in TensorFlow is just this big mess.
01:15:11.320 | It's incredibly inefficient.
01:15:13.160 | And I kind of had to write it that way
01:15:14.720 | because of the TPU problems I described earlier.
01:15:19.120 | So I just, you know,
01:15:22.120 | I just feel like they've got this huge technical debt,
01:15:24.680 | which they're not going to solve
01:15:26.160 | without starting from scratch.
01:15:27.920 | - So here's an interesting question then.
01:15:29.400 | If there's a new student starting today,
01:15:33.560 | what would you recommend they use?
01:15:37.440 | - Well, I mean, we obviously recommend Fast.ai and PyTorch
01:15:40.400 | because we teach new students
01:15:42.680 | and that's what we teach with.
01:15:43.840 | So we would very strongly recommend that
01:15:46.040 | because it will let you get on top of the concepts
01:15:49.960 | much more quickly.
01:15:51.880 | So then you'll become an actual,
01:15:53.080 | and you'll also learn the actual state of the art techniques,
01:15:56.120 | you know, so you actually get world-class results.
01:15:59.160 | Honestly, it doesn't much matter what library you learn
01:16:03.880 | because switching from Chainer to MXNet
01:16:08.280 | to TensorFlow to PyTorch is gonna be a couple of days work
01:16:11.960 | as long as you understand the foundation as well.
01:16:15.200 | - But you think we'll swift creep in there
01:16:19.360 | as a thing that people start using?
01:16:22.880 | - Not for a few years,
01:16:24.320 | particularly because like Swift has no data science community,
01:16:29.320 | libraries, tooling. - So code bases are out there.
01:16:33.360 | - And the Swift community has a total lack of appreciation
01:16:38.360 | and understanding of numeric computing.
01:16:40.840 | So like they keep on making stupid decisions,
01:16:43.280 | you know, for years they've just done dumb things
01:16:45.400 | around performance and prioritization.
01:16:49.400 | That's clearly changing now
01:16:53.440 | because the developer of Swift, Chris Latner,
01:16:58.000 | is working at Google on Swift for TensorFlow.
01:17:00.720 | So like that's a priority.
01:17:04.160 | It'll be interesting to see what happens with Apple
01:17:05.800 | because like Apple hasn't shown any sign of caring
01:17:10.800 | about numeric programming in Swift.
01:17:13.800 | So I mean, hopefully they'll get off their ass
01:17:17.360 | and start appreciating this
01:17:18.800 | 'cause currently all of their low level libraries
01:17:22.240 | are not written in Swift.
01:17:25.120 | They're not particularly Swifty at all,
01:17:27.360 | stuff like core ML, they're really pretty rubbish.
01:17:30.760 | So yeah, so there's a long way to go,
01:17:33.680 | but at least one nice thing is that Swift for TensorFlow
01:17:36.080 | can actually directly use Python code and Python libraries
01:17:40.760 | in literally the entire lesson one notebook of fast AI
01:17:45.000 | runs in Swift right now in Python mode.
01:17:48.560 | So that's a nice intermediate thing.
01:17:51.640 | - How long does it take,
01:17:53.400 | if you look at the two fast AI courses,
01:17:57.560 | how long does it take to get from point zero
01:18:00.480 | to completing both courses?
01:18:02.040 | - It varies a lot.
01:18:04.320 | Somewhere between two months and two years generally.
01:18:13.160 | - So for two months, how many hours a day?
01:18:15.320 | - So like somebody who is a very competent coder
01:18:20.320 | can do 70 hours per course and-
01:18:26.480 | - 70, seven zero, that's it?
01:18:30.040 | Okay.
01:18:30.880 | - But a lot of people I know take a year off
01:18:35.680 | to study fast AI full time and say at the end of the year,
01:18:40.480 | they feel pretty competent.
01:18:43.440 | 'Cause generally there's a lot of other things you do.
01:18:45.560 | Like generally they'll be entering Kaggle competitions.
01:18:48.680 | They might be reading Ian Goodfellow's book.
01:18:51.440 | They might, you know, they'll be doing a bunch of stuff.
01:18:54.560 | And often, you know, particularly if they
01:18:56.720 | are a domain expert, their coding skills
01:18:59.040 | might be a little on the pedestrian side.
01:19:01.760 | So part of it's just like doing a lot more writing.
01:19:04.760 | - What do you find is the bottleneck for people usually,
01:19:08.000 | except getting started and setting stuff up?
01:19:11.720 | - I would say coding.
01:19:13.160 | - Just-
01:19:14.000 | - Yeah, I would say the best,
01:19:14.840 | the people who are strong coders pick it up the best.
01:19:17.880 | Although another bottleneck is people who have a lot
01:19:21.640 | of experience of classic statistics can really struggle
01:19:26.640 | because the intuition is so the opposite
01:19:30.000 | of what they're used to.
01:19:30.840 | They're very used to like trying to reduce the number
01:19:33.040 | of parameters in their model and looking
01:19:36.920 | at individual coefficients and stuff like that.
01:19:39.400 | So I find people who have a lot of coding background
01:19:42.920 | and know nothing about statistics are generally
01:19:45.680 | gonna be the best off.
01:19:47.440 | - So you taught several courses on deep learning
01:19:51.360 | and as Feynman says,
01:19:52.920 | "The best way to understand something is to teach it."
01:19:55.600 | What have you learned about deep learning from teaching it?
01:19:59.120 | - A lot.
01:20:00.600 | It's a key reason for me to teach the courses.
01:20:03.560 | I mean, obviously it's gonna be necessary
01:20:04.920 | to achieve our goal of getting domain experts
01:20:07.640 | to be familiar with deep learning,
01:20:09.320 | but it was also necessary for me to achieve my goal
01:20:12.040 | of being really familiar with deep learning.
01:20:14.240 | I mean, to see so many domain experts
01:20:23.200 | from so many different backgrounds,
01:20:25.640 | it's definitely, I wouldn't say taught me,
01:20:28.800 | but convinced me something that I liked to believe
01:20:31.520 | was true, which was anyone can do it.
01:20:34.880 | So there's a lot of kind of snobbishness out there
01:20:37.400 | about only certain people can learn to code,
01:20:40.200 | only certain people are gonna be smart enough to do AI.
01:20:43.120 | That's definitely bullshit.
01:20:45.320 | I've seen so many people
01:20:47.240 | from so many different backgrounds get state-of-the-art
01:20:50.320 | results in their domain areas now.
01:20:52.480 | It's definitely taught me that the key differentiator
01:20:57.120 | between people that succeed and people that fail
01:20:59.600 | is tenacity.
01:21:00.680 | That seems to be basically the only thing that matters.
01:21:03.920 | The people, a lot of people give up.
01:21:06.800 | And, but of the ones who don't give up,
01:21:11.360 | pretty much everybody succeeds.
01:21:15.000 | Even if at first I'm just kind of like thinking like,
01:21:17.840 | wow, they really aren't quite getting it yet, are they?
01:21:20.520 | But eventually people get it and they succeed.
01:21:24.720 | So I think that's been,
01:21:26.400 | I think they're both things I've liked to believe was true,
01:21:28.720 | but I don't feel like I really had strong evidence
01:21:30.880 | for them to be true,
01:21:31.760 | but now I can say I've seen it again and again.
01:21:34.760 | - So what advice do you have for someone
01:21:38.600 | who wants to get started in deep learning?
01:21:42.160 | - Train lots of models.
01:21:44.360 | That's how you learn it.
01:21:47.040 | So like, so I would, you know, I think, it's not just me.
01:21:51.560 | I think our course is very good,
01:21:53.320 | but also lots of people independently have said
01:21:54.960 | it's very good.
01:21:55.800 | It recently won the COGx award for AI courses
01:21:58.600 | as being the best in the world.
01:22:00.160 | I'd say come to our course, course.fast.ai.
01:22:02.960 | And the thing I keep on hopping on in my lessons
01:22:05.240 | is train models, print out the inputs to the models,
01:22:09.120 | print out to the outputs to the models,
01:22:11.000 | like study, you know, change the inputs a bit,
01:22:15.320 | look at how the outputs vary,
01:22:17.320 | just run lots of experiments to get a, you know,
01:22:20.360 | an intuitive understanding of what's going on.
01:22:24.480 | - To get hooked, do you think, you mentioned training,
01:22:29.080 | do you think just running the models inference?
01:22:32.640 | Like if we talk about getting started.
01:22:35.360 | - No, you've got to fine tune the models.
01:22:37.480 | So that's the critical thing,
01:22:39.480 | 'cause at that point you now have a model
01:22:41.240 | that's in your domain area.
01:22:43.240 | So there's no point running somebody else's model
01:22:46.840 | 'cause it's not your model.
01:22:47.880 | Like, so it only takes five minutes to fine tune a model
01:22:50.480 | for the data you care about.
01:22:52.040 | And in lesson two of the course,
01:22:53.520 | we teach you how to create your own dataset from scratch
01:22:56.360 | by scripting Google image search.
01:22:58.560 | So, and we show you how to actually create
01:23:01.160 | a web application running online.
01:23:02.840 | So I create one in the course that differentiates
01:23:05.280 | between a teddy bear, a grizzly bear, and a brown bear.
01:23:08.320 | And it does it with basically a hundred percent accuracy.
01:23:11.040 | Took me about four minutes to scrape the images
01:23:13.120 | from Google search in the script.
01:23:15.080 | There's a little graphical widgets we have in the notebook
01:23:18.760 | that help you clean up the dataset.
01:23:21.400 | There's other widgets that help you study the results
01:23:24.040 | to see where the errors are happening.
01:23:26.360 | And so now we've got over a thousand replies
01:23:29.280 | in our share your work here thread of students saying,
01:23:32.800 | here's the thing I built.
01:23:34.280 | And so there's people who like,
01:23:35.880 | and a lot of them are state of the art.
01:23:37.600 | Like somebody said, oh, I tried looking
01:23:39.000 | at Devan Garey characters and I couldn't believe it.
01:23:41.160 | The thing that came out was more accurate
01:23:43.320 | than the best academic paper after lesson one.
01:23:46.640 | And then there's others which are just more kind of fun.
01:23:48.560 | Like somebody who's doing Trinidad and Tobago hummingbirds.
01:23:53.080 | She said, that's kind of their national bird.
01:23:54.880 | And she's got something that can now classify a Trinidad
01:23:57.400 | and Tobago hummingbirds.
01:23:58.800 | So yeah, train models, fine tune models with your dataset
01:24:02.440 | and then study their inputs and outputs.
01:24:05.200 | - How much is Fast.ai courses?
01:24:07.160 | - Free.
01:24:08.000 | Everything we do is free.
01:24:10.480 | We have no revenue sources of any kind.
01:24:12.720 | It's just a service to the community.
01:24:15.400 | - You're a saint.
01:24:16.600 | Okay.
01:24:17.440 | Once a person understands the basics,
01:24:20.080 | trains a bunch of models.
01:24:22.720 | If we look at the scale of years,
01:24:25.880 | what advice do you have for someone wanting
01:24:27.640 | to eventually become an expert?
01:24:29.280 | - Train lots of models.
01:24:31.920 | (laughing)
01:24:33.120 | Specifically train lots of models in your domain area.
01:24:35.360 | So an expert what, right?
01:24:37.080 | We don't need more expert,
01:24:39.160 | like create slightly evolutionary research
01:24:44.160 | in areas that everybody's studying.
01:24:46.680 | We need experts at using deep learning
01:24:50.440 | to diagnose malaria.
01:24:52.640 | Or we need experts at using deep learning
01:24:55.520 | to analyze language to study media bias.
01:25:00.520 | So we need experts in analyzing fisheries
01:25:04.080 | to identify problem areas in the ocean.
01:25:11.960 | That's what we need.
01:25:13.240 | So like become the expert in your passion area.
01:25:17.760 | And this is a tool which you can use
01:25:20.160 | for just about anything.
01:25:21.240 | And you'll be able to do that thing better
01:25:22.920 | than other people, particularly by combining it
01:25:25.760 | with your passion and domain expertise.
01:25:27.440 | - So that's really interesting.
01:25:28.400 | Even if you do wanna innovate on transfer learning
01:25:30.880 | or active learning, your thought is,
01:25:34.040 | I mean, it's one I certainly share,
01:25:36.200 | is you also need to find a domain or a dataset
01:25:40.160 | that you actually really care for.
01:25:41.680 | - Right.
01:25:42.520 | If you're not working on a real problem that you understand,
01:25:45.360 | how do you know if you're doing it any good?
01:25:48.040 | How do you know if your results are good?
01:25:49.320 | How do you know if you're getting bad results?
01:25:50.800 | Why are you getting bad results?
01:25:52.040 | Is it a problem with the data?
01:25:53.600 | How do you know you're doing anything useful?
01:25:57.400 | Yeah, to me, the only really interesting research
01:26:00.160 | is not the only, but the vast majority
01:26:02.400 | of interesting research is like try
01:26:04.720 | and solve an actual problem and solve it really well.
01:26:06.880 | - So both understanding sufficient tools
01:26:09.440 | on the deep learning side and becoming a domain expert
01:26:13.720 | in a particular domain are really things within reach
01:26:17.360 | for anybody.
01:26:18.280 | - Yeah, I mean, to me, I would compare it
01:26:20.560 | to like studying self-driving cars,
01:26:23.480 | having never looked at a car or been in a car
01:26:26.560 | or turned a car on, which is like the way it is
01:26:29.360 | for a lot of people.
01:26:30.640 | They'll study some academic dataset
01:26:32.880 | where they literally have no idea about that.
01:26:36.160 | - By the way, I'm not sure how familiar
01:26:37.680 | with autonomous vehicles, but that is literally,
01:26:40.880 | you describe a large percentage of robotics folks
01:26:43.440 | working in self-driving cars
01:26:45.000 | is they actually haven't considered driving.
01:26:48.680 | They haven't actually looked at what driving looks like.
01:26:50.600 | They haven't driven.
01:26:51.440 | - Right, and it's a problem because you know,
01:26:53.320 | when you've actually driven, you know,
01:26:54.400 | like these are the things that happened to me
01:26:56.240 | when I was driving.
01:26:57.080 | - There's nothing that beats the real world examples
01:26:59.680 | of just experiencing them.
01:27:01.120 | You've created many successful startups.
01:27:04.880 | What does it take to create a successful startup?
01:27:07.400 | - Same thing as becoming a successful
01:27:11.520 | deep learning practitioner, which is not giving up.
01:27:15.000 | So you can run out of money or run out of time
01:27:20.000 | or run out of something, you know,
01:27:24.720 | but if you keep costs super low
01:27:28.000 | and try and save up some money beforehand
01:27:29.960 | so you can afford to have some time,
01:27:34.000 | then just sticking with it is one important thing.
01:27:38.080 | Doing something you understand and care about is important.
01:27:42.680 | By something, I don't mean,
01:27:44.040 | the biggest problem I see with deep learning people
01:27:46.720 | is they do a PhD in deep learning
01:27:50.160 | and then they try and commercialize their PhD,
01:27:52.440 | which is a waste of time
01:27:53.320 | 'cause that doesn't solve an actual problem.
01:27:55.880 | You picked your PhD topic 'cause it was an interesting
01:27:59.280 | kind of engineering or math or research exercise.
01:28:02.520 | But yeah, if you've actually spent time as a recruiter
01:28:06.680 | and you know that most of your time
01:28:08.240 | was spent sifting through resumes
01:28:10.680 | and you know that most of the time
01:28:12.880 | you're just looking for certain kinds of things
01:28:14.720 | and you can try doing that with a model for a few minutes
01:28:19.720 | and see whether that's something which the model
01:28:21.040 | seems to be able to do as well as you could,
01:28:23.760 | then you're on the right track to creating a startup.
01:28:27.640 | And then I think just, yeah, being,
01:28:29.400 | just be pragmatic and try and stay away
01:28:35.720 | from venture capital money as long as possible,
01:28:37.920 | preferably forever.
01:28:39.200 | - So yeah, on that point, do you,
01:28:41.320 | venture capital, so did you,
01:28:44.600 | were you able to successfully run startups
01:28:46.880 | with self-funded for quite a while?
01:28:48.240 | - Yeah, so my first two were self-funded
01:28:50.200 | and that was the right way to do it.
01:28:52.360 | - Is that scary?
01:28:53.200 | - No, VC startups are much more scary
01:28:57.840 | because you have these people on your back
01:29:00.680 | who do this all the time and who have done it for years
01:29:03.360 | telling you, "Grow, grow, grow, grow."
01:29:05.480 | And they don't care if you fail,
01:29:07.200 | they only care if you don't grow fast enough.
01:29:09.480 | So that's scary, whereas doing the ones myself,
01:29:13.280 | well, with partners who were friends,
01:29:17.760 | it's nice 'cause we just went along at a pace
01:29:21.120 | that made sense and we were able to build it to something
01:29:23.760 | which was big enough that we never had to work again,
01:29:27.280 | but it was not big enough that any VC
01:29:29.280 | would think it was impressive.
01:29:31.480 | And that was enough for us to be excited.
01:29:35.440 | So I thought that's a much better way
01:29:38.840 | to do things than most people.
01:29:40.280 | - In generally speaking, not for yourself,
01:29:41.920 | but how do you make money during that process?
01:29:44.520 | Do you cut into savings?
01:29:47.440 | - So yeah, so I started Fastmail and Optimal Decisions
01:29:50.640 | at the same time in 1999 with two different friends.
01:29:54.560 | And for Fastmail, I guess I spent $70 a month on the server.
01:30:04.000 | And when the server ran out of space,
01:30:06.240 | I put a payments button on the front page
01:30:09.400 | and said, "If you want more than 10 megs of space,
01:30:11.880 | you have to pay $10 a year."
01:30:15.640 | And- - So run low,
01:30:17.320 | like keep your cost down.
01:30:18.480 | - Yeah, so I kept my cost down.
01:30:19.480 | And once I needed to spend more money,
01:30:22.960 | I asked people to spend the money for me.
01:30:25.560 | And that was that basically from then on,
01:30:29.440 | we were making money and I was profitable from then.
01:30:34.440 | For Optimal Decisions, it was a bit harder
01:30:37.640 | 'cause we were trying to sell something
01:30:40.040 | that was more like a $1 million sale.
01:30:42.160 | But what we did was we would sell scoping projects.
01:30:46.400 | So kind of like prototype-y projects,
01:30:50.560 | but rather than doing it for free,
01:30:51.720 | we would sell them for 50 to $100,000.
01:30:54.200 | So again, we were covering our costs
01:30:56.920 | and also making the client feel
01:30:58.320 | like we were doing something valuable.
01:31:00.200 | So in both cases, we were profitable
01:31:01.920 | from six months in.
01:31:04.800 | - Ah, nevertheless, it's scary.
01:31:08.160 | - I mean, yeah, sure.
01:31:10.000 | I mean, it's scary before you jump in.
01:31:13.280 | And I guess I was comparing it to the scarediness of VC.
01:31:18.120 | I felt like with VC stuff, it was more scary,
01:31:20.480 | kind of much more in somebody else's hands,
01:31:24.320 | will they fund you or not?
01:31:26.160 | And what do they think of what you're doing?
01:31:27.880 | I also found it very difficult with VC-backed startups
01:31:30.560 | to actually do the thing which I thought was important
01:31:34.240 | for the company rather than doing the thing
01:31:35.960 | which I thought would make the VC happy.
01:31:38.880 | Now, VCs always tell you not to do the thing
01:31:40.920 | that makes them happy.
01:31:42.400 | But then if you don't do the thing that makes them happy,
01:31:44.080 | they get sad, so.
01:31:45.360 | - And do you think optimizing for the,
01:31:48.120 | whatever they call it, the exit,
01:31:50.160 | is a good thing to optimize for?
01:31:53.080 | - I mean, it can be, but not at the VC level,
01:31:54.920 | 'cause the VC exit needs to be, you know, a thousand X.
01:31:59.560 | So, where else the lifestyle exit,
01:32:03.120 | if you can sell something for $10 million,
01:32:05.360 | you've made it, right?
01:32:06.400 | So, I don't, it depends.
01:32:09.160 | If you wanna build something that's gonna,
01:32:11.200 | you're kind of happy to do forever, then fine.
01:32:13.560 | If you wanna build something you wanna sell
01:32:15.720 | in three years time, that's fine too.
01:32:18.440 | I mean, they're both perfectly good outcomes.
01:32:21.280 | - So, you're learning Swift now, in a way.
01:32:24.880 | I mean, you already-- - Trying to.
01:32:26.760 | - And I read that you use, at least in some cases,
01:32:31.160 | spaced repetition as a mechanism for learning new things.
01:32:34.440 | I use Anki quite a lot myself.
01:32:36.400 | - Yeah, me too.
01:32:37.240 | - I actually never talked to anybody about it.
01:32:41.440 | Don't know how many people do it,
01:32:44.160 | but it works incredibly well for me.
01:32:46.760 | Can you talk through your experience?
01:32:47.960 | Like, how did you, what do you,
01:32:51.120 | first of all, okay, let's back it up.
01:32:53.120 | What is spaced repetition?
01:32:55.120 | - So, spaced repetition is an idea created
01:33:00.120 | by a psychologist named Ebbinghaus.
01:33:03.440 | I don't know, must be a couple of hundred years ago
01:33:06.120 | or something, 150 years ago.
01:33:08.040 | He did something which sounds pretty damn tedious.
01:33:10.720 | He wrote down random sequences of letters on cards
01:33:15.600 | and tested how well he would remember
01:33:18.840 | those random sequences a day later,
01:33:21.320 | a week later, whatever.
01:33:23.000 | He discovered that there was this kind of a curve
01:33:26.120 | where his probability of remembering one of them
01:33:28.800 | would be dramatically smaller the next day
01:33:30.640 | and then a little bit smaller the next day
01:33:31.960 | and a little bit smaller the next day.
01:33:33.520 | What he discovered is that if he revised those cards
01:33:36.880 | after a day, the probabilities would decrease
01:33:41.560 | at a smaller rate.
01:33:42.880 | And then if he revised them again a week later,
01:33:44.960 | they would decrease at a smaller rate again.
01:33:47.040 | And so he basically figured out a roughly optimal equation
01:33:51.800 | for when you should revise something you wanna remember.
01:33:54.600 | So spaced repetition learning is using
01:33:58.640 | this simple algorithm, just something like
01:34:02.080 | revise something after a day and then three days
01:34:04.480 | and then a week and then three weeks and so forth.
01:34:07.680 | And so if you use a program like Anki, as you know,
01:34:10.640 | it will just do that for you.
01:34:12.080 | And it will say, did you remember this?
01:34:14.520 | And if you say no, it will reschedule it back
01:34:17.640 | to appear again like 10 times faster
01:34:20.280 | than it otherwise would have.
01:34:21.960 | It's a kind of a way of being guaranteed to learn something
01:34:27.880 | because by definition, if you're not learning it,
01:34:30.200 | it will be rescheduled to be revised more quickly.
01:34:32.680 | Unfortunately though, it's also like,
01:34:36.080 | it doesn't let you fool yourself.
01:34:37.440 | If you're not learning something,
01:34:39.480 | you know like your revisions will just get more and more.
01:34:44.040 | So you have to find ways to learn things productively
01:34:48.240 | and effectively like treat your brain well.
01:34:50.520 | So using like mnemonics and stories
01:34:52.920 | and context and stuff like that.
01:34:56.320 | So yeah, it's a super great technique.
01:34:59.720 | It's like learning how to learn is something which
01:35:02.560 | everybody should learn before they actually learn anything,
01:35:05.640 | but almost nobody does.
01:35:07.920 | - So what have you, so it certainly works well
01:35:10.120 | for learning new languages, for, I mean,
01:35:13.720 | for learning like small projects almost,
01:35:16.400 | but do you, you know, I started using it for,
01:35:19.800 | I forget who wrote a blog post about this inspired me.
01:35:22.400 | It might've been you, I'm not sure.
01:35:25.520 | Is, I started when I read papers,
01:35:28.480 | I'll concepts and ideas, I'll put them.
01:35:31.880 | - Was it Michael Nielsen?
01:35:32.800 | - It was Michael Nielsen.
01:35:33.640 | - Yeah, so Michael started doing this recently
01:35:36.400 | and has been writing about it.
01:35:37.920 | I, so the kind of today's Ebbinghaus
01:35:43.200 | is a guy called Peter Wozniak
01:35:45.040 | who developed a system called SuperMemo.
01:35:47.720 | And he's been basically trying to become like
01:35:50.040 | the world's greatest Renaissance man
01:35:54.040 | over the last few decades.
01:35:55.920 | He's basically lived his life with space repetition,
01:36:00.040 | learning for everything.
01:36:02.080 | I, and sort of like,
01:36:05.800 | Michael's only very recently got into this,
01:36:07.440 | but he started really getting excited about doing it
01:36:09.520 | for a lot of different things.
01:36:11.160 | For me personally, I actually don't use it
01:36:14.600 | for anything except Chinese.
01:36:16.960 | And the reason for that is that Chinese
01:36:20.680 | is specifically a thing I made a conscious decision
01:36:23.080 | that I want to continue to remember,
01:36:26.680 | even if I don't get much of a chance to exercise it,
01:36:30.120 | 'cause like I'm not often in China, so I don't.
01:36:33.040 | Or else something like programming languages or papers,
01:36:38.320 | I have a very different approach,
01:36:39.640 | which is I try not to learn anything from them,
01:36:43.040 | but instead I try to identify the important concepts
01:36:47.080 | and like actually ingest them.
01:36:49.000 | So like really understand that concept deeply
01:36:53.640 | and study it carefully.
01:36:54.760 | I will decide if it really is important,
01:36:56.600 | if it is like incorporated into our library,
01:37:01.600 | incorporated into how I do things
01:37:04.200 | or decide it's not worth it.
01:37:06.760 | So I find I then remember the things that I care about
01:37:12.600 | because I'm using it all the time.
01:37:15.720 | So for the last 25 years,
01:37:20.160 | I've committed to spending at least half of every day
01:37:23.440 | learning or practicing something new,
01:37:25.880 | which is all my colleagues have always hated
01:37:28.760 | because it always looks like I'm not working on
01:37:31.000 | what I'm meant to be working on,
01:37:32.000 | but it always means I do everything faster
01:37:34.560 | because I've been practicing a lot of stuff.
01:37:36.920 | So I kind of give myself a lot of opportunity
01:37:39.400 | to practice new things.
01:37:41.720 | And so I find now I don't,
01:37:43.280 | yeah, I don't often kind of find myself
01:37:47.880 | wishing I could remember something
01:37:50.320 | 'cause if it's something that's useful,
01:37:51.440 | then I've been using it a lot.
01:37:53.880 | It's easy enough to look it up on Google,
01:37:56.160 | but speaking Chinese, you can't look it up on Google.
01:37:59.720 | - Do you have advice for people learning new things?
01:38:01.560 | So if you, what have you learned as a process?
01:38:04.840 | I mean, it all starts with just making the hours
01:38:07.640 | and the day available.
01:38:08.960 | - Yeah, you gotta stick with it,
01:38:10.160 | which is, again, the number one thing
01:38:12.040 | that 99% of people don't do.
01:38:13.680 | So the people I started learning Chinese with,
01:38:15.880 | none of them were still doing it 12 months later.
01:38:18.360 | I'm still doing it 10 years later.
01:38:20.400 | I tried to stay in touch with them,
01:38:21.920 | but they just, no one did it.
01:38:23.600 | For something like Chinese,
01:38:26.240 | like study how human learning works.
01:38:28.520 | So every one of my Chinese flashcards
01:38:31.240 | is associated with a story,
01:38:33.760 | and that story is specifically designed to be memorable.
01:38:36.720 | And we find things memorable,
01:38:37.840 | which are like funny or disgusting or sexy
01:38:41.360 | or related to people that we know or care about.
01:38:44.240 | So I try to make sure all the stories that are in my head
01:38:47.320 | have those characteristics.
01:38:49.120 | Yeah, so you have to, you know,
01:38:52.160 | you won't remember things well
01:38:53.240 | if they don't have some context.
01:38:56.040 | And yeah, you won't remember them well
01:38:57.280 | if you don't regularly practice them,
01:39:00.640 | whether it be just part of your day-to-day life
01:39:02.480 | or the Chinese, I mean, flashcards.
01:39:06.080 | I mean, the other thing is,
01:39:07.800 | let yourself fail sometimes.
01:39:09.520 | So like I've had various medical problems
01:39:11.840 | over the last few years,
01:39:13.040 | and basically my flashcards just stopped
01:39:17.040 | for about three years.
01:39:18.640 | And then there've been other times
01:39:21.480 | I've stopped for a few months,
01:39:22.600 | and it's so hard because you get back to it,
01:39:24.200 | and it's like, you have 18,000 cards due.
01:39:27.400 | It's like, and so you just have to go,
01:39:30.480 | all right, well, I can either stop and give up everything
01:39:34.120 | or just decide to do this every day
01:39:36.560 | for the next two years until I get back to it.
01:39:39.000 | The amazing thing has been that even after three years,
01:39:41.720 | you know, the Chinese was still in there.
01:39:45.880 | Like it was so much faster to relearn
01:39:48.440 | than it was to learn the first time.
01:39:50.080 | - Yeah, absolutely.
01:39:52.280 | It's in there.
01:39:53.120 | I have the same with guitar, with music and so on.
01:39:56.520 | It's sad because the work sometimes takes away,
01:39:59.120 | and then you won't play for a year.
01:40:01.160 | But really, if you then just get back to it every day,
01:40:03.520 | you're right there again.
01:40:06.000 | What do you think is the next big breakthrough
01:40:08.400 | in artificial intelligence?
01:40:09.400 | What are your hopes in deep learning or beyond
01:40:12.720 | that people should be working on,
01:40:14.120 | or you hope there'll be breakthroughs?
01:40:16.280 | - I don't think it's possible to predict.
01:40:17.960 | I think what we already have
01:40:20.600 | is an incredibly powerful platform
01:40:23.680 | to solve lots of societally important problems
01:40:26.520 | that are currently unsolved.
01:40:27.600 | So I just hope that lots of people
01:40:30.440 | will learn this toolkit and try to use it.
01:40:33.360 | I don't think we need a lot of new technological breakthroughs
01:40:36.800 | to do a lot of great work right now.
01:40:38.600 | - And when do you think we're going to create
01:40:42.760 | a human level intelligence system?
01:40:45.160 | Do you think- - Don't know.
01:40:46.480 | - How hard is it?
01:40:47.440 | How far away are we?
01:40:48.720 | - Don't know.
01:40:49.560 | - Don't know. - I have no way to know.
01:40:50.760 | I don't know.
01:40:51.760 | I don't know why people make predictions about this
01:40:53.840 | 'cause there's no data and nothing to go on.
01:40:57.480 | And it's just like,
01:41:00.360 | there's so many societally important problems
01:41:03.520 | to solve right now.
01:41:04.440 | I just don't find it a really interesting question
01:41:08.720 | to even answer.
01:41:10.280 | - So in terms of societally important problems,
01:41:13.000 | what's the problem that is within reach?
01:41:16.400 | - Well, I mean, for example,
01:41:17.480 | there are problems that AI creates, right?
01:41:19.800 | So more specifically,
01:41:21.320 | labor force displacement is going to be huge
01:41:26.840 | and people keep making this frivolous econometric argument
01:41:30.920 | of being like, oh, there's been other things that aren't AI
01:41:33.960 | that have come along before
01:41:34.960 | and haven't created massive labor force displacement,
01:41:37.800 | therefore AI won't.
01:41:39.920 | - So that's a serious concern for you?
01:41:41.600 | - Oh, yeah. - Andrew Yang is running on it.
01:41:43.680 | - Yeah, I'm desperately concerned.
01:41:47.360 | And you see already that the changing workplace
01:41:52.360 | has led to a hollowing out of the middle class.
01:41:55.760 | You're seeing that students coming out of school today
01:41:59.040 | have a less rosy financial future ahead of them
01:42:03.200 | than their parents did,
01:42:04.040 | which has never happened in the last few hundred years.
01:42:09.040 | We've always had progress before.
01:42:10.960 | And you see this turning into anxiety and despair
01:42:16.320 | and even violence.
01:42:19.480 | So I very much worry about that.
01:42:23.440 | - You've written quite a bit about ethics too.
01:42:25.760 | - I do think that every data scientist
01:42:29.640 | working with deep learning needs to recognize
01:42:33.960 | they have an incredibly high leverage tool
01:42:35.640 | that they're using that can influence society
01:42:38.000 | in lots of ways.
01:42:39.040 | And if they're doing research,
01:42:40.320 | that that research is gonna be used by people
01:42:42.760 | doing this kind of work.
01:42:44.440 | And they have a responsibility to consider the consequences
01:42:48.400 | and to think about things like
01:42:51.800 | how will humans be in the loop here?
01:42:53.920 | How do we avoid runaway feedback loops?
01:42:56.520 | How do we ensure an appeals process for humans
01:42:59.200 | that are impacted by my algorithm?
01:43:01.720 | How do I ensure that the constraints of my algorithm
01:43:04.960 | are adequately explained to the people
01:43:06.720 | that end up using them?
01:43:09.160 | There's all kinds of human issues
01:43:11.880 | which only data scientists are actually in the right place
01:43:16.280 | to educate people about.
01:43:17.960 | But data scientists tend to think of themselves
01:43:20.280 | as just engineers and that they don't need
01:43:23.400 | to be part of that process.
01:43:24.520 | - For now.
01:43:25.360 | - Yeah, which is wrong.
01:43:26.680 | - Well, you're in a perfect position to educate them better,
01:43:30.280 | to read literature, to read history, to learn from history.
01:43:33.760 | Well, Jeremy, thank you so much for everything you do,
01:43:39.080 | for inspiring a huge amount of people,
01:43:41.320 | getting them into deep learning
01:43:42.480 | and having the ripple effects,
01:43:45.080 | the flap of a butterfly's wings
01:43:47.440 | that will probably change the world.
01:43:48.640 | So thank you very much.
01:43:50.080 | - Cheers.
01:43:50.920 | (upbeat music)
01:43:53.520 | (upbeat music)
01:43:56.120 | (upbeat music)
01:43:58.720 | (upbeat music)
01:44:01.320 | (upbeat music)
01:44:03.920 | (upbeat music)
01:44:06.520 | [BLANK_AUDIO]