The End of Finetuning — with Jeremy Howard of Fast.ai

00:00:00.000 | [MUSIC PLAYING]

00:00:03.400 | Hey, everyone.

00:00:11.960 | Welcome to the Latent Space Podcast.

00:00:13.880 | This is Alessio, partner and CTO at Residence

00:00:16.520 | at Decibel Partners.

00:00:17.680 | And I'm joined by my co-host, Sviggs, founder of Small.ai.

00:00:21.280 | Hey, and today we have in the remote studio, Jeremy Howard,

00:00:25.200 | from-- all the way from Australia.

00:00:26.680 | Good morning.

00:00:27.800 | The remote studio, also known as my house.

00:00:30.000 | Good morning.

00:00:31.360 | Nice to see you, Sviggs.

00:00:32.800 | Nice to see you, too.

00:00:34.760 | I'm actually very used to seeing you in your mask

00:00:38.000 | as a message to people, but today we're mostly audio.

00:00:41.760 | But thank you for doing the very important public service

00:00:44.840 | of COVID awareness.

00:00:46.540 | At once, it was a pleasure.

00:00:47.680 | It was all very annoying, and frustrating, and tedious.

00:00:50.720 | But somebody had to do it, so I just left.

00:00:52.520 | Somebody had to do it, especially somebody

00:00:54.280 | with your profile, I think.

00:00:56.360 | It really drives home the message.

00:00:58.440 | So we tend to really--

00:01:00.560 | we tend to introduce people for them,

00:01:02.120 | and then ask people to fill in the blanks

00:01:03.840 | on the personal side.

00:01:06.000 | Something I did not know about you was that you graduated

00:01:08.840 | with a BA in philosophy from the University of Melbourne.

00:01:12.640 | I assumed you had a PhD.

00:01:15.040 | No, I mean, I barely got through my BA,

00:01:18.600 | because I was working 80 to 100 hour weeks at McKinsey

00:01:23.040 | plant company from 19 years old onwards.

00:01:27.720 | So I actually didn't attend any lectures

00:01:33.160 | in second and third year university.

00:01:35.640 | Well, I guess you didn't need it,

00:01:37.120 | or you're very sort of self-driven and self-motivated.

00:01:39.760 | I just-- I took two weeks off before each exam period

00:01:46.200 | when I was working at McKinsey.

00:01:47.640 | And then, I mean, I can't believe I got away

00:01:49.640 | with this in hindsight.

00:01:50.640 | I would go to all my professors and say,

00:01:52.880 | oh, I was meant to be in your class this semester,

00:01:55.040 | and I didn't quite turn up.

00:01:56.280 | Were there any assignments I was meant to have done, whatever?

00:01:59.040 | And I can't believe all of them let me basically have--

00:02:03.720 | they basically always would say, like, OK, well,

00:02:05.800 | if you can have this written by tomorrow, I'll accept it.

00:02:08.400 | So yeah, stressful way to get through university, but--

00:02:13.120 | Well, it shows that, I guess, you min-maxed the opportunities.

00:02:17.720 | That definitely was a precursor.

00:02:19.760 | Finally, like, in as much as I, you know, in philosophy,

00:02:24.200 | the things I found interesting and focused on

00:02:26.960 | in the little bit of time I did spend on it was ethics

00:02:29.960 | and cognitive science.

00:02:31.320 | And it's kind of really amazing that it's now come back around,

00:02:34.520 | and those are actually genuinely useful things

00:02:37.120 | to know about, which I never thought would happen.

00:02:39.240 | A lot of-- yeah, a lot of relevant conversations there.

00:02:43.120 | So you were a consultant for a while,

00:02:45.440 | and then in the magical month of June 1989,

00:02:48.000 | you founded both Optimal Decisions and Fastmail,

00:02:50.920 | which I also briefly used, so thank you for that.

00:02:52.920 | Good for you, yeah, 'cause I had read the statistics,

00:02:55.680 | which is that, like, 90% or something of small businesses

00:02:58.600 | fail, so I thought if I start two businesses,

00:03:01.440 | I have a higher chance.

00:03:02.760 | In hindsight, I was thinking it was

00:03:04.160 | some kind of stochastic thing I didn't have control over,

00:03:06.600 | but it's a bit odd, but anyway.

00:03:10.640 | And then you were president and chief scientist

00:03:12.760 | at Kaggle, which obviously is the composition platform

00:03:19.440 | of machine learning, and then Analytic,

00:03:23.240 | where you were working on using deep learning

00:03:25.520 | to improve medical diagnostics and clinical decisions.

00:03:28.000 | Yeah, I was actually the first company

00:03:29.280 | to use deep learning in medicine,

00:03:30.800 | so I kind of founded the field.

00:03:33.200 | And even now, that's still, like, a pretty early phase.

00:03:36.680 | And I actually heard you on your new podcast with Tanish,

00:03:40.560 | where you went very, very deep into the stuff,

00:03:43.480 | the kind of work that he's doing,

00:03:44.720 | such a young prodigy at his age.

00:03:47.320 | Maybe he's too old to be called a prodigy now, ex-prodigy.

00:03:51.080 | No, I think he still counts.

00:03:53.480 | And anyway, just to round out the bio,

00:03:55.720 | you have a lot more other credentials, obviously,

00:03:58.200 | but most recently, you started Fast.ai,

00:04:01.080 | which is still, I guess, your primary identity

00:04:03.720 | with Rachel Thomas.

00:04:05.280 | So welcome. Yeah, she's my wife.

00:04:06.120 | Thanks. Thank you.

00:04:06.960 | Yeah, doing a lot of public service there

00:04:09.720 | with, like, getting people involved in AI.

00:04:11.480 | And I can't imagine a better way to describe it than Fast.ai.

00:04:15.520 | Fast.ai is, you teach people from nothing

00:04:18.160 | to stable diffusion in, you know,

00:04:19.440 | seven weeks or something, and that's amazing.

00:04:22.200 | Yeah, yeah, I mean, it's funny, you know,

00:04:24.480 | when we started that, what was that, like, 2016 or something,

00:04:27.320 | the idea that deep learning was something

00:04:29.160 | that you could make more accessible

00:04:30.840 | was generally considered stupid.

00:04:32.960 | Like, everybody knew that deep learning

00:04:36.320 | was a thing that you got a math

00:04:38.040 | or a computer science PhD, you know,

00:04:40.080 | those one of five labs

00:04:41.520 | that could give you the appropriate skills.

00:04:44.160 | Then you would join, yeah, basically,

00:04:48.320 | from one of those labs,

00:04:49.560 | you might be able to write some papers.

00:04:52.800 | So yeah, the idea that normal people

00:04:54.400 | could use that technology to do good work

00:04:58.280 | was considered kind of ridiculous when we started it.

00:05:03.040 | And we weren't sure if it was possible either,

00:05:04.600 | but we kind of felt like we had to give it a go

00:05:06.560 | 'cause the alternative was we were pretty sure

00:05:08.720 | that deep learning was on its way to becoming,

00:05:12.320 | you know, the most or one of the most, you know,

00:05:15.480 | important technologies in human history.

00:05:18.600 | And if the only people that could use it

00:05:20.480 | were a handful of computer science PhDs,

00:05:23.080 | that seemed like, A, a big waste,

00:05:25.560 | and B, kind of dangerous.

00:05:28.200 | - Yep.

00:05:29.160 | And, you know, well, I just wanted to know one thing

00:05:32.320 | on your bio that at Kaggle,

00:05:33.760 | you were also the top rank participant in both 2010 and 2011.

00:05:37.800 | So sometimes you see a lot of founders running companies

00:05:40.960 | that are not really in touch with the problem,

00:05:42.640 | but you were clearly building something

00:05:45.040 | that you knew a lot about, which is awesome.

00:05:48.120 | And even, yeah, talking about deep learning,

00:05:50.400 | you created, published a paper on ULMFIT,

00:05:53.480 | which was kind of the predecessor to multitask learning

00:05:56.760 | and a lot of the groundwork

00:05:58.280 | that then went to into Transformers.

00:06:00.320 | I read back on the paper

00:06:01.880 | and you turned this model of AWD LSTM,

00:06:04.960 | which, I mean, I did the math

00:06:06.880 | and it was like 24 to 33 million parameters,

00:06:10.080 | depending on what training data set you use.

00:06:12.520 | Today, that's kind of like not even small,

00:06:14.800 | it's like super small.

00:06:15.960 | What were some of the kind of like contrarian takes

00:06:20.560 | that you had at the time,

00:06:21.840 | and maybe set the stage a little bit

00:06:23.960 | for the rest of the audience

00:06:25.960 | on what was kind of like the state of the art,

00:06:29.080 | so to speak, at the time,

00:06:30.480 | and what people were working towards?

00:06:32.360 | - Yeah, the whole thing was a contrarian take.

00:06:34.760 | Okay, so we started Fast.ai, my wife and I,

00:06:39.720 | and we, yeah, so we're trying to think,

00:06:41.760 | okay, how do we make it more accessible?

00:06:43.160 | So when we started thinking about it,

00:06:46.480 | it was probably 2015, and then 2016,

00:06:48.240 | we started doing something about it.

00:06:49.520 | Why is it inaccessible?

00:06:50.760 | Okay, well, A, no one knows how to do it

00:06:54.480 | other than a few number of people.

00:06:56.720 | And then when we'd ask those few number of people,

00:06:58.400 | well, how do you actually get good results?

00:07:00.200 | They would say like, oh, it's like,

00:07:02.400 | you know, a box of tricks that aren't published.

00:07:04.240 | So you have to join one of the labs and learn the tricks.

00:07:08.160 | So a bunch of unpublished tricks,

00:07:10.200 | not much software around,

00:07:13.480 | but thankfully there was Theano and wrappers,

00:07:17.320 | and particularly Lasagna, the wrapper.

00:07:19.240 | But yeah, not much software around,

00:07:23.080 | not much in the way of data sets,

00:07:27.720 | very hard to get started in terms of the compute,

00:07:30.480 | like how do you get that set up?

00:07:32.280 | So, you know, everything was kind of inaccessible.

00:07:36.680 | And, you know, as we started looking into it,

00:07:41.560 | we had a key insight, which was like, you know what?

00:07:45.680 | Most of the compute and data for image recognition,

00:07:50.000 | for example, we don't need to do it.

00:07:53.280 | You know, there's this thing which nobody knows about,

00:07:55.600 | nobody talks about called transfer learning,

00:07:58.920 | where you take somebody else's model,

00:08:00.560 | where they already figured out like how to detect edges

00:08:04.840 | and gradients and corners and text and whatever else.

00:08:07.800 | And then you can fine tune it to do the thing you wanna do.

00:08:11.120 | And we thought that's the key,

00:08:12.600 | that's the key to becoming more accessible

00:08:16.440 | in terms of compute and data requirements.

00:08:19.080 | So when we started Fast.ai,

00:08:21.360 | we focused from day one on transfer learning,

00:08:23.640 | lesson one, in fact, was transfer learning,

00:08:26.080 | literally lesson one.

00:08:27.400 | It was something not normally even mentioned in,

00:08:30.280 | I mean, there wasn't much in the way of courses.

00:08:32.680 | You know, really the courses out there were PhD programs

00:08:39.720 | that had happened to have recorded their lessons.

00:08:41.720 | They would rarely mention it at all.

00:08:43.600 | We wanted to show how to do four things

00:08:46.840 | that seemed really useful, you know, work with vision,

00:08:49.920 | work with tables of data,

00:08:52.720 | work with kind of recommendation systems

00:08:54.880 | and collaborative filtering and work with text.

00:08:56.760 | 'Cause we felt like those four kind of modalities

00:08:59.240 | covered a lot of the stuff that, you know,

00:09:01.840 | are useful in real life.

00:09:04.480 | And no one was doing anything much useful with text.

00:09:06.600 | Everybody was talking about word2vec, you know,

00:09:08.840 | like king plus queen minus woman and blah, blah, blah.

00:09:13.840 | It was like cool experiments,

00:09:18.600 | but nobody's doing anything like useful with it.

00:09:20.720 | NLP was all like lemmatization and stop words

00:09:25.280 | and topic models and bigrams and SVMs.

00:09:29.640 | And it was really academic and not practical.

00:09:33.440 | But yeah, I mean, to be honest,

00:09:36.880 | I've been thinking about this crazy idea

00:09:39.360 | for nearly 30 years,

00:09:42.320 | since I had done cognitive science at university,

00:09:45.200 | where we talked a lot

00:09:46.240 | about the CELS Chinese Room Experiment.

00:09:49.920 | This idea of like,

00:09:51.000 | what if there was somebody that could kind of like,

00:09:53.760 | knew all of the symbolic manipulations required

00:09:56.920 | to answer questions in Chinese,

00:10:00.240 | but they didn't speak Chinese.

00:10:01.920 | And they were kind of inside a room

00:10:03.680 | with no other way to talk to the outside world

00:10:06.960 | other than taking in slips of paper

00:10:08.480 | with Chinese written on them.

00:10:09.760 | And then they do all their rules

00:10:11.000 | and then they pass back a piece of paper with Chinese back.

00:10:13.760 | And this room with a person in

00:10:16.760 | is actually fantastically good at answering any question

00:10:19.120 | you give them written in Chinese.

00:10:21.280 | You know, do they understand Chinese?

00:10:24.720 | And is this, you know,

00:10:26.520 | something that's intelligently working with Chinese?

00:10:29.880 | Ever since that time, I'd say the most thought,

00:10:32.840 | to me, the most thoughtful

00:10:34.400 | and compelling philosophical response is yes.

00:10:37.560 | You know, intuitively it feels like no,

00:10:41.720 | because that's just because we can't imagine

00:10:43.760 | such a large kind of system.

00:10:45.720 | But, you know, if it looks like a duck

00:10:49.320 | and acts like a duck, it's a duck, you know,

00:10:51.760 | or to all intents and purposes.

00:10:54.040 | And so I always kind of thought, you know,

00:10:55.200 | so this is basically a kind of analysis

00:10:58.120 | of the limits of text.

00:11:00.600 | And I kind of felt like,

00:11:01.440 | yeah, if something could ingest enough text

00:11:04.520 | and could use the patterns it saw

00:11:09.240 | to then generate text in response to text,

00:11:13.040 | it could appear to be intelligent.

00:11:18.240 | You know, whether that means it is intelligent or not

00:11:21.120 | is a different discussion

00:11:22.280 | and not one I find very interesting.

00:11:23.960 | Yeah, and then when I came across neural nets

00:11:25.880 | when I was about 20,

00:11:27.680 | you know, what I learned

00:11:28.520 | about the universal approximation theorem and stuff.

00:11:30.680 | And I started thinking like,

00:11:31.720 | oh, I wonder if like a neural net

00:11:33.400 | could ever get big enough,

00:11:35.360 | take in enough data to be a Chinese room experiment.

00:11:40.360 | You know, with that background

00:11:41.880 | and this kind of like interest in transfer learning,

00:11:44.880 | you know, I'd been thinking about this thing

00:11:47.400 | for kind of 30 years and I thought like,

00:11:48.840 | oh, I wonder if we're there yet, you know,

00:11:51.160 | 'cause we have a lot of text.

00:11:53.600 | Like I can literally download Wikipedia,

00:11:56.320 | which is a lot of text.

00:11:58.320 | And I thought, you know,

00:11:59.280 | how would something learn to kind of answer questions

00:12:03.840 | or, you know, respond to text?

00:12:05.680 | And I thought, well, what if we used a language model?

00:12:08.160 | So language models are already a thing, you know,

00:12:10.280 | they were not a popular or well-known thing,

00:12:11.880 | but they were a thing.

00:12:12.720 | But language models exist to this idea

00:12:14.240 | that you could train a model to fill in the gaps,

00:12:17.240 | or actually in those days it wasn't fill in the gaps,

00:12:18.960 | it was finish a string.

00:12:20.760 | And in fact, Andrej Karpathy did his fantastic RNN

00:12:24.640 | demonstration from this at a similar time

00:12:27.520 | where he showed like you can have it ingest Shakespeare

00:12:32.160 | and it will generate something that

00:12:34.120 | looks a bit like Shakespeare.

00:12:35.600 | I thought, okay, so if I do this at a much bigger scale,

00:12:39.840 | using all of Wikipedia,

00:12:41.680 | what would it need to be able to do

00:12:45.120 | to finish a sentence in Wikipedia effectively,

00:12:49.560 | to do it quite accurately quite often?

00:12:52.560 | I thought, geez, it would actually have to know

00:12:54.120 | a lot about the world.

00:12:55.480 | You know, it'd have to know that there is a world

00:12:57.000 | and that there are objects

00:12:58.160 | and that objects relate to each other through time

00:13:00.520 | and cause each other to react in ways

00:13:02.560 | and that causes precede effects

00:13:04.560 | and that when there are animals and there are people

00:13:09.000 | and that people can be in certain positions

00:13:12.320 | during certain timeframes.

00:13:13.680 | And then you could, you know, all that together,

00:13:15.440 | you can then finish a sentence like,

00:13:17.800 | this was signed into law in 2016 by US President X

00:13:22.080 | and it would fill in the name, you know?

00:13:24.480 | So that's why I tried to create a,

00:13:27.120 | what in those days was considered a big language model,

00:13:30.320 | trained on the entirety on Wikipedia,

00:13:32.000 | which is, that was, you know, a bit unheard of.

00:13:33.920 | And my interest was not in,

00:13:35.440 | you know, just having a language model,

00:13:38.640 | my interest was in like,

00:13:40.480 | what latent capabilities would such a system have

00:13:45.480 | that would allow it to finish those kinds of sentences?

00:13:50.760 | Because I was pretty sure,

00:13:53.920 | based on our work with Transfer Learning and Vision,

00:13:56.040 | that I could then suck out those latent capabilities

00:13:59.640 | by transfer learning, you know,

00:14:01.560 | by fine-tuning it on a task data set or whatever.

00:14:04.200 | So we generated this three-step system.

00:14:06.400 | So step one was train a language model on a big corpus,

00:14:09.560 | step two was fine-tune a language model

00:14:12.760 | on a more curated corpus,

00:14:14.400 | and step three was further fine-tune that model on a task.

00:14:18.280 | And of course that's what everybody still does today, right?

00:14:21.000 | That's what ChatGPT is.

00:14:22.840 | And so the first time I tried it,

00:14:26.880 | within hours, I had a new state-of-the-art

00:14:29.360 | academic result on IMDb.

00:14:31.480 | And I was like, "Holy shit, it does work."

00:14:34.120 | And so you asked, to what degree was this kind of like

00:14:37.840 | pushing against the, you know, established wisdom?

00:14:40.480 | You know, every way.

00:14:41.400 | Like the reason it took me so long to try it

00:14:43.680 | was 'cause I asked all my friends in NLP

00:14:47.280 | if this could work, and everybody said, "No,

00:14:49.760 | it definitely won't work."

00:14:51.040 | It wasn't like, "Oh, maybe."

00:14:52.080 | Everybody was like, "It definitely won't work.

00:14:55.280 | NLP is much more complicated than vision.

00:14:57.960 | Language is a much more vastly complicated domain."

00:15:00.800 | You know, and you've got problems

00:15:01.760 | like the grounding problem.

00:15:03.000 | We know from like philosophy and theory of mind

00:15:05.080 | that it's actually impossible for it to work.

00:15:07.320 | So yeah, so don't waste your time.

00:15:10.680 | - Jeremy, had people not tried

00:15:12.400 | because it was like too complicated

00:15:14.880 | to actually get the data and like set up the training?

00:15:17.160 | Or like, were people just lazy and kind of like,

00:15:19.720 | "Hey, this is just not gonna work."

00:15:20.560 | - No, I mean, it wasn't lazy.

00:15:22.080 | So like, so the person I thought at that time who,

00:15:24.760 | there were two people I thought at that time actually

00:15:26.440 | who were the strongest at language models

00:15:28.120 | were Stephen Merrity and Alec Radford.

00:15:31.760 | And at the time I didn't know Alec,

00:15:34.640 | but I, after we had both,

00:15:37.040 | after I'd released ULMFIT and he had released GPT,

00:15:39.960 | I organized a chat for both of us

00:15:43.960 | with Kate Metz of the New York Times,

00:15:46.160 | and Kate Metz answered, sorry,

00:15:47.880 | and Alec answered this question for Kate,

00:15:49.520 | and Kate was like, "So how did, you know, GPT come about?"

00:15:53.720 | And he said, "Well, I was pretty sure

00:15:55.920 | "that pre-training on a general large corpus wouldn't work,

00:15:59.440 | "so I hadn't tried it.

00:16:01.240 | "And then I read ULMFIT and turns out it did work.

00:16:05.520 | "And so I did it, you know, bigger

00:16:08.360 | "and it worked even better."

00:16:09.520 | And similar with Stephen, you know,

00:16:11.000 | I asked Stephen Merrity, like,

00:16:12.200 | "Why don't we just find, you know,

00:16:15.320 | "take your AWD, ASTLM and like train it

00:16:17.640 | "on all of Wikipedia and fine tune it?"

00:16:19.280 | And he was kind of like,

00:16:20.680 | "I don't think that's gonna really fly."

00:16:23.600 | Like two years before,

00:16:25.000 | I did a very popular talk at KDD, the conference,

00:16:29.840 | where everybody in NLP was in the audience.

00:16:33.760 | I recognized half the faces, you know,

00:16:36.600 | and I told them all this,

00:16:37.800 | I'm sure transfer learning is the key.

00:16:40.400 | I'm sure ImageNet, you know,

00:16:44.080 | is gonna be an NLP thing as well.

00:16:47.040 | And, you know, everybody was interested

00:16:50.320 | and people asked me questions afterwards.

00:16:53.760 | But just, yeah, nobody followed up

00:16:55.720 | because everybody knew that it didn't work.

00:16:59.560 | I mean, even like, so we were scooped a little bit

00:17:05.440 | by Dai and Lee, Kwok Lee at Google.

00:17:08.480 | They had already, I didn't even realize this,

00:17:11.880 | which is a bit embarrassing.

00:17:12.720 | They had already done a large language model

00:17:15.840 | and fine tuned it.

00:17:17.560 | But again, they didn't create a general purpose

00:17:21.720 | large language model on a general purpose corpus.

00:17:23.600 | They only ever tested a domain specific corpus.

00:17:28.280 | And I haven't spoken to Kwok actually about that,

00:17:30.760 | but I assume that the reason was the same.

00:17:33.120 | It probably just didn't occur to them

00:17:36.080 | that the general approach could work.

00:17:38.680 | So maybe it was that kind of 30 years

00:17:40.440 | of mulling over the cell Chinese room experiment

00:17:43.720 | that had convinced me that it probably would work.

00:17:46.920 | I don't know.

00:17:48.080 | - Yeah, interesting.

00:17:49.120 | I just dug up Alec announcement tweet from Tony team.

00:17:54.120 | He said, "Inspired by Kobe, Elmo, and Yola, I'm fit.

00:17:57.640 | "We showed a single transformer language model

00:17:59.520 | "can be fine tuned to a wide variety."

00:18:02.160 | It's interesting because, you know,

00:18:03.400 | today people think of OpenAI as the leader,

00:18:06.160 | kind of like the research lab pushing forward the field.

00:18:09.800 | What was that at the time?

00:18:11.000 | You know, like kind of like going back five years,

00:18:12.960 | people think of OpenAI as an overnight success,

00:18:15.000 | but obviously it took a while.

00:18:16.800 | - Yeah, yeah, no, I mean, absolutely.

00:18:18.440 | And I'll say like, it's interesting

00:18:20.320 | that it mentioned Elmo because in some ways

00:18:22.960 | that was kind of diametrically opposed to ULM fit.

00:18:26.840 | You know, there was these kind of like,

00:18:29.040 | so there was a lot of activity

00:18:31.840 | at the same time as ULM fits release.

00:18:34.000 | So there was, so before it, as Brian McCann,

00:18:38.400 | I think at Salesforce had come out with this neat model

00:18:43.000 | that did a kind of multitask learning,

00:18:46.200 | but again, they didn't create a general

00:18:49.240 | fine-tune language model first.

00:18:50.760 | There was Elmo, which I think was, you know,

00:18:53.320 | actually quite a few months

00:18:55.560 | after the first ULM fit example, I think.

00:19:00.000 | But yeah, there was a bit of this stuff going on.

00:19:01.360 | And the problem was everybody was doing,

00:19:06.200 | and particularly after GPT came out then,

00:19:08.360 | everybody wanted to focus on zero-shot

00:19:10.360 | and few-shot learning.

00:19:11.400 | You know, everybody hated fine-tuning.

00:19:13.240 | Everybody hated transfer learning.

00:19:14.640 | And like, I literally did tours trying to get people

00:19:18.320 | to start doing transfer learning.

00:19:20.200 | And, you know, nobody was interested,

00:19:24.600 | particularly after GPT showed such good results

00:19:27.040 | with zero-shot and few-shot learning.

00:19:29.240 | And so I actually feel like we kind of went backwards

00:19:31.520 | for years and not to be honest,

00:19:33.400 | I mean, I'm a bit sad about this now,

00:19:34.640 | but I kind of got so disappointed and dissuaded by like,

00:19:41.480 | it felt like these bigger lab, much bigger labs,

00:19:44.800 | you know, like Fast.ai had only ever been just me

00:19:47.040 | and Rachel were getting all of this attention

00:19:51.560 | for an approach I thought was the wrong way to do it.

00:19:54.440 | You know, I was convinced was the wrong way to do it.

00:19:56.400 | And so, yeah, for years people were really focused

00:19:59.200 | on getting better at zero-shot and few-shot.

00:20:00.960 | And it wasn't until, you know, this key idea of like,

00:20:04.520 | well, let's take the ULM fit approach,

00:20:06.720 | but for step two, rather than fine-tuning

00:20:10.440 | on a kind of a domain corpus,

00:20:12.600 | let's fine-tune on an instruction corpus.

00:20:15.600 | And then in step three, rather than fine-tuning

00:20:18.080 | on a reasonably specific task classification,

00:20:20.520 | let's fine-tune on a RLHF task classification.

00:20:25.040 | And so that was really, that was really key, you know?

00:20:27.640 | So I was kind of like out of the NLP field

00:20:30.720 | for a few years there because yeah,

00:20:33.560 | it just felt like, I don't know,

00:20:36.560 | pushing uphill against this vast tide,

00:20:41.160 | which I was convinced was not the right direction,

00:20:43.360 | but who's gonna listen to me, you know?

00:20:44.800 | 'Cause as you said, I don't have a PhD,

00:20:47.440 | not at a university, or at least I wasn't then.

00:20:50.120 | I don't have a big set of computers

00:20:52.640 | to fine-tune huge transformer models.

00:20:56.120 | So yeah, it was definitely difficult.

00:20:58.320 | It's always been hard.

00:20:59.360 | You know, it's always been hard.

00:21:01.200 | Like I've always been somebody

00:21:02.360 | who does not wanna build stuff on lots of big computers

00:21:07.360 | because most people don't have lots of big computers.

00:21:11.040 | And I hate creating stuff that most people can't use,

00:21:13.640 | you know?

00:21:14.560 | And also stuff that's created on lots of big computers

00:21:17.600 | has always been like much more media-friendly.

00:21:20.920 | So like, it might seem like a recent thing,

00:21:23.520 | but actually throughout my 30 years in data science,

00:21:26.080 | the attention's always been on, you know,

00:21:29.400 | the big iron results.

00:21:31.880 | So when I first started,

00:21:32.880 | everybody was talking about data warehouses

00:21:35.880 | and it was all about Teradata.

00:21:37.680 | And it'd be like, oh, this big bank

00:21:39.240 | has this huge room full of computers

00:21:42.280 | and they have like terabytes of data available,

00:21:45.240 | you know, the press of a button.

00:21:46.480 | And yeah, that's always what people wanna talk about,

00:21:50.720 | what people wanna write about.

00:21:52.480 | And then of course,

00:21:54.120 | students coming out of their PhDs and stuff,

00:21:56.160 | that's where they wanna go work

00:21:57.440 | 'cause that's where they read about.

00:21:59.680 | And to me, it's a huge distraction, you know,

00:22:03.520 | because like I say,

00:22:05.640 | most people don't have unlimited compute.

00:22:10.000 | And I wanna help most people,

00:22:11.720 | not the small subset of the most well-off people.

00:22:16.720 | - Yeah, that's awesome.

00:22:18.440 | And it's great to hear, you know,

00:22:20.320 | you do such a great job educating

00:22:22.640 | that a lot of times you're not telling your own story,

00:22:25.240 | you know?

00:22:26.080 | So I love this conversation.

00:22:28.200 | And the other thing before we jump into Fast.ai,

00:22:30.720 | actually, you know, a lot of people that I know,

00:22:33.720 | they run across a new architecture and whatnot,

00:22:35.920 | they're like, I gotta start a company

00:22:37.480 | and raise a bunch of money and do all of this stuff.

00:22:39.360 | And say, you were like,

00:22:40.600 | I want everybody to have access to this.

00:22:42.600 | Why was that the case for you?

00:22:45.120 | Was it because you already had like a successful,

00:22:47.320 | you know, venture in like FastMail

00:22:49.400 | and you were more interested in that?

00:22:50.760 | What was the reasoning?

00:22:52.520 | - That's a really good question.

00:22:54.080 | So I guess the answer is yes.

00:22:56.960 | It is, that's the reason why.

00:22:58.560 | So when I was a teenager,

00:23:00.800 | I thought it would be really cool to like,

00:23:03.280 | have my own company.

00:23:05.160 | You know, I didn't know the word startup.

00:23:06.600 | I didn't know the word entrepreneur.

00:23:08.160 | I didn't know the word VC.

00:23:09.920 | And I didn't really know what any of those things were

00:23:12.000 | really until after we started Kaggle, to be honest.

00:23:14.240 | Even though I had started to what we now call startups,

00:23:16.520 | I just thought they were just small businesses.

00:23:19.120 | You know, they were just companies.

00:23:20.840 | So yeah, so those two companies were FastMail

00:23:23.960 | and Optimal Decisions.

00:23:24.800 | FastMail was the first kind of synchronized email provider

00:23:29.440 | for non-businesses.

00:23:30.880 | So something you can get your same email at home

00:23:34.000 | on your laptop, at work, on your phone, whatever.

00:23:37.600 | And then Optimal Decisions

00:23:39.520 | invented a new approach to insurance pricing,

00:23:43.120 | something called profit-optimized insurance pricing.

00:23:46.280 | So I saw both of those companies, you know, after 10 years.

00:23:52.040 | And at that point, I had achieved the thing

00:23:56.280 | that as a teenager, I had wanted to do, you know.

00:24:00.600 | It took a lot longer than it should have

00:24:01.760 | 'cause I spent way longer in management consulting

00:24:03.560 | than I should have 'cause I got caught up

00:24:04.880 | in that stupid rat race.

00:24:06.280 | But you know, eventually I got there

00:24:08.040 | and I remember my mom saying to me,

00:24:10.680 | "Oh, you must be so proud."

00:24:12.200 | You know, 'cause she remembered my dream.

00:24:14.640 | She was like, "You've done it."

00:24:16.880 | And I kind of reflected and I was like, "I'm not.

00:24:21.000 | "I'm not proud at all."

00:24:22.640 | You know, like people quite liked FastMail.

00:24:25.240 | You know, it's quite nice to have synchronized email.

00:24:27.400 | It probably would have happened anyway.

00:24:29.400 | Yeah, I'm certainly not proud

00:24:32.120 | that I've helped some insurance companies

00:24:34.560 | suck more money out of their customers.

00:24:36.680 | Yeah, no, I'm not proud.

00:24:39.000 | You know, it's actually,

00:24:41.480 | I haven't really helped the world very much.

00:24:44.040 | You know, maybe in the insurance case

00:24:45.440 | I've made it a little bit worse.

00:24:47.280 | I don't know.

00:24:48.680 | So yeah, I was determined

00:24:51.920 | to not waste more years of my life

00:24:55.960 | doing things, working hard to do things

00:24:58.440 | which I could not be reasonably sure

00:25:00.480 | would have a lot of value.

00:25:02.200 | So, you know, I took some time off.

00:25:06.240 | I wasn't sure if I'd ever work again, actually.

00:25:08.440 | I didn't particularly want to

00:25:09.760 | 'cause it felt like, yeah,

00:25:10.800 | it felt like such a disappointment.

00:25:12.560 | But you know, and I didn't need to.

00:25:15.840 | I had enough money.

00:25:16.960 | Like I wasn't super rich, but I had enough money.

00:25:18.640 | I didn't need to work.

00:25:20.360 | And I certainly recognize that amongst the other people

00:25:23.040 | I knew who had enough money

00:25:25.480 | that they didn't need to work,

00:25:26.640 | they all worked ridiculously hard.

00:25:28.840 | You know, and constantly put themselves

00:25:30.520 | in extremely stressful situations.

00:25:32.480 | And I thought, I don't want to be one of those idiots

00:25:34.280 | who's tied to, you know,

00:25:37.440 | buying a bigger plane than the next guy or whatever.

00:25:42.400 | You know, Kaggle came along

00:25:43.720 | and I mainly kind of did that

00:25:44.720 | just 'cause it was fun and interesting

00:25:46.960 | to hang out with interesting people.

00:25:49.360 | But, you know, with Fast.ai in particular,

00:25:53.440 | you know, Rachel and I had a very explicit,

00:25:57.000 | you know, long series of conversations

00:25:59.800 | over a long period of time about like,

00:26:01.320 | well, how can we be the most helpful to society as a whole

00:26:06.320 | and particularly to those people

00:26:08.880 | who maybe need more help, you know?

00:26:11.200 | And so we definitely saw the world going

00:26:13.280 | in a potentially pretty dystopian direction

00:26:17.720 | if the world's most powerful technology

00:26:19.680 | was controlled by a small group of elites.

00:26:23.560 | So we thought, yeah, we should focus

00:26:26.920 | on trying to help that not happen.

00:26:30.040 | You know, sadly, it looks like it still is likely to happen,

00:26:33.320 | but I mean, I feel like we've helped make it

00:26:37.240 | a little bit less likely.

00:26:38.320 | So we've done our-

00:26:39.640 | - You've shown that it's possible.

00:26:41.640 | And I think your constant advocacy,

00:26:45.800 | your courses, your research that you publish,

00:26:49.240 | you know, just the other day you published a finding

00:26:52.600 | on, you know, learning that I think is still something

00:26:56.880 | that people are still talking about quite a lot.

00:26:59.000 | I think that that is the origin story

00:27:02.760 | of a lot of people who are gonna be, you know,

00:27:05.000 | little Jeremy Howards furthering your mission with,

00:27:07.280 | you know, you don't have to do everything by yourself

00:27:09.120 | is what I'm saying.

00:27:09.960 | - No, definitely, definitely.

00:27:10.800 | You know, that was a big takeaway from like "Analytic"

00:27:14.680 | was that in "Analytic" it definitely felt like

00:27:16.240 | we had to do everything ourselves.

00:27:17.880 | And I kind of, I wanted to solve medicine.

00:27:20.120 | I was like, yeah, okay,

00:27:20.960 | solving medicine is actually quite difficult

00:27:22.720 | and I can't do it on my own.

00:27:25.360 | And there's a lot of other things I'd like to solve

00:27:26.760 | and I can't do those either.

00:27:27.840 | So that was definitely the other piece was like,

00:27:30.440 | yeah, you know, can we create an army

00:27:35.800 | of passionate domain experts who can change,

00:27:40.240 | they're a little part of the world.

00:27:41.720 | And that's definitely happened.

00:27:42.680 | Like I find nowadays, at least half the time,

00:27:46.640 | probably quite a bit more that I get in contact

00:27:50.640 | with somebody who's done really interesting work

00:27:52.960 | in some domain.

00:27:54.120 | Most of the time I'd say they say,

00:27:55.640 | yeah, I got my start with Fast.ai.

00:27:57.400 | So it's definitely, I can see that.

00:28:00.320 | And I also know from talking to folks at places

00:28:04.080 | like Amazon and Adobe and stuff,

00:28:06.320 | which, you know, there's lots of alumni there

00:28:07.880 | and they say, oh my God,

00:28:08.720 | I got here and like half of the people are Fast.ai alumni.

00:28:12.200 | So it's fantastic.

00:28:14.600 | - Yeah, actually Andre Capassi grabbed me

00:28:16.640 | when I saw him at NeurIPS a few years ago.

00:28:18.560 | And he was like, I have to tell you,

00:28:19.720 | thanks for the Fast.ai courses.

00:28:21.280 | When people come to Tesla

00:28:22.320 | and they need to know more about deep learning,

00:28:24.280 | we always send them to your course.

00:28:26.400 | And the OpenAI Scholars Program was doing the same thing.

00:28:29.640 | So it's kind of like, yeah, it's had a surprising impact.

00:28:35.360 | You know, that's just one of like three things we do

00:28:39.600 | is the course, you know.

00:28:41.040 | And it's only ever been at most two people,

00:28:45.320 | either me and Rachel or me and Sylvain nowadays,

00:28:47.640 | it's just me.

00:28:49.200 | So yeah, I think it shows you don't necessarily need

00:28:51.400 | a huge amount of money and a huge team of people

00:28:54.560 | to make an impact.

00:28:57.800 | - Yeah, so just to reintroduce Fast.ai

00:29:00.840 | for people who may not have dived into it much,

00:29:05.000 | there is the courses that you do.

00:29:07.520 | There is the library that is very well loved.

00:29:12.240 | And I kind of think of it as a nicer layer

00:29:15.440 | on top of PyTorch that people should start with PyTorch

00:29:18.600 | and use it as the basis for a lot of your courses.

00:29:21.280 | And then you have like NBDev,

00:29:24.960 | which I don't know, is that the third one?

00:29:27.200 | - Oh, so the three areas were research, software,

00:29:31.120 | and courses.

00:29:32.560 | - Oh, sorry, I was going by, in terms of software.

00:29:34.760 | - Software, you know, Fast.ai is the main thing,

00:29:39.760 | but NBDev is not far behind.

00:29:42.800 | But then there's also things like Fast.core,

00:29:46.120 | GHAPI, I mean, dozens of open source projects

00:29:50.160 | that I've created.

00:29:51.000 | And some of them have been pretty popular

00:29:55.320 | and some of them are still a little bit hidden, actually.

00:29:57.640 | I should, some of them I should try to do a better job

00:30:00.360 | of telling people about.

00:30:01.280 | - What are you thinking about?

00:30:02.600 | Yeah, what's on the--

00:30:03.440 | - Oh, no, no, just like little things.

00:30:05.040 | Like, for example, for working with EC2 and AWS,

00:30:07.840 | I created a FastEC2 library,

00:30:09.520 | which I think is like way more convenient

00:30:11.920 | and nice to use than anything else out there.

00:30:14.400 | And it's literally got a whole autocomplete,

00:30:16.280 | dynamic autocomplete that works both on the command line

00:30:19.080 | and in notebooks.

00:30:20.400 | It'll like autocomplete your instance names

00:30:22.400 | and everything like that.

00:30:24.080 | You know, just little things like that.

00:30:25.840 | I try to make like, when I work with some domain,

00:30:30.280 | I try to make it like,

00:30:32.080 | I wanna make it as enjoyable as possible for me to do that.

00:30:35.600 | So I always try to kind of like,

00:30:37.040 | like with GHAPI, for example,

00:30:38.960 | I think that GitHub API is incredibly powerful,

00:30:43.160 | but I didn't find it good to work with

00:30:45.600 | 'cause I didn't particularly like the libraries

00:30:47.040 | that were out there.

00:30:47.880 | So like GHAPI, like FastEC2,

00:30:50.040 | it like autocompletes both at the command line

00:30:53.440 | or in a notebook or whatever,

00:30:55.040 | like literally the entire GitHub API.

00:30:59.640 | The entire thing is like,

00:31:01.680 | I think it's like less than a hundred K of code

00:31:03.640 | because it actually, as far as I know,

00:31:06.960 | the only one that grabs it directly

00:31:09.440 | from the official open API spec that GitHub produces.

00:31:14.120 | And like if you're in GitHub and you just type an API,

00:31:18.960 | you know, autocomplete API method and hit enter,

00:31:25.440 | it prints out the docs or the six brief docs

00:31:28.760 | and then gives you a link

00:31:29.600 | to the actual documentation page.

00:31:32.080 | You know, GitHub Actions I can write now in Python,

00:31:34.680 | which is just so much easier

00:31:36.000 | than writing them in TypeScript and stuff.

00:31:38.760 | So, you know, just little things like that.

00:31:41.120 | - I think that's an approach

00:31:42.240 | that I wish more developers took

00:31:44.240 | to publish some of their work along the way.

00:31:46.440 | You described the third arm of FastAI as research.

00:31:51.120 | It's not something I see often.

00:31:53.000 | Obviously you do do some research

00:31:54.760 | and how do you run your research?

00:31:58.240 | What are your research interests?

00:31:59.920 | - Yeah, so research is what I spend

00:32:01.840 | the vast majority of my time on.

00:32:04.160 | And the artifacts that come out of that

00:32:08.240 | are largely software and courses, you know?

00:32:11.720 | So to me, the main artifact shouldn't be papers

00:32:15.120 | 'cause papers are things read

00:32:16.200 | by a small exclusive group of people.

00:32:18.240 | You know, to me, the main artifacts should be like

00:32:20.800 | something teaching you people,

00:32:23.160 | here's how to use this insight

00:32:24.480 | and here's software you can use that builds it in.

00:32:28.200 | So I think I've only ever done

00:32:30.440 | three first person papers in my life, you know?

00:32:33.120 | And they were, and none of those are ones I wanted to do.

00:32:36.640 | You know, they were all ones that like,

00:32:37.920 | so one was ULM Fit,

00:32:39.480 | where Sebastian Ruder reached out to me

00:32:41.200 | after seeing the course and said like,

00:32:43.000 | "You have to publish this as a paper."

00:32:44.840 | You know?

00:32:45.680 | And he said, "I'll write it."

00:32:48.480 | (laughs)

00:32:49.320 | I was like, "Oh."

00:32:50.160 | And he said, "I want to write it

00:32:51.280 | 'cause if I do, I can put it on my PhD

00:32:52.960 | and that would be great."

00:32:53.800 | And it's like, "Okay, well, I want to help you

00:32:54.720 | with your PhD and that's great."

00:32:57.240 | So like, you know, one was the masks paper,

00:33:00.960 | which just had to exist and nobody else was writing it.

00:33:04.720 | And then the third was the Fast.ai library paper,

00:33:09.560 | which again, somebody reached out and said,

00:33:14.560 | "Please, please write this.

00:33:16.360 | We will waive the fee for the journal and everything

00:33:19.200 | and actually help you get it through publishing and stuff."

00:33:22.400 | So yeah, so I don't, other than that,

00:33:24.360 | I've never written a first author paper.

00:33:27.120 | So the research is like, well, so for example,

00:33:30.320 | you know, DawnBench was a competition

00:33:33.680 | which Stanford ran a few years ago.

00:33:36.320 | It was kind of the first big competition

00:33:39.840 | of like, who can train neural nets the fastest

00:33:43.680 | rather than the most accurate.

00:33:45.840 | And specifically it was who can train ImageNet the fastest.

00:33:52.640 | And again, this was like one of these things

00:33:54.920 | where it was created by necessity.

00:33:57.280 | So Google had just released their TPUs.

00:34:00.120 | And so I heard from my friends at Google

00:34:02.280 | that they had put together this big team

00:34:04.840 | to smash DawnBench so that they could prove to people

00:34:08.960 | that they had to use Google Cloud and use their TPUs

00:34:11.760 | and show how good their TPUs were.

00:34:14.040 | And we kind of thought, "Oh shit, this would be a disaster

00:34:16.400 | if they do that, because then everybody's going to be like,

00:34:18.440 | "Oh, deep learning is not accessible."

00:34:20.520 | You know, to actually be good at it,

00:34:22.040 | you have to be Google and you have to use special silicon.

00:34:24.320 | And so, you know, we only found out about this 10 days

00:34:27.880 | before the competition finished.

00:34:30.160 | But, you know, we basically got together

00:34:32.160 | an emergency bunch of our students and Rachel and I

00:34:36.120 | and sat for the next 10 days and just tried to crunch through

00:34:41.120 | and try to use all of our best ideas

00:34:46.000 | that had come from our research.

00:34:48.360 | And so particularly progressive resizing,

00:34:50.280 | just basically train mainly on small things,

00:34:52.560 | train on non-square things, you know, stuff like that.

00:34:57.640 | And so, yeah, we ended up winning, thank God.

00:35:02.080 | And so, you know, we turned it around from being like,

00:35:05.160 | like, "Oh shit, you know, this is going to show

00:35:06.800 | "that you have to be Google and have TPUs,"

00:35:08.520 | to being like, "Oh my God,

00:35:09.360 | "even the little guy can do deep learning."

00:35:11.920 | So that's an example of the kind of like

00:35:16.480 | research artifacts we do.

00:35:18.840 | And yeah, so all of my research is always,

00:35:22.160 | how do we do more with less, you know?

00:35:24.320 | So how do we get better results with less data,

00:35:26.640 | with less compute, with less complexity,

00:35:30.480 | with less education, you know, stuff like that.

00:35:34.440 | So ULM fits obviously a good example of that.

00:35:37.720 | - And most recently you published,

00:35:40.720 | "Can LLMs learn from a single example?"

00:35:42.920 | Maybe, could you tell the story a little bit behind that?

00:35:46.080 | And maybe that goes a little bit too far

00:35:48.160 | into the learning of very low resource literature.

00:35:53.160 | - Yeah, yeah.

00:35:54.840 | So me and my friend, Jono Whittaker,

00:35:57.880 | basically had been playing around

00:36:01.160 | with this fun Kaggle competition,

00:36:03.160 | which is actually still running as we speak,

00:36:05.200 | which is, can you create a model

00:36:09.880 | which can answer multiple choice questions

00:36:13.200 | about anything that's in Wikipedia?

00:36:15.880 | And the thing that makes it interesting

00:36:18.440 | is that your model has to run on Kaggle within nine hours.

00:36:23.440 | And Kaggle's very, very limited.

00:36:26.040 | So you've only got 14 gig RAM, only two CPUs,

00:36:29.600 | and a small, very old GPU.

00:36:31.800 | So this is cool, you know, if you can do well at this,

00:36:35.520 | and this is a good example of like,

00:36:37.000 | oh, you can do more with less.

00:36:38.520 | So yeah, Jono and I were playing around with fine tuning,

00:36:44.800 | of course, transfer learning, pre-trained language models.

00:36:48.240 | And we saw this like,

00:36:52.640 | so we always, you know, plot our losses as we go.

00:36:55.040 | So here's another thing we created.

00:36:56.160 | Well, actually, Sylvain Gouger,

00:36:57.600 | when he worked with us, created a code Fast Progress,

00:36:59.720 | which is kind of like TQEDM, but we think a lot better.

00:37:03.560 | So we look at our fast progress curves,

00:37:05.880 | and they kind of go down, down, down, down, down, down,

00:37:07.920 | down a little bit, little bit, little bit,

00:37:09.160 | and then suddenly go clunk, and they drop,

00:37:11.960 | and then down, down, down, down, down a little bit,

00:37:13.400 | and then suddenly clunk, they drop.

00:37:15.000 | We're like, what the hell?

00:37:16.520 | These clunks are occurring at the end of each epoch.

00:37:20.560 | So normally in deep learning,

00:37:23.600 | this would be, you know, I've seen this before,

00:37:27.000 | and it's always been a bug.

00:37:28.680 | It's always turned out that like,

00:37:29.880 | oh, we accidentally forgot to turn on eval mode

00:37:32.520 | during the validation set,

00:37:33.640 | so I was actually learning then,

00:37:35.560 | or, oh, we accidentally were calculating

00:37:39.080 | moving average statistics throughout the epoch,

00:37:41.440 | so, you know, if it's recently moving average or whatever,

00:37:44.320 | and so we were using HuggingFaceTrainer.

00:37:47.240 | So, you know, I did not give my friends at HuggingFace

00:37:50.200 | the benefit of the doubt.

00:37:51.160 | I thought, oh, they've fucked up HuggingFaceTrainer,

00:37:53.400 | you know, idiots.

00:37:56.120 | Well, you'll use the FastAITrainer instead.

00:37:58.520 | So we switched over to Learner.

00:37:59.800 | We still saw the clunks, and, you know, that's,

00:38:04.600 | yeah, it shouldn't really happen,

00:38:06.120 | because semantically speaking, in the epoch,

00:38:09.600 | isn't like, it's not a thing, you know,

00:38:12.600 | like nothing happens, or nothing's meant to happen

00:38:15.480 | when you go from ending one epoch

00:38:17.120 | to starting the next one.

00:38:18.360 | So there shouldn't be a clunk, you know.

00:38:22.800 | So I kind of asked around on the open source discords,

00:38:25.560 | and I was like, what's going on here?

00:38:29.200 | And everybody was just like, oh, that's just what,

00:38:30.880 | that's just what these training curves look like.

00:38:32.720 | Ours all look like that.

00:38:33.880 | Don't worry about it.

00:38:35.040 | And I was like, oh, are you all using Trainer?

00:38:37.480 | Yes, oh, well, there must be some bug with Trainer.

00:38:40.440 | And I was like, well, we also saw it in Learner,

00:38:42.160 | and somebody else was like, no, we've got our own Trainer.

00:38:44.080 | We get it as well.

00:38:45.280 | They're just like, don't worry about it.

00:38:46.240 | It's just something we see.

00:38:47.880 | It's just normal.

00:38:49.000 | I can't do that.

00:38:50.040 | I can't just be like, here's something that's like,

00:38:53.360 | in the previous 30 years of neural networks,

00:38:55.480 | nobody ever saw it, and now suddenly we see it.

00:38:58.080 | So don't worry about it.

00:39:00.360 | Like, I just, I have to know why.

00:39:02.520 | - Can I clarify?

00:39:03.480 | This is, was everyone that you're talking to,

00:39:05.960 | were they all seeing it for the same data set

00:39:07.600 | or in different data sets?

00:39:08.880 | - Data, David, different data sets, different trainers.

00:39:11.920 | They're just like, no, this is just what it looks like

00:39:14.200 | when you fine-tune language models.

00:39:15.600 | Don't worry about it.

00:39:17.240 | - You've never seen this before?

00:39:18.720 | - I hadn't seen it before, but I'd been kind of like,

00:39:21.080 | as I say, I kept working on them

00:39:23.160 | for a couple of years after ULM fit,

00:39:24.840 | and then I kind of moved on to other things,

00:39:27.240 | partly out of frustration.

00:39:28.640 | So I hadn't been fine-tuning, you know,

00:39:32.160 | I mean, LAM has only been out for a few months, right?

00:39:35.720 | But I wasn't one of those people

00:39:37.960 | who jumped straight into it, you know?

00:39:39.600 | So I was relatively new

00:39:41.800 | to the kind of LAMA fine-tuning world,

00:39:44.480 | where else these guys had been, you know,

00:39:48.000 | doing it since day one.

00:39:49.520 | It was only a few months ago,

00:39:51.920 | but it's still quite a bit of time.

00:39:53.040 | So yeah, they're just like, no, this is all what we see.

00:39:56.280 | Don't worry about it.

00:39:58.200 | So yeah, I've got a very kind of like, I don't know,

00:40:01.480 | I've got this brain where I have to know why things are.

00:40:04.400 | And so I kind of, I ask people like,

00:40:06.160 | well, why do you think it's happening?

00:40:07.360 | And they'd be like, oh, pretty obviously,

00:40:09.400 | 'cause it's like, memorize the data set.

00:40:12.160 | It's just like, it can't be right.

00:40:14.720 | It's only seen it once.

00:40:15.920 | Like, look at this, the loss has dropped by 0.3.

00:40:19.480 | 0.3, which is like, basically it knows the answer.

00:40:24.360 | They're like, no, no, it's just, it is,

00:40:28.760 | it's just memorize the data set.

00:40:30.040 | So yeah, so look, Jono and I did not discover this.

00:40:34.160 | And Jono and I did not come up with a hypothesis.

00:40:37.000 | You know, I guess we were just the ones,

00:40:38.240 | I guess, who had been around for long enough

00:40:39.640 | to recognize that like, this isn't how it's meant to work.

00:40:42.920 | And so we, you know, and so we went back and like,

00:40:46.120 | okay, let's just run some experiments, you know,

00:40:48.920 | 'cause nobody seems to have actually published

00:40:50.240 | anything about this.

00:40:51.400 | Well, it's not quite true.

00:40:53.960 | Some people have published things,

00:40:55.040 | but nobody ever actually stepped back and said like,

00:40:56.840 | what the hell?

00:40:58.400 | You know, how can this be possible?

00:40:59.880 | Is it possible?

00:41:00.720 | Is it what's happening?

00:41:01.880 | And so, yeah, we created a bunch of experiments

00:41:03.680 | where we basically predicted ahead of time.

00:41:06.080 | It's like, okay, if this hypothesis is correct,

00:41:08.240 | that it's memorized in the training set,

00:41:09.520 | then we ought to see blah under conditions blah,

00:41:12.360 | but not under these conditions.

00:41:14.280 | And so we ran a bunch of experiments,

00:41:15.480 | all of them supported the hypothesis

00:41:17.400 | that it was memorizing the data set

00:41:19.840 | in a single thing at once.

00:41:21.800 | And it's a pretty big data set, you know,

00:41:25.240 | which in hindsight, it's not totally surprising

00:41:32.120 | because the theory, remember, of the ULM fit theory

00:41:34.920 | was like what's kind of creating

00:41:37.600 | all these latent capabilities to make it easier

00:41:39.720 | for it to predict the next token.

00:41:42.000 | So if it's got all this kind of latent capability,

00:41:45.320 | it ought to also be really good at compressing new tokens

00:41:48.640 | because it can immediately recognize it as like,

00:41:51.320 | oh, if that's just a version of this.

00:41:53.960 | So it's not so crazy, you know,

00:41:58.520 | but it is, it requires us to rethink everything

00:42:03.040 | because like, and nobody knows like, okay,

00:42:05.680 | so how do we fine tune these things?

00:42:07.360 | Because like, it doesn't even matter.

00:42:10.320 | Like maybe it's fine.

00:42:11.760 | Like maybe it's fine that it's memorized the data set

00:42:13.760 | after one go and you do a second go.

00:42:15.720 | And okay, the validation loss is terrible

00:42:19.480 | because it's now really overconfident.

00:42:22.000 | That's fine.

00:42:22.840 | Don't, you know, don't, I keep telling people,

00:42:24.520 | don't track validation loss, track validation accuracy,

00:42:27.960 | 'cause at least that will still be useful.

00:42:30.920 | There's another thing that's got lost since ULM fit,

00:42:33.120 | nobody tracks accuracy of language models anymore.

00:42:35.840 | But you know, it'll still keep learning and it does,

00:42:40.000 | it does keep improving, but is it worse?

00:42:44.280 | You know, like, is it like, now that it's kind of

00:42:47.000 | memorized it, it's probably getting a less strong signal,

00:42:50.360 | you know, I don't know.

00:42:54.240 | So I still don't know how to fine tune

00:42:55.640 | language models properly and I haven't found anybody

00:42:57.920 | who feels like they do, like nobody really knows

00:43:00.760 | whether this memorization thing is,

00:43:04.280 | it's probably a feature in some ways,

00:43:05.760 | it's probably some things that you can do usefully with it.

00:43:07.920 | It's probably, yeah, I have a feeling

00:43:11.520 | it's messing up training dynamics as well.

00:43:14.440 | - It doesn't come at the cost of catastrophic forgetting

00:43:16.960 | as well, right?

00:43:17.800 | Like, which is the other side of the coin.

00:43:19.560 | - It does to some extent, like we know it does,

00:43:24.560 | like look at Code Llama, for example.

00:43:26.320 | So Code Llama was a, I think it was like a 500 billion

00:43:30.240 | token fine tuning of Llama 2 using code.

00:43:33.520 | And also pros about code that Meta did.

00:43:37.440 | And honestly, they kind of blew it,

00:43:41.200 | because Code Llama is good at coding,

00:43:43.040 | but it's bad at everything else.

00:43:44.680 | You know, and it used to be good.

00:43:46.120 | Yeah, I was pretty sure it was like,

00:43:48.560 | before they released it, me and lots of people

00:43:50.320 | in the open source discords were all like,

00:43:51.800 | oh my God, you know, we know this is coming,

00:43:53.600 | Jan Lukinska is saying it's coming,

00:43:55.080 | I hope they kept at least like 50% non-code data,

00:43:58.240 | 'cause otherwise it's gonna forget everything else.

00:44:00.440 | And they didn't, only like 0.3% of their epochs

00:44:05.440 | were non-code data.

00:44:07.400 | So it did, it forgot everything else.

00:44:08.920 | So now it's good at code and it's bad at everything else.

00:44:12.840 | So we definitely have catastrophic forgetting.

00:44:14.640 | It's fixable, just somebody has to do, you know,

00:44:17.920 | somebody has to spend their time training a model

00:44:21.760 | on a good mix of data.

00:44:24.320 | Like, so, okay, so here's the thing.

00:44:26.160 | Even though I originally created the three-step approach

00:44:32.320 | that everybody now does,

00:44:34.040 | my view is it's actually wrong and we shouldn't use it.

00:44:36.840 | And that's because people are using it in a way different

00:44:44.720 | to why I created it.

00:44:46.000 | You know, I created it thinking that the task-specific

00:44:48.440 | models would be more specific.

00:44:51.280 | You know, it's like, oh, this is like a sentiment classifier.

00:44:54.840 | That's an example of a task, you know,

00:44:57.280 | but the tasks now are like a, you know, RLHF,

00:45:01.720 | which is basically like answer questions

00:45:03.360 | that make people feel happy about your answer.

00:45:05.280 | So that's a much more general task

00:45:07.560 | and it's a really cool approach.

00:45:09.440 | And so we see, for example, RLHF also breaks models,

00:45:14.440 | like, you know, like GPT-4, RLHDEFT,

00:45:18.160 | we know from kind of the work that Microsoft did,

00:45:21.680 | you know, the earlier less-aligned version was better.

00:45:25.840 | And these are all kind of examples

00:45:28.640 | of catastrophic forgetting.

00:45:30.160 | And so, to me, the right way to do this

00:45:33.480 | is to fine-tune language models,

00:45:36.600 | is to actually throw away the idea of fine-tuning.

00:45:38.840 | There's no such thing.

00:45:40.280 | There's only continued pre-training.

00:45:42.600 | And pre-training is something where, from the very start,

00:45:46.280 | you try to include all the kinds of data that you care about,

00:45:49.880 | all the kinds of problems that you care about,

00:45:51.800 | instructions, exercises, code,

00:45:55.840 | general purpose document completion, whatever.

00:45:59.280 | And then as you train, you gradually curate that,

00:46:05.400 | you know, you gradually make that higher and higher quality

00:46:08.360 | and more and more specific

00:46:09.400 | to the kinds of tasks you want it to do.

00:46:12.200 | But you never throw away any data.

00:46:14.840 | You always keep all of the data types there

00:46:16.960 | in reasonably high quantities.

00:46:20.280 | You know, maybe the quality filter,

00:46:23.480 | you stop training on low-quality data,

00:46:26.240 | 'cause that's probably fine to forget

00:46:27.440 | how to write badly, maybe.

00:46:29.240 | So yeah, that's now my view,

00:46:32.560 | is I think ULM fit is the wrong approach.

00:46:36.040 | And that's why we're seeing a lot of these, you know,

00:46:40.160 | so-called alignment tax and this view of like,

00:46:42.960 | "Oh, a model can't both code and do other things."

00:46:45.800 | You know, I think it's actually

00:46:46.800 | 'cause people are training them wrong.

00:46:49.240 | - Well, I think you have a clear anti-laziness approach.

00:46:53.800 | I think other people are not as good-hearted, you know?

00:46:57.440 | They're like, "Hey, they told me this thing works.

00:46:59.800 | "And if I release a model this way, people will appreciate it.

00:47:03.160 | "I'll get promoted and I'll kind of make more money."

00:47:06.920 | - Oh, absolutely.

00:47:08.240 | Yeah, and it's not just money.

00:47:09.440 | It's like, this is how citations work most badly, you know?

00:47:12.680 | So if you wanna get cited, you need to write a paper

00:47:15.600 | that people in your field recognize as an advancement

00:47:19.640 | on things that we know are good.

00:47:22.120 | And so we've seen this happen again and again.

00:47:24.360 | So like I say, like zero-shot and few-shot learning,

00:47:28.080 | everybody was writing about that.

00:47:29.640 | Or, you know, with image generation,

00:47:32.240 | everybody just was writing about GANs, you know?

00:47:35.040 | And I was trying to say like,

00:47:35.880 | "No, GANs are not the right approach."

00:47:37.840 | You know, and I showed again through research

00:47:40.040 | that we demonstrated in our videos

00:47:42.200 | that you can do better than GANs much faster

00:47:46.280 | and with much less data.

00:47:48.280 | And nobody cared because again,

00:47:49.880 | like if you wanna get published,

00:47:51.800 | you write a GAN paper that slightly improves

00:47:54.680 | this part of GANs and this tiny field,

00:47:57.240 | you'll get published, you know?

00:47:59.720 | So it's, yeah, it's not set up for real innovation.

00:48:03.760 | It's, you know, again, it's really helpful for me,

00:48:10.680 | you know, I have my own research lab

00:48:12.760 | with nobody telling me what to do

00:48:14.120 | and I don't even publish,

00:48:16.160 | so it doesn't matter if I get citations.

00:48:18.320 | So I just write what I think actually matters.

00:48:21.440 | I wish there was, and you know,

00:48:24.120 | and actually places like OpenAI, you know,

00:48:26.560 | the researchers there can do that as well.

00:48:29.040 | It's a shame, you know,

00:48:29.880 | I wish there was more academic open venues

00:48:34.000 | in which people can focus on like genuine innovation.

00:48:38.880 | - Twitter, which is unironically

00:48:42.160 | has become a little bit of that form.

00:48:44.320 | I wanted to follow up on one thing that you mentioned,

00:48:47.280 | which is that you checked around the open source discords.

00:48:50.680 | I don't know if it's too,

00:48:53.080 | I don't know if it's a kosher to ask

00:48:54.760 | like what discords are lively or useful right now.

00:48:59.200 | I think that something I definitely felt

00:49:02.120 | like I missed out on was the early days of LutherAI,

00:49:04.640 | which is a fair hotbed.

00:49:07.160 | And you know, like what is the new Luther?

00:49:09.480 | And you actually shouted out the alignment lab AI discord

00:49:12.800 | in your blog post.

00:49:14.000 | And that was the first time I even knew,

00:49:15.320 | like I saw them on Twitter

00:49:16.480 | and never knew they had a discord,

00:49:17.760 | never knew that there was actually

00:49:18.800 | substantive discussions going on in there

00:49:20.720 | and that you were an active member of it.

00:49:22.720 | - Okay, yeah, and then even then,

00:49:23.960 | if you do know about that and you go there,

00:49:25.560 | it'll look like it's totally dead.

00:49:27.400 | And that's because unfortunately,

00:49:29.040 | nearly all the discords,

00:49:30.120 | nearly all of the conversation happens in private channels.

00:49:35.160 | - So how does someone get into that world?

00:49:38.560 | 'Cause it's obviously very, very instructive, right?

00:49:42.880 | - You could just come to the first AI discord,

00:49:44.800 | which I'll be honest with you,

00:49:46.240 | it's less bustling than some of the others,

00:49:50.160 | but it's not terrible.

00:49:52.320 | And so like, at least to be fair,

00:49:55.880 | one of Emma's bustling channels is private.

00:49:58.080 | So I'm just thinking.

00:50:01.920 | - It's just the nature of quality discussion, right?

00:50:03.960 | - Yeah, I guess when I think about it,

00:50:06.560 | I didn't have any private discussions

00:50:08.440 | on our discord for years,

00:50:10.680 | but there was a lot of people who came in with like,

00:50:14.080 | oh, I just had this amazing idea for AGI.

00:50:17.560 | If you just thought about like,

00:50:18.800 | if you imagine that AI is a brain and we,

00:50:21.840 | this just, I don't want to talk about it.

00:50:24.160 | I don't want to like,

00:50:25.480 | maybe you don't want to be dismissive or whatever.

00:50:27.520 | And it's like, oh, well, that's an interesting comment,

00:50:29.280 | but maybe you should like try training some models first

00:50:31.440 | to see if that aligns with your intuition.

00:50:33.040 | Like, oh, but how can I possibly learn?

00:50:34.480 | It's like, well, we have a course,

00:50:35.880 | just actually spend time learning.

00:50:38.320 | Like, you know, anyway.

00:50:40.080 | And it's like, okay,

00:50:41.560 | I know the people who always have good answers there.

00:50:43.960 | And so I created a private channel and put them all in it.

00:50:46.720 | And I got to admit, that's where I post more often

00:50:50.000 | 'cause there's much less,

00:50:52.800 | you know, flight of fancy views

00:50:56.120 | about how we could solve AGI, blah, blah, blah.

00:50:58.440 | So there is a bit of that,

00:51:00.120 | but having said that, like,

00:51:02.040 | I think the bar's pretty low.

00:51:03.720 | Like if you join a Discord

00:51:05.600 | and you can hit the like participants

00:51:10.600 | or community or whatever button,

00:51:11.920 | you can see who's in it.

00:51:12.760 | And then you'll see at the top who the admins or moderators

00:51:15.440 | or people in the dev role are.

00:51:17.120 | And just DM one of them and say like,

00:51:22.960 | oh, here's my GitHub.

00:51:25.640 | Well, here's some blog posts I wrote.

00:51:27.640 | You know, I'm interested in talking about this.

00:51:29.680 | You know, can I join the private channels?

00:51:31.840 | And I've never heard of anybody saying no.

00:51:34.960 | I will say, you know, Alutha's all pretty open.

00:51:39.960 | So you can do the Alutha Discord still.

00:51:43.400 | You know, one problem with the Alutha Discord

00:51:45.000 | is it's been going on for so long

00:51:47.680 | that it's like, it's very inside baseball.

00:51:50.280 | - It's hard to join a newcomer.

00:51:51.240 | - It's quite hard to get started.

00:51:52.840 | Kappa AI looks, I think it's all open.

00:51:59.840 | - Those just left a stability.

00:52:02.240 | - That's more accessible.

00:52:03.800 | - Yeah.

00:52:04.640 | - There's also just recently,

00:52:09.520 | now it's research that does like the Hermes models

00:52:12.720 | and data set just opened.

00:52:15.640 | They've got some private channels,

00:52:16.880 | but it's pretty open, I think.

00:52:19.160 | You mentioned Alignment Lab,

00:52:20.480 | that one it's all the interesting stuff

00:52:21.920 | is on private channels.

00:52:22.800 | So just ask.

00:52:25.520 | If you know me, ask me 'cause I've got admin on that one.

00:52:28.960 | There's also, yeah, OS Skunkworks, OS Skunkworks AI.

00:52:33.880 | There's a good Discord, which I think it's open.

00:52:37.880 | So yeah, they're all pretty good.

00:52:40.440 | - I don't want you to leak any, you know,

00:52:42.800 | Discords that don't want any publicity,

00:52:45.280 | but this is all helpful.

00:52:47.040 | - We all want people.

00:52:48.520 | Like we all want people.

00:52:49.640 | We just want people who like wanna build stuff.

00:52:53.080 | - Exactly, yeah.

00:52:54.640 | - Rather than people who,

00:52:56.080 | and like, it's fine to not know anything as well,

00:52:59.920 | but if you don't know anything,

00:53:02.200 | but you wanna tell everybody else what to do

00:53:04.040 | and how to do it, that's annoying.

00:53:05.520 | If you don't know anything and wanna be told,

00:53:08.120 | like here's a really small kind of task

00:53:10.840 | that as somebody who doesn't know anything,

00:53:12.760 | it's gonna take you a really long time to do,

00:53:14.400 | but it would still be helpful.

00:53:15.880 | Then, and then you go and do it.

00:53:17.080 | That would be great.

00:53:18.240 | The truth is, yeah, like, I don't know,

00:53:20.800 | maybe 5% of people who come in with great enthusiasm

00:53:23.440 | saying that they wanna learn and they'll do anything.

00:53:25.240 | And then somebody says like,

00:53:26.080 | okay, here's some work you can do.

00:53:27.880 | Almost nobody does that work.

00:53:29.760 | So if you're somebody who actually does the work

00:53:32.280 | and follows up, you will massively stand out.

00:53:36.920 | That's an extreme rarity.

00:53:38.400 | And everybody will then want to help you do more work.

00:53:41.080 | So yeah, so just, yeah, just do work

00:53:44.600 | and people will want to support you.

00:53:47.880 | - Our Discord used to be referral only for a long time.

00:53:51.240 | We then have a public invite

00:53:53.000 | and then we opened it in the kind of like channel gating.

00:53:55.960 | Yeah, a lot of people just wanna do,

00:53:58.360 | I remember it used to be like, you know, a forum moderator.

00:54:00.840 | It's like, people just wanna do like drive-by posting,

00:54:03.040 | you know, and like, they don't wanna help the community.

00:54:05.840 | They just wanna get their question answered.

00:54:07.720 | - I mean, the funny thing is our forum community

00:54:11.800 | does not have any of that garbage.

00:54:13.760 | You know, there's something specific

00:54:15.240 | about the low latency thing

00:54:17.040 | where people like expect an instant answer.

00:54:20.800 | And yeah, we're all somehow in a forum thread

00:54:24.920 | where they know it's like there forever.

00:54:28.080 | People are a bit more thoughtful,

00:54:29.320 | but then the forums are less active than they used to be

00:54:34.320 | because Discord has got more popular, you know?

00:54:39.520 | So it's all a bit of a compromise.

00:54:41.120 | You know, running a healthy community is,

00:54:44.240 | yeah, it's always a bit of a challenge.

00:54:46.520 | - All right, we got so many more things we wanna dive in,

00:54:48.560 | but I don't wanna keep you here for hours.

00:54:50.760 | This is not the Lex Fridman podcast we always like to say.

00:54:54.160 | One topic I would love to maybe chat a bit about

00:54:57.400 | is Mojo, Modular, you know, CrystalLiner,

00:55:00.360 | not many of you on the podcast,

00:55:01.920 | so we wanna spend a little time there.

00:55:04.200 | You recently did a Hacker's Guide to Language Models,

00:55:07.240 | and you ran through everything from quantized model

00:55:10.360 | to like smaller models, larger models, and all of that.

00:55:14.160 | But obviously, Modular is taking its own approach.

00:55:17.560 | Yeah, what got you excited?

00:55:18.600 | I know you and Chris have been talking about this

00:55:20.320 | for like years and a lot of the ideas you had, so.

00:55:23.720 | - Yeah, yeah, yeah, absolutely.

00:55:25.280 | So I met Chris, I think it was at the first

00:55:29.520 | TensorFlow Dev Summit.

00:55:31.520 | And I don't think he had even like,

00:55:33.920 | I'm not sure if he'd even officially started

00:55:35.680 | his employment with Google at that point.

00:55:37.600 | So I don't know, you know,

00:55:40.000 | certainly nothing had been mentioned.

00:55:42.640 | So I, you know, I admired him from afar

00:55:45.040 | with LLVM and Swift and whatever.

00:55:48.200 | And so I saw him walk into the courtyard at Google.

00:55:53.200 | It's just like, "Oh shit, man, it's Chris Latner.

00:55:56.640 | I wonder if he would lower his standards enough

00:55:59.640 | to talk to me.

00:56:01.360 | Well, it's worth a try."

00:56:02.920 | So I caught up my courage because like,

00:56:04.680 | nobody was talking to him.

00:56:05.960 | He looked a bit lost and I wandered over and was like,

00:56:07.880 | "Oh, you're Chris Latner, right?"

00:56:09.120 | It's like, "What are you doing here?"

00:56:11.000 | And I was like, "Yeah, yeah, I am."

00:56:12.440 | And he was like, "Oh, I'm Jeremy Howard."

00:56:13.640 | It's like, "Oh, do you do some of this AI stuff?"

00:56:15.880 | And I was like, "Yeah, yeah, I like this AI stuff."

00:56:18.480 | "Are you doing AI stuff?"

00:56:19.760 | He's like, "Well, I'm thinking about starting

00:56:21.520 | to do some AI stuff.

00:56:22.560 | Yeah, I think it's gonna be cool."

00:56:23.640 | And I was like, "Oh."

00:56:25.320 | So like, I spent the next half hour

00:56:27.360 | just basically brain dumping all the ways

00:56:30.480 | in which AI was stupid to him.

00:56:32.800 | And he listened patiently.

00:56:34.240 | I thought he probably wouldn't even remember

00:56:36.760 | or care or whatever, but yeah,

00:56:40.560 | then I kind of like, I guess I re-caught up with him

00:56:42.360 | a few months later and he was like,

00:56:43.440 | "I've been thinking about everything you said

00:56:45.400 | in that conversation."

00:56:46.240 | And he like narrated back his response to every part of it,

00:56:50.240 | the projects he was planning to do.

00:56:51.520 | And it was just like, "Oh, this dude follows up.

00:56:54.080 | Holy shit."

00:56:55.000 | And I was like, "Wow, okay."

00:56:58.240 | And he was like, "Yeah, so we're gonna create

00:56:59.840 | this new thing called Swift for TensorFlow.

00:57:02.920 | And it's gonna be like, it's gonna be a compiler

00:57:05.120 | with auto-differentiation built in and blah, blah, blah."

00:57:08.320 | And I was like, "Oh, wait, why would that help?"

00:57:10.200 | You know, he was like, "Okay, with a compiler

00:57:12.560 | during the forward pass, you don't have to worry

00:57:15.080 | about saving context, you know,

00:57:16.520 | 'cause it'll all be optimized in the backward."

00:57:18.040 | But I was like, "Oh my God."

00:57:19.840 | 'Cause I didn't really know much about compilers,

00:57:21.560 | it's just that, you know, I spent enough

00:57:23.200 | to kind of like understand the ideas,

00:57:25.840 | but it hadn't occurred to me that a compiler

00:57:28.680 | basically solves a lot of the problems we have as end users.

00:57:32.640 | I was like, "Wow, that's amazing.

00:57:33.880 | Okay, you do know, right, that nobody's gonna use this

00:57:36.360 | unless it's like usable."

00:57:37.880 | He was like, "Yeah, I know, right?

00:57:39.680 | So I was thinking you should create like a fast AI for this."

00:57:42.360 | I was like, "Okay, but I don't even know Swift."

00:57:46.440 | And he was like, "Well, why don't you start learning it?

00:57:50.040 | And if you have any questions, ask me."

00:57:52.640 | It's just like, "Holy shit."

00:57:53.840 | Like, not only is Chris Latner lowered his standards enough

00:57:57.680 | to talk to me, but he's offering me personal tutoring

00:58:00.600 | in the programming language that he made.

00:58:02.800 | So I was just like, "I'm not gonna let him down."

00:58:05.680 | So I spent like the next two months

00:58:07.440 | like just nerding out on Swift.

00:58:10.080 | And it was just before Christmas that I kind of like

00:58:13.400 | started writing down what I'd learned.

00:58:15.800 | So I wrote a couple of blog posts on like,

00:58:19.160 | "Okay, this is like my attempt

00:58:21.600 | to do numeric programming in Swift.

00:58:24.320 | And these are all the challenges I had.

00:58:26.000 | And these are some of the issues I had

00:58:27.560 | with like making things properly performant.

00:58:32.400 | And here are some libraries I wrote."

00:58:34.240 | And I sent it to Chris and I was like,

00:58:35.520 | "I hope he's not too disappointed with me."

00:58:37.680 | You know, 'cause that would be the worst.

00:58:40.000 | And I was also like, "I hope he doesn't dislike the fact

00:58:44.280 | that I didn't love everything."

00:58:47.360 | And yeah, he was like, "Oh, thanks for sending me that.

00:58:52.240 | Let's get on a call and talk about it."

00:58:53.760 | And we spoke and he was like, "This is amazing.

00:58:56.120 | I can't believe that you made this.

00:58:57.840 | This is exactly what Swift needs."

00:58:59.520 | And he was like, "And so like somebody set up

00:59:01.280 | like a new Swift, I can't remember what they call them,

00:59:06.280 | the equivalent of a PEP, kind of IRFC thing of like,

00:59:09.080 | oh, you know, let's look at how we can implement

00:59:10.880 | Jeremy's ideas and the language."

00:59:12.520 | And so I was like, "Oh, wow."

00:59:15.000 | And so, yeah, you know.

00:59:16.920 | So, you know, and then we ended up like literally teaching

00:59:22.200 | some lessons together about Swift for TensorFlow

00:59:24.840 | and we built a fast AI kind of equivalent

00:59:29.480 | with him and his team.

00:59:32.200 | It was so much fun.

00:59:33.320 | Then in the end, you know, Google didn't follow through,

00:59:36.640 | which is fair enough, like asking everybody

00:59:39.760 | to learn a new programming language is gonna be tough.

00:59:42.880 | But like, it was very obvious, very, very obvious

00:59:45.200 | at that time that TensorFlow 2 is gonna be a failure,

00:59:47.640 | you know, and so this felt like, okay, I, you know,

00:59:52.640 | well, you know, what are you gonna do?

00:59:55.320 | Like, you can't focus on TensorFlow 2

01:00:00.320 | 'cause it's not gonna, like it's not working.

01:00:02.400 | It's never gonna work.

01:00:03.400 | You know, nobody at Google's using it internally.

01:00:06.760 | So, you know, in the end, Chris left, you know,

01:00:11.760 | Swift for TensorFlow got archived.

01:00:13.680 | There was no backup plan.

01:00:16.120 | So it kind of felt like Google was kind of screwed,

01:00:19.920 | you know, and Chris went and did something else.

01:00:22.320 | But we kept talking and I was like, "Look, Chris, you know,

01:00:25.160 | you've gotta be your own boss, man.

01:00:27.600 | 'Cause like, you know, you've got the ideas, you know,

01:00:30.320 | like only you've got the ideas, you know,

01:00:32.400 | and if your ideas are implemented,

01:00:35.280 | we'd all be so much better off

01:00:36.640 | 'cause like Python's the best of a whole bunch of shit,

01:00:41.640 | you know, like I would, it's amazing, but it's awful,

01:00:45.720 | you know, compared to what it could be.

01:00:46.960 | And anyway, so eventually a few years later,

01:00:50.360 | he called me up and he was like,

01:00:51.680 | "Jeremy, I've taken your advice.

01:00:53.880 | I've started a company."

01:00:55.240 | So I was like, "Oh my God."

01:00:57.320 | So we got to create a new language.

01:00:58.720 | We're gonna create a new infrastructure.

01:01:00.480 | It's gonna build, it's gonna have all the stuff

01:01:02.000 | we've talked about.

01:01:02.960 | And it's like, "Oh, wow."

01:01:05.280 | So that's what Modular is.

01:01:09.360 | And so Mojo is like, you know,

01:01:14.000 | building on all the stuff that Chris has figured out over,

01:01:18.760 | I mean, really from when he did his PhD thesis,

01:01:21.600 | which developed LLVM onwards, you know,

01:01:24.320 | and Swift and MLIR, you know,

01:01:27.080 | the TensorFlow runtime engine, which is very good.

01:01:31.160 | You know, that was something that he built and has lasted.

01:01:34.000 | So, yeah, I'm pumped about that.

01:01:38.440 | I mean, it's very speculative.

01:01:40.000 | Creating a whole new language is tough.

01:01:41.960 | I mean, Chris has done it before

01:01:43.160 | and he's created a whole C++ compiler amongst other things,

01:01:46.680 | looking pretty hopeful.

01:01:49.760 | I mean, I hope it works because, you know, I mean-

01:01:53.840 | - You told them to quit his job, so.

01:01:55.640 | - But I mean, in the meantime, I will say, you know,

01:02:00.760 | Google now does have a backup plan, you know,

01:02:03.400 | they have JAX, which was never a strategy.

01:02:06.120 | It was just a bunch of people

01:02:07.160 | who also recognized TensorFlow 2 as shit

01:02:09.080 | and they just decided to build something else.

01:02:11.920 | And for years, my friends in that team were like,

01:02:13.560 | "Don't tell anybody about us

01:02:14.960 | 'cause we don't want it to be anything but a research project."

01:02:19.200 | So now these poor guys,

01:02:21.040 | suddenly they're the great white hope for Google's future.

01:02:24.080 | And so JAX is, you know, also not terrible,

01:02:27.720 | but it's still written in Python.

01:02:29.520 | Like, it would be cool if we had all the benefits of JAX,

01:02:32.920 | but in a language that was designed

01:02:36.080 | for those kinds of purposes.

01:02:38.440 | So, you know, fingers crossed that, yeah,

01:02:43.360 | that Mojo turns out great.

01:02:46.680 | - Yeah.

01:02:48.080 | Any other thoughts on when,

01:02:50.600 | where people should be spending their time?

01:02:52.200 | So that's more the kind of language framework level

01:02:55.360 | than you have the, you know, GGML,

01:02:58.320 | some of these other like quantization-focused

01:03:01.200 | kind of model level things.

01:03:02.320 | Then you got the hardware people.

01:03:03.880 | It's like a whole other bucket.

01:03:06.040 | Yeah, what are some of the exciting stuff

01:03:07.720 | that you're excited about?

01:03:09.560 | - Well, you won't be surprised to hear me say this,

01:03:11.440 | but I think fine-tuning transfer learning

01:03:14.080 | is still a hugely underappreciated area.

01:03:18.400 | So today's zero-shot, few-shot learning equivalent

01:03:24.960 | is retrieval-augmented generation, you know, REG,

01:03:28.320 | which is like, just like few-shot learning is a thing.

01:03:32.000 | Like, it's a real thing.

01:03:32.840 | It's a useful thing.

01:03:34.200 | It's not a thing anybody would want to ignore.

01:03:36.320 | Why are people not spending at least as much effort

01:03:38.840 | on fine-tuning, you know?

01:03:40.400 | 'Cause, you know, REG is like such a inefficient hack,

01:03:45.280 | really, isn't it?

01:03:46.120 | It's like, you know, segment up my data

01:03:50.760 | in some somewhat arbitrary way,

01:03:53.440 | embed it, ask questions about that, you know,

01:03:56.520 | hope that my embedding model embeds questions

01:04:01.560 | in the same embedding space as the paragraphs,

01:04:04.120 | which obviously is not going to, if your question is like,

01:04:06.240 | if I've got a whole bunch of archive papers embeddings,

01:04:08.880 | and I asked, like, what are all the ways

01:04:12.040 | in which we can make inference more efficient?

01:04:17.040 | Like, the only paragraphs it'll find

01:04:20.560 | is like if there's a review paper

01:04:22.240 | that says here's a list of ways to make,

01:04:24.680 | you know, inference more efficient.

01:04:26.720 | - Doesn't have any of the specifics.

01:04:28.800 | - No, it's not going to be like, oh, here's one way,

01:04:30.920 | here's one way, here's a different way in different papers,

01:04:33.240 | you know?

01:04:34.080 | Yeah, if you fine-tune a model,

01:04:37.520 | then all of that information is getting directly

01:04:40.680 | incorporated into the weights of your model

01:04:43.880 | in a much more efficient and nuanced way.

01:04:46.600 | And then you can use REG on top of that.

01:04:50.480 | So I think that that's one area

01:04:51.840 | that's definitely like underappreciated.

01:04:55.280 | And also the kind of like the confluence,

01:04:57.680 | or like, okay, how do you combine REG

01:04:59.440 | and fine-tuning, for example?

01:05:01.280 | - Something that I think a lot of people are uncertain about,

01:05:04.160 | and I don't expect you to know either,

01:05:06.160 | is that whether or not you can fine-tune new information in,

01:05:11.160 | and I think that that is the focus

01:05:13.880 | of some of your open questions and research.

01:05:16.720 | - Of course you can, right?

01:05:17.960 | - Because it's additional pre-training.

01:05:18.800 | - Obviously you can,

01:05:20.360 | because there's no such thing as fine-tuning,

01:05:23.200 | there's only continued pre-training.

01:05:25.200 | So fine-tuning is pre-training,

01:05:28.000 | like they're literally the same thing.

01:05:29.960 | So the knowledge got in there in the first place

01:05:33.360 | through pre-training,

01:05:34.200 | so how could like continuing to pre-train

01:05:36.640 | not put more knowledge in?

01:05:37.960 | Like it's the same thing.

01:05:39.800 | The problem is just we're really bad at it,

01:05:43.120 | 'cause everybody's doing it dumb ways.

01:05:45.040 | So, you know, it's a good question,

01:05:46.920 | and it's not just new knowledge,

01:05:48.080 | but like new capabilities.

01:05:49.800 | You know, I think like in my "Hacker's Guide to LL,"

01:05:54.120 | into "Hacker's Guide to LLMs" talk,

01:05:56.640 | I show simple, I mean, it's a funny,

01:05:59.320 | that's a simple example, 'cause it doesn't sound it,

01:06:01.080 | but like taking a pre-trained base model

01:06:03.640 | and getting it to generate SQL.

01:06:05.320 | And it took 15 minutes to train on a single GPU.

01:06:09.560 | You know, I think that might surprise people,

01:06:11.600 | so that that capability is at your fingertips,

01:06:15.600 | and, you know, 'cause it was already there,

01:06:17.240 | it was just latent in the base model.

01:06:20.880 | Really pushing the boundaries

01:06:22.520 | of what you can do with small models,

01:06:24.360 | I think is a really interesting question.

01:06:27.760 | Like what can you do with a,

01:06:30.280 | like, I mean, there isn't much

01:06:31.280 | in the way of good small models.

01:06:33.480 | A really underappreciated one is a BTLM 3B,

01:06:38.080 | which is a like kind of 7B quality 3B model.

01:06:44.920 | There's not much at the 1 to 2B range, sadly.

01:06:47.280 | There are some code ones,

01:06:48.600 | but like the fact that there are some really good code ones

01:06:51.320 | in that 1 to 2B range shows you

01:06:53.200 | that that's a great size for doing complex tasks well.

01:06:56.920 | - There was 5.1 recently,

01:07:00.160 | which has been the subject of a little bit of discussion

01:07:03.040 | about whether to train on benchmarks.

01:07:04.720 | - Yeah, that's 5.1.5 as well.

01:07:06.440 | So that's not a good model yet.

01:07:09.960 | - Why not?

01:07:12.880 | - It's good at doing,

01:07:13.720 | so 5.1 in particular is good at doing a very specific thing,

01:07:16.520 | which is creating very small Python snippets.

01:07:19.600 | The thing, okay, so like 5.1.5

01:07:23.120 | has never read Wikipedia, for example.

01:07:26.920 | So it doesn't know who Tom Cruise is, you know.

01:07:30.080 | It doesn't know who anybody is.

01:07:34.320 | It doesn't know about any movies.

01:07:36.080 | It doesn't really know anything about anything.

01:07:39.000 | Like, 'cause it was never, it's never read anything.

01:07:42.200 | You know, it was trained

01:07:43.080 | on a nearly entirely synthetic data set,

01:07:46.920 | which is designed for it to learn reasoning.

01:07:50.280 | And so it was a research project and a really good one.

01:07:53.120 | And it definitely shows us a powerful direction

01:07:55.440 | in terms of what can you do with synthetic data.

01:07:58.280 | And wow, gosh, even these tiny models

01:08:00.480 | can get pretty good reasoning skills,

01:08:02.400 | pretty good math skills, pretty good coding skills.

01:08:04.960 | But I don't know if it's a model

01:08:08.800 | you could necessarily build on.

01:08:11.880 | Some people have tried to do some fine tunes of it.

01:08:15.120 | And again, they're like surprisingly good in some ways

01:08:19.040 | for a 1.5B model,

01:08:20.640 | but not sure you'd find it useful for anything.

01:08:24.520 | - I think that's the struggle of pitching small models

01:08:28.040 | because small is great.

01:08:30.080 | You know, you don't have a lot,

01:08:31.640 | you don't need a lot of resources to run them,

01:08:33.520 | but the performance evaluation is always so iffy.

01:08:36.640 | It's always just like, yeah, it works on some things

01:08:39.200 | and we don't trust it for others.

01:08:41.840 | - Yeah, so that's why we're back to fine tuning.

01:08:44.840 | I would say, so Microsoft did create a 5.1.5 web,

01:08:48.960 | but they didn't release it, unfortunately.

01:08:51.040 | I would say a 5.1.5 web with fine tuning for your task,

01:08:57.280 | you know, might solve a lot of tasks

01:09:02.920 | that people have in their kind of day-to-day lives.

01:09:05.960 | You know, particularly in kind of an enterprise setting,

01:09:08.880 | I think there's a lot of like repetitive kind of processing

01:09:13.120 | that has to be done.

01:09:14.520 | It's a useful thing for coders to know about

01:09:16.720 | 'cause I think quite often you can like replace

01:09:18.880 | some thousands and thousands of lines of complex buggy code,

01:09:22.360 | maybe with a fine tune, you know.

01:09:25.000 | - Good, yeah.

01:09:27.920 | And Jeremy, before we let you go,

01:09:31.000 | I think one question on top of a lot of people's minds.

01:09:34.000 | So you've done practical deep learning for coders

01:09:36.880 | in 2018, '19, '21, '22.

01:09:40.080 | I feel like the more time goes by,

01:09:41.840 | the more the GPUs get concentrated.

01:09:44.840 | If you're somebody who's interested in deep learning today

01:09:47.680 | and you don't wanna go join OpenAI,

01:09:49.480 | you don't wanna join Anthropic,

01:09:51.400 | what's like the best use of their time?

01:09:53.840 | Should they focus on, yeah, small model development?

01:09:56.280 | Should they focus on fine tuning math and all of that?

01:09:59.520 | Should they just like focus on making rag not a hack

01:10:03.920 | and coming up with a better solution?

01:10:06.120 | Yeah, what's a practical deep learning for coders 2024

01:10:09.560 | kind of look like?

01:10:10.560 | - Yeah, I mean, good question.

01:10:12.600 | I'm trying to figure that out for myself, you know,

01:10:14.840 | like what should I teach?

01:10:16.360 | 'Cause I definitely feel like things have changed a bit,

01:10:21.280 | you know, one of the ways in which things have changed

01:10:23.880 | is that coding is much more accessible now.

01:10:27.480 | So if you look at a lot of the folks

01:10:28.760 | in the kind of open source LLM community,

01:10:31.080 | they're folks who really hadn't coded before a year ago

01:10:34.760 | and they're using these models to help them build stuff

01:10:37.520 | they couldn't build before,

01:10:38.840 | which is just fantastic, you know?

01:10:40.920 | So one thing I kind of think is like, okay,

01:10:44.600 | well, we need a lot more material to help these people

01:10:47.560 | use this newfound skill they have

01:10:49.280 | 'cause they don't really know what they're doing,

01:10:51.880 | you know, and they don't claim to,

01:10:53.520 | but they're doing it anyway

01:10:54.360 | and I think that's fantastic, you know?

01:10:55.760 | So like, are there things we could do to help people,

01:10:59.720 | you know, bridge this gap?

01:11:00.960 | 'Cause previously, you know,

01:11:03.280 | I know folks who were, you know,

01:11:05.760 | doing menial jobs a year ago

01:11:09.760 | and now they're training language models

01:11:12.120 | thanks to the help of Codex and Copilot and whatever.

01:11:17.080 | So, you know, yeah, what does it look like

01:11:18.680 | to like really grab this opportunity?

01:11:22.160 | You know, maybe Fast.ai's goals

01:11:24.480 | can be dramatically expanded now

01:11:26.760 | to being like, let's make coding more accessible,

01:11:30.520 | you know, kind of AI-oriented coding more accessible.

01:11:34.960 | If so, our course should probably look very different,

01:11:39.200 | you know, and we'd have to throw away that like,

01:11:41.120 | oh, you have to have at least a year

01:11:42.480 | of full-time programming, you know, as a prerequisite.

01:11:46.800 | Yeah, what would happen if we got rid of that?

01:11:50.520 | So that's kind of one thought that's in my head.

01:11:53.520 | You know, as to what should other people do,

01:11:56.680 | honestly, I don't think anybody has any idea,

01:12:01.720 | like the more I look at it, what's going on.

01:12:04.400 | I know I don't, you know,

01:12:05.800 | like we don't really know how to do anything very well.

01:12:08.600 | Clearly OpenAI do,

01:12:12.320 | like they seem to be quite good at some things

01:12:14.920 | or they're talking to folks at

01:12:16.400 | or who have recently left OpenAI.

01:12:19.280 | Even there, it's clear there's a lot of stuff

01:12:21.040 | they haven't really figured out

01:12:22.160 | and they're just kind of like using recipes

01:12:24.600 | that they've noticed have been okay.

01:12:27.080 | So yeah, we don't really know how to train these models well,

01:12:30.000 | we don't know how to fine tune them well,

01:12:31.240 | we don't know how to do rag well,

01:12:33.200 | we don't know what they can do,

01:12:34.320 | we don't know what they can't do,

01:12:35.360 | we don't know how big a model you need

01:12:36.680 | to solve different kinds of problems,

01:12:38.480 | we don't know what kind of problems they can't do,

01:12:40.080 | we don't know what good prompting strategies are

01:12:42.080 | for particular problems, you know.

01:12:44.200 | Like somebody sent me a message the other day saying

01:12:47.920 | they've written something that is a prompting strategy

01:12:52.920 | for GPT-4.

01:12:55.160 | They've written like 6,000 lines of Python code

01:12:58.520 | and it's to help it play chess.

01:13:01.800 | And then they said they've had it play

01:13:04.160 | against other chess engines,

01:13:05.720 | including the best Stockfish engines.

01:13:07.800 | And it's got an ELO of 3,400.

01:13:11.920 | - Oh my God.

01:13:12.760 | - Which would make it close to

01:13:13.800 | the best chess engine in existence.

01:13:17.720 | And I think this is a good example of like,

01:13:20.400 | people were saying like GPT-4 can't play chess.

01:13:23.400 | I mean, I was sure that was wrong.

01:13:25.440 | I mean, obviously it can play chess,

01:13:27.360 | but the difference between like,

01:13:29.120 | with no prompting strategy,

01:13:31.560 | it can't even make legal moves,

01:13:33.080 | with good prompting strategies,

01:13:34.360 | it might be just about the best chess engine in the world.

01:13:37.240 | Far better than any human player.

01:13:39.000 | So yeah, I mean, we don't really know

01:13:40.400 | what the capabilities are yet.

01:13:41.600 | So I feel like it's all blue sky at this point.

01:13:45.200 | - It feels like computer vision in 2013 to me,

01:13:49.280 | which was like in 2013 computer vision.

01:13:51.240 | - We just had the Alex net moment.

01:13:52.720 | - We've had Alex net, we've had VGG net.

01:13:56.000 | It's around the time Zyler and Fergus like,

01:13:58.720 | no, it's probably before that.

01:13:59.760 | So we hadn't yet had the Zyler and Fergus like,

01:14:01.400 | oh, this is actually what's going on inside the layers.

01:14:03.240 | So, you know, we don't actually know

01:14:06.400 | what's happening inside these transformers.

01:14:08.440 | We don't know how to create good training dynamics.

01:14:11.560 | We don't really know anything much.

01:14:14.800 | And there's a reason for that, right?

01:14:18.000 | And the reason for that is language models

01:14:21.520 | suddenly got really useful.

01:14:24.240 | And so the kind of economically rational thing to do,

01:14:28.760 | like this is not criticism, this is true.

01:14:31.160 | The economic rational thing to do is to like, okay,

01:14:33.160 | like build that as fast as possible, you know,

01:14:36.760 | make something work, get it out there.

01:14:39.560 | And that's what, you know, open AI in particular did,

01:14:43.440 | Anthropic kind of did.

01:14:44.840 | But there's a whole lot of technical debt everywhere.

01:14:50.840 | You know, nobody's really figured this stuff out

01:14:53.360 | because everybody's been so busy

01:14:55.160 | building what we know works as quickly as possible.

01:14:59.880 | So yeah, I think there's a huge amount of opportunity to,

01:15:03.040 | you know, I think we'll find things

01:15:04.800 | can be made to work a lot faster, a lot less memory.

01:15:11.520 | I got a whole bunch of ideas I want to try, you know,

01:15:14.360 | every time I look at something closely,

01:15:17.960 | like really closely, I'm always like, oh,

01:15:20.600 | turns out this person actually had no idea

01:15:22.240 | what they're doing, you know, which is fine.

01:15:25.840 | Like none of us know what we're doing.

01:15:27.640 | We should experiment with that.

01:15:30.680 | - We had a trade out on the podcast

01:15:34.080 | who created flash attention.

01:15:35.960 | And I asked them,

01:15:37.520 | did nobody think of using SRAM before you?

01:15:40.240 | Like were people just like,

01:15:42.240 | and he was like, yeah, people just didn't think of it,

01:15:45.520 | didn't try, they didn't come from like a systems background.

01:15:48.440 | - Yeah, I mean, the thing about flash attention is,

01:15:51.240 | I mean, lots of people absolutely had thought of that

01:15:55.520 | and so had I, right?

01:15:56.720 | But I mean, the honest truth is, particularly before Triton,

01:16:00.040 | like everybody knew that tiling

01:16:04.480 | is the right way to solve anything.

01:16:05.920 | And everybody knew that attention,

01:16:08.400 | fused attention wasn't tiled, that was stupid.

01:16:11.440 | But not everybody's got his ability to like,

01:16:16.800 | be like, oh, well, I'm confident enough in CUDA

01:16:20.960 | and or Triton to use that insight to write something better.

01:16:25.200 | You know, and this is where like,

01:16:26.160 | I'm super excited about Mojo, right?

01:16:27.800 | And I always talk to Chris about flash attention

01:16:30.400 | 'cause I'm like, you know,

01:16:31.640 | there is a thousand flash attentions out there

01:16:34.200 | for us to build.

01:16:37.760 | You just gotta make it easy for us to build them.

01:16:40.400 | So like Triton definitely helps,

01:16:42.240 | but it's still not easy.

01:16:46.840 | You know, it still requires kind of really understanding

01:16:49.200 | the GPU architecture,

01:16:52.480 | writing it in that kind of very CUDA-ish way.

01:16:54.960 | So yeah, I think,

01:16:57.360 | I think, you know, if Mojo or something equivalent

01:17:00.440 | can really work well,

01:17:02.520 | we're gonna see a lot more flash attentions popping up.

01:17:07.680 | - Great, Jerry, and before we wrap,

01:17:09.640 | we usually do a quick lightning round.

01:17:11.640 | We're gonna have three simple questions.

01:17:13.880 | So the first one is around acceleration.

01:17:16.240 | And you've been in this field a long time.

01:17:18.800 | What's something that it's already here today

01:17:21.240 | in AI that you thought would take much longer?

01:17:23.880 | - I don't think anything.

01:17:25.000 | So I've actually been slightly too bullish.

01:17:27.360 | So in my 2014 TED Talk,

01:17:30.640 | I had a graph and I said like,

01:17:34.680 | this is like the slope of human capabilities

01:17:37.240 | and this is the slope of AI capabilities.

01:17:39.960 | And I said, oh, and I put a dot saying we are here.

01:17:42.800 | And it was just before they passed.

01:17:45.160 | And I looked back at the transcript the other day

01:17:48.240 | and I said, in five years,

01:17:50.160 | I think we'll, you know,

01:17:52.040 | we might've crossed that threshold

01:17:53.680 | in which computers will be better at most human tasks

01:17:56.720 | than most humans, most average humans.

01:17:59.200 | And so that might be almost true now

01:18:03.560 | for non-physical tasks.

01:18:06.360 | So I was like, took, you know,

01:18:09.080 | took that twice as long as I thought it might.

01:18:11.960 | Yeah, no, I wouldn't say anything surprised me too much.

01:18:18.800 | It's still like, definitely like, I gotta admit,

01:18:22.120 | you know, I had a very visceral reaction

01:18:24.520 | using GPT-4 for the first time.

01:18:27.400 | Not because I found it surprising,

01:18:29.240 | but actually like, actually doing it,

01:18:33.080 | like it's something I was pretty sure

01:18:35.280 | would exist by about now, maybe a bit earlier.

01:18:38.560 | But actually using it definitely is different

01:18:41.960 | to just feeling like it's probably on its way, you know?

01:18:45.480 | And yeah, whatever GPT-5 looks like,

01:18:49.280 | I'm sure, I imagine I'll have the same visceral reaction,

01:18:55.600 | you know?

01:18:57.000 | - It's really amazing to watch develop.

01:19:01.080 | We also have an exploration question.

01:19:03.200 | So what do you think is the most interesting

01:19:05.200 | unsolved question in AI?

01:19:07.080 | - How do language models learn?

01:19:10.880 | You know, what are the training dynamics?

01:19:12.520 | Like, I wanna see, there was a great paper

01:19:16.800 | about ResNets a few years ago

01:19:21.320 | that showed how, that was able to like,

01:19:24.920 | plot a kind of projected three-dimensional loss surface

01:19:28.840 | for a ConvNet with and without skip connections.

01:19:33.840 | And you know, you could very clearly see

01:19:36.520 | without the skip connections, it was bumpy,

01:19:38.920 | and with the skip connections, it was super smooth.

01:19:41.520 | That's the kind of work we need.

01:19:45.440 | Like, so there was actually an interesting blog post

01:19:47.480 | that came out just today from the PyTorch team,

01:19:50.840 | where some of them have created this like,

01:19:53.120 | 3D matrix product visualization thing.

01:19:56.880 | - The MatMul Visualizer, yeah.

01:19:58.240 | - Yeah, and they actually showed some nice examples

01:20:01.040 | of like, a GPT-2 attention layer,

01:20:03.960 | and like, showed an animation and said like,

01:20:07.000 | if you look at this, we can actually see a bit

01:20:08.640 | about what it's doing.

01:20:10.080 | You know, so again, it reminds me of this Eiler and Fergus,

01:20:13.880 | you know, ConvNet paper that was the first one

01:20:16.400 | to do these reverse convolutions,

01:20:18.680 | to show you what's actually being learned

01:20:20.600 | in each layer in a ConvNet.

01:20:22.280 | Yeah, we need a lot more of this, like,

01:20:24.240 | what is going on inside these models?

01:20:27.760 | How do they actually learn?

01:20:29.160 | And then how can we use those insights

01:20:31.680 | to help them to learn better?

01:20:35.520 | So I think that'd be one.

01:20:36.360 | The other exploration I'd really like to see

01:20:37.920 | is a much more rigorous analysis

01:20:41.320 | of what kind of data do they need,

01:20:44.880 | at what level, and when do they need it, and how often.

01:20:48.640 | So that kind of like, data set mixing, curation, so forth,

01:20:52.520 | in order to get the best capabilities.

01:20:55.240 | Yeah, how much is Wikipedia?

01:20:57.080 | Yeah, fine tune, what kind of mix do you need

01:21:02.560 | for it to keep its capabilities?

01:21:04.920 | And what are the kind of underlying capabilities

01:21:06.640 | that it most needs to keep?

01:21:07.760 | And if it loses those, it would lose all these other ones.

01:21:09.960 | And what data do you need to keep those?

01:21:11.640 | And, you know, other things we can do

01:21:13.560 | to change the loss function,

01:21:15.320 | to help it to not forget to do things, stuff like that.

01:21:20.320 | - Awesome, and yeah, before wrapping,

01:21:22.880 | what's one message, one idea

01:21:25.360 | you want everyone to remember and think about?

01:21:27.880 | - Yeah, I guess the main thing I want everybody to remember

01:21:30.320 | is that, you know, there's a lot of people in the world,

01:21:33.600 | and they have a lot of, you know,

01:21:36.280 | diverse experiences and capabilities.

01:21:39.280 | And, you know, they all matter.

01:21:43.080 | And now that we have a, you know,

01:21:47.000 | nearly powerful technology in our lives,

01:21:49.840 | we could think of that one of two ways.

01:21:52.280 | One would be, gee, that's really scary.

01:21:57.280 | What would happen if all of these people in the world

01:21:59.960 | had access to this technology?

01:22:01.280 | Some of them might be bad people.

01:22:03.560 | Let's make sure they can't have it.

01:22:05.680 | Or one might be, wow, of all those people in the world,

01:22:10.000 | I bet a lot of them could really improve

01:22:13.080 | the lives of a lot of humanity if they had this tool.

01:22:15.760 | This has always been the case, you know,

01:22:19.680 | from the invention of writing

01:22:21.720 | to the invention of the printing press

01:22:23.640 | to the, you know, development of education.

01:22:26.280 | And it's been a constant battle

01:22:29.360 | between people who think that distributed power is unsafe,

01:22:33.720 | and it should be held onto by an elite few,

01:22:36.560 | and people who think that humanity on net, you know,

01:22:41.560 | is a marvelous species,

01:22:46.800 | particularly when part of a society and a civilization,

01:22:49.440 | and we should do everything we can

01:22:51.320 | to enable more of them to contribute.

01:22:55.320 | This is a really big conversation right now.

01:22:59.680 | And, you know, I want to see more and more people

01:23:03.800 | showing up and showing what, you know,

01:23:09.240 | what the great unwashed masses out there

01:23:13.120 | can actually achieve, you know,

01:23:14.520 | that actually, you know,

01:23:16.640 | regular people are going to do a lot of really valuable work

01:23:21.200 | and actually help us be, you know,

01:23:26.160 | more safe and also flourishing in our lives

01:23:32.320 | and providing a future for our children to flourish in,

01:23:36.920 | you know, if we lock things down

01:23:45.160 | to the people that we think, you know,

01:23:49.000 | the elites that we think can be trusted to run it for us.

01:23:52.800 | Yeah, I think all bets are off

01:23:54.080 | about where that leaves us as a society, you know.

01:23:59.080 | - Yep, yeah, that's an important message.

01:24:06.320 | And yeah, that's why we've been promoting

01:24:08.280 | a lot of open source developers, open source communities,

01:24:11.800 | I think, letting the builders build.

01:24:15.000 | - And explore, that's always a good idea.

01:24:17.080 | - Yeah.

01:24:18.200 | - Thank you so much for coming on, Jeremy.

01:24:19.760 | This was great.

01:24:20.960 | - Thank you for having me.

01:24:22.160 | (upbeat music)

01:24:24.760 | (upbeat music continues)

01:24:28.160 | (upbeat music continues)

01:24:31.560 | (upbeat music continues)

01:24:35.440 | (upbeat music continues)

01:24:38.840 | (upbeat music continues)

01:24:42.240 | (gentle music)

01:24:44.820 | (upbeat music)

The End of Finetuning — with Jeremy Howard of Fast.ai

Chapters