back to index

The End of Finetuning — with Jeremy Howard of Fast.ai


Chapters

0:0 Introduction
1:14 Jeremy’s background
2:53 Founding FastMail and Optimal Decisions
4:5 Starting Fast.ai with Rachel Thomas
5:28 Developing the ULMFit natural language processing model
10:11 Jeremy’s goal of making AI more accessible
14:30 Fine-tuning language models - issues with memorization and catastrophic forgetting
18:9 The development of GPT and other language models around the same time as ULMFit
20:0 Issues with validation loss metrics when fine-tuning language models
22:16 Jeremy’s motivation to do valuable work with AI that helps society
26:39 Starting fast.ai to spread AI capabilities more widely
29:27 Overview of fast.ai - courses, library, research
34:20 Using progressive resizing and other techniques to win the DAWNBench competition
38:42 Discovering the single-shot memorization phenomenon in language model fine-tuning
43:13 Why fine tuning is simply continued pre-training
46:47 Chris Lattner and Modular AI
48:38 Issues with incentives and citations limiting innovation in research
52:49 Joining AI research communities through Discord servers
55:23 Mojo
63:8 Most exciting areas - continued focus on transfer learning and small models
66:56 Pushing capabilities of small models through transfer learning
70:58 Opening up coding through AI to more people
73:51 Current state of AI capabilities compared to computer vision in 2013 - lots of basic research needed
77:8 Lightning Round

Whisper Transcript | Transcript Only Page

00:00:00.000 | [MUSIC PLAYING]
00:00:03.400 | Hey, everyone.
00:00:11.960 | Welcome to the Latent Space Podcast.
00:00:13.880 | This is Alessio, partner and CTO at Residence
00:00:16.520 | at Decibel Partners.
00:00:17.680 | And I'm joined by my co-host, Sviggs, founder of Small.ai.
00:00:21.280 | Hey, and today we have in the remote studio, Jeremy Howard,
00:00:25.200 | from-- all the way from Australia.
00:00:26.680 | Good morning.
00:00:27.800 | The remote studio, also known as my house.
00:00:30.000 | Good morning.
00:00:31.360 | Nice to see you, Sviggs.
00:00:32.800 | Nice to see you, too.
00:00:34.760 | I'm actually very used to seeing you in your mask
00:00:38.000 | as a message to people, but today we're mostly audio.
00:00:41.760 | But thank you for doing the very important public service
00:00:44.840 | of COVID awareness.
00:00:46.540 | At once, it was a pleasure.
00:00:47.680 | It was all very annoying, and frustrating, and tedious.
00:00:50.720 | But somebody had to do it, so I just left.
00:00:52.520 | Somebody had to do it, especially somebody
00:00:54.280 | with your profile, I think.
00:00:56.360 | It really drives home the message.
00:00:58.440 | So we tend to really--
00:01:00.560 | we tend to introduce people for them,
00:01:02.120 | and then ask people to fill in the blanks
00:01:03.840 | on the personal side.
00:01:06.000 | Something I did not know about you was that you graduated
00:01:08.840 | with a BA in philosophy from the University of Melbourne.
00:01:12.640 | I assumed you had a PhD.
00:01:15.040 | No, I mean, I barely got through my BA,
00:01:18.600 | because I was working 80 to 100 hour weeks at McKinsey
00:01:23.040 | plant company from 19 years old onwards.
00:01:27.720 | So I actually didn't attend any lectures
00:01:33.160 | in second and third year university.
00:01:35.640 | Well, I guess you didn't need it,
00:01:37.120 | or you're very sort of self-driven and self-motivated.
00:01:39.760 | I just-- I took two weeks off before each exam period
00:01:46.200 | when I was working at McKinsey.
00:01:47.640 | And then, I mean, I can't believe I got away
00:01:49.640 | with this in hindsight.
00:01:50.640 | I would go to all my professors and say,
00:01:52.880 | oh, I was meant to be in your class this semester,
00:01:55.040 | and I didn't quite turn up.
00:01:56.280 | Were there any assignments I was meant to have done, whatever?
00:01:59.040 | And I can't believe all of them let me basically have--
00:02:03.720 | they basically always would say, like, OK, well,
00:02:05.800 | if you can have this written by tomorrow, I'll accept it.
00:02:08.400 | So yeah, stressful way to get through university, but--
00:02:13.120 | Well, it shows that, I guess, you min-maxed the opportunities.
00:02:17.720 | That definitely was a precursor.
00:02:19.760 | Finally, like, in as much as I, you know, in philosophy,
00:02:24.200 | the things I found interesting and focused on
00:02:26.960 | in the little bit of time I did spend on it was ethics
00:02:29.960 | and cognitive science.
00:02:31.320 | And it's kind of really amazing that it's now come back around,
00:02:34.520 | and those are actually genuinely useful things
00:02:37.120 | to know about, which I never thought would happen.
00:02:39.240 | A lot of-- yeah, a lot of relevant conversations there.
00:02:43.120 | So you were a consultant for a while,
00:02:45.440 | and then in the magical month of June 1989,
00:02:48.000 | you founded both Optimal Decisions and Fastmail,
00:02:50.920 | which I also briefly used, so thank you for that.
00:02:52.920 | Good for you, yeah, 'cause I had read the statistics,
00:02:55.680 | which is that, like, 90% or something of small businesses
00:02:58.600 | fail, so I thought if I start two businesses,
00:03:01.440 | I have a higher chance.
00:03:02.760 | In hindsight, I was thinking it was
00:03:04.160 | some kind of stochastic thing I didn't have control over,
00:03:06.600 | but it's a bit odd, but anyway.
00:03:10.640 | And then you were president and chief scientist
00:03:12.760 | at Kaggle, which obviously is the composition platform
00:03:19.440 | of machine learning, and then Analytic,
00:03:23.240 | where you were working on using deep learning
00:03:25.520 | to improve medical diagnostics and clinical decisions.
00:03:28.000 | Yeah, I was actually the first company
00:03:29.280 | to use deep learning in medicine,
00:03:30.800 | so I kind of founded the field.
00:03:33.200 | And even now, that's still, like, a pretty early phase.
00:03:36.680 | And I actually heard you on your new podcast with Tanish,
00:03:40.560 | where you went very, very deep into the stuff,
00:03:43.480 | the kind of work that he's doing,
00:03:44.720 | such a young prodigy at his age.
00:03:47.320 | Maybe he's too old to be called a prodigy now, ex-prodigy.
00:03:51.080 | No, I think he still counts.
00:03:53.480 | And anyway, just to round out the bio,
00:03:55.720 | you have a lot more other credentials, obviously,
00:03:58.200 | but most recently, you started Fast.ai,
00:04:01.080 | which is still, I guess, your primary identity
00:04:03.720 | with Rachel Thomas.
00:04:05.280 | So welcome. Yeah, she's my wife.
00:04:06.120 | Thanks. Thank you.
00:04:06.960 | Yeah, doing a lot of public service there
00:04:09.720 | with, like, getting people involved in AI.
00:04:11.480 | And I can't imagine a better way to describe it than Fast.ai.
00:04:15.520 | Fast.ai is, you teach people from nothing
00:04:18.160 | to stable diffusion in, you know,
00:04:19.440 | seven weeks or something, and that's amazing.
00:04:22.200 | Yeah, yeah, I mean, it's funny, you know,
00:04:24.480 | when we started that, what was that, like, 2016 or something,
00:04:27.320 | the idea that deep learning was something
00:04:29.160 | that you could make more accessible
00:04:30.840 | was generally considered stupid.
00:04:32.960 | Like, everybody knew that deep learning
00:04:36.320 | was a thing that you got a math
00:04:38.040 | or a computer science PhD, you know,
00:04:40.080 | those one of five labs
00:04:41.520 | that could give you the appropriate skills.
00:04:44.160 | Then you would join, yeah, basically,
00:04:48.320 | from one of those labs,
00:04:49.560 | you might be able to write some papers.
00:04:52.800 | So yeah, the idea that normal people
00:04:54.400 | could use that technology to do good work
00:04:58.280 | was considered kind of ridiculous when we started it.
00:05:03.040 | And we weren't sure if it was possible either,
00:05:04.600 | but we kind of felt like we had to give it a go
00:05:06.560 | 'cause the alternative was we were pretty sure
00:05:08.720 | that deep learning was on its way to becoming,
00:05:12.320 | you know, the most or one of the most, you know,
00:05:15.480 | important technologies in human history.
00:05:18.600 | And if the only people that could use it
00:05:20.480 | were a handful of computer science PhDs,
00:05:23.080 | that seemed like, A, a big waste,
00:05:25.560 | and B, kind of dangerous.
00:05:28.200 | - Yep.
00:05:29.160 | And, you know, well, I just wanted to know one thing
00:05:32.320 | on your bio that at Kaggle,
00:05:33.760 | you were also the top rank participant in both 2010 and 2011.
00:05:37.800 | So sometimes you see a lot of founders running companies
00:05:40.960 | that are not really in touch with the problem,
00:05:42.640 | but you were clearly building something
00:05:45.040 | that you knew a lot about, which is awesome.
00:05:48.120 | And even, yeah, talking about deep learning,
00:05:50.400 | you created, published a paper on ULMFIT,
00:05:53.480 | which was kind of the predecessor to multitask learning
00:05:56.760 | and a lot of the groundwork
00:05:58.280 | that then went to into Transformers.
00:06:00.320 | I read back on the paper
00:06:01.880 | and you turned this model of AWD LSTM,
00:06:04.960 | which, I mean, I did the math
00:06:06.880 | and it was like 24 to 33 million parameters,
00:06:10.080 | depending on what training data set you use.
00:06:12.520 | Today, that's kind of like not even small,
00:06:14.800 | it's like super small.
00:06:15.960 | What were some of the kind of like contrarian takes
00:06:20.560 | that you had at the time,
00:06:21.840 | and maybe set the stage a little bit
00:06:23.960 | for the rest of the audience
00:06:25.960 | on what was kind of like the state of the art,
00:06:29.080 | so to speak, at the time,
00:06:30.480 | and what people were working towards?
00:06:32.360 | - Yeah, the whole thing was a contrarian take.
00:06:34.760 | Okay, so we started Fast.ai, my wife and I,
00:06:39.720 | and we, yeah, so we're trying to think,
00:06:41.760 | okay, how do we make it more accessible?
00:06:43.160 | So when we started thinking about it,
00:06:46.480 | it was probably 2015, and then 2016,
00:06:48.240 | we started doing something about it.
00:06:49.520 | Why is it inaccessible?
00:06:50.760 | Okay, well, A, no one knows how to do it
00:06:54.480 | other than a few number of people.
00:06:56.720 | And then when we'd ask those few number of people,
00:06:58.400 | well, how do you actually get good results?
00:07:00.200 | They would say like, oh, it's like,
00:07:02.400 | you know, a box of tricks that aren't published.
00:07:04.240 | So you have to join one of the labs and learn the tricks.
00:07:08.160 | So a bunch of unpublished tricks,
00:07:10.200 | not much software around,
00:07:13.480 | but thankfully there was Theano and wrappers,
00:07:17.320 | and particularly Lasagna, the wrapper.
00:07:19.240 | But yeah, not much software around,
00:07:23.080 | not much in the way of data sets,
00:07:27.720 | very hard to get started in terms of the compute,
00:07:30.480 | like how do you get that set up?
00:07:32.280 | So, you know, everything was kind of inaccessible.
00:07:36.680 | And, you know, as we started looking into it,
00:07:41.560 | we had a key insight, which was like, you know what?
00:07:45.680 | Most of the compute and data for image recognition,
00:07:50.000 | for example, we don't need to do it.
00:07:53.280 | You know, there's this thing which nobody knows about,
00:07:55.600 | nobody talks about called transfer learning,
00:07:58.920 | where you take somebody else's model,
00:08:00.560 | where they already figured out like how to detect edges
00:08:04.840 | and gradients and corners and text and whatever else.
00:08:07.800 | And then you can fine tune it to do the thing you wanna do.
00:08:11.120 | And we thought that's the key,
00:08:12.600 | that's the key to becoming more accessible
00:08:16.440 | in terms of compute and data requirements.
00:08:19.080 | So when we started Fast.ai,
00:08:21.360 | we focused from day one on transfer learning,
00:08:23.640 | lesson one, in fact, was transfer learning,
00:08:26.080 | literally lesson one.
00:08:27.400 | It was something not normally even mentioned in,
00:08:30.280 | I mean, there wasn't much in the way of courses.
00:08:32.680 | You know, really the courses out there were PhD programs
00:08:39.720 | that had happened to have recorded their lessons.
00:08:41.720 | They would rarely mention it at all.
00:08:43.600 | We wanted to show how to do four things
00:08:46.840 | that seemed really useful, you know, work with vision,
00:08:49.920 | work with tables of data,
00:08:52.720 | work with kind of recommendation systems
00:08:54.880 | and collaborative filtering and work with text.
00:08:56.760 | 'Cause we felt like those four kind of modalities
00:08:59.240 | covered a lot of the stuff that, you know,
00:09:01.840 | are useful in real life.
00:09:04.480 | And no one was doing anything much useful with text.
00:09:06.600 | Everybody was talking about word2vec, you know,
00:09:08.840 | like king plus queen minus woman and blah, blah, blah.
00:09:13.840 | It was like cool experiments,
00:09:18.600 | but nobody's doing anything like useful with it.
00:09:20.720 | NLP was all like lemmatization and stop words
00:09:25.280 | and topic models and bigrams and SVMs.
00:09:29.640 | And it was really academic and not practical.
00:09:33.440 | But yeah, I mean, to be honest,
00:09:36.880 | I've been thinking about this crazy idea
00:09:39.360 | for nearly 30 years,
00:09:42.320 | since I had done cognitive science at university,
00:09:45.200 | where we talked a lot
00:09:46.240 | about the CELS Chinese Room Experiment.
00:09:49.920 | This idea of like,
00:09:51.000 | what if there was somebody that could kind of like,
00:09:53.760 | knew all of the symbolic manipulations required
00:09:56.920 | to answer questions in Chinese,
00:10:00.240 | but they didn't speak Chinese.
00:10:01.920 | And they were kind of inside a room
00:10:03.680 | with no other way to talk to the outside world
00:10:06.960 | other than taking in slips of paper
00:10:08.480 | with Chinese written on them.
00:10:09.760 | And then they do all their rules
00:10:11.000 | and then they pass back a piece of paper with Chinese back.
00:10:13.760 | And this room with a person in
00:10:16.760 | is actually fantastically good at answering any question
00:10:19.120 | you give them written in Chinese.
00:10:21.280 | You know, do they understand Chinese?
00:10:24.720 | And is this, you know,
00:10:26.520 | something that's intelligently working with Chinese?
00:10:29.880 | Ever since that time, I'd say the most thought,
00:10:32.840 | to me, the most thoughtful
00:10:34.400 | and compelling philosophical response is yes.
00:10:37.560 | You know, intuitively it feels like no,
00:10:41.720 | because that's just because we can't imagine
00:10:43.760 | such a large kind of system.
00:10:45.720 | But, you know, if it looks like a duck
00:10:49.320 | and acts like a duck, it's a duck, you know,
00:10:51.760 | or to all intents and purposes.
00:10:54.040 | And so I always kind of thought, you know,
00:10:55.200 | so this is basically a kind of analysis
00:10:58.120 | of the limits of text.
00:11:00.600 | And I kind of felt like,
00:11:01.440 | yeah, if something could ingest enough text
00:11:04.520 | and could use the patterns it saw
00:11:09.240 | to then generate text in response to text,
00:11:13.040 | it could appear to be intelligent.
00:11:18.240 | You know, whether that means it is intelligent or not
00:11:21.120 | is a different discussion
00:11:22.280 | and not one I find very interesting.
00:11:23.960 | Yeah, and then when I came across neural nets
00:11:25.880 | when I was about 20,
00:11:27.680 | you know, what I learned
00:11:28.520 | about the universal approximation theorem and stuff.
00:11:30.680 | And I started thinking like,
00:11:31.720 | oh, I wonder if like a neural net
00:11:33.400 | could ever get big enough,
00:11:35.360 | take in enough data to be a Chinese room experiment.
00:11:40.360 | You know, with that background
00:11:41.880 | and this kind of like interest in transfer learning,
00:11:44.880 | you know, I'd been thinking about this thing
00:11:47.400 | for kind of 30 years and I thought like,
00:11:48.840 | oh, I wonder if we're there yet, you know,
00:11:51.160 | 'cause we have a lot of text.
00:11:53.600 | Like I can literally download Wikipedia,
00:11:56.320 | which is a lot of text.
00:11:58.320 | And I thought, you know,
00:11:59.280 | how would something learn to kind of answer questions
00:12:03.840 | or, you know, respond to text?
00:12:05.680 | And I thought, well, what if we used a language model?
00:12:08.160 | So language models are already a thing, you know,
00:12:10.280 | they were not a popular or well-known thing,
00:12:11.880 | but they were a thing.
00:12:12.720 | But language models exist to this idea
00:12:14.240 | that you could train a model to fill in the gaps,
00:12:17.240 | or actually in those days it wasn't fill in the gaps,
00:12:18.960 | it was finish a string.
00:12:20.760 | And in fact, Andrej Karpathy did his fantastic RNN
00:12:24.640 | demonstration from this at a similar time
00:12:27.520 | where he showed like you can have it ingest Shakespeare
00:12:32.160 | and it will generate something that
00:12:34.120 | looks a bit like Shakespeare.
00:12:35.600 | I thought, okay, so if I do this at a much bigger scale,
00:12:39.840 | using all of Wikipedia,
00:12:41.680 | what would it need to be able to do
00:12:45.120 | to finish a sentence in Wikipedia effectively,
00:12:49.560 | to do it quite accurately quite often?
00:12:52.560 | I thought, geez, it would actually have to know
00:12:54.120 | a lot about the world.
00:12:55.480 | You know, it'd have to know that there is a world
00:12:57.000 | and that there are objects
00:12:58.160 | and that objects relate to each other through time
00:13:00.520 | and cause each other to react in ways
00:13:02.560 | and that causes precede effects
00:13:04.560 | and that when there are animals and there are people
00:13:09.000 | and that people can be in certain positions
00:13:12.320 | during certain timeframes.
00:13:13.680 | And then you could, you know, all that together,
00:13:15.440 | you can then finish a sentence like,
00:13:17.800 | this was signed into law in 2016 by US President X
00:13:22.080 | and it would fill in the name, you know?
00:13:24.480 | So that's why I tried to create a,
00:13:27.120 | what in those days was considered a big language model,
00:13:30.320 | trained on the entirety on Wikipedia,
00:13:32.000 | which is, that was, you know, a bit unheard of.
00:13:33.920 | And my interest was not in,
00:13:35.440 | you know, just having a language model,
00:13:38.640 | my interest was in like,
00:13:40.480 | what latent capabilities would such a system have
00:13:45.480 | that would allow it to finish those kinds of sentences?
00:13:50.760 | Because I was pretty sure,
00:13:53.920 | based on our work with Transfer Learning and Vision,
00:13:56.040 | that I could then suck out those latent capabilities
00:13:59.640 | by transfer learning, you know,
00:14:01.560 | by fine-tuning it on a task data set or whatever.
00:14:04.200 | So we generated this three-step system.
00:14:06.400 | So step one was train a language model on a big corpus,
00:14:09.560 | step two was fine-tune a language model
00:14:12.760 | on a more curated corpus,
00:14:14.400 | and step three was further fine-tune that model on a task.
00:14:18.280 | And of course that's what everybody still does today, right?
00:14:21.000 | That's what ChatGPT is.
00:14:22.840 | And so the first time I tried it,
00:14:26.880 | within hours, I had a new state-of-the-art
00:14:29.360 | academic result on IMDb.
00:14:31.480 | And I was like, "Holy shit, it does work."
00:14:34.120 | And so you asked, to what degree was this kind of like
00:14:37.840 | pushing against the, you know, established wisdom?
00:14:40.480 | You know, every way.
00:14:41.400 | Like the reason it took me so long to try it
00:14:43.680 | was 'cause I asked all my friends in NLP
00:14:47.280 | if this could work, and everybody said, "No,
00:14:49.760 | it definitely won't work."
00:14:51.040 | It wasn't like, "Oh, maybe."
00:14:52.080 | Everybody was like, "It definitely won't work.
00:14:55.280 | NLP is much more complicated than vision.
00:14:57.960 | Language is a much more vastly complicated domain."
00:15:00.800 | You know, and you've got problems
00:15:01.760 | like the grounding problem.
00:15:03.000 | We know from like philosophy and theory of mind
00:15:05.080 | that it's actually impossible for it to work.
00:15:07.320 | So yeah, so don't waste your time.
00:15:10.680 | - Jeremy, had people not tried
00:15:12.400 | because it was like too complicated
00:15:14.880 | to actually get the data and like set up the training?
00:15:17.160 | Or like, were people just lazy and kind of like,
00:15:19.720 | "Hey, this is just not gonna work."
00:15:20.560 | - No, I mean, it wasn't lazy.
00:15:22.080 | So like, so the person I thought at that time who,
00:15:24.760 | there were two people I thought at that time actually
00:15:26.440 | who were the strongest at language models
00:15:28.120 | were Stephen Merrity and Alec Radford.
00:15:31.760 | And at the time I didn't know Alec,
00:15:34.640 | but I, after we had both,
00:15:37.040 | after I'd released ULMFIT and he had released GPT,
00:15:39.960 | I organized a chat for both of us
00:15:43.960 | with Kate Metz of the New York Times,
00:15:46.160 | and Kate Metz answered, sorry,
00:15:47.880 | and Alec answered this question for Kate,
00:15:49.520 | and Kate was like, "So how did, you know, GPT come about?"
00:15:53.720 | And he said, "Well, I was pretty sure
00:15:55.920 | "that pre-training on a general large corpus wouldn't work,
00:15:59.440 | "so I hadn't tried it.
00:16:01.240 | "And then I read ULMFIT and turns out it did work.
00:16:05.520 | "And so I did it, you know, bigger
00:16:08.360 | "and it worked even better."
00:16:09.520 | And similar with Stephen, you know,
00:16:11.000 | I asked Stephen Merrity, like,
00:16:12.200 | "Why don't we just find, you know,
00:16:15.320 | "take your AWD, ASTLM and like train it
00:16:17.640 | "on all of Wikipedia and fine tune it?"
00:16:19.280 | And he was kind of like,
00:16:20.680 | "I don't think that's gonna really fly."
00:16:23.600 | Like two years before,
00:16:25.000 | I did a very popular talk at KDD, the conference,
00:16:29.840 | where everybody in NLP was in the audience.
00:16:33.760 | I recognized half the faces, you know,
00:16:36.600 | and I told them all this,
00:16:37.800 | I'm sure transfer learning is the key.
00:16:40.400 | I'm sure ImageNet, you know,
00:16:44.080 | is gonna be an NLP thing as well.
00:16:47.040 | And, you know, everybody was interested
00:16:50.320 | and people asked me questions afterwards.
00:16:53.760 | But just, yeah, nobody followed up
00:16:55.720 | because everybody knew that it didn't work.
00:16:59.560 | I mean, even like, so we were scooped a little bit
00:17:05.440 | by Dai and Lee, Kwok Lee at Google.
00:17:08.480 | They had already, I didn't even realize this,
00:17:11.880 | which is a bit embarrassing.
00:17:12.720 | They had already done a large language model
00:17:15.840 | and fine tuned it.
00:17:17.560 | But again, they didn't create a general purpose
00:17:21.720 | large language model on a general purpose corpus.
00:17:23.600 | They only ever tested a domain specific corpus.
00:17:28.280 | And I haven't spoken to Kwok actually about that,
00:17:30.760 | but I assume that the reason was the same.
00:17:33.120 | It probably just didn't occur to them
00:17:36.080 | that the general approach could work.
00:17:38.680 | So maybe it was that kind of 30 years
00:17:40.440 | of mulling over the cell Chinese room experiment
00:17:43.720 | that had convinced me that it probably would work.
00:17:46.920 | I don't know.
00:17:48.080 | - Yeah, interesting.
00:17:49.120 | I just dug up Alec announcement tweet from Tony team.
00:17:54.120 | He said, "Inspired by Kobe, Elmo, and Yola, I'm fit.
00:17:57.640 | "We showed a single transformer language model
00:17:59.520 | "can be fine tuned to a wide variety."
00:18:02.160 | It's interesting because, you know,
00:18:03.400 | today people think of OpenAI as the leader,
00:18:06.160 | kind of like the research lab pushing forward the field.
00:18:09.800 | What was that at the time?
00:18:11.000 | You know, like kind of like going back five years,
00:18:12.960 | people think of OpenAI as an overnight success,
00:18:15.000 | but obviously it took a while.
00:18:16.800 | - Yeah, yeah, no, I mean, absolutely.
00:18:18.440 | And I'll say like, it's interesting
00:18:20.320 | that it mentioned Elmo because in some ways
00:18:22.960 | that was kind of diametrically opposed to ULM fit.
00:18:26.840 | You know, there was these kind of like,
00:18:29.040 | so there was a lot of activity
00:18:31.840 | at the same time as ULM fits release.
00:18:34.000 | So there was, so before it, as Brian McCann,
00:18:38.400 | I think at Salesforce had come out with this neat model
00:18:43.000 | that did a kind of multitask learning,
00:18:46.200 | but again, they didn't create a general
00:18:49.240 | fine-tune language model first.
00:18:50.760 | There was Elmo, which I think was, you know,
00:18:53.320 | actually quite a few months
00:18:55.560 | after the first ULM fit example, I think.
00:19:00.000 | But yeah, there was a bit of this stuff going on.
00:19:01.360 | And the problem was everybody was doing,
00:19:06.200 | and particularly after GPT came out then,
00:19:08.360 | everybody wanted to focus on zero-shot
00:19:10.360 | and few-shot learning.
00:19:11.400 | You know, everybody hated fine-tuning.
00:19:13.240 | Everybody hated transfer learning.
00:19:14.640 | And like, I literally did tours trying to get people
00:19:18.320 | to start doing transfer learning.
00:19:20.200 | And, you know, nobody was interested,
00:19:24.600 | particularly after GPT showed such good results
00:19:27.040 | with zero-shot and few-shot learning.
00:19:29.240 | And so I actually feel like we kind of went backwards
00:19:31.520 | for years and not to be honest,
00:19:33.400 | I mean, I'm a bit sad about this now,
00:19:34.640 | but I kind of got so disappointed and dissuaded by like,
00:19:41.480 | it felt like these bigger lab, much bigger labs,
00:19:44.800 | you know, like Fast.ai had only ever been just me
00:19:47.040 | and Rachel were getting all of this attention
00:19:51.560 | for an approach I thought was the wrong way to do it.
00:19:54.440 | You know, I was convinced was the wrong way to do it.
00:19:56.400 | And so, yeah, for years people were really focused
00:19:59.200 | on getting better at zero-shot and few-shot.
00:20:00.960 | And it wasn't until, you know, this key idea of like,
00:20:04.520 | well, let's take the ULM fit approach,
00:20:06.720 | but for step two, rather than fine-tuning
00:20:10.440 | on a kind of a domain corpus,
00:20:12.600 | let's fine-tune on an instruction corpus.
00:20:15.600 | And then in step three, rather than fine-tuning
00:20:18.080 | on a reasonably specific task classification,
00:20:20.520 | let's fine-tune on a RLHF task classification.
00:20:25.040 | And so that was really, that was really key, you know?
00:20:27.640 | So I was kind of like out of the NLP field
00:20:30.720 | for a few years there because yeah,
00:20:33.560 | it just felt like, I don't know,
00:20:36.560 | pushing uphill against this vast tide,
00:20:41.160 | which I was convinced was not the right direction,
00:20:43.360 | but who's gonna listen to me, you know?
00:20:44.800 | 'Cause as you said, I don't have a PhD,
00:20:47.440 | not at a university, or at least I wasn't then.
00:20:50.120 | I don't have a big set of computers
00:20:52.640 | to fine-tune huge transformer models.
00:20:56.120 | So yeah, it was definitely difficult.
00:20:58.320 | It's always been hard.
00:20:59.360 | You know, it's always been hard.
00:21:01.200 | Like I've always been somebody
00:21:02.360 | who does not wanna build stuff on lots of big computers
00:21:07.360 | because most people don't have lots of big computers.
00:21:11.040 | And I hate creating stuff that most people can't use,
00:21:13.640 | you know?
00:21:14.560 | And also stuff that's created on lots of big computers
00:21:17.600 | has always been like much more media-friendly.
00:21:20.920 | So like, it might seem like a recent thing,
00:21:23.520 | but actually throughout my 30 years in data science,
00:21:26.080 | the attention's always been on, you know,
00:21:29.400 | the big iron results.
00:21:31.880 | So when I first started,
00:21:32.880 | everybody was talking about data warehouses
00:21:35.880 | and it was all about Teradata.
00:21:37.680 | And it'd be like, oh, this big bank
00:21:39.240 | has this huge room full of computers
00:21:42.280 | and they have like terabytes of data available,
00:21:45.240 | you know, the press of a button.
00:21:46.480 | And yeah, that's always what people wanna talk about,
00:21:50.720 | what people wanna write about.
00:21:52.480 | And then of course,
00:21:54.120 | students coming out of their PhDs and stuff,
00:21:56.160 | that's where they wanna go work
00:21:57.440 | 'cause that's where they read about.
00:21:59.680 | And to me, it's a huge distraction, you know,
00:22:03.520 | because like I say,
00:22:05.640 | most people don't have unlimited compute.
00:22:10.000 | And I wanna help most people,
00:22:11.720 | not the small subset of the most well-off people.
00:22:16.720 | - Yeah, that's awesome.
00:22:18.440 | And it's great to hear, you know,
00:22:20.320 | you do such a great job educating
00:22:22.640 | that a lot of times you're not telling your own story,
00:22:25.240 | you know?
00:22:26.080 | So I love this conversation.
00:22:28.200 | And the other thing before we jump into Fast.ai,
00:22:30.720 | actually, you know, a lot of people that I know,
00:22:33.720 | they run across a new architecture and whatnot,
00:22:35.920 | they're like, I gotta start a company
00:22:37.480 | and raise a bunch of money and do all of this stuff.
00:22:39.360 | And say, you were like,
00:22:40.600 | I want everybody to have access to this.
00:22:42.600 | Why was that the case for you?
00:22:45.120 | Was it because you already had like a successful,
00:22:47.320 | you know, venture in like FastMail
00:22:49.400 | and you were more interested in that?
00:22:50.760 | What was the reasoning?
00:22:52.520 | - That's a really good question.
00:22:54.080 | So I guess the answer is yes.
00:22:56.960 | It is, that's the reason why.
00:22:58.560 | So when I was a teenager,
00:23:00.800 | I thought it would be really cool to like,
00:23:03.280 | have my own company.
00:23:05.160 | You know, I didn't know the word startup.
00:23:06.600 | I didn't know the word entrepreneur.
00:23:08.160 | I didn't know the word VC.
00:23:09.920 | And I didn't really know what any of those things were
00:23:12.000 | really until after we started Kaggle, to be honest.
00:23:14.240 | Even though I had started to what we now call startups,
00:23:16.520 | I just thought they were just small businesses.
00:23:19.120 | You know, they were just companies.
00:23:20.840 | So yeah, so those two companies were FastMail
00:23:23.960 | and Optimal Decisions.
00:23:24.800 | FastMail was the first kind of synchronized email provider
00:23:29.440 | for non-businesses.
00:23:30.880 | So something you can get your same email at home
00:23:34.000 | on your laptop, at work, on your phone, whatever.
00:23:37.600 | And then Optimal Decisions
00:23:39.520 | invented a new approach to insurance pricing,
00:23:43.120 | something called profit-optimized insurance pricing.
00:23:46.280 | So I saw both of those companies, you know, after 10 years.
00:23:52.040 | And at that point, I had achieved the thing
00:23:56.280 | that as a teenager, I had wanted to do, you know.
00:24:00.600 | It took a lot longer than it should have
00:24:01.760 | 'cause I spent way longer in management consulting
00:24:03.560 | than I should have 'cause I got caught up
00:24:04.880 | in that stupid rat race.
00:24:06.280 | But you know, eventually I got there
00:24:08.040 | and I remember my mom saying to me,
00:24:10.680 | "Oh, you must be so proud."
00:24:12.200 | You know, 'cause she remembered my dream.
00:24:14.640 | She was like, "You've done it."
00:24:16.880 | And I kind of reflected and I was like, "I'm not.
00:24:21.000 | "I'm not proud at all."
00:24:22.640 | You know, like people quite liked FastMail.
00:24:25.240 | You know, it's quite nice to have synchronized email.
00:24:27.400 | It probably would have happened anyway.
00:24:29.400 | Yeah, I'm certainly not proud
00:24:32.120 | that I've helped some insurance companies
00:24:34.560 | suck more money out of their customers.
00:24:36.680 | Yeah, no, I'm not proud.
00:24:39.000 | You know, it's actually,
00:24:41.480 | I haven't really helped the world very much.
00:24:44.040 | You know, maybe in the insurance case
00:24:45.440 | I've made it a little bit worse.
00:24:47.280 | I don't know.
00:24:48.680 | So yeah, I was determined
00:24:51.920 | to not waste more years of my life
00:24:55.960 | doing things, working hard to do things
00:24:58.440 | which I could not be reasonably sure
00:25:00.480 | would have a lot of value.
00:25:02.200 | So, you know, I took some time off.
00:25:06.240 | I wasn't sure if I'd ever work again, actually.
00:25:08.440 | I didn't particularly want to
00:25:09.760 | 'cause it felt like, yeah,
00:25:10.800 | it felt like such a disappointment.
00:25:12.560 | But you know, and I didn't need to.
00:25:15.840 | I had enough money.
00:25:16.960 | Like I wasn't super rich, but I had enough money.
00:25:18.640 | I didn't need to work.
00:25:20.360 | And I certainly recognize that amongst the other people
00:25:23.040 | I knew who had enough money
00:25:25.480 | that they didn't need to work,
00:25:26.640 | they all worked ridiculously hard.
00:25:28.840 | You know, and constantly put themselves
00:25:30.520 | in extremely stressful situations.
00:25:32.480 | And I thought, I don't want to be one of those idiots
00:25:34.280 | who's tied to, you know,
00:25:37.440 | buying a bigger plane than the next guy or whatever.
00:25:42.400 | You know, Kaggle came along
00:25:43.720 | and I mainly kind of did that
00:25:44.720 | just 'cause it was fun and interesting
00:25:46.960 | to hang out with interesting people.
00:25:49.360 | But, you know, with Fast.ai in particular,
00:25:53.440 | you know, Rachel and I had a very explicit,
00:25:57.000 | you know, long series of conversations
00:25:59.800 | over a long period of time about like,
00:26:01.320 | well, how can we be the most helpful to society as a whole
00:26:06.320 | and particularly to those people
00:26:08.880 | who maybe need more help, you know?
00:26:11.200 | And so we definitely saw the world going
00:26:13.280 | in a potentially pretty dystopian direction
00:26:17.720 | if the world's most powerful technology
00:26:19.680 | was controlled by a small group of elites.
00:26:23.560 | So we thought, yeah, we should focus
00:26:26.920 | on trying to help that not happen.
00:26:30.040 | You know, sadly, it looks like it still is likely to happen,
00:26:33.320 | but I mean, I feel like we've helped make it
00:26:37.240 | a little bit less likely.
00:26:38.320 | So we've done our-
00:26:39.640 | - You've shown that it's possible.
00:26:41.640 | And I think your constant advocacy,
00:26:45.800 | your courses, your research that you publish,
00:26:49.240 | you know, just the other day you published a finding
00:26:52.600 | on, you know, learning that I think is still something
00:26:56.880 | that people are still talking about quite a lot.
00:26:59.000 | I think that that is the origin story
00:27:02.760 | of a lot of people who are gonna be, you know,
00:27:05.000 | little Jeremy Howards furthering your mission with,
00:27:07.280 | you know, you don't have to do everything by yourself
00:27:09.120 | is what I'm saying.
00:27:09.960 | - No, definitely, definitely.
00:27:10.800 | You know, that was a big takeaway from like "Analytic"
00:27:14.680 | was that in "Analytic" it definitely felt like
00:27:16.240 | we had to do everything ourselves.
00:27:17.880 | And I kind of, I wanted to solve medicine.
00:27:20.120 | I was like, yeah, okay,
00:27:20.960 | solving medicine is actually quite difficult
00:27:22.720 | and I can't do it on my own.
00:27:25.360 | And there's a lot of other things I'd like to solve
00:27:26.760 | and I can't do those either.
00:27:27.840 | So that was definitely the other piece was like,
00:27:30.440 | yeah, you know, can we create an army
00:27:35.800 | of passionate domain experts who can change,
00:27:40.240 | they're a little part of the world.
00:27:41.720 | And that's definitely happened.
00:27:42.680 | Like I find nowadays, at least half the time,
00:27:46.640 | probably quite a bit more that I get in contact
00:27:50.640 | with somebody who's done really interesting work
00:27:52.960 | in some domain.
00:27:54.120 | Most of the time I'd say they say,
00:27:55.640 | yeah, I got my start with Fast.ai.
00:27:57.400 | So it's definitely, I can see that.
00:28:00.320 | And I also know from talking to folks at places
00:28:04.080 | like Amazon and Adobe and stuff,
00:28:06.320 | which, you know, there's lots of alumni there
00:28:07.880 | and they say, oh my God,
00:28:08.720 | I got here and like half of the people are Fast.ai alumni.
00:28:12.200 | So it's fantastic.
00:28:14.600 | - Yeah, actually Andre Capassi grabbed me
00:28:16.640 | when I saw him at NeurIPS a few years ago.
00:28:18.560 | And he was like, I have to tell you,
00:28:19.720 | thanks for the Fast.ai courses.
00:28:21.280 | When people come to Tesla
00:28:22.320 | and they need to know more about deep learning,
00:28:24.280 | we always send them to your course.
00:28:26.400 | And the OpenAI Scholars Program was doing the same thing.
00:28:29.640 | So it's kind of like, yeah, it's had a surprising impact.
00:28:35.360 | You know, that's just one of like three things we do
00:28:39.600 | is the course, you know.
00:28:41.040 | And it's only ever been at most two people,
00:28:45.320 | either me and Rachel or me and Sylvain nowadays,
00:28:47.640 | it's just me.
00:28:49.200 | So yeah, I think it shows you don't necessarily need
00:28:51.400 | a huge amount of money and a huge team of people
00:28:54.560 | to make an impact.
00:28:57.800 | - Yeah, so just to reintroduce Fast.ai
00:29:00.840 | for people who may not have dived into it much,
00:29:05.000 | there is the courses that you do.
00:29:07.520 | There is the library that is very well loved.
00:29:12.240 | And I kind of think of it as a nicer layer
00:29:15.440 | on top of PyTorch that people should start with PyTorch
00:29:18.600 | and use it as the basis for a lot of your courses.
00:29:21.280 | And then you have like NBDev,
00:29:24.960 | which I don't know, is that the third one?
00:29:27.200 | - Oh, so the three areas were research, software,
00:29:31.120 | and courses.
00:29:32.560 | - Oh, sorry, I was going by, in terms of software.
00:29:34.760 | - Software, you know, Fast.ai is the main thing,
00:29:39.760 | but NBDev is not far behind.
00:29:42.800 | But then there's also things like Fast.core,
00:29:46.120 | GHAPI, I mean, dozens of open source projects
00:29:50.160 | that I've created.
00:29:51.000 | And some of them have been pretty popular
00:29:55.320 | and some of them are still a little bit hidden, actually.
00:29:57.640 | I should, some of them I should try to do a better job
00:30:00.360 | of telling people about.
00:30:01.280 | - What are you thinking about?
00:30:02.600 | Yeah, what's on the--
00:30:03.440 | - Oh, no, no, just like little things.
00:30:05.040 | Like, for example, for working with EC2 and AWS,
00:30:07.840 | I created a FastEC2 library,
00:30:09.520 | which I think is like way more convenient
00:30:11.920 | and nice to use than anything else out there.
00:30:14.400 | And it's literally got a whole autocomplete,
00:30:16.280 | dynamic autocomplete that works both on the command line
00:30:19.080 | and in notebooks.
00:30:20.400 | It'll like autocomplete your instance names
00:30:22.400 | and everything like that.
00:30:24.080 | You know, just little things like that.
00:30:25.840 | I try to make like, when I work with some domain,
00:30:30.280 | I try to make it like,
00:30:32.080 | I wanna make it as enjoyable as possible for me to do that.
00:30:35.600 | So I always try to kind of like,
00:30:37.040 | like with GHAPI, for example,
00:30:38.960 | I think that GitHub API is incredibly powerful,
00:30:43.160 | but I didn't find it good to work with
00:30:45.600 | 'cause I didn't particularly like the libraries
00:30:47.040 | that were out there.
00:30:47.880 | So like GHAPI, like FastEC2,
00:30:50.040 | it like autocompletes both at the command line
00:30:53.440 | or in a notebook or whatever,
00:30:55.040 | like literally the entire GitHub API.
00:30:59.640 | The entire thing is like,
00:31:01.680 | I think it's like less than a hundred K of code
00:31:03.640 | because it actually, as far as I know,
00:31:06.960 | the only one that grabs it directly
00:31:09.440 | from the official open API spec that GitHub produces.
00:31:14.120 | And like if you're in GitHub and you just type an API,
00:31:18.960 | you know, autocomplete API method and hit enter,
00:31:25.440 | it prints out the docs or the six brief docs
00:31:28.760 | and then gives you a link
00:31:29.600 | to the actual documentation page.
00:31:32.080 | You know, GitHub Actions I can write now in Python,
00:31:34.680 | which is just so much easier
00:31:36.000 | than writing them in TypeScript and stuff.
00:31:38.760 | So, you know, just little things like that.
00:31:41.120 | - I think that's an approach
00:31:42.240 | that I wish more developers took
00:31:44.240 | to publish some of their work along the way.
00:31:46.440 | You described the third arm of FastAI as research.
00:31:51.120 | It's not something I see often.
00:31:53.000 | Obviously you do do some research
00:31:54.760 | and how do you run your research?
00:31:58.240 | What are your research interests?
00:31:59.920 | - Yeah, so research is what I spend
00:32:01.840 | the vast majority of my time on.
00:32:04.160 | And the artifacts that come out of that
00:32:08.240 | are largely software and courses, you know?
00:32:11.720 | So to me, the main artifact shouldn't be papers
00:32:15.120 | 'cause papers are things read
00:32:16.200 | by a small exclusive group of people.
00:32:18.240 | You know, to me, the main artifacts should be like
00:32:20.800 | something teaching you people,
00:32:23.160 | here's how to use this insight
00:32:24.480 | and here's software you can use that builds it in.
00:32:28.200 | So I think I've only ever done
00:32:30.440 | three first person papers in my life, you know?
00:32:33.120 | And they were, and none of those are ones I wanted to do.
00:32:36.640 | You know, they were all ones that like,
00:32:37.920 | so one was ULM Fit,
00:32:39.480 | where Sebastian Ruder reached out to me
00:32:41.200 | after seeing the course and said like,
00:32:43.000 | "You have to publish this as a paper."
00:32:44.840 | You know?
00:32:45.680 | And he said, "I'll write it."
00:32:48.480 | (laughs)
00:32:49.320 | I was like, "Oh."
00:32:50.160 | And he said, "I want to write it
00:32:51.280 | 'cause if I do, I can put it on my PhD
00:32:52.960 | and that would be great."
00:32:53.800 | And it's like, "Okay, well, I want to help you
00:32:54.720 | with your PhD and that's great."
00:32:57.240 | So like, you know, one was the masks paper,
00:33:00.960 | which just had to exist and nobody else was writing it.
00:33:04.720 | And then the third was the Fast.ai library paper,
00:33:09.560 | which again, somebody reached out and said,
00:33:14.560 | "Please, please write this.
00:33:16.360 | We will waive the fee for the journal and everything
00:33:19.200 | and actually help you get it through publishing and stuff."
00:33:22.400 | So yeah, so I don't, other than that,
00:33:24.360 | I've never written a first author paper.
00:33:27.120 | So the research is like, well, so for example,
00:33:30.320 | you know, DawnBench was a competition
00:33:33.680 | which Stanford ran a few years ago.
00:33:36.320 | It was kind of the first big competition
00:33:39.840 | of like, who can train neural nets the fastest
00:33:43.680 | rather than the most accurate.
00:33:45.840 | And specifically it was who can train ImageNet the fastest.
00:33:52.640 | And again, this was like one of these things
00:33:54.920 | where it was created by necessity.
00:33:57.280 | So Google had just released their TPUs.
00:34:00.120 | And so I heard from my friends at Google
00:34:02.280 | that they had put together this big team
00:34:04.840 | to smash DawnBench so that they could prove to people
00:34:08.960 | that they had to use Google Cloud and use their TPUs
00:34:11.760 | and show how good their TPUs were.
00:34:14.040 | And we kind of thought, "Oh shit, this would be a disaster
00:34:16.400 | if they do that, because then everybody's going to be like,
00:34:18.440 | "Oh, deep learning is not accessible."
00:34:20.520 | You know, to actually be good at it,
00:34:22.040 | you have to be Google and you have to use special silicon.
00:34:24.320 | And so, you know, we only found out about this 10 days
00:34:27.880 | before the competition finished.
00:34:30.160 | But, you know, we basically got together
00:34:32.160 | an emergency bunch of our students and Rachel and I
00:34:36.120 | and sat for the next 10 days and just tried to crunch through
00:34:41.120 | and try to use all of our best ideas
00:34:46.000 | that had come from our research.
00:34:48.360 | And so particularly progressive resizing,
00:34:50.280 | just basically train mainly on small things,
00:34:52.560 | train on non-square things, you know, stuff like that.
00:34:57.640 | And so, yeah, we ended up winning, thank God.
00:35:02.080 | And so, you know, we turned it around from being like,
00:35:05.160 | like, "Oh shit, you know, this is going to show
00:35:06.800 | "that you have to be Google and have TPUs,"
00:35:08.520 | to being like, "Oh my God,
00:35:09.360 | "even the little guy can do deep learning."
00:35:11.920 | So that's an example of the kind of like
00:35:16.480 | research artifacts we do.
00:35:18.840 | And yeah, so all of my research is always,
00:35:22.160 | how do we do more with less, you know?
00:35:24.320 | So how do we get better results with less data,
00:35:26.640 | with less compute, with less complexity,
00:35:30.480 | with less education, you know, stuff like that.
00:35:34.440 | So ULM fits obviously a good example of that.
00:35:37.720 | - And most recently you published,
00:35:40.720 | "Can LLMs learn from a single example?"
00:35:42.920 | Maybe, could you tell the story a little bit behind that?
00:35:46.080 | And maybe that goes a little bit too far
00:35:48.160 | into the learning of very low resource literature.
00:35:53.160 | - Yeah, yeah.
00:35:54.840 | So me and my friend, Jono Whittaker,
00:35:57.880 | basically had been playing around
00:36:01.160 | with this fun Kaggle competition,
00:36:03.160 | which is actually still running as we speak,
00:36:05.200 | which is, can you create a model
00:36:09.880 | which can answer multiple choice questions
00:36:13.200 | about anything that's in Wikipedia?
00:36:15.880 | And the thing that makes it interesting
00:36:18.440 | is that your model has to run on Kaggle within nine hours.
00:36:23.440 | And Kaggle's very, very limited.
00:36:26.040 | So you've only got 14 gig RAM, only two CPUs,
00:36:29.600 | and a small, very old GPU.
00:36:31.800 | So this is cool, you know, if you can do well at this,
00:36:35.520 | and this is a good example of like,
00:36:37.000 | oh, you can do more with less.
00:36:38.520 | So yeah, Jono and I were playing around with fine tuning,
00:36:44.800 | of course, transfer learning, pre-trained language models.
00:36:48.240 | And we saw this like,
00:36:52.640 | so we always, you know, plot our losses as we go.
00:36:55.040 | So here's another thing we created.
00:36:56.160 | Well, actually, Sylvain Gouger,
00:36:57.600 | when he worked with us, created a code Fast Progress,
00:36:59.720 | which is kind of like TQEDM, but we think a lot better.
00:37:03.560 | So we look at our fast progress curves,
00:37:05.880 | and they kind of go down, down, down, down, down, down,
00:37:07.920 | down a little bit, little bit, little bit,
00:37:09.160 | and then suddenly go clunk, and they drop,
00:37:11.960 | and then down, down, down, down, down a little bit,
00:37:13.400 | and then suddenly clunk, they drop.
00:37:15.000 | We're like, what the hell?
00:37:16.520 | These clunks are occurring at the end of each epoch.
00:37:20.560 | So normally in deep learning,
00:37:23.600 | this would be, you know, I've seen this before,
00:37:27.000 | and it's always been a bug.
00:37:28.680 | It's always turned out that like,
00:37:29.880 | oh, we accidentally forgot to turn on eval mode
00:37:32.520 | during the validation set,
00:37:33.640 | so I was actually learning then,
00:37:35.560 | or, oh, we accidentally were calculating
00:37:39.080 | moving average statistics throughout the epoch,
00:37:41.440 | so, you know, if it's recently moving average or whatever,
00:37:44.320 | and so we were using HuggingFaceTrainer.
00:37:47.240 | So, you know, I did not give my friends at HuggingFace
00:37:50.200 | the benefit of the doubt.
00:37:51.160 | I thought, oh, they've fucked up HuggingFaceTrainer,
00:37:53.400 | you know, idiots.
00:37:56.120 | Well, you'll use the FastAITrainer instead.
00:37:58.520 | So we switched over to Learner.
00:37:59.800 | We still saw the clunks, and, you know, that's,
00:38:04.600 | yeah, it shouldn't really happen,
00:38:06.120 | because semantically speaking, in the epoch,
00:38:09.600 | isn't like, it's not a thing, you know,
00:38:12.600 | like nothing happens, or nothing's meant to happen
00:38:15.480 | when you go from ending one epoch
00:38:17.120 | to starting the next one.
00:38:18.360 | So there shouldn't be a clunk, you know.
00:38:22.800 | So I kind of asked around on the open source discords,
00:38:25.560 | and I was like, what's going on here?
00:38:29.200 | And everybody was just like, oh, that's just what,
00:38:30.880 | that's just what these training curves look like.
00:38:32.720 | Ours all look like that.
00:38:33.880 | Don't worry about it.
00:38:35.040 | And I was like, oh, are you all using Trainer?
00:38:37.480 | Yes, oh, well, there must be some bug with Trainer.
00:38:40.440 | And I was like, well, we also saw it in Learner,
00:38:42.160 | and somebody else was like, no, we've got our own Trainer.
00:38:44.080 | We get it as well.
00:38:45.280 | They're just like, don't worry about it.
00:38:46.240 | It's just something we see.
00:38:47.880 | It's just normal.
00:38:49.000 | I can't do that.
00:38:50.040 | I can't just be like, here's something that's like,
00:38:53.360 | in the previous 30 years of neural networks,
00:38:55.480 | nobody ever saw it, and now suddenly we see it.
00:38:58.080 | So don't worry about it.
00:39:00.360 | Like, I just, I have to know why.
00:39:02.520 | - Can I clarify?
00:39:03.480 | This is, was everyone that you're talking to,
00:39:05.960 | were they all seeing it for the same data set
00:39:07.600 | or in different data sets?
00:39:08.880 | - Data, David, different data sets, different trainers.
00:39:11.920 | They're just like, no, this is just what it looks like
00:39:14.200 | when you fine-tune language models.
00:39:15.600 | Don't worry about it.
00:39:17.240 | - You've never seen this before?
00:39:18.720 | - I hadn't seen it before, but I'd been kind of like,
00:39:21.080 | as I say, I kept working on them
00:39:23.160 | for a couple of years after ULM fit,
00:39:24.840 | and then I kind of moved on to other things,
00:39:27.240 | partly out of frustration.
00:39:28.640 | So I hadn't been fine-tuning, you know,
00:39:32.160 | I mean, LAM has only been out for a few months, right?
00:39:35.720 | But I wasn't one of those people
00:39:37.960 | who jumped straight into it, you know?
00:39:39.600 | So I was relatively new
00:39:41.800 | to the kind of LAMA fine-tuning world,
00:39:44.480 | where else these guys had been, you know,
00:39:48.000 | doing it since day one.
00:39:49.520 | It was only a few months ago,
00:39:51.920 | but it's still quite a bit of time.
00:39:53.040 | So yeah, they're just like, no, this is all what we see.
00:39:56.280 | Don't worry about it.
00:39:58.200 | So yeah, I've got a very kind of like, I don't know,
00:40:01.480 | I've got this brain where I have to know why things are.
00:40:04.400 | And so I kind of, I ask people like,
00:40:06.160 | well, why do you think it's happening?
00:40:07.360 | And they'd be like, oh, pretty obviously,
00:40:09.400 | 'cause it's like, memorize the data set.
00:40:12.160 | It's just like, it can't be right.
00:40:14.720 | It's only seen it once.
00:40:15.920 | Like, look at this, the loss has dropped by 0.3.
00:40:19.480 | 0.3, which is like, basically it knows the answer.
00:40:24.360 | They're like, no, no, it's just, it is,
00:40:28.760 | it's just memorize the data set.
00:40:30.040 | So yeah, so look, Jono and I did not discover this.
00:40:34.160 | And Jono and I did not come up with a hypothesis.
00:40:37.000 | You know, I guess we were just the ones,
00:40:38.240 | I guess, who had been around for long enough
00:40:39.640 | to recognize that like, this isn't how it's meant to work.
00:40:42.920 | And so we, you know, and so we went back and like,
00:40:46.120 | okay, let's just run some experiments, you know,
00:40:48.920 | 'cause nobody seems to have actually published
00:40:50.240 | anything about this.
00:40:51.400 | Well, it's not quite true.
00:40:53.960 | Some people have published things,
00:40:55.040 | but nobody ever actually stepped back and said like,
00:40:56.840 | what the hell?
00:40:58.400 | You know, how can this be possible?
00:40:59.880 | Is it possible?
00:41:00.720 | Is it what's happening?
00:41:01.880 | And so, yeah, we created a bunch of experiments
00:41:03.680 | where we basically predicted ahead of time.
00:41:06.080 | It's like, okay, if this hypothesis is correct,
00:41:08.240 | that it's memorized in the training set,
00:41:09.520 | then we ought to see blah under conditions blah,
00:41:12.360 | but not under these conditions.
00:41:14.280 | And so we ran a bunch of experiments,
00:41:15.480 | all of them supported the hypothesis
00:41:17.400 | that it was memorizing the data set
00:41:19.840 | in a single thing at once.
00:41:21.800 | And it's a pretty big data set, you know,
00:41:25.240 | which in hindsight, it's not totally surprising
00:41:32.120 | because the theory, remember, of the ULM fit theory
00:41:34.920 | was like what's kind of creating
00:41:37.600 | all these latent capabilities to make it easier
00:41:39.720 | for it to predict the next token.
00:41:42.000 | So if it's got all this kind of latent capability,
00:41:45.320 | it ought to also be really good at compressing new tokens
00:41:48.640 | because it can immediately recognize it as like,
00:41:51.320 | oh, if that's just a version of this.
00:41:53.960 | So it's not so crazy, you know,
00:41:58.520 | but it is, it requires us to rethink everything
00:42:03.040 | because like, and nobody knows like, okay,
00:42:05.680 | so how do we fine tune these things?
00:42:07.360 | Because like, it doesn't even matter.
00:42:10.320 | Like maybe it's fine.
00:42:11.760 | Like maybe it's fine that it's memorized the data set
00:42:13.760 | after one go and you do a second go.
00:42:15.720 | And okay, the validation loss is terrible
00:42:19.480 | because it's now really overconfident.
00:42:22.000 | That's fine.
00:42:22.840 | Don't, you know, don't, I keep telling people,
00:42:24.520 | don't track validation loss, track validation accuracy,
00:42:27.960 | 'cause at least that will still be useful.
00:42:30.920 | There's another thing that's got lost since ULM fit,
00:42:33.120 | nobody tracks accuracy of language models anymore.
00:42:35.840 | But you know, it'll still keep learning and it does,
00:42:40.000 | it does keep improving, but is it worse?
00:42:44.280 | You know, like, is it like, now that it's kind of
00:42:47.000 | memorized it, it's probably getting a less strong signal,
00:42:50.360 | you know, I don't know.
00:42:54.240 | So I still don't know how to fine tune
00:42:55.640 | language models properly and I haven't found anybody
00:42:57.920 | who feels like they do, like nobody really knows
00:43:00.760 | whether this memorization thing is,
00:43:04.280 | it's probably a feature in some ways,
00:43:05.760 | it's probably some things that you can do usefully with it.
00:43:07.920 | It's probably, yeah, I have a feeling
00:43:11.520 | it's messing up training dynamics as well.
00:43:14.440 | - It doesn't come at the cost of catastrophic forgetting
00:43:16.960 | as well, right?
00:43:17.800 | Like, which is the other side of the coin.
00:43:19.560 | - It does to some extent, like we know it does,
00:43:24.560 | like look at Code Llama, for example.
00:43:26.320 | So Code Llama was a, I think it was like a 500 billion
00:43:30.240 | token fine tuning of Llama 2 using code.
00:43:33.520 | And also pros about code that Meta did.
00:43:37.440 | And honestly, they kind of blew it,
00:43:41.200 | because Code Llama is good at coding,
00:43:43.040 | but it's bad at everything else.
00:43:44.680 | You know, and it used to be good.
00:43:46.120 | Yeah, I was pretty sure it was like,
00:43:48.560 | before they released it, me and lots of people
00:43:50.320 | in the open source discords were all like,
00:43:51.800 | oh my God, you know, we know this is coming,
00:43:53.600 | Jan Lukinska is saying it's coming,
00:43:55.080 | I hope they kept at least like 50% non-code data,
00:43:58.240 | 'cause otherwise it's gonna forget everything else.
00:44:00.440 | And they didn't, only like 0.3% of their epochs
00:44:05.440 | were non-code data.
00:44:07.400 | So it did, it forgot everything else.
00:44:08.920 | So now it's good at code and it's bad at everything else.
00:44:12.840 | So we definitely have catastrophic forgetting.
00:44:14.640 | It's fixable, just somebody has to do, you know,
00:44:17.920 | somebody has to spend their time training a model
00:44:21.760 | on a good mix of data.
00:44:24.320 | Like, so, okay, so here's the thing.
00:44:26.160 | Even though I originally created the three-step approach
00:44:32.320 | that everybody now does,
00:44:34.040 | my view is it's actually wrong and we shouldn't use it.
00:44:36.840 | And that's because people are using it in a way different
00:44:44.720 | to why I created it.
00:44:46.000 | You know, I created it thinking that the task-specific
00:44:48.440 | models would be more specific.
00:44:51.280 | You know, it's like, oh, this is like a sentiment classifier.
00:44:54.840 | That's an example of a task, you know,
00:44:57.280 | but the tasks now are like a, you know, RLHF,
00:45:01.720 | which is basically like answer questions
00:45:03.360 | that make people feel happy about your answer.
00:45:05.280 | So that's a much more general task
00:45:07.560 | and it's a really cool approach.
00:45:09.440 | And so we see, for example, RLHF also breaks models,
00:45:14.440 | like, you know, like GPT-4, RLHDEFT,
00:45:18.160 | we know from kind of the work that Microsoft did,
00:45:21.680 | you know, the earlier less-aligned version was better.
00:45:25.840 | And these are all kind of examples
00:45:28.640 | of catastrophic forgetting.
00:45:30.160 | And so, to me, the right way to do this
00:45:33.480 | is to fine-tune language models,
00:45:36.600 | is to actually throw away the idea of fine-tuning.
00:45:38.840 | There's no such thing.
00:45:40.280 | There's only continued pre-training.
00:45:42.600 | And pre-training is something where, from the very start,
00:45:46.280 | you try to include all the kinds of data that you care about,
00:45:49.880 | all the kinds of problems that you care about,
00:45:51.800 | instructions, exercises, code,
00:45:55.840 | general purpose document completion, whatever.
00:45:59.280 | And then as you train, you gradually curate that,
00:46:05.400 | you know, you gradually make that higher and higher quality
00:46:08.360 | and more and more specific
00:46:09.400 | to the kinds of tasks you want it to do.
00:46:12.200 | But you never throw away any data.
00:46:14.840 | You always keep all of the data types there
00:46:16.960 | in reasonably high quantities.
00:46:20.280 | You know, maybe the quality filter,
00:46:23.480 | you stop training on low-quality data,
00:46:26.240 | 'cause that's probably fine to forget
00:46:27.440 | how to write badly, maybe.
00:46:29.240 | So yeah, that's now my view,
00:46:32.560 | is I think ULM fit is the wrong approach.
00:46:36.040 | And that's why we're seeing a lot of these, you know,
00:46:40.160 | so-called alignment tax and this view of like,
00:46:42.960 | "Oh, a model can't both code and do other things."
00:46:45.800 | You know, I think it's actually
00:46:46.800 | 'cause people are training them wrong.
00:46:49.240 | - Well, I think you have a clear anti-laziness approach.
00:46:53.800 | I think other people are not as good-hearted, you know?
00:46:57.440 | They're like, "Hey, they told me this thing works.
00:46:59.800 | "And if I release a model this way, people will appreciate it.
00:47:03.160 | "I'll get promoted and I'll kind of make more money."
00:47:06.920 | - Oh, absolutely.
00:47:08.240 | Yeah, and it's not just money.
00:47:09.440 | It's like, this is how citations work most badly, you know?
00:47:12.680 | So if you wanna get cited, you need to write a paper
00:47:15.600 | that people in your field recognize as an advancement
00:47:19.640 | on things that we know are good.
00:47:22.120 | And so we've seen this happen again and again.
00:47:24.360 | So like I say, like zero-shot and few-shot learning,
00:47:28.080 | everybody was writing about that.
00:47:29.640 | Or, you know, with image generation,
00:47:32.240 | everybody just was writing about GANs, you know?
00:47:35.040 | And I was trying to say like,
00:47:35.880 | "No, GANs are not the right approach."
00:47:37.840 | You know, and I showed again through research
00:47:40.040 | that we demonstrated in our videos
00:47:42.200 | that you can do better than GANs much faster
00:47:46.280 | and with much less data.
00:47:48.280 | And nobody cared because again,
00:47:49.880 | like if you wanna get published,
00:47:51.800 | you write a GAN paper that slightly improves
00:47:54.680 | this part of GANs and this tiny field,
00:47:57.240 | you'll get published, you know?
00:47:59.720 | So it's, yeah, it's not set up for real innovation.
00:48:03.760 | It's, you know, again, it's really helpful for me,
00:48:10.680 | you know, I have my own research lab
00:48:12.760 | with nobody telling me what to do
00:48:14.120 | and I don't even publish,
00:48:16.160 | so it doesn't matter if I get citations.
00:48:18.320 | So I just write what I think actually matters.
00:48:21.440 | I wish there was, and you know,
00:48:24.120 | and actually places like OpenAI, you know,
00:48:26.560 | the researchers there can do that as well.
00:48:29.040 | It's a shame, you know,
00:48:29.880 | I wish there was more academic open venues
00:48:34.000 | in which people can focus on like genuine innovation.
00:48:38.880 | - Twitter, which is unironically
00:48:42.160 | has become a little bit of that form.
00:48:44.320 | I wanted to follow up on one thing that you mentioned,
00:48:47.280 | which is that you checked around the open source discords.
00:48:50.680 | I don't know if it's too,
00:48:53.080 | I don't know if it's a kosher to ask
00:48:54.760 | like what discords are lively or useful right now.
00:48:59.200 | I think that something I definitely felt
00:49:02.120 | like I missed out on was the early days of LutherAI,
00:49:04.640 | which is a fair hotbed.
00:49:07.160 | And you know, like what is the new Luther?
00:49:09.480 | And you actually shouted out the alignment lab AI discord
00:49:12.800 | in your blog post.
00:49:14.000 | And that was the first time I even knew,
00:49:15.320 | like I saw them on Twitter
00:49:16.480 | and never knew they had a discord,
00:49:17.760 | never knew that there was actually
00:49:18.800 | substantive discussions going on in there
00:49:20.720 | and that you were an active member of it.
00:49:22.720 | - Okay, yeah, and then even then,
00:49:23.960 | if you do know about that and you go there,
00:49:25.560 | it'll look like it's totally dead.
00:49:27.400 | And that's because unfortunately,
00:49:29.040 | nearly all the discords,
00:49:30.120 | nearly all of the conversation happens in private channels.
00:49:35.160 | - So how does someone get into that world?
00:49:38.560 | 'Cause it's obviously very, very instructive, right?
00:49:42.880 | - You could just come to the first AI discord,
00:49:44.800 | which I'll be honest with you,
00:49:46.240 | it's less bustling than some of the others,
00:49:50.160 | but it's not terrible.
00:49:52.320 | And so like, at least to be fair,
00:49:55.880 | one of Emma's bustling channels is private.
00:49:58.080 | So I'm just thinking.
00:50:01.920 | - It's just the nature of quality discussion, right?
00:50:03.960 | - Yeah, I guess when I think about it,
00:50:06.560 | I didn't have any private discussions
00:50:08.440 | on our discord for years,
00:50:10.680 | but there was a lot of people who came in with like,
00:50:14.080 | oh, I just had this amazing idea for AGI.
00:50:17.560 | If you just thought about like,
00:50:18.800 | if you imagine that AI is a brain and we,
00:50:21.840 | this just, I don't want to talk about it.
00:50:24.160 | I don't want to like,
00:50:25.480 | maybe you don't want to be dismissive or whatever.
00:50:27.520 | And it's like, oh, well, that's an interesting comment,
00:50:29.280 | but maybe you should like try training some models first
00:50:31.440 | to see if that aligns with your intuition.
00:50:33.040 | Like, oh, but how can I possibly learn?
00:50:34.480 | It's like, well, we have a course,
00:50:35.880 | just actually spend time learning.
00:50:38.320 | Like, you know, anyway.
00:50:40.080 | And it's like, okay,
00:50:41.560 | I know the people who always have good answers there.
00:50:43.960 | And so I created a private channel and put them all in it.
00:50:46.720 | And I got to admit, that's where I post more often
00:50:50.000 | 'cause there's much less,
00:50:52.800 | you know, flight of fancy views
00:50:56.120 | about how we could solve AGI, blah, blah, blah.
00:50:58.440 | So there is a bit of that,
00:51:00.120 | but having said that, like,
00:51:02.040 | I think the bar's pretty low.
00:51:03.720 | Like if you join a Discord
00:51:05.600 | and you can hit the like participants
00:51:10.600 | or community or whatever button,
00:51:11.920 | you can see who's in it.
00:51:12.760 | And then you'll see at the top who the admins or moderators
00:51:15.440 | or people in the dev role are.
00:51:17.120 | And just DM one of them and say like,
00:51:22.960 | oh, here's my GitHub.
00:51:25.640 | Well, here's some blog posts I wrote.
00:51:27.640 | You know, I'm interested in talking about this.
00:51:29.680 | You know, can I join the private channels?
00:51:31.840 | And I've never heard of anybody saying no.
00:51:34.960 | I will say, you know, Alutha's all pretty open.
00:51:39.960 | So you can do the Alutha Discord still.
00:51:43.400 | You know, one problem with the Alutha Discord
00:51:45.000 | is it's been going on for so long
00:51:47.680 | that it's like, it's very inside baseball.
00:51:50.280 | - It's hard to join a newcomer.
00:51:51.240 | - It's quite hard to get started.
00:51:52.840 | Kappa AI looks, I think it's all open.
00:51:59.840 | - Those just left a stability.
00:52:02.240 | - That's more accessible.
00:52:03.800 | - Yeah.
00:52:04.640 | - There's also just recently,
00:52:09.520 | now it's research that does like the Hermes models
00:52:12.720 | and data set just opened.
00:52:15.640 | They've got some private channels,
00:52:16.880 | but it's pretty open, I think.
00:52:19.160 | You mentioned Alignment Lab,
00:52:20.480 | that one it's all the interesting stuff
00:52:21.920 | is on private channels.
00:52:22.800 | So just ask.
00:52:25.520 | If you know me, ask me 'cause I've got admin on that one.
00:52:28.960 | There's also, yeah, OS Skunkworks, OS Skunkworks AI.
00:52:33.880 | There's a good Discord, which I think it's open.
00:52:37.880 | So yeah, they're all pretty good.
00:52:40.440 | - I don't want you to leak any, you know,
00:52:42.800 | Discords that don't want any publicity,
00:52:45.280 | but this is all helpful.
00:52:47.040 | - We all want people.
00:52:48.520 | Like we all want people.
00:52:49.640 | We just want people who like wanna build stuff.
00:52:53.080 | - Exactly, yeah.
00:52:54.640 | - Rather than people who,
00:52:56.080 | and like, it's fine to not know anything as well,
00:52:59.920 | but if you don't know anything,
00:53:02.200 | but you wanna tell everybody else what to do
00:53:04.040 | and how to do it, that's annoying.
00:53:05.520 | If you don't know anything and wanna be told,
00:53:08.120 | like here's a really small kind of task
00:53:10.840 | that as somebody who doesn't know anything,
00:53:12.760 | it's gonna take you a really long time to do,
00:53:14.400 | but it would still be helpful.
00:53:15.880 | Then, and then you go and do it.
00:53:17.080 | That would be great.
00:53:18.240 | The truth is, yeah, like, I don't know,
00:53:20.800 | maybe 5% of people who come in with great enthusiasm
00:53:23.440 | saying that they wanna learn and they'll do anything.
00:53:25.240 | And then somebody says like,
00:53:26.080 | okay, here's some work you can do.
00:53:27.880 | Almost nobody does that work.
00:53:29.760 | So if you're somebody who actually does the work
00:53:32.280 | and follows up, you will massively stand out.
00:53:36.920 | That's an extreme rarity.
00:53:38.400 | And everybody will then want to help you do more work.
00:53:41.080 | So yeah, so just, yeah, just do work
00:53:44.600 | and people will want to support you.
00:53:47.880 | - Our Discord used to be referral only for a long time.
00:53:51.240 | We then have a public invite
00:53:53.000 | and then we opened it in the kind of like channel gating.
00:53:55.960 | Yeah, a lot of people just wanna do,
00:53:58.360 | I remember it used to be like, you know, a forum moderator.
00:54:00.840 | It's like, people just wanna do like drive-by posting,
00:54:03.040 | you know, and like, they don't wanna help the community.
00:54:05.840 | They just wanna get their question answered.
00:54:07.720 | - I mean, the funny thing is our forum community
00:54:11.800 | does not have any of that garbage.
00:54:13.760 | You know, there's something specific
00:54:15.240 | about the low latency thing
00:54:17.040 | where people like expect an instant answer.
00:54:20.800 | And yeah, we're all somehow in a forum thread
00:54:24.920 | where they know it's like there forever.
00:54:28.080 | People are a bit more thoughtful,
00:54:29.320 | but then the forums are less active than they used to be
00:54:34.320 | because Discord has got more popular, you know?
00:54:39.520 | So it's all a bit of a compromise.
00:54:41.120 | You know, running a healthy community is,
00:54:44.240 | yeah, it's always a bit of a challenge.
00:54:46.520 | - All right, we got so many more things we wanna dive in,
00:54:48.560 | but I don't wanna keep you here for hours.
00:54:50.760 | This is not the Lex Fridman podcast we always like to say.
00:54:54.160 | One topic I would love to maybe chat a bit about
00:54:57.400 | is Mojo, Modular, you know, CrystalLiner,
00:55:00.360 | not many of you on the podcast,
00:55:01.920 | so we wanna spend a little time there.
00:55:04.200 | You recently did a Hacker's Guide to Language Models,
00:55:07.240 | and you ran through everything from quantized model
00:55:10.360 | to like smaller models, larger models, and all of that.
00:55:14.160 | But obviously, Modular is taking its own approach.
00:55:17.560 | Yeah, what got you excited?
00:55:18.600 | I know you and Chris have been talking about this
00:55:20.320 | for like years and a lot of the ideas you had, so.
00:55:23.720 | - Yeah, yeah, yeah, absolutely.
00:55:25.280 | So I met Chris, I think it was at the first
00:55:29.520 | TensorFlow Dev Summit.
00:55:31.520 | And I don't think he had even like,
00:55:33.920 | I'm not sure if he'd even officially started
00:55:35.680 | his employment with Google at that point.
00:55:37.600 | So I don't know, you know,
00:55:40.000 | certainly nothing had been mentioned.
00:55:42.640 | So I, you know, I admired him from afar
00:55:45.040 | with LLVM and Swift and whatever.
00:55:48.200 | And so I saw him walk into the courtyard at Google.
00:55:53.200 | It's just like, "Oh shit, man, it's Chris Latner.
00:55:56.640 | I wonder if he would lower his standards enough
00:55:59.640 | to talk to me.
00:56:01.360 | Well, it's worth a try."
00:56:02.920 | So I caught up my courage because like,
00:56:04.680 | nobody was talking to him.
00:56:05.960 | He looked a bit lost and I wandered over and was like,
00:56:07.880 | "Oh, you're Chris Latner, right?"
00:56:09.120 | It's like, "What are you doing here?"
00:56:11.000 | And I was like, "Yeah, yeah, I am."
00:56:12.440 | And he was like, "Oh, I'm Jeremy Howard."
00:56:13.640 | It's like, "Oh, do you do some of this AI stuff?"
00:56:15.880 | And I was like, "Yeah, yeah, I like this AI stuff."
00:56:18.480 | "Are you doing AI stuff?"
00:56:19.760 | He's like, "Well, I'm thinking about starting
00:56:21.520 | to do some AI stuff.
00:56:22.560 | Yeah, I think it's gonna be cool."
00:56:23.640 | And I was like, "Oh."
00:56:25.320 | So like, I spent the next half hour
00:56:27.360 | just basically brain dumping all the ways
00:56:30.480 | in which AI was stupid to him.
00:56:32.800 | And he listened patiently.
00:56:34.240 | I thought he probably wouldn't even remember
00:56:36.760 | or care or whatever, but yeah,
00:56:40.560 | then I kind of like, I guess I re-caught up with him
00:56:42.360 | a few months later and he was like,
00:56:43.440 | "I've been thinking about everything you said
00:56:45.400 | in that conversation."
00:56:46.240 | And he like narrated back his response to every part of it,
00:56:50.240 | the projects he was planning to do.
00:56:51.520 | And it was just like, "Oh, this dude follows up.
00:56:54.080 | Holy shit."
00:56:55.000 | And I was like, "Wow, okay."
00:56:58.240 | And he was like, "Yeah, so we're gonna create
00:56:59.840 | this new thing called Swift for TensorFlow.
00:57:02.920 | And it's gonna be like, it's gonna be a compiler
00:57:05.120 | with auto-differentiation built in and blah, blah, blah."
00:57:08.320 | And I was like, "Oh, wait, why would that help?"
00:57:10.200 | You know, he was like, "Okay, with a compiler
00:57:12.560 | during the forward pass, you don't have to worry
00:57:15.080 | about saving context, you know,
00:57:16.520 | 'cause it'll all be optimized in the backward."
00:57:18.040 | But I was like, "Oh my God."
00:57:19.840 | 'Cause I didn't really know much about compilers,
00:57:21.560 | it's just that, you know, I spent enough
00:57:23.200 | to kind of like understand the ideas,
00:57:25.840 | but it hadn't occurred to me that a compiler
00:57:28.680 | basically solves a lot of the problems we have as end users.
00:57:32.640 | I was like, "Wow, that's amazing.
00:57:33.880 | Okay, you do know, right, that nobody's gonna use this
00:57:36.360 | unless it's like usable."
00:57:37.880 | He was like, "Yeah, I know, right?
00:57:39.680 | So I was thinking you should create like a fast AI for this."
00:57:42.360 | I was like, "Okay, but I don't even know Swift."
00:57:46.440 | And he was like, "Well, why don't you start learning it?
00:57:50.040 | And if you have any questions, ask me."
00:57:52.640 | It's just like, "Holy shit."
00:57:53.840 | Like, not only is Chris Latner lowered his standards enough
00:57:57.680 | to talk to me, but he's offering me personal tutoring
00:58:00.600 | in the programming language that he made.
00:58:02.800 | So I was just like, "I'm not gonna let him down."
00:58:05.680 | So I spent like the next two months
00:58:07.440 | like just nerding out on Swift.
00:58:10.080 | And it was just before Christmas that I kind of like
00:58:13.400 | started writing down what I'd learned.
00:58:15.800 | So I wrote a couple of blog posts on like,
00:58:19.160 | "Okay, this is like my attempt
00:58:21.600 | to do numeric programming in Swift.
00:58:24.320 | And these are all the challenges I had.
00:58:26.000 | And these are some of the issues I had
00:58:27.560 | with like making things properly performant.
00:58:32.400 | And here are some libraries I wrote."
00:58:34.240 | And I sent it to Chris and I was like,
00:58:35.520 | "I hope he's not too disappointed with me."
00:58:37.680 | You know, 'cause that would be the worst.
00:58:40.000 | And I was also like, "I hope he doesn't dislike the fact
00:58:44.280 | that I didn't love everything."
00:58:47.360 | And yeah, he was like, "Oh, thanks for sending me that.
00:58:52.240 | Let's get on a call and talk about it."
00:58:53.760 | And we spoke and he was like, "This is amazing.
00:58:56.120 | I can't believe that you made this.
00:58:57.840 | This is exactly what Swift needs."
00:58:59.520 | And he was like, "And so like somebody set up
00:59:01.280 | like a new Swift, I can't remember what they call them,
00:59:06.280 | the equivalent of a PEP, kind of IRFC thing of like,
00:59:09.080 | oh, you know, let's look at how we can implement
00:59:10.880 | Jeremy's ideas and the language."
00:59:12.520 | And so I was like, "Oh, wow."
00:59:15.000 | And so, yeah, you know.
00:59:16.920 | So, you know, and then we ended up like literally teaching
00:59:22.200 | some lessons together about Swift for TensorFlow
00:59:24.840 | and we built a fast AI kind of equivalent
00:59:29.480 | with him and his team.
00:59:32.200 | It was so much fun.
00:59:33.320 | Then in the end, you know, Google didn't follow through,
00:59:36.640 | which is fair enough, like asking everybody
00:59:39.760 | to learn a new programming language is gonna be tough.
00:59:42.880 | But like, it was very obvious, very, very obvious
00:59:45.200 | at that time that TensorFlow 2 is gonna be a failure,
00:59:47.640 | you know, and so this felt like, okay, I, you know,
00:59:52.640 | well, you know, what are you gonna do?
00:59:55.320 | Like, you can't focus on TensorFlow 2
01:00:00.320 | 'cause it's not gonna, like it's not working.
01:00:02.400 | It's never gonna work.
01:00:03.400 | You know, nobody at Google's using it internally.
01:00:06.760 | So, you know, in the end, Chris left, you know,
01:00:11.760 | Swift for TensorFlow got archived.
01:00:13.680 | There was no backup plan.
01:00:16.120 | So it kind of felt like Google was kind of screwed,
01:00:19.920 | you know, and Chris went and did something else.
01:00:22.320 | But we kept talking and I was like, "Look, Chris, you know,
01:00:25.160 | you've gotta be your own boss, man.
01:00:27.600 | 'Cause like, you know, you've got the ideas, you know,
01:00:30.320 | like only you've got the ideas, you know,
01:00:32.400 | and if your ideas are implemented,
01:00:35.280 | we'd all be so much better off
01:00:36.640 | 'cause like Python's the best of a whole bunch of shit,
01:00:41.640 | you know, like I would, it's amazing, but it's awful,
01:00:45.720 | you know, compared to what it could be.
01:00:46.960 | And anyway, so eventually a few years later,
01:00:50.360 | he called me up and he was like,
01:00:51.680 | "Jeremy, I've taken your advice.
01:00:53.880 | I've started a company."
01:00:55.240 | So I was like, "Oh my God."
01:00:57.320 | So we got to create a new language.
01:00:58.720 | We're gonna create a new infrastructure.
01:01:00.480 | It's gonna build, it's gonna have all the stuff
01:01:02.000 | we've talked about.
01:01:02.960 | And it's like, "Oh, wow."
01:01:05.280 | So that's what Modular is.
01:01:09.360 | And so Mojo is like, you know,
01:01:14.000 | building on all the stuff that Chris has figured out over,
01:01:18.760 | I mean, really from when he did his PhD thesis,
01:01:21.600 | which developed LLVM onwards, you know,
01:01:24.320 | and Swift and MLIR, you know,
01:01:27.080 | the TensorFlow runtime engine, which is very good.
01:01:31.160 | You know, that was something that he built and has lasted.
01:01:34.000 | So, yeah, I'm pumped about that.
01:01:38.440 | I mean, it's very speculative.
01:01:40.000 | Creating a whole new language is tough.
01:01:41.960 | I mean, Chris has done it before
01:01:43.160 | and he's created a whole C++ compiler amongst other things,
01:01:46.680 | looking pretty hopeful.
01:01:49.760 | I mean, I hope it works because, you know, I mean-
01:01:53.840 | - You told them to quit his job, so.
01:01:55.640 | - But I mean, in the meantime, I will say, you know,
01:02:00.760 | Google now does have a backup plan, you know,
01:02:03.400 | they have JAX, which was never a strategy.
01:02:06.120 | It was just a bunch of people
01:02:07.160 | who also recognized TensorFlow 2 as shit
01:02:09.080 | and they just decided to build something else.
01:02:11.920 | And for years, my friends in that team were like,
01:02:13.560 | "Don't tell anybody about us
01:02:14.960 | 'cause we don't want it to be anything but a research project."
01:02:19.200 | So now these poor guys,
01:02:21.040 | suddenly they're the great white hope for Google's future.
01:02:24.080 | And so JAX is, you know, also not terrible,
01:02:27.720 | but it's still written in Python.
01:02:29.520 | Like, it would be cool if we had all the benefits of JAX,
01:02:32.920 | but in a language that was designed
01:02:36.080 | for those kinds of purposes.
01:02:38.440 | So, you know, fingers crossed that, yeah,
01:02:43.360 | that Mojo turns out great.
01:02:46.680 | - Yeah.
01:02:48.080 | Any other thoughts on when,
01:02:50.600 | where people should be spending their time?
01:02:52.200 | So that's more the kind of language framework level
01:02:55.360 | than you have the, you know, GGML,
01:02:58.320 | some of these other like quantization-focused
01:03:01.200 | kind of model level things.
01:03:02.320 | Then you got the hardware people.
01:03:03.880 | It's like a whole other bucket.
01:03:06.040 | Yeah, what are some of the exciting stuff
01:03:07.720 | that you're excited about?
01:03:09.560 | - Well, you won't be surprised to hear me say this,
01:03:11.440 | but I think fine-tuning transfer learning
01:03:14.080 | is still a hugely underappreciated area.
01:03:18.400 | So today's zero-shot, few-shot learning equivalent
01:03:24.960 | is retrieval-augmented generation, you know, REG,
01:03:28.320 | which is like, just like few-shot learning is a thing.
01:03:32.000 | Like, it's a real thing.
01:03:32.840 | It's a useful thing.
01:03:34.200 | It's not a thing anybody would want to ignore.
01:03:36.320 | Why are people not spending at least as much effort
01:03:38.840 | on fine-tuning, you know?
01:03:40.400 | 'Cause, you know, REG is like such a inefficient hack,
01:03:45.280 | really, isn't it?
01:03:46.120 | It's like, you know, segment up my data
01:03:50.760 | in some somewhat arbitrary way,
01:03:53.440 | embed it, ask questions about that, you know,
01:03:56.520 | hope that my embedding model embeds questions
01:04:01.560 | in the same embedding space as the paragraphs,
01:04:04.120 | which obviously is not going to, if your question is like,
01:04:06.240 | if I've got a whole bunch of archive papers embeddings,
01:04:08.880 | and I asked, like, what are all the ways
01:04:12.040 | in which we can make inference more efficient?
01:04:17.040 | Like, the only paragraphs it'll find
01:04:20.560 | is like if there's a review paper
01:04:22.240 | that says here's a list of ways to make,
01:04:24.680 | you know, inference more efficient.
01:04:26.720 | - Doesn't have any of the specifics.
01:04:28.800 | - No, it's not going to be like, oh, here's one way,
01:04:30.920 | here's one way, here's a different way in different papers,
01:04:33.240 | you know?
01:04:34.080 | Yeah, if you fine-tune a model,
01:04:37.520 | then all of that information is getting directly
01:04:40.680 | incorporated into the weights of your model
01:04:43.880 | in a much more efficient and nuanced way.
01:04:46.600 | And then you can use REG on top of that.
01:04:50.480 | So I think that that's one area
01:04:51.840 | that's definitely like underappreciated.
01:04:55.280 | And also the kind of like the confluence,
01:04:57.680 | or like, okay, how do you combine REG
01:04:59.440 | and fine-tuning, for example?
01:05:01.280 | - Something that I think a lot of people are uncertain about,
01:05:04.160 | and I don't expect you to know either,
01:05:06.160 | is that whether or not you can fine-tune new information in,
01:05:11.160 | and I think that that is the focus
01:05:13.880 | of some of your open questions and research.
01:05:16.720 | - Of course you can, right?
01:05:17.960 | - Because it's additional pre-training.
01:05:18.800 | - Obviously you can,
01:05:20.360 | because there's no such thing as fine-tuning,
01:05:23.200 | there's only continued pre-training.
01:05:25.200 | So fine-tuning is pre-training,
01:05:28.000 | like they're literally the same thing.
01:05:29.960 | So the knowledge got in there in the first place
01:05:33.360 | through pre-training,
01:05:34.200 | so how could like continuing to pre-train
01:05:36.640 | not put more knowledge in?
01:05:37.960 | Like it's the same thing.
01:05:39.800 | The problem is just we're really bad at it,
01:05:43.120 | 'cause everybody's doing it dumb ways.
01:05:45.040 | So, you know, it's a good question,
01:05:46.920 | and it's not just new knowledge,
01:05:48.080 | but like new capabilities.
01:05:49.800 | You know, I think like in my "Hacker's Guide to LL,"
01:05:54.120 | into "Hacker's Guide to LLMs" talk,
01:05:56.640 | I show simple, I mean, it's a funny,
01:05:59.320 | that's a simple example, 'cause it doesn't sound it,
01:06:01.080 | but like taking a pre-trained base model
01:06:03.640 | and getting it to generate SQL.
01:06:05.320 | And it took 15 minutes to train on a single GPU.
01:06:09.560 | You know, I think that might surprise people,
01:06:11.600 | so that that capability is at your fingertips,
01:06:15.600 | and, you know, 'cause it was already there,
01:06:17.240 | it was just latent in the base model.
01:06:20.880 | Really pushing the boundaries
01:06:22.520 | of what you can do with small models,
01:06:24.360 | I think is a really interesting question.
01:06:27.760 | Like what can you do with a,
01:06:30.280 | like, I mean, there isn't much
01:06:31.280 | in the way of good small models.
01:06:33.480 | A really underappreciated one is a BTLM 3B,
01:06:38.080 | which is a like kind of 7B quality 3B model.
01:06:44.920 | There's not much at the 1 to 2B range, sadly.
01:06:47.280 | There are some code ones,
01:06:48.600 | but like the fact that there are some really good code ones
01:06:51.320 | in that 1 to 2B range shows you
01:06:53.200 | that that's a great size for doing complex tasks well.
01:06:56.920 | - There was 5.1 recently,
01:07:00.160 | which has been the subject of a little bit of discussion
01:07:03.040 | about whether to train on benchmarks.
01:07:04.720 | - Yeah, that's 5.1.5 as well.
01:07:06.440 | So that's not a good model yet.
01:07:09.960 | - Why not?
01:07:12.880 | - It's good at doing,
01:07:13.720 | so 5.1 in particular is good at doing a very specific thing,
01:07:16.520 | which is creating very small Python snippets.
01:07:19.600 | The thing, okay, so like 5.1.5
01:07:23.120 | has never read Wikipedia, for example.
01:07:26.920 | So it doesn't know who Tom Cruise is, you know.
01:07:30.080 | It doesn't know who anybody is.
01:07:34.320 | It doesn't know about any movies.
01:07:36.080 | It doesn't really know anything about anything.
01:07:39.000 | Like, 'cause it was never, it's never read anything.
01:07:42.200 | You know, it was trained
01:07:43.080 | on a nearly entirely synthetic data set,
01:07:46.920 | which is designed for it to learn reasoning.
01:07:50.280 | And so it was a research project and a really good one.
01:07:53.120 | And it definitely shows us a powerful direction
01:07:55.440 | in terms of what can you do with synthetic data.
01:07:58.280 | And wow, gosh, even these tiny models
01:08:00.480 | can get pretty good reasoning skills,
01:08:02.400 | pretty good math skills, pretty good coding skills.
01:08:04.960 | But I don't know if it's a model
01:08:08.800 | you could necessarily build on.
01:08:11.880 | Some people have tried to do some fine tunes of it.
01:08:15.120 | And again, they're like surprisingly good in some ways
01:08:19.040 | for a 1.5B model,
01:08:20.640 | but not sure you'd find it useful for anything.
01:08:24.520 | - I think that's the struggle of pitching small models
01:08:28.040 | because small is great.
01:08:30.080 | You know, you don't have a lot,
01:08:31.640 | you don't need a lot of resources to run them,
01:08:33.520 | but the performance evaluation is always so iffy.
01:08:36.640 | It's always just like, yeah, it works on some things
01:08:39.200 | and we don't trust it for others.
01:08:41.840 | - Yeah, so that's why we're back to fine tuning.
01:08:44.840 | I would say, so Microsoft did create a 5.1.5 web,
01:08:48.960 | but they didn't release it, unfortunately.
01:08:51.040 | I would say a 5.1.5 web with fine tuning for your task,
01:08:57.280 | you know, might solve a lot of tasks
01:09:02.920 | that people have in their kind of day-to-day lives.
01:09:05.960 | You know, particularly in kind of an enterprise setting,
01:09:08.880 | I think there's a lot of like repetitive kind of processing
01:09:13.120 | that has to be done.
01:09:14.520 | It's a useful thing for coders to know about
01:09:16.720 | 'cause I think quite often you can like replace
01:09:18.880 | some thousands and thousands of lines of complex buggy code,
01:09:22.360 | maybe with a fine tune, you know.
01:09:25.000 | - Good, yeah.
01:09:27.920 | And Jeremy, before we let you go,
01:09:31.000 | I think one question on top of a lot of people's minds.
01:09:34.000 | So you've done practical deep learning for coders
01:09:36.880 | in 2018, '19, '21, '22.
01:09:40.080 | I feel like the more time goes by,
01:09:41.840 | the more the GPUs get concentrated.
01:09:44.840 | If you're somebody who's interested in deep learning today
01:09:47.680 | and you don't wanna go join OpenAI,
01:09:49.480 | you don't wanna join Anthropic,
01:09:51.400 | what's like the best use of their time?
01:09:53.840 | Should they focus on, yeah, small model development?
01:09:56.280 | Should they focus on fine tuning math and all of that?
01:09:59.520 | Should they just like focus on making rag not a hack
01:10:03.920 | and coming up with a better solution?
01:10:06.120 | Yeah, what's a practical deep learning for coders 2024
01:10:09.560 | kind of look like?
01:10:10.560 | - Yeah, I mean, good question.
01:10:12.600 | I'm trying to figure that out for myself, you know,
01:10:14.840 | like what should I teach?
01:10:16.360 | 'Cause I definitely feel like things have changed a bit,
01:10:21.280 | you know, one of the ways in which things have changed
01:10:23.880 | is that coding is much more accessible now.
01:10:27.480 | So if you look at a lot of the folks
01:10:28.760 | in the kind of open source LLM community,
01:10:31.080 | they're folks who really hadn't coded before a year ago
01:10:34.760 | and they're using these models to help them build stuff
01:10:37.520 | they couldn't build before,
01:10:38.840 | which is just fantastic, you know?
01:10:40.920 | So one thing I kind of think is like, okay,
01:10:44.600 | well, we need a lot more material to help these people
01:10:47.560 | use this newfound skill they have
01:10:49.280 | 'cause they don't really know what they're doing,
01:10:51.880 | you know, and they don't claim to,
01:10:53.520 | but they're doing it anyway
01:10:54.360 | and I think that's fantastic, you know?
01:10:55.760 | So like, are there things we could do to help people,
01:10:59.720 | you know, bridge this gap?
01:11:00.960 | 'Cause previously, you know,
01:11:03.280 | I know folks who were, you know,
01:11:05.760 | doing menial jobs a year ago
01:11:09.760 | and now they're training language models
01:11:12.120 | thanks to the help of Codex and Copilot and whatever.
01:11:17.080 | So, you know, yeah, what does it look like
01:11:18.680 | to like really grab this opportunity?
01:11:22.160 | You know, maybe Fast.ai's goals
01:11:24.480 | can be dramatically expanded now
01:11:26.760 | to being like, let's make coding more accessible,
01:11:30.520 | you know, kind of AI-oriented coding more accessible.
01:11:34.960 | If so, our course should probably look very different,
01:11:39.200 | you know, and we'd have to throw away that like,
01:11:41.120 | oh, you have to have at least a year
01:11:42.480 | of full-time programming, you know, as a prerequisite.
01:11:46.800 | Yeah, what would happen if we got rid of that?
01:11:50.520 | So that's kind of one thought that's in my head.
01:11:53.520 | You know, as to what should other people do,
01:11:56.680 | honestly, I don't think anybody has any idea,
01:12:01.720 | like the more I look at it, what's going on.
01:12:04.400 | I know I don't, you know,
01:12:05.800 | like we don't really know how to do anything very well.
01:12:08.600 | Clearly OpenAI do,
01:12:12.320 | like they seem to be quite good at some things
01:12:14.920 | or they're talking to folks at
01:12:16.400 | or who have recently left OpenAI.
01:12:19.280 | Even there, it's clear there's a lot of stuff
01:12:21.040 | they haven't really figured out
01:12:22.160 | and they're just kind of like using recipes
01:12:24.600 | that they've noticed have been okay.
01:12:27.080 | So yeah, we don't really know how to train these models well,
01:12:30.000 | we don't know how to fine tune them well,
01:12:31.240 | we don't know how to do rag well,
01:12:33.200 | we don't know what they can do,
01:12:34.320 | we don't know what they can't do,
01:12:35.360 | we don't know how big a model you need
01:12:36.680 | to solve different kinds of problems,
01:12:38.480 | we don't know what kind of problems they can't do,
01:12:40.080 | we don't know what good prompting strategies are
01:12:42.080 | for particular problems, you know.
01:12:44.200 | Like somebody sent me a message the other day saying
01:12:47.920 | they've written something that is a prompting strategy
01:12:52.920 | for GPT-4.
01:12:55.160 | They've written like 6,000 lines of Python code
01:12:58.520 | and it's to help it play chess.
01:13:01.800 | And then they said they've had it play
01:13:04.160 | against other chess engines,
01:13:05.720 | including the best Stockfish engines.
01:13:07.800 | And it's got an ELO of 3,400.
01:13:11.920 | - Oh my God.
01:13:12.760 | - Which would make it close to
01:13:13.800 | the best chess engine in existence.
01:13:17.720 | And I think this is a good example of like,
01:13:20.400 | people were saying like GPT-4 can't play chess.
01:13:23.400 | I mean, I was sure that was wrong.
01:13:25.440 | I mean, obviously it can play chess,
01:13:27.360 | but the difference between like,
01:13:29.120 | with no prompting strategy,
01:13:31.560 | it can't even make legal moves,
01:13:33.080 | with good prompting strategies,
01:13:34.360 | it might be just about the best chess engine in the world.
01:13:37.240 | Far better than any human player.
01:13:39.000 | So yeah, I mean, we don't really know
01:13:40.400 | what the capabilities are yet.
01:13:41.600 | So I feel like it's all blue sky at this point.
01:13:45.200 | - It feels like computer vision in 2013 to me,
01:13:49.280 | which was like in 2013 computer vision.
01:13:51.240 | - We just had the Alex net moment.
01:13:52.720 | - We've had Alex net, we've had VGG net.
01:13:56.000 | It's around the time Zyler and Fergus like,
01:13:58.720 | no, it's probably before that.
01:13:59.760 | So we hadn't yet had the Zyler and Fergus like,
01:14:01.400 | oh, this is actually what's going on inside the layers.
01:14:03.240 | So, you know, we don't actually know
01:14:06.400 | what's happening inside these transformers.
01:14:08.440 | We don't know how to create good training dynamics.
01:14:11.560 | We don't really know anything much.
01:14:14.800 | And there's a reason for that, right?
01:14:18.000 | And the reason for that is language models
01:14:21.520 | suddenly got really useful.
01:14:24.240 | And so the kind of economically rational thing to do,
01:14:28.760 | like this is not criticism, this is true.
01:14:31.160 | The economic rational thing to do is to like, okay,
01:14:33.160 | like build that as fast as possible, you know,
01:14:36.760 | make something work, get it out there.
01:14:39.560 | And that's what, you know, open AI in particular did,
01:14:43.440 | Anthropic kind of did.
01:14:44.840 | But there's a whole lot of technical debt everywhere.
01:14:50.840 | You know, nobody's really figured this stuff out
01:14:53.360 | because everybody's been so busy
01:14:55.160 | building what we know works as quickly as possible.
01:14:59.880 | So yeah, I think there's a huge amount of opportunity to,
01:15:03.040 | you know, I think we'll find things
01:15:04.800 | can be made to work a lot faster, a lot less memory.
01:15:11.520 | I got a whole bunch of ideas I want to try, you know,
01:15:14.360 | every time I look at something closely,
01:15:17.960 | like really closely, I'm always like, oh,
01:15:20.600 | turns out this person actually had no idea
01:15:22.240 | what they're doing, you know, which is fine.
01:15:25.840 | Like none of us know what we're doing.
01:15:27.640 | We should experiment with that.
01:15:30.680 | - We had a trade out on the podcast
01:15:34.080 | who created flash attention.
01:15:35.960 | And I asked them,
01:15:37.520 | did nobody think of using SRAM before you?
01:15:40.240 | Like were people just like,
01:15:42.240 | and he was like, yeah, people just didn't think of it,
01:15:45.520 | didn't try, they didn't come from like a systems background.
01:15:48.440 | - Yeah, I mean, the thing about flash attention is,
01:15:51.240 | I mean, lots of people absolutely had thought of that
01:15:55.520 | and so had I, right?
01:15:56.720 | But I mean, the honest truth is, particularly before Triton,
01:16:00.040 | like everybody knew that tiling
01:16:04.480 | is the right way to solve anything.
01:16:05.920 | And everybody knew that attention,
01:16:08.400 | fused attention wasn't tiled, that was stupid.
01:16:11.440 | But not everybody's got his ability to like,
01:16:16.800 | be like, oh, well, I'm confident enough in CUDA
01:16:20.960 | and or Triton to use that insight to write something better.
01:16:25.200 | You know, and this is where like,
01:16:26.160 | I'm super excited about Mojo, right?
01:16:27.800 | And I always talk to Chris about flash attention
01:16:30.400 | 'cause I'm like, you know,
01:16:31.640 | there is a thousand flash attentions out there
01:16:34.200 | for us to build.
01:16:37.760 | You just gotta make it easy for us to build them.
01:16:40.400 | So like Triton definitely helps,
01:16:42.240 | but it's still not easy.
01:16:46.840 | You know, it still requires kind of really understanding
01:16:49.200 | the GPU architecture,
01:16:52.480 | writing it in that kind of very CUDA-ish way.
01:16:54.960 | So yeah, I think,
01:16:57.360 | I think, you know, if Mojo or something equivalent
01:17:00.440 | can really work well,
01:17:02.520 | we're gonna see a lot more flash attentions popping up.
01:17:07.680 | - Great, Jerry, and before we wrap,
01:17:09.640 | we usually do a quick lightning round.
01:17:11.640 | We're gonna have three simple questions.
01:17:13.880 | So the first one is around acceleration.
01:17:16.240 | And you've been in this field a long time.
01:17:18.800 | What's something that it's already here today
01:17:21.240 | in AI that you thought would take much longer?
01:17:23.880 | - I don't think anything.
01:17:25.000 | So I've actually been slightly too bullish.
01:17:27.360 | So in my 2014 TED Talk,
01:17:30.640 | I had a graph and I said like,
01:17:34.680 | this is like the slope of human capabilities
01:17:37.240 | and this is the slope of AI capabilities.
01:17:39.960 | And I said, oh, and I put a dot saying we are here.
01:17:42.800 | And it was just before they passed.
01:17:45.160 | And I looked back at the transcript the other day
01:17:48.240 | and I said, in five years,
01:17:50.160 | I think we'll, you know,
01:17:52.040 | we might've crossed that threshold
01:17:53.680 | in which computers will be better at most human tasks
01:17:56.720 | than most humans, most average humans.
01:17:59.200 | And so that might be almost true now
01:18:03.560 | for non-physical tasks.
01:18:06.360 | So I was like, took, you know,
01:18:09.080 | took that twice as long as I thought it might.
01:18:11.960 | Yeah, no, I wouldn't say anything surprised me too much.
01:18:18.800 | It's still like, definitely like, I gotta admit,
01:18:22.120 | you know, I had a very visceral reaction
01:18:24.520 | using GPT-4 for the first time.
01:18:27.400 | Not because I found it surprising,
01:18:29.240 | but actually like, actually doing it,
01:18:33.080 | like it's something I was pretty sure
01:18:35.280 | would exist by about now, maybe a bit earlier.
01:18:38.560 | But actually using it definitely is different
01:18:41.960 | to just feeling like it's probably on its way, you know?
01:18:45.480 | And yeah, whatever GPT-5 looks like,
01:18:49.280 | I'm sure, I imagine I'll have the same visceral reaction,
01:18:55.600 | you know?
01:18:57.000 | - It's really amazing to watch develop.
01:19:01.080 | We also have an exploration question.
01:19:03.200 | So what do you think is the most interesting
01:19:05.200 | unsolved question in AI?
01:19:07.080 | - How do language models learn?
01:19:10.880 | You know, what are the training dynamics?
01:19:12.520 | Like, I wanna see, there was a great paper
01:19:16.800 | about ResNets a few years ago
01:19:21.320 | that showed how, that was able to like,
01:19:24.920 | plot a kind of projected three-dimensional loss surface
01:19:28.840 | for a ConvNet with and without skip connections.
01:19:33.840 | And you know, you could very clearly see
01:19:36.520 | without the skip connections, it was bumpy,
01:19:38.920 | and with the skip connections, it was super smooth.
01:19:41.520 | That's the kind of work we need.
01:19:45.440 | Like, so there was actually an interesting blog post
01:19:47.480 | that came out just today from the PyTorch team,
01:19:50.840 | where some of them have created this like,
01:19:53.120 | 3D matrix product visualization thing.
01:19:56.880 | - The MatMul Visualizer, yeah.
01:19:58.240 | - Yeah, and they actually showed some nice examples
01:20:01.040 | of like, a GPT-2 attention layer,
01:20:03.960 | and like, showed an animation and said like,
01:20:07.000 | if you look at this, we can actually see a bit
01:20:08.640 | about what it's doing.
01:20:10.080 | You know, so again, it reminds me of this Eiler and Fergus,
01:20:13.880 | you know, ConvNet paper that was the first one
01:20:16.400 | to do these reverse convolutions,
01:20:18.680 | to show you what's actually being learned
01:20:20.600 | in each layer in a ConvNet.
01:20:22.280 | Yeah, we need a lot more of this, like,
01:20:24.240 | what is going on inside these models?
01:20:27.760 | How do they actually learn?
01:20:29.160 | And then how can we use those insights
01:20:31.680 | to help them to learn better?
01:20:35.520 | So I think that'd be one.
01:20:36.360 | The other exploration I'd really like to see
01:20:37.920 | is a much more rigorous analysis
01:20:41.320 | of what kind of data do they need,
01:20:44.880 | at what level, and when do they need it, and how often.
01:20:48.640 | So that kind of like, data set mixing, curation, so forth,
01:20:52.520 | in order to get the best capabilities.
01:20:55.240 | Yeah, how much is Wikipedia?
01:20:57.080 | Yeah, fine tune, what kind of mix do you need
01:21:02.560 | for it to keep its capabilities?
01:21:04.920 | And what are the kind of underlying capabilities
01:21:06.640 | that it most needs to keep?
01:21:07.760 | And if it loses those, it would lose all these other ones.
01:21:09.960 | And what data do you need to keep those?
01:21:11.640 | And, you know, other things we can do
01:21:13.560 | to change the loss function,
01:21:15.320 | to help it to not forget to do things, stuff like that.
01:21:20.320 | - Awesome, and yeah, before wrapping,
01:21:22.880 | what's one message, one idea
01:21:25.360 | you want everyone to remember and think about?
01:21:27.880 | - Yeah, I guess the main thing I want everybody to remember
01:21:30.320 | is that, you know, there's a lot of people in the world,
01:21:33.600 | and they have a lot of, you know,
01:21:36.280 | diverse experiences and capabilities.
01:21:39.280 | And, you know, they all matter.
01:21:43.080 | And now that we have a, you know,
01:21:47.000 | nearly powerful technology in our lives,
01:21:49.840 | we could think of that one of two ways.
01:21:52.280 | One would be, gee, that's really scary.
01:21:57.280 | What would happen if all of these people in the world
01:21:59.960 | had access to this technology?
01:22:01.280 | Some of them might be bad people.
01:22:03.560 | Let's make sure they can't have it.
01:22:05.680 | Or one might be, wow, of all those people in the world,
01:22:10.000 | I bet a lot of them could really improve
01:22:13.080 | the lives of a lot of humanity if they had this tool.
01:22:15.760 | This has always been the case, you know,
01:22:19.680 | from the invention of writing
01:22:21.720 | to the invention of the printing press
01:22:23.640 | to the, you know, development of education.
01:22:26.280 | And it's been a constant battle
01:22:29.360 | between people who think that distributed power is unsafe,
01:22:33.720 | and it should be held onto by an elite few,
01:22:36.560 | and people who think that humanity on net, you know,
01:22:41.560 | is a marvelous species,
01:22:46.800 | particularly when part of a society and a civilization,
01:22:49.440 | and we should do everything we can
01:22:51.320 | to enable more of them to contribute.
01:22:55.320 | This is a really big conversation right now.
01:22:59.680 | And, you know, I want to see more and more people
01:23:03.800 | showing up and showing what, you know,
01:23:09.240 | what the great unwashed masses out there
01:23:13.120 | can actually achieve, you know,
01:23:14.520 | that actually, you know,
01:23:16.640 | regular people are going to do a lot of really valuable work
01:23:21.200 | and actually help us be, you know,
01:23:26.160 | more safe and also flourishing in our lives
01:23:32.320 | and providing a future for our children to flourish in,
01:23:36.920 | you know, if we lock things down
01:23:45.160 | to the people that we think, you know,
01:23:49.000 | the elites that we think can be trusted to run it for us.
01:23:52.800 | Yeah, I think all bets are off
01:23:54.080 | about where that leaves us as a society, you know.
01:23:59.080 | - Yep, yeah, that's an important message.
01:24:06.320 | And yeah, that's why we've been promoting
01:24:08.280 | a lot of open source developers, open source communities,
01:24:11.800 | I think, letting the builders build.
01:24:15.000 | - And explore, that's always a good idea.
01:24:17.080 | - Yeah.
01:24:18.200 | - Thank you so much for coming on, Jeremy.
01:24:19.760 | This was great.
01:24:20.960 | - Thank you for having me.
01:24:22.160 | (upbeat music)
01:24:24.760 | (upbeat music continues)
01:24:28.160 | (upbeat music continues)
01:24:31.560 | (upbeat music continues)
01:24:35.440 | (upbeat music continues)
01:24:38.840 | (upbeat music continues)
01:24:42.240 | (gentle music)
01:24:44.820 | (upbeat music)