back to indexThe End of Finetuning — with Jeremy Howard of Fast.ai
Chapters
0:0 Introduction
1:14 Jeremy’s background
2:53 Founding FastMail and Optimal Decisions
4:5 Starting Fast.ai with Rachel Thomas
5:28 Developing the ULMFit natural language processing model
10:11 Jeremy’s goal of making AI more accessible
14:30 Fine-tuning language models - issues with memorization and catastrophic forgetting
18:9 The development of GPT and other language models around the same time as ULMFit
20:0 Issues with validation loss metrics when fine-tuning language models
22:16 Jeremy’s motivation to do valuable work with AI that helps society
26:39 Starting fast.ai to spread AI capabilities more widely
29:27 Overview of fast.ai - courses, library, research
34:20 Using progressive resizing and other techniques to win the DAWNBench competition
38:42 Discovering the single-shot memorization phenomenon in language model fine-tuning
43:13 Why fine tuning is simply continued pre-training
46:47 Chris Lattner and Modular AI
48:38 Issues with incentives and citations limiting innovation in research
52:49 Joining AI research communities through Discord servers
55:23 Mojo
63:8 Most exciting areas - continued focus on transfer learning and small models
66:56 Pushing capabilities of small models through transfer learning
70:58 Opening up coding through AI to more people
73:51 Current state of AI capabilities compared to computer vision in 2013 - lots of basic research needed
77:8 Lightning Round
00:00:13.880 |
This is Alessio, partner and CTO at Residence 00:00:17.680 |
And I'm joined by my co-host, Sviggs, founder of Small.ai. 00:00:21.280 |
Hey, and today we have in the remote studio, Jeremy Howard, 00:00:34.760 |
I'm actually very used to seeing you in your mask 00:00:38.000 |
as a message to people, but today we're mostly audio. 00:00:41.760 |
But thank you for doing the very important public service 00:00:47.680 |
It was all very annoying, and frustrating, and tedious. 00:01:06.000 |
Something I did not know about you was that you graduated 00:01:08.840 |
with a BA in philosophy from the University of Melbourne. 00:01:18.600 |
because I was working 80 to 100 hour weeks at McKinsey 00:01:37.120 |
or you're very sort of self-driven and self-motivated. 00:01:39.760 |
I just-- I took two weeks off before each exam period 00:01:52.880 |
oh, I was meant to be in your class this semester, 00:01:56.280 |
Were there any assignments I was meant to have done, whatever? 00:01:59.040 |
And I can't believe all of them let me basically have-- 00:02:03.720 |
they basically always would say, like, OK, well, 00:02:05.800 |
if you can have this written by tomorrow, I'll accept it. 00:02:08.400 |
So yeah, stressful way to get through university, but-- 00:02:13.120 |
Well, it shows that, I guess, you min-maxed the opportunities. 00:02:19.760 |
Finally, like, in as much as I, you know, in philosophy, 00:02:24.200 |
the things I found interesting and focused on 00:02:26.960 |
in the little bit of time I did spend on it was ethics 00:02:31.320 |
And it's kind of really amazing that it's now come back around, 00:02:34.520 |
and those are actually genuinely useful things 00:02:37.120 |
to know about, which I never thought would happen. 00:02:39.240 |
A lot of-- yeah, a lot of relevant conversations there. 00:02:48.000 |
you founded both Optimal Decisions and Fastmail, 00:02:50.920 |
which I also briefly used, so thank you for that. 00:02:52.920 |
Good for you, yeah, 'cause I had read the statistics, 00:02:55.680 |
which is that, like, 90% or something of small businesses 00:02:58.600 |
fail, so I thought if I start two businesses, 00:03:04.160 |
some kind of stochastic thing I didn't have control over, 00:03:10.640 |
And then you were president and chief scientist 00:03:12.760 |
at Kaggle, which obviously is the composition platform 00:03:23.240 |
where you were working on using deep learning 00:03:25.520 |
to improve medical diagnostics and clinical decisions. 00:03:33.200 |
And even now, that's still, like, a pretty early phase. 00:03:36.680 |
And I actually heard you on your new podcast with Tanish, 00:03:40.560 |
where you went very, very deep into the stuff, 00:03:47.320 |
Maybe he's too old to be called a prodigy now, ex-prodigy. 00:03:55.720 |
you have a lot more other credentials, obviously, 00:04:01.080 |
which is still, I guess, your primary identity 00:04:11.480 |
And I can't imagine a better way to describe it than Fast.ai. 00:04:19.440 |
seven weeks or something, and that's amazing. 00:04:24.480 |
when we started that, what was that, like, 2016 or something, 00:04:58.280 |
was considered kind of ridiculous when we started it. 00:05:03.040 |
And we weren't sure if it was possible either, 00:05:04.600 |
but we kind of felt like we had to give it a go 00:05:06.560 |
'cause the alternative was we were pretty sure 00:05:08.720 |
that deep learning was on its way to becoming, 00:05:12.320 |
you know, the most or one of the most, you know, 00:05:29.160 |
And, you know, well, I just wanted to know one thing 00:05:33.760 |
you were also the top rank participant in both 2010 and 2011. 00:05:37.800 |
So sometimes you see a lot of founders running companies 00:05:40.960 |
that are not really in touch with the problem, 00:05:53.480 |
which was kind of the predecessor to multitask learning 00:06:15.960 |
What were some of the kind of like contrarian takes 00:06:25.960 |
on what was kind of like the state of the art, 00:06:32.360 |
- Yeah, the whole thing was a contrarian take. 00:06:56.720 |
And then when we'd ask those few number of people, 00:07:02.400 |
you know, a box of tricks that aren't published. 00:07:04.240 |
So you have to join one of the labs and learn the tricks. 00:07:13.480 |
but thankfully there was Theano and wrappers, 00:07:27.720 |
very hard to get started in terms of the compute, 00:07:32.280 |
So, you know, everything was kind of inaccessible. 00:07:36.680 |
And, you know, as we started looking into it, 00:07:41.560 |
we had a key insight, which was like, you know what? 00:07:45.680 |
Most of the compute and data for image recognition, 00:07:53.280 |
You know, there's this thing which nobody knows about, 00:08:00.560 |
where they already figured out like how to detect edges 00:08:04.840 |
and gradients and corners and text and whatever else. 00:08:07.800 |
And then you can fine tune it to do the thing you wanna do. 00:08:21.360 |
we focused from day one on transfer learning, 00:08:27.400 |
It was something not normally even mentioned in, 00:08:30.280 |
I mean, there wasn't much in the way of courses. 00:08:32.680 |
You know, really the courses out there were PhD programs 00:08:39.720 |
that had happened to have recorded their lessons. 00:08:46.840 |
that seemed really useful, you know, work with vision, 00:08:54.880 |
and collaborative filtering and work with text. 00:08:56.760 |
'Cause we felt like those four kind of modalities 00:09:04.480 |
And no one was doing anything much useful with text. 00:09:06.600 |
Everybody was talking about word2vec, you know, 00:09:08.840 |
like king plus queen minus woman and blah, blah, blah. 00:09:18.600 |
but nobody's doing anything like useful with it. 00:09:20.720 |
NLP was all like lemmatization and stop words 00:09:29.640 |
And it was really academic and not practical. 00:09:42.320 |
since I had done cognitive science at university, 00:09:51.000 |
what if there was somebody that could kind of like, 00:09:53.760 |
knew all of the symbolic manipulations required 00:10:03.680 |
with no other way to talk to the outside world 00:10:11.000 |
and then they pass back a piece of paper with Chinese back. 00:10:16.760 |
is actually fantastically good at answering any question 00:10:26.520 |
something that's intelligently working with Chinese? 00:10:29.880 |
Ever since that time, I'd say the most thought, 00:10:34.400 |
and compelling philosophical response is yes. 00:11:18.240 |
You know, whether that means it is intelligent or not 00:11:23.960 |
Yeah, and then when I came across neural nets 00:11:28.520 |
about the universal approximation theorem and stuff. 00:11:35.360 |
take in enough data to be a Chinese room experiment. 00:11:41.880 |
and this kind of like interest in transfer learning, 00:11:59.280 |
how would something learn to kind of answer questions 00:12:05.680 |
And I thought, well, what if we used a language model? 00:12:08.160 |
So language models are already a thing, you know, 00:12:14.240 |
that you could train a model to fill in the gaps, 00:12:17.240 |
or actually in those days it wasn't fill in the gaps, 00:12:20.760 |
And in fact, Andrej Karpathy did his fantastic RNN 00:12:27.520 |
where he showed like you can have it ingest Shakespeare 00:12:35.600 |
I thought, okay, so if I do this at a much bigger scale, 00:12:45.120 |
to finish a sentence in Wikipedia effectively, 00:12:52.560 |
I thought, geez, it would actually have to know 00:12:55.480 |
You know, it'd have to know that there is a world 00:12:58.160 |
and that objects relate to each other through time 00:13:04.560 |
and that when there are animals and there are people 00:13:13.680 |
And then you could, you know, all that together, 00:13:17.800 |
this was signed into law in 2016 by US President X 00:13:27.120 |
what in those days was considered a big language model, 00:13:32.000 |
which is, that was, you know, a bit unheard of. 00:13:40.480 |
what latent capabilities would such a system have 00:13:45.480 |
that would allow it to finish those kinds of sentences? 00:13:53.920 |
based on our work with Transfer Learning and Vision, 00:13:56.040 |
that I could then suck out those latent capabilities 00:14:01.560 |
by fine-tuning it on a task data set or whatever. 00:14:06.400 |
So step one was train a language model on a big corpus, 00:14:14.400 |
and step three was further fine-tune that model on a task. 00:14:18.280 |
And of course that's what everybody still does today, right? 00:14:34.120 |
And so you asked, to what degree was this kind of like 00:14:37.840 |
pushing against the, you know, established wisdom? 00:14:52.080 |
Everybody was like, "It definitely won't work. 00:14:57.960 |
Language is a much more vastly complicated domain." 00:15:03.000 |
We know from like philosophy and theory of mind 00:15:05.080 |
that it's actually impossible for it to work. 00:15:14.880 |
to actually get the data and like set up the training? 00:15:17.160 |
Or like, were people just lazy and kind of like, 00:15:22.080 |
So like, so the person I thought at that time who, 00:15:24.760 |
there were two people I thought at that time actually 00:15:37.040 |
after I'd released ULMFIT and he had released GPT, 00:15:49.520 |
and Kate was like, "So how did, you know, GPT come about?" 00:15:55.920 |
"that pre-training on a general large corpus wouldn't work, 00:16:01.240 |
"And then I read ULMFIT and turns out it did work. 00:16:25.000 |
I did a very popular talk at KDD, the conference, 00:16:59.560 |
I mean, even like, so we were scooped a little bit 00:17:08.480 |
They had already, I didn't even realize this, 00:17:17.560 |
But again, they didn't create a general purpose 00:17:21.720 |
large language model on a general purpose corpus. 00:17:23.600 |
They only ever tested a domain specific corpus. 00:17:28.280 |
And I haven't spoken to Kwok actually about that, 00:17:40.440 |
of mulling over the cell Chinese room experiment 00:17:43.720 |
that had convinced me that it probably would work. 00:17:49.120 |
I just dug up Alec announcement tweet from Tony team. 00:17:54.120 |
He said, "Inspired by Kobe, Elmo, and Yola, I'm fit. 00:17:57.640 |
"We showed a single transformer language model 00:18:06.160 |
kind of like the research lab pushing forward the field. 00:18:11.000 |
You know, like kind of like going back five years, 00:18:12.960 |
people think of OpenAI as an overnight success, 00:18:22.960 |
that was kind of diametrically opposed to ULM fit. 00:18:38.400 |
I think at Salesforce had come out with this neat model 00:19:00.000 |
But yeah, there was a bit of this stuff going on. 00:19:14.640 |
And like, I literally did tours trying to get people 00:19:24.600 |
particularly after GPT showed such good results 00:19:29.240 |
And so I actually feel like we kind of went backwards 00:19:34.640 |
but I kind of got so disappointed and dissuaded by like, 00:19:41.480 |
it felt like these bigger lab, much bigger labs, 00:19:44.800 |
you know, like Fast.ai had only ever been just me 00:19:47.040 |
and Rachel were getting all of this attention 00:19:51.560 |
for an approach I thought was the wrong way to do it. 00:19:54.440 |
You know, I was convinced was the wrong way to do it. 00:19:56.400 |
And so, yeah, for years people were really focused 00:20:00.960 |
And it wasn't until, you know, this key idea of like, 00:20:15.600 |
And then in step three, rather than fine-tuning 00:20:18.080 |
on a reasonably specific task classification, 00:20:20.520 |
let's fine-tune on a RLHF task classification. 00:20:25.040 |
And so that was really, that was really key, you know? 00:20:41.160 |
which I was convinced was not the right direction, 00:20:47.440 |
not at a university, or at least I wasn't then. 00:21:02.360 |
who does not wanna build stuff on lots of big computers 00:21:07.360 |
because most people don't have lots of big computers. 00:21:11.040 |
And I hate creating stuff that most people can't use, 00:21:14.560 |
And also stuff that's created on lots of big computers 00:21:17.600 |
has always been like much more media-friendly. 00:21:23.520 |
but actually throughout my 30 years in data science, 00:21:42.280 |
and they have like terabytes of data available, 00:21:46.480 |
And yeah, that's always what people wanna talk about, 00:21:59.680 |
And to me, it's a huge distraction, you know, 00:22:11.720 |
not the small subset of the most well-off people. 00:22:22.640 |
that a lot of times you're not telling your own story, 00:22:28.200 |
And the other thing before we jump into Fast.ai, 00:22:30.720 |
actually, you know, a lot of people that I know, 00:22:33.720 |
they run across a new architecture and whatnot, 00:22:37.480 |
and raise a bunch of money and do all of this stuff. 00:22:45.120 |
Was it because you already had like a successful, 00:23:09.920 |
And I didn't really know what any of those things were 00:23:12.000 |
really until after we started Kaggle, to be honest. 00:23:14.240 |
Even though I had started to what we now call startups, 00:23:16.520 |
I just thought they were just small businesses. 00:23:20.840 |
So yeah, so those two companies were FastMail 00:23:24.800 |
FastMail was the first kind of synchronized email provider 00:23:30.880 |
So something you can get your same email at home 00:23:34.000 |
on your laptop, at work, on your phone, whatever. 00:23:39.520 |
invented a new approach to insurance pricing, 00:23:43.120 |
something called profit-optimized insurance pricing. 00:23:46.280 |
So I saw both of those companies, you know, after 10 years. 00:23:56.280 |
that as a teenager, I had wanted to do, you know. 00:24:01.760 |
'cause I spent way longer in management consulting 00:24:16.880 |
And I kind of reflected and I was like, "I'm not. 00:24:25.240 |
You know, it's quite nice to have synchronized email. 00:25:06.240 |
I wasn't sure if I'd ever work again, actually. 00:25:16.960 |
Like I wasn't super rich, but I had enough money. 00:25:20.360 |
And I certainly recognize that amongst the other people 00:25:32.480 |
And I thought, I don't want to be one of those idiots 00:25:37.440 |
buying a bigger plane than the next guy or whatever. 00:26:01.320 |
well, how can we be the most helpful to society as a whole 00:26:30.040 |
You know, sadly, it looks like it still is likely to happen, 00:26:45.800 |
your courses, your research that you publish, 00:26:49.240 |
you know, just the other day you published a finding 00:26:52.600 |
on, you know, learning that I think is still something 00:26:56.880 |
that people are still talking about quite a lot. 00:27:02.760 |
of a lot of people who are gonna be, you know, 00:27:05.000 |
little Jeremy Howards furthering your mission with, 00:27:07.280 |
you know, you don't have to do everything by yourself 00:27:10.800 |
You know, that was a big takeaway from like "Analytic" 00:27:14.680 |
was that in "Analytic" it definitely felt like 00:27:25.360 |
And there's a lot of other things I'd like to solve 00:27:27.840 |
So that was definitely the other piece was like, 00:27:42.680 |
Like I find nowadays, at least half the time, 00:27:46.640 |
probably quite a bit more that I get in contact 00:27:50.640 |
with somebody who's done really interesting work 00:28:00.320 |
And I also know from talking to folks at places 00:28:06.320 |
which, you know, there's lots of alumni there 00:28:08.720 |
I got here and like half of the people are Fast.ai alumni. 00:28:22.320 |
and they need to know more about deep learning, 00:28:26.400 |
And the OpenAI Scholars Program was doing the same thing. 00:28:29.640 |
So it's kind of like, yeah, it's had a surprising impact. 00:28:35.360 |
You know, that's just one of like three things we do 00:28:45.320 |
either me and Rachel or me and Sylvain nowadays, 00:28:49.200 |
So yeah, I think it shows you don't necessarily need 00:28:51.400 |
a huge amount of money and a huge team of people 00:29:00.840 |
for people who may not have dived into it much, 00:29:07.520 |
There is the library that is very well loved. 00:29:15.440 |
on top of PyTorch that people should start with PyTorch 00:29:18.600 |
and use it as the basis for a lot of your courses. 00:29:27.200 |
- Oh, so the three areas were research, software, 00:29:32.560 |
- Oh, sorry, I was going by, in terms of software. 00:29:34.760 |
- Software, you know, Fast.ai is the main thing, 00:29:46.120 |
GHAPI, I mean, dozens of open source projects 00:29:55.320 |
and some of them are still a little bit hidden, actually. 00:29:57.640 |
I should, some of them I should try to do a better job 00:30:05.040 |
Like, for example, for working with EC2 and AWS, 00:30:11.920 |
and nice to use than anything else out there. 00:30:16.280 |
dynamic autocomplete that works both on the command line 00:30:25.840 |
I try to make like, when I work with some domain, 00:30:32.080 |
I wanna make it as enjoyable as possible for me to do that. 00:30:38.960 |
I think that GitHub API is incredibly powerful, 00:30:45.600 |
'cause I didn't particularly like the libraries 00:30:50.040 |
it like autocompletes both at the command line 00:31:01.680 |
I think it's like less than a hundred K of code 00:31:09.440 |
from the official open API spec that GitHub produces. 00:31:14.120 |
And like if you're in GitHub and you just type an API, 00:31:18.960 |
you know, autocomplete API method and hit enter, 00:31:32.080 |
You know, GitHub Actions I can write now in Python, 00:31:46.440 |
You described the third arm of FastAI as research. 00:32:11.720 |
So to me, the main artifact shouldn't be papers 00:32:18.240 |
You know, to me, the main artifacts should be like 00:32:24.480 |
and here's software you can use that builds it in. 00:32:30.440 |
three first person papers in my life, you know? 00:32:33.120 |
And they were, and none of those are ones I wanted to do. 00:32:53.800 |
And it's like, "Okay, well, I want to help you 00:33:00.960 |
which just had to exist and nobody else was writing it. 00:33:04.720 |
And then the third was the Fast.ai library paper, 00:33:16.360 |
We will waive the fee for the journal and everything 00:33:19.200 |
and actually help you get it through publishing and stuff." 00:33:27.120 |
So the research is like, well, so for example, 00:33:39.840 |
of like, who can train neural nets the fastest 00:33:45.840 |
And specifically it was who can train ImageNet the fastest. 00:34:04.840 |
to smash DawnBench so that they could prove to people 00:34:08.960 |
that they had to use Google Cloud and use their TPUs 00:34:14.040 |
And we kind of thought, "Oh shit, this would be a disaster 00:34:16.400 |
if they do that, because then everybody's going to be like, 00:34:22.040 |
you have to be Google and you have to use special silicon. 00:34:24.320 |
And so, you know, we only found out about this 10 days 00:34:32.160 |
an emergency bunch of our students and Rachel and I 00:34:36.120 |
and sat for the next 10 days and just tried to crunch through 00:34:52.560 |
train on non-square things, you know, stuff like that. 00:34:57.640 |
And so, yeah, we ended up winning, thank God. 00:35:02.080 |
And so, you know, we turned it around from being like, 00:35:05.160 |
like, "Oh shit, you know, this is going to show 00:35:24.320 |
So how do we get better results with less data, 00:35:30.480 |
with less education, you know, stuff like that. 00:35:34.440 |
So ULM fits obviously a good example of that. 00:35:42.920 |
Maybe, could you tell the story a little bit behind that? 00:35:48.160 |
into the learning of very low resource literature. 00:36:18.440 |
is that your model has to run on Kaggle within nine hours. 00:36:26.040 |
So you've only got 14 gig RAM, only two CPUs, 00:36:31.800 |
So this is cool, you know, if you can do well at this, 00:36:38.520 |
So yeah, Jono and I were playing around with fine tuning, 00:36:44.800 |
of course, transfer learning, pre-trained language models. 00:36:52.640 |
so we always, you know, plot our losses as we go. 00:36:57.600 |
when he worked with us, created a code Fast Progress, 00:36:59.720 |
which is kind of like TQEDM, but we think a lot better. 00:37:05.880 |
and they kind of go down, down, down, down, down, down, 00:37:11.960 |
and then down, down, down, down, down a little bit, 00:37:16.520 |
These clunks are occurring at the end of each epoch. 00:37:23.600 |
this would be, you know, I've seen this before, 00:37:29.880 |
oh, we accidentally forgot to turn on eval mode 00:37:39.080 |
moving average statistics throughout the epoch, 00:37:41.440 |
so, you know, if it's recently moving average or whatever, 00:37:47.240 |
So, you know, I did not give my friends at HuggingFace 00:37:51.160 |
I thought, oh, they've fucked up HuggingFaceTrainer, 00:37:59.800 |
We still saw the clunks, and, you know, that's, 00:38:12.600 |
like nothing happens, or nothing's meant to happen 00:38:22.800 |
So I kind of asked around on the open source discords, 00:38:29.200 |
And everybody was just like, oh, that's just what, 00:38:30.880 |
that's just what these training curves look like. 00:38:35.040 |
And I was like, oh, are you all using Trainer? 00:38:37.480 |
Yes, oh, well, there must be some bug with Trainer. 00:38:40.440 |
And I was like, well, we also saw it in Learner, 00:38:42.160 |
and somebody else was like, no, we've got our own Trainer. 00:38:50.040 |
I can't just be like, here's something that's like, 00:38:55.480 |
nobody ever saw it, and now suddenly we see it. 00:39:03.480 |
This is, was everyone that you're talking to, 00:39:05.960 |
were they all seeing it for the same data set 00:39:08.880 |
- Data, David, different data sets, different trainers. 00:39:11.920 |
They're just like, no, this is just what it looks like 00:39:18.720 |
- I hadn't seen it before, but I'd been kind of like, 00:39:32.160 |
I mean, LAM has only been out for a few months, right? 00:39:53.040 |
So yeah, they're just like, no, this is all what we see. 00:39:58.200 |
So yeah, I've got a very kind of like, I don't know, 00:40:01.480 |
I've got this brain where I have to know why things are. 00:40:15.920 |
Like, look at this, the loss has dropped by 0.3. 00:40:19.480 |
0.3, which is like, basically it knows the answer. 00:40:30.040 |
So yeah, so look, Jono and I did not discover this. 00:40:34.160 |
And Jono and I did not come up with a hypothesis. 00:40:39.640 |
to recognize that like, this isn't how it's meant to work. 00:40:42.920 |
And so we, you know, and so we went back and like, 00:40:46.120 |
okay, let's just run some experiments, you know, 00:40:48.920 |
'cause nobody seems to have actually published 00:40:55.040 |
but nobody ever actually stepped back and said like, 00:41:01.880 |
And so, yeah, we created a bunch of experiments 00:41:06.080 |
It's like, okay, if this hypothesis is correct, 00:41:09.520 |
then we ought to see blah under conditions blah, 00:41:25.240 |
which in hindsight, it's not totally surprising 00:41:32.120 |
because the theory, remember, of the ULM fit theory 00:41:37.600 |
all these latent capabilities to make it easier 00:41:42.000 |
So if it's got all this kind of latent capability, 00:41:45.320 |
it ought to also be really good at compressing new tokens 00:41:48.640 |
because it can immediately recognize it as like, 00:41:58.520 |
but it is, it requires us to rethink everything 00:42:11.760 |
Like maybe it's fine that it's memorized the data set 00:42:22.840 |
Don't, you know, don't, I keep telling people, 00:42:24.520 |
don't track validation loss, track validation accuracy, 00:42:30.920 |
There's another thing that's got lost since ULM fit, 00:42:33.120 |
nobody tracks accuracy of language models anymore. 00:42:35.840 |
But you know, it'll still keep learning and it does, 00:42:44.280 |
You know, like, is it like, now that it's kind of 00:42:47.000 |
memorized it, it's probably getting a less strong signal, 00:42:55.640 |
language models properly and I haven't found anybody 00:42:57.920 |
who feels like they do, like nobody really knows 00:43:05.760 |
it's probably some things that you can do usefully with it. 00:43:14.440 |
- It doesn't come at the cost of catastrophic forgetting 00:43:19.560 |
- It does to some extent, like we know it does, 00:43:26.320 |
So Code Llama was a, I think it was like a 500 billion 00:43:48.560 |
before they released it, me and lots of people 00:43:55.080 |
I hope they kept at least like 50% non-code data, 00:43:58.240 |
'cause otherwise it's gonna forget everything else. 00:44:00.440 |
And they didn't, only like 0.3% of their epochs 00:44:08.920 |
So now it's good at code and it's bad at everything else. 00:44:12.840 |
So we definitely have catastrophic forgetting. 00:44:14.640 |
It's fixable, just somebody has to do, you know, 00:44:17.920 |
somebody has to spend their time training a model 00:44:26.160 |
Even though I originally created the three-step approach 00:44:34.040 |
my view is it's actually wrong and we shouldn't use it. 00:44:36.840 |
And that's because people are using it in a way different 00:44:46.000 |
You know, I created it thinking that the task-specific 00:44:51.280 |
You know, it's like, oh, this is like a sentiment classifier. 00:44:57.280 |
but the tasks now are like a, you know, RLHF, 00:45:03.360 |
that make people feel happy about your answer. 00:45:09.440 |
And so we see, for example, RLHF also breaks models, 00:45:18.160 |
we know from kind of the work that Microsoft did, 00:45:21.680 |
you know, the earlier less-aligned version was better. 00:45:36.600 |
is to actually throw away the idea of fine-tuning. 00:45:42.600 |
And pre-training is something where, from the very start, 00:45:46.280 |
you try to include all the kinds of data that you care about, 00:45:49.880 |
all the kinds of problems that you care about, 00:45:55.840 |
general purpose document completion, whatever. 00:45:59.280 |
And then as you train, you gradually curate that, 00:46:05.400 |
you know, you gradually make that higher and higher quality 00:46:36.040 |
And that's why we're seeing a lot of these, you know, 00:46:40.160 |
so-called alignment tax and this view of like, 00:46:42.960 |
"Oh, a model can't both code and do other things." 00:46:49.240 |
- Well, I think you have a clear anti-laziness approach. 00:46:53.800 |
I think other people are not as good-hearted, you know? 00:46:57.440 |
They're like, "Hey, they told me this thing works. 00:46:59.800 |
"And if I release a model this way, people will appreciate it. 00:47:03.160 |
"I'll get promoted and I'll kind of make more money." 00:47:09.440 |
It's like, this is how citations work most badly, you know? 00:47:12.680 |
So if you wanna get cited, you need to write a paper 00:47:15.600 |
that people in your field recognize as an advancement 00:47:22.120 |
And so we've seen this happen again and again. 00:47:24.360 |
So like I say, like zero-shot and few-shot learning, 00:47:32.240 |
everybody just was writing about GANs, you know? 00:47:37.840 |
You know, and I showed again through research 00:47:59.720 |
So it's, yeah, it's not set up for real innovation. 00:48:03.760 |
It's, you know, again, it's really helpful for me, 00:48:18.320 |
So I just write what I think actually matters. 00:48:34.000 |
in which people can focus on like genuine innovation. 00:48:44.320 |
I wanted to follow up on one thing that you mentioned, 00:48:47.280 |
which is that you checked around the open source discords. 00:48:54.760 |
like what discords are lively or useful right now. 00:49:02.120 |
like I missed out on was the early days of LutherAI, 00:49:09.480 |
And you actually shouted out the alignment lab AI discord 00:49:30.120 |
nearly all of the conversation happens in private channels. 00:49:38.560 |
'Cause it's obviously very, very instructive, right? 00:49:42.880 |
- You could just come to the first AI discord, 00:50:01.920 |
- It's just the nature of quality discussion, right? 00:50:10.680 |
but there was a lot of people who came in with like, 00:50:25.480 |
maybe you don't want to be dismissive or whatever. 00:50:27.520 |
And it's like, oh, well, that's an interesting comment, 00:50:29.280 |
but maybe you should like try training some models first 00:50:41.560 |
I know the people who always have good answers there. 00:50:43.960 |
And so I created a private channel and put them all in it. 00:50:46.720 |
And I got to admit, that's where I post more often 00:50:56.120 |
about how we could solve AGI, blah, blah, blah. 00:51:12.760 |
And then you'll see at the top who the admins or moderators 00:51:27.640 |
You know, I'm interested in talking about this. 00:51:34.960 |
I will say, you know, Alutha's all pretty open. 00:51:43.400 |
You know, one problem with the Alutha Discord 00:52:09.520 |
now it's research that does like the Hermes models 00:52:25.520 |
If you know me, ask me 'cause I've got admin on that one. 00:52:28.960 |
There's also, yeah, OS Skunkworks, OS Skunkworks AI. 00:52:33.880 |
There's a good Discord, which I think it's open. 00:52:49.640 |
We just want people who like wanna build stuff. 00:52:56.080 |
and like, it's fine to not know anything as well, 00:53:05.520 |
If you don't know anything and wanna be told, 00:53:12.760 |
it's gonna take you a really long time to do, 00:53:20.800 |
maybe 5% of people who come in with great enthusiasm 00:53:23.440 |
saying that they wanna learn and they'll do anything. 00:53:29.760 |
So if you're somebody who actually does the work 00:53:32.280 |
and follows up, you will massively stand out. 00:53:38.400 |
And everybody will then want to help you do more work. 00:53:47.880 |
- Our Discord used to be referral only for a long time. 00:53:53.000 |
and then we opened it in the kind of like channel gating. 00:53:58.360 |
I remember it used to be like, you know, a forum moderator. 00:54:00.840 |
It's like, people just wanna do like drive-by posting, 00:54:03.040 |
you know, and like, they don't wanna help the community. 00:54:07.720 |
- I mean, the funny thing is our forum community 00:54:20.800 |
And yeah, we're all somehow in a forum thread 00:54:29.320 |
but then the forums are less active than they used to be 00:54:34.320 |
because Discord has got more popular, you know? 00:54:46.520 |
- All right, we got so many more things we wanna dive in, 00:54:50.760 |
This is not the Lex Fridman podcast we always like to say. 00:54:54.160 |
One topic I would love to maybe chat a bit about 00:55:04.200 |
You recently did a Hacker's Guide to Language Models, 00:55:07.240 |
and you ran through everything from quantized model 00:55:10.360 |
to like smaller models, larger models, and all of that. 00:55:14.160 |
But obviously, Modular is taking its own approach. 00:55:18.600 |
I know you and Chris have been talking about this 00:55:20.320 |
for like years and a lot of the ideas you had, so. 00:55:48.200 |
And so I saw him walk into the courtyard at Google. 00:55:53.200 |
It's just like, "Oh shit, man, it's Chris Latner. 00:55:56.640 |
I wonder if he would lower his standards enough 00:56:05.960 |
He looked a bit lost and I wandered over and was like, 00:56:13.640 |
It's like, "Oh, do you do some of this AI stuff?" 00:56:15.880 |
And I was like, "Yeah, yeah, I like this AI stuff." 00:56:19.760 |
He's like, "Well, I'm thinking about starting 00:56:40.560 |
then I kind of like, I guess I re-caught up with him 00:56:43.440 |
"I've been thinking about everything you said 00:56:46.240 |
And he like narrated back his response to every part of it, 00:56:51.520 |
And it was just like, "Oh, this dude follows up. 00:56:58.240 |
And he was like, "Yeah, so we're gonna create 00:57:02.920 |
And it's gonna be like, it's gonna be a compiler 00:57:05.120 |
with auto-differentiation built in and blah, blah, blah." 00:57:08.320 |
And I was like, "Oh, wait, why would that help?" 00:57:10.200 |
You know, he was like, "Okay, with a compiler 00:57:12.560 |
during the forward pass, you don't have to worry 00:57:16.520 |
'cause it'll all be optimized in the backward." 00:57:19.840 |
'Cause I didn't really know much about compilers, 00:57:28.680 |
basically solves a lot of the problems we have as end users. 00:57:33.880 |
Okay, you do know, right, that nobody's gonna use this 00:57:39.680 |
So I was thinking you should create like a fast AI for this." 00:57:42.360 |
I was like, "Okay, but I don't even know Swift." 00:57:46.440 |
And he was like, "Well, why don't you start learning it? 00:57:53.840 |
Like, not only is Chris Latner lowered his standards enough 00:57:57.680 |
to talk to me, but he's offering me personal tutoring 00:58:02.800 |
So I was just like, "I'm not gonna let him down." 00:58:10.080 |
And it was just before Christmas that I kind of like 00:58:40.000 |
And I was also like, "I hope he doesn't dislike the fact 00:58:47.360 |
And yeah, he was like, "Oh, thanks for sending me that. 00:58:53.760 |
And we spoke and he was like, "This is amazing. 00:58:59.520 |
And he was like, "And so like somebody set up 00:59:01.280 |
like a new Swift, I can't remember what they call them, 00:59:06.280 |
the equivalent of a PEP, kind of IRFC thing of like, 00:59:09.080 |
oh, you know, let's look at how we can implement 00:59:16.920 |
So, you know, and then we ended up like literally teaching 00:59:22.200 |
some lessons together about Swift for TensorFlow 00:59:33.320 |
Then in the end, you know, Google didn't follow through, 00:59:39.760 |
to learn a new programming language is gonna be tough. 00:59:42.880 |
But like, it was very obvious, very, very obvious 00:59:45.200 |
at that time that TensorFlow 2 is gonna be a failure, 00:59:47.640 |
you know, and so this felt like, okay, I, you know, 01:00:00.320 |
'cause it's not gonna, like it's not working. 01:00:03.400 |
You know, nobody at Google's using it internally. 01:00:06.760 |
So, you know, in the end, Chris left, you know, 01:00:16.120 |
So it kind of felt like Google was kind of screwed, 01:00:19.920 |
you know, and Chris went and did something else. 01:00:22.320 |
But we kept talking and I was like, "Look, Chris, you know, 01:00:27.600 |
'Cause like, you know, you've got the ideas, you know, 01:00:36.640 |
'cause like Python's the best of a whole bunch of shit, 01:00:41.640 |
you know, like I would, it's amazing, but it's awful, 01:01:00.480 |
It's gonna build, it's gonna have all the stuff 01:01:14.000 |
building on all the stuff that Chris has figured out over, 01:01:18.760 |
I mean, really from when he did his PhD thesis, 01:01:27.080 |
the TensorFlow runtime engine, which is very good. 01:01:31.160 |
You know, that was something that he built and has lasted. 01:01:43.160 |
and he's created a whole C++ compiler amongst other things, 01:01:49.760 |
I mean, I hope it works because, you know, I mean- 01:01:55.640 |
- But I mean, in the meantime, I will say, you know, 01:02:00.760 |
Google now does have a backup plan, you know, 01:02:09.080 |
and they just decided to build something else. 01:02:11.920 |
And for years, my friends in that team were like, 01:02:14.960 |
'cause we don't want it to be anything but a research project." 01:02:21.040 |
suddenly they're the great white hope for Google's future. 01:02:29.520 |
Like, it would be cool if we had all the benefits of JAX, 01:02:52.200 |
So that's more the kind of language framework level 01:02:58.320 |
some of these other like quantization-focused 01:03:09.560 |
- Well, you won't be surprised to hear me say this, 01:03:18.400 |
So today's zero-shot, few-shot learning equivalent 01:03:24.960 |
is retrieval-augmented generation, you know, REG, 01:03:28.320 |
which is like, just like few-shot learning is a thing. 01:03:34.200 |
It's not a thing anybody would want to ignore. 01:03:36.320 |
Why are people not spending at least as much effort 01:03:40.400 |
'Cause, you know, REG is like such a inefficient hack, 01:03:53.440 |
embed it, ask questions about that, you know, 01:03:56.520 |
hope that my embedding model embeds questions 01:04:01.560 |
in the same embedding space as the paragraphs, 01:04:04.120 |
which obviously is not going to, if your question is like, 01:04:06.240 |
if I've got a whole bunch of archive papers embeddings, 01:04:12.040 |
in which we can make inference more efficient? 01:04:28.800 |
- No, it's not going to be like, oh, here's one way, 01:04:30.920 |
here's one way, here's a different way in different papers, 01:04:37.520 |
then all of that information is getting directly 01:05:01.280 |
- Something that I think a lot of people are uncertain about, 01:05:06.160 |
is that whether or not you can fine-tune new information in, 01:05:20.360 |
because there's no such thing as fine-tuning, 01:05:29.960 |
So the knowledge got in there in the first place 01:05:49.800 |
You know, I think like in my "Hacker's Guide to LL," 01:05:59.320 |
that's a simple example, 'cause it doesn't sound it, 01:06:05.320 |
And it took 15 minutes to train on a single GPU. 01:06:09.560 |
You know, I think that might surprise people, 01:06:11.600 |
so that that capability is at your fingertips, 01:06:44.920 |
There's not much at the 1 to 2B range, sadly. 01:06:48.600 |
but like the fact that there are some really good code ones 01:06:53.200 |
that that's a great size for doing complex tasks well. 01:07:00.160 |
which has been the subject of a little bit of discussion 01:07:13.720 |
so 5.1 in particular is good at doing a very specific thing, 01:07:16.520 |
which is creating very small Python snippets. 01:07:26.920 |
So it doesn't know who Tom Cruise is, you know. 01:07:36.080 |
It doesn't really know anything about anything. 01:07:39.000 |
Like, 'cause it was never, it's never read anything. 01:07:50.280 |
And so it was a research project and a really good one. 01:07:53.120 |
And it definitely shows us a powerful direction 01:07:55.440 |
in terms of what can you do with synthetic data. 01:08:02.400 |
pretty good math skills, pretty good coding skills. 01:08:11.880 |
Some people have tried to do some fine tunes of it. 01:08:15.120 |
And again, they're like surprisingly good in some ways 01:08:20.640 |
but not sure you'd find it useful for anything. 01:08:24.520 |
- I think that's the struggle of pitching small models 01:08:31.640 |
you don't need a lot of resources to run them, 01:08:33.520 |
but the performance evaluation is always so iffy. 01:08:36.640 |
It's always just like, yeah, it works on some things 01:08:41.840 |
- Yeah, so that's why we're back to fine tuning. 01:08:44.840 |
I would say, so Microsoft did create a 5.1.5 web, 01:08:51.040 |
I would say a 5.1.5 web with fine tuning for your task, 01:09:02.920 |
that people have in their kind of day-to-day lives. 01:09:05.960 |
You know, particularly in kind of an enterprise setting, 01:09:08.880 |
I think there's a lot of like repetitive kind of processing 01:09:16.720 |
'cause I think quite often you can like replace 01:09:18.880 |
some thousands and thousands of lines of complex buggy code, 01:09:31.000 |
I think one question on top of a lot of people's minds. 01:09:34.000 |
So you've done practical deep learning for coders 01:09:44.840 |
If you're somebody who's interested in deep learning today 01:09:53.840 |
Should they focus on, yeah, small model development? 01:09:56.280 |
Should they focus on fine tuning math and all of that? 01:09:59.520 |
Should they just like focus on making rag not a hack 01:10:06.120 |
Yeah, what's a practical deep learning for coders 2024 01:10:12.600 |
I'm trying to figure that out for myself, you know, 01:10:16.360 |
'Cause I definitely feel like things have changed a bit, 01:10:21.280 |
you know, one of the ways in which things have changed 01:10:31.080 |
they're folks who really hadn't coded before a year ago 01:10:34.760 |
and they're using these models to help them build stuff 01:10:44.600 |
well, we need a lot more material to help these people 01:10:49.280 |
'cause they don't really know what they're doing, 01:10:55.760 |
So like, are there things we could do to help people, 01:11:12.120 |
thanks to the help of Codex and Copilot and whatever. 01:11:26.760 |
to being like, let's make coding more accessible, 01:11:30.520 |
you know, kind of AI-oriented coding more accessible. 01:11:34.960 |
If so, our course should probably look very different, 01:11:39.200 |
you know, and we'd have to throw away that like, 01:11:42.480 |
of full-time programming, you know, as a prerequisite. 01:11:46.800 |
Yeah, what would happen if we got rid of that? 01:11:50.520 |
So that's kind of one thought that's in my head. 01:11:56.680 |
honestly, I don't think anybody has any idea, 01:12:05.800 |
like we don't really know how to do anything very well. 01:12:12.320 |
like they seem to be quite good at some things 01:12:19.280 |
Even there, it's clear there's a lot of stuff 01:12:27.080 |
So yeah, we don't really know how to train these models well, 01:12:38.480 |
we don't know what kind of problems they can't do, 01:12:40.080 |
we don't know what good prompting strategies are 01:12:44.200 |
Like somebody sent me a message the other day saying 01:12:47.920 |
they've written something that is a prompting strategy 01:12:55.160 |
They've written like 6,000 lines of Python code 01:13:20.400 |
people were saying like GPT-4 can't play chess. 01:13:34.360 |
it might be just about the best chess engine in the world. 01:13:41.600 |
So I feel like it's all blue sky at this point. 01:13:45.200 |
- It feels like computer vision in 2013 to me, 01:13:59.760 |
So we hadn't yet had the Zyler and Fergus like, 01:14:01.400 |
oh, this is actually what's going on inside the layers. 01:14:08.440 |
We don't know how to create good training dynamics. 01:14:24.240 |
And so the kind of economically rational thing to do, 01:14:31.160 |
The economic rational thing to do is to like, okay, 01:14:33.160 |
like build that as fast as possible, you know, 01:14:39.560 |
And that's what, you know, open AI in particular did, 01:14:44.840 |
But there's a whole lot of technical debt everywhere. 01:14:50.840 |
You know, nobody's really figured this stuff out 01:14:55.160 |
building what we know works as quickly as possible. 01:14:59.880 |
So yeah, I think there's a huge amount of opportunity to, 01:15:04.800 |
can be made to work a lot faster, a lot less memory. 01:15:11.520 |
I got a whole bunch of ideas I want to try, you know, 01:15:42.240 |
and he was like, yeah, people just didn't think of it, 01:15:45.520 |
didn't try, they didn't come from like a systems background. 01:15:48.440 |
- Yeah, I mean, the thing about flash attention is, 01:15:51.240 |
I mean, lots of people absolutely had thought of that 01:15:56.720 |
But I mean, the honest truth is, particularly before Triton, 01:16:08.400 |
fused attention wasn't tiled, that was stupid. 01:16:16.800 |
be like, oh, well, I'm confident enough in CUDA 01:16:20.960 |
and or Triton to use that insight to write something better. 01:16:27.800 |
And I always talk to Chris about flash attention 01:16:31.640 |
there is a thousand flash attentions out there 01:16:37.760 |
You just gotta make it easy for us to build them. 01:16:46.840 |
You know, it still requires kind of really understanding 01:16:52.480 |
writing it in that kind of very CUDA-ish way. 01:16:57.360 |
I think, you know, if Mojo or something equivalent 01:17:02.520 |
we're gonna see a lot more flash attentions popping up. 01:17:18.800 |
What's something that it's already here today 01:17:21.240 |
in AI that you thought would take much longer? 01:17:39.960 |
And I said, oh, and I put a dot saying we are here. 01:17:45.160 |
And I looked back at the transcript the other day 01:17:53.680 |
in which computers will be better at most human tasks 01:18:09.080 |
took that twice as long as I thought it might. 01:18:11.960 |
Yeah, no, I wouldn't say anything surprised me too much. 01:18:18.800 |
It's still like, definitely like, I gotta admit, 01:18:35.280 |
would exist by about now, maybe a bit earlier. 01:18:38.560 |
But actually using it definitely is different 01:18:41.960 |
to just feeling like it's probably on its way, you know? 01:18:49.280 |
I'm sure, I imagine I'll have the same visceral reaction, 01:19:24.920 |
plot a kind of projected three-dimensional loss surface 01:19:28.840 |
for a ConvNet with and without skip connections. 01:19:38.920 |
and with the skip connections, it was super smooth. 01:19:45.440 |
Like, so there was actually an interesting blog post 01:19:47.480 |
that came out just today from the PyTorch team, 01:19:58.240 |
- Yeah, and they actually showed some nice examples 01:20:07.000 |
if you look at this, we can actually see a bit 01:20:10.080 |
You know, so again, it reminds me of this Eiler and Fergus, 01:20:13.880 |
you know, ConvNet paper that was the first one 01:20:44.880 |
at what level, and when do they need it, and how often. 01:20:48.640 |
So that kind of like, data set mixing, curation, so forth, 01:20:57.080 |
Yeah, fine tune, what kind of mix do you need 01:21:04.920 |
And what are the kind of underlying capabilities 01:21:07.760 |
And if it loses those, it would lose all these other ones. 01:21:15.320 |
to help it to not forget to do things, stuff like that. 01:21:25.360 |
you want everyone to remember and think about? 01:21:27.880 |
- Yeah, I guess the main thing I want everybody to remember 01:21:30.320 |
is that, you know, there's a lot of people in the world, 01:21:57.280 |
What would happen if all of these people in the world 01:22:05.680 |
Or one might be, wow, of all those people in the world, 01:22:13.080 |
the lives of a lot of humanity if they had this tool. 01:22:29.360 |
between people who think that distributed power is unsafe, 01:22:36.560 |
and people who think that humanity on net, you know, 01:22:46.800 |
particularly when part of a society and a civilization, 01:22:59.680 |
And, you know, I want to see more and more people 01:23:16.640 |
regular people are going to do a lot of really valuable work 01:23:32.320 |
and providing a future for our children to flourish in, 01:23:49.000 |
the elites that we think can be trusted to run it for us. 01:23:54.080 |
about where that leaves us as a society, you know. 01:24:08.280 |
a lot of open source developers, open source communities,