back to indexAnswer.ai & AI Magic with Jeremy Howard
Chapters
0:0 Introduction
1:7 Continous Pre-Training is Here
4:48 Schedule-Free Optimizers and Learning Rate Schedules
6:8 Governance and Structural Issues within OpenAI and Other AI Labs
13:32 How Answer.ai works
27:4 How to Recruit Productive Researchers
32:34 Building a new BERT
37:10 FSDP, QLoRA, and QDoRA: Innovations in Fine-Tuning Large Models
43:42 Research and Development on Model Inference Optimization
47:48 FastHTML for Web Application Development
61:16 AI Magic & Dialogue Engineering
64:11 AI wishlist & predictions
00:00:00.560 |
Hey, everyone. Welcome to the Latent Space Podcast. 00:00:03.100 |
This is Alessio, partner and CTO in Residence and Decibel Partners, 00:00:06.400 |
and I'm joined by my co-host, Swiggs, founder of Small.ai. 00:00:12.600 |
I think your third appearance on Latent Space. Welcome. 00:00:21.000 |
- Very fun standing outside streets. - I never heard that, by the way. 00:00:25.200 |
You've got to send me a link. I've got to hear what it sounded like. 00:00:27.000 |
- Yeah, yeah. - I think the two episodes are six hours, 00:00:30.800 |
so there's plenty to listen. We'll make sure to send it over. 00:00:35.000 |
Yeah, we're trying this thing where the major ML conferences, 00:00:38.000 |
we, you know, do a little audio tour of the conference 00:00:44.200 |
But the last time you were on, you declared the end of fine-tuning. 00:00:50.000 |
I sort of editorialized the title a little bit, 00:00:52.800 |
and I know you were slightly uncomfortable with it, 00:00:59.400 |
And we were just discussing in our pre-show that things have... 00:01:02.200 |
It's really happening, that the continued pre-training is really happening. 00:01:07.600 |
Yeah, absolutely. I think people are starting to understand that 00:01:13.600 |
treating the three ULM fit steps of, like, pre-training, you know, 00:01:18.800 |
and then the kind of, like, what people would now call "instruction tuning," 00:01:21.400 |
and then, I don't know if we've got a general term for this, 00:01:24.200 |
DPO, RLHFE step, you know, but, you know, the task training, 00:01:29.600 |
they're not actually as separate as we originally suggested they were in our paper. 00:01:38.800 |
and that you make sure that you have, you know, 00:01:42.600 |
more of kind of the original data set incorporated into the later stages, 00:01:49.000 |
and that, you know, we've also seen with, like, LLAMA3, 00:01:53.600 |
this idea that those later stages can be done for a lot longer. 00:01:57.200 |
These are all of the things I was kind of trying to describe there. 00:02:00.600 |
It wasn't, like, yeah, wasn't the end of pre-training. 00:02:05.200 |
but more that we should treat it as a continuum, 00:02:11.200 |
of how much you can do with an already trained model. 00:02:16.600 |
You can really add a lot of behavior to it. You can change its behavior. 00:02:22.600 |
So a lot of our research has been around trying to figure out 00:02:31.200 |
because I get very offended at the idea of starting from random weights. 00:02:39.400 |
there was an outstanding paper about starting transformers from data-driven pyres. 00:02:46.600 |
They called it sort of never trained from scratch, 00:02:48.600 |
and I think it was kind of rebelling against, like, 00:02:54.200 |
Yeah, I've, you know, that's been our kind of continuous message 00:02:57.000 |
since we started Fast.ai, is if you're training from random weights, 00:03:00.800 |
you better have a really good reason, you know, 00:03:03.000 |
because it seems so unlikely to me that nobody has ever trained on data 00:03:08.200 |
that has any similarity whatsoever to the general class of data you're working with, 00:03:13.000 |
and that's the only situation in which I think starting from random weights makes sense. 00:03:19.400 |
Yeah, the other trends since our last pod that I would point people to 00:03:24.600 |
is I'm seeing a rise in multi-phase pre-training. 00:03:29.200 |
So Snowflake released a large model called Snowflake Arctic, 00:03:34.200 |
where they detailed three phases of training, 00:03:37.000 |
where they had, like, a different mixture of, like, 00:03:39.600 |
there was, like, 75% web in the first instance, 00:03:43.000 |
and then they reduced the percentage of the web text by 10% each time 00:03:47.200 |
and increased the amount of code in each phase. 00:03:51.200 |
And I feel like multi-phase is being called out in papers more. 00:04:02.400 |
and I wonder if there's something that you're seeing on your end. 00:04:08.200 |
So the point at which they're doing proper continued pre-training 00:04:10.800 |
is the point at which that becomes a continuum rather than a phase. 00:04:14.000 |
So the only difference with what I was describing last time is to say, like, 00:04:17.400 |
oh, there should, you know, there's a function or whatever which is happening every batch. 00:04:24.400 |
And it doesn't, like, it's not a huge difference, 00:04:28.600 |
but it's like back, you know, I always used to get offended 00:04:31.200 |
when people had learning rates that, like, jumped. 00:04:34.600 |
And so one of the things I started doing early on in Fast.ai 00:04:37.000 |
was to say to people, like, no, you should actually have, 00:04:39.400 |
your learning rate schedule should be a function, not a list of numbers. 00:04:43.000 |
So now I'm trying to give the same idea about training mix. 00:04:48.400 |
There's been pretty public work from Meta on schedule-free optimizers. 00:04:52.200 |
I don't know if you've been following Aaron DeFazio and what he's doing. 00:04:56.200 |
Just because you mentioned learning rate schedules, 00:04:58.600 |
you know, what if you didn't have a schedule? 00:05:03.200 |
Like, I don't think that schedule-free optimizer's that exciting. 00:05:08.800 |
We've had non-scheduled optimizers for ages, like, 00:05:14.800 |
Les Wright, who's now at Meta, who was part of the Fast.ai community there, 00:05:17.800 |
created something called the Ranger optimizer. 00:05:22.200 |
You know, I actually like having more hyperparameters, you know, 00:05:26.000 |
as soon as you say schedule-free, then, like, well, now I don't get to choose. 00:05:31.800 |
And there isn't really a mathematically correct way of, like, 00:05:36.600 |
I actually try to schedule more parameters rather than less. 00:05:39.000 |
So, like, I like scheduling my epsilon in my atom, for example. 00:05:45.600 |
So, but then the other thing we always did with the Fast.ai library 00:05:49.800 |
was make it so you don't have to set any schedules. 00:05:52.600 |
So Fast.ai always supported, like, not, you didn't even have to pass a learning rate. 00:05:57.400 |
Like, it would always just try to have good defaults and do the right thing. 00:06:01.600 |
But to me, I like to have more parameters I can play with if I want to, 00:06:09.000 |
And then the more, less technical side, I guess, of your issue, 00:06:13.800 |
I guess, with the market was some of the large research labs 00:06:18.400 |
taking all this innovation kind of behind closed doors 00:06:20.400 |
and whether or not that's good, which it isn't. 00:06:24.000 |
And now we could maybe make it more available to people. 00:06:26.600 |
And then after a month, a month after we released the episode, 00:06:30.200 |
there was the whole Sam Altman drama and, like, all the OpenAI governance issues. 00:06:37.400 |
okay, what happens if some of these kind of labs, you know, 00:06:41.200 |
start to break from within, so to speak, and the alignment of the humans 00:06:45.600 |
is probably going to fall before the alignment of the models. 00:06:48.600 |
So I'm curious, like, if you have any new thoughts, 00:06:51.000 |
and maybe we can also tie in some of the way that we've been building Answer 00:06:54.800 |
as, like, a public benefit corp and some of those aspects. 00:06:58.000 |
Sure. So, yeah, I mean, it was kind of uncomfortable 00:07:13.800 |
that OpenAI's current governance structure can't continue 00:07:18.200 |
and that it was definitely going to fall apart. 00:07:22.200 |
And a bunch of people were like, "What did you know, Jeremy?" 00:07:25.800 |
- What did Jeremy see? - I didn't see anything. 00:07:33.800 |
And so, yeah, so my friend Eric Ries and I spoke a lot before that 00:07:39.800 |
about, you know, Eric's, I think, probably most people would agree, 00:07:44.800 |
the top expert in the world on, kind of, start-up and AI governance. 00:07:52.200 |
And, you know, we could both clearly see that this didn't make sense 00:08:00.200 |
where then there are people working at a commercial company 00:08:02.600 |
that's owned by or controlled nominally by the non-profit 00:08:06.000 |
where the people in the company are being given the equivalent of stock options. 00:08:13.400 |
with expecting to make money largely from their equity. 00:08:18.000 |
So the idea that then a board could exercise control 00:08:22.600 |
by saying, like, "Oh, we're worried about safety issues 00:08:26.000 |
and so we're going to do something that decreases the profit of the company," 00:08:31.400 |
their remuneration pretty much is tied to their profit, 00:08:37.600 |
So, I mean, that was a huge oversight there by someone. 00:08:42.800 |
And I guess it's, like, I guess part of the problem is that the kind of people who 00:08:47.600 |
work at non-profits, you know, and in this case the board, you know, 00:08:52.400 |
who are kind of academics and, you know, people who 00:08:57.200 |
are kind of true believers, I think it's hard for them to realize that 99.999% of the world is 00:09:02.800 |
driven very heavily by money, especially huge amounts of money. 00:09:07.200 |
So, yeah, Eric and I had been talking for a long time before that 00:09:12.200 |
about, like, well, what could be done differently? 00:09:15.800 |
Because also companies are sociopathic, like, by design. 00:09:20.400 |
And so the alignment problem, as it relates to companies, 00:09:24.800 |
has not been solved. Like, companies become huge, 00:09:28.400 |
they devour their founders, they devour their communities, 00:09:33.400 |
and they do things where even the CEOs, you know, often of big companies tell me, like, 00:09:41.800 |
But, you know, I know that if I didn't do it, 00:09:46.600 |
then I would just get fired, and the board would put in somebody else. 00:09:49.400 |
And the board knows if they don't do it, then their shareholders can sue them, 00:09:53.000 |
because they're not maximizing profitability or whatever. 00:09:56.000 |
So, what Eric's spent a lot of time doing is trying to think about, like, 00:10:03.400 |
how do we make companies less sociopathic, you know? 00:10:08.200 |
Or maybe a better way to think of it is, like, how do we make it so that the founders of companies 00:10:15.400 |
can ensure that their companies continue to actually do the things they want them to do? 00:10:30.200 |
you know, like, well, A, we very explicitly decided we're going to start a company, 00:10:34.400 |
not a academic lab, not a non-profit, you know. 00:10:39.200 |
We created a Delaware C Corp, you know, the most company kind of company. 00:10:46.400 |
But when we did so, we told everybody, you know, including our first investors, 00:10:59.400 |
We are going to run this company on the basis of maximizing long-term value. 00:11:07.600 |
So, you know, in fact, so when we did our second round, which is an angel round, 00:11:15.600 |
we had everybody invest through a long-term SPV, which we set up, 00:11:21.800 |
where everybody had to agree to vote in line with long-term value principles. 00:11:29.400 |
So, like, it's not just, it's never enough just to say to people, like, 00:11:36.000 |
okay, we're trying to create long-term value here for society as well as for ourselves, 00:11:40.400 |
and everybody's like, oh, yeah, yeah, I totally agree with that. 00:11:43.200 |
But when it comes to like, okay, well, here's a specific decision we have to make, 00:11:47.000 |
which will not maximize short-term value, people suddenly change their mind. 00:11:52.400 |
So, you know, it has to be written into the legal documents of everybody, 00:11:56.800 |
so that there's no question that that's the way the company has to be managed. 00:12:03.800 |
So, then you mentioned the PBC aspect, Public Benefit Corporation, 00:12:13.200 |
Like, it took, you know, like one paragraph added to our corporate documents to become a PBC. 00:12:19.800 |
It was cheap, it was easy, but it's got this huge benefit, 00:12:23.000 |
which is, if you're not a Public Benefit Corporation, 00:12:26.800 |
then somebody can come along and offer to buy you, 00:12:31.600 |
with a stated description of, like, turning your company into the thing you most hate, right? 00:12:37.200 |
And if they offer you more than the market value of your company and you don't accept it, 00:12:41.800 |
then you are not necessarily meeting the, kind of, your fiduciary responsibilities. 00:12:49.200 |
So, the way, like, Eric always described it to me, you know, is like, 00:12:54.200 |
if Philip Morris came along and said that you've got great technology for marketing cigarettes to children, 00:12:59.200 |
so we're going to pivot your company to do that entirely, 00:13:01.600 |
and we're going to pay you 50% more than the market value, 00:13:07.200 |
If you have a PBC, then you are more than welcome to say no, 00:13:12.600 |
if that offer is not in line with your stated public benefit. 00:13:17.000 |
So, our stated public benefit is to maximize, you know, the benefit to society through using AI. 00:13:24.400 |
So, given that more children smoking doesn't do that, 00:13:28.200 |
then we can say, like, no, we're not selling to you. 00:13:33.000 |
Yeah, and I was looking back at some of our emails. 00:13:37.800 |
You sent me an email on November 13th about talking, 00:13:41.200 |
and then on the 14th, I sent you an email working together to free AI, was the subject line. 00:13:47.200 |
And then that was, kind of, the start of the seed round. 00:13:52.400 |
So, this was, like, not even, you know, you were having these thoughts even before. 00:13:57.400 |
We had, like, a public example of, like, why some of the current structures didn't work. 00:14:01.800 |
So, yeah, you were very ahead of the curve, so to speak. 00:14:07.200 |
I would love just to, you know, people can read your awesome introduction, blog, and answer, 00:14:12.400 |
and the idea of having an R&D lab versus our lab, and then a D-Lab somewhere else. 00:14:19.000 |
I think, to me, the most interesting thing has been hiring, 00:14:22.200 |
and some of the awesome people that you've been bringing on that 00:14:24.800 |
maybe don't fit the central casting of Silicon Valley, so to speak. 00:14:29.000 |
Like, sometimes I go there, like, playing baseball cards, you know. 00:14:31.400 |
People are like, oh, what teams was this person on? 00:14:34.000 |
Where did they work? Versus focusing on ability. 00:14:36.000 |
So, I would love for you to give a shout out to some of the awesome folks on the team. 00:14:41.200 |
So, you know, there's, like, a graphic going around describing, like, the people at XAI, 00:14:46.600 |
you know, the Elon Musk thing, and, like, they're all connected to, like, you know, 00:14:53.200 |
multiple of Stanford, Meta, DeepMind, OpenAI, Berkeley, Oxford. 00:15:03.200 |
It's just, look, these are all great institutions, and they have good people, 00:15:07.800 |
and I'm definitely not at all against that, but, damn, there's so many other people. 00:15:13.400 |
And one of the things I found really interesting is, kind of, anytime I, 00:15:20.800 |
almost anytime I see something which I think, like, this is really high quality work, 00:15:24.600 |
and it's, like, something I don't think would have been built 00:15:27.600 |
if that person hadn't built the thing right now, 00:15:30.000 |
I nearly always reach out to them and ask to chat. 00:15:34.000 |
And I tend to dig in to find out, like, okay, you know, why did you do that thing? 00:15:39.800 |
Your thing's much better, but it's not what other people are working on. 00:15:42.600 |
And, like, 80% of the time, I find out the person has a really unusual background. 00:15:50.600 |
So, like, often they'll have, like, either they, like, came from poverty, 00:15:55.000 |
and, like, didn't get an opportunity to go to good school, 00:15:57.600 |
or they, like, you know, had dyslexia and, you know, got kicked out of school in year 11, 00:16:02.800 |
or, you know, or they had a health issue that meant they couldn't go to university, 00:16:08.800 |
or something happened in their past, and they ended up out of the mainstream, 00:16:20.600 |
And those are the people that, throughout my career, 00:16:24.200 |
I've tended to, kind of, accidentally hire more of. 00:16:29.400 |
It's, like, when I see somebody who's done, two people who have done extremely well. 00:16:35.200 |
One of them did extremely well in exactly the normal way, 00:16:38.200 |
from the background, entirely pointing in that direction, 00:16:41.400 |
and they achieved all the hurdles to get there. 00:16:43.800 |
And, like, okay, that's quite impressive, you know. 00:16:48.200 |
But another person who did just as well, despite lots of constraints, 00:16:53.000 |
and doing things in really unusual ways, and came up with different approaches, 00:16:56.800 |
like, that's normally the person I'm likely to find useful to work with, 00:17:01.400 |
because they're often, like, risk-takers, they're often creative, 00:17:04.400 |
they're often extremely tenacious, they're often very open-minded. 00:17:09.600 |
So, that's the kind of folks we, you know, I tend to find myself hiring. 00:17:23.400 |
it's a group of people that are strong enough that nearly every one of them 00:17:27.600 |
has independently come to me in the past few weeks and said, 00:17:31.200 |
and told me that they have imposter syndrome, 00:17:33.400 |
and they're not convinced that they're good enough to be here, you know. 00:17:37.600 |
And I kind of heard it at the point where I was like, okay, 00:17:44.600 |
are so far behind your peers that you shouldn't get to be here. 00:17:47.400 |
But I think part of the problem is, like, as an R&D lab, 00:17:53.400 |
the great developers look at the great researchers and they're like, 00:17:56.800 |
wow, these big-brained, crazy research people with all their math and shit, 00:18:04.600 |
And then the researchers look at the developers and they're like, 00:18:06.400 |
oh, they're killing it, making all this stuff with all these people using it, 00:18:10.000 |
and talking on Twitter about how great it is. 00:18:12.200 |
And I think they're both a bit intimidated by each other, you know. 00:18:15.600 |
And so I have to kind of remind them, like, okay, 00:18:19.200 |
there are lots of things in this world where you suck 00:18:21.800 |
compared to lots of other people in this company, 00:18:24.000 |
but also vice versa, you know, for all things. 00:18:27.200 |
And the reason you came here is because you wanted to 00:18:31.400 |
learn about those other things from those other people 00:18:33.800 |
and have an opportunity to, like, bring them all together into a single unit. 00:18:40.000 |
So, you know, it's not reasonable to expect you're going to be better at everything 00:18:48.000 |
Even though, like, I guess the other part of it is for nearly all of the people in the company, 00:18:52.200 |
to be honest, they have nearly always been better than everybody else 00:18:55.600 |
at nearly everything they're doing, nearly everywhere they've been. 00:18:58.200 |
So it's kind of weird to be in this situation now where it's like, 00:19:01.000 |
gee, I can clearly see that I suck at this thing 00:19:05.600 |
that I'm meant to be able to do compared to these other people, 00:19:08.400 |
where I'm like the worst in the company at this thing for some things. 00:19:11.400 |
So I think that's a healthy place to be, you know, 00:19:15.600 |
as long as you keep reminding each other about that's actually why we're here. 00:19:24.000 |
And it's been really nice to see, like, it's all a bit of an experiment, like, 00:19:32.000 |
We don't have any hierarchy from that point of view. 00:19:36.000 |
which means I don't get to tell people what to do or how to do it or when to do it. 00:19:43.000 |
And it's been a bit of an experiment to see how that would work out. 00:19:46.000 |
And it's been great, like, so, for instance, Ben Clavier, 00:19:53.800 |
who you might have come across, he's the author of Ragatouille. 00:19:56.200 |
He's the author of Rerankers, super strong information retrieval guy. 00:20:01.200 |
And a few weeks ago, he, you know, this additional channel appeared on Discord, 00:20:12.800 |
Like, these people started appearing, as in our collab sections. 00:20:16.400 |
We have a collab section for, like, collaborating with outsiders. 00:20:21.000 |
There are all these names that I recognize, like Bert24. 00:20:24.200 |
And they're all talking about, like, the next generation of Bert. 00:20:28.600 |
It's like, okay, Ben decided that I think, quite rightly, we need a new Bert. 00:20:35.400 |
Because everybody, like, so many people are still using Bert. 00:20:40.000 |
But it actually doesn't take advantage of lots of best practices. 00:20:43.200 |
And so, he just went out and found basically everybody who's created better Berts 00:20:47.800 |
in the last four or five years, brought them all together. 00:20:53.200 |
Suddenly, there's this huge collaboration going on. 00:21:05.400 |
And he's like, oh, I created a whole Transformers from scratch implementation 00:21:14.400 |
He originally did it largely as a teaching exercise to show other people. 00:21:18.400 |
But he was like, I could, you know, use that to create a really hackable Bert implementation. 00:21:36.400 |
I can now implement all these other Bert things, you know. 00:21:43.600 |
There, you know, there's lots of folks, you know, who have, like, contributed new data 00:21:50.200 |
So, I mean, I can help in the same way that other people can help. 00:21:55.800 |
So, like, then Ben Clavier reached out to me at one point and said, like, okay, can 00:22:00.600 |
you help me, like, what have you learned over time about how to manage, you know, intimidatingly 00:22:08.400 |
capable and large groups of people who you're nominally meant to be leading? 00:22:15.800 |
And so, you know, like, I try to help, but I don't direct. 00:22:21.400 |
Another great example was Kerim, who, after our FSTP QLORA work, decided quite correctly 00:22:33.880 |
that it didn't really make sense to use LORA in today's world. 00:22:36.840 |
You want to use the normalized version, which is called DORA. 00:22:41.000 |
And like, two or three weeks after we did FSTP QLORA, he just popped up and said, okay, 00:22:47.800 |
I've just converted the whole thing to DORA, and I've also created these VLLM extensions, 00:22:52.680 |
and I've got all these benchmarks, and, you know, now I've got training of quantized models 00:23:03.860 |
with adapters that are as fast as LORA and as, actually, better than, weirdly, fine-tuning. 00:23:09.200 |
I was just like, okay, that's great, you know? 00:23:15.040 |
And yeah, so, the things we've done to try to help make these things happen as well is 00:23:20.920 |
like, we have, so we don't have any required meetings, you know, but we do have a meeting 00:23:26.720 |
for each pair of major time zones that everybody's invited to, and, you know, people see their 00:23:38.280 |
colleagues doing stuff that looks really cool, and say like, oh, how can I help, you know, 00:23:47.000 |
So another example is Austin, who, you know, amazing background, he ran AI at Fidelity, 00:23:55.320 |
he ran AI at Pfizer, he ran browsing and retrieval for Google's DeepMind stuff, created Gemma.cpp, 00:24:03.920 |
and he's been working on a new system to make it easier to do WebGPU programming, because 00:24:10.560 |
again, he quite correctly identified, like, you know, this is a way that not everybody 00:24:16.280 |
has to use CUDA, not everybody has to use NVIDIA, you can do stuff on your own computer, 00:24:22.120 |
optionally through the browser, we need to make this easier to do. 00:24:25.480 |
And so I, yeah, so I said to him, like, okay, I want to learn about that, not an area that 00:24:32.440 |
I have much expertise in, so, you know, he's going to show me what he's working on and 00:24:37.540 |
teach me a bit about it, and hopefully I can help contribute. 00:24:40.160 |
I think one of the key things that's happened in all of these is everybody understands what 00:24:47.720 |
Eric Gilliam, who wrote the second blog post in our series, the R&D historian, describes 00:24:54.440 |
as everybody has total flexibility to do what they want, but we all understand, like, kind 00:25:04.520 |
of roughly why we're here, you know, we all have the same, you know, we agree with the 00:25:08.120 |
premises around, like, you know, everything's too expensive, everything's too complicated, 00:25:15.000 |
you know, people are building too many vanity foundation models rather than taking better 00:25:20.740 |
advantage of fine-tuning, like, there's this kind of general, like, sense of, like, we're 00:25:26.040 |
all on the same wavelength about, you know, all the ways in which current research is 00:25:33.240 |
fucked up and, you know, all the ways in which, you know, we kind of try, you know, worried 00:25:39.220 |
about centralization and we, you know, we all care a lot about not just research for 00:25:47.840 |
the point of citations, but research that actually wouldn't have happened otherwise 00:25:51.280 |
and actually is going to lead to real-world outcomes and so, yeah, with this kind of like 00:25:55.160 |
shared vision, people understand, like, you know, so when I say, like, oh, well, you know, 00:26:04.400 |
tell me, Ben, about BERT 24, what's that about, and he's like, you know, like, oh, well, you 00:26:08.400 |
know, you can see from an accessibility point of view or you can see from a kind of a actual 00:26:13.240 |
practical impact point of view, there's far too much focus on decoder-only models and, 00:26:19.360 |
you know, like, BERT's used in all of these different places and industry and so I can 00:26:23.120 |
see, like, in terms of our basic principles, what we're trying to achieve, this seems like 00:26:26.440 |
something important and so I think that's, like, a really helpful that we have that kind 00:26:35.920 |
Yeah, and before we maybe talk about some of the specific research, when you're, like, 00:26:41.000 |
reaching out to people, interviewing them, what are some of the traits, like, how do 00:26:48.000 |
Is it working on side projects that, you know, you're already familiar with? 00:26:51.360 |
Is there anything, like, in the interview process that, like, helps you screen for people 00:26:54.520 |
that are more, less pragmatic and more research-driven versus some of these folks that are, like, 00:27:00.380 |
are just going to do it, you know, they're not waiting for, like, the perfect process? 00:27:05.360 |
Anybody who comes through the recruiting is interviewed by everybody in the company. 00:27:15.560 |
You know, our goal is 12 people, so it's not an unreasonable amount and, like, the way 00:27:23.160 |
I, so the other thing to say is everybody so far who's come into the recruiting pipeline, 00:27:29.380 |
everybody bar one, has been hired, so, which is to say our original curation has been good. 00:27:39.920 |
And that's actually pretty easy because nearly everybody who's come in through the recruiting 00:27:42.420 |
pipeline are people I know pretty well, so, you know, Jono Whittaker and I, you know, 00:27:51.440 |
he worked on the stable diffusion course we did, he's outrageously creative and talented 00:28:01.140 |
and he's super, like, enthusiastic tinkerer, just likes making things and, you know, Benjamin 00:28:11.840 |
was one of the strongest parts of the fast.ai community, which is now the alumni, it's like 00:28:16.900 |
hundreds of thousands of people and, you know, again, like, they're not people who a normal 00:28:25.860 |
So Benjamin doesn't have any qualifications in math or computer science, Jono was living 00:28:36.340 |
in Zimbabwe, he was not, you know, he was working on, like, helping some African startups, 00:28:41.900 |
you know, but not FANG kind of credentials, but yeah, I mean, when you actually see people 00:28:49.060 |
doing real work and they stand out above, you know, we've got lots of Stanford graduates 00:28:56.620 |
and OpenAI people and whatever in our alumni community as well, you know, when you stand 00:29:00.660 |
out above all of those people, anyway, obviously you've got something going for you, you know, 00:29:07.540 |
him and I worked together on the masks study we did in the proceeding at the National Academy 00:29:16.460 |
So, you know, we had worked together and, again, that was a group of, like, basically 00:29:20.300 |
the 18 or 19 top experts in the world on public health and epidemiology and research design 00:29:29.780 |
and so forth, and Austin was, you know, one of the strongest people in that collaboration. 00:29:38.740 |
So yeah, you know, like, I've been lucky enough to have had opportunities to work with some 00:29:46.040 |
people who are great and, you know, I'm a very open-minded person, so I kind of am always 00:29:49.960 |
happy to try working with pretty much anybody and some people stand out. 00:29:54.100 |
You know, there have been some exceptions, people I haven't previously known, like Ben 00:29:57.340 |
Clavier actually I didn't know before, but, you know, with him, like, I just read his 00:30:06.740 |
code and I'm like, oh, that's really well-written code, like I, and like it's not written exactly 00:30:15.780 |
the same way as everybody else's code, and it's not written to do exactly the same thing 00:30:20.900 |
So yeah, and then when I chatted to him, it's just like, I don't know, I felt like we'd 00:30:27.300 |
known each other for years, like we just were on the same wavelength, and, but I could pretty 00:30:31.540 |
much tell that was going to happen just by reading his code. 00:30:34.700 |
I think you express a lot in the code you choose to write and how you choose to write 00:30:39.740 |
it, I guess, you know, or another example, this guy named Vic, who was previously the 00:30:49.620 |
CEO of DataQuest, and like, in that case, like, he's, you know, he's created a really 00:30:57.780 |
successful startup, he's like, he won the first, basically, Kaggle NLP competition, 00:31:08.460 |
He's got the current state-of-the-art OCR system, Syria, again, he's just a guy who 00:31:17.540 |
obviously just builds stuff, you know, he doesn't ask for permission, he doesn't need 00:31:24.700 |
Actually, Karim's another great example of this, I mean, I already knew Karim very well 00:31:28.660 |
because he was my best ever master's student, but it wasn't a surprise to me, then, when 00:31:34.180 |
he then went off to create the world's state-of-the-art language model in Turkish on his own, in his 00:31:40.380 |
spare time, with no budget, you know, from scratch, this is not fine-tuning or whatever, 00:31:46.660 |
he like, went back to Common Crawl and did everything, so, yeah, it's kind of, I don't 00:31:53.460 |
know what I'd describe that process as, but it's not at all based on credentials. 00:32:03.300 |
We wanted to dive in a little bit more on, you know, turning from the people side of 00:32:07.840 |
things into the technical bets that you're making. 00:32:11.660 |
Also a little bit more on Bert, I was actually, we just did an interview with Yitay from Rekka, 00:32:16.780 |
I don't know if you're familiar with his work, but also another encoder-decoder bet, and 00:32:24.740 |
one of his arguments was actually people kind of over-index on the decoder-only GPT-3 type 00:32:28.860 |
paradigm, I wonder if you have thoughts there that is maybe non-consensus as well. 00:32:34.100 |
Yeah, no, absolutely, so I think it's a great example, so one of the people we're collaborating 00:32:38.100 |
with a little bit with Bert24 is Colin Raffle, who is the guy behind, yeah, most of that 00:32:46.600 |
You know, between that and UL2, there's a lot of really interesting work, and so one 00:32:54.980 |
of the things I've been encouraging the Bert group to do, and Colin has as well, is to 00:33:01.740 |
consider using a T5 pre-trained encoder backbone as a thing you fine-tune, which I think would 00:33:13.220 |
But he was saying, you know, Colin was also saying actually just use encoder-decoder as 00:33:19.780 |
your Bert, you know, why don't you use that as a baseline, which I also think is a good 00:33:25.740 |
Yeah, look, you know, what technical arguments are, you know, are people underweighting? 00:33:29.740 |
I mean, Colin would be able to describe this much better than I can, but I'll give my slightly 00:33:34.880 |
Look, I mean, think about like diffusion models, right, like in stable diffusion, like we use 00:33:39.720 |
things like UNet, we, you know, you have this kind of downward path and then in the upward 00:33:45.760 |
path you have the cross connections, which you, it's not a tension, but it's like a similar 00:33:52.680 |
You're inputting the original encoding path into your decoding path. 00:34:00.000 |
It's critical to make it work, right, because otherwise in the decoding part, the model 00:34:05.720 |
has to like do so much kind of from scratch, right? 00:34:09.920 |
So like if you're doing translation, like that's a classic kind of encoder-decoder example. 00:34:16.880 |
If it's decoder only, you never get the opportunity to find the right, you know, feature engineering, 00:34:26.480 |
that feature encoding for the original sentence. 00:34:32.440 |
And it kind of means then on every token that you generate, you have to recreate the whole, 00:34:39.120 |
So if you have an encoder, it's basically saying like, okay, this is your opportunity 00:34:44.540 |
model to create a really useful feature representation for your, for your input information. 00:34:55.320 |
So I think there's really strong arguments for encoder-decoder models anywhere that there 00:34:59.920 |
is this kind of like context or source thing, you know. 00:35:08.920 |
And then why encoder only, well because like so much of the time what we actually care 00:35:18.300 |
It's like we're not generating an arbitrary length sequence of tokens. 00:35:22.840 |
So anytime you're not generating an arbitrary length sequence of tokens, decoder models 00:35:32.860 |
Now the interesting thing is, you see on like Kaggle competitions, that decoder models 00:35:36.260 |
still are at least competitive with things like Deberta v3. 00:35:44.980 |
But they have to be way bigger to be competitive with things like Deberta v3, and the only 00:35:51.340 |
reason they are competitive is because people have put a lot more time and money and effort 00:35:54.700 |
into training the decoder only once, you know. 00:35:57.900 |
There isn't a recent Deberta, there isn't a recent Bert. 00:36:02.520 |
So yeah, it's a whole part of the world that people have slept on a little bit, and this 00:36:10.060 |
This is how trends happen, rather than like, to me everybody should be like, oh let's look 00:36:15.480 |
at the thing that has shown signs of being useful in the past but nobody really followed 00:36:22.620 |
That's the more interesting path, you know, but people tend to be like, oh I need to get 00:36:28.540 |
Can I make it 0.1% better, you know, or 0.1% faster? 00:36:34.780 |
Yeah, so I think it's like, ETA's work commercially now is interesting because here's like a whole, 00:36:41.780 |
here's a whole model that's been trained in a different way, so there's probably a whole 00:36:44.280 |
lot of tasks it's probably better at than, you know, GPT and Gemini and Claude. 00:36:54.940 |
So that should be a good commercial opportunity for them if they can figure out what those 00:36:59.120 |
Well, if rumors are to be believed, and he didn't comment on this, but, you know, Snowflake 00:37:03.620 |
may figure out the commercialization for them, so we'll see. 00:37:10.640 |
Let's talk about FSDP, Qlora, Qdora and all of that awesome stuff. 00:37:15.900 |
One of the things we talked about last time, some of these models are meant to run on systems 00:37:20.600 |
that nobody can really own, no single person. 00:37:24.700 |
And then you were like, well, what if you could fine tune a 70B model on like a 4090? 00:37:30.740 |
And I was like, no, that sounds great, Jeremy, but like, can we actually do it? 00:37:38.320 |
Can you maybe tell us some of the worst stories behind that, like the idea behind FSDP, which 00:37:43.320 |
is kind of taking, you know, sharped data parallel computation, then Qlora, which is 00:37:50.860 |
do not touch all the weights, just go quantize some of the model, and then within the quantized 00:37:57.320 |
model only do certain layers, instead of doing everything. 00:38:06.880 |
I think before you published it, nobody thought this was like a short term thing that we're 00:38:12.960 |
And now it's like, oh, obviously you can do it, but it's not that easy. 00:38:17.040 |
I mean, to be honest, it was extremely unpleasant work to do. 00:38:28.620 |
So I kind of did version 0.1 of it myself before we had launched the company, or at 00:38:36.420 |
least the kind of like the pieces, which is, they're all pieces that are difficult to work 00:38:42.580 |
So for the quantization, you know, I chatted to Tim Detmers quite a bit, and, you know, 00:38:47.960 |
he very much encouraged me by saying like, yeah, it's possible. 00:38:51.360 |
He actually thought it'd be easy, it probably would be easy for him, but I'm not Tim Detmers. 00:38:55.960 |
You know, so he wrote Bits and Bytes, which is his quantization library, and, you know, 00:39:03.720 |
He didn't write that to be production like code. 00:39:10.400 |
So, you know, like, it's not particularly well structured. 00:39:15.660 |
There's lots of code paths that never get used. 00:39:18.180 |
There's lots of, you know, multiple versions of the same thing. 00:39:22.340 |
So trying to get my head around that was hard, and, you know, because it's like, the interesting 00:39:26.820 |
bits are all written in CUDA, it's hard to like to step through it and see what's happening. 00:39:31.820 |
And then, you know, FSTP is this very complicated library in PyTorch, which not particularly 00:39:39.940 |
So the only really way to understand it properly is, again, just read the code and step through 00:39:45.640 |
And then, like, Bits and Bytes doesn't really work in practice unless it's used with PEFT, 00:39:54.900 |
the Hugging Face library, and PEFT doesn't really work in practice unless you use it 00:39:58.900 |
And there's a lot of coupling in the Hugging Face ecosystem where, like, none of it works 00:40:09.940 |
So yeah, trying to just get a minimal example that I can play with was really hard. 00:40:15.060 |
And so I ended up having to rewrite a lot of it myself, to kind of create this minimal 00:40:23.140 |
One thing that helped a lot was Medec had this Llama Recipes repo that came out just 00:40:27.700 |
a little bit before I started working on that. 00:40:29.460 |
And like, they had a kind of role model example of, like, here's how to train FSDP Laura. 00:40:43.780 |
Actually, a lot of that had been put together, like, a lot of the stuff I discovered, the 00:40:47.460 |
interesting stuff, had been put together by Les Wright, who's, he was actually the guy 00:40:51.260 |
in the Fast.ai community I mentioned who created the Ranger Optimizer. 00:40:55.020 |
So he's doing a lot of great stuff at Meta now. 00:41:00.620 |
So yeah, I kind of, that helped get some minimum stuff going, and then it was great once Benjamin 00:41:11.580 |
And so we basically hacked at that together, and then Kerim joined, like, a month later 00:41:16.620 |
But gee, it was just a lot of, like, fiddly detailed engineering on, like, barely documented 00:41:29.660 |
So my focus was to see if it kind of could work, and I kind of got a bit of a proof of 00:41:32.540 |
concept working, and then the rest of the guys actually did all the work to make it 00:41:41.020 |
And you know, every time we thought we had something, you know, we needed to have good 00:41:47.020 |
So we'd, like, it's very easy to convince yourself you've done the work when you haven't, 00:41:51.820 |
you know, so then we'd actually try lots of things and be like, oh, in these, like, really 00:41:55.260 |
important cases, the memory uses higher, you know, or it's actually slower. 00:42:00.220 |
And we'd go in and we'd just find, like, all these things that were nothing to do with 00:42:07.540 |
And nobody had noticed they hadn't worked properly because nobody had really benchmarked 00:42:11.380 |
So we ended up, you know, trying to fix a whole lot of different things. 00:42:17.020 |
And even as we did so, new regressions were appearing in, like, Transformers and stuff 00:42:21.820 |
that Benjamin then had to go away and figure out, like, oh, how come FlashAttention doesn't 00:42:26.460 |
work in this version of Transformers anymore with this set of models, and, like, oh, it 00:42:31.820 |
turns out they accidentally changed this thing so it doesn't work. 00:42:35.420 |
You know, there's just, there's not a lot of really good performance type evals going 00:42:43.500 |
So there's an extraordinary amount of, like, things where people say, like, oh, we built 00:42:46.860 |
this thing and it has this result, and when you actually check it, it doesn't. 00:42:51.180 |
So yeah, there's a shitload of war stories from getting that thing to work. 00:42:56.780 |
And it did require a particularly, like, tenacious group of people and a group of people who 00:43:01.660 |
don't mind doing a whole lot of, kind of, like, really janitorial work, to be honest, 00:43:10.620 |
Yeah, we had the tree DAO on the podcast, and we talked about how a lot of it is, like, 00:43:16.100 |
systems work to make some of these things work. 00:43:18.140 |
It's not just, like, beautiful pure math that you do on a blackboard. 00:43:21.660 |
It's, like, how do you get into the nitty-gritty of it. 00:43:24.620 |
I mean, FlashAttention is a great example of that. 00:43:27.100 |
Like, it's, it basically is just, like, oh, let's just take the attention and just do 00:43:31.300 |
the tailed version of it, which sounds simple enough, you know. 00:43:36.340 |
But then implementing that is challenging at lots of levels. 00:43:43.580 |
You know, obviously, you've done all this amazing work on fine-tuning. 00:43:46.460 |
Do you have any research you've been doing on the inference side, how to make local inference 00:43:53.220 |
We're doing quite a bit on that at the moment. 00:43:55.080 |
We haven't released too much there yet, but one of the things I've been trying to do is 00:44:04.340 |
And one of the nice things that's happened is that a couple of folks at Meta, including 00:44:11.940 |
Mark Seraphim, have done a nice job of creating this CUDA mode community of people working 00:44:17.420 |
on, like, CUDA kernels or learning about that, and I tried to help get that going well as 00:44:21.660 |
well and did some lessons to help people get into it. 00:44:27.980 |
So there's a lot going on in both inference and fine-tuning performance and a lot of it's 00:44:36.900 |
Also the PyTorch team have created this Torch AO project on quantization. 00:44:44.580 |
And so there's a big overlap now between kind of the FastAI and AnswerAI and CUDA mode communities 00:44:50.900 |
of people working on stuff about inference and fine-tuning, but we're getting close now. 00:45:00.060 |
You know, our goal is that nobody should be merging models, nobody should be downloading 00:45:07.020 |
merged models, everybody should be using basically quantized plus adapters for almost everything, 00:45:15.980 |
and just downloading the adapters, and that should be much faster. 00:45:21.380 |
So that's kind of the place we're trying to get to. 00:45:25.180 |
It's difficult, you know, because, like, Kerem's been doing a lot of work with VLM, for example. 00:45:31.020 |
These inference engines are pretty complex bits of code. 00:45:35.940 |
They have a whole lot of custom kernel stuff going on as well, as do the quantization libraries. 00:45:41.780 |
So we've been working on that with also quite a bit of collaborating with the folks who 00:45:44.580 |
do HQQ, which is a really great quantization library and works super well. 00:45:54.140 |
So yeah, there's a lot of other people outside AnswerAI that we're working with a lot who 00:45:58.100 |
are really helping on all this performance optimization stuff, open source. 00:46:03.100 |
Just to follow up on merging models, I picked up there that you said nobody should be merging 00:46:09.220 |
I think that's interesting because, you know, obviously a lot of people are experimenting 00:46:14.980 |
I would say, in defense of merging models, you can do it without data. 00:46:20.540 |
That's probably the only thing that's going for it. 00:46:27.020 |
To explain, it's not that you shouldn't merge models, it's that you shouldn't be distributing 00:46:34.340 |
You should distribute a merged adapter, 99% of the time, and actually often one of the 00:46:41.940 |
best things happening in the model merging world is actually that often merging adapters 00:46:47.140 |
The point is, Sean, that once you've got your new model, if you distribute it as an adapter 00:46:54.180 |
that sits on top of a quantized model that somebody's already downloaded, then it's a 00:46:59.380 |
much smaller download for them, and also the inference should be much faster, because you're 00:47:05.300 |
not having to transfer FB16 weights from FB, from HBM memory at all, or ever load them 00:47:12.740 |
off disk, you know, all the main weights are quantized, and the only floating point weights 00:47:18.180 |
are in the adapters, so that should make both inference and fine-tuning faster. 00:47:27.580 |
We're moving on a little bit to the rest of the Fast universe. 00:47:31.420 |
I would have thought that, you know, once you started Answer.ai, that the sort of Fast 00:47:36.580 |
universe would be kind of on hold, and then today you just dropped FastLight, and it looks 00:47:42.540 |
like, you know, there's more activity going on in sort of FastLand. 00:47:47.500 |
Yeah, so FastLand and AnswerLand are not really distinct things, AnswerLand is kind of like 00:47:56.940 |
the FastLand grown up and funded, they both have the same mission, which is to maximize 00:48:07.460 |
We want to create thousands of commercially successful products at Answer.ai, and we want 00:48:16.060 |
to do that with like 12 people, so that means we need a pretty efficient stack, you know, 00:48:26.340 |
like quite a few orders of magnitude more efficient, not just for creation, but for 00:48:31.220 |
deployment and maintenance than anything that currently exists. 00:48:37.900 |
People often forget about the 'D' part of our R&D firm, so we've got to be extremely 00:48:43.420 |
good at, you know, creating, deploying, and maintaining applications, not just models. 00:48:50.020 |
Much to my, you know, horror, the story around creating web applications is much worse now 00:49:01.500 |
than it was 10 or 15 years ago, in terms of like, if I say to a data scientist, here's 00:49:09.460 |
how to create and deploy a web application, you know, either you have to learn JavaScript 00:49:17.340 |
or TypeScript, and about all the complex, like, libraries like React and stuff, and 00:49:22.900 |
all the complex, like, details around security and web protocol stuff, around how you then 00:49:27.620 |
talk to a back-end, and then all the details about creating the back-end. 00:49:32.020 |
You know, if that's your job, you know, and you're, you know, you have specialists who 00:49:37.380 |
work in just one of those areas, it is possible to, for that to all work, but compared to 00:49:45.940 |
like, oh, write a PHP script and put it in the home directory that you get when you sign 00:49:50.820 |
up to this shell provider, which is what it was like in the 90s, you know, here are those 00:49:55.820 |
25 lines of code, you're done, and now you can pass that URL around to all your friends, 00:50:01.820 |
you know, or put this, you know, .pl file inside the CGI bin directory that you got 00:50:11.460 |
So yeah, the thing I've been mainly working on the last few weeks is fixing all that, 00:50:24.460 |
I don't know if this is an announcement, but I can tell you guys. 00:50:28.180 |
So yeah, there's this thing called fastHTML, which basically lets you create a complete 00:50:41.140 |
Unlike excellent projects like Streamlit and Gradio, you're not working on top of a 00:50:46.900 |
highly abstracted thing that's got nothing to do with web foundations, you're working 00:50:51.860 |
with web foundations directly, but you're able to do it by using pure Python. 00:50:59.380 |
There's no template, there's no ginger, there's no separate like CSS and JavaScript files. 00:51:06.740 |
It looks and behaves like a modern SPA web application. 00:51:16.980 |
And you can create components for like Daisy UI, or Bootstrap, or Shoelace, or whatever 00:51:27.780 |
fancy JavaScript and/or CSS, Tailwind, etc. library you like, but you can write it all 00:51:36.660 |
You can pip install somebody else's set of components and use them entirely from Python. 00:51:41.900 |
You can develop and prototype it all in a Jupyter Notebook if you want to. 00:51:46.300 |
It all displays correctly, so you can like interactively do that. 00:51:52.020 |
And then you mentioned Fastlight, so specifically now if you're using SQLite in particular, 00:51:59.660 |
it's like ridiculously easy to have that persistence, you know, and you can basically, all of your 00:52:08.700 |
handlers will be passed database-ready objects automatically that you can just call .delete, 00:52:18.780 |
Yeah, you get session, you get security, you get all that. 00:52:24.540 |
So it's, again, like with most of everything I do, it's very little code. 00:52:30.420 |
It's mainly tying together really cool stuff that other people have written, so. 00:52:37.420 |
You don't have to use it, but a lot of the best stuff comes from its incorporation of 00:52:41.180 |
HTMX, which to me is basically the thing that changes your browser to make it work the way 00:52:50.500 |
So it's a, it just does four small things, but those four small things are the things 00:52:56.260 |
that are basically unnecessary constraints that HTML should never have had. 00:53:06.180 |
It sits on top of Starlet, which is a very nice, you know, kind of lower level platform 00:53:15.860 |
The actual interface matches as closely as possible to FastAPI, which is a really nice 00:53:22.340 |
system for creating the kind of classic JavaScript type applications. 00:53:28.940 |
And Sebastian, who wrote FastAPI, has been kind enough to help me think through some 00:53:37.020 |
I mean, everybody involved has been super helpful. 00:53:40.020 |
Actually, I chatted to Carson, who created HTMX, you know, also about it, chatted to 00:53:46.820 |
Like, everybody in the community I've spoken to definitely realizes there's a big gap to 00:53:54.380 |
be filled around, like, highly scalable web foundation based, you know, pure Python framework 00:54:11.780 |
So yeah, I'm getting a lot of support and trying to make sure that FastHTML works well 00:54:19.300 |
Yeah, I would say, when I heard about this, I just texted Alexio, I think this is going 00:54:26.700 |
You know, like, people consider Streamlit and Gradio to be the state of the art, but 00:54:30.780 |
I think there's so much to improve in, you know, having sort of, what do you say, what 00:54:35.380 |
do you call web foundations and web fundamentals at the core of it, I think would be really 00:54:40.740 |
Yeah, it's based on 25 years of thinking and work for me. 00:54:46.140 |
So like, FastML was built on a system much like this one, but that was of hell. 00:54:54.500 |
And so I spent, you know, 10 years working on that. 00:54:58.100 |
We had millions of people using that every day, really pushing it hard. 00:55:06.460 |
So you know, and obviously lots of other people have done, like, great stuff and particularly 00:55:10.460 |
So I've been thinking about like, yeah, how do I pull together the best of the web framework 00:55:18.660 |
There's also things like Pico CSS, which is the CSS system, which by default, FastHTML 00:55:29.100 |
Although as I say, you can pip install anything you want to, but it makes it like, super easy 00:55:33.380 |
to, you know, so we're trying to make it so that just out of the box, you don't have any 00:55:37.460 |
choices to make, you know, if you don't want to. 00:55:39.940 |
You can make choices, but for most people, you just, you know, it's like the PHP in your 00:55:45.660 |
You just start typing and just by default, you'll get something which looks and feels, 00:55:54.060 |
And if you want to then write a version of Gradio or Streamlit on top of that, you totally 00:56:02.020 |
And then the nice thing is if you then write it in kind of the Gradio equivalent, which 00:56:06.900 |
will be, you know, I mentioned we'll create some kind of pip installable thing for that. 00:56:11.860 |
Once you've outgrown, or if you outgrow that, it's not like, okay, throw that all away and 00:56:17.220 |
start again in this like whole separate language, but it's like this kind of smooth, gentle 00:56:23.780 |
path that you can take step-by-step because it's all just standard web foundations all 00:56:34.700 |
So, you know, just to wrap up the sort of open source work that you're doing, you know, 00:56:41.340 |
you're aiming to create thousands of projects with a very, very small team. 00:56:45.420 |
And I haven't heard you mention once AI agents or AI developer tooling or AI code maintenance, 00:56:53.300 |
you know, I know you're very productive, but you know, what is the role of AI in your own 00:57:02.340 |
I'm not sure how much I want to say just yet. 00:57:22.660 |
And I'm creating a system for doing dialogue engineering. 00:57:33.860 |
I'm doing most of my work in this system and it's making me much more productive than I 00:57:40.020 |
So I always just build stuff for myself and hope that it'll be useful for somebody else. 00:57:49.460 |
Think about chatGPT with Code Interpreter, right? 00:57:56.380 |
The basic UX is the same as a 1970s teletype, right? 00:58:01.220 |
So if you wrote APL on a teletype in the 1970s, you typed onto a thing, your words appeared 00:58:07.940 |
at the bottom of a sheet of paper and you'd like hit enter and it would scroll up. 00:58:12.580 |
And then the answer from APL would be printed out, scroll up, and then you would type the 00:58:16.620 |
next thing, which is also the way, for example, a shell works, like bash or ZSH or whatever. 00:58:28.360 |
It's not terrible, you know, like we all get a lot done in these like very, very basic 00:58:33.620 |
teletype style REPL environments, but I've never felt like it's optimal, you know, and 00:58:40.020 |
to me, you know, so, and everybody else has just copied chatGPT. 00:58:55.300 |
And then you add Code Interpreter and the most you can do is to like plead with chatGPT 00:59:04.980 |
It's pretty good for very, very, very beginner users who like can't code at all, like by 00:59:10.300 |
default now the code's even hidden away, so you never even have to see it ever happened. 00:59:15.260 |
But for somebody who's like wanting to learn to code or who already knows a bit of code 00:59:18.560 |
or whatever, it's, it seems really not ideal. 00:59:25.300 |
The other end of the spectrum, which is where Sean's work comes in, is, oh, you want to 00:59:38.260 |
There's an empty screen with a flashing cursor. 00:59:44.140 |
And it's like, okay, you can use systems like Sean's or like Cursor or whatever to be like, 00:59:52.620 |
okay, Apple K in cursors, like create a form that blah, blah, blah, but it's, in the end, 01:00:00.180 |
it's like a convenience over the top of this incredibly complicated system that full-time 01:00:06.220 |
sophisticated software engineers have designed over the past few decades in a totally different 01:00:11.160 |
environment as a way to build software, you know. 01:00:14.460 |
And so we're trying to like shoehorn in AI into that. 01:00:20.520 |
And it's, it's not easy to do, and I think there are like much better ways of thinking 01:00:28.840 |
about the craft of software development in a language model world to be much more interactive, 01:00:38.100 |
So the thing that I'm building is, is neither of those things. 01:00:43.020 |
And it's built around this idea of crafting a dialogue, you know, where the outcome of 01:00:49.860 |
the dialogue is, you know, the artifacts that you want, whether it be a piece of analysis 01:00:57.100 |
or whether it be a Python library or whether it be a technical blog post or whatever. 01:01:03.860 |
So as part of building that, I've created something called Claudette, which is a library 01:01:09.180 |
I've created something called Cosette, which is a library for OpenAI. 01:01:16.660 |
There are libraries which are designed to make those APIs much more usable, much easier 01:01:26.740 |
And then I've written AI magic on top of those. 01:01:32.220 |
And that's been an interesting exercise because I did Claudette first, and rather than try 01:01:39.780 |
to like, I was looking at what Simon Willison did with his fantastic LLM library, and his 01:01:45.740 |
library is designed around like, let's make something that supports all the LLM inference 01:01:53.340 |
I thought, okay, what if I did something different, which is like make something that says Claude 01:01:56.620 |
friendly as possible and forget everything else. 01:02:00.980 |
So for example, one of the really nice things in Claude is pre-fill. 01:02:05.100 |
So by telling the assistant that this is what your response started with, there's a lot 01:02:09.700 |
of powerful things you can take advantage of. 01:02:12.640 |
So yeah, I created Claudette to be as Claude friendly as possible. 01:02:16.680 |
And then after I did that, and then with Claude, particularly with GPT 4.0 coming out, I kind 01:02:23.900 |
of thought, okay, now let's create something that's as open AI friendly as possible. 01:02:29.460 |
And then I tried to look to see, well, where are the similarities and where are the differences? 01:02:33.980 |
And now can I make them compatible in places where it makes sense for them to be compatible 01:02:38.580 |
without losing out on the things that make each one special for what they are. 01:02:43.540 |
So yeah, those are some of the things I've been working on in that space. 01:02:49.380 |
And I'm thinking we might launch AI magic via a course called how to solve it with code. 01:03:01.100 |
The name is based on the classic Polya book, if you know how to solve it, which is, you 01:03:06.540 |
know, one of the classic math books of all time, where we're basically going to try to 01:03:13.660 |
show people how to solve challenging problems that they didn't think they could solve without 01:03:19.940 |
doing a full computer science course, by taking advantage of a bit of AI and a bit of, like, 01:03:30.420 |
And it's particularly for this, like, whole generation of people who are learning to code 01:03:37.500 |
Like, I know a lot of people who didn't really know how to code, but they've created things 01:03:42.540 |
because they use ChatGPT, but they don't really know how to maintain them or fix them or add 01:03:46.260 |
things to them that ChatGPT can't do, because they don't really know how to code. 01:03:50.780 |
So this course will be designed to show you how you can, like, you know, either become 01:03:57.140 |
a developer who can, like, supercharge their capabilities by using language models, or 01:04:01.700 |
become a language model first developer who can supercharge their capabilities by understanding 01:04:06.460 |
a bit about process and fundamentals, so, yeah. 01:04:15.580 |
I guess the fourth time you're going to be on Learning Space, we're going to talk about 01:04:21.140 |
Jeremy, before we wrap, this was just a great run through everything. 01:04:27.660 |
What are the things that when you next come on the podcast in nine, 12 months, we're going 01:04:31.420 |
to be like, "Man, Jeremy was, like, really ahead of it." 01:04:34.060 |
Like, is there anything that you see in this space that maybe people are not talking enough? 01:04:38.700 |
You know, what's the next company that's going to fall, like, in drama internally? 01:04:43.820 |
You know, hopefully we'll be talking a lot about fast HTML and hopefully the international 01:04:47.080 |
community that at that point has come up around that, and also about AI magic and about dialogue 01:04:54.300 |
Hopefully dialogue engineering catches on, because I think it's the right way to think 01:04:59.260 |
I'm just trying to think about more on the research side. 01:05:01.620 |
Yeah, I think, you know, I mean, we've talked about a lot of it. 01:05:03.860 |
Like, I think encoder-decoder architectures, encoder-only architectures, hopefully we'll 01:05:08.740 |
be talking about, like, the whole re-interest in BERT that BERT 24 stimulated. 01:05:15.460 |
There's a state-space model that came out today that might be interesting for just general 01:05:21.380 |
One thing that stood out to me with Cartesia's blog post was that they were talking about 01:05:25.820 |
real-time ingestion of billions and trillions of tokens, and keeping that context, obviously, 01:05:34.940 |
I'm wondering what your thoughts are, because you've been entirely transformers the whole 01:05:39.860 |
Yeah, no, so obviously my background is RNNs and LSTMs, and I'm still a believer in the 01:05:48.180 |
idea that state is something you can update, you know. 01:05:53.260 |
So obviously Sepp Hochreiter came up, came out with XLSTM recently. 01:06:01.380 |
Oh my god, okay, another whole thing we haven't talked about, just somewhat related. 01:06:09.700 |
I've been going crazy for, like, a long time about, like, why can I not pay anybody to 01:06:16.340 |
save my KV cache, you know, for, like, I just ingested the Great Gatsby or the documentation 01:06:24.300 |
for Starlet or whatever, you know, I'm sending it as my prompt context. 01:06:34.700 |
So Gemini is about to finally come out with KV caching, and this is something that Austin 01:06:41.180 |
actually in Gemma.cpp had had on his roadmap for years, well, not years, months, long time, 01:06:48.340 |
is that the idea that the KV cache is, like, a thing that, like, it's a third thing, right? 01:06:58.060 |
So there's RAG, you know, there's in-context learning, you know, and prompt engineering, 01:07:11.380 |
I think it creates, like, a whole new class, almost, of applications or of techniques where, 01:07:19.820 |
you know, for me, for example, I very often, like, I very often work with, like, really 01:07:23.700 |
new libraries, or I've created my own library that I'm now writing with, rather than on. 01:07:31.140 |
So I want all the docs in my new library to be there all the time. 01:07:35.340 |
So yeah, I want to upload them once, and then all of, have a whole discussion about building 01:07:41.740 |
this application using FastHTML, well, nobody's got FastHTML in their, in their language model 01:07:48.980 |
I don't want to send all the FastHTML docs across every time. 01:07:51.420 |
So one of the things I'm looking at doing in AI Magic, actually, is taking advantage 01:07:54.380 |
of some of these ideas, so that you can have the documentation of the libraries you're 01:08:05.060 |
So there'll be ways to, you know, something over the next 12 months people will be spending 01:08:10.300 |
time thinking about is how to, like, where to use RAG, where to use fine-tuning, where 01:08:14.500 |
to use KV cache storage, you know, and, and how to use state, because in state models 01:08:24.020 |
and XLSTM, again, state is something you, you update. 01:08:30.400 |
So how do we combine the best of all of these worlds? 01:08:34.140 |
>> And Jeremy, I know before you talked about how some of the autoregressive models are 01:08:40.820 |
Any other thoughts on like JEPA, diffusion for text, any interesting thing that you've 01:08:45.900 |
>> In the same way that, like, we probably ought to have state that you can update, i.e. 01:08:50.900 |
XLSTM and state models, in the same way, a lot of things probably should have an encoder, 01:08:58.140 |
JEPA and diffusion both seem like the right conceptual mapping for a lot of things we 01:09:08.100 |
So the idea of, like, there, there should be a, a piece of the generative pipeline, 01:09:19.100 |
which is like thinking about the answer and coming up with a sketch of what the answer 01:09:24.940 |
looks like before you start outputting tokens. 01:09:29.600 |
That's where it kind of feels like diffusion ought to fit, you know, and diffusion is, 01:09:34.780 |
because it's not autoregressive, it's like, let's try to, like, gradually de-blur the 01:09:43.540 |
So this is also where dialogue engineering fits in, by the way. 01:09:47.260 |
So with dialogue engineering, one of the reasons it's working so well for me is I use it to 01:09:52.260 |
kind of, like, craft the thought process before I generate the code, you know. 01:10:03.340 |
So yeah, there's a lot of different pieces here, and I don't know how they'll all kind 01:10:10.220 |
I don't know if JEPA is going to actually end up working in the text world, I don't 01:10:13.100 |
know if diffusion will end up working in the text world, but they seem to be, like, trying 01:10:16.900 |
to solve a class of problem which is currently unsolved.