back to index

Answer.ai & AI Magic with Jeremy Howard


Chapters

0:0 Introduction
1:7 Continous Pre-Training is Here
4:48 Schedule-Free Optimizers and Learning Rate Schedules
6:8 Governance and Structural Issues within OpenAI and Other AI Labs
13:32 How Answer.ai works
27:4 How to Recruit Productive Researchers
32:34 Building a new BERT
37:10 FSDP, QLoRA, and QDoRA: Innovations in Fine-Tuning Large Models
43:42 Research and Development on Model Inference Optimization
47:48 FastHTML for Web Application Development
61:16 AI Magic & Dialogue Engineering
64:11 AI wishlist & predictions

Whisper Transcript | Transcript Only Page

00:00:00.560 | Hey, everyone. Welcome to the Latent Space Podcast.
00:00:03.100 | This is Alessio, partner and CTO in Residence and Decibel Partners,
00:00:06.400 | and I'm joined by my co-host, Swiggs, founder of Small.ai.
00:00:10.000 | And today we're back with Jeremy Howard,
00:00:12.600 | I think your third appearance on Latent Space. Welcome.
00:00:16.200 | Wait, third? Second?
00:00:17.600 | - Well, I grabbed you in Europe. - I see.
00:00:21.000 | - Very fun standing outside streets. - I never heard that, by the way.
00:00:25.200 | You've got to send me a link. I've got to hear what it sounded like.
00:00:27.000 | - Yeah, yeah. - I think the two episodes are six hours,
00:00:30.800 | so there's plenty to listen. We'll make sure to send it over.
00:00:35.000 | Yeah, we're trying this thing where the major ML conferences,
00:00:38.000 | we, you know, do a little audio tour of the conference
00:00:40.800 | and give people a sense of what it's like.
00:00:44.200 | But the last time you were on, you declared the end of fine-tuning.
00:00:47.000 | I hope that... I know that, you know,
00:00:50.000 | I sort of editorialized the title a little bit,
00:00:52.800 | and I know you were slightly uncomfortable with it,
00:00:55.000 | but you just own it anyway.
00:00:57.000 | I think you're very good at the hot takes.
00:00:59.400 | And we were just discussing in our pre-show that things have...
00:01:02.200 | It's really happening, that the continued pre-training is really happening.
00:01:07.600 | Yeah, absolutely. I think people are starting to understand that
00:01:13.600 | treating the three ULM fit steps of, like, pre-training, you know,
00:01:18.800 | and then the kind of, like, what people would now call "instruction tuning,"
00:01:21.400 | and then, I don't know if we've got a general term for this,
00:01:24.200 | DPO, RLHFE step, you know, but, you know, the task training,
00:01:29.600 | they're not actually as separate as we originally suggested they were in our paper.
00:01:35.200 | And when you treat it more as a continuum,
00:01:38.800 | and that you make sure that you have, you know,
00:01:42.600 | more of kind of the original data set incorporated into the later stages,
00:01:49.000 | and that, you know, we've also seen with, like, LLAMA3,
00:01:53.600 | this idea that those later stages can be done for a lot longer.
00:01:57.200 | These are all of the things I was kind of trying to describe there.
00:02:00.600 | It wasn't, like, yeah, wasn't the end of pre-training.
00:02:03.400 | Sorry, it wasn't the end of fine-tuning,
00:02:05.200 | but more that we should treat it as a continuum,
00:02:08.600 | and we should have much higher expectations
00:02:11.200 | of how much you can do with an already trained model.
00:02:16.600 | You can really add a lot of behavior to it. You can change its behavior.
00:02:20.400 | You can, you know, you can do a lot.
00:02:22.600 | So a lot of our research has been around trying to figure out
00:02:25.200 | how to modify the model by a larger amount
00:02:29.400 | rather than starting from random weights,
00:02:31.200 | because I get very offended at the idea of starting from random weights.
00:02:35.400 | Yeah, I saw that in iClear in Vienna,
00:02:39.400 | there was an outstanding paper about starting transformers from data-driven pyres.
00:02:44.400 | I don't know if you saw that one.
00:02:46.600 | They called it sort of never trained from scratch,
00:02:48.600 | and I think it was kind of rebelling against, like,
00:02:50.600 | the sort of random initialization of it.
00:02:54.200 | Yeah, I've, you know, that's been our kind of continuous message
00:02:57.000 | since we started Fast.ai, is if you're training from random weights,
00:03:00.800 | you better have a really good reason, you know,
00:03:03.000 | because it seems so unlikely to me that nobody has ever trained on data
00:03:08.200 | that has any similarity whatsoever to the general class of data you're working with,
00:03:13.000 | and that's the only situation in which I think starting from random weights makes sense.
00:03:19.400 | Yeah, the other trends since our last pod that I would point people to
00:03:24.600 | is I'm seeing a rise in multi-phase pre-training.
00:03:29.200 | So Snowflake released a large model called Snowflake Arctic,
00:03:34.200 | where they detailed three phases of training,
00:03:37.000 | where they had, like, a different mixture of, like,
00:03:39.600 | there was, like, 75% web in the first instance,
00:03:43.000 | and then they reduced the percentage of the web text by 10% each time
00:03:47.200 | and increased the amount of code in each phase.
00:03:51.200 | And I feel like multi-phase is being called out in papers more.
00:03:55.600 | I feel like it's always been a thing, like,
00:03:57.800 | changing data mix is not something new,
00:04:00.000 | but calling it a distinct phase is new,
00:04:02.400 | and I wonder if there's something that you're seeing on your end.
00:04:06.200 | Well, so they're getting there, right?
00:04:08.200 | So the point at which they're doing proper continued pre-training
00:04:10.800 | is the point at which that becomes a continuum rather than a phase.
00:04:14.000 | So the only difference with what I was describing last time is to say, like,
00:04:17.400 | oh, there should, you know, there's a function or whatever which is happening every batch.
00:04:24.400 | And it doesn't, like, it's not a huge difference,
00:04:28.600 | but it's like back, you know, I always used to get offended
00:04:31.200 | when people had learning rates that, like, jumped.
00:04:34.600 | And so one of the things I started doing early on in Fast.ai
00:04:37.000 | was to say to people, like, no, you should actually have,
00:04:39.400 | your learning rate schedule should be a function, not a list of numbers.
00:04:43.000 | So now I'm trying to give the same idea about training mix.
00:04:48.400 | There's been pretty public work from Meta on schedule-free optimizers.
00:04:52.200 | I don't know if you've been following Aaron DeFazio and what he's doing.
00:04:56.200 | Just because you mentioned learning rate schedules,
00:04:58.600 | you know, what if you didn't have a schedule?
00:05:01.000 | I mean, I don't care very much, honestly.
00:05:03.200 | Like, I don't think that schedule-free optimizer's that exciting.
00:05:06.000 | It's fine.
00:05:08.800 | We've had non-scheduled optimizers for ages, like,
00:05:14.800 | Les Wright, who's now at Meta, who was part of the Fast.ai community there,
00:05:17.800 | created something called the Ranger optimizer.
00:05:22.200 | You know, I actually like having more hyperparameters, you know,
00:05:26.000 | as soon as you say schedule-free, then, like, well, now I don't get to choose.
00:05:31.800 | And there isn't really a mathematically correct way of, like,
00:05:36.600 | I actually try to schedule more parameters rather than less.
00:05:39.000 | So, like, I like scheduling my epsilon in my atom, for example.
00:05:43.000 | I schedule all the things.
00:05:45.600 | So, but then the other thing we always did with the Fast.ai library
00:05:49.800 | was make it so you don't have to set any schedules.
00:05:52.600 | So Fast.ai always supported, like, not, you didn't even have to pass a learning rate.
00:05:57.400 | Like, it would always just try to have good defaults and do the right thing.
00:06:01.600 | But to me, I like to have more parameters I can play with if I want to,
00:06:05.600 | but that you don't have to.
00:06:09.000 | And then the more, less technical side, I guess, of your issue,
00:06:13.800 | I guess, with the market was some of the large research labs
00:06:18.400 | taking all this innovation kind of behind closed doors
00:06:20.400 | and whether or not that's good, which it isn't.
00:06:24.000 | And now we could maybe make it more available to people.
00:06:26.600 | And then after a month, a month after we released the episode,
00:06:30.200 | there was the whole Sam Altman drama and, like, all the OpenAI governance issues.
00:06:35.000 | And maybe people started to think more,
00:06:37.400 | okay, what happens if some of these kind of labs, you know,
00:06:41.200 | start to break from within, so to speak, and the alignment of the humans
00:06:45.600 | is probably going to fall before the alignment of the models.
00:06:48.600 | So I'm curious, like, if you have any new thoughts,
00:06:51.000 | and maybe we can also tie in some of the way that we've been building Answer
00:06:54.800 | as, like, a public benefit corp and some of those aspects.
00:06:58.000 | Sure. So, yeah, I mean, it was kind of uncomfortable
00:07:00.800 | because two days before Altman got fired,
00:07:04.600 | I did a small public video interview
00:07:09.800 | in which I said I'm quite sure
00:07:13.800 | that OpenAI's current governance structure can't continue
00:07:18.200 | and that it was definitely going to fall apart.
00:07:20.600 | And it fell apart two days later.
00:07:22.200 | And a bunch of people were like, "What did you know, Jeremy?"
00:07:25.800 | - What did Jeremy see? - I didn't see anything.
00:07:29.800 | It's just obviously true.
00:07:33.800 | And so, yeah, so my friend Eric Ries and I spoke a lot before that
00:07:39.800 | about, you know, Eric's, I think, probably most people would agree,
00:07:44.800 | the top expert in the world on, kind of, start-up and AI governance.
00:07:52.200 | And, you know, we could both clearly see that this didn't make sense
00:07:57.800 | to have, like, a so-called non-profit
00:08:00.200 | where then there are people working at a commercial company
00:08:02.600 | that's owned by or controlled nominally by the non-profit
00:08:06.000 | where the people in the company are being given the equivalent of stock options.
00:08:10.000 | Like, everybody there was working there
00:08:13.400 | with expecting to make money largely from their equity.
00:08:18.000 | So the idea that then a board could exercise control
00:08:22.600 | by saying, like, "Oh, we're worried about safety issues
00:08:26.000 | and so we're going to do something that decreases the profit of the company,"
00:08:29.200 | when every stakeholder in the company,
00:08:31.400 | their remuneration pretty much is tied to their profit,
00:08:34.800 | it obviously couldn't work.
00:08:37.600 | So, I mean, that was a huge oversight there by someone.
00:08:42.800 | And I guess it's, like, I guess part of the problem is that the kind of people who
00:08:47.600 | work at non-profits, you know, and in this case the board, you know,
00:08:52.400 | who are kind of academics and, you know, people who
00:08:57.200 | are kind of true believers, I think it's hard for them to realize that 99.999% of the world is
00:09:02.800 | driven very heavily by money, especially huge amounts of money.
00:09:07.200 | So, yeah, Eric and I had been talking for a long time before that
00:09:12.200 | about, like, well, what could be done differently?
00:09:15.800 | Because also companies are sociopathic, like, by design.
00:09:20.400 | And so the alignment problem, as it relates to companies,
00:09:24.800 | has not been solved. Like, companies become huge,
00:09:28.400 | they devour their founders, they devour their communities,
00:09:33.400 | and they do things where even the CEOs, you know, often of big companies tell me, like,
00:09:38.600 | "I wish our company didn't do that thing."
00:09:41.800 | But, you know, I know that if I didn't do it,
00:09:46.600 | then I would just get fired, and the board would put in somebody else.
00:09:49.400 | And the board knows if they don't do it, then their shareholders can sue them,
00:09:53.000 | because they're not maximizing profitability or whatever.
00:09:56.000 | So, what Eric's spent a lot of time doing is trying to think about, like,
00:10:03.400 | how do we make companies less sociopathic, you know?
00:10:08.200 | Or maybe a better way to think of it is, like, how do we make it so that the founders of companies
00:10:15.400 | can ensure that their companies continue to actually do the things they want them to do?
00:10:24.200 | So, you know, when we started a company,
00:10:30.200 | you know, like, well, A, we very explicitly decided we're going to start a company,
00:10:34.400 | not a academic lab, not a non-profit, you know.
00:10:39.200 | We created a Delaware C Corp, you know, the most company kind of company.
00:10:46.400 | But when we did so, we told everybody, you know, including our first investors,
00:10:55.400 | which was you, Alessio.
00:10:57.000 | They sound great.
00:10:59.400 | We are going to run this company on the basis of maximizing long-term value.
00:11:04.200 | You know?
00:11:07.600 | So, you know, in fact, so when we did our second round, which is an angel round,
00:11:15.600 | we had everybody invest through a long-term SPV, which we set up,
00:11:21.800 | where everybody had to agree to vote in line with long-term value principles.
00:11:29.400 | So, like, it's not just, it's never enough just to say to people, like,
00:11:36.000 | okay, we're trying to create long-term value here for society as well as for ourselves,
00:11:40.400 | and everybody's like, oh, yeah, yeah, I totally agree with that.
00:11:43.200 | But when it comes to like, okay, well, here's a specific decision we have to make,
00:11:47.000 | which will not maximize short-term value, people suddenly change their mind.
00:11:52.400 | So, you know, it has to be written into the legal documents of everybody,
00:11:56.800 | so that there's no question that that's the way the company has to be managed.
00:12:03.800 | So, then you mentioned the PBC aspect, Public Benefit Corporation,
00:12:07.200 | which I never quite understood previously.
00:12:10.400 | And it turns out it's incredibly simple.
00:12:13.200 | Like, it took, you know, like one paragraph added to our corporate documents to become a PBC.
00:12:19.800 | It was cheap, it was easy, but it's got this huge benefit,
00:12:23.000 | which is, if you're not a Public Benefit Corporation,
00:12:26.800 | then somebody can come along and offer to buy you,
00:12:31.600 | with a stated description of, like, turning your company into the thing you most hate, right?
00:12:37.200 | And if they offer you more than the market value of your company and you don't accept it,
00:12:41.800 | then you are not necessarily meeting the, kind of, your fiduciary responsibilities.
00:12:49.200 | So, the way, like, Eric always described it to me, you know, is like,
00:12:54.200 | if Philip Morris came along and said that you've got great technology for marketing cigarettes to children,
00:12:59.200 | so we're going to pivot your company to do that entirely,
00:13:01.600 | and we're going to pay you 50% more than the market value,
00:13:05.000 | you're going to have to say yes.
00:13:07.200 | If you have a PBC, then you are more than welcome to say no,
00:13:12.600 | if that offer is not in line with your stated public benefit.
00:13:17.000 | So, our stated public benefit is to maximize, you know, the benefit to society through using AI.
00:13:24.400 | So, given that more children smoking doesn't do that,
00:13:28.200 | then we can say, like, no, we're not selling to you.
00:13:33.000 | Yeah, and I was looking back at some of our emails.
00:13:37.800 | You sent me an email on November 13th about talking,
00:13:41.200 | and then on the 14th, I sent you an email working together to free AI, was the subject line.
00:13:47.200 | And then that was, kind of, the start of the seed round.
00:13:50.200 | And then two days later, someone got fired.
00:13:52.400 | So, this was, like, not even, you know, you were having these thoughts even before.
00:13:57.400 | We had, like, a public example of, like, why some of the current structures didn't work.
00:14:01.800 | So, yeah, you were very ahead of the curve, so to speak.
00:14:07.200 | I would love just to, you know, people can read your awesome introduction, blog, and answer,
00:14:12.400 | and the idea of having an R&D lab versus our lab, and then a D-Lab somewhere else.
00:14:19.000 | I think, to me, the most interesting thing has been hiring,
00:14:22.200 | and some of the awesome people that you've been bringing on that
00:14:24.800 | maybe don't fit the central casting of Silicon Valley, so to speak.
00:14:29.000 | Like, sometimes I go there, like, playing baseball cards, you know.
00:14:31.400 | People are like, oh, what teams was this person on?
00:14:34.000 | Where did they work? Versus focusing on ability.
00:14:36.000 | So, I would love for you to give a shout out to some of the awesome folks on the team.
00:14:41.200 | So, you know, there's, like, a graphic going around describing, like, the people at XAI,
00:14:46.600 | you know, the Elon Musk thing, and, like, they're all connected to, like, you know,
00:14:53.200 | multiple of Stanford, Meta, DeepMind, OpenAI, Berkeley, Oxford.
00:15:03.200 | It's just, look, these are all great institutions, and they have good people,
00:15:07.800 | and I'm definitely not at all against that, but, damn, there's so many other people.
00:15:13.400 | And one of the things I found really interesting is, kind of, anytime I,
00:15:20.800 | almost anytime I see something which I think, like, this is really high quality work,
00:15:24.600 | and it's, like, something I don't think would have been built
00:15:27.600 | if that person hadn't built the thing right now,
00:15:30.000 | I nearly always reach out to them and ask to chat.
00:15:34.000 | And I tend to dig in to find out, like, okay, you know, why did you do that thing?
00:15:38.400 | Everybody else has done this other thing.
00:15:39.800 | Your thing's much better, but it's not what other people are working on.
00:15:42.600 | And, like, 80% of the time, I find out the person has a really unusual background.
00:15:50.600 | So, like, often they'll have, like, either they, like, came from poverty,
00:15:55.000 | and, like, didn't get an opportunity to go to good school,
00:15:57.600 | or they, like, you know, had dyslexia and, you know, got kicked out of school in year 11,
00:16:02.800 | or, you know, or they had a health issue that meant they couldn't go to university,
00:16:08.800 | or something happened in their past, and they ended up out of the mainstream,
00:16:16.000 | and then they, kind of, succeeded anyway.
00:16:20.600 | And those are the people that, throughout my career,
00:16:24.200 | I've tended to, kind of, accidentally hire more of.
00:16:28.000 | But, like, it's not exactly accidentally.
00:16:29.400 | It's, like, when I see somebody who's done, two people who have done extremely well.
00:16:35.200 | One of them did extremely well in exactly the normal way,
00:16:38.200 | from the background, entirely pointing in that direction,
00:16:41.400 | and they achieved all the hurdles to get there.
00:16:43.800 | And, like, okay, that's quite impressive, you know.
00:16:48.200 | But another person who did just as well, despite lots of constraints,
00:16:53.000 | and doing things in really unusual ways, and came up with different approaches,
00:16:56.800 | like, that's normally the person I'm likely to find useful to work with,
00:17:01.400 | because they're often, like, risk-takers, they're often creative,
00:17:04.400 | they're often extremely tenacious, they're often very open-minded.
00:17:09.600 | So, that's the kind of folks we, you know, I tend to find myself hiring.
00:17:16.200 | And I think, like, so now at Answer.ai,
00:17:23.400 | it's a group of people that are strong enough that nearly every one of them
00:17:27.600 | has independently come to me in the past few weeks and said,
00:17:31.200 | and told me that they have imposter syndrome,
00:17:33.400 | and they're not convinced that they're good enough to be here, you know.
00:17:37.600 | And I kind of heard it at the point where I was like, okay,
00:17:41.200 | I don't think it's possible that all of you
00:17:44.600 | are so far behind your peers that you shouldn't get to be here.
00:17:47.400 | But I think part of the problem is, like, as an R&D lab,
00:17:53.400 | the great developers look at the great researchers and they're like,
00:17:56.800 | wow, these big-brained, crazy research people with all their math and shit,
00:18:02.400 | they're too cool for me, oh my god.
00:18:04.600 | And then the researchers look at the developers and they're like,
00:18:06.400 | oh, they're killing it, making all this stuff with all these people using it,
00:18:10.000 | and talking on Twitter about how great it is.
00:18:12.200 | And I think they're both a bit intimidated by each other, you know.
00:18:15.600 | And so I have to kind of remind them, like, okay,
00:18:19.200 | there are lots of things in this world where you suck
00:18:21.800 | compared to lots of other people in this company,
00:18:24.000 | but also vice versa, you know, for all things.
00:18:27.200 | And the reason you came here is because you wanted to
00:18:31.400 | learn about those other things from those other people
00:18:33.800 | and have an opportunity to, like, bring them all together into a single unit.
00:18:40.000 | So, you know, it's not reasonable to expect you're going to be better at everything
00:18:45.600 | than everybody else.
00:18:48.000 | Even though, like, I guess the other part of it is for nearly all of the people in the company,
00:18:52.200 | to be honest, they have nearly always been better than everybody else
00:18:55.600 | at nearly everything they're doing, nearly everywhere they've been.
00:18:58.200 | So it's kind of weird to be in this situation now where it's like,
00:19:01.000 | gee, I can clearly see that I suck at this thing
00:19:05.600 | that I'm meant to be able to do compared to these other people,
00:19:08.400 | where I'm like the worst in the company at this thing for some things.
00:19:11.400 | So I think that's a healthy place to be, you know,
00:19:15.600 | as long as you keep reminding each other about that's actually why we're here.
00:19:24.000 | And it's been really nice to see, like, it's all a bit of an experiment, like,
00:19:30.200 | we don't have any managers.
00:19:32.000 | We don't have any hierarchy from that point of view.
00:19:34.200 | So, for example, I'm not a manager,
00:19:36.000 | which means I don't get to tell people what to do or how to do it or when to do it.
00:19:43.000 | And it's been a bit of an experiment to see how that would work out.
00:19:46.000 | And it's been great, like, so, for instance, Ben Clavier,
00:19:53.800 | who you might have come across, he's the author of Ragatouille.
00:19:56.200 | He's the author of Rerankers, super strong information retrieval guy.
00:20:01.200 | And a few weeks ago, he, you know, this additional channel appeared on Discord,
00:20:09.000 | on our private Discord called Bert24.
00:20:12.800 | Like, these people started appearing, as in our collab sections.
00:20:16.400 | We have a collab section for, like, collaborating with outsiders.
00:20:19.800 | And these people started appearing.
00:20:21.000 | There are all these names that I recognize, like Bert24.
00:20:24.200 | And they're all talking about, like, the next generation of Bert.
00:20:26.800 | And I start following along.
00:20:28.600 | It's like, okay, Ben decided that I think, quite rightly, we need a new Bert.
00:20:35.400 | Because everybody, like, so many people are still using Bert.
00:20:38.200 | And it's still the best at so many things.
00:20:40.000 | But it actually doesn't take advantage of lots of best practices.
00:20:43.200 | And so, he just went out and found basically everybody who's created better Berts
00:20:47.800 | in the last four or five years, brought them all together.
00:20:53.200 | Suddenly, there's this huge collaboration going on.
00:20:56.600 | So, yeah, I didn't tell him to do that.
00:20:58.200 | He didn't ask my permission to do that.
00:21:01.000 | And then, like, Benjamin Warner dived in.
00:21:05.400 | And he's like, oh, I created a whole Transformers from scratch implementation
00:21:12.000 | designed to be maximally hackable.
00:21:14.400 | He originally did it largely as a teaching exercise to show other people.
00:21:18.400 | But he was like, I could, you know, use that to create a really hackable Bert implementation.
00:21:25.400 | In fact, he didn't say that.
00:21:26.400 | He said, I just did do that.
00:21:28.400 | You know, and I created a repo.
00:21:32.200 | And then everybody's like, starts using it.
00:21:34.000 | They're like, oh, my God, this is amazing.
00:21:36.400 | I can now implement all these other Bert things, you know.
00:21:40.800 | And it's not just answer AI guys.
00:21:43.600 | There, you know, there's lots of folks, you know, who have, like, contributed new data
00:21:47.600 | set mixes and blah, blah, blah.
00:21:50.200 | So, I mean, I can help in the same way that other people can help.
00:21:55.800 | So, like, then Ben Clavier reached out to me at one point and said, like, okay, can
00:22:00.600 | you help me, like, what have you learned over time about how to manage, you know, intimidatingly
00:22:08.400 | capable and large groups of people who you're nominally meant to be leading?
00:22:15.800 | And so, you know, like, I try to help, but I don't direct.
00:22:21.400 | Another great example was Kerim, who, after our FSTP QLORA work, decided quite correctly
00:22:33.880 | that it didn't really make sense to use LORA in today's world.
00:22:36.840 | You want to use the normalized version, which is called DORA.
00:22:41.000 | And like, two or three weeks after we did FSTP QLORA, he just popped up and said, okay,
00:22:47.800 | I've just converted the whole thing to DORA, and I've also created these VLLM extensions,
00:22:52.680 | and I've got all these benchmarks, and, you know, now I've got training of quantized models
00:23:03.860 | with adapters that are as fast as LORA and as, actually, better than, weirdly, fine-tuning.
00:23:09.200 | I was just like, okay, that's great, you know?
00:23:15.040 | And yeah, so, the things we've done to try to help make these things happen as well is
00:23:20.920 | like, we have, so we don't have any required meetings, you know, but we do have a meeting
00:23:26.720 | for each pair of major time zones that everybody's invited to, and, you know, people see their
00:23:38.280 | colleagues doing stuff that looks really cool, and say like, oh, how can I help, you know,
00:23:45.200 | or how can I learn, or whatever.
00:23:47.000 | So another example is Austin, who, you know, amazing background, he ran AI at Fidelity,
00:23:55.320 | he ran AI at Pfizer, he ran browsing and retrieval for Google's DeepMind stuff, created Gemma.cpp,
00:24:03.920 | and he's been working on a new system to make it easier to do WebGPU programming, because
00:24:10.560 | again, he quite correctly identified, like, you know, this is a way that not everybody
00:24:16.280 | has to use CUDA, not everybody has to use NVIDIA, you can do stuff on your own computer,
00:24:22.120 | optionally through the browser, we need to make this easier to do.
00:24:25.480 | And so I, yeah, so I said to him, like, okay, I want to learn about that, not an area that
00:24:32.440 | I have much expertise in, so, you know, he's going to show me what he's working on and
00:24:37.540 | teach me a bit about it, and hopefully I can help contribute.
00:24:40.160 | I think one of the key things that's happened in all of these is everybody understands what
00:24:47.720 | Eric Gilliam, who wrote the second blog post in our series, the R&D historian, describes
00:24:54.440 | as everybody has total flexibility to do what they want, but we all understand, like, kind
00:25:04.520 | of roughly why we're here, you know, we all have the same, you know, we agree with the
00:25:08.120 | premises around, like, you know, everything's too expensive, everything's too complicated,
00:25:15.000 | you know, people are building too many vanity foundation models rather than taking better
00:25:20.740 | advantage of fine-tuning, like, there's this kind of general, like, sense of, like, we're
00:25:26.040 | all on the same wavelength about, you know, all the ways in which current research is
00:25:33.240 | fucked up and, you know, all the ways in which, you know, we kind of try, you know, worried
00:25:39.220 | about centralization and we, you know, we all care a lot about not just research for
00:25:47.840 | the point of citations, but research that actually wouldn't have happened otherwise
00:25:51.280 | and actually is going to lead to real-world outcomes and so, yeah, with this kind of like
00:25:55.160 | shared vision, people understand, like, you know, so when I say, like, oh, well, you know,
00:26:04.400 | tell me, Ben, about BERT 24, what's that about, and he's like, you know, like, oh, well, you
00:26:08.400 | know, you can see from an accessibility point of view or you can see from a kind of a actual
00:26:13.240 | practical impact point of view, there's far too much focus on decoder-only models and,
00:26:19.360 | you know, like, BERT's used in all of these different places and industry and so I can
00:26:23.120 | see, like, in terms of our basic principles, what we're trying to achieve, this seems like
00:26:26.440 | something important and so I think that's, like, a really helpful that we have that kind
00:26:32.400 | of shared perspective, you know.
00:26:35.920 | Yeah, and before we maybe talk about some of the specific research, when you're, like,
00:26:41.000 | reaching out to people, interviewing them, what are some of the traits, like, how do
00:26:45.760 | these things come out, you know, usually?
00:26:48.000 | Is it working on side projects that, you know, you're already familiar with?
00:26:51.360 | Is there anything, like, in the interview process that, like, helps you screen for people
00:26:54.520 | that are more, less pragmatic and more research-driven versus some of these folks that are, like,
00:27:00.380 | are just going to do it, you know, they're not waiting for, like, the perfect process?
00:27:05.360 | Anybody who comes through the recruiting is interviewed by everybody in the company.
00:27:15.560 | You know, our goal is 12 people, so it's not an unreasonable amount and, like, the way
00:27:23.160 | I, so the other thing to say is everybody so far who's come into the recruiting pipeline,
00:27:29.380 | everybody bar one, has been hired, so, which is to say our original curation has been good.
00:27:39.920 | And that's actually pretty easy because nearly everybody who's come in through the recruiting
00:27:42.420 | pipeline are people I know pretty well, so, you know, Jono Whittaker and I, you know,
00:27:51.440 | he worked on the stable diffusion course we did, he's outrageously creative and talented
00:28:01.140 | and he's super, like, enthusiastic tinkerer, just likes making things and, you know, Benjamin
00:28:11.840 | was one of the strongest parts of the fast.ai community, which is now the alumni, it's like
00:28:16.900 | hundreds of thousands of people and, you know, again, like, they're not people who a normal
00:28:23.180 | interview process would pick up, right?
00:28:25.860 | So Benjamin doesn't have any qualifications in math or computer science, Jono was living
00:28:36.340 | in Zimbabwe, he was not, you know, he was working on, like, helping some African startups,
00:28:41.900 | you know, but not FANG kind of credentials, but yeah, I mean, when you actually see people
00:28:49.060 | doing real work and they stand out above, you know, we've got lots of Stanford graduates
00:28:56.620 | and OpenAI people and whatever in our alumni community as well, you know, when you stand
00:29:00.660 | out above all of those people, anyway, obviously you've got something going for you, you know,
00:29:07.540 | him and I worked together on the masks study we did in the proceeding at the National Academy
00:29:15.460 | of Science.
00:29:16.460 | So, you know, we had worked together and, again, that was a group of, like, basically
00:29:20.300 | the 18 or 19 top experts in the world on public health and epidemiology and research design
00:29:29.780 | and so forth, and Austin was, you know, one of the strongest people in that collaboration.
00:29:38.740 | So yeah, you know, like, I've been lucky enough to have had opportunities to work with some
00:29:46.040 | people who are great and, you know, I'm a very open-minded person, so I kind of am always
00:29:49.960 | happy to try working with pretty much anybody and some people stand out.
00:29:54.100 | You know, there have been some exceptions, people I haven't previously known, like Ben
00:29:57.340 | Clavier actually I didn't know before, but, you know, with him, like, I just read his
00:30:06.740 | code and I'm like, oh, that's really well-written code, like I, and like it's not written exactly
00:30:15.780 | the same way as everybody else's code, and it's not written to do exactly the same thing
00:30:19.060 | as everybody else's code.
00:30:20.900 | So yeah, and then when I chatted to him, it's just like, I don't know, I felt like we'd
00:30:27.300 | known each other for years, like we just were on the same wavelength, and, but I could pretty
00:30:31.540 | much tell that was going to happen just by reading his code.
00:30:34.700 | I think you express a lot in the code you choose to write and how you choose to write
00:30:39.740 | it, I guess, you know, or another example, this guy named Vic, who was previously the
00:30:49.620 | CEO of DataQuest, and like, in that case, like, he's, you know, he's created a really
00:30:57.780 | successful startup, he's like, he won the first, basically, Kaggle NLP competition,
00:31:04.860 | which was automatic essay grading.
00:31:08.460 | He's got the current state-of-the-art OCR system, Syria, again, he's just a guy who
00:31:17.540 | obviously just builds stuff, you know, he doesn't ask for permission, he doesn't need
00:31:22.340 | any, like, external resources.
00:31:24.700 | Actually, Karim's another great example of this, I mean, I already knew Karim very well
00:31:28.660 | because he was my best ever master's student, but it wasn't a surprise to me, then, when
00:31:34.180 | he then went off to create the world's state-of-the-art language model in Turkish on his own, in his
00:31:40.380 | spare time, with no budget, you know, from scratch, this is not fine-tuning or whatever,
00:31:46.660 | he like, went back to Common Crawl and did everything, so, yeah, it's kind of, I don't
00:31:53.460 | know what I'd describe that process as, but it's not at all based on credentials.
00:32:01.420 | Assemble based on talent, yeah.
00:32:03.300 | We wanted to dive in a little bit more on, you know, turning from the people side of
00:32:07.840 | things into the technical bets that you're making.
00:32:11.660 | Also a little bit more on Bert, I was actually, we just did an interview with Yitay from Rekka,
00:32:16.780 | I don't know if you're familiar with his work, but also another encoder-decoder bet, and
00:32:24.740 | one of his arguments was actually people kind of over-index on the decoder-only GPT-3 type
00:32:28.860 | paradigm, I wonder if you have thoughts there that is maybe non-consensus as well.
00:32:34.100 | Yeah, no, absolutely, so I think it's a great example, so one of the people we're collaborating
00:32:38.100 | with a little bit with Bert24 is Colin Raffle, who is the guy behind, yeah, most of that
00:32:45.600 | stuff.
00:32:46.600 | You know, between that and UL2, there's a lot of really interesting work, and so one
00:32:54.980 | of the things I've been encouraging the Bert group to do, and Colin has as well, is to
00:33:01.740 | consider using a T5 pre-trained encoder backbone as a thing you fine-tune, which I think would
00:33:12.220 | be really cool.
00:33:13.220 | But he was saying, you know, Colin was also saying actually just use encoder-decoder as
00:33:19.780 | your Bert, you know, why don't you use that as a baseline, which I also think is a good
00:33:24.740 | idea.
00:33:25.740 | Yeah, look, you know, what technical arguments are, you know, are people underweighting?
00:33:29.740 | I mean, Colin would be able to describe this much better than I can, but I'll give my slightly
00:33:33.880 | non-expert attempt.
00:33:34.880 | Look, I mean, think about like diffusion models, right, like in stable diffusion, like we use
00:33:39.720 | things like UNet, we, you know, you have this kind of downward path and then in the upward
00:33:45.760 | path you have the cross connections, which you, it's not a tension, but it's like a similar
00:33:49.880 | idea, right?
00:33:52.680 | You're inputting the original encoding path into your decoding path.
00:34:00.000 | It's critical to make it work, right, because otherwise in the decoding part, the model
00:34:05.720 | has to like do so much kind of from scratch, right?
00:34:09.920 | So like if you're doing translation, like that's a classic kind of encoder-decoder example.
00:34:16.880 | If it's decoder only, you never get the opportunity to find the right, you know, feature engineering,
00:34:26.480 | that feature encoding for the original sentence.
00:34:32.440 | And it kind of means then on every token that you generate, you have to recreate the whole,
00:34:37.880 | the whole thing, you know.
00:34:39.120 | So if you have an encoder, it's basically saying like, okay, this is your opportunity
00:34:44.540 | model to create a really useful feature representation for your, for your input information.
00:34:55.320 | So I think there's really strong arguments for encoder-decoder models anywhere that there
00:34:59.920 | is this kind of like context or source thing, you know.
00:35:08.920 | And then why encoder only, well because like so much of the time what we actually care
00:35:14.560 | about is like, you know, a classification.
00:35:17.300 | You know, it's like an output.
00:35:18.300 | It's like we're not generating an arbitrary length sequence of tokens.
00:35:22.840 | So anytime you're not generating an arbitrary length sequence of tokens, decoder models
00:35:30.480 | don't seem to make much sense to me.
00:35:32.860 | Now the interesting thing is, you see on like Kaggle competitions, that decoder models
00:35:36.260 | still are at least competitive with things like Deberta v3.
00:35:44.980 | But they have to be way bigger to be competitive with things like Deberta v3, and the only
00:35:51.340 | reason they are competitive is because people have put a lot more time and money and effort
00:35:54.700 | into training the decoder only once, you know.
00:35:57.900 | There isn't a recent Deberta, there isn't a recent Bert.
00:36:02.520 | So yeah, it's a whole part of the world that people have slept on a little bit, and this
00:36:08.940 | is just what happens.
00:36:10.060 | This is how trends happen, rather than like, to me everybody should be like, oh let's look
00:36:15.480 | at the thing that has shown signs of being useful in the past but nobody really followed
00:36:20.180 | up with properly.
00:36:22.620 | That's the more interesting path, you know, but people tend to be like, oh I need to get
00:36:26.540 | citations.
00:36:27.540 | So what's everybody else doing?
00:36:28.540 | Can I make it 0.1% better, you know, or 0.1% faster?
00:36:33.380 | That's what everybody tends to do.
00:36:34.780 | Yeah, so I think it's like, ETA's work commercially now is interesting because here's like a whole,
00:36:41.780 | here's a whole model that's been trained in a different way, so there's probably a whole
00:36:44.280 | lot of tasks it's probably better at than, you know, GPT and Gemini and Claude.
00:36:54.940 | So that should be a good commercial opportunity for them if they can figure out what those
00:36:58.120 | tasks are.
00:36:59.120 | Well, if rumors are to be believed, and he didn't comment on this, but, you know, Snowflake
00:37:03.620 | may figure out the commercialization for them, so we'll see.
00:37:10.640 | Let's talk about FSDP, Qlora, Qdora and all of that awesome stuff.
00:37:15.900 | One of the things we talked about last time, some of these models are meant to run on systems
00:37:20.600 | that nobody can really own, no single person.
00:37:24.700 | And then you were like, well, what if you could fine tune a 70B model on like a 4090?
00:37:30.740 | And I was like, no, that sounds great, Jeremy, but like, can we actually do it?
00:37:35.080 | And then obviously, you all figured it out.
00:37:38.320 | Can you maybe tell us some of the worst stories behind that, like the idea behind FSDP, which
00:37:43.320 | is kind of taking, you know, sharped data parallel computation, then Qlora, which is
00:37:50.860 | do not touch all the weights, just go quantize some of the model, and then within the quantized
00:37:57.320 | model only do certain layers, instead of doing everything.
00:38:00.120 | Well, to the adapters.
00:38:01.120 | Yeah, yeah.
00:38:02.120 | To the adapters.
00:38:03.120 | Yeah, I will leave the floor to you.
00:38:06.880 | I think before you published it, nobody thought this was like a short term thing that we're
00:38:11.960 | just going to have.
00:38:12.960 | And now it's like, oh, obviously you can do it, but it's not that easy.
00:38:16.040 | Yeah.
00:38:17.040 | I mean, to be honest, it was extremely unpleasant work to do.
00:38:23.620 | This is like, not at all enjoyable.
00:38:28.620 | So I kind of did version 0.1 of it myself before we had launched the company, or at
00:38:36.420 | least the kind of like the pieces, which is, they're all pieces that are difficult to work
00:38:41.580 | with, right?
00:38:42.580 | So for the quantization, you know, I chatted to Tim Detmers quite a bit, and, you know,
00:38:47.960 | he very much encouraged me by saying like, yeah, it's possible.
00:38:51.360 | He actually thought it'd be easy, it probably would be easy for him, but I'm not Tim Detmers.
00:38:55.960 | You know, so he wrote Bits and Bytes, which is his quantization library, and, you know,
00:39:01.300 | he wrote that for a paper.
00:39:03.720 | He didn't write that to be production like code.
00:39:06.400 | It's now like he's using it.
00:39:07.400 | He wrote it in one night, apparently.
00:39:08.400 | Yeah.
00:39:09.400 | Yeah.
00:39:10.400 | So, you know, like, it's not particularly well structured.
00:39:15.660 | There's lots of code paths that never get used.
00:39:18.180 | There's lots of, you know, multiple versions of the same thing.
00:39:21.340 | You have to try to figure it out.
00:39:22.340 | So trying to get my head around that was hard, and, you know, because it's like, the interesting
00:39:26.820 | bits are all written in CUDA, it's hard to like to step through it and see what's happening.
00:39:31.820 | And then, you know, FSTP is this very complicated library in PyTorch, which not particularly
00:39:38.940 | well documented.
00:39:39.940 | So the only really way to understand it properly is, again, just read the code and step through
00:39:44.140 | the code.
00:39:45.640 | And then, like, Bits and Bytes doesn't really work in practice unless it's used with PEFT,
00:39:54.900 | the Hugging Face library, and PEFT doesn't really work in practice unless you use it
00:39:57.900 | with other things.
00:39:58.900 | And there's a lot of coupling in the Hugging Face ecosystem where, like, none of it works
00:40:04.620 | separately.
00:40:05.620 | They all work together, which I don't love.
00:40:09.940 | So yeah, trying to just get a minimal example that I can play with was really hard.
00:40:15.060 | And so I ended up having to rewrite a lot of it myself, to kind of create this minimal
00:40:22.140 | script.
00:40:23.140 | One thing that helped a lot was Medec had this Llama Recipes repo that came out just
00:40:27.700 | a little bit before I started working on that.
00:40:29.460 | And like, they had a kind of role model example of, like, here's how to train FSDP Laura.
00:40:41.220 | Didn't work with QLaura on Llama.
00:40:43.780 | Actually, a lot of that had been put together, like, a lot of the stuff I discovered, the
00:40:47.460 | interesting stuff, had been put together by Les Wright, who's, he was actually the guy
00:40:51.260 | in the Fast.ai community I mentioned who created the Ranger Optimizer.
00:40:55.020 | So he's doing a lot of great stuff at Meta now.
00:41:00.620 | So yeah, I kind of, that helped get some minimum stuff going, and then it was great once Benjamin
00:41:07.740 | and Jono joined full-time.
00:41:11.580 | And so we basically hacked at that together, and then Kerim joined, like, a month later
00:41:15.620 | or something.
00:41:16.620 | But gee, it was just a lot of, like, fiddly detailed engineering on, like, barely documented
00:41:26.500 | bits of obscure internals.
00:41:29.660 | So my focus was to see if it kind of could work, and I kind of got a bit of a proof of
00:41:32.540 | concept working, and then the rest of the guys actually did all the work to make it
00:41:37.620 | work properly.
00:41:41.020 | And you know, every time we thought we had something, you know, we needed to have good
00:41:45.940 | benchmarks, right?
00:41:47.020 | So we'd, like, it's very easy to convince yourself you've done the work when you haven't,
00:41:51.820 | you know, so then we'd actually try lots of things and be like, oh, in these, like, really
00:41:55.260 | important cases, the memory uses higher, you know, or it's actually slower.
00:42:00.220 | And we'd go in and we'd just find, like, all these things that were nothing to do with
00:42:04.220 | our library that just didn't work properly.
00:42:07.540 | And nobody had noticed they hadn't worked properly because nobody had really benchmarked
00:42:10.380 | it properly.
00:42:11.380 | So we ended up, you know, trying to fix a whole lot of different things.
00:42:17.020 | And even as we did so, new regressions were appearing in, like, Transformers and stuff
00:42:21.820 | that Benjamin then had to go away and figure out, like, oh, how come FlashAttention doesn't
00:42:26.460 | work in this version of Transformers anymore with this set of models, and, like, oh, it
00:42:31.820 | turns out they accidentally changed this thing so it doesn't work.
00:42:35.420 | You know, there's just, there's not a lot of really good performance type evals going
00:42:41.700 | on in the Open Source ecosystem.
00:42:43.500 | So there's an extraordinary amount of, like, things where people say, like, oh, we built
00:42:46.860 | this thing and it has this result, and when you actually check it, it doesn't.
00:42:51.180 | So yeah, there's a shitload of war stories from getting that thing to work.
00:42:56.780 | And it did require a particularly, like, tenacious group of people and a group of people who
00:43:01.660 | don't mind doing a whole lot of, kind of, like, really janitorial work, to be honest,
00:43:07.500 | to get the details right, to check them.
00:43:09.620 | Yeah.
00:43:10.620 | Yeah, we had the tree DAO on the podcast, and we talked about how a lot of it is, like,
00:43:16.100 | systems work to make some of these things work.
00:43:18.140 | It's not just, like, beautiful pure math that you do on a blackboard.
00:43:21.660 | It's, like, how do you get into the nitty-gritty of it.
00:43:24.620 | I mean, FlashAttention is a great example of that.
00:43:27.100 | Like, it's, it basically is just, like, oh, let's just take the attention and just do
00:43:31.300 | the tailed version of it, which sounds simple enough, you know.
00:43:36.340 | But then implementing that is challenging at lots of levels.
00:43:41.460 | Yeah.
00:43:42.580 | What about inference?
00:43:43.580 | You know, obviously, you've done all this amazing work on fine-tuning.
00:43:46.460 | Do you have any research you've been doing on the inference side, how to make local inference
00:43:51.180 | really fast on these models, too?
00:43:53.220 | We're doing quite a bit on that at the moment.
00:43:55.080 | We haven't released too much there yet, but one of the things I've been trying to do is
00:44:01.220 | also just to help other people.
00:44:04.340 | And one of the nice things that's happened is that a couple of folks at Meta, including
00:44:11.940 | Mark Seraphim, have done a nice job of creating this CUDA mode community of people working
00:44:17.420 | on, like, CUDA kernels or learning about that, and I tried to help get that going well as
00:44:21.660 | well and did some lessons to help people get into it.
00:44:27.980 | So there's a lot going on in both inference and fine-tuning performance and a lot of it's
00:44:34.100 | actually happening kind of related to that.
00:44:36.900 | Also the PyTorch team have created this Torch AO project on quantization.
00:44:44.580 | And so there's a big overlap now between kind of the FastAI and AnswerAI and CUDA mode communities
00:44:50.900 | of people working on stuff about inference and fine-tuning, but we're getting close now.
00:45:00.060 | You know, our goal is that nobody should be merging models, nobody should be downloading
00:45:07.020 | merged models, everybody should be using basically quantized plus adapters for almost everything,
00:45:15.980 | and just downloading the adapters, and that should be much faster.
00:45:21.380 | So that's kind of the place we're trying to get to.
00:45:25.180 | It's difficult, you know, because, like, Kerem's been doing a lot of work with VLM, for example.
00:45:31.020 | These inference engines are pretty complex bits of code.
00:45:35.940 | They have a whole lot of custom kernel stuff going on as well, as do the quantization libraries.
00:45:41.780 | So we've been working on that with also quite a bit of collaborating with the folks who
00:45:44.580 | do HQQ, which is a really great quantization library and works super well.
00:45:54.140 | So yeah, there's a lot of other people outside AnswerAI that we're working with a lot who
00:45:58.100 | are really helping on all this performance optimization stuff, open source.
00:46:03.100 | Just to follow up on merging models, I picked up there that you said nobody should be merging
00:46:08.220 | models.
00:46:09.220 | I think that's interesting because, you know, obviously a lot of people are experimenting
00:46:12.540 | with this and finding interesting results.
00:46:14.980 | I would say, in defense of merging models, you can do it without data.
00:46:20.540 | That's probably the only thing that's going for it.
00:46:27.020 | To explain, it's not that you shouldn't merge models, it's that you shouldn't be distributing
00:46:32.780 | a merged model.
00:46:34.340 | You should distribute a merged adapter, 99% of the time, and actually often one of the
00:46:41.940 | best things happening in the model merging world is actually that often merging adapters
00:46:45.700 | works better.
00:46:47.140 | The point is, Sean, that once you've got your new model, if you distribute it as an adapter
00:46:54.180 | that sits on top of a quantized model that somebody's already downloaded, then it's a
00:46:59.380 | much smaller download for them, and also the inference should be much faster, because you're
00:47:05.300 | not having to transfer FB16 weights from FB, from HBM memory at all, or ever load them
00:47:12.740 | off disk, you know, all the main weights are quantized, and the only floating point weights
00:47:18.180 | are in the adapters, so that should make both inference and fine-tuning faster.
00:47:24.100 | Got it, got it, okay, perfect.
00:47:27.580 | We're moving on a little bit to the rest of the Fast universe.
00:47:31.420 | I would have thought that, you know, once you started Answer.ai, that the sort of Fast
00:47:36.580 | universe would be kind of on hold, and then today you just dropped FastLight, and it looks
00:47:42.540 | like, you know, there's more activity going on in sort of FastLand.
00:47:47.500 | Yeah, so FastLand and AnswerLand are not really distinct things, AnswerLand is kind of like
00:47:56.940 | the FastLand grown up and funded, they both have the same mission, which is to maximize
00:48:04.340 | the societal benefit of AI broadly.
00:48:07.460 | We want to create thousands of commercially successful products at Answer.ai, and we want
00:48:16.060 | to do that with like 12 people, so that means we need a pretty efficient stack, you know,
00:48:26.340 | like quite a few orders of magnitude more efficient, not just for creation, but for
00:48:31.220 | deployment and maintenance than anything that currently exists.
00:48:37.900 | People often forget about the 'D' part of our R&D firm, so we've got to be extremely
00:48:43.420 | good at, you know, creating, deploying, and maintaining applications, not just models.
00:48:50.020 | Much to my, you know, horror, the story around creating web applications is much worse now
00:49:01.500 | than it was 10 or 15 years ago, in terms of like, if I say to a data scientist, here's
00:49:09.460 | how to create and deploy a web application, you know, either you have to learn JavaScript
00:49:17.340 | or TypeScript, and about all the complex, like, libraries like React and stuff, and
00:49:22.900 | all the complex, like, details around security and web protocol stuff, around how you then
00:49:27.620 | talk to a back-end, and then all the details about creating the back-end.
00:49:32.020 | You know, if that's your job, you know, and you're, you know, you have specialists who
00:49:37.380 | work in just one of those areas, it is possible to, for that to all work, but compared to
00:49:45.940 | like, oh, write a PHP script and put it in the home directory that you get when you sign
00:49:50.820 | up to this shell provider, which is what it was like in the 90s, you know, here are those
00:49:55.820 | 25 lines of code, you're done, and now you can pass that URL around to all your friends,
00:50:01.820 | you know, or put this, you know, .pl file inside the CGI bin directory that you got
00:50:06.540 | when you signed up to this web host.
00:50:11.460 | So yeah, the thing I've been mainly working on the last few weeks is fixing all that,
00:50:19.660 | and I think I fixed it.
00:50:21.460 | I'll go to this thing called fastHTML.
00:50:24.460 | I don't know if this is an announcement, but I can tell you guys.
00:50:28.180 | So yeah, there's this thing called fastHTML, which basically lets you create a complete
00:50:36.460 | web application in a single Python file.
00:50:41.140 | Unlike excellent projects like Streamlit and Gradio, you're not working on top of a
00:50:46.900 | highly abstracted thing that's got nothing to do with web foundations, you're working
00:50:51.860 | with web foundations directly, but you're able to do it by using pure Python.
00:50:59.380 | There's no template, there's no ginger, there's no separate like CSS and JavaScript files.
00:51:06.740 | It looks and behaves like a modern SPA web application.
00:51:16.980 | And you can create components for like Daisy UI, or Bootstrap, or Shoelace, or whatever
00:51:27.780 | fancy JavaScript and/or CSS, Tailwind, etc. library you like, but you can write it all
00:51:35.180 | in Python.
00:51:36.660 | You can pip install somebody else's set of components and use them entirely from Python.
00:51:41.900 | You can develop and prototype it all in a Jupyter Notebook if you want to.
00:51:46.300 | It all displays correctly, so you can like interactively do that.
00:51:52.020 | And then you mentioned Fastlight, so specifically now if you're using SQLite in particular,
00:51:59.660 | it's like ridiculously easy to have that persistence, you know, and you can basically, all of your
00:52:08.700 | handlers will be passed database-ready objects automatically that you can just call .delete,
00:52:16.340 | .update, .insert on.
00:52:18.780 | Yeah, you get session, you get security, you get all that.
00:52:24.540 | So it's, again, like with most of everything I do, it's very little code.
00:52:30.420 | It's mainly tying together really cool stuff that other people have written, so.
00:52:37.420 | You don't have to use it, but a lot of the best stuff comes from its incorporation of
00:52:41.180 | HTMX, which to me is basically the thing that changes your browser to make it work the way
00:52:48.660 | it always should have.
00:52:50.500 | So it's a, it just does four small things, but those four small things are the things
00:52:56.260 | that are basically unnecessary constraints that HTML should never have had.
00:53:01.980 | So it removes the constraints.
00:53:06.180 | It sits on top of Starlet, which is a very nice, you know, kind of lower level platform
00:53:11.700 | for building these kind of web applications.
00:53:15.860 | The actual interface matches as closely as possible to FastAPI, which is a really nice
00:53:22.340 | system for creating the kind of classic JavaScript type applications.
00:53:28.940 | And Sebastian, who wrote FastAPI, has been kind enough to help me think through some
00:53:33.580 | of these design decisions and so forth.
00:53:37.020 | I mean, everybody involved has been super helpful.
00:53:40.020 | Actually, I chatted to Carson, who created HTMX, you know, also about it, chatted to
00:53:44.820 | some of the folks involved in Django.
00:53:46.820 | Like, everybody in the community I've spoken to definitely realizes there's a big gap to
00:53:54.380 | be filled around, like, highly scalable web foundation based, you know, pure Python framework
00:54:07.100 | with a minimum of fuss.
00:54:11.780 | So yeah, I'm getting a lot of support and trying to make sure that FastHTML works well
00:54:18.300 | for people.
00:54:19.300 | Yeah, I would say, when I heard about this, I just texted Alexio, I think this is going
00:54:25.700 | to be pretty huge.
00:54:26.700 | You know, like, people consider Streamlit and Gradio to be the state of the art, but
00:54:30.780 | I think there's so much to improve in, you know, having sort of, what do you say, what
00:54:35.380 | do you call web foundations and web fundamentals at the core of it, I think would be really
00:54:39.740 | helpful.
00:54:40.740 | Yeah, it's based on 25 years of thinking and work for me.
00:54:46.140 | So like, FastML was built on a system much like this one, but that was of hell.
00:54:54.500 | And so I spent, you know, 10 years working on that.
00:54:58.100 | We had millions of people using that every day, really pushing it hard.
00:55:02.860 | And I really always enjoyed working in that.
00:55:06.460 | So you know, and obviously lots of other people have done, like, great stuff and particularly
00:55:09.220 | HTMX, you know.
00:55:10.460 | So I've been thinking about like, yeah, how do I pull together the best of the web framework
00:55:15.420 | I created for FastML with HTMX?
00:55:18.660 | There's also things like Pico CSS, which is the CSS system, which by default, FastHTML
00:55:28.100 | comes with.
00:55:29.100 | Although as I say, you can pip install anything you want to, but it makes it like, super easy
00:55:33.380 | to, you know, so we're trying to make it so that just out of the box, you don't have any
00:55:37.460 | choices to make, you know, if you don't want to.
00:55:39.940 | You can make choices, but for most people, you just, you know, it's like the PHP in your
00:55:44.660 | home directory thing.
00:55:45.660 | You just start typing and just by default, you'll get something which looks and feels,
00:55:52.300 | you know, pretty okay.
00:55:54.060 | And if you want to then write a version of Gradio or Streamlit on top of that, you totally
00:56:02.020 | And then the nice thing is if you then write it in kind of the Gradio equivalent, which
00:56:06.900 | will be, you know, I mentioned we'll create some kind of pip installable thing for that.
00:56:11.860 | Once you've outgrown, or if you outgrow that, it's not like, okay, throw that all away and
00:56:17.220 | start again in this like whole separate language, but it's like this kind of smooth, gentle
00:56:23.780 | path that you can take step-by-step because it's all just standard web foundations all
00:56:31.260 | the way, you know.
00:56:32.700 | Yeah.
00:56:33.700 | Got it.
00:56:34.700 | So, you know, just to wrap up the sort of open source work that you're doing, you know,
00:56:41.340 | you're aiming to create thousands of projects with a very, very small team.
00:56:45.420 | And I haven't heard you mention once AI agents or AI developer tooling or AI code maintenance,
00:56:53.300 | you know, I know you're very productive, but you know, what is the role of AI in your own
00:56:57.780 | work?
00:57:00.100 | So I'm making something.
00:57:02.340 | I'm not sure how much I want to say just yet.
00:57:04.340 | Okay.
00:57:05.340 | Give us a nibble.
00:57:06.340 | All right, I'll give you the key thing.
00:57:09.700 | So I've created a new approach.
00:57:15.420 | It's not called prompt engineering.
00:57:17.500 | It's called dialogue engineering.
00:57:22.660 | And I'm creating a system for doing dialogue engineering.
00:57:30.140 | It's currently called AI magic.
00:57:33.860 | I'm doing most of my work in this system and it's making me much more productive than I
00:57:38.060 | was before I used it.
00:57:40.020 | So I always just build stuff for myself and hope that it'll be useful for somebody else.
00:57:49.460 | Think about chatGPT with Code Interpreter, right?
00:57:56.380 | The basic UX is the same as a 1970s teletype, right?
00:58:01.220 | So if you wrote APL on a teletype in the 1970s, you typed onto a thing, your words appeared
00:58:07.940 | at the bottom of a sheet of paper and you'd like hit enter and it would scroll up.
00:58:12.580 | And then the answer from APL would be printed out, scroll up, and then you would type the
00:58:16.620 | next thing, which is also the way, for example, a shell works, like bash or ZSH or whatever.
00:58:28.360 | It's not terrible, you know, like we all get a lot done in these like very, very basic
00:58:33.620 | teletype style REPL environments, but I've never felt like it's optimal, you know, and
00:58:40.020 | to me, you know, so, and everybody else has just copied chatGPT.
00:58:47.780 | So it's also the way BART and Gemini work.
00:58:51.540 | It's also the way the Claude web app works.
00:58:55.300 | And then you add Code Interpreter and the most you can do is to like plead with chatGPT
00:59:01.180 | to write the kind of code I want.
00:59:04.980 | It's pretty good for very, very, very beginner users who like can't code at all, like by
00:59:10.300 | default now the code's even hidden away, so you never even have to see it ever happened.
00:59:15.260 | But for somebody who's like wanting to learn to code or who already knows a bit of code
00:59:18.560 | or whatever, it's, it seems really not ideal.
00:59:23.620 | So okay, that's one end of the spectrum.
00:59:25.300 | The other end of the spectrum, which is where Sean's work comes in, is, oh, you want to
00:59:32.900 | do more than chatGPT, no worries.
00:59:35.260 | Here is Visual Studio Code.
00:59:36.780 | I run it.
00:59:38.260 | There's an empty screen with a flashing cursor.
00:59:40.620 | Okay, start coding, you know.
00:59:44.140 | And it's like, okay, you can use systems like Sean's or like Cursor or whatever to be like,
00:59:52.620 | okay, Apple K in cursors, like create a form that blah, blah, blah, but it's, in the end,
01:00:00.180 | it's like a convenience over the top of this incredibly complicated system that full-time
01:00:06.220 | sophisticated software engineers have designed over the past few decades in a totally different
01:00:11.160 | environment as a way to build software, you know.
01:00:14.460 | And so we're trying to like shoehorn in AI into that.
01:00:20.520 | And it's, it's not easy to do, and I think there are like much better ways of thinking
01:00:28.840 | about the craft of software development in a language model world to be much more interactive,
01:00:37.100 | you know.
01:00:38.100 | So the thing that I'm building is, is neither of those things.
01:00:40.980 | It's something between the two.
01:00:43.020 | And it's built around this idea of crafting a dialogue, you know, where the outcome of
01:00:49.860 | the dialogue is, you know, the artifacts that you want, whether it be a piece of analysis
01:00:57.100 | or whether it be a Python library or whether it be a technical blog post or whatever.
01:01:03.860 | So as part of building that, I've created something called Claudette, which is a library
01:01:08.180 | for Claude.
01:01:09.180 | I've created something called Cosette, which is a library for OpenAI.
01:01:16.660 | There are libraries which are designed to make those APIs much more usable, much easier
01:01:23.740 | to use, much more concise.
01:01:26.740 | And then I've written AI magic on top of those.
01:01:32.220 | And that's been an interesting exercise because I did Claudette first, and rather than try
01:01:39.780 | to like, I was looking at what Simon Willison did with his fantastic LLM library, and his
01:01:45.740 | library is designed around like, let's make something that supports all the LLM inference
01:01:50.980 | engines and commercial providers.
01:01:53.340 | I thought, okay, what if I did something different, which is like make something that says Claude
01:01:56.620 | friendly as possible and forget everything else.
01:01:59.580 | So that's what Claudette was.
01:02:00.980 | So for example, one of the really nice things in Claude is pre-fill.
01:02:05.100 | So by telling the assistant that this is what your response started with, there's a lot
01:02:09.700 | of powerful things you can take advantage of.
01:02:12.640 | So yeah, I created Claudette to be as Claude friendly as possible.
01:02:16.680 | And then after I did that, and then with Claude, particularly with GPT 4.0 coming out, I kind
01:02:23.900 | of thought, okay, now let's create something that's as open AI friendly as possible.
01:02:29.460 | And then I tried to look to see, well, where are the similarities and where are the differences?
01:02:33.980 | And now can I make them compatible in places where it makes sense for them to be compatible
01:02:38.580 | without losing out on the things that make each one special for what they are.
01:02:43.540 | So yeah, those are some of the things I've been working on in that space.
01:02:49.380 | And I'm thinking we might launch AI magic via a course called how to solve it with code.
01:03:01.100 | The name is based on the classic Polya book, if you know how to solve it, which is, you
01:03:06.540 | know, one of the classic math books of all time, where we're basically going to try to
01:03:13.660 | show people how to solve challenging problems that they didn't think they could solve without
01:03:19.940 | doing a full computer science course, by taking advantage of a bit of AI and a bit of, like,
01:03:29.420 | practical skills.
01:03:30.420 | And it's particularly for this, like, whole generation of people who are learning to code
01:03:35.460 | with and because of ChatGPT.
01:03:37.500 | Like, I know a lot of people who didn't really know how to code, but they've created things
01:03:42.540 | because they use ChatGPT, but they don't really know how to maintain them or fix them or add
01:03:46.260 | things to them that ChatGPT can't do, because they don't really know how to code.
01:03:50.780 | So this course will be designed to show you how you can, like, you know, either become
01:03:57.140 | a developer who can, like, supercharge their capabilities by using language models, or
01:04:01.700 | become a language model first developer who can supercharge their capabilities by understanding
01:04:06.460 | a bit about process and fundamentals, so, yeah.
01:04:12.780 | Nice.
01:04:14.220 | That's a great spoiler, you know.
01:04:15.580 | I guess the fourth time you're going to be on Learning Space, we're going to talk about
01:04:19.140 | AI magic.
01:04:21.140 | Jeremy, before we wrap, this was just a great run through everything.
01:04:27.660 | What are the things that when you next come on the podcast in nine, 12 months, we're going
01:04:31.420 | to be like, "Man, Jeremy was, like, really ahead of it."
01:04:34.060 | Like, is there anything that you see in this space that maybe people are not talking enough?
01:04:38.700 | You know, what's the next company that's going to fall, like, in drama internally?
01:04:42.820 | Anything?
01:04:43.820 | You know, hopefully we'll be talking a lot about fast HTML and hopefully the international
01:04:47.080 | community that at that point has come up around that, and also about AI magic and about dialogue
01:04:51.700 | engineering.
01:04:54.300 | Hopefully dialogue engineering catches on, because I think it's the right way to think
01:04:56.700 | about a lot of this stuff.
01:04:58.260 | What else?
01:04:59.260 | I'm just trying to think about more on the research side.
01:05:01.620 | Yeah, I think, you know, I mean, we've talked about a lot of it.
01:05:03.860 | Like, I think encoder-decoder architectures, encoder-only architectures, hopefully we'll
01:05:08.740 | be talking about, like, the whole re-interest in BERT that BERT 24 stimulated.
01:05:15.460 | There's a state-space model that came out today that might be interesting for just general
01:05:20.140 | discussion.
01:05:21.380 | One thing that stood out to me with Cartesia's blog post was that they were talking about
01:05:25.820 | real-time ingestion of billions and trillions of tokens, and keeping that context, obviously,
01:05:33.020 | in the state-space that they have.
01:05:34.940 | I'm wondering what your thoughts are, because you've been entirely transformers the whole
01:05:38.860 | time.
01:05:39.860 | Yeah, no, so obviously my background is RNNs and LSTMs, and I'm still a believer in the
01:05:48.180 | idea that state is something you can update, you know.
01:05:53.260 | So obviously Sepp Hochreiter came up, came out with XLSTM recently.
01:06:01.380 | Oh my god, okay, another whole thing we haven't talked about, just somewhat related.
01:06:09.700 | I've been going crazy for, like, a long time about, like, why can I not pay anybody to
01:06:16.340 | save my KV cache, you know, for, like, I just ingested the Great Gatsby or the documentation
01:06:24.300 | for Starlet or whatever, you know, I'm sending it as my prompt context.
01:06:32.240 | Why are you redoing it every time, you know?
01:06:34.700 | So Gemini is about to finally come out with KV caching, and this is something that Austin
01:06:41.180 | actually in Gemma.cpp had had on his roadmap for years, well, not years, months, long time,
01:06:48.340 | is that the idea that the KV cache is, like, a thing that, like, it's a third thing, right?
01:06:58.060 | So there's RAG, you know, there's in-context learning, you know, and prompt engineering,
01:07:07.020 | and there's KV cache creation.
01:07:11.380 | I think it creates, like, a whole new class, almost, of applications or of techniques where,
01:07:19.820 | you know, for me, for example, I very often, like, I very often work with, like, really
01:07:23.700 | new libraries, or I've created my own library that I'm now writing with, rather than on.
01:07:31.140 | So I want all the docs in my new library to be there all the time.
01:07:35.340 | So yeah, I want to upload them once, and then all of, have a whole discussion about building
01:07:41.740 | this application using FastHTML, well, nobody's got FastHTML in their, in their language model
01:07:48.980 | I don't want to send all the FastHTML docs across every time.
01:07:51.420 | So one of the things I'm looking at doing in AI Magic, actually, is taking advantage
01:07:54.380 | of some of these ideas, so that you can have the documentation of the libraries you're
01:08:02.780 | working on be kind of always available.
01:08:05.060 | So there'll be ways to, you know, something over the next 12 months people will be spending
01:08:10.300 | time thinking about is how to, like, where to use RAG, where to use fine-tuning, where
01:08:14.500 | to use KV cache storage, you know, and, and how to use state, because in state models
01:08:24.020 | and XLSTM, again, state is something you, you update.
01:08:30.400 | So how do we combine the best of all of these worlds?
01:08:34.140 | >> And Jeremy, I know before you talked about how some of the autoregressive models are
01:08:38.980 | not maybe a great fit for agents.
01:08:40.820 | Any other thoughts on like JEPA, diffusion for text, any interesting thing that you've
01:08:44.900 | seen pop up?
01:08:45.900 | >> In the same way that, like, we probably ought to have state that you can update, i.e.
01:08:50.900 | XLSTM and state models, in the same way, a lot of things probably should have an encoder,
01:08:58.140 | JEPA and diffusion both seem like the right conceptual mapping for a lot of things we
01:09:05.940 | probably want to do.
01:09:08.100 | So the idea of, like, there, there should be a, a piece of the generative pipeline,
01:09:19.100 | which is like thinking about the answer and coming up with a sketch of what the answer
01:09:24.940 | looks like before you start outputting tokens.
01:09:29.600 | That's where it kind of feels like diffusion ought to fit, you know, and diffusion is,
01:09:34.780 | because it's not autoregressive, it's like, let's try to, like, gradually de-blur the
01:09:40.900 | picture of how to solve this.
01:09:43.540 | So this is also where dialogue engineering fits in, by the way.
01:09:47.260 | So with dialogue engineering, one of the reasons it's working so well for me is I use it to
01:09:52.260 | kind of, like, craft the thought process before I generate the code, you know.
01:10:03.340 | So yeah, there's a lot of different pieces here, and I don't know how they'll all kind
01:10:09.060 | of exactly fit together.
01:10:10.220 | I don't know if JEPA is going to actually end up working in the text world, I don't
01:10:13.100 | know if diffusion will end up working in the text world, but they seem to be, like, trying
01:10:16.900 | to solve a class of problem which is currently unsolved.
01:10:19.900 | Awesome, Jeremy, this was great, as usual.
01:10:26.180 | Thanks again for coming back on the pod.
01:10:27.540 | Thank you, Lysia.
01:10:28.540 | Thank you, Sean.
01:10:29.540 | And thank you all for listening.
01:10:30.540 | Yeah, that was fantastic.
01:10:31.660 | [Music]
01:10:51.660 | [End of Audio]
01:10:54.660 | [BLANK_AUDIO]