Cursor Team: Future of Programming with AI

00:00:00.000 | The following is a conversation

00:00:01.760 | with the founding members of the Cursor team,

00:00:04.520 | Michael Truel, Suali Asif, Arvid Lundmark, and Aman Sanger.

00:00:09.520 | Cursor is a code editor based on VS Code

00:00:14.260 | that has a lot of powerful features for AI-assisted coding.

00:00:17.940 | It has captivated the attention and excitement

00:00:21.000 | of the programming and AI communities.

00:00:23.900 | So I thought this is an excellent opportunity

00:00:26.760 | to dive deep into the role of AI in programming.

00:00:30.420 | This is a super technical conversation

00:00:33.040 | that is bigger than just about one code editor.

00:00:36.720 | It's about the future of programming,

00:00:38.440 | and in general, the future of human-AI collaboration

00:00:42.080 | in designing and engineering

00:00:44.080 | complicated and powerful systems.

00:00:47.060 | This is the Lex Friedman Podcast.

00:00:48.720 | To support it, please check out our sponsors

00:00:50.920 | in the description.

00:00:52.160 | And now, dear friends, here's Michael, Suali, Arvid,

00:00:56.560 | and Aman.

00:00:58.040 | All right, this is awesome.

00:01:00.000 | We have Michael, Aman, Suali, Arvid here

00:01:03.120 | from the Cursor team.

00:01:04.880 | First up, big ridiculous question.

00:01:07.400 | What's the point of a code editor?

00:01:10.200 | - So the code editor is largely the place

00:01:12.400 | where you build software.

00:01:14.280 | And today, or for a long time, that's meant

00:01:17.320 | the place where you text edit a formal programming language.

00:01:21.080 | And for people who aren't programmers,

00:01:22.720 | the way to think of a code editor

00:01:23.640 | is like a really souped-up word processor for programmers,

00:01:27.280 | where the reason it's souped up

00:01:28.960 | is code has a lot of structure.

00:01:31.440 | And so the quote-unquote word processor, the code editor,

00:01:35.680 | can actually do a lot for you

00:01:37.280 | that word processors sort of in the writing space

00:01:39.680 | haven't been able to do for people editing text there.

00:01:42.200 | And so that's everything

00:01:43.840 | from giving you visual differentiation

00:01:45.600 | of the actual tokens in the code

00:01:47.320 | so you can scan it quickly,

00:01:49.120 | to letting you navigate around the code base,

00:01:51.000 | sort of like you're navigating around the internet

00:01:52.440 | with hyperlinks.

00:01:53.280 | Going to sort of definitions of things you're using,

00:01:55.680 | to error checking, to catch rudimentary bugs.

00:02:00.280 | And so traditionally, that's what a code editor has meant.

00:02:06.560 | And I think that what a code editor is

00:02:10.040 | is going to change a lot over the next 10 years

00:02:12.160 | as what it means to build software

00:02:14.120 | maybe starts to look a bit different.

00:02:16.800 | - I think also a code editor should just be fun.

00:02:19.640 | - Yes, that is very important.

00:02:21.400 | That is very important.

00:02:22.240 | And it's actually sort of an underrated aspect

00:02:25.040 | of how we decide what to build.

00:02:27.440 | Like a lot of the things that we build

00:02:30.280 | and then we try them out, we do an experiment

00:02:32.960 | and then we actually throw them out

00:02:35.320 | because they're not fun.

00:02:37.040 | And so a big part of being fun

00:02:38.480 | is like being fast a lot of the time.

00:02:41.600 | Fast is fun.

00:02:42.800 | - Yeah, fast is, yeah.

00:02:44.360 | Yeah, that should be a t-shirt.

00:02:47.080 | - Like fundamentally, I think one of the things

00:02:50.920 | that draws a lot of people to building stuff on computers

00:02:53.680 | is this like insane iteration speed

00:02:55.560 | where in other disciplines,

00:02:57.400 | you might be sort of gatecapped by resources

00:02:59.800 | or the ability, even the ability

00:03:02.040 | to get a large group together

00:03:02.920 | and coding is this like amazing thing

00:03:04.400 | where it's you and the computer

00:03:05.480 | and that alone, you can build

00:03:08.160 | really cool stuff really quickly.

00:03:09.840 | - So for people who don't know,

00:03:10.920 | Cursor is this super cool new editor

00:03:14.760 | that's a fork of VS Code.

00:03:16.280 | It'd be interesting to get your kind of explanation

00:03:20.240 | of your own journey of editors.

00:03:22.600 | How did you, I think all of you

00:03:24.280 | were big fans of VS Code with Copilot.

00:03:28.200 | How did you arrive to VS Code

00:03:29.960 | and how did that lead to your journey with Cursor?

00:03:33.120 | - Yeah, so I think a lot of us,

00:03:37.640 | well, all of us were originally Vim users.

00:03:39.960 | - Pure Vim. - Pure Vim, yeah.

00:03:41.560 | No NeoVim, just pure Vim and a terminal.

00:03:44.160 | And at least for myself,

00:03:47.360 | it was around the time that Copilot came out,

00:03:50.240 | so 2021, that I really wanted to try it.

00:03:55.240 | So I went into VS Code, the only platform,

00:03:57.920 | the only code editor in which it was available.

00:04:00.080 | And even though I really enjoyed using Vim,

00:04:04.680 | just the experience of Copilot with VS Code

00:04:07.440 | was more than good enough to convince me to switch.

00:04:10.840 | And so that kind of was the default

00:04:12.280 | until we started working on Cursor.

00:04:14.680 | - And maybe we should explain what Copilot does.

00:04:17.400 | It's like a really nice auto-complete.

00:04:20.000 | It suggests, as you start writing a thing,

00:04:21.880 | it suggests one or two or three lines

00:04:24.280 | how to complete the thing.

00:04:26.000 | And there's a fun experience in that,

00:04:29.360 | you know, like when you have a close friendship

00:04:31.200 | and your friend completes your sentences?

00:04:34.040 | Like when it's done well, there's an intimate feeling.

00:04:37.320 | There's probably a better word than intimate,

00:04:38.680 | but there's a cool feeling of like,

00:04:40.760 | holy shit, it gets me.

00:04:44.400 | And then there's an unpleasant feeling

00:04:46.280 | when it doesn't get you.

00:04:48.280 | And so there's that kind of friction,

00:04:50.600 | but I would say for a lot of people,

00:04:52.240 | the feeling that it gets me overpowers that it doesn't.

00:04:55.160 | - And I think actually one of the underrated aspects

00:04:57.080 | of GitHub Copilot is that even when it's wrong,

00:04:59.320 | it's like a little bit annoying, but it's not that bad

00:05:01.680 | because you just type another character

00:05:04.200 | and then maybe then it gets you,

00:05:05.800 | or you type another character and then it gets you.

00:05:08.040 | So even when it's wrong, it's not that bad.

00:05:09.480 | - Yeah, you can sort of iterate and fix it.

00:05:11.840 | I mean, the other underrated part of Copilot for me

00:05:14.680 | sort of was just the first real AI product.

00:05:18.000 | So the first language model consumer product.

00:05:21.440 | - So Copilot was kind of like the first killer app for LLMs.

00:05:26.440 | - Yeah, and like the beta was out in 2021.

00:05:29.040 | - Right, okay.

00:05:30.280 | So what's the origin story of Cursor?

00:05:34.160 | - So around 2020, the scaling loss papers came out

00:05:37.280 | from OpenAI.

00:05:39.080 | And that was a moment where this looked like

00:05:42.000 | clear predictable progress for the field,

00:05:43.360 | where even if we didn't have any more ideas,

00:05:46.040 | it looks like you can make these models a lot better

00:05:47.400 | if you had more compute and more data.

00:05:49.720 | - By the way, we'll probably talk for three to four hours

00:05:53.520 | on the topic of scaling loss.

00:05:55.160 | - Yes.

00:05:56.000 | - But just to summarize, it's a paper and a set of papers

00:05:59.520 | and a set of ideas that say bigger might be better

00:06:02.120 | for model size and data size

00:06:04.080 | in the realm of machine learning.

00:06:05.720 | - It's bigger and better, but predictably better.

00:06:08.800 | Okay, there's another topic of conversation.

00:06:10.640 | - Yeah, so around that time, for some of us,

00:06:13.080 | there were like a lot of conceptual conversations

00:06:14.640 | about what's this gonna look like,

00:06:17.280 | what's the story gonna be

00:06:18.520 | for all these different knowledge worker fields

00:06:20.240 | about how they're gonna be made better

00:06:23.200 | by this technology getting better.

00:06:25.160 | And then I think there were a couple of moments

00:06:27.840 | where like the theoretical gains predicted in that paper

00:06:31.440 | started to feel really concrete

00:06:32.800 | and it started to feel like a moment

00:06:33.760 | where you could actually go and not do a PhD

00:06:37.040 | if you wanted to work on, do useful work in AI,

00:06:40.320 | actually felt like now there was this whole set of systems

00:06:42.640 | one could build that were really useful.

00:06:44.760 | And I think that the first moment

00:06:45.960 | we already talked about a little bit,

00:06:47.160 | which was playing with the early bit of Copilot,

00:06:48.840 | like that was awesome and magical.

00:06:50.540 | I think that the next big moment

00:06:53.120 | where everything kind of clicked together

00:06:55.320 | was actually getting early access to GPT-4.

00:06:57.600 | So it was sort of end of 2022

00:06:59.440 | was when we were tinkering with that model

00:07:02.400 | and the step-up in capabilities felt enormous.

00:07:05.640 | And previous to that,

00:07:06.960 | we had been working on a couple of different projects.

00:07:08.780 | We had been, because of Copilot, because of scaling odds,

00:07:13.040 | because of our prior interest in the technology,

00:07:15.060 | we had been tinkering around with tools for programmers,

00:07:18.880 | but things that are like very specific.

00:07:20.500 | So, we were building tools for financial professionals

00:07:24.560 | who have to work within a Jupyter Notebook

00:07:25.780 | or like playing around with,

00:07:27.120 | can you do static analysis with these models?

00:07:29.260 | And then the step-up in GPT-4 felt like,

00:07:31.240 | look, that really made concrete the theoretical gains

00:07:35.160 | that we had predicted before.

00:07:37.300 | Felt like you could build a lot more

00:07:39.120 | just immediately at that point in time.

00:07:40.920 | And also, if we were being consistent,

00:07:44.960 | it really felt like

00:07:46.520 | this wasn't just gonna be a point solution thing.

00:07:48.120 | This was gonna be all of programming

00:07:49.500 | was gonna flow through these models.

00:07:50.880 | It felt like that demanded

00:07:52.200 | a different type of programming environment,

00:07:54.000 | a different type of programming.

00:07:55.680 | And so we set off to build that,

00:07:57.440 | that sort of larger vision around that.

00:07:59.920 | - There's one that I distinctly remember.

00:08:01.800 | So my roommate is an IML Gold winner

00:08:05.140 | and there's a competition in the U.S. called the Putnam,

00:08:07.920 | which is sort of the IMO for college people.

00:08:10.040 | And it's this math competition.

00:08:12.280 | It's exceptionally good.

00:08:14.160 | So Sheng Tong and Aman,

00:08:16.600 | I remember it's sort of June of 2022,

00:08:21.600 | had this bet on whether the,

00:08:24.360 | like 2024, June or July,

00:08:27.140 | you were going to win a gold medal

00:08:28.800 | in the IMO with like models.

00:08:31.240 | - IMO is International Math Olympiad.

00:08:33.520 | - Yeah, IMO is International Math Olympiad.

00:08:35.600 | And so Arvind and I are both there,

00:08:37.660 | you know, also competed in it.

00:08:38.820 | So it was sort of personal.

00:08:41.580 | And I remember thinking,

00:08:44.780 | man, this is just, this is not gonna happen.

00:08:47.180 | This was like,

00:08:48.020 | it was like, even though I sort of believed in progress,

00:08:51.660 | I thought, you know, IMO Gold just,

00:08:54.260 | like Aman is just delusional.

00:08:55.780 | - Yeah.

00:08:56.620 | - That was the, and to be honest,

00:08:58.260 | I mean, I was, to be clear, very wrong,

00:09:01.100 | but that was maybe the most prescient bet in the group.

00:09:05.360 | - So the new results from DeepMind,

00:09:08.160 | it turned out that you were correct.

00:09:10.160 | That's what the-

00:09:11.000 | - Well, it was technically not.

00:09:12.680 | - Technically incorrect, but one point away.

00:09:15.040 | Aman was very enthusiastic about this stuff.

00:09:16.960 | - Yeah.

00:09:17.800 | - And before, Aman had this like scaling loss T-shirt

00:09:21.160 | that he would walk around with,

00:09:22.360 | where it had the like charts and like the formulas on it.

00:09:25.240 | - So you like felt the AGI or you felt the scaling loss?

00:09:28.640 | - Yeah, I distinctly remember

00:09:30.100 | there was this one conversation I had with Michael,

00:09:33.600 | where before I hadn't thought super deeply

00:09:36.220 | and critically about scaling laws.

00:09:38.520 | And he kind of posed the question,

00:09:40.560 | why isn't scaling all you need,

00:09:42.640 | or why isn't scaling gonna result

00:09:44.200 | in massive gains in progress?

00:09:46.360 | And I think I went through like the stages of grief.

00:09:49.440 | There is anger, denial, and then finally at the end,

00:09:51.880 | just thinking about it, acceptance.

00:09:55.900 | And I think I've been quite hopeful

00:10:00.020 | and optimistic about progress since.

00:10:03.220 | I think one thing I'll caveat is,

00:10:05.700 | I think it also depends on like which domains

00:10:07.340 | you're gonna see progress.

00:10:08.160 | Like math is a great domain,

00:10:09.840 | because especially like formal theorem proving,

00:10:12.740 | because you get this fantastic signal

00:10:15.340 | of actually verifying if the thing was correct.

00:10:18.180 | And so this means something like RL

00:10:19.660 | can work really, really well.

00:10:21.220 | And I think like you could have systems

00:10:22.820 | that are perhaps very superhuman at math

00:10:25.380 | and still not technically have AGI.

00:10:27.680 | - Okay, so can we take it all the way to Cursor?

00:10:30.840 | And what is Cursor?

00:10:32.280 | It's a fork of VS Code.

00:10:34.560 | And VS Code is one of the most popular editors

00:10:37.560 | for a long time.

00:10:38.400 | Like everybody fell in love with it.

00:10:39.560 | Everybody loved Vim.

00:10:41.240 | I left DMAX for it.

00:10:43.000 | Sorry.

00:10:43.840 | So unified in some fundamental way,

00:10:49.960 | the developer community.

00:10:52.960 | And then you look at the space of things,

00:10:54.840 | you look at the scaling laws,

00:10:55.960 | AI is becoming amazing.

00:10:58.280 | And you decided, okay,

00:10:59.880 | it's not enough to just write an extension for your VS Code,

00:11:02.880 | because there's a lot of limitations to that.

00:11:06.200 | Where we need,

00:11:07.340 | if AI is gonna keep getting better, better, better,

00:11:09.240 | we need to really like rethink

00:11:11.080 | how the AI is gonna be part of the editing process.

00:11:14.260 | And so you decided to fork VS Code

00:11:16.700 | and start to build a lot of the amazing features

00:11:19.460 | we'll be able to talk about.

00:11:22.160 | But what was that decision like?

00:11:23.300 | Because there's a lot of extensions,

00:11:25.920 | including Copilot of VS Code

00:11:28.600 | that are doing sort of AI type stuff.

00:11:30.320 | What was the decision like to just fork VS Code?

00:11:33.300 | - So the decision to do an editor

00:11:35.320 | seemed kind of self-evident to us

00:11:37.880 | for at least what we wanted to do and achieve.

00:11:40.440 | Because when we started working on the editor,

00:11:42.380 | the idea was these models are gonna get much better,

00:11:44.400 | their capabilities are gonna improve,

00:11:45.520 | and it's gonna entirely change how you build software.

00:11:47.680 | Both in a, you will have big productivity gains,

00:11:49.960 | but also radical in how like the act of building software

00:11:52.200 | is going to change a lot.

00:11:53.860 | And so you're very limited

00:11:55.740 | in the control you have over a code editor,

00:11:58.180 | if you're a plugin to an existing coding environment.

00:12:01.580 | And we didn't wanna get locked in by those limitations.

00:12:04.940 | We wanted to be able to just build the most useful stuff.

00:12:08.100 | - Okay, well then the natural question is,

00:12:10.340 | you know, VS Code is kind of with Copilot a competitor.

00:12:15.460 | So how do you win?

00:12:17.320 | Is it basically just the speed

00:12:18.780 | and the quality of the features?

00:12:20.200 | - Yeah, I mean, I think this is a space

00:12:23.000 | that is quite interesting, perhaps quite unique,

00:12:26.280 | where if you look at previous tech waves,

00:12:29.760 | maybe there's kind of one major thing that happened

00:12:31.780 | and it unlocked a new wave of companies.

00:12:34.200 | But every single year, every single model capability

00:12:37.720 | or jump you get in model capabilities,

00:12:39.920 | you now unlock this new wave of features,

00:12:43.560 | things that are possible, especially in programming.

00:12:46.880 | And so I think in AI programming,

00:12:49.760 | being even just a few months ahead,

00:12:51.700 | let alone a year ahead,

00:12:53.380 | makes your product much, much, much more useful.

00:12:55.780 | I think the cursor a year from now

00:12:57.780 | will need to make the cursor of today look obsolete.

00:13:00.880 | And I think, you know, Microsoft has done a number

00:13:04.980 | of like fantastic things,

00:13:06.480 | but I don't think they're in a great place

00:13:08.340 | to really keep innovating and pushing on this

00:13:10.980 | in the way that a startup can.

00:13:13.120 | - Just rapidly implementing features.

00:13:15.820 | - And push, yeah, like, and kind of doing

00:13:18.980 | the research experimentation necessary

00:13:21.260 | to really push the ceiling.

00:13:24.060 | - I don't know if I think of it in terms of features

00:13:26.060 | as I think of it in terms of like capabilities

00:13:28.300 | for programmers.

00:13:29.660 | It's that like, you know, as, you know,

00:13:33.020 | the new one model came out

00:13:34.900 | and I'm sure there are going to be more models

00:13:37.280 | of different types, like longer context and maybe faster.

00:13:40.700 | Like there's all these crazy ideas that you can try

00:13:44.700 | and hopefully 10% of the crazy ideas

00:13:47.740 | will make it into something kind of cool and useful.

00:13:50.700 | And we want people to have that sooner.

00:13:55.700 | To rephrase, it's like an underrated fact

00:13:57.580 | is we're making it for ourself.

00:13:59.300 | When we started Cursor,

00:14:00.820 | you really felt this frustration that, you know, models,

00:14:04.160 | you could see models getting better,

00:14:06.700 | but the cobalt experience had not changed.

00:14:08.780 | It was like, man, these guys,

00:14:11.740 | like the ceiling is getting higher.

00:14:13.000 | Like, why are they not making new things?

00:14:14.700 | Like they should be making new things.

00:14:16.100 | They should be like,

00:14:16.940 | like where's all the alpha features?

00:14:19.340 | There were no alpha features.

00:14:21.060 | It was like, I'm sure it was selling well.

00:14:24.700 | I'm sure it was a great business,

00:14:26.140 | but it didn't feel,

00:14:27.660 | I'm one of these people that really want to try

00:14:30.740 | and use new things.

00:14:31.780 | And it was just, there's no new thing

00:14:33.660 | for like a very long while.

00:14:35.380 | - Yeah, it's interesting.

00:14:37.300 | I don't know how you put that into words,

00:14:38.740 | but when you compare Cursor with Copilot,

00:14:41.460 | Copilot pretty quickly became,

00:14:43.640 | started to feel stale for some reason.

00:14:45.760 | - Yeah, I think one thing that I think helps us

00:14:49.560 | is that we're sort of doing it all in one

00:14:52.760 | where we're developing the UX

00:14:55.400 | and the way you interact with the model.

00:14:57.480 | At the same time as we're developing,

00:14:59.680 | like how we actually make the model give better answers.

00:15:02.440 | So we're like, how you build up the prompter

00:15:05.400 | or like, how do you find the context?

00:15:06.960 | And for a Cursor tab, like how do you train the model?

00:15:10.320 | So I think that helps us to have all of it,

00:15:12.520 | like sort of like the same people working

00:15:15.020 | on the entire experience end-to-end.

00:15:17.380 | - Yeah, it's like the person making the UI

00:15:19.380 | and the person training the model,

00:15:20.660 | like sit to like 18 feet away.

00:15:23.620 | - Often the same person even.

00:15:25.740 | - Yeah, often even the same person.

00:15:27.340 | So you can create things that are sort of not possible

00:15:30.980 | if you're not talking, you're not experimenting.

00:15:34.340 | - And you're using, like you said, Cursor to write Cursor.

00:15:37.180 | - Of course, oh yeah.

00:15:38.780 | - Well, let's talk about some of these features.

00:15:40.760 | Let's talk about the all-knowing,

00:15:43.120 | the all-powerful, praise be to the tab.

00:15:46.000 | You know, auto-complete on steroids, basically.

00:15:50.760 | So how does tab work?

00:15:51.980 | What is tab?

00:15:53.180 | - To highlight and summarize at a high level,

00:15:54.800 | I'd say that there are two things

00:15:57.000 | that Cursor is pretty good at right now.

00:15:58.920 | There are other things that it does,

00:16:01.400 | but two things that it helps programmers with.

00:16:04.800 | One is this idea of looking over your shoulder

00:16:08.320 | and being like a really fast colleague

00:16:10.440 | who can kind of jump ahead of you and type

00:16:12.720 | and figure out what you're gonna do next.

00:16:15.080 | And that was the original idea behind,

00:16:18.720 | that was kind of the kernel of the idea

00:16:20.000 | behind a good auto-complete

00:16:21.280 | was predicting what you're gonna do next.

00:16:23.240 | But you can make that concept even more ambitious

00:16:26.120 | by not just predicting the characters after your Cursor,

00:16:29.640 | but actually predicting the next entire change

00:16:31.160 | you're gonna make, the next diff,

00:16:32.120 | next place you're gonna jump to.

00:16:35.200 | And the second thing Cursor is pretty good at right now too

00:16:40.200 | is helping you sometimes jump ahead of the AI

00:16:42.680 | and tell it what to do and go from instructions to code.

00:16:47.120 | And on both of those, we've done a lot of work

00:16:48.560 | on making the editing experience for those things ergonomic

00:16:51.240 | and also making those things smart and fast.

00:16:54.520 | - One of the things we really wanted

00:16:56.240 | was we wanted the model to be able to edit code for us.

00:16:59.060 | That was kind of a wish.

00:17:00.200 | And we had multiple attempts at it

00:17:02.640 | before we had a sort of a good model

00:17:04.920 | that could edit code for you.

00:17:06.360 | Then after we had a good model,

00:17:09.760 | I think there've been a lot of effort

00:17:11.680 | to make the inference fast for having a good experience.

00:17:16.680 | And we've been starting to incorporate,

00:17:22.560 | I mean, Michael sort of mentioned this like ability

00:17:24.480 | to jump to different places.

00:17:26.280 | And that jump to different places,

00:17:27.720 | I think came from a feeling of,

00:17:30.400 | once you accept an edit,

00:17:32.340 | it's like, man, it should be just really obvious

00:17:36.720 | where to go next.

00:17:37.800 | It's like, I'd made this change,

00:17:39.960 | the model should just know that like the next place to go to

00:17:42.600 | is like 18 lines down.

00:17:44.680 | Like if you're a WIM user,

00:17:46.400 | you could press 1-8-J-J or whatever.

00:17:48.920 | But like, why am I doing this?

00:17:52.080 | Like the model should just know it.

00:17:54.040 | And then so the idea was you just pressed tab,

00:17:56.800 | it would go 18 lines down

00:17:58.080 | and then show you the next edit and you would press tab.

00:18:01.720 | So it was just you, as long as you could keep pressing tab.

00:18:04.680 | And so the internal competition was

00:18:06.280 | how many tabs can we make someone press?

00:18:08.480 | Once you have like the idea,

00:18:10.560 | more sort of abstractly the thing to think about

00:18:14.920 | is sort of like, how are the edits sort of zero entropy?

00:18:18.960 | So once you've sort of expressed your intent

00:18:20.920 | and the edit is,

00:18:22.440 | there's no like new bits of information

00:18:25.120 | to finish your thought,

00:18:27.440 | but you still have to type some characters

00:18:29.640 | to like make the computer understand

00:18:31.360 | what you're actually thinking.

00:18:33.080 | Then maybe the model should just sort of read your mind

00:18:35.800 | and all the zero entropy bits should just be like

00:18:39.240 | tabbed away.

00:18:40.600 | - Yeah.

00:18:41.440 | - That was sort of the abstract.

00:18:42.600 | - There's this interesting thing where

00:18:43.800 | if you look at language model loss on different domains,

00:18:46.960 | I believe the bits per byte,

00:18:49.360 | which is kind of character normalized loss for code

00:18:53.400 | is lower than language, which means in general,

00:18:56.040 | there are a lot of tokens in code

00:18:57.200 | that are super predictable.

00:18:58.840 | A lot of characters that are super predictable.

00:19:00.960 | And this is, I think, even magnified

00:19:03.040 | when you're not just trying to auto-complete code,

00:19:05.560 | but predicting what the user is going to do next

00:19:08.560 | in their editing of existing code.

00:19:10.880 | And so, you know, the goal of cursor tabs,

00:19:12.520 | let's eliminate all the low entropy actions

00:19:15.320 | you take inside of the editor.

00:19:16.760 | When the intent is effectively determined,

00:19:19.680 | let's just jump you forward in time, skip you forward.

00:19:22.400 | - Well, what's the intuition

00:19:23.960 | and what's the technical details

00:19:25.080 | of how to do next cursor prediction?

00:19:27.520 | That jump, that's not so intuitive, I think, to people.

00:19:31.440 | - Yeah.

00:19:32.520 | I think I can speak to a few of the details

00:19:35.520 | on how to make these things work.

00:19:37.280 | They're incredibly low latency.

00:19:38.480 | So you need to train small models on this task.

00:19:43.160 | In particular, they're incredibly pre-fill token hungry.

00:19:48.160 | What that means is they have these really,

00:19:49.840 | really long prompts where they see a lot of your code

00:19:52.600 | and they're not actually generating that many tokens.

00:19:54.880 | And so the perfect fit for that is using a sparse model,

00:19:58.680 | meaning an MOE model.

00:19:59.840 | So that was kind of one breakthrough we made

00:20:03.360 | that substantially improved its performance

00:20:05.000 | at longer context.

00:20:06.280 | The other being a variant of speculative decoding

00:20:10.000 | that we kind of built out called speculative edits.

00:20:13.280 | These are two, I think, important pieces

00:20:15.320 | of what make it quite high quality and very fast.

00:20:20.360 | - Okay, so MOE, mixture of experts.

00:20:22.800 | The input is huge, the output is small.

00:20:24.960 | - Yeah.

00:20:25.800 | - Okay, so what else can you say about how to make,

00:20:28.560 | does caching play a role in this particular--

00:20:31.200 | - Caching plays a huge role.

00:20:32.840 | Because you're dealing with this many input tokens,

00:20:36.480 | if every single keystroke that you're typing

00:20:39.080 | in a given line, you had to rerun the model

00:20:41.720 | on all of those tokens passed in,

00:20:44.120 | you're just going to, one, significantly degrade latency,

00:20:47.400 | two, you're gonna kill your GPUs with load.

00:20:49.880 | So you need to design the actual prompts you use

00:20:53.800 | for the model such that they're caching aware.

00:20:57.040 | And then, yeah, you need to reuse the KB cache

00:20:59.840 | across requests just so that you're spending less work,

00:21:03.000 | less compute.

00:21:04.440 | - Again, what are the things that TAB is supposed

00:21:07.320 | to be able to do kind of in the near term,

00:21:11.040 | just to like sort of linger on that?

00:21:13.520 | Generate code, like fill empty space,

00:21:18.440 | also edit code across multiple lines,

00:21:21.600 | and then jump to different locations inside the same file?

00:21:24.320 | - Yeah. - And then like--

00:21:25.160 | - Hopefully jump to different files also.

00:21:26.920 | So if you make an edit in one file,

00:21:28.680 | and maybe you have to go to another file

00:21:32.480 | to finish your thought,

00:21:33.400 | it should go to the second file also, yeah.

00:21:36.200 | - And then the full generalization

00:21:38.000 | is like next action prediction.

00:21:40.680 | Like sometimes you need to run a command in the terminal,

00:21:44.000 | and it should be able to suggest the command

00:21:46.880 | based on the code that you wrote too.

00:21:48.920 | Or sometimes you actually need to,

00:21:53.200 | like it suggests something,

00:21:54.080 | but it's hard for you to know if it's correct,

00:21:57.120 | because you actually need some more information to learn.

00:21:59.680 | Like you need to know the type

00:22:01.200 | to be able to verify that it's correct.

00:22:02.720 | And so maybe it should actually take you to a place

00:22:05.560 | that's like the definition of something,

00:22:07.520 | and then take you back

00:22:09.160 | so that you have all the requisite knowledge

00:22:11.160 | to be able to accept the next completion.

00:22:13.200 | - So providing the human the knowledge.

00:22:15.640 | - Yes.

00:22:17.200 | - Right.

00:22:18.240 | Can you integrate, like,

00:22:19.400 | I just got to know a guy named Prime Gen,

00:22:22.600 | who I believe has an SS,

00:22:24.920 | you can order coffee via SSH.

00:22:27.760 | - Oh yeah.

00:22:29.520 | - Oh, we did that.

00:22:30.360 | - We did that.

00:22:31.200 | - So can that also the model do that?

00:22:32.920 | Like feed you and provide you with caffeine?

00:22:37.360 | Okay, so that's the general framework.

00:22:39.280 | - Yeah, yeah.

00:22:40.120 | And the magic moment would be if it is,

00:22:44.680 | programming is this weird discipline

00:22:46.120 | where sometimes the next five minutes,

00:22:50.000 | not always, but sometimes the next five minutes,

00:22:51.680 | what you're gonna do is actually predictable

00:22:52.920 | from the stuff you've done recently.

00:22:54.360 | And so can you get to a world

00:22:55.480 | where that next five minutes either happens

00:22:57.160 | by you disengaging and it taking you through,

00:22:59.440 | or maybe a little bit more of just you seeing next step,

00:23:02.920 | what it's gonna do, and you're like,

00:23:03.760 | okay, that's good, that's good, that's good, that's good.

00:23:05.320 | And you can just sort of tap, tap, tap

00:23:07.120 | through these big changes.

00:23:09.000 | - As we're talking about this,

00:23:10.120 | I should mention that one of the really cool

00:23:12.680 | and noticeable things about Cursor is that

00:23:14.880 | there's this whole diff interface situation going on.

00:23:17.760 | So like the model suggests with the red and the green

00:23:22.560 | of like, here's how we're gonna modify the code.

00:23:24.440 | And in the chat window, you can apply

00:23:27.280 | and it shows you the diff and you can accept the diff.

00:23:29.880 | So maybe can you speak to whatever direction of that?

00:23:32.680 | - We'll probably have like four

00:23:34.040 | or five different kinds of diffs.

00:23:37.480 | So we have optimized the diff for the autocomplete.

00:23:40.880 | So that has a different diff interface

00:23:42.800 | than when you're reviewing larger blocks of code.

00:23:47.680 | And then we're trying to optimize another diff thing

00:23:50.720 | for when you're doing multiple different files

00:23:53.240 | and sort of at a high level, the difference is

00:23:57.480 | for when you're doing autocomplete,

00:24:00.480 | it should be really, really fast to read.

00:24:02.520 | Actually, it should be really fast to read in all situations

00:24:06.680 | but in autocomplete, it's sort of,

00:24:08.560 | you're really like your eyes focused in one area.

00:24:11.640 | You can't be in too many,

00:24:13.560 | the humans can't look in too many different places.

00:24:15.400 | - So you're talking about on the interface side?

00:24:17.200 | - On the interface side.

00:24:18.040 | So it currently has this box on the side.

00:24:20.440 | So we have the current box.

00:24:22.200 | And if it tries to delete code in some place

00:24:25.360 | and tries to add other code,

00:24:27.120 | it tries to show you a box on the side.

00:24:28.680 | - You can maybe show it if we pull it up on cursor.com.

00:24:31.600 | This is what we're talking about.

00:24:33.480 | - So that box, it was like three or four different attempts

00:24:38.400 | at trying to make this thing work.

00:24:40.760 | Where first attempt was like this blue crossed out line.

00:24:45.600 | So before it was a box on the side.

00:24:48.080 | It used to show you the code to delete

00:24:50.400 | by showing you like Google Docs style,

00:24:53.320 | you would see like a line through it.

00:24:55.240 | Then you would see the new code.

00:24:57.840 | And that was super distracting.

00:24:59.960 | And then we tried many different,

00:25:02.240 | there was sort of deletions,

00:25:03.800 | there was trying to read highlight.

00:25:06.320 | Then the next iteration of it, which is sort of funny,

00:25:09.040 | you would hold the on Mac, the option button.

00:25:14.040 | So it would sort of highlight a region of code

00:25:17.080 | to show you that there might be something coming.

00:25:19.600 | So maybe in this example,

00:25:21.720 | like the input and the value would all get blue.

00:25:26.120 | And the blue would to highlight

00:25:28.360 | that the AI had a suggestion for you.

00:25:30.200 | So instead of directly showing you the thing,

00:25:32.960 | it would show you that the AI,

00:25:34.600 | it would just hint that the AI had a suggestion.

00:25:36.440 | And if you really wanted to see it,

00:25:38.160 | you would hold the option button,

00:25:40.360 | and then you would see the new suggestion.

00:25:42.520 | Then if you release the option button,

00:25:45.120 | you would then see your original code.

00:25:47.560 | - So that's, by the way, that's pretty nice,

00:25:49.480 | but you have to know to hold the option button.

00:25:51.120 | - Yeah.

00:25:51.960 | - So by the way, I'm not a Mac user, but I got it.

00:25:54.600 | (laughs)

00:25:55.440 | - It was-

00:25:56.280 | - It's a button I guess, you people have.

00:25:59.080 | - It's again, it's just non-intuitive.

00:26:01.840 | I think that's the key thing.

00:26:03.680 | - And there's a chance this is also

00:26:05.280 | not the final version of it.

00:26:06.920 | - I am personally very excited

00:26:08.200 | for making a lot of improvements in this area.

00:26:13.200 | Like we often talk about it as the verification problem,

00:26:17.920 | where these diffs are great for small edits.

00:26:21.680 | For large edits, or like when it's multiple files

00:26:24.920 | or something, it's actually a little bit prohibitive

00:26:29.920 | to review these diffs.

00:26:32.520 | And so there are like a couple of different ideas here.

00:26:36.400 | Like one idea that we have is, okay, you know,

00:26:38.800 | like parts of the diffs are important.

00:26:41.040 | They have a lot of information.

00:26:42.400 | And then parts of the diff are just very low entropy.

00:26:46.480 | They're like the same thing over and over again.

00:26:49.520 | And so maybe you can highlight the important pieces

00:26:52.640 | and then gray out the not so important pieces.

00:26:55.200 | Or maybe you can have a model that looks at the diff

00:26:58.520 | and sees, oh, there's a likely bug here.

00:27:00.960 | I will like mark this with a little red squiggly

00:27:03.920 | and say like, you should probably like review

00:27:05.760 | this part of the diff.

00:27:07.680 | And ideas in that vein, I think are exciting.

00:27:11.400 | - Yeah, that's a really fascinating space

00:27:13.720 | of like UX design engineering.

00:27:15.960 | So you're basically trying to guide the human programmer

00:27:20.840 | through all the things they need to read and nothing more.

00:27:23.800 | - Yeah. - Like optimally.

00:27:25.280 | - Yeah, and you want an intelligent model to do it.

00:27:27.880 | Like currently diff algorithms are, they're like,

00:27:31.600 | they're just like normal algorithms.

00:27:36.080 | There is no intelligence.

00:27:38.000 | There's like intelligence that went into designing

00:27:39.800 | the algorithm, but then there's no, like,

00:27:42.320 | you don't care if it's about this thing or this thing,

00:27:45.080 | as you want a model to do this.

00:27:47.280 | - So I think the general question is like,

00:27:50.320 | Matt, these models are going to get much smarter.

00:27:53.640 | As the models get much smarter,

00:27:55.920 | the changes they will be able to propose are much bigger.

00:27:58.880 | So as the changes gets bigger and bigger and bigger,

00:28:02.040 | the humans have to do more and more

00:28:03.560 | and more verification work.

00:28:04.960 | It gets more and more and more hard.

00:28:06.640 | Like it's just, you need to help them out.

00:28:08.880 | It's sort of, I don't want to spend all my time

00:28:11.720 | reviewing code.

00:28:12.560 | - Can you say a little more across multiple files, Div?

00:28:20.040 | - Yeah, I mean, so GitHub tries to solve this, right?

00:28:23.560 | With code review.

00:28:25.120 | When you're doing code review,

00:28:26.080 | you're reviewing multiple diffs across multiple files.

00:28:29.640 | But like Arvid said earlier,

00:28:32.080 | I think you can do much better than code review.

00:28:34.960 | You know, code review kind of sucks.

00:28:36.960 | Like you spend a lot of time trying to grok this code

00:28:39.680 | that's often quite unfamiliar to you.

00:28:42.200 | And it often like doesn't even actually catch that many bugs.

00:28:47.200 | And I think you can significantly improve

00:28:50.240 | that review experience using language models,

00:28:52.120 | for example, using the kinds of tricks

00:28:54.000 | that Arvid had described of maybe pointing you

00:28:56.720 | towards the regions that actually matter.

00:28:58.920 | I think also, if the code is produced

00:29:03.960 | by these language models,

00:29:05.440 | and it's not produced by someone else,

00:29:07.560 | like the code review experience is designed

00:29:12.160 | for both the reviewer and the person that produced the code.

00:29:16.240 | In the case where the person that produced the code

00:29:18.360 | is a language model,

00:29:20.040 | you don't have to care that much about their experience.

00:29:22.160 | And you can design the entire thing around the reviewer

00:29:24.920 | such that the reviewer's job is as fun,

00:29:29.000 | as easy, as productive as possible.

00:29:31.000 | And I think that feels like the issue

00:29:34.360 | with just kind of naively trying to make

00:29:36.960 | these things look like code review.

00:29:39.440 | I think you can be a lot more creative

00:29:41.040 | and push the boundary on what's possible.

00:29:43.120 | - Just one idea there is I think ordering matters.

00:29:46.680 | Generally, when you review a PR,

00:29:48.320 | you have this list of files

00:29:50.120 | and you're reviewing them from top to bottom,

00:29:52.320 | but actually you actually want to understand this part first

00:29:55.840 | because that came logically first.

00:29:57.560 | And then you want to understand the next part.

00:29:59.040 | And you don't want to have to figure out that yourself.

00:30:02.680 | You want a model to guide you through the thing.

00:30:05.360 | - And is the step of creation going to be more

00:30:08.000 | and more natural language is the goal

00:30:10.480 | versus with actual writing?

00:30:12.120 | - I think sometimes.

00:30:14.120 | I don't think it's going to be the case

00:30:15.560 | that all of programming will be natural language.

00:30:18.360 | And the reason for that is if I'm pair programming

00:30:21.960 | with Swalla and Swalla is at the computer and the keyboard,

00:30:24.440 | and sometimes if I'm driving, I want to say to Swalla,

00:30:29.440 | "Hey, implement this function."

00:30:31.640 | And that works.

00:30:33.040 | And then sometimes it's just so annoying

00:30:35.800 | to explain to Swalla what I want him to do.

00:30:37.800 | And so I actually take over the keyboard

00:30:40.360 | and I show him, I write part of the example,

00:30:43.520 | and then it makes sense.

00:30:45.480 | And that's the easiest way to communicate.

00:30:47.400 | And so I think that's also the case for AI.

00:30:49.600 | Sometimes the easiest way to communicate with AI

00:30:51.680 | will be to show an example,

00:30:52.680 | and then it goes and does the thing everywhere else.

00:30:54.920 | Or sometimes if you're making a website, for example,

00:30:57.760 | the easiest way to show to the AI what you want

00:31:00.920 | is not to tell it what to do,

00:31:02.360 | but drag things around or draw things.

00:31:05.000 | And yeah, and maybe eventually we will get

00:31:09.160 | to brain machine interfaces or whatever,

00:31:11.120 | and it can understand what you're thinking.

00:31:12.760 | And so I think natural language will have a place.

00:31:14.720 | I think it will definitely not be the way

00:31:17.760 | most people program most of the time.

00:31:20.560 | - I'm really feeling the AGI with this editor.

00:31:23.000 | (laughing)

00:31:24.040 | It feels like there's a lot of machine learning

00:31:25.680 | going on underneath.

00:31:27.760 | Tell me about some of the ML stuff that makes it all work.

00:31:31.200 | - Well, Cursor really works via this ensemble

00:31:34.600 | of custom models that we've trained alongside

00:31:37.640 | the frontier models that are fantastic

00:31:39.160 | at the reasoning intense things.

00:31:40.800 | And so Cursor tab, for example, is a great example

00:31:43.840 | of where you can specialize this model to be even better

00:31:46.480 | than even frontier models.

00:31:47.520 | If you look at evals on the task we set it at.

00:31:50.360 | The other domain, which it's kind of surprising

00:31:53.080 | that it requires custom models,

00:31:54.200 | but it's kind of necessary and works quite well is in apply.

00:31:58.080 | So I think these models are like the frontier models

00:32:03.000 | are quite good at sketching out plans for code

00:32:05.200 | and generating like rough sketches of like the change,

00:32:07.720 | but actually creating diffs is quite hard

00:32:13.080 | for frontier models, for your training models.

00:32:15.440 | Like you try to do this with Sonnet, with O1,

00:32:21.200 | any frontier model, and it really messes up stupid things

00:32:24.200 | like counting line numbers,

00:32:26.280 | especially in super, super large files.

00:32:28.400 | And so what we've done to alleviate this

00:32:31.840 | is we let the model kind of sketch out this rough code block

00:32:35.200 | that indicates what the change will be.

00:32:37.920 | And we train a model to then apply that change to the file.

00:32:42.440 | - And we should say that apply is,

00:32:45.120 | the model looks at your code.

00:32:47.440 | It gives you a really damn good suggestion

00:32:49.480 | of what new things to do.

00:32:52.400 | And the seemingly for humans trivial step

00:32:55.080 | of combining the two, you're saying is not so trivial.

00:32:59.400 | - Contrary to popular perception,

00:33:01.160 | it is not a deterministic algorithm.

00:33:03.080 | - Yeah.

00:33:04.160 | I think like you see shallow copies of apply elsewhere,

00:33:09.160 | and it just breaks like most of the time

00:33:11.480 | because you think you can kind of try

00:33:12.720 | to do some deterministic matching,

00:33:14.000 | and then it fails, at least 40% of the time.

00:33:18.160 | And that just results in a terrible product experience.

00:33:21.480 | I think in general, this regime of,

00:33:26.160 | you are going to get smarter and smarter models.

00:33:29.080 | And like, so one other thing that apply lets you do

00:33:31.800 | is it lets you use fewer tokens

00:33:35.240 | with the most intelligent models.

00:33:37.280 | This is both expensive in terms of latency

00:33:39.960 | for generating all these tokens and cost.

00:33:44.200 | So you can give this very, very rough sketch

00:33:47.160 | and then have your small models go and implement it

00:33:49.880 | because it's a much easier task to implement

00:33:52.000 | this very, very sketched out code.

00:33:54.360 | And I think that this regime will continue

00:33:56.200 | where you can use smarter and smarter models

00:33:58.280 | to do the planning.

00:33:59.120 | And then maybe the implementation details

00:34:01.600 | can be handled by the less intelligent ones.

00:34:03.560 | Perhaps you'll have, you know, maybe a one,

00:34:05.880 | maybe it'll be even more capable models

00:34:08.360 | given an even higher level plan

00:34:10.800 | that is kind of recursively applied by Sonnet

00:34:15.640 | and then the apply model.

00:34:16.640 | - Maybe we should talk about how to make it fast.

00:34:18.960 | - Yeah.

00:34:19.800 | - I feel like fast is always an interesting detail.

00:34:21.720 | Fast is good.

00:34:22.800 | - Yeah. How do you make it fast?

00:34:25.120 | - Yeah, so one big component of making it fast

00:34:28.280 | is speculative edits.

00:34:29.960 | So speculative edits are a variant of speculative decoding.

00:34:33.080 | And maybe it'd be helpful to briefly describe

00:34:35.720 | speculative decoding.

00:34:37.800 | With speculative decoding,

00:34:39.240 | what you do is you can kind of take advantage

00:34:41.960 | of the fact that, you know, most of the time,

00:34:45.240 | and I'll add the caveat that it would be

00:34:47.640 | when you're memory bound in language model generation.

00:34:50.600 | If you process multiple tokens at once,

00:34:55.920 | it is faster than generating one token at a time.

00:34:58.760 | So this is like the same reason why

00:35:00.480 | if you look at tokens per second

00:35:02.640 | with prompt tokens versus generated tokens,

00:35:05.200 | it's much, much faster for prompt tokens.

00:35:07.520 | So what we do is instead of using

00:35:12.200 | what speculative decoding normally does,

00:35:13.800 | which is using a really small model

00:35:15.760 | to predict these draft tokens

00:35:17.160 | that your larger model will then go in and verify.

00:35:20.600 | With code edits, we have a very strong prior

00:35:24.000 | of what the existing code will look like.

00:35:25.920 | And that prior is literally the same exact code.

00:35:29.600 | So what you can do is you could just feed chunks

00:35:31.560 | of the original code back into the model.

00:35:35.000 | And then the model will just pretty much agree

00:35:37.920 | most of the time that, okay,

00:35:39.200 | I'm just gonna spit this code back out.

00:35:40.760 | And so you can process all of those lines in parallel.

00:35:43.480 | And you just do this with sufficiently many chunks.

00:35:45.320 | And then eventually you'll reach a point of disagreement

00:35:47.720 | where the model will now predict text that is different

00:35:51.200 | from the ground truth original code.

00:35:53.400 | It'll generate those tokens.

00:35:54.640 | And then we kind of will decide after enough tokens

00:35:57.160 | match the original code to restart speculating

00:36:01.160 | in chunks of code.

00:36:02.240 | What this actually ends up looking like

00:36:04.960 | is just a much faster version of normal editing code.

00:36:08.960 | So it looks like a much faster version

00:36:12.120 | of the model rewriting all the code.

00:36:13.680 | So we can use the same exact interface

00:36:16.560 | that we use for diffs,

00:36:19.120 | but it will just stream down a lot faster.

00:36:21.920 | - And then the advantage is that while it's streaming,

00:36:24.600 | you can just also start reviewing the code before it's done.

00:36:28.880 | So there's no big loading screen.

00:36:32.040 | So maybe that is part of the advantage.

00:36:36.440 | - So the human can start reading before the thing is done.

00:36:39.520 | - I think the interesting riff here is something like,

00:36:42.120 | like speculation is a fairly common idea nowadays.

00:36:45.680 | It's like not only in language models.

00:36:47.120 | I mean, there's obviously speculation in CPUs

00:36:49.240 | and there's like speculation for databases

00:36:51.560 | and speculation all over the place.

00:36:54.680 | - Well, let me ask this sort of the ridiculous question

00:36:56.920 | of which LLM is better at coding.

00:36:59.960 | GPT, CLAWD, who wins in the context of programming?

00:37:04.160 | And I'm sure the answer is much more nuanced

00:37:05.920 | because it sounds like every single part of this

00:37:08.880 | involves a different model.

00:37:12.080 | - Yeah, I think there's no model

00:37:15.200 | that Pareto dominates others,

00:37:18.960 | meaning it is better in all categories

00:37:21.440 | that we think matter.

00:37:22.760 | The categories being speed,

00:37:27.600 | ability to edit code,

00:37:29.080 | ability to process lots of code, long context,

00:37:32.000 | you know, a couple of other things

00:37:33.080 | and kind of coding capabilities.

00:37:35.560 | The one that I'd say right now is just kind of net best

00:37:38.840 | is SONNET.

00:37:39.680 | I think this is a consensus opinion.

00:37:41.240 | Our one's really interesting

00:37:42.480 | and it's really good at reasoning.

00:37:44.960 | So if you give it really hard programming interview

00:37:48.920 | style problems or lead code problems,

00:37:51.480 | it can do quite, quite well on them.

00:37:53.280 | But it doesn't feel like it kind of understands

00:37:57.400 | your rough intent as well as SONNET does.

00:38:00.400 | Like if you look at a lot of the other frontier models,

00:38:05.600 | one qualm I have is it feels like

00:38:08.080 | they're not necessarily over,

00:38:09.560 | I'm not saying they train on benchmarks,

00:38:12.040 | but they perform really well on benchmarks

00:38:14.440 | relative to kind of everything that's kind of in the middle.

00:38:17.800 | So if you try it in all of these benchmarks

00:38:19.240 | and things that are in the distribution of the benchmarks

00:38:21.320 | they're evaluated on, you know, they'll do really well,

00:38:23.360 | but when you push them a little bit outside of that,

00:38:25.680 | SONNET's I think the one that kind of does best

00:38:28.480 | at kind of maintaining that same capability.

00:38:31.480 | Like you kind of have the same capability in the benchmark

00:38:33.480 | as when you try to instruct it to do anything with coding.

00:38:37.320 | - What, another ridiculous question,

00:38:39.400 | is the difference between the normal programming experience

00:38:42.440 | versus what benchmarks represent.

00:38:44.920 | Like where do benchmarks fall short, do you think,

00:38:47.400 | when we're evaluating these models?

00:38:49.280 | - By the way, that's like a really, really hard,

00:38:51.560 | it's like critically important detail,

00:38:54.400 | like how different like benchmarks are

00:38:56.480 | versus like real coding.

00:38:58.640 | Where real coding, it's not interview style coding,

00:39:03.640 | it's you're doing these, you know,

00:39:06.720 | humans are saying like half broken English sometimes,

00:39:10.400 | and sometimes you're saying like, oh, do what I did before.

00:39:14.640 | Sometimes you're saying, you know, go add this thing

00:39:19.120 | and then do this other thing for me

00:39:20.480 | and then make this UI element.

00:39:21.880 | And then, you know, it's just like a lot of things

00:39:25.760 | are sort of context dependent.

00:39:27.880 | You really want to like understand the human

00:39:30.040 | and then do what the human wants

00:39:31.720 | as opposed to sort of this,

00:39:33.240 | maybe the way to put it is sort of abstractly

00:39:35.560 | is the interview problems are very well specified.

00:39:40.560 | They lean a lot on specification

00:39:45.120 | while the human stuff is less specified.

00:39:49.760 | - Yeah, I think that this benchmark question

00:39:51.960 | is both complicated by what Suhal just mentioned.

00:39:55.040 | And then also to what Aman was getting into

00:39:59.240 | is that even if you like, you know,

00:40:00.640 | there's this problem of like the skew

00:40:01.800 | between what can you actually model in a benchmark

00:40:03.400 | versus real programming.

00:40:05.680 | And that can be sometimes hard to encapsulate

00:40:07.360 | because it's like real programming is like very messy

00:40:10.240 | and sometimes things aren't super well specified,

00:40:12.760 | what's correct or what isn't.

00:40:14.040 | But then it's also doubly hard

00:40:16.560 | because of this public benchmark problem.

00:40:18.240 | And that's both because public benchmarks

00:40:20.320 | are sometimes kind of hill climbed on,

00:40:21.800 | but then it's like really, really hard

00:40:23.240 | to also get the data from the public benchmarks

00:40:26.280 | out of the models.

00:40:28.200 | And so for instance, like one of the most popular

00:40:31.560 | like agent benchmarks, suite bench

00:40:33.480 | is really, really contaminated

00:40:36.560 | in the training data of these foundation models.

00:40:39.280 | And so if you ask these foundation models

00:40:40.840 | to do a suite bench problem,

00:40:42.360 | you actually don't give them the context of a code base.

00:40:44.120 | They can like hallucinate the right file pass.

00:40:45.840 | They can hallucinate the right function names.

00:40:47.800 | And so it's also just the public aspect

00:40:52.040 | of these things is tricky.

00:40:53.360 | - Yeah, like in that case,

00:40:54.520 | it could be trained on the literal issues

00:40:56.920 | or pull requests themselves.

00:40:58.640 | And maybe the labs will start to do a better job

00:41:02.240 | or they've already done a good job

00:41:03.760 | at decontaminating those things,

00:41:05.360 | but they're not going to emit the actual training data

00:41:08.200 | of the repository itself.

00:41:09.760 | Like these are all like some of the most popular

00:41:11.600 | Python repositories, like SymPy is one example.

00:41:14.720 | I don't think they're going to handicap their models

00:41:17.680 | on SymPy and all these popular Python repositories

00:41:20.240 | in order to get true evaluation scores in these benchmarks.

00:41:24.120 | - I think that given the dearths in benchmarks,

00:41:27.280 | there have been like a few interesting crutches

00:41:30.320 | that places that build systems with these models

00:41:32.520 | or build these models actually use

00:41:35.000 | to get a sense of are they going the right direction or not?

00:41:37.120 | And in a lot of places,

00:41:39.400 | people will actually just have humans play with the things

00:41:41.920 | and give qualitative feedback on these.

00:41:44.160 | Like one or two of the foundation model companies,

00:41:45.920 | they have people who, that's a big part of their role.

00:41:49.000 | And internally we also qualitatively assess these models

00:41:53.080 | and actually lean on that a lot

00:41:54.040 | in addition to like private evals that we have.

00:41:56.560 | - It's like the vibe.

00:41:57.880 | - The vibe, yeah.

00:41:58.720 | - It's like the vibe.

00:41:59.560 | - The vibe benchmark, human benchmark.

00:42:02.120 | - Yeah.

00:42:02.960 | - You pull in the humans to do a vibe check.

00:42:05.600 | - Yeah.

00:42:06.440 | - Okay.

00:42:07.280 | I mean, that's kind of what I do,

00:42:08.120 | like just like reading online forums and Reddit and X,

00:42:12.520 | just like, well, I don't know how to properly load

00:42:17.520 | in people's opinions 'cause they'll say things like,

00:42:20.640 | I feel like Claude or GPT's gotten dumber or something.

00:42:25.200 | They'll say, I feel like,

00:42:27.680 | and then I sometimes feel like that too,

00:42:29.860 | but I wonder if it's the model's problem or mine.

00:42:34.000 | - Yeah, with Claude, there's an interesting take I heard

00:42:36.400 | where I think AWS has different chips

00:42:41.560 | and I suspect they have slightly different numerics

00:42:44.520 | than NVIDIA GPUs.

00:42:47.000 | And someone speculated that Claude's degraded performance

00:42:51.320 | had to do with maybe using the quantized version

00:42:54.360 | that existed on AWS Bedrock

00:42:56.040 | versus whatever was running on Anthropix GPUs.

00:43:00.780 | - I interview a bunch of people that have conspiracy theories,

00:43:03.000 | so I'm glad you spoke to this conspiracy theory.

00:43:05.680 | - Well, it's not like a conspiracy theory as much.

00:43:09.420 | They're just, they're like, they're, you know,

00:43:12.000 | humans are humans and there's these details

00:43:14.520 | and, you know, you're doing like this queasy amount of flops

00:43:19.200 | and that chips are messy and man, you can just have bugs.

00:43:22.600 | Like bugs are, it's hard to overstate

00:43:26.400 | how hard bugs are to avoid.

00:43:27.880 | - What's the role of a good prompt in all of this?

00:43:32.840 | See, you will mention that benchmarks

00:43:34.600 | have really structured, well-formulated prompts.

00:43:39.400 | What should a human be doing to maximize success?

00:43:44.400 | And what's the importance of what the human,

00:43:46.580 | you wrote a blog post on, you called it prompt design.

00:43:50.780 | - Yeah, I think it depends on which model you're using

00:43:55.780 | and all of them are slightly different

00:43:57.380 | and they respond differently to different prompts.

00:44:00.000 | But I think the original GPT-4

00:44:05.000 | and the original sort of pre-developed models last year,

00:44:08.580 | they were quite sensitive to the prompts

00:44:10.560 | and they also had a very small context window.

00:44:13.600 | And so we have all of these pieces of information

00:44:16.600 | around the code base that would maybe be relevant

00:44:19.880 | in the prompt, like you have the docs,

00:44:21.480 | you have the files that you add,

00:44:23.120 | you have the conversation history.

00:44:24.600 | And then there's a problem like,

00:44:26.640 | how do you decide what you actually put in the prompt

00:44:28.800 | and when you have a limited space.

00:44:30.720 | And even for today's models,

00:44:31.880 | even when you have long context,

00:44:33.880 | filling out the entire context window

00:44:35.800 | means that it's slower.

00:44:37.920 | It means that sometimes the model actually gets confused

00:44:40.540 | and some models get more confused than others.

00:44:43.180 | And we have this one system internally

00:44:45.460 | that we call preempt,

00:44:46.700 | which helps us with that a little bit.

00:44:50.040 | And I think it was built for the era before

00:44:55.040 | where we had 8,000 token context windows.

00:45:00.620 | And it's a little bit similar

00:45:03.420 | to when you're making a website,

00:45:05.820 | you wanted to work on mobile,

00:45:09.840 | you wanted to work on a desktop screen

00:45:12.200 | and you have this dynamic information,

00:45:16.720 | which you don't have, for example,

00:45:18.100 | if you're making like designing a print magazine,

00:45:19.920 | you have like, you know exactly where you can put stuff.

00:45:22.160 | But when you have a website or when you have a prompt,

00:45:24.240 | you have these inputs

00:45:25.800 | and then you need to format them to always work.

00:45:27.840 | Even if the input is really big,

00:45:29.020 | then you might have to cut something down.

00:45:31.280 | And so the idea was, okay, let's take some inspiration.

00:45:34.460 | What's the best way to design websites?

00:45:37.440 | Well, the thing that we really like is React

00:45:40.960 | and the declarative approach

00:45:42.080 | where you use JSX in JavaScript

00:45:47.080 | and then you declare, this is what I want.

00:45:50.360 | And I think this has higher priority

00:45:53.120 | or like this has higher Z index than something else.

00:45:56.480 | And then you have this rendering engine in web design.

00:46:00.120 | It's like Chrome.

00:46:01.120 | And in our case, it's a preempt renderer,

00:46:04.000 | which then fits everything onto the page.

00:46:07.140 | And as you declare it, it will decide what you want,

00:46:09.000 | and then it figures out what you want.

00:46:11.760 | And so we have found that to be quite helpful.

00:46:14.540 | And I think the role of it has sort of shifted over time,

00:46:19.080 | where initially it was to fit

00:46:20.180 | to these small context windows.

00:46:21.800 | Now it's really useful because, you know,

00:46:24.180 | it helps us with splitting up the data

00:46:27.760 | that goes into the prompt and the actual rendering of it.

00:46:30.240 | And so it's easier to debug

00:46:33.200 | because you can change the rendering of the prompt

00:46:35.480 | and then try it on old prompts

00:46:37.840 | because you have the raw data that went into the prompt.

00:46:40.320 | And then you can see,

00:46:41.160 | did my change actually improve it

00:46:42.480 | for like this entire eval set?

00:46:45.160 | - So do you literally prompt with JSX?

00:46:48.060 | - Yes. - Yeah.

00:46:49.240 | - So it kind of looks like React.

00:46:50.520 | There are components, like we have one component

00:46:52.320 | that's a file component and it takes in like the cursor,

00:46:57.320 | like usually there's like one line

00:46:59.140 | where the cursor is in your file.

00:47:00.840 | And that's like probably the most important line

00:47:02.280 | because that's the one you're looking at.

00:47:03.560 | And so then you can give priorities.

00:47:04.960 | So like that line has the highest priority

00:47:07.120 | and then you subtract one for every line

00:47:09.640 | that is farther away.

00:47:11.480 | And then eventually when it's rendered,

00:47:13.160 | it'd figure out how many lines can actually fit

00:47:14.960 | and it centers around that thing.

00:47:17.040 | - That's amazing. - Yeah.

00:47:17.880 | - And you can do like other fancy things

00:47:19.780 | where if you have lots of code blocks

00:47:21.780 | from the entire code base,

00:47:22.920 | you could use retrieval and things like embedding

00:47:26.320 | and re-ranking scores to add priorities

00:47:28.960 | for each of these components.

00:47:30.880 | - So should humans, when they ask questions,

00:47:33.400 | also try to use something like that?

00:47:35.720 | Like would it be beneficial to write JSX in the problem

00:47:39.960 | or the whole idea is you should be loose and messy?

00:47:43.320 | - I think our goal is kind of that

00:47:45.840 | you should just do whatever is the most natural thing

00:47:48.760 | for you.

00:47:49.680 | And then we, our job is to figure out

00:47:52.880 | how do we actually like retrieve the relative thing

00:47:55.360 | so that your thing actually makes sense.

00:47:56.920 | - Well, this is sort of the discussion I had

00:47:58.840 | with Arvind of Perplexity.

00:48:01.520 | It's like, his whole idea is like,

00:48:03.080 | you should let the person be as lazy as he wants.

00:48:06.120 | - Yeah.

00:48:06.960 | - But like, yeah, that's a beautiful thing.

00:48:10.360 | But I feel like you're allowed to ask more of programmers.

00:48:14.000 | Right? - Yes.

00:48:14.840 | - So like if you say, just do what you want,

00:48:16.760 | I mean, humans are lazy.

00:48:19.080 | There's a kind of tension between just being lazy

00:48:21.520 | versus like provide more as,

00:48:23.360 | be prompted, almost like the system pressuring you

00:48:28.720 | or inspiring you to be articulate.

00:48:32.200 | - Yeah.

00:48:33.040 | - Not in terms of the grammar of the sentences,

00:48:34.400 | but in terms of the depth of thoughts

00:48:36.280 | that you convey inside the prompts.

00:48:38.960 | - I think even as a system gets closer

00:48:40.760 | to some level of perfection,

00:48:43.320 | often when you ask the model for something,

00:48:47.120 | you just are not, not enough intent is conveyed

00:48:50.520 | to know what to do.

00:48:51.880 | And there are like a few ways to resolve that intent.

00:48:54.400 | One is the simple thing of having the model just ask you,

00:48:58.040 | I'm not sure how to do these parts based on your query.

00:49:01.440 | Could you clarify that?

00:49:02.800 | I think the other could be,

00:49:06.200 | maybe if you, there are five or six possible generations,

00:49:11.200 | given the uncertainty present in your query so far,

00:49:14.680 | why don't we just actually show you all of those

00:49:16.480 | and let you pick them?

00:49:17.600 | - How hard is it to, for the model to choose to speak,

00:49:22.520 | talk back?

00:49:23.360 | Sort of versus, that's hard.

00:49:27.160 | It's sort of like how to deal with the uncertainty.

00:49:30.080 | Do I choose to ask for more information

00:49:34.200 | to reduce the ambiguity?

00:49:36.080 | - So, I mean, one of the things we do is,

00:49:39.440 | it's like a recent addition is,

00:49:41.520 | try to suggest files that you can add.

00:49:44.360 | So, and while you're typing,

00:49:46.720 | one can guess what the uncertainty is

00:49:50.720 | and maybe suggest that like,

00:49:53.840 | maybe you're writing your API

00:49:56.200 | and we can guess using the commits

00:50:02.200 | that you've made previously in the same file

00:50:05.640 | that the client and the server is super useful.

00:50:09.280 | And there's like a hard technical problem

00:50:12.720 | of how do you resolve it across all commits?

00:50:15.360 | Which files are the most important

00:50:16.920 | given your current prompt?

00:50:19.160 | And we're still sort of, initial version is rolled out

00:50:23.000 | and I'm sure we can make it much more accurate.

00:50:25.280 | It's very experimental,

00:50:28.000 | but then the idea is we show you like,

00:50:29.880 | do you just want to add this file, this file, this file also

00:50:33.000 | to tell the model to edit those files for you?

00:50:36.280 | Because if you're, maybe you're making the API,

00:50:38.960 | like you should also edit the client and the server

00:50:41.160 | that is using the API and the other one resolving the API.

00:50:44.200 | And so that'll be kind of cool as,

00:50:46.240 | both there's the phase where you're writing a prompt

00:50:49.320 | and there's before you even click enter,

00:50:52.000 | maybe we can help resolve some of the uncertainty.

00:50:54.360 | - To what degree do you use agentic approaches?

00:50:57.080 | How you use fire agents?

00:50:59.080 | - We think agents are really, really cool.

00:51:02.480 | Like, I think agents is like,

00:51:05.040 | it's like, it resembles sort of like a human,

00:51:07.960 | it's sort of like the things,

00:51:09.280 | like you can kind of feel that it,

00:51:11.080 | like you're getting closer to AGI

00:51:12.600 | because you see a demo where it acts as a human would.

00:51:17.600 | And it's really, really cool.

00:51:19.720 | I think agents are not yet super useful for many things.

00:51:24.720 | They, I think we're getting close

00:51:28.720 | to where they will actually be useful.

00:51:30.520 | And so I think there are certain types of tasks

00:51:35.520 | where having an agent would be really nice.

00:51:39.600 | Like I would love to have an agent.

00:51:40.720 | For example, if like we have a bug

00:51:42.120 | where you sometimes can't command C

00:51:44.760 | and command V inside our chat input box.

00:51:48.720 | And that's a task that's super well specified.

00:51:50.840 | I just want to say like in two sentences,

00:51:52.640 | this does not work, please fix it.

00:51:54.360 | And then I would love to have an agent

00:51:56.240 | that just goes off, does it.

00:51:58.320 | And then a day later I come back and I reviewed the thing.

00:52:02.800 | - You mean it goes, finds the right file.

00:52:05.480 | - Yeah, it finds the right files.

00:52:06.920 | It like tries to reproduce the bug.

00:52:08.800 | It like fixes the bug

00:52:10.160 | and then it verifies that it's correct.

00:52:11.720 | And this could be a process that takes a long time.

00:52:14.800 | And so I think I would love to have that.

00:52:17.560 | And then I think a lot of programming,

00:52:20.560 | like there is often this belief

00:52:22.000 | that agents will take over all of programming.

00:52:24.760 | I don't think we think that that's the case

00:52:28.360 | because a lot of programming,

00:52:29.800 | a lot of the value is in iterating

00:52:32.480 | or you don't actually want to specify something upfront

00:52:35.440 | because you don't really know what you want

00:52:37.320 | until you've seen an initial version

00:52:39.080 | and then you want to iterate on that.

00:52:41.120 | And then you provide more information.

00:52:43.040 | And so for a lot of programming,

00:52:44.800 | I think you actually want a system that's instant

00:52:47.280 | that gives you an initial version instantly back

00:52:49.120 | and then you can iterate super, super quickly.

00:52:52.160 | - What about something like

00:52:53.920 | that recently came out Repl.it agent

00:52:56.320 | that does also like setting up the development environment,

00:52:59.800 | installing software packages, configuring everything,

00:53:02.320 | configuring the databases and actually deploying the app?

00:53:05.680 | - Yeah.

00:53:06.520 | - Is that also in the set of things you dream about?

00:53:09.880 | - I think so.

00:53:10.720 | I think that would be really cool

00:53:11.640 | for certain types of programming.

00:53:14.160 | It would be really cool.

00:53:15.200 | - Is that within scope of Cursor?

00:53:17.760 | - Yeah.

00:53:18.600 | We aren't actively working on it right now,

00:53:20.800 | but it's definitely like,

00:53:22.840 | we want to make the programmer's life easier and more fun.

00:53:27.680 | And some things are just really tedious

00:53:30.040 | and you need to go through a bunch of steps

00:53:31.480 | and you want to delegate that to an agent.

00:53:34.120 | And then some things,

00:53:35.040 | you can actually have an agent in the background

00:53:36.840 | while you're working.

00:53:37.840 | Like, let's say you have a PR

00:53:39.520 | that's both backend and frontend

00:53:41.520 | and you're working in the frontend

00:53:42.480 | and then you can have a background agent

00:53:44.080 | that does some work and figure out

00:53:46.000 | kind of what you're doing.

00:53:47.040 | And then when you get to the backend part of your PR,

00:53:50.120 | then you have some like initial piece of code

00:53:52.840 | that you can iterate on.

00:53:54.120 | And so that would also be really cool.

00:53:58.520 | - One of the things we already talked about is speed,

00:54:01.400 | but I wonder if we can just linger on that some more

00:54:04.600 | in the various places that the technical details involved

00:54:09.600 | in making this thing really fast.

00:54:11.720 | So every single aspect of a cursor,

00:54:14.200 | most aspects of cursor feel really fast.

00:54:16.360 | Like I mentioned, the apply is probably the slowest thing.

00:54:18.480 | And for me, I'm sorry, the pain on Harvey's face.

00:54:21.800 | - I know, it's the pain.

00:54:23.000 | It's the pain that we're feeling

00:54:24.120 | and we're working on fixing it.

00:54:26.000 | - Yeah, I mean, it says something that feels,

00:54:30.200 | I don't know what it is,

00:54:31.040 | like one second or two seconds, that feels slow.

00:54:33.960 | That means that's actually shows

00:54:36.800 | that everything else is just really, really fast.

00:54:39.640 | So is there some technical details

00:54:40.960 | about how to make some of these models,

00:54:42.840 | how to make the chat fast, how to make the divs fast?

00:54:47.320 | Is there something that just jumps to mind?

00:54:49.120 | - Yeah, I mean, so we can go over

00:54:50.480 | a lot of the strategies that we use.

00:54:51.800 | One interesting thing is cache warming.

00:54:53.880 | And so what you can do is if, as the user's typing,

00:54:59.640 | you can have, you're probably going to use

00:55:03.600 | some piece of context and you can know that

00:55:05.880 | before the user's done typing.

00:55:07.440 | So, you know, as we discussed before,

00:55:10.600 | reusing the KVCache results in lower latency,

00:55:13.440 | lower costs across requests.

00:55:15.840 | So as the user starts typing,

00:55:17.000 | you can immediately warm the cache with like,

00:55:19.200 | let's say the current file contents.

00:55:21.160 | And then when they've pressed enter,

00:55:23.040 | there's very few tokens it actually has to pre-fill

00:55:27.040 | and compute before starting the generation.

00:55:28.680 | This will significantly lower TTFD.

00:55:30.640 | - Can you explain how KVCache works?

00:55:33.000 | - Yeah, so the way transformers work,

00:55:35.880 | (laughing)

00:55:37.720 | - I like it.

00:55:38.560 | - I mean, like one of the mechanisms

00:55:41.920 | that allow transformers to not just independently,

00:55:45.240 | like the mechanism that allows transformers

00:55:46.840 | to not just independently look at each token,

00:55:48.480 | but see previous tokens are the keys and values to tension.

00:55:52.520 | And generally the way attention works

00:55:54.480 | is you have at your current token, some query,

00:55:58.480 | and then you've all the keys and values

00:56:00.440 | of all your previous tokens,

00:56:01.520 | which are some kind of representation

00:56:04.000 | that the model stores internally of all the previous tokens

00:56:06.720 | in the prompt.

00:56:08.000 | And like by default, when you're doing a chat,

00:56:12.880 | the model has to, for every single token,

00:56:15.800 | do this forward pass through the entire model.

00:56:19.320 | That's a lot of matrix multiplies that happen.

00:56:21.440 | And that is really, really slow.

00:56:23.600 | Instead, if you have already done that

00:56:26.120 | and you stored the keys and values

00:56:28.080 | and you keep that in the GPU,

00:56:30.400 | then when I'm, let's say I have sorted

00:56:32.440 | for the last N tokens, if I now want to compute

00:56:35.480 | the output token for the N plus one token,

00:56:38.640 | I don't need to pass those first N tokens

00:56:42.040 | through the entire model because I already have

00:56:44.600 | all those keys and values.

00:56:46.400 | And so you just need to do the forward pass

00:56:48.280 | through that last token.

00:56:49.960 | And then when you're doing attention,

00:56:52.120 | you're reusing those keys and values that have been computed,

00:56:54.840 | which is the only kind of sequential part

00:56:57.680 | or sequentially dependent part of the transformer.

00:56:59.920 | - Is there like higher level caching

00:57:02.040 | or like caching of the prompts

00:57:04.280 | or that kind of stuff that could help?

00:57:06.560 | - Yeah, there's other types of caching you can kind of do.

00:57:10.560 | One interesting thing that you can do for CursorTab

00:57:15.920 | is you can basically predict ahead

00:57:20.600 | as if the user would have accepted the suggestion

00:57:23.400 | and then trigger another request.

00:57:26.680 | And so then you've cached, you've done a speculative,

00:57:29.160 | it's a mix of speculation and caching, right?

00:57:31.080 | 'Cause you're speculating what would happen

00:57:32.840 | if they accepted it.

00:57:34.040 | And then you have this value that is cached,

00:57:36.880 | this suggestion.

00:57:38.560 | And then when they press tab,

00:57:39.600 | the next one would be waiting for them immediately.

00:57:41.880 | It's a kind of clever heuristic slash trick

00:57:44.600 | that uses a higher level caching

00:57:47.280 | and can give the, it feels fast

00:57:51.280 | despite there not actually being any changes

00:57:53.360 | in the model.

00:57:54.200 | - And if you can make the KV cache smaller,

00:57:56.320 | one of the advantages you get is like,

00:57:58.200 | maybe you can speculate even more.

00:57:59.920 | Maybe you can guess here's the 10 things

00:58:01.960 | that could be useful.

00:58:03.640 | I don't like, like you predict the next 10

00:58:06.720 | and then like it's possible the user hits

00:58:09.280 | the one of the 10.

00:58:10.840 | It's like much higher chance than the user hits

00:58:12.600 | like the exact one that you showed them.

00:58:14.800 | Maybe they type in other character

00:58:16.840 | and we sort of hit something else in the cache.

00:58:19.080 | So there's all these tricks where,

00:58:20.880 | the general phenomena here is,

00:58:25.320 | I think it's also super useful for RL is,

00:58:28.560 | maybe a single sample from the model isn't very good.

00:58:33.240 | But if you predict like 10 different things,

00:58:37.920 | it turns out that one of the 10 that's right

00:58:40.720 | is the probability is much higher.

00:58:42.520 | So there's these passive key curves

00:58:44.560 | and part of RL, like what RL does is,

00:58:47.760 | you can exploit this passive key phenomena

00:58:51.200 | to make many different predictions

00:58:53.400 | and one way to think about this,

00:58:57.240 | the model sort of knows internally has like,

00:58:59.560 | has some uncertainty over like,

00:59:01.280 | which of the key things is correct

00:59:03.040 | or like which of the key things does the human want.

00:59:05.440 | When we RL our cursor tab model,

00:59:09.480 | one of the things we're doing is,

00:59:11.920 | we're predicting which of the hundred different suggestions

00:59:16.920 | the model produces is more amenable for humans.

00:59:20.480 | Like which of them do humans more like than other things.

00:59:23.560 | Maybe like there's something where the model

00:59:26.400 | can predict very far ahead versus like a little bit

00:59:28.600 | and maybe somewhere in the middle

00:59:30.080 | and then you can give a reward to the things

00:59:34.280 | that humans would like more

00:59:35.400 | and sort of punish the things that it would like

00:59:37.840 | and sort of then train the model

00:59:39.040 | to output the suggestions that humans would like more.

00:59:40.920 | You have these like RL loops that are very useful

00:59:43.200 | that exploit these passive key curves.

00:59:45.840 | Oman maybe can go into even more detail.

00:59:48.640 | - Yeah, it is a little different than speed.

00:59:50.840 | But I mean, like technically you tie it back in

00:59:55.880 | because you can get away with the smaller model

00:59:57.280 | if you RL your smaller model

00:59:58.920 | and it gets the same performances as the bigger one.

01:00:01.480 | That's like, and Suali was mentioning stuff about KB,

01:00:07.200 | about reducing the size of your KB cache.

01:00:08.800 | There are other techniques there as well

01:00:10.120 | that are really helpful for speed.

01:00:11.800 | So kind of back in the day,

01:00:15.080 | like all the way two years ago,

01:00:17.640 | people mainly use multi-head attention.

01:00:20.120 | And I think there's been a migration

01:00:21.480 | towards more efficient attention schemes

01:00:24.320 | like group query or multi-query attention.

01:00:28.720 | And this is really helpful for then

01:00:31.640 | with larger batch sizes,

01:00:33.600 | being able to generate the tokens much faster.

01:00:36.640 | The interesting thing here is

01:00:38.200 | this now has no effect on that

01:00:42.120 | time to first token pre-fill speed.

01:00:45.280 | The thing this matters for is now generating tokens.

01:00:48.880 | And why is that?

01:00:49.880 | 'Cause when you're generating tokens,

01:00:51.760 | instead of being bottlenecked

01:00:54.560 | by doing these super parallelizable matrix multiplies

01:00:57.680 | across all your tokens,

01:00:59.040 | you're bottlenecked by how quickly it's for long context

01:01:02.400 | with large batch sizes,

01:01:04.200 | by how quickly you can read those cache keys and values.

01:01:07.200 | And so then that's memory bandwidth

01:01:10.760 | and how can we make this faster?

01:01:12.560 | We can try to compress the size of these keys and values.

01:01:15.280 | So multi-query attention is the most aggressive of these,

01:01:18.280 | where normally with multi-head attention,

01:01:20.880 | you have some number of "attention heads"

01:01:24.320 | and some number of kind of query heads.

01:01:29.320 | Multi-query just preserves the query heads,

01:01:32.200 | gets rid of all the key value heads.

01:01:34.480 | So there's only one kind of key value head

01:01:37.920 | and there's all the remaining query heads.

01:01:41.320 | With group query, you instead preserve all the query heads

01:01:46.320 | and then your keys and values are kind of...

01:01:51.960 | There are fewer heads for the keys and values,

01:01:53.680 | but you're not reducing it to just one.

01:01:56.040 | But anyways, the whole point here

01:01:57.280 | is you're just reducing the size of your KB cache.

01:02:00.480 | - And then there is MLA.

01:02:02.480 | - Yeah, multi-latent.

01:02:04.200 | That's a little more complicated.

01:02:06.040 | And the way that this works

01:02:07.800 | is it kind of turns the entirety of your keys and values

01:02:12.280 | across all your heads into this kind of one latent vector

01:02:16.760 | that is then kind of expanded inference time.

01:02:19.760 | - But MLA is from this company called DeepSeq.

01:02:23.960 | It's quite an interesting algorithm.

01:02:26.280 | Maybe the key idea is sort of,

01:02:27.920 | in both MQA and in other places,

01:02:32.480 | what you're doing is you're sort of reducing

01:02:35.920 | like the number of KB heads.

01:02:38.720 | The advantage you get from that is there's less of them,

01:02:43.480 | but maybe the theory is that you actually want

01:02:47.680 | a lot of different,

01:02:48.840 | like you want each of the keys and values

01:02:51.840 | to actually be different.

01:02:52.840 | So one way to reduce the size is you keep

01:02:55.880 | one big shared vector for all the keys and values,

01:03:01.360 | and then you have smaller vectors for every single token

01:03:03.880 | so that you can store only the smaller thing

01:03:07.560 | as some sort of like low-rank reduction.

01:03:10.080 | And the low-rank reduction,

01:03:11.480 | and at the end of the time,

01:03:12.800 | when you eventually want to compute the final thing,

01:03:16.080 | remember that you're memory bound,

01:03:17.600 | which means that you still have some compute left

01:03:20.320 | that you can use for these things.

01:03:21.400 | And so if you can expand the latent vector back out,

01:03:26.400 | and somehow this is far more efficient

01:03:29.920 | because you're reducing, for example,

01:03:33.120 | maybe like reducing like 32 or something,

01:03:36.240 | like the size of the vector that you're keeping.

01:03:37.960 | - Yeah, there's perhaps some richness

01:03:39.920 | in having a separate set of keys and values

01:03:43.880 | and query that kind of pairwise match up

01:03:45.960 | versus compressing that all into one,

01:03:47.760 | and that interaction at least.

01:03:51.320 | - Okay, and all of that is dealing with being memory bound.

01:03:55.400 | - Yeah.

01:03:56.240 | - And what, I mean, ultimately,

01:03:59.000 | how does that map to the user experience?

01:04:01.400 | Trying to get the-

01:04:02.240 | - Yeah, the two things that it maps to

01:04:03.880 | is you can now make your cache a lot larger

01:04:06.640 | because you've less space allocated for the KB cache.

01:04:09.600 | You can maybe cache a lot more aggressively

01:04:11.240 | in a lot more things.

01:04:12.400 | So you get more cache hits,

01:04:14.120 | which are helpful for reducing the time to first token

01:04:17.280 | for the reasons that were kind of described earlier.

01:04:19.520 | And then the second being,

01:04:20.680 | when you start doing inference with more and more requests

01:04:24.160 | and larger and larger batch sizes,

01:04:25.920 | you don't see much of a slowdown

01:04:28.400 | in as it's generating the tokens, the speed of that.

01:04:31.720 | - Would it also allow you to make your prompt bigger

01:04:33.960 | for certain-

01:04:34.800 | - Yeah, yeah.

01:04:35.640 | So like the basic, the size of your KB cache

01:04:38.280 | is both the size of all your prompts

01:04:41.280 | multiplied by the number of prompts

01:04:42.480 | being processed in parallel.

01:04:43.600 | So you could increase either of those dimensions, right?

01:04:45.800 | The batch size or the size of your prompts

01:04:48.120 | without degrading the latency of generating tokens.

01:04:51.920 | - Arvid, you wrote a blog post, Shadow Workspace.

01:04:54.360 | - Yeah.

01:04:55.200 | - Iterating on code in the background.

01:04:56.520 | - Yeah.

01:04:57.360 | - So what's going on?

01:04:58.520 | - So to be clear,

01:04:59.680 | we want there to be a lot of stuff happening

01:05:02.840 | in the background

01:05:03.680 | and we're experimenting with a lot of things.

01:05:05.760 | Right now, we don't have much of that happening

01:05:09.120 | other than like the cache warming

01:05:10.960 | or like figuring out the right context

01:05:13.920 | that goes into your command key prompts, for example.

01:05:16.520 | But the idea is,

01:05:17.800 | if you can actually spend computation in the background,

01:05:20.320 | then you can help the user

01:05:24.760 | maybe like at a slightly longer time horizon

01:05:27.840 | than just predicting the next few lines

01:05:30.120 | that you're gonna make.

01:05:30.960 | But actually like in the next 10 minutes,

01:05:32.880 | what are you going to make?

01:05:34.040 | And by doing it in the background,

01:05:35.680 | you can spend more computation doing that.

01:05:38.760 | And so the idea of the Shadow Workspace

01:05:41.560 | that we implemented

01:05:42.680 | and we use it internally for like experiments

01:05:45.760 | is that to actually get advantage

01:05:49.280 | of doing stuff in the background,

01:05:50.880 | you want some kind of feedback signal

01:05:53.440 | to give back to the model.

01:05:54.840 | Because otherwise, like you can get higher performance

01:05:57.480 | by just letting the model think for longer.

01:06:00.760 | And so like O1 is a good example of that.

01:06:03.000 | But another way you can improve performance

01:06:04.800 | is by letting the model iterate and get feedback.

01:06:08.040 | And so one very important piece of feedback

01:06:11.200 | when you're a programmer is the language server,

01:06:15.000 | which is this thing,

01:06:16.960 | it exists for most different languages

01:06:20.080 | and there's like a separate language server per language.

01:06:22.640 | And it can tell you, you know,

01:06:24.680 | you're using the wrong type here

01:06:26.120 | and then it gives you an error

01:06:27.320 | or it can allow you to go to definition

01:06:29.400 | and sort of understands the structure of your code.

01:06:31.880 | So language servers are extensions developed by,

01:06:34.920 | like there's a TypeScript language

01:06:36.280 | that were developed by the TypeScript people,

01:06:38.160 | a Rust language that were developed by the Rust people,

01:06:40.120 | and then they all interface

01:06:41.480 | over the language server protocol to VS Code.

01:06:43.480 | So that VS Code doesn't need to have

01:06:45.120 | all of the different languages built into VS Code,

01:06:47.480 | but rather you can use

01:06:49.000 | the existing compiler infrastructure.

01:06:50.880 | - For linting purposes, what?

01:06:52.880 | - It's for linting, it's for going to definition

01:06:56.000 | and for like seeing the right types that you're using.

01:06:59.600 | - So it's doing like type checking also?

01:07:01.400 | - Yes, type checking and going to references.

01:07:03.960 | And that's like, when you're working in a big project,

01:07:07.040 | you kind of need that.

01:07:08.440 | If you don't have that,

01:07:09.600 | it's like really hard to code in a big project.

01:07:12.720 | - Can you say again how that's being used inside Cursor,

01:07:16.320 | the language server protocol communication thing?

01:07:20.440 | - So it's being used in Cursor to show to the programmer,

01:07:22.520 | just like in VS Code.

01:07:23.680 | But then the idea is you want to show that same information

01:07:26.840 | to the models, the AI models.

01:07:30.040 | And you want to do that in a way

01:07:31.840 | that doesn't affect the user,

01:07:33.400 | because you want to do it in background.

01:07:34.800 | And so the idea behind the shadow workspace was,

01:07:38.000 | okay, like one way we can do this

01:07:40.040 | is we spawn a separate window of Cursor that's hidden.

01:07:45.040 | And so you can set this flag and an electron is hidden.

01:07:48.920 | There is a window, but you don't actually see it.

01:07:50.720 | And inside of this window,

01:07:52.680 | the AI agents can modify code however they want,

01:07:56.000 | as long as they don't save it,

01:07:57.080 | because it's still the same folder,

01:07:59.280 | and then can get feedback from the linters

01:08:01.720 | and go to definition and iterate on their code.

01:08:04.080 | - So like literally run everything in the background,

01:08:06.800 | like as if, right?

01:08:08.560 | - Yeah.

01:08:09.400 | - Maybe even run the code?

01:08:10.840 | - So that's the eventual version.

01:08:13.280 | That's what you want.

01:08:14.120 | And a lot of the blog post is actually about

01:08:16.280 | how do you make that happen?

01:08:19.120 | Because it's a little bit tricky.

01:08:20.760 | You want it to be on the user's machine

01:08:22.280 | so that it exactly mirrors the user's environment.

01:08:25.880 | And then on Linux, you can do this cool thing

01:08:29.080 | where you can actually mirror the file system

01:08:31.400 | and have the AI make changes to the files.

01:08:35.360 | And it thinks that it's operating on the file level,

01:08:38.680 | but actually that's stored in memory

01:08:42.840 | and you can create this kernel extension to make it work.

01:08:47.200 | Whereas on Mac and Windows,

01:08:49.840 | it's a little bit more difficult,

01:08:51.920 | but it's a fun technical problem, so that's why.

01:08:57.400 | - One maybe hacky, but interesting idea that I like

01:08:59.640 | is holding a lock on saving.

01:09:02.200 | And so basically you can then have the language model

01:09:04.720 | kind of hold the lock on saving to disk.

01:09:07.360 | And then instead of you operating in the ground truth

01:09:09.960 | version of the files that are saved to disk,

01:09:12.240 | you actually are operating

01:09:13.320 | what was the shadow workspace before

01:09:14.800 | and these unsaved things that only exist in memory

01:09:16.600 | that you still get linter errors for and you can code in.

01:09:19.120 | And then when you try to maybe run code,

01:09:21.320 | it's just like, there's a small warning that there's a lock

01:09:23.960 | and then you kind of will take back the lock

01:09:25.800 | from the language server

01:09:27.000 | if you're trying to do things concurrently

01:09:28.560 | or from the shadow workspace

01:09:29.800 | if you're trying to do things concurrently.

01:09:31.720 | - That's such an exciting future, by the way.

01:09:33.880 | It's a bit of a tangent,

01:09:34.840 | but like to allow a model to change files,

01:09:38.400 | it's scary for people, but like, it's really cool

01:09:42.120 | to be able to just like let the agent do a set of tasks

01:09:46.000 | and you come back the next day and kind of observe

01:09:49.680 | like it's a colleague or something like that.

01:09:51.920 | - Yeah, and I think there may be different versions

01:09:53.960 | of like runability where for the simple things

01:09:57.560 | where you're doing things in the span of a few minutes

01:09:59.960 | on behalf of the user as they're programming,

01:10:02.040 | it makes sense to make something work locally

01:10:04.840 | in their machine.

01:10:05.800 | I think for the more aggressive things

01:10:07.200 | where you're making larger changes

01:10:08.640 | that take longer periods of time,

01:10:10.360 | you'll probably want to do this

01:10:11.600 | in some sandbox remote environment.

01:10:13.480 | And that's another incredibly tricky problem

01:10:15.800 | of how do you exactly reproduce or mostly reproduce

01:10:20.120 | to the point of it being effectively equivalent

01:10:22.480 | for running code, the user's environment

01:10:24.960 | with this remote sandbox.

01:10:27.240 | - I'm curious what kind of agents you want for coding.

01:10:30.680 | Do you want them to find bugs?

01:10:32.920 | Do you want them to like implement new features?

01:10:35.040 | Like what agents do you want?

01:10:36.720 | - So by the way, when I think about agents,

01:10:38.560 | I don't think just about coding.

01:10:41.400 | I think so for the practices, this particular podcast,

01:10:45.120 | there's video editing and a lot of, if you look in Adobe,

01:10:47.920 | a lot of there's code behind.

01:10:50.320 | It's very poorly documented code,

01:10:52.080 | but you can interact with Premiere, for example,

01:10:55.240 | using code and basically all the uploading,

01:10:58.640 | everything I do on YouTube,

01:10:59.640 | everything as you could probably imagine,

01:11:01.480 | I do all of that through code and including translation

01:11:04.920 | and overdubbing all of this.

01:11:06.640 | So I envision all of those kinds of tasks.

01:11:10.160 | So automating many of the tasks

01:11:11.840 | that don't have to do directly with the editing.

01:11:14.120 | So that, okay.

01:11:16.080 | That's what I was thinking about.

01:11:17.000 | But in terms of coding,

01:11:19.520 | I would be fundamentally thinking about bug finding,

01:11:23.960 | like many levels of kind of bug finding

01:11:27.480 | and also bug finding like logical bugs,

01:11:30.200 | not logical, like spiritual bugs or something.

01:11:32.520 | (all laughing)

01:11:34.280 | Ones like sort of big directions of implementation,

01:11:37.400 | that kind of stuff.

01:11:38.680 | - That's a bind on bug finding.

01:11:40.000 | - Yeah, I mean, it's really interesting

01:11:41.960 | that these models are so bad at bug finding

01:11:44.840 | when just naively prompted to find a bug.

01:11:48.720 | They're incredibly poorly calibrated.

01:11:51.320 | - Even the smartest models.

01:11:52.720 | - Exactly, even O1.

01:11:54.800 | - How do you explain that?

01:11:56.480 | Is there a good intuition?

01:11:57.840 | - I think these models are really strong reflection

01:12:02.080 | of the pre-training distribution.

01:12:04.560 | And I do think they generalize

01:12:06.880 | as the loss gets lower and lower,

01:12:08.520 | but I don't think the loss in the scale is quite,

01:12:11.360 | or the loss is low enough

01:12:13.000 | such that they're like really fully generalizing in code.

01:12:16.360 | Like the things that we use these things for,

01:12:18.680 | the frontier models that they're quite good at

01:12:21.360 | are really code generation and question answering.

01:12:25.120 | And these things exist in massive quantities

01:12:28.440 | in pre-training with all of the code on GitHub

01:12:30.880 | on the scale of many, many trillions of tokens

01:12:33.160 | and questions and answers on things like Stack Overflow

01:12:37.400 | and maybe GitHub issues.

01:12:39.160 | And so when you try to push one of these things

01:12:41.880 | that really don't exist very much online,

01:12:46.160 | like for example, the cursor tab objective

01:12:48.680 | of predicting the next edit,

01:12:49.960 | given the edits done so far,

01:12:52.040 | the brittleness kind of shows.

01:12:53.720 | And then bug detection is another great example

01:12:55.880 | where there aren't really that many examples

01:12:58.080 | of like actually detecting real bugs

01:12:59.720 | and then proposing fixes.

01:13:01.080 | And the models just kind of like really struggle at it.

01:13:05.920 | But I think it's a question of transferring the model,

01:13:08.520 | like in the same way that you get this fantastic transfer

01:13:11.920 | from pre-trained models just on code in general

01:13:14.680 | to the cursor tab objective,

01:13:17.000 | you'll see a very, very similar thing

01:13:19.080 | with generalized models that are really good at code

01:13:21.560 | to bug detection.

01:13:22.400 | It just takes like a little bit of kind of nudging

01:13:24.280 | in that direction.

01:13:25.240 | - Like to be clear,

01:13:26.080 | I think they sort of understand code really well.

01:13:28.200 | Like while they're being pre-trained,

01:13:30.280 | like the representation that's being built up,

01:13:33.400 | like almost certainly like somewhere in the stream,

01:13:36.960 | there's the model knows

01:13:38.880 | that maybe there's something sketchy going on, right?

01:13:42.000 | It sort of has some sketchiness,

01:13:43.560 | but actually eliciting the sketchiness to,

01:13:46.920 | like actually like part of it

01:13:51.320 | is that humans are really calibrated

01:13:52.920 | on which bugs are really important.

01:13:55.000 | It's not just actually saying

01:13:57.080 | like there's something sketchy.

01:13:58.040 | It's like, is this sketchy trivial?

01:14:00.480 | Is this sketchy like you're gonna take the server down?

01:14:03.040 | It's like part of it is maybe the cultural knowledge

01:14:05.720 | of like, why is a staff engineer a staff engineer?

01:14:09.240 | A staff engineer is good

01:14:10.720 | because they know that three years ago,

01:14:12.800 | like someone wrote a really, you know,

01:14:15.000 | a sketchy piece of code that took the server down.

01:14:17.960 | And as opposed to like,

01:14:20.160 | as opposed to maybe you're just like, you know,

01:14:21.920 | you just, this thing is like an experiment.

01:14:25.920 | So like a few bugs are fine.

01:14:27.440 | Like you're just trying to experiment

01:14:28.720 | and get the feel of the thing.

01:14:30.320 | And so if the model gets really annoying

01:14:31.960 | when you're writing an experiment, that's really bad.

01:14:34.560 | But if you're writing something for super production,

01:14:36.920 | you're like writing a database, right?

01:14:38.320 | You're writing code in Postgres or Linux or whatever,

01:14:40.880 | like you're Linus Torvalds.

01:14:42.680 | It's sort of unacceptable to have even an edge case

01:14:45.400 | and just having the calibration of like,

01:14:47.640 | how paranoid is the user?

01:14:51.600 | - But even then, like,

01:14:52.720 | if you're putting it on maximum paranoia,

01:14:55.120 | it's still just like, doesn't quite get it.

01:14:57.680 | - Yeah, yeah, yeah.

01:14:58.800 | - I mean, but this is hard for humans too,

01:15:01.000 | to understand which line of code is important

01:15:04.120 | and which is not.

01:15:05.320 | Like you, I think one of your principles on a website says,

01:15:08.400 | if a code can do a lot of damage,

01:15:11.520 | one should add a comment that say,

01:15:13.640 | this line of code is dangerous.

01:15:17.000 | - And all caps, repeat it 10 times.

01:15:20.720 | - No, you say like, for every single line of code

01:15:24.640 | inside the function, you have to add,

01:15:26.400 | and that's quite profound.

01:15:28.120 | That says something about human beings

01:15:30.160 | because the engineers move on,

01:15:33.400 | even the same person might just forget

01:15:36.360 | how it can sync the Titanic, a single function.

01:15:38.520 | Like you don't, you might not intuit that quite clearly

01:15:41.040 | by looking at the single piece of code.

01:15:42.920 | - Yeah, and I think that one is also,

01:15:45.440 | partially also for today's AI models,

01:15:48.120 | where if you actually write dangerous, dangerous, dangerous

01:15:51.840 | in every single line,

01:15:52.800 | like the models will pay more attention to that

01:15:57.520 | and will be more likely to find bugs in that region.

01:16:00.480 | - That's actually just straight up a really good practice

01:16:03.600 | of labeling code, of how much damage this can do.

01:16:08.280 | - Yeah, I mean, it's controversial.

01:16:10.160 | Some people think it's ugly.

01:16:11.720 | - Well, I actually think it's like,

01:16:14.640 | in fact, I actually think this is one of the things

01:16:16.680 | I learned from Aureate is, you know,

01:16:18.240 | like I sort of aesthetically, I don't like it,

01:16:22.080 | but I think there's certainly something

01:16:24.240 | where like it's useful for the models

01:16:26.520 | and humans just forget a lot.

01:16:28.200 | And it's really easy to make a small mistake

01:16:30.480 | and cause like, bring down, you know,

01:16:33.920 | like just bring down the server and like,

01:16:36.080 | like, of course we like test a lot and whatever,

01:16:39.360 | but there's always these things

01:16:41.360 | that you have to be very careful.

01:16:42.480 | - Yeah, like with just normal doc strings,

01:16:44.400 | I think people will often just skim it

01:16:46.320 | when making a change and think,

01:16:47.440 | oh, I know how to do this.

01:16:49.520 | And you kind of really need to point it out to them

01:16:53.360 | so that that doesn't slip through.

01:16:55.800 | - Yeah, you have to be reminded

01:16:57.000 | that you can do a lot of damage.

01:16:58.600 | That's like, we don't really think about that.

01:17:02.080 | You think about, okay, how do I figure out how this works

01:17:04.960 | so I can improve it?

01:17:05.800 | You don't think about the other direction.

01:17:08.720 | - Until we have formal verification for everything,

01:17:12.760 | then you can do whatever you want

01:17:14.200 | and you know for certain

01:17:16.560 | that you have not introduced a bug if the proof pass.

01:17:18.920 | - But concretely, what do you think

01:17:20.000 | that future would look like?

01:17:22.000 | - I think people will just not write tests anymore

01:17:26.000 | and the model will suggest,

01:17:29.960 | like you write a function,

01:17:31.280 | the model will suggest a spec

01:17:32.960 | and you review the spec.

01:17:34.200 | And in the meantime,

01:17:36.920 | smart reasoning model computes a proof

01:17:39.440 | that the implementation follows the spec.

01:17:42.120 | And I think that happens for most functions.

01:17:44.280 | - Don't you think this gets at a little bit,

01:17:46.360 | some of the stuff you were talking about earlier

01:17:47.680 | with the difficulty of specifying intent

01:17:49.400 | for what you want with software,

01:17:51.680 | where sometimes it might be,

01:17:53.160 | because the intent is really hard to specify,

01:17:54.800 | it's also then going to be really hard to prove

01:17:56.680 | that it's actually matching whatever your intent is.

01:17:58.440 | - Like you think that spec is hard to generate?

01:18:00.880 | - Yeah, or just like for a given spec,

01:18:06.720 | maybe you can, I think there is a question of like,

01:18:08.920 | can you actually do the formal verification?

01:18:10.960 | Like that's like, is that possible?

01:18:12.880 | I think that there's like more to dig into there.

01:18:15.000 | But then also-

01:18:15.880 | - Even if you have the spec?

01:18:17.480 | - If you have the spec.

01:18:18.320 | - But how do you-

01:18:19.160 | - Even if you have the spec.

01:18:20.000 | Is the spec written in natural language?

01:18:20.960 | Or is it more formal?

01:18:21.800 | - No, the spec would be formal.

01:18:24.840 | - But how easy would that be to draw?

01:18:25.680 | - So then I think that you care about things

01:18:28.080 | that are not going to be easily well-specified

01:18:29.640 | in the spec language.

01:18:30.840 | - I see, I see.

01:18:31.680 | Yeah, yeah.

01:18:32.760 | - Maybe an argument against formal verification

01:18:35.160 | is all you need.

01:18:36.000 | - Yeah.

01:18:36.840 | - The worry is there's this massive document.

01:18:38.360 | - Replacing something like unit tests.

01:18:40.760 | Sure.

01:18:41.600 | - Yeah, yeah.

01:18:42.440 | I think you can probably also evolve the spec languages

01:18:47.040 | to capture some of the things

01:18:48.560 | that they don't really capture right now.

01:18:51.160 | But I don't know.

01:18:53.640 | I think it's very exciting.

01:18:55.000 | - And you're speaking not just about like single functions.

01:18:57.920 | You're speaking about entire code bases.

01:19:00.120 | - I think entire code bases is harder,

01:19:01.600 | but that is what I would love to have.

01:19:03.920 | And I think it should be possible.

01:19:05.920 | And because you can even,

01:19:07.440 | there's like a lot of work recently

01:19:09.160 | where you can prove, formally verify down to the hardware.

01:19:13.640 | So like through the, you formally verify the C code

01:19:16.680 | and then you formally verify through the GCC compiler

01:19:19.600 | and then through the very log down to the hardware.

01:19:22.280 | And that's like incredibly big system,

01:19:25.600 | but it actually works.

01:19:26.720 | And I think big code bases are sort of similar

01:19:28.960 | in that they're like multi-layered system.

01:19:31.120 | And if you can decompose it and formally verify each part,

01:19:35.040 | then I think it should be possible.

01:19:36.560 | I think the specification problem is a real problem, but.

01:19:39.080 | - How do you handle side effects?

01:19:40.880 | Or how do you handle, I guess, external dependencies

01:19:44.200 | like calling the Stripe API?

01:19:46.320 | - Maybe Stripe would write a spec for the API.

01:19:48.600 | - But like, you can't do this for everything.

01:19:49.920 | Like, can you do this for everything you use?

01:19:52.200 | Like, how do you do it for, if there's a language model,

01:19:55.160 | like maybe like people will use language models

01:19:57.320 | as primitives in the programs they write.

01:19:59.440 | And there's like a dependence on it.

01:20:00.680 | And like, how do you now include that?

01:20:02.680 | - I think you might be able to prove that still.

01:20:05.920 | - Prove what about language models?

01:20:07.600 | - I think it feels possible that you could actually prove

01:20:10.800 | that a language model is aligned, for example.

01:20:14.920 | Or like you can prove that it actually gives

01:20:17.200 | the right answer.

01:20:18.920 | - That's the dream.

01:20:21.360 | - Yeah, that is.

01:20:22.200 | I mean, if it's possible, that's your, I have a dream speech.

01:20:26.040 | If it's possible, that will certainly help with, you know,

01:20:29.680 | making sure your code doesn't have bugs

01:20:31.880 | and making sure AI doesn't destroy all of human civilization.

01:20:35.040 | So the full spectrum of AI safety to just bug finding.

01:20:39.300 | So you said the models struggle with bug finding.

01:20:42.720 | What's the hope?

01:20:43.880 | - You know, my hope initially is,

01:20:46.040 | and I can let Michael chime in too,

01:20:48.600 | but it was like, it should, you know,

01:20:52.800 | first help with the stupid bugs.

01:20:54.560 | Like it should very quickly catch the stupid bugs,

01:20:56.960 | like off by one error is like,

01:20:58.880 | sometimes you write something in a comment

01:21:00.360 | and do it the other way.

01:21:01.960 | It's like very common.

01:21:02.800 | Like I do this, I write like less than in a comment

01:21:04.960 | and like I've maybe write a greater than sign

01:21:06.560 | or something like that.

01:21:07.920 | And the model is like, yeah, it looks sketchy.

01:21:10.240 | Like, are you sure you want to do that?

01:21:13.000 | But eventually it should be able to catch harder bugs too.

01:21:16.160 | - Yeah, and I think that it's also important to note

01:21:19.040 | that this is, having good bug finding models

01:21:21.800 | feels necessary to get to the highest reaches

01:21:24.640 | of having AI do more and more programming for you,

01:21:27.840 | where you're going to, you know,

01:21:29.200 | if AI is building more and more of the system for you,

01:21:31.160 | you need to not just generate, but also verify.

01:21:33.800 | And without that, some of the problems

01:21:35.880 | that we've talked about before

01:21:37.520 | with programming with these models

01:21:39.880 | will just become untenable.

01:21:42.400 | So it's not just for humans, like you write a bug,

01:21:45.680 | I write a bug, find the bug for me,

01:21:47.160 | but it's also being able to verify the AI's code

01:21:50.200 | and check it is really important.

01:21:52.560 | - Yeah, and then how do you actually do this?

01:21:54.120 | Like we have had a lot of contentious dinner discussions

01:21:56.320 | of how do you actually train a bug model?

01:21:57.720 | But one very popular idea is, you know,

01:22:00.720 | it's kind of potentially easy to introduce a bug

01:22:04.160 | than actually finding the bug.

01:22:05.360 | And so you can train a model to introduce bugs

01:22:08.200 | in existing code.

01:22:09.560 | And then you can train a reverse bug model

01:22:13.320 | then that can find bugs using this synthetic data.

01:22:17.360 | So that's like one example.

01:22:18.720 | But yeah, there are lots of ideas for how to-

01:22:21.920 | - You can also do a bunch of work,

01:22:24.600 | not even at the model level, of taking the biggest models

01:22:27.400 | and then maybe giving them access to a lot of information

01:22:30.760 | that's not just the code.

01:22:32.320 | Like it's kind of a hard problem to like stare at a file

01:22:34.360 | and be like, where's the bug?

01:22:35.680 | And you know, that's hard for humans often, right?

01:22:38.120 | And so often you have to run the code

01:22:39.720 | and being able to see things like traces

01:22:41.320 | and step through a debugger.

01:22:43.280 | There's a whole nother direction

01:22:44.560 | where it like kind of tends toward that.

01:22:46.160 | And it could also be that there are kind of

01:22:47.360 | two different product form factors here.

01:22:48.680 | It could be that you have a really specialty model

01:22:50.640 | that's quite fast, that's kind of running in the background

01:22:52.440 | and trying to spot bugs.

01:22:53.880 | And it might be that sometimes,

01:22:55.520 | sort of to Arvid's earlier example about, you know,

01:22:57.960 | some nefarious input box bug,

01:22:59.680 | it might be that sometimes you wanna like,

01:23:01.280 | you know there's a bug,

01:23:02.200 | you're not just like checking hypothesis-free,

01:23:04.080 | you're like, this is a problem, I really wanna solve it.

01:23:06.560 | And you zap that with tons and tons and tons of compute

01:23:08.840 | and you're willing to put in like $50 to solve that bug

01:23:11.160 | or something even more.

01:23:12.760 | - Have you thought about integrating money

01:23:14.720 | into this whole thing?

01:23:15.560 | Like I would pay probably a large amount of money

01:23:17.200 | for if you found a bug

01:23:19.240 | or even generated a code that I really appreciated.

01:23:21.320 | Like I had a moment a few days ago

01:23:23.680 | when I started using cursor or generated

01:23:25.800 | perfect,

01:23:29.160 | like perfect three functions

01:23:32.720 | for interacting with the YouTube API

01:23:36.080 | to update captions

01:23:38.960 | and for localization like in different languages.

01:23:42.400 | The API documentation is not very good.

01:23:45.160 | And the code across, like if I Googled it for a while,

01:23:48.280 | I couldn't find exactly,

01:23:49.520 | there's a lot of confusing information

01:23:51.320 | and cursor generated perfectly.

01:23:53.240 | And I was like, I just sat back,

01:23:54.840 | I read the code and I was like, this is correct.

01:23:56.520 | I tested it, it's correct.

01:23:58.160 | I was like, I wanna tip on a button that goes.

01:24:01.840 | - Yeah.

01:24:02.760 | - Here's $5.

01:24:03.920 | One that's really good just to support the company

01:24:05.920 | and support what the interface is.

01:24:08.080 | And the other is that probably sends a strong signal,

01:24:10.800 | like good job, right?

01:24:14.080 | So there's this much stronger signal

01:24:15.560 | than just accepting the code, right?

01:24:17.000 | You just actually send like a strong, good job.

01:24:20.200 | That and for bug finding, obviously,

01:24:22.480 | like there's a lot of people,

01:24:24.920 | that would pay a huge amount of money for a bug,

01:24:28.800 | like a bug bounty thing, right?

01:24:32.400 | Do you guys think about that?

01:24:33.720 | - Yeah, it's a controversial idea inside the company.

01:24:37.000 | I think it sort of depends on how much

01:24:38.960 | you believe in humanity almost, you know?

01:24:42.440 | Like, I think it would be really cool

01:24:45.480 | if like you spend nothing to try to find a bug.

01:24:49.080 | And if it doesn't find a bug, you spend $0.

01:24:51.560 | And then if it does find a bug and you click accept,

01:24:54.560 | then it also shows like in parentheses, like $1.

01:24:58.080 | And so you spend $1 to accept the bug.

01:25:00.480 | And then of course there's a worry like,

01:25:02.160 | okay, we spent a lot of computation,

01:25:03.600 | like maybe people will just copy paste.

01:25:05.560 | I think that's a worry.

01:25:08.480 | And then there's also the worry that like introducing money

01:25:10.960 | into the product makes it like kind of,

01:25:14.280 | you know, like it doesn't feel as fun anymore.

01:25:16.160 | Like you have to like think about money

01:25:18.080 | and all you want to think about is like the code.

01:25:21.080 | And so maybe it actually makes more sense to separate it out

01:25:23.520 | and like you pay some fee like every month

01:25:26.760 | and then you get all of these things for free.

01:25:29.320 | - But there could be a tipping component,

01:25:30.800 | which is not like it costs this.

01:25:32.360 | - Yes, but it still has that like dollar symbol.

01:25:34.600 | I think it's fine.

01:25:35.440 | But I also see the point where like,

01:25:38.560 | maybe you don't want to introduce it.

01:25:40.160 | - Yeah, I was gonna say the moment

01:25:41.000 | that feels like people do this is when they share it,

01:25:43.240 | when they have this fantastic example,

01:25:45.120 | they just kind of share it with their friends.

01:25:46.880 | There is also a potential world

01:25:48.040 | where there's a technical solution to this,

01:25:49.880 | like on our system problem too,

01:25:51.560 | where if we can get to a place where we understand

01:25:54.000 | the output of the system more,

01:25:55.960 | I mean, to the stuff we were talking about with like,

01:25:57.880 | you know, error checking with the LSP

01:25:59.360 | and then also running the code.

01:26:00.720 | But if you could get to a place

01:26:01.560 | where you could actually somehow verify,

01:26:03.560 | oh, I have fixed the bug.

01:26:05.040 | Maybe then the bounty system

01:26:07.200 | doesn't need to rely on the honor system too.

01:26:09.120 | - How much interaction is there

01:26:10.400 | between the terminal and the code?

01:26:12.240 | Like how much information is gained

01:26:14.120 | from if you run the code in the terminal?

01:26:16.360 | Can you use, can you do like a loop where it runs the code

01:26:22.280 | and suggest how to change the code

01:26:24.720 | if the code and runtime gives an error?

01:26:27.760 | Is right now they're separate worlds completely?

01:26:30.720 | Like I know you can like do control K inside the terminal

01:26:34.080 | to help you write the code.

01:26:35.040 | - You can use terminal context as well

01:26:38.080 | inside of Jackman K kind of everything.

01:26:40.960 | We don't have the looping part yet,

01:26:44.640 | though we suspect something like this

01:26:46.080 | could make a lot of sense.

01:26:47.360 | There's a question of whether it happens

01:26:48.560 | in the foreground too,

01:26:49.640 | or if it happens in the background,

01:26:51.400 | like what we've been discussing.

01:26:52.680 | - Sure, the background is pretty cool.

01:26:54.200 | Like we do running the code in different ways.

01:26:56.640 | Plus there's a database side to this,

01:26:58.280 | which how do you protect it from not modifying the database?

01:27:01.120 | But okay.

01:27:01.960 | - I mean, there's certainly cool solutions there.

01:27:06.080 | There's this new API that is being developed for,

01:27:10.360 | it's not in AWS, but it certainly is.

01:27:15.280 | I think it's in PlanetScale.

01:27:16.480 | I don't know if PlanetScale was the first one to add it.

01:27:18.760 | It's this ability to sort of add branches to a database,

01:27:22.360 | which is like if you're working on a feature

01:27:25.520 | and you want to test against a broad database,

01:27:27.680 | but you don't actually want to test

01:27:28.920 | against a broad database,

01:27:29.880 | you could sort of add a branch to the database.

01:27:31.960 | And the way to do that is to add a branch

01:27:33.440 | to the write-ahead log.

01:27:35.200 | And there's obviously a lot of technical complexity

01:27:37.400 | in doing it correctly.

01:27:38.480 | I guess database companies need new things to do.

01:27:41.640 | They have good databases now.

01:27:47.520 | And I think like TurboBuffer,

01:27:50.160 | which is one of the databases we use,

01:27:52.080 | is going to add maybe branching to the write-ahead log.

01:27:57.080 | And so maybe the AI agents will use branching.

01:28:03.040 | They'll like test against some branch

01:28:05.400 | and it's sort of gonna be a requirement for the database

01:28:08.760 | to like support branching or something.

01:28:10.600 | - It'd be really interesting

01:28:11.440 | if you could branch a file system, right?

01:28:13.680 | - Yeah, I feel like everything needs branching.

01:28:15.720 | It's like that.

01:28:17.040 | - Yeah, it's the problem with the multiverse, right?

01:28:22.000 | Like if you branch on everything, that's like a lot.

01:28:24.360 | - I mean, there's obviously these like super clever

01:28:26.320 | algorithms to make sure that you don't actually

01:28:28.520 | sort of use a lot of space or CPU or whatever.

01:28:32.280 | - Okay, this is a good place to ask about infrastructure.

01:28:34.880 | So you guys mostly use AWS.

01:28:36.880 | What are some interesting details?

01:28:38.240 | What are some interesting challenges?

01:28:39.640 | Why'd you choose AWS?

01:28:41.400 | Why is AWS still winning?

01:28:44.080 | Hashtag.

01:28:45.000 | - AWS is just really, really good.

01:28:48.280 | It's really good.

01:28:49.120 | Like whenever you use an AWS product,

01:28:54.120 | you just know that it's going to work.

01:28:56.840 | Like it might be absolute hell to go through the steps

01:29:00.480 | to set it up.

01:29:02.120 | - Why is the interface so horrible?

01:29:04.200 | - Because it's just so good.

01:29:06.200 | It doesn't need to-

01:29:07.040 | - It's the nature of winning.

01:29:08.920 | - I think it's exactly, it's just nature of winning.

01:29:11.240 | Yeah, yeah.

01:29:12.440 | But AWS, you can always trust, like it will always work.

01:29:15.240 | And if there is a problem, it's probably your problem.

01:29:18.600 | Yeah.

01:29:20.920 | - Okay.

01:29:21.760 | Is there some interesting like challenges to,

01:29:23.640 | you guys have a pretty new startup to get scaling

01:29:26.840 | to like, to so many people and-

01:29:29.320 | - Yeah, I think that there,

01:29:30.680 | it has been an interesting journey adding, you know,

01:29:35.440 | each extra zero to the request per second.

01:29:37.920 | You run into all of these with like, you know,

01:29:39.520 | the general components you're using for caching

01:29:41.520 | and databases run into issues as you make things

01:29:43.720 | bigger and bigger.

01:29:44.560 | At the scale where we get like, you know,

01:29:45.760 | into overflows on our tables and things like that.

01:29:48.720 | And then also there have been some custom systems

01:29:51.800 | that we've built, like for instance,

01:29:53.200 | our retrieval system for computing a semantic index

01:29:57.040 | of your code base and answering questions about a code base

01:30:00.120 | that have continually, I feel like been,

01:30:02.280 | well, one of the trickier things to scale.

01:30:04.360 | - I have a few friends who are super, super senior engineers

01:30:07.520 | and one of their sort of lines is like,

01:30:09.040 | it's very hard to predict where systems will break

01:30:11.840 | when you scale them.

01:30:13.360 | You can sort of try to predict in advance,

01:30:17.000 | but like, there's always something weird

01:30:18.960 | that's going to happen when you add this extra zero.

01:30:22.040 | You thought you thought through everything,

01:30:23.720 | but you didn't actually think through everything.

01:30:26.320 | But I think for that particular system, we've,

01:30:30.520 | so for concrete details,

01:30:34.640 | the thing we do is obviously we upload,

01:30:36.880 | when like we chunk up all of your code

01:30:41.120 | and then we send up sort of the code for embedding

01:30:44.720 | and we embed the code.

01:30:46.280 | And then we store the embeddings in a database,

01:30:49.280 | but we don't actually store any of the code.

01:30:51.800 | And then there's reasons around making sure

01:30:53.560 | that we don't introduce client bugs

01:30:56.320 | because we're very, very paranoid about client bugs.

01:30:59.080 | We store much of the details on the server,

01:31:03.520 | like everything is sort of encrypted.

01:31:06.680 | So one of the technical challenges

01:31:09.840 | is always making sure that the local index,

01:31:12.720 | the local code base state is the same as the state

01:31:16.160 | that is on the server.

01:31:17.920 | And the way sort of technically we ended up doing that is,

01:31:21.840 | so for every single file, you can sort of keep this hash.

01:31:25.800 | And then for every folder, you can sort of keep a hash,

01:31:28.640 | which is the hash of all of its children.

01:31:31.160 | And you can sort of recursively do that until the top.

01:31:33.880 | And why do something complicated?

01:31:37.640 | One thing you could do is you could keep a hash

01:31:39.720 | for every file.

01:31:40.920 | Then every minute you could try to download

01:31:43.440 | the hashes that are on the server,

01:31:44.720 | figure out what are the files that don't exist on the server.

01:31:47.360 | Maybe you just created a new file.

01:31:48.880 | Maybe you just deleted a file.

01:31:50.320 | Maybe you checked out a new branch

01:31:52.400 | and try to reconcile the state

01:31:54.480 | between the client and the server.

01:31:56.160 | But that introduces like absolutely ginormous

01:31:59.960 | network overhead, both on the client side.

01:32:03.680 | I mean, nobody really wants us to hammer their wifi

01:32:06.880 | all the time if you're using cursor.

01:32:09.360 | But also like, I mean, it would introduce

01:32:11.000 | like ginormous overhead in the database.

01:32:13.680 | It would sort of be reading this tens of terabyte database,

01:32:18.680 | sort of approaching like 20 terabytes

01:32:23.200 | or something database, like every second.

01:32:25.680 | That's just sort of kind of crazy.

01:32:28.120 | You definitely don't want to do that.

01:32:30.560 | So what do you do?

01:32:31.400 | You sort of, you just try to reconcile the single hash,

01:32:34.320 | which is at the root of the project.

01:32:35.760 | And then if something mismatches, then you go,

01:32:37.800 | you find where all the things disagree.

01:32:39.600 | Maybe you look at the children and see if the hashes match.

01:32:42.000 | And if the hashes don't match,

01:32:43.000 | go look at their children and so on.

01:32:44.640 | But you only do that in this scenario

01:32:46.280 | where things don't match.

01:32:47.280 | And for most people, most of the time the hashes match.

01:32:50.000 | - So it's a kind of like hierarchical reconciliation.

01:32:53.240 | - Yeah, something like that.

01:32:54.800 | Yeah, it's called the Merkle tree.

01:32:56.360 | - Yeah, Merkle, yeah.

01:32:58.120 | I mean, so yeah, this is cool to see that

01:33:00.080 | you kind of have to think through all these problems.

01:33:01.800 | - And I mean, the point of, like the reason it's gotten hard

01:33:04.480 | is just because, like the number of people using it

01:33:07.080 | and if some of your customers

01:33:09.280 | have really, really large code bases,

01:33:13.040 | to the point where, you know,

01:33:15.880 | we originally reordered our code base, which is big,

01:33:18.680 | but I mean, it's just not the size of some company

01:33:21.360 | that's been there for 20 years

01:33:22.600 | and sort of has a ginormous number of files.

01:33:25.080 | And you sort of want to scale that across programmers.

01:33:28.200 | There's all these details where like

01:33:30.000 | building a simple thing is easy,

01:33:31.360 | but scaling it to a lot of people, like a lot of companies

01:33:34.400 | is obviously a difficult problem.

01:33:36.240 | Which sort of, you know, independent of actually,

01:33:38.360 | so that's, there's part of this scaling

01:33:39.720 | our current solution is also, you know,

01:33:41.800 | coming up with new ideas that obviously we're working on,

01:33:45.520 | but then scaling all of that in the last few weeks, months.

01:33:48.440 | - Yeah, and there are a lot of clever things,

01:33:50.640 | like additional things that go into this indexing system.

01:33:53.640 | For example, the bottleneck in terms of costs

01:33:57.160 | is not storing things in the vector database,

01:33:58.960 | or the database that's actually embedding the code.

01:34:01.040 | And you don't want to re-embed the code base

01:34:02.680 | for every single person in a company

01:34:04.720 | that is using the same exact code,

01:34:07.400 | except for maybe they're in a different branch

01:34:08.960 | with a few different files,

01:34:09.880 | or they've made a few local changes.

01:34:12.320 | And so, because again, embeddings are the bottleneck,

01:34:14.600 | you can do just one clever trick

01:34:16.160 | and not have to worry about like the complexity

01:34:18.320 | of like dealing with branches and the other databases,

01:34:20.600 | where you just have some cache on the actual vectors

01:34:25.600 | computed from the hash of a given chunk.

01:34:29.560 | And so this means that when the nth person at a company

01:34:33.720 | goes and embeds their code base, it's really, really fast.

01:34:36.680 | And you do all this without actually storing any code

01:34:39.240 | on our servers at all.

01:34:40.120 | No code data is stored.

01:34:41.680 | We just store the vectors in the vector database

01:34:43.400 | and the vector cache.

01:34:45.480 | - What's the biggest gains at this time

01:34:49.120 | you get from indexing the code base?

01:34:51.680 | Just out of curiosity, like what benefit do users have?

01:34:56.000 | It seems like longer term,

01:34:57.400 | there'll be more and more benefit,

01:34:58.680 | but in the short term,

01:34:59.600 | just asking questions of the code base,

01:35:02.880 | what's the usefulness of that?

01:35:06.080 | - I think the most obvious one is just,

01:35:10.640 | you want to find out where something is happening

01:35:13.120 | in your large code base.

01:35:14.600 | And you sort of have a fuzzy memory of,

01:35:16.960 | okay, I want to find the place where we do X,

01:35:19.320 | but you don't exactly know what to search for

01:35:22.240 | in a normal text search.

01:35:23.440 | And so you ask a chat,

01:35:25.240 | you hit command enter to ask with the code base chat,

01:35:27.920 | and then very often it finds the right place

01:35:31.120 | that you were thinking of.

01:35:32.160 | I think, like you mentioned,

01:35:34.760 | in the future, I think this is only going to get more

01:35:37.080 | and more powerful,

01:35:38.320 | where we're working a lot

01:35:39.440 | on improving the quality of our retrieval.

01:35:42.120 | And I think the ceiling for that is really, really much

01:35:44.080 | higher than people give it credit for.

01:35:45.960 | - One question that's good to ask here,

01:35:47.920 | have you considered and why haven't you much done

01:35:50.320 | sort of local stuff to where you can do the,

01:35:53.680 | I mean, it seems like everything we just discussed

01:35:55.640 | is exceptionally difficult to do.

01:35:57.240 | To go to the cloud,

01:35:58.520 | you have to think about all these things

01:35:59.840 | with the caching and the large code base

01:36:04.840 | with a large number of programmers

01:36:06.520 | are using the same code base.

01:36:07.520 | You have to figure out the puzzle of that.

01:36:09.400 | A lot of it, most software just does stuff,

01:36:13.160 | this heavy computational stuff locally.

01:36:16.360 | Have you considered doing sort of embeddings locally?

01:36:18.800 | - Yeah, we thought about it

01:36:19.840 | and I think it would be cool to do it locally.

01:36:22.640 | I think it's just really hard.

01:36:24.600 | And one thing to keep in mind is that,

01:36:28.120 | some of our users use the latest MacBook Pro

01:36:30.800 | but most of our users,

01:36:33.240 | like more than 80% of our users are in Windows machines,

01:36:36.240 | which, and many of them are not very powerful.

01:36:39.760 | And so local models really only works

01:36:42.600 | on the latest computers.

01:36:44.240 | And it's also a big overhead to build that in.

01:36:48.400 | And so even if we would like to do that,

01:36:50.440 | it's currently not something that we are able to focus on.

01:36:54.360 | And I think there are some people that do that.

01:36:57.600 | And I think that's great.

01:36:58.880 | But especially as models get bigger and bigger

01:37:02.640 | and you want to do fancier things with like bigger models,

01:37:05.920 | it becomes even harder to do it locally.

01:37:07.920 | - Yeah, and it's not a problem of like weaker computers.

01:37:11.640 | It's just that, for example, if you're some big company,

01:37:15.680 | you have big company code base,

01:37:17.680 | it's just really hard to process big company code base

01:37:20.240 | even on the beefiest MacBook Pros.

01:37:22.280 | So even if it's not even a matter of like,

01:37:24.560 | if you're just like a student or something,

01:37:28.040 | I think if you're like the best programmer at a big company,

01:37:31.760 | you're still going to have a horrible experience

01:37:34.440 | if you do everything locally.

01:37:35.760 | I mean, you could do edge and sort of scrape by,

01:37:38.680 | but like, again, it wouldn't be fun anymore.

01:37:40.840 | - Yeah, like at approximate nearest neighbors

01:37:42.440 | and this massive code base is going to just eat up

01:37:44.480 | your memory and your CPU.

01:37:46.280 | And that's just that.

01:37:50.080 | Like, let's talk about like also the modeling side

01:37:52.800 | where, as Arvid said, there are these massive headwinds

01:37:55.800 | against local models where one,

01:37:59.560 | things that seem to move towards MOEs,

01:38:01.680 | which like one benefit is maybe

01:38:03.320 | they're more memory bandwidth bound,

01:38:05.320 | which plays in favor of local versus using GPUs

01:38:09.600 | or using NVIDIA GPUs.

01:38:12.320 | But the downside is these models are just bigger in total.

01:38:16.520 | And they're going to need to fit often,

01:38:18.960 | not even on a single node, but multiple nodes.

01:38:22.160 | There's no way that's going to fit inside

01:38:24.240 | of even really good MacBooks.

01:38:26.520 | And I think, especially for coding,

01:38:28.880 | it's not a question as much of like,

01:38:31.480 | does it clear some bar of like the models good enough

01:38:34.840 | to do these things and then like we're satisfied,

01:38:37.320 | which may be the case for other problems

01:38:39.680 | and maybe where local models shine,

01:38:41.640 | but people are always going to want the best,

01:38:43.480 | the most intelligent, the most capable things.

01:38:46.200 | And that's going to be really, really hard to run

01:38:48.480 | for almost all people locally.

01:38:51.320 | - Don't you want the most capable model?

01:38:53.800 | Like you want SONNET?

01:38:56.160 | - And also with O1.

01:38:58.160 | - I like how you're pitching me.

01:39:00.520 | Would you be satisfied with an inferior model?

01:39:03.220 | Listen, I'm yes, I'm one of those,

01:39:05.520 | but there's some people that like to do stuff locally,

01:39:07.960 | especially like really, there's a whole,

01:39:11.080 | obviously open source movement that kind of resists.

01:39:13.640 | And it's good that they exist actually,

01:39:15.500 | because you want to resist the power centers

01:39:18.880 | that are growing.

01:39:20.080 | - There's actually an alternative to local models

01:39:23.000 | that I am particularly fond of.

01:39:25.200 | I think it's still very much in the research stage,

01:39:28.520 | but you could imagine to do homomorphic encryption

01:39:32.560 | for language model inference.

01:39:34.360 | So you encrypt your input on your local machine,

01:39:36.920 | then you send that up,

01:39:37.920 | and then the server can use lots of computation.

01:39:42.920 | They can run models that you cannot run locally

01:39:45.040 | on this encrypted data,

01:39:46.920 | but they cannot see what the data is.

01:39:48.520 | And then they send back the answer

01:39:49.760 | and you decrypt the answer and only you can see the answer.

01:39:52.480 | So I think that's still very much research

01:39:55.880 | and all of it is about trying to make the overhead lower

01:40:00.720 | because right now the overhead is really big.

01:40:02.800 | But if you can make that happen,

01:40:04.480 | I think that would be really, really cool.

01:40:07.240 | And I think it would be really, really impactful

01:40:10.080 | because I think one thing that's actually kind of worrisome

01:40:12.160 | is that as these models get better and better,

01:40:14.840 | they're going to become more and more economically useful.

01:40:17.880 | And so more and more of the world's information and data

01:40:21.040 | will flow through, you know, one or two centralized actors.

01:40:26.040 | And then there are worries about, you know,

01:40:29.480 | there can be traditional hacker attempts,

01:40:31.400 | but it also creates this kind of scary part

01:40:35.040 | where if all of the world's information

01:40:37.480 | is flowing through one node in plain text,

01:40:39.800 | you can have surveillance in very bad ways.

01:40:43.960 | And sometimes that will happen for, you know,

01:40:47.680 | initially will be like good reasons,

01:40:49.800 | like people will want to try to protect against

01:40:52.720 | like bad actors using AI models in bad ways.

01:40:55.680 | And then you will add in some surveillance code

01:40:57.480 | and then someone else will come in and, you know,

01:40:59.640 | you're in a slippery slope and then you start

01:41:01.840 | doing bad things with a lot of the world's data.

01:41:06.880 | And so I'm very hopeful that we can solve

01:41:10.480 | homomorphic encryption for language model inference.

01:41:12.640 | - Doing privacy preserving machine learning.

01:41:14.320 | But I would say like that's the challenge we have

01:41:16.240 | with all software these days.

01:41:18.680 | It's like there's so many features that can be provided

01:41:22.240 | from the cloud and all of us increasingly rely on it

01:41:25.160 | and make our life awesome, but there's downsides.

01:41:27.720 | And that's why you rely on really good security

01:41:29.600 | to protect from basic attacks.

01:41:31.600 | But there's also only a small set of companies

01:41:35.320 | that are controlling that data, you know,

01:41:37.800 | and they obviously have leverage

01:41:40.040 | and they can be infiltrated in all kinds of ways.

01:41:42.000 | That's the world we live in.

01:41:43.600 | - Yeah, I mean, the thing I'm just actually quite worried

01:41:46.640 | about is sort of the world where, I mean,

01:41:48.560 | so Entropiq has this responsible scaling policy

01:41:51.440 | and so we're on like the low ASLs,

01:41:55.120 | which is the Entropiq security level or whatever

01:41:57.200 | of like of the models.

01:41:58.920 | But as we get to like code and code ASL3, ASL4,

01:42:02.320 | whatever models, which are sort of very powerful.

01:42:06.440 | But for mostly reasonable security reasons,

01:42:11.120 | you would want to monitor all the prompts.

01:42:13.600 | But I think that's sort of reasonable

01:42:16.280 | and understandable where everyone is coming from.

01:42:18.560 | But, man, it'd be really horrible

01:42:20.960 | if sort of like all the world's information

01:42:23.080 | is sort of monitored that heavily.

01:42:24.800 | It's way too centralized.

01:42:27.000 | It's like sort of this really fine line you're walking

01:42:30.600 | where on the one side, like,

01:42:33.360 | you don't want the models to go rogue.

01:42:35.160 | On the other side, like, man, it's humans, like,

01:42:38.160 | I don't know if I trust like all the world's information

01:42:41.040 | to pass through like three model providers.

01:42:43.400 | - Yeah.

01:42:44.640 | - Why do you think it's different than cloud providers?

01:42:47.600 | - Because I think this is,

01:42:51.440 | a lot of this data would never have gone

01:42:54.080 | to the cloud providers in the first place.

01:42:56.520 | Where this is often like,

01:43:00.560 | you want to give more data to the EIA models.

01:43:02.440 | You want to give personal data

01:43:04.480 | that you would never have put online in the first place

01:43:07.400 | to these companies or to these models.

01:43:10.960 | And it also centralizes control

01:43:15.080 | where right now for cloud,

01:43:19.040 | you can often use your own encryption keys

01:43:21.080 | and like AWS can't really do much.

01:43:24.400 | But here it's just centralized actors

01:43:29.240 | that see the exact plain text of everything.

01:43:31.640 | - On the topic of context,

01:43:34.160 | that's actually been a friction for me.

01:43:36.080 | When I'm writing code, you know, in Python,

01:43:38.040 | there's a bunch of stuff imported.

01:43:40.120 | There's a, you could probably intuit

01:43:42.680 | the kind of stuff I would like to include in the context.

01:43:45.520 | Is there, like how hard is it

01:43:48.040 | to auto figure out the context?

01:43:51.040 | - It's tricky.

01:43:52.800 | I think we can do a lot better

01:43:54.640 | at computing the context automatically in the future.

01:43:58.680 | One thing that's important to notice,

01:44:00.120 | there are trade-offs with including automatic context.

01:44:03.600 | So the more context you include for these models,

01:44:06.720 | first of all, the slower they are.

01:44:09.640 | And the more expensive those requests are,

01:44:12.200 | which means you can then do less model calls

01:44:13.880 | and do less fancy stuff in the background.

01:44:16.040 | Also for a lot of these models,

01:44:17.480 | they get confused if you have a lot of information

01:44:19.200 | in the prompt.

01:44:20.160 | So the bar for accuracy

01:44:23.080 | and for relevance of the context you include

01:44:25.080 | should be quite high.

01:44:26.120 | But this is, already we do some automatic context

01:44:31.640 | in some places within the product.

01:44:33.040 | It's definitely something we wanna get a lot better at.

01:44:35.360 | And I think that there are a lot of cool ideas

01:44:39.440 | to try there,

01:44:40.280 | both on the learning better retrieval systems,

01:44:45.680 | like better embedding models, better re-rankers.

01:44:48.400 | I think that there are also cool academic ideas,

01:44:52.120 | stuff we've tried out internally,

01:44:53.280 | but also the field is grappling with writ large

01:44:55.880 | about can you get language models to a place

01:44:58.200 | where you can actually just have the model itself,

01:45:00.280 | like understand a new corpus of information.

01:45:02.640 | And the most popular talked about version of this is,

01:45:05.640 | can you make the context windows infinite?

01:45:07.520 | Then if you make the context windows infinite,

01:45:08.880 | can you make the model actually pay attention

01:45:10.480 | to the infinite context?

01:45:11.680 | And then after you can make it pay attention

01:45:13.120 | to the infinite context,

01:45:14.320 | to make it somewhat feasible to actually do it,

01:45:16.680 | can you then do caching for that infinite context?

01:45:18.760 | You don't have to recompute that all the time.

01:45:20.920 | But there are other cool ideas that are being tried

01:45:23.440 | that are a little bit more analogous to fine tuning

01:45:25.760 | of actually learning this information

01:45:27.120 | in the weights of the model.

01:45:28.640 | And it might be that you actually get

01:45:30.760 | sort of a qualitatively different type of understanding

01:45:34.760 | if you do it more at the weight level

01:45:36.000 | than if you do it at the in-context learning level.

01:45:37.720 | I think the jury's still a little bit out

01:45:40.640 | on how this is all gonna work in the end.

01:45:43.040 | But in the interim, us as a company,

01:45:44.640 | we are really excited about better retrieval systems

01:45:47.200 | and picking the parts of the code base

01:45:49.200 | that are most relevant to what you're doing.

01:45:51.120 | We could do that a lot better.

01:45:52.520 | - Like one interesting proof of concept

01:45:54.440 | for the learning this knowledge directly in the weights

01:45:58.280 | is with VS Code.

01:46:00.400 | So we're in a VS Code fork and VS Code,

01:46:03.400 | the code is all public.

01:46:04.920 | So these models in pre-training have seen all the code.

01:46:08.680 | They've probably also seen questions and answers about it.

01:46:11.080 | And then they've been fine-tuned and RLA-chefed

01:46:13.360 | to be able to answer questions about code in general.

01:46:16.040 | So when you ask it a question about VS Code,

01:46:18.880 | sometimes it'll hallucinate,

01:46:20.080 | but sometimes it actually does a pretty good job

01:46:22.960 | at answering the question.

01:46:24.760 | And I think like this is just by,

01:46:27.480 | it happens to be okay.

01:46:29.560 | But what if you could actually like specifically

01:46:31.840 | train or post-train a model

01:46:33.040 | such that it really was built to understand this code base?

01:46:37.520 | It's an open research question,

01:46:40.040 | one that we're quite interested in.

01:46:41.400 | And then there's also uncertainty of like,

01:46:43.000 | do you want the model to be the thing

01:46:44.640 | that end-to-end is doing everything?

01:46:46.800 | I.e. it's doing the retrieval and its internals

01:46:49.640 | and then kind of answering the question, creating the code,

01:46:51.840 | or do you want to separate the retrieval

01:46:55.200 | from the frontier model where maybe, you know,

01:46:58.080 | you'll get some really capable models

01:46:59.520 | that are much better than like the best open source ones

01:47:01.960 | in a handful of months.

01:47:03.280 | And then you'll want to separately train

01:47:07.120 | a really good open source model to be the retriever,

01:47:09.400 | to be the thing that feeds in the context

01:47:12.320 | to these larger models.

01:47:14.280 | - Can you speak a little more to the post-training a model

01:47:16.880 | to understand the code base?

01:47:18.800 | Like, what do you mean by that with,

01:47:20.320 | is this a synthetic data direction?

01:47:22.360 | Is this-

01:47:23.200 | - Yeah, I mean, there are many possible ways

01:47:25.560 | you could try doing it.

01:47:26.800 | There's certainly no shortage of ideas.

01:47:30.200 | It's just a question of going in

01:47:31.280 | and like trying all of them and being empirical

01:47:33.240 | about which one works best.

01:47:34.600 | You know, one very naive thing is to try to replicate

01:47:38.840 | what's done with VS Code and these frontier models.

01:47:43.080 | So let's like continue pre-training,

01:47:45.840 | some kind of continued pre-training

01:47:46.880 | that includes general code data,

01:47:48.040 | but also throws in a lot of the data

01:47:50.440 | of some particular repository that you care about.

01:47:53.120 | And then in post-training, meaning in,

01:47:56.400 | let's just start with instruction fine-tuning,

01:47:58.360 | you have like a normal instruction fine-tuning data set

01:48:00.440 | about code, but you throw in a lot of questions

01:48:03.480 | about code in that repository.

01:48:07.040 | So you could either get ground truth ones,

01:48:09.680 | which might be difficult,

01:48:10.520 | or you could do what you kind of hinted at

01:48:12.200 | or suggested using synthetic data,

01:48:14.800 | i.e. kind of having the model ask questions

01:48:19.800 | about various pieces of the code.

01:48:22.880 | So you kind of take the pieces of the code,

01:48:24.440 | then prompt the model or have a model propose a question

01:48:27.800 | for that piece of code,

01:48:28.960 | and then add those as instruction fine-tuning data points.

01:48:32.560 | And then in theory, this might unlock the model's ability

01:48:36.200 | to answer questions about that code base.

01:48:39.400 | - Let me ask you about OpenAI-01.

01:48:42.440 | What do you think is the role of that kind of

01:48:44.480 | test-time compute system in programming?

01:48:47.280 | - I think test-time compute is really, really interesting.

01:48:50.800 | So there's been the pre-training regime,

01:48:52.600 | which will kind of, as you scale up the amount of data

01:48:57.040 | and the size of your model,

01:48:58.040 | get you better and better performance,

01:48:59.480 | both on loss and then on downstream benchmarks,

01:49:02.600 | and just general performance when we use it for coding

01:49:05.200 | or other tasks.

01:49:07.000 | We're starting to hit a bit of a data wall,

01:49:12.600 | meaning it's going to be hard

01:49:13.960 | to continue scaling up this regime.

01:49:16.040 | And so scaling up test-time compute

01:49:18.360 | is an interesting way of now, you know,

01:49:20.120 | increasing the number of inference time flops that we use,

01:49:24.600 | but still getting, like, yeah,

01:49:27.240 | as you increase the number of flops you use inference time,

01:49:29.280 | getting corresponding improvements

01:49:31.760 | in the performance of these models.

01:49:33.400 | Traditionally, we just had to literally train a bigger model

01:49:35.560 | that always used that many more flops,

01:49:38.840 | but now we could perhaps use the same size model

01:49:41.480 | and run it for longer to be able to get an answer

01:49:45.400 | at the quality of a much larger model.

01:49:46.760 | And so the really interesting thing I like about this

01:49:49.480 | is there are some problems that perhaps require

01:49:53.200 | 100 trillion parameter model intelligence

01:49:55.200 | trained on 100 trillion tokens,

01:49:56.760 | but that's, like, maybe 1%,

01:50:00.200 | maybe, like, 0.1% of all queries.

01:50:02.920 | So are you going to spend all of this effort,

01:50:05.560 | all this compute training a model that costs that much

01:50:09.560 | and then run it so infrequently?

01:50:12.080 | It feels completely wasteful

01:50:13.840 | when instead you get the model that can,

01:50:16.040 | that you train the model that's capable of doing

01:50:18.160 | the 99.9% of queries,

01:50:20.240 | then you have a way of inference time running it longer

01:50:23.560 | for those few people that really,

01:50:25.120 | really want max intelligence.

01:50:26.960 | - How do you figure out which problem

01:50:30.560 | requires what level of intelligence?

01:50:33.320 | Is that possible to dynamically figure out

01:50:35.120 | when to use GPT-4, when to use,

01:50:37.480 | like, when to use a small model

01:50:39.000 | and when you need the O-1?

01:50:41.680 | - I mean, yeah, that's an open research problem, certainly.

01:50:47.240 | I don't think anyone's actually cracked

01:50:48.760 | this model routing problem quite well.

01:50:51.040 | We'd like to.

01:50:51.880 | We have, like, initial implementations of this for things,

01:50:55.600 | for something like CursorTab,

01:50:57.040 | but at the level of, like,

01:50:59.520 | going between 4.0 Sonnet to O-1,

01:51:02.600 | it's a bit trickier.

01:51:04.880 | There's also a question of, like,

01:51:05.800 | what level of intelligence do you need

01:51:07.720 | to determine if the thing is too hard

01:51:12.200 | for the four-level model?

01:51:13.800 | Maybe you need the O-1 level model.

01:51:17.680 | It's really unclear.

01:51:19.320 | - But you mentioned, so there's a pre-training process,

01:51:23.560 | then there's post-training,

01:51:25.160 | and then there's, like, test-time compute.

01:51:27.080 | Is that fair to sort of separate?

01:51:28.680 | Where's the biggest gains?

01:51:30.080 | - Well, it's weird, because, like, test-time compute,

01:51:33.600 | there's, like, a whole training strategy needed

01:51:36.120 | to get test-time compute to work,

01:51:38.040 | and the other really weird thing about this is no one,

01:51:42.680 | like, outside of the big labs,

01:51:44.520 | and maybe even just OpenAI,

01:51:45.960 | no one really knows how it works.

01:51:47.680 | Like, there have been some really interesting papers

01:51:49.840 | that show hints of what they might be doing.

01:51:53.680 | And so perhaps they're doing something

01:51:56.680 | with tree search using process reward models.

01:52:00.080 | But yeah, I just, I think the issue is

01:52:02.520 | we don't quite know exactly what it looks like,

01:52:04.840 | so it would be hard to kind of comment

01:52:06.320 | on, like, where it fits in.

01:52:07.960 | I would put it in post-training,

01:52:09.400 | but maybe, like, the compute spent for this kind of,

01:52:12.160 | for getting test-time compute to work for a model

01:52:14.680 | is going to dwarf pre-training eventually.

01:52:17.520 | - So we don't even know if O1 is using

01:52:21.600 | just, like, chain-of-thought, RL.

01:52:23.800 | We don't know how they're using any of these.

01:52:26.000 | We don't know anything.

01:52:27.240 | - It's fun to speculate.

01:52:28.320 | (all laughing)

01:52:30.520 | - Like, if you were to build a competing model,

01:52:33.360 | what would you do?

01:52:35.000 | - Yeah, so one thing to do would be,

01:52:38.240 | I think you probably need to train a process reward model,

01:52:41.040 | which is, so maybe we can get into reward models

01:52:43.720 | and outcome reward models versus process reward models.

01:52:46.320 | Outcome reward models are the kind of

01:52:48.000 | traditional reward models that people are trained

01:52:50.560 | for language modeling,

01:52:53.880 | and it's just looking at the final thing.

01:52:55.360 | So if you're doing some math problem,

01:52:56.520 | let's look at that final thing you've done, everything,

01:52:59.120 | and let's assign a grade to it,

01:53:02.080 | how likely we think, like,

01:53:03.640 | what's the reward for this outcome.

01:53:05.760 | Process reward models, instead,

01:53:07.120 | try to grade the chain of thought.

01:53:09.240 | And so OpenAI had some preliminary paper on this,

01:53:11.600 | I think last summer,

01:53:13.800 | where they used human labelers

01:53:17.120 | to get this pretty large, several hundred thousand dataset

01:53:20.280 | of grading chains of thought.

01:53:21.960 | Ultimately, it feels like,

01:53:24.840 | I haven't seen anything interesting

01:53:26.720 | in the ways that people use process reward models

01:53:29.280 | outside of just using it as a means of

01:53:33.160 | affecting how we choose between a bunch of samples.

01:53:36.400 | So like what people do in all these papers

01:53:39.000 | is they sample a bunch of outputs from the language model,

01:53:42.000 | and then use the process reward models

01:53:44.440 | to grade all those generations

01:53:47.200 | alongside maybe some other heuristics,

01:53:49.040 | and then use that to choose the best answer.

01:53:51.640 | The really interesting thing that people think might work

01:53:55.000 | and people want to work

01:53:56.320 | is tree search with these process reward models,

01:53:58.760 | because if you really can grade every single step

01:54:02.280 | of the chain of thought,

01:54:03.640 | then you can kind of branch out

01:54:05.720 | and explore multiple paths of this chain of thought,

01:54:08.880 | and then use these process reward models

01:54:10.400 | to evaluate how good is this branch that you're taking.

01:54:14.000 | - Yeah, when the quality of the branch

01:54:16.600 | is somehow strongly correlated

01:54:18.240 | with the quality of the outcome at the very end.

01:54:20.480 | So like you have a good model

01:54:21.760 | of knowing which branch to take.

01:54:23.440 | So not just in the short term,

01:54:24.960 | and like in the long term.

01:54:25.920 | - Yeah, and like the interesting work

01:54:27.400 | that I think has been done

01:54:28.240 | is figuring out how to properly train the process,

01:54:30.880 | or the interesting work that has been open sourced

01:54:33.600 | and people I think talk about

01:54:35.520 | is how to train the process reward models,

01:54:38.880 | maybe in a more automated way.

01:54:41.000 | I could be wrong here,

01:54:42.200 | could not be mentioning something,

01:54:43.400 | because I haven't seen anything super,

01:54:46.000 | that seems to work really well

01:54:47.440 | for using the process reward models creatively

01:54:50.800 | to do tree searching code.

01:54:52.720 | - This is kind of an AI safety,

01:54:54.160 | maybe a bit of a philosophy question.

01:54:55.840 | So OpenAI says that they're hiding the chain of thought

01:54:58.560 | from the user.

01:54:59.960 | And they've said that that was a difficult decision to make.

01:55:03.120 | They, instead of showing the chain of thought,

01:55:06.080 | they're asking the model to summarize the chain of thought.

01:55:09.280 | They're also in the background saying

01:55:10.560 | they're going to monitor the chain of thought

01:55:13.000 | to make sure the model is not trying to manipulate the user,

01:55:15.840 | which is a fascinating possibility.

01:55:17.760 | But anyway,

01:55:18.600 | what do you think about hiding the chain of thought?

01:55:21.160 | - One consideration for OpenAI,

01:55:22.720 | and this is completely speculative,

01:55:24.560 | could be that they wanna make it hard for people

01:55:26.920 | to distill these capabilities out of their model.

01:55:29.720 | It might actually be easier

01:55:31.120 | if you had access to that hidden chain of thought

01:55:33.600 | to replicate the technology,

01:55:36.040 | 'cause that's pretty important data,

01:55:37.120 | like seeing the steps that the model took

01:55:38.840 | to get to the final result.

01:55:39.920 | - So you could probably train on that also.

01:55:42.360 | - And there was sort of a mirror situation with this,

01:55:45.240 | with some of the large language model providers,

01:55:47.040 | and also this is speculation,

01:55:48.760 | but some of these APIs

01:55:52.120 | used to offer easy access to log probabilities

01:55:55.360 | for all the tokens that they're generating,

01:55:57.640 | and also log probabilities for the prompt tokens.

01:55:59.960 | And then some of these APIs took those away.

01:56:02.640 | And again, complete speculation,

01:56:03.880 | but one of the thoughts is that

01:56:07.360 | the reason those were taken away is

01:56:08.840 | if you have access to log probabilities,

01:56:11.080 | similar to this hidden chain of thought,

01:56:12.520 | that can give you even more information

01:56:13.840 | to try and distill these capabilities out of the APIs,

01:56:16.960 | out of these biggest models,

01:56:18.680 | into models you control.

01:56:20.040 | As an asterisk on also the previous discussion

01:56:23.200 | about us integrating O1,

01:56:26.120 | I think that we're still learning how to use this model.

01:56:29.320 | So we made O1 available in Cursor

01:56:31.120 | because when we got the model,

01:56:33.880 | we were really interested in trying it out.

01:56:35.840 | I think a lot of programmers

01:56:37.280 | are gonna be interested in trying it out,

01:56:38.960 | but O1 is not part of the default Cursor experience

01:56:43.520 | in any way up.

01:56:44.360 | And we still haven't found a way

01:56:47.480 | to get integrated into the editor

01:56:51.240 | in a way that we reach for sort of every hour,

01:56:54.880 | maybe even every day.

01:56:56.200 | And so I think the jury's still out

01:56:58.560 | on how to use the model.

01:57:00.080 | And we haven't seen examples yet

01:57:04.120 | of people releasing things where it seems really clear,

01:57:07.360 | like, oh, that's like now the use case.

01:57:09.760 | The obvious one to return to

01:57:11.240 | is maybe this can make it easier

01:57:12.880 | for you to have these background things running, right?

01:57:15.240 | To have these models in loops,

01:57:16.120 | to have these models be agentic.

01:57:17.800 | But we're still discovering.

01:57:22.560 | - To be clear, we have ideas.

01:57:24.040 | We just need to try

01:57:25.760 | and get something incredibly useful

01:57:28.160 | before we put it out there.

01:57:29.600 | - But it has these significant limitations.

01:57:31.720 | Like, even like barring capabilities,

01:57:35.640 | it does not stream.

01:57:37.600 | And that means it's really, really painful to use

01:57:40.560 | for things where you want to supervise the output.

01:57:43.280 | And instead, you're just waiting

01:57:45.240 | for the wall of text to show up.

01:57:47.320 | Also, it does feel like the early innings

01:57:49.480 | of test, time, compute, and search

01:57:50.840 | where it's just like a very, very much a V0.

01:57:54.640 | And there's so many things that like don't feel quite right.

01:57:58.760 | And I suspect in parallel

01:58:01.760 | to people increasing the amount of pre-training data

01:58:05.800 | and the size of the models and pre-training

01:58:07.080 | and finding tricks there,

01:58:08.240 | you'll now have this other thread

01:58:09.920 | of getting search to work better and better.

01:58:12.640 | - So let me ask you about Strawberry Tomorrow Eyes.

01:58:19.840 | So it looks like GitHub Copilot

01:58:24.440 | might be integrating O1 in some kind of way.

01:58:28.280 | And I think some of the comments are saying,

01:58:29.920 | does this mean Cursor is done?

01:58:31.520 | I think I saw one comment saying that.

01:58:35.000 | - I saw, time to shut down Cursor.

01:58:37.120 | - Time to shut down Cursor, thank you.

01:58:39.440 | So is it time to shut down Cursor?

01:58:41.400 | - I think this space is a little bit different

01:58:43.080 | from past software spaces over the 2010s,

01:58:46.840 | where I think that the ceiling here

01:58:49.160 | is really, really, really incredibly high.

01:58:51.280 | And so I think that the best product in three to four years

01:58:54.360 | will just be so much more useful

01:58:55.680 | than the best product today.

01:58:57.320 | And you can like wax poetic about moats this

01:59:01.640 | and brand that, and this is our advantage.

01:59:05.000 | But I think in the end, just if you don't have,

01:59:07.560 | like if you stop innovating on the product, you will lose.

01:59:10.800 | And that's also great for startups.

01:59:13.360 | That's great for people trying to enter this market

01:59:16.040 | because it means you have an opportunity

01:59:17.960 | to win against people who have, you know,

01:59:19.800 | lots of users already by just building something better.

01:59:23.000 | And so I think, yeah, over the next few years,

01:59:26.120 | it's just about building the best product,

01:59:28.800 | building the best system, and that both comes down

01:59:31.240 | to the modeling engine side of things.

01:59:34.480 | And it also comes down to the editing experience.

01:59:37.440 | - Yeah, I think most of the additional value

01:59:40.120 | from Cursor versus everything else out there

01:59:42.520 | is not just integrating the new model fast, like a one.

01:59:46.160 | And it comes from all of the kind of depth

01:59:49.480 | that goes into these custom models

01:59:51.560 | that you don't realize are working for you

01:59:53.480 | in kind of every facet of the product,

01:59:55.480 | as well as like the really thoughtful UX

01:59:59.400 | with every single feature.

02:00:00.720 | - All right, from that profound answer,

02:00:03.800 | let's descend back down to the technical.

02:00:05.560 | You mentioned you have a taxonomy of synthetic data.

02:00:08.480 | - Oh, yeah.

02:00:09.720 | - Can you please explain?

02:00:10.600 | - Yeah, I think there are three main kinds of synthetic data.

02:00:15.240 | The first is, so what is synthetic data first?

02:00:18.200 | So there's normal data, like non-synthetic data,

02:00:20.400 | which is just data that's naturally created,

02:00:23.800 | i.e. usually it'll be from humans having done things.

02:00:27.120 | So from some human process, you get this data.

02:00:30.480 | Synthetic data, the first one would be distillation.

02:00:34.720 | So having a language model, kind of output tokens

02:00:38.080 | or probability distributions over tokens.

02:00:41.760 | And then you can train some less capable model on this.

02:00:45.640 | This approach is not gonna get you a net,

02:00:47.960 | like more capable model than the original one

02:00:49.880 | that has produced the tokens.

02:00:51.320 | But it's really useful for if there's some capability

02:00:55.360 | you wanna elicit from some really expensive

02:00:58.040 | high latency model, you can then distill that down

02:01:00.880 | into some smaller task specific model.

02:01:03.400 | The second kind is when like one direction of the problem

02:01:09.280 | is easier than the reverse.

02:01:11.840 | And so a great example of this is bug detection,

02:01:16.120 | like we mentioned earlier, where it's a lot easier

02:01:19.840 | to introduce reasonable looking bugs

02:01:22.600 | than it is to actually detect them.

02:01:24.960 | And this is probably the case for humans too.

02:01:27.200 | And so what you can do is you can get a model

02:01:31.440 | that's not training that much data, that's not that smart

02:01:34.320 | to introduce a bunch of bugs in code.

02:01:35.840 | And then you can use that to then train,

02:01:38.320 | use this synthetic data to train a model

02:01:39.800 | that can be really good at detecting bugs.

02:01:42.240 | The last category, I think is, I guess the main one

02:01:45.240 | that it feels like the big labs are doing

02:01:48.360 | for synthetic data, which is producing texts

02:01:52.800 | with language models that can then be verified easily.

02:01:57.360 | So like, extreme example of this

02:01:59.920 | is if you have a verification system that can detect

02:02:02.840 | if language is Shakespeare level

02:02:05.760 | and then you have a bunch of monkeys typing in typewriters.

02:02:08.160 | Like, you can eventually get enough training data

02:02:10.640 | to train a Shakespeare level language model.

02:02:12.640 | And I mean, this is the case, like very much the case

02:02:14.760 | for math where verification is actually really, really easy

02:02:19.160 | for formal languages.

02:02:22.680 | And then what you can do is you can have an okay model,

02:02:26.200 | generate a ton of rollouts and then choose the ones

02:02:29.600 | that you know have actually proved

02:02:31.840 | the ground truth theorems and train that further.

02:02:34.680 | There's similar things you can do for code

02:02:36.360 | with LeetCode like problems where if you have some set

02:02:40.400 | of tests that you know correspond to,

02:02:42.440 | if something passes these tests,

02:02:43.760 | it is actually solved the problem.

02:02:45.600 | You could do the same thing where you verify

02:02:46.880 | that it's passed the test and then train the model

02:02:48.760 | and the outputs that have passed the tests.

02:02:51.680 | I think it's gonna be a little tricky getting this to work

02:02:54.280 | in all domains or just in general.

02:02:57.720 | Like having the perfect verifier feels really, really hard

02:03:00.440 | to do with just like open-ended miscellaneous tasks.

02:03:04.760 | You get the model or more like long horizon tasks,

02:03:07.800 | even in coding.

02:03:09.040 | - That's 'cause you're not as optimistic as Arvid, but yeah.

02:03:12.280 | So yeah, so that third category requires having a verifier.

02:03:16.560 | - Yeah, verification is, it feels like it's best

02:03:18.880 | when you know for a fact that it's correct.

02:03:20.520 | And like, then it wouldn't be like using a language model

02:03:23.720 | to verify, it would be using tests or formal systems.

02:03:28.440 | - Or running the thing too.

02:03:30.640 | Doing like the human form of verification

02:03:32.440 | where you just do manual quality control.

02:03:34.280 | - Yeah, yeah.

02:03:35.360 | - But like the language model version of that

02:03:37.000 | where it's like running the thing

02:03:37.840 | and it actually understands the output.

02:03:39.760 | - Yeah, no, that's true.

02:03:40.600 | - Sort of somewhere between.

02:03:41.880 | - Yeah, I think that's the category that is most likely

02:03:45.680 | to result in like massive gains.

02:03:48.280 | - What about RL with feedback side, RLHF versus RLAIF?

02:03:52.520 | What's the role of that in getting better performance

02:03:57.920 | on the models?

02:04:00.080 | - Yeah, so RLHF is when the reward model you use

02:04:05.080 | is trained from some labels you've collected

02:04:09.880 | from humans giving feedback.

02:04:11.400 | I think this works if you have the ability

02:04:15.360 | to get a ton of human feedback

02:04:18.280 | for this kind of task that you care about.

02:04:20.840 | RLAIF is interesting because you're kind of depending on,

02:04:26.840 | like this is actually kind of going to,

02:04:30.000 | it's depending on the constraint that verification

02:04:33.200 | is actually a decent bit easier than generation.

02:04:36.880 | Because it feels like, okay, like, what are you doing?

02:04:38.920 | Are you using this language model

02:04:40.080 | to look at the language model outputs

02:04:41.320 | and then prove the language model?

02:04:42.680 | But no, it actually may work if the language model

02:04:46.720 | has a much easier time verifying some solution

02:04:49.960 | than it does generating it.

02:04:50.880 | Then you actually could perhaps get this kind of recursive.

02:04:54.240 | I don't think it's going to look exactly like that.

02:04:56.840 | The other thing you could do is,

02:04:59.040 | that we kind of do is like a little bit of a mix

02:05:03.200 | of RLAIF and RLHF,

02:05:05.440 | where usually the model is actually quite correct.

02:05:07.640 | And this is in the case of cursor tap,

02:05:09.840 | picking between like two possible generations

02:05:13.360 | of what is the better one.

02:05:15.040 | And then it just needs like a hand,

02:05:16.720 | a little bit of human nudging

02:05:18.880 | with only like on the order of 50, 100 examples

02:05:24.080 | to like kind of align that prior the model has

02:05:27.240 | with exactly with what you want.

02:05:29.200 | It looks different than I think normal RLHF

02:05:31.240 | where you're usually training these reward models

02:05:33.120 | on tons of examples.

02:05:34.520 | - What's your intuition when you compare generation

02:05:39.320 | and verification, or generation and ranking?

02:05:42.360 | Is ranking way easier than generation?

02:05:45.840 | - My intuition would just say, yeah, it should be.

02:05:49.160 | Like this is kind of going back to,

02:05:53.800 | like if you believe P does not equal NP,

02:05:56.600 | then there's this massive class of problems

02:05:59.520 | that are much, much easier to verify given a proof

02:06:02.240 | than actually proving it.

02:06:03.920 | - I wonder if the same thing will prove P not equal to NP

02:06:07.240 | or P equal to NP.

02:06:08.480 | - That would be, that would be really cool.

02:06:11.640 | - That'd be a whatever fields model by AI.

02:06:16.200 | Who gets the credit?

02:06:17.800 | Another open philosophical question.

02:06:19.640 | - I'm actually--

02:06:22.040 | - Whoever prompted it.

02:06:22.880 | (laughs)

02:06:24.240 | - I'm actually surprisingly curious what like a good bet

02:06:27.760 | for when AI will get the Fields Medal will be.

02:06:31.280 | I actually don't have--

02:06:32.120 | - Isn't this Amon's specialty?

02:06:33.120 | - I don't know what Amon's bet here is.

02:06:35.400 | - Oh, sorry, Nobel Prize or Fields Medal first?

02:06:37.760 | - Fields Medal.

02:06:38.600 | - Well, Fields Medal level.

02:06:39.720 | - Fields Medal comes first, I think.

02:06:41.280 | - Fields Medal comes first.

02:06:42.520 | Well, you would say that, of course.

02:06:44.840 | - But it's also this like isolated system

02:06:46.600 | you can verify and--

02:06:47.880 | - Sure.

02:06:48.920 | - Like, I don't even know if I would--

02:06:49.760 | - You don't need to do--

02:06:50.600 | - I feel like I have much more to do there.

02:06:51.720 | I felt like the path to get to IMO

02:06:53.520 | was a little bit more clear

02:06:55.160 | because it already could get a few IMO problems.

02:06:57.720 | And there were a bunch of like,

02:06:59.040 | there was a bunch of low hanging fruit

02:07:00.360 | given the literature at the time

02:07:01.600 | of like what tactics people could take.

02:07:04.000 | I think I'm one, much less versed in the space

02:07:06.520 | that they're improving now.

02:07:07.760 | And two, yeah, less intuition about how close we are

02:07:11.720 | to solving these really, really hard open problems.

02:07:15.600 | - So you think you'll be Fields Medal first?

02:07:17.280 | It won't be like in physics or in--

02:07:20.400 | - Oh, 100%.

02:07:21.240 | I think that's probably more likely.

02:07:23.840 | Like, it's probably much more likely that it'll get in.

02:07:26.800 | Yeah, yeah, yeah, yeah.

02:07:27.640 | Well, I think it goes to like, I don't know,

02:07:29.080 | like BSD, which is the Burt-Springer-Dyer conjecture,

02:07:32.240 | or like Riemann hypothesis,

02:07:33.680 | or any one of these like hard math problems,

02:07:36.720 | which is actually really hard.

02:07:38.560 | It's sort of unclear what the path to get

02:07:41.400 | even a solution looks like.

02:07:42.920 | Like, we don't even know what a path looks like,

02:07:44.640 | let alone--

02:07:45.480 | - And you don't buy the idea

02:07:47.480 | that this is like an isolated system

02:07:49.120 | and you can actually have a good reward system,

02:07:51.280 | and it feels like it's easier to train for that.

02:07:56.000 | - I think we might get Fields Medal before AGI.

02:07:59.520 | - I think--

02:08:00.360 | - I mean, I'd be very happy.

02:08:02.840 | (laughs)

02:08:03.680 | I'd be very happy.

02:08:04.500 | But I don't know if I think 2028, 2030.

02:08:08.520 | (laughs)

02:08:09.720 | - Or Fields Medal.

02:08:10.880 | - Fields Medal.

02:08:11.720 | - All right.

02:08:12.960 | It feels like forever from now,

02:08:15.040 | given how fast things have been going.

02:08:17.560 | - Speaking of how fast things have been going,

02:08:19.160 | let's talk about scaling laws.

02:08:21.440 | So for people who don't know,

02:08:23.000 | maybe it's good to talk about this whole idea

02:08:28.920 | of scaling laws.

02:08:30.040 | What are they?

02:08:31.000 | Where do you think stand?

02:08:32.200 | And where do you think things are going?

02:08:34.360 | - I think it was interesting,

02:08:35.200 | the original scaling laws paper by OpenAI

02:08:37.160 | was slightly wrong,

02:08:38.000 | 'cause I think of some issues they did

02:08:40.480 | with learning rate schedules.

02:08:43.160 | And then Chinchilla showed a more correct version.

02:08:46.520 | And then from then people have, again,

02:08:48.400 | kind of deviated from doing the compute optimal thing,

02:08:50.360 | 'cause people start now optimizing more so

02:08:53.360 | for making the thing work really well,

02:08:56.400 | given an inference budget.

02:08:58.920 | And I think there are a lot more dimensions to these curves

02:09:03.280 | than what we originally used of just compute,

02:09:06.680 | number of parameters and data.

02:09:09.640 | Like inference compute is the obvious one.

02:09:12.600 | I think context length is another obvious one.

02:09:14.720 | So if you care,

02:09:15.560 | like let's say you care about the two things

02:09:16.800 | of inference, compute, and then context window,

02:09:21.240 | maybe the thing you wanna train is some kind of SSM

02:09:24.680 | because they're much, much cheaper and faster

02:09:27.480 | at super, super long context.

02:09:28.920 | And even if maybe it is 10X worse scaling properties

02:09:31.680 | during training,

02:09:32.520 | meaning you've spent 10X more compute

02:09:34.520 | to train the thing to get the same level of capabilities,

02:09:37.840 | it's worth it because you care most

02:09:40.080 | about that inference budget for really long context windows.

02:09:43.400 | So it'll be interesting to see how people kind of play

02:09:46.000 | with all these dimensions.

02:09:47.520 | - So, yeah.

02:09:48.360 | I mean, you speak to the multiple dimensions, obviously.

02:09:49.880 | The original conception was just looking at the variables

02:09:52.400 | of the size of the model as measured by parameters

02:09:55.480 | and the size of the data as measured

02:09:56.920 | by the number of tokens and looking at the ratio of the two.

02:09:59.760 | - Yeah.

02:10:00.600 | - And it's kind of a compelling notion

02:10:02.520 | that there is a number or at least a minimum.

02:10:06.360 | And it seems like one was emerging.

02:10:10.440 | Do you still believe that there is a kind of,

02:10:13.200 | bigger is better?

02:10:14.240 | - I mean, I think bigger is certainly better

02:10:19.080 | for just raw performance.

02:10:21.520 | - And raw intelligence.

02:10:22.480 | - And raw intelligence.

02:10:23.560 | I think that the path that people might take is,

02:10:25.640 | I'm particularly bullish on distillation.

02:10:28.200 | And like, yeah.

02:10:29.040 | How many knobs can you turn to,

02:10:31.160 | if we spend like a ton, ton of money on training,

02:10:34.840 | like get the most capable, cheap model, right?

02:10:38.360 | Like really, really caring as much as you can.

02:10:40.360 | 'Cause like the naive version of caring as much as you can

02:10:42.960 | about inference time compute

02:10:43.920 | is what people have already done with like the Lama models

02:10:46.000 | or just overtraining the shit out of 7B models

02:10:50.160 | on way, way, way more tokens than is essential optimal.

02:10:54.160 | Right, but if you really care about it,

02:10:55.160 | maybe the thing to do is what Gemma did,

02:10:56.360 | which is let's not just train on tokens.

02:10:59.040 | Let's literally train on minimizing the KL divergence

02:11:04.040 | with the distribution of Gemma 27B, right?

02:11:08.480 | So knowledge distillation there.

02:11:11.000 | And you're spending the compute

02:11:12.720 | of literally training this 27 billion model,

02:11:15.760 | billion parameter model on all these tokens

02:11:17.640 | just to get out this, I don't know, smaller model.

02:11:20.320 | - And the distillation gives you just a faster model.

02:11:22.480 | Smaller means faster.

02:11:23.840 | - Yeah, distillation in theory is,

02:11:25.680 | I think getting out more signal

02:11:29.080 | from the data that you're training on.

02:11:30.600 | And it's like another,

02:11:31.440 | it's perhaps another way of getting over,

02:11:33.800 | not like completely over,

02:11:35.000 | but like partially helping with the data wall.

02:11:37.640 | Where like you only have so much data to train on,

02:11:39.400 | let's like train this really, really big model

02:11:41.120 | on all these tokens and we'll distill it into a smaller one.

02:11:43.760 | And maybe we can get more signal per token

02:11:48.280 | for this much smaller model

02:11:50.040 | than we would have originally if we trained it.

02:11:51.600 | - So if I gave you $10 trillion, how would you spend it?

02:11:55.560 | I mean, you can't buy an island or whatever.

02:11:58.640 | How would you allocate it

02:12:00.360 | in terms of improving the big model

02:12:03.600 | versus maybe paying for HF in the RLHF or?

02:12:08.600 | - Yeah, I think there's a lot of these secrets

02:12:14.000 | and details about training these large models

02:12:16.720 | that I just don't know

02:12:18.400 | and are only privy to the large labs.

02:12:19.960 | And the issue is I would waste a lot of that money

02:12:22.600 | if I even attempted this,

02:12:24.040 | because I wouldn't know those things.

02:12:26.360 | Suspending a lot of disbelief

02:12:28.200 | and assuming like you had the know-how

02:12:32.960 | or if you're saying like you have to operate

02:12:35.200 | with like the limited information you have now.

02:12:37.800 | - No, no, no.

02:12:38.640 | Actually, I would say you swoop in

02:12:40.800 | and you get all the information,

02:12:42.040 | all the little heuristics, all the little parameters,

02:12:44.560 | all the parameters that define how the thing is trained.

02:12:49.280 | If we look in how to invest money for the next five years

02:12:54.320 | in terms of maximizing what you called raw intelligence.

02:12:57.480 | - I mean, isn't the answer like really simple?

02:12:59.280 | You just try to get as much compute as possible?

02:13:02.200 | Like at the end of the day, all you need to buy is the GPUs

02:13:05.000 | and then sort of the researchers can find all the,

02:13:08.840 | like they can sort of, you can tune whether you want

02:13:12.320 | to pre-train a big model or a small model.

02:13:15.200 | - Well, this gets into the question of like,

02:13:16.560 | are you really limited by compute and money

02:13:18.920 | or are you limited by these other things?

02:13:21.040 | - I'm more privy to Arvid's belief

02:13:24.360 | that we're sort of ideal limited, but there's always-

02:13:27.760 | - But if you have a lot of compute,

02:13:30.640 | you can run a lot of experiments.

02:13:32.760 | - So you would run a lot of experiments

02:13:34.920 | versus like use that compute to train a gigantic model.

02:13:38.560 | - I would, but I do believe that we are limited

02:13:42.560 | in terms of ideas that we have.

02:13:44.600 | - I think, yeah, 'cause even with all this compute

02:13:47.960 | and like, you know, all the data you could collect

02:13:49.920 | in the world, I think you really are ultimately limited

02:13:52.680 | by not even ideas, but just like really good engineering.

02:13:58.920 | Like even with all the capital in the world,

02:14:00.880 | would you really be able to assemble,

02:14:03.560 | like there aren't that many people in the world

02:14:05.520 | who really can like make the difference here.

02:14:08.000 | And there's so much work that goes into research

02:14:11.640 | that is just like pure, really, really hard engineering work.

02:14:15.760 | As like a very kind of hand-wavy example,

02:14:18.680 | if you look at the original "Transformer" paper,

02:14:20.680 | you know how much work was kind of joining together

02:14:22.800 | a lot of these really interesting concepts

02:14:25.160 | embedded in the literature versus then going in

02:14:28.720 | and writing all the code,

02:14:30.160 | like maybe the CUDA kernels, maybe whatever else,

02:14:31.880 | I don't know if it ran on GPUs or TPUs originally,

02:14:34.000 | such that it actually saturated the GPU performance, right?

02:14:38.360 | Getting Gnome to go in and do all of this code, right?

02:14:41.200 | And Gnome is like probably one of the best engineers

02:14:42.920 | in the world, or maybe going a step further,

02:14:45.160 | like the next generation of models, having these things,

02:14:47.720 | like getting model parallelism to work

02:14:49.480 | and scaling it on like, you know, thousands of,

02:14:51.680 | or maybe tens of thousands of like V100s,

02:14:54.280 | which I think GBDE3 may have been.

02:14:57.160 | There's just so much engineering effort

02:14:58.720 | that has to go into all of these things to make it work.

02:15:01.760 | If you really brought that cost down to like, you know,

02:15:07.680 | maybe not zero, but just made it 10X easier,

02:15:10.280 | made it super easy for someone with really fantastic ideas

02:15:13.560 | to immediately get to the version of like

02:15:16.000 | the new architecture they dreamed up

02:15:17.480 | that is like getting 50, 40% utilization on the GPUs.

02:15:22.840 | I think that would just speed up research by a ton.

02:15:27.640 | - I mean, I think if you see a clear path to improvement,

02:15:30.360 | you should always sort of take

02:15:31.720 | the low-hanging fruit first, right?

02:15:33.040 | And I think probably OpenAI and all the other labs

02:15:36.720 | that did the right thing to pick off the low-hanging fruit,

02:15:39.280 | where the low-hanging fruit is like sort of,

02:15:41.920 | you could scale up to a GPT 4.25 scale,

02:15:47.680 | and you just keep scaling,

02:15:50.960 | and like things keep getting better.

02:15:53.200 | And as long as, like you,

02:15:55.440 | there's no point of experimenting with new ideas

02:15:57.440 | when like everything is working.

02:15:59.480 | And you should sort of bang on it

02:16:01.560 | and try to get as much juice out of the possible.

02:16:04.120 | And then maybe when you really need new ideas for,

02:16:07.040 | I think if you're spending 10 trillion dollars,

02:16:08.960 | you probably want to spend some,

02:16:10.720 | so you know, then actually like re-evaluate your ideas.

02:16:13.320 | Like probably your idea limited at that point.

02:16:15.480 | - I think all of us believe new ideas

02:16:18.120 | are probably needed to get, you know,

02:16:20.520 | all the way there to AGI.

02:16:22.760 | And all of us also probably believe

02:16:27.160 | there exist ways of testing out those ideas

02:16:30.160 | at smaller scales and being fairly confident

02:16:34.040 | that they'll play out.

02:16:35.680 | It's just quite difficult for the labs

02:16:39.080 | in their current position to dedicate

02:16:41.400 | their very limited research and engineering talent

02:16:45.200 | to exploring all these other ideas

02:16:47.240 | when there's like this core thing

02:16:48.600 | that will probably like improve performance

02:16:52.640 | for some like decent amount of time.

02:16:54.560 | - Yeah, but also these big labs like winning.

02:16:59.040 | So they're just going wild.

02:17:02.400 | Okay, so how, big question looking out into the future.

02:17:07.400 | You're now at the center of the programming world.

02:17:12.000 | How do you think programming,

02:17:13.240 | the nature of programming changes

02:17:14.840 | in the next few months, in the next year,

02:17:17.600 | in the next two years, the next five years, 10 years?

02:17:20.840 | - I think we're really excited about a future

02:17:23.800 | where the programmer's in the driver's seat for a long time.

02:17:27.960 | And you've heard us talk about this a little bit,

02:17:30.320 | but one that emphasizes speed and agency

02:17:34.240 | for the programmer and control,

02:17:36.200 | the ability to modify anything you want to modify,

02:17:38.640 | the ability to iterate really fast

02:17:40.200 | in what you're building.

02:17:41.720 | And this is a little different, I think,

02:17:45.280 | than where some people are jumping to in the space,

02:17:50.280 | where I think one idea that's captivated people

02:17:54.200 | is can you talk to your computer?

02:17:58.280 | Can you have it build software for you

02:17:59.520 | as if you're talking to like an engineering department

02:18:01.400 | or an engineer over Slack?

02:18:02.680 | And can it just be this sort of isolated text box?

02:18:05.640 | And part of the reason we're not excited about that

02:18:10.720 | is some of the stuff we've talked about with latency.

02:18:12.760 | But then a big piece, a reason we're not excited about that

02:18:16.040 | is because that comes with giving up a lot of control.

02:18:19.080 | It's much harder to be really specific

02:18:20.640 | when you're talking in the text box.

02:18:22.360 | And if you're necessarily just going to communicate

02:18:25.760 | with a thing, like you would be communicating

02:18:27.200 | with an engineering department,

02:18:28.040 | you're actually abdicating tons and tons

02:18:29.520 | of really important decisions to this bot.

02:18:32.480 | And this kind of gets at fundamentally what engineering is.

02:18:38.600 | I think that some people

02:18:40.440 | who are a little bit more removed from engineering

02:18:41.840 | might think of it as the spec is completely written out

02:18:44.920 | and then the engineers just come and they just implement.

02:18:47.880 | And it's just about making the thing happen in code

02:18:49.960 | and making the thing exist.

02:18:52.040 | But I think a lot of the best engineering,

02:18:55.160 | the engineering we enjoy,

02:18:56.400 | involves tons of tiny micro decisions

02:18:59.520 | about what exactly you're building

02:19:01.240 | and about really hard trade-offs between speed and cost

02:19:05.080 | and just all the other things involved in a system.

02:19:08.320 | And we want, as long as humans

02:19:12.600 | are actually the ones designing the software

02:19:15.440 | and the ones specifying what they want to be built,

02:19:18.400 | and it's not just like company run by all AIs,

02:19:20.760 | we think you'll really want the human in a driver's seat

02:19:23.560 | dictating these decisions.

02:19:26.240 | And so the jury's still out on kind of what that looks like.

02:19:30.640 | I think that one weird idea for what that could look like

02:19:34.200 | is it could look like you can control

02:19:37.200 | the level of abstraction you view a code base at.

02:19:39.760 | And you can point at specific parts of a code base

02:19:43.200 | that maybe you digest a code base

02:19:46.720 | by looking at it in the form of pseudocode.

02:19:49.120 | And you can actually edit that pseudocode too,

02:19:52.560 | and then have changes get made down

02:19:54.320 | at the sort of formal programming level.

02:19:56.520 | And you can gesture at any piece of logic

02:20:01.520 | in your software component of programming.

02:20:04.120 | You keep the inflow, text editing component of programming,

02:20:07.120 | you keep the control of,

02:20:08.560 | you can even go down into the code,

02:20:10.040 | you can go at higher levels of abstraction,

02:20:12.320 | while also giving you these big productivity gains.

02:20:14.640 | - It'd be nice if you can go up and down

02:20:16.320 | the abstraction stack.

02:20:18.280 | - Yeah, and there are a lot of details to figure out there

02:20:20.200 | that's sort of like a fuzzy idea,

02:20:21.800 | time will tell if it actually works,

02:20:23.200 | but these principles of control and speed

02:20:25.760 | in the human in the driver's seat

02:20:26.640 | we think are really important.

02:20:28.680 | We think for some things, like Arvid mentioned before,

02:20:31.080 | for some styles of programming,

02:20:32.360 | you can kind of hand it off chatbot style,

02:20:34.800 | if you have a bug that's really well-specified,

02:20:36.800 | but that's not most of programming,

02:20:39.240 | and that's also not most of the programming

02:20:41.800 | we think a lot of people value.

02:20:43.440 | - What about like the fundamental skill of programming?

02:20:46.080 | There's a lot of people, like young people right now,

02:20:49.840 | kind of scared, like thinking,

02:20:53.800 | 'cause they like love programming,

02:20:55.240 | but they're scared about like,

02:20:56.280 | will I be able to have a future

02:20:57.640 | if I pursue this career path?

02:20:59.800 | Do you think the very skill of programming

02:21:01.840 | will change fundamentally?

02:21:04.040 | - I actually think this is a really, really exciting time

02:21:06.600 | to be building software.

02:21:08.040 | Like we remember what programming was like in 2013, 2012,

02:21:13.040 | whatever it was,

02:21:14.760 | and there was just so much more cruft and boilerplate

02:21:20.360 | and looking up something really gnarly,

02:21:25.320 | and that stuff still exists, it's definitely not at zero,

02:21:28.640 | but programming today is way more fun than back then.

02:21:32.520 | It's like, we're really getting down

02:21:34.200 | to the delight concentration,

02:21:36.720 | and all the things that really draw people to programming,

02:21:39.520 | like for instance, this element of being able

02:21:41.160 | to build things really fast and speed,

02:21:44.120 | and also individual control,

02:21:45.840 | like all those are just being turned up a ton.

02:21:48.320 | And so I think it's just gonna be,

02:21:50.280 | I think it's gonna be a really, really fun time

02:21:51.720 | for people who build software.

02:21:53.720 | I think that the skills will probably change too.

02:21:56.120 | I think that people's taste in creative ideas

02:21:58.600 | will be magnified, and it will be less about,

02:22:02.160 | maybe less a little bit about boilerplate text editing,

02:22:05.160 | maybe even a little bit less about carefulness,

02:22:07.840 | which I think is really important today.

02:22:10.760 | If you're a programmer, I think it'll be a lot more fun.

02:22:13.440 | - What do you guys think?

02:22:15.200 | - I agree.

02:22:16.120 | I'm very excited to be able to change,

02:22:18.320 | like just, one thing that happened recently

02:22:22.800 | was like we wanted to do a relatively big migration

02:22:25.800 | to our code base.

02:22:26.640 | We were using async local storage in Node.js,

02:22:30.120 | which is known to be not very performant,

02:22:31.960 | and we wanted to migrate to our context object.

02:22:33.760 | And this is a big migration

02:22:35.440 | and affects the entire code base.

02:22:37.640 | And Swale and I spent, I don't know, five days

02:22:41.360 | working through this, even with today's AI tools.

02:22:43.640 | And I am really excited for a future

02:22:47.040 | where I can just show a couple of examples,

02:22:50.520 | and then the AI applies that to all of the locations.

02:22:54.120 | And then it highlights, oh, this is a new example,

02:22:56.960 | like what should I do?

02:22:57.800 | And then I show exactly what to do there.

02:22:59.440 | And then that can be done in like 10 minutes.

02:23:02.520 | And then you can iterate much, much faster.

02:23:04.920 | Then you don't have to think as much upfront

02:23:08.280 | and stand at the blackboard and like,

02:23:10.520 | think exactly like, how are we going to do this?

02:23:12.360 | Because the cost is so high,

02:23:13.800 | but you can just try something first and you realize,

02:23:16.480 | oh, this is not actually exactly what I want.

02:23:18.400 | And then you can change it instantly again after.

02:23:20.800 | And so, yeah, I think being a programmer in the future

02:23:24.960 | is going to be a lot of fun.

02:23:26.560 | - Yeah, I really liked that point about,

02:23:29.840 | it feels like a lot of the time with programming,

02:23:31.280 | there are two ways you can go about it.

02:23:33.560 | One is like, you think really hard, carefully upfront

02:23:37.240 | about the best possible way to do it.

02:23:39.760 | And then you spend your limited time of engineering

02:23:42.160 | to actually implement it.

02:23:43.480 | But I much prefer just getting in the code

02:23:46.080 | and like, taking a crack at it,

02:23:47.720 | seeing how it kind of lays out,

02:23:49.920 | and then iterating really quickly on that.

02:23:52.680 | That feels more fun.

02:23:55.880 | - Yeah, like you're speaking to,

02:23:57.240 | generating the boilerplate is great.

02:23:59.320 | So you just focus on the difficult design,

02:24:01.840 | nuanced, difficult design decisions.

02:24:04.320 | Migration, I feel like this is a cool one.

02:24:07.960 | Like, it seems like larger language models

02:24:09.520 | are able to basically translate

02:24:11.200 | from one program language to another,

02:24:12.560 | or like translate, like migrate,

02:24:15.400 | in the general sense of what migrate is.

02:24:17.400 | But that's in the current moment.

02:24:20.720 | So I mean, the fear has to do with like,

02:24:22.640 | okay, as these models get better and better,

02:24:24.920 | then you're doing less and less creative decisions.

02:24:27.120 | And is it going to kind of move to a place

02:24:28.880 | where it's, you're operating in the design space

02:24:33.040 | of natural language,

02:24:34.000 | where natural language is the main programming language.

02:24:37.320 | And I guess I could ask that by way of advice.

02:24:39.280 | Like, if somebody is interested in programming now,

02:24:41.520 | what do you think they should learn?

02:24:43.240 | Like, do they, you guys started in some Java,

02:24:47.320 | and I forget the, oh, some PHP.

02:24:53.000 | - Objective C.

02:24:54.120 | - Objective C.

02:24:54.960 | There you go.

02:24:56.320 | I mean, in the end,

02:24:57.160 | we all know JavaScript is going to win.

02:24:58.960 | (laughs)

02:25:01.040 | And not TypeScript.

02:25:02.440 | It's just, it's going to be like vanilla JavaScript.

02:25:04.680 | It's just going to eat the world,

02:25:06.840 | and maybe a little bit of PHP.

02:25:08.360 | And I mean, it also brings up the question of like,

02:25:10.800 | I think Don Knuth has this idea

02:25:14.040 | that some percent of the population is geeks.

02:25:16.680 | And like, there's a particular kind of psychology in mind

02:25:20.280 | required for programming.

02:25:22.200 | And it feels like more and more that expands

02:25:25.760 | the kind of person that should be able to,

02:25:27.680 | can do great programming, might expand.

02:25:30.920 | - I think different people do programming

02:25:34.920 | for different reasons.

02:25:36.600 | But I think the true, maybe like the best programmers

02:25:39.760 | are the ones that really love,

02:25:43.400 | just like absolutely love programming.

02:25:46.560 | For example, there are folks in our team

02:25:48.400 | who literally when they get back from work,

02:25:53.400 | they go and then they boot up Cursor,

02:25:58.440 | and then they start coding on their side projects

02:26:00.600 | for the entire night.

02:26:01.440 | And they stay up till 3 a.m. doing that.

02:26:03.360 | And when they're sad, they said,

02:26:06.680 | "I just really need to code."

02:26:09.600 | (laughs)

02:26:11.160 | And I think like, you know,

02:26:14.000 | there's that level of programmer

02:26:15.480 | where like this obsession and love of programming,

02:26:18.040 | I think makes really the best programmers.

02:26:22.920 | And I think these types of people

02:26:24.400 | will really get into the details of how things work.

02:26:29.400 | - I guess the question I'm asking,

02:26:30.720 | that exact problem, let's think about that person.

02:26:33.560 | When the super tab, the super awesome,

02:26:37.640 | praise be the tab, succeeds,

02:26:40.400 | and you keep pressing tab.

02:26:42.400 | - That person in the team loves Cursor tab

02:26:44.560 | more than anybody else.

02:26:45.800 | - Yeah, and it's also not just like,

02:26:48.240 | pressing tab is like the just pressing tab.

02:26:50.600 | That's like the easy way to say it

02:26:51.840 | in the catchphrase, you know?

02:26:54.440 | But what you're actually doing when you're pressing tab

02:26:56.600 | is that you're injecting intent

02:26:59.880 | all the time while you're doing it.

02:27:02.360 | Sometimes you're rejecting it,

02:27:03.440 | sometimes you're typing a few more characters.

02:27:05.960 | And that's the way that you're sort of shaping

02:27:10.800 | the things that's being created.

02:27:12.200 | And I think programming will change a lot

02:27:14.920 | to just what is it that you want to make?

02:27:17.680 | - It's sort of higher bandwidth.

02:27:18.880 | The communication to the computer

02:27:20.200 | just becomes higher and higher bandwidth

02:27:21.760 | as opposed to just typing is much lower bandwidth

02:27:25.800 | than communicating intent.

02:27:27.760 | - I mean, this goes to your manifesto

02:27:31.280 | titled Engineering Genius.

02:27:33.840 | We are an applied research lab

02:27:35.840 | building extraordinary productive human AI systems.

02:27:38.480 | So speaking to this like hybrid element.

02:27:41.640 | To start, we're building the engineer of the future,

02:27:44.880 | a human AI programmer.

02:27:47.080 | That's an order of magnitude more effective

02:27:48.800 | than any one engineer.

02:27:50.720 | This hybrid engineer will have effortless control

02:27:53.240 | over their code base and no low entropy keystrokes.

02:27:56.920 | They will iterate at the speed of their judgment,

02:27:59.880 | even in the most complex systems.

02:28:02.160 | Using a combination of AI and human ingenuity,

02:28:05.280 | they will outsmart and out engineer

02:28:07.440 | the best pure AI systems.

02:28:09.640 | We are a group of researchers and engineers.

02:28:12.080 | We build software and models to invent

02:28:14.480 | at the edge of what's useful and what's possible.

02:28:16.240 | Our work has already improved the lives

02:28:18.560 | of hundreds of thousands of programmers.

02:28:21.240 | And on the way to that,

02:28:22.880 | we'll at least make programming more fun.

02:28:24.880 | So thank you for talking today.

02:28:26.720 | - Thank you.

02:28:27.540 | - Thanks for having us.

02:28:28.380 | - Thank you. - Thank you.

02:28:29.960 | - Thanks for listening to this conversation

02:28:31.360 | with Michael, Swale, Arvid, and Aman.

02:28:34.640 | To support this podcast,

02:28:35.640 | please check out our sponsors in the description.

02:28:38.240 | And now let me leave you with a random funny

02:28:42.600 | and perhaps profound programming code I saw on Reddit.

02:28:45.560 | Nothing is as permanent as a temporary solution that works.

02:28:51.120 | Thank you for listening and hope to see you next time.

02:28:55.340 | (upbeat music)

02:28:57.920 | (upbeat music)

02:29:00.500 | you

02:29:02.560 | [BLANK_AUDIO]

Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447

Chapters