back to index

Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447


Chapters

0:0 Introduction
0:59 Code editor basics
3:9 GitHub Copilot
10:27 Cursor
16:54 Cursor Tab
23:8 Code diff
31:20 ML details
36:54 GPT vs Claude
43:28 Prompt engineering
50:54 AI agents
64:51 Running code in background
69:31 Debugging
74:58 Dangerous code
86:9 Branching file systems
89:20 Scaling challenges
103:32 Context
108:39 OpenAI o1
120:1 Synthetic data
123:48 RLHF vs RLAIF
125:34 Fields Medal for AI
128:17 Scaling laws
137:6 The future of programming

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation
00:00:01.760 | with the founding members of the Cursor team,
00:00:04.520 | Michael Truel, Suali Asif, Arvid Lundmark, and Aman Sanger.
00:00:09.520 | Cursor is a code editor based on VS Code
00:00:14.260 | that has a lot of powerful features for AI-assisted coding.
00:00:17.940 | It has captivated the attention and excitement
00:00:21.000 | of the programming and AI communities.
00:00:23.900 | So I thought this is an excellent opportunity
00:00:26.760 | to dive deep into the role of AI in programming.
00:00:30.420 | This is a super technical conversation
00:00:33.040 | that is bigger than just about one code editor.
00:00:36.720 | It's about the future of programming,
00:00:38.440 | and in general, the future of human-AI collaboration
00:00:42.080 | in designing and engineering
00:00:44.080 | complicated and powerful systems.
00:00:47.060 | This is the Lex Friedman Podcast.
00:00:48.720 | To support it, please check out our sponsors
00:00:50.920 | in the description.
00:00:52.160 | And now, dear friends, here's Michael, Suali, Arvid,
00:00:56.560 | and Aman.
00:00:58.040 | All right, this is awesome.
00:01:00.000 | We have Michael, Aman, Suali, Arvid here
00:01:03.120 | from the Cursor team.
00:01:04.880 | First up, big ridiculous question.
00:01:07.400 | What's the point of a code editor?
00:01:10.200 | - So the code editor is largely the place
00:01:12.400 | where you build software.
00:01:14.280 | And today, or for a long time, that's meant
00:01:17.320 | the place where you text edit a formal programming language.
00:01:21.080 | And for people who aren't programmers,
00:01:22.720 | the way to think of a code editor
00:01:23.640 | is like a really souped-up word processor for programmers,
00:01:27.280 | where the reason it's souped up
00:01:28.960 | is code has a lot of structure.
00:01:31.440 | And so the quote-unquote word processor, the code editor,
00:01:35.680 | can actually do a lot for you
00:01:37.280 | that word processors sort of in the writing space
00:01:39.680 | haven't been able to do for people editing text there.
00:01:42.200 | And so that's everything
00:01:43.840 | from giving you visual differentiation
00:01:45.600 | of the actual tokens in the code
00:01:47.320 | so you can scan it quickly,
00:01:49.120 | to letting you navigate around the code base,
00:01:51.000 | sort of like you're navigating around the internet
00:01:52.440 | with hyperlinks.
00:01:53.280 | Going to sort of definitions of things you're using,
00:01:55.680 | to error checking, to catch rudimentary bugs.
00:02:00.280 | And so traditionally, that's what a code editor has meant.
00:02:06.560 | And I think that what a code editor is
00:02:10.040 | is going to change a lot over the next 10 years
00:02:12.160 | as what it means to build software
00:02:14.120 | maybe starts to look a bit different.
00:02:16.800 | - I think also a code editor should just be fun.
00:02:19.640 | - Yes, that is very important.
00:02:21.400 | That is very important.
00:02:22.240 | And it's actually sort of an underrated aspect
00:02:25.040 | of how we decide what to build.
00:02:27.440 | Like a lot of the things that we build
00:02:30.280 | and then we try them out, we do an experiment
00:02:32.960 | and then we actually throw them out
00:02:35.320 | because they're not fun.
00:02:37.040 | And so a big part of being fun
00:02:38.480 | is like being fast a lot of the time.
00:02:41.600 | Fast is fun.
00:02:42.800 | - Yeah, fast is, yeah.
00:02:44.360 | Yeah, that should be a t-shirt.
00:02:47.080 | - Like fundamentally, I think one of the things
00:02:50.920 | that draws a lot of people to building stuff on computers
00:02:53.680 | is this like insane iteration speed
00:02:55.560 | where in other disciplines,
00:02:57.400 | you might be sort of gatecapped by resources
00:02:59.800 | or the ability, even the ability
00:03:02.040 | to get a large group together
00:03:02.920 | and coding is this like amazing thing
00:03:04.400 | where it's you and the computer
00:03:05.480 | and that alone, you can build
00:03:08.160 | really cool stuff really quickly.
00:03:09.840 | - So for people who don't know,
00:03:10.920 | Cursor is this super cool new editor
00:03:14.760 | that's a fork of VS Code.
00:03:16.280 | It'd be interesting to get your kind of explanation
00:03:20.240 | of your own journey of editors.
00:03:22.600 | How did you, I think all of you
00:03:24.280 | were big fans of VS Code with Copilot.
00:03:28.200 | How did you arrive to VS Code
00:03:29.960 | and how did that lead to your journey with Cursor?
00:03:33.120 | - Yeah, so I think a lot of us,
00:03:37.640 | well, all of us were originally Vim users.
00:03:39.960 | - Pure Vim. - Pure Vim, yeah.
00:03:41.560 | No NeoVim, just pure Vim and a terminal.
00:03:44.160 | And at least for myself,
00:03:47.360 | it was around the time that Copilot came out,
00:03:50.240 | so 2021, that I really wanted to try it.
00:03:55.240 | So I went into VS Code, the only platform,
00:03:57.920 | the only code editor in which it was available.
00:04:00.080 | And even though I really enjoyed using Vim,
00:04:04.680 | just the experience of Copilot with VS Code
00:04:07.440 | was more than good enough to convince me to switch.
00:04:10.840 | And so that kind of was the default
00:04:12.280 | until we started working on Cursor.
00:04:14.680 | - And maybe we should explain what Copilot does.
00:04:17.400 | It's like a really nice auto-complete.
00:04:20.000 | It suggests, as you start writing a thing,
00:04:21.880 | it suggests one or two or three lines
00:04:24.280 | how to complete the thing.
00:04:26.000 | And there's a fun experience in that,
00:04:29.360 | you know, like when you have a close friendship
00:04:31.200 | and your friend completes your sentences?
00:04:34.040 | Like when it's done well, there's an intimate feeling.
00:04:37.320 | There's probably a better word than intimate,
00:04:38.680 | but there's a cool feeling of like,
00:04:40.760 | holy shit, it gets me.
00:04:44.400 | And then there's an unpleasant feeling
00:04:46.280 | when it doesn't get you.
00:04:48.280 | And so there's that kind of friction,
00:04:50.600 | but I would say for a lot of people,
00:04:52.240 | the feeling that it gets me overpowers that it doesn't.
00:04:55.160 | - And I think actually one of the underrated aspects
00:04:57.080 | of GitHub Copilot is that even when it's wrong,
00:04:59.320 | it's like a little bit annoying, but it's not that bad
00:05:01.680 | because you just type another character
00:05:04.200 | and then maybe then it gets you,
00:05:05.800 | or you type another character and then it gets you.
00:05:08.040 | So even when it's wrong, it's not that bad.
00:05:09.480 | - Yeah, you can sort of iterate and fix it.
00:05:11.840 | I mean, the other underrated part of Copilot for me
00:05:14.680 | sort of was just the first real AI product.
00:05:18.000 | So the first language model consumer product.
00:05:21.440 | - So Copilot was kind of like the first killer app for LLMs.
00:05:26.440 | - Yeah, and like the beta was out in 2021.
00:05:29.040 | - Right, okay.
00:05:30.280 | So what's the origin story of Cursor?
00:05:34.160 | - So around 2020, the scaling loss papers came out
00:05:37.280 | from OpenAI.
00:05:39.080 | And that was a moment where this looked like
00:05:42.000 | clear predictable progress for the field,
00:05:43.360 | where even if we didn't have any more ideas,
00:05:46.040 | it looks like you can make these models a lot better
00:05:47.400 | if you had more compute and more data.
00:05:49.720 | - By the way, we'll probably talk for three to four hours
00:05:53.520 | on the topic of scaling loss.
00:05:55.160 | - Yes.
00:05:56.000 | - But just to summarize, it's a paper and a set of papers
00:05:59.520 | and a set of ideas that say bigger might be better
00:06:02.120 | for model size and data size
00:06:04.080 | in the realm of machine learning.
00:06:05.720 | - It's bigger and better, but predictably better.
00:06:08.800 | Okay, there's another topic of conversation.
00:06:10.640 | - Yeah, so around that time, for some of us,
00:06:13.080 | there were like a lot of conceptual conversations
00:06:14.640 | about what's this gonna look like,
00:06:17.280 | what's the story gonna be
00:06:18.520 | for all these different knowledge worker fields
00:06:20.240 | about how they're gonna be made better
00:06:23.200 | by this technology getting better.
00:06:25.160 | And then I think there were a couple of moments
00:06:27.840 | where like the theoretical gains predicted in that paper
00:06:31.440 | started to feel really concrete
00:06:32.800 | and it started to feel like a moment
00:06:33.760 | where you could actually go and not do a PhD
00:06:37.040 | if you wanted to work on, do useful work in AI,
00:06:40.320 | actually felt like now there was this whole set of systems
00:06:42.640 | one could build that were really useful.
00:06:44.760 | And I think that the first moment
00:06:45.960 | we already talked about a little bit,
00:06:47.160 | which was playing with the early bit of Copilot,
00:06:48.840 | like that was awesome and magical.
00:06:50.540 | I think that the next big moment
00:06:53.120 | where everything kind of clicked together
00:06:55.320 | was actually getting early access to GPT-4.
00:06:57.600 | So it was sort of end of 2022
00:06:59.440 | was when we were tinkering with that model
00:07:02.400 | and the step-up in capabilities felt enormous.
00:07:05.640 | And previous to that,
00:07:06.960 | we had been working on a couple of different projects.
00:07:08.780 | We had been, because of Copilot, because of scaling odds,
00:07:13.040 | because of our prior interest in the technology,
00:07:15.060 | we had been tinkering around with tools for programmers,
00:07:18.880 | but things that are like very specific.
00:07:20.500 | So, we were building tools for financial professionals
00:07:24.560 | who have to work within a Jupyter Notebook
00:07:25.780 | or like playing around with,
00:07:27.120 | can you do static analysis with these models?
00:07:29.260 | And then the step-up in GPT-4 felt like,
00:07:31.240 | look, that really made concrete the theoretical gains
00:07:35.160 | that we had predicted before.
00:07:37.300 | Felt like you could build a lot more
00:07:39.120 | just immediately at that point in time.
00:07:40.920 | And also, if we were being consistent,
00:07:44.960 | it really felt like
00:07:46.520 | this wasn't just gonna be a point solution thing.
00:07:48.120 | This was gonna be all of programming
00:07:49.500 | was gonna flow through these models.
00:07:50.880 | It felt like that demanded
00:07:52.200 | a different type of programming environment,
00:07:54.000 | a different type of programming.
00:07:55.680 | And so we set off to build that,
00:07:57.440 | that sort of larger vision around that.
00:07:59.920 | - There's one that I distinctly remember.
00:08:01.800 | So my roommate is an IML Gold winner
00:08:05.140 | and there's a competition in the U.S. called the Putnam,
00:08:07.920 | which is sort of the IMO for college people.
00:08:10.040 | And it's this math competition.
00:08:12.280 | It's exceptionally good.
00:08:14.160 | So Sheng Tong and Aman,
00:08:16.600 | I remember it's sort of June of 2022,
00:08:21.600 | had this bet on whether the,
00:08:24.360 | like 2024, June or July,
00:08:27.140 | you were going to win a gold medal
00:08:28.800 | in the IMO with like models.
00:08:31.240 | - IMO is International Math Olympiad.
00:08:33.520 | - Yeah, IMO is International Math Olympiad.
00:08:35.600 | And so Arvind and I are both there,
00:08:37.660 | you know, also competed in it.
00:08:38.820 | So it was sort of personal.
00:08:41.580 | And I remember thinking,
00:08:44.780 | man, this is just, this is not gonna happen.
00:08:47.180 | This was like,
00:08:48.020 | it was like, even though I sort of believed in progress,
00:08:51.660 | I thought, you know, IMO Gold just,
00:08:54.260 | like Aman is just delusional.
00:08:55.780 | - Yeah.
00:08:56.620 | - That was the, and to be honest,
00:08:58.260 | I mean, I was, to be clear, very wrong,
00:09:01.100 | but that was maybe the most prescient bet in the group.
00:09:05.360 | - So the new results from DeepMind,
00:09:08.160 | it turned out that you were correct.
00:09:10.160 | That's what the-
00:09:11.000 | - Well, it was technically not.
00:09:12.680 | - Technically incorrect, but one point away.
00:09:15.040 | Aman was very enthusiastic about this stuff.
00:09:16.960 | - Yeah.
00:09:17.800 | - And before, Aman had this like scaling loss T-shirt
00:09:21.160 | that he would walk around with,
00:09:22.360 | where it had the like charts and like the formulas on it.
00:09:25.240 | - So you like felt the AGI or you felt the scaling loss?
00:09:28.640 | - Yeah, I distinctly remember
00:09:30.100 | there was this one conversation I had with Michael,
00:09:33.600 | where before I hadn't thought super deeply
00:09:36.220 | and critically about scaling laws.
00:09:38.520 | And he kind of posed the question,
00:09:40.560 | why isn't scaling all you need,
00:09:42.640 | or why isn't scaling gonna result
00:09:44.200 | in massive gains in progress?
00:09:46.360 | And I think I went through like the stages of grief.
00:09:49.440 | There is anger, denial, and then finally at the end,
00:09:51.880 | just thinking about it, acceptance.
00:09:55.900 | And I think I've been quite hopeful
00:10:00.020 | and optimistic about progress since.
00:10:03.220 | I think one thing I'll caveat is,
00:10:05.700 | I think it also depends on like which domains
00:10:07.340 | you're gonna see progress.
00:10:08.160 | Like math is a great domain,
00:10:09.840 | because especially like formal theorem proving,
00:10:12.740 | because you get this fantastic signal
00:10:15.340 | of actually verifying if the thing was correct.
00:10:18.180 | And so this means something like RL
00:10:19.660 | can work really, really well.
00:10:21.220 | And I think like you could have systems
00:10:22.820 | that are perhaps very superhuman at math
00:10:25.380 | and still not technically have AGI.
00:10:27.680 | - Okay, so can we take it all the way to Cursor?
00:10:30.840 | And what is Cursor?
00:10:32.280 | It's a fork of VS Code.
00:10:34.560 | And VS Code is one of the most popular editors
00:10:37.560 | for a long time.
00:10:38.400 | Like everybody fell in love with it.
00:10:39.560 | Everybody loved Vim.
00:10:41.240 | I left DMAX for it.
00:10:43.000 | Sorry.
00:10:43.840 | So unified in some fundamental way,
00:10:49.960 | the developer community.
00:10:52.960 | And then you look at the space of things,
00:10:54.840 | you look at the scaling laws,
00:10:55.960 | AI is becoming amazing.
00:10:58.280 | And you decided, okay,
00:10:59.880 | it's not enough to just write an extension for your VS Code,
00:11:02.880 | because there's a lot of limitations to that.
00:11:06.200 | Where we need,
00:11:07.340 | if AI is gonna keep getting better, better, better,
00:11:09.240 | we need to really like rethink
00:11:11.080 | how the AI is gonna be part of the editing process.
00:11:14.260 | And so you decided to fork VS Code
00:11:16.700 | and start to build a lot of the amazing features
00:11:19.460 | we'll be able to talk about.
00:11:22.160 | But what was that decision like?
00:11:23.300 | Because there's a lot of extensions,
00:11:25.920 | including Copilot of VS Code
00:11:28.600 | that are doing sort of AI type stuff.
00:11:30.320 | What was the decision like to just fork VS Code?
00:11:33.300 | - So the decision to do an editor
00:11:35.320 | seemed kind of self-evident to us
00:11:37.880 | for at least what we wanted to do and achieve.
00:11:40.440 | Because when we started working on the editor,
00:11:42.380 | the idea was these models are gonna get much better,
00:11:44.400 | their capabilities are gonna improve,
00:11:45.520 | and it's gonna entirely change how you build software.
00:11:47.680 | Both in a, you will have big productivity gains,
00:11:49.960 | but also radical in how like the act of building software
00:11:52.200 | is going to change a lot.
00:11:53.860 | And so you're very limited
00:11:55.740 | in the control you have over a code editor,
00:11:58.180 | if you're a plugin to an existing coding environment.
00:12:01.580 | And we didn't wanna get locked in by those limitations.
00:12:04.940 | We wanted to be able to just build the most useful stuff.
00:12:08.100 | - Okay, well then the natural question is,
00:12:10.340 | you know, VS Code is kind of with Copilot a competitor.
00:12:15.460 | So how do you win?
00:12:17.320 | Is it basically just the speed
00:12:18.780 | and the quality of the features?
00:12:20.200 | - Yeah, I mean, I think this is a space
00:12:23.000 | that is quite interesting, perhaps quite unique,
00:12:26.280 | where if you look at previous tech waves,
00:12:29.760 | maybe there's kind of one major thing that happened
00:12:31.780 | and it unlocked a new wave of companies.
00:12:34.200 | But every single year, every single model capability
00:12:37.720 | or jump you get in model capabilities,
00:12:39.920 | you now unlock this new wave of features,
00:12:43.560 | things that are possible, especially in programming.
00:12:46.880 | And so I think in AI programming,
00:12:49.760 | being even just a few months ahead,
00:12:51.700 | let alone a year ahead,
00:12:53.380 | makes your product much, much, much more useful.
00:12:55.780 | I think the cursor a year from now
00:12:57.780 | will need to make the cursor of today look obsolete.
00:13:00.880 | And I think, you know, Microsoft has done a number
00:13:04.980 | of like fantastic things,
00:13:06.480 | but I don't think they're in a great place
00:13:08.340 | to really keep innovating and pushing on this
00:13:10.980 | in the way that a startup can.
00:13:13.120 | - Just rapidly implementing features.
00:13:15.820 | - And push, yeah, like, and kind of doing
00:13:18.980 | the research experimentation necessary
00:13:21.260 | to really push the ceiling.
00:13:24.060 | - I don't know if I think of it in terms of features
00:13:26.060 | as I think of it in terms of like capabilities
00:13:28.300 | for programmers.
00:13:29.660 | It's that like, you know, as, you know,
00:13:33.020 | the new one model came out
00:13:34.900 | and I'm sure there are going to be more models
00:13:37.280 | of different types, like longer context and maybe faster.
00:13:40.700 | Like there's all these crazy ideas that you can try
00:13:44.700 | and hopefully 10% of the crazy ideas
00:13:47.740 | will make it into something kind of cool and useful.
00:13:50.700 | And we want people to have that sooner.
00:13:55.700 | To rephrase, it's like an underrated fact
00:13:57.580 | is we're making it for ourself.
00:13:59.300 | When we started Cursor,
00:14:00.820 | you really felt this frustration that, you know, models,
00:14:04.160 | you could see models getting better,
00:14:06.700 | but the cobalt experience had not changed.
00:14:08.780 | It was like, man, these guys,
00:14:11.740 | like the ceiling is getting higher.
00:14:13.000 | Like, why are they not making new things?
00:14:14.700 | Like they should be making new things.
00:14:16.100 | They should be like,
00:14:16.940 | like where's all the alpha features?
00:14:19.340 | There were no alpha features.
00:14:21.060 | It was like, I'm sure it was selling well.
00:14:24.700 | I'm sure it was a great business,
00:14:26.140 | but it didn't feel,
00:14:27.660 | I'm one of these people that really want to try
00:14:30.740 | and use new things.
00:14:31.780 | And it was just, there's no new thing
00:14:33.660 | for like a very long while.
00:14:35.380 | - Yeah, it's interesting.
00:14:37.300 | I don't know how you put that into words,
00:14:38.740 | but when you compare Cursor with Copilot,
00:14:41.460 | Copilot pretty quickly became,
00:14:43.640 | started to feel stale for some reason.
00:14:45.760 | - Yeah, I think one thing that I think helps us
00:14:49.560 | is that we're sort of doing it all in one
00:14:52.760 | where we're developing the UX
00:14:55.400 | and the way you interact with the model.
00:14:57.480 | At the same time as we're developing,
00:14:59.680 | like how we actually make the model give better answers.
00:15:02.440 | So we're like, how you build up the prompter
00:15:05.400 | or like, how do you find the context?
00:15:06.960 | And for a Cursor tab, like how do you train the model?
00:15:10.320 | So I think that helps us to have all of it,
00:15:12.520 | like sort of like the same people working
00:15:15.020 | on the entire experience end-to-end.
00:15:17.380 | - Yeah, it's like the person making the UI
00:15:19.380 | and the person training the model,
00:15:20.660 | like sit to like 18 feet away.
00:15:23.620 | - Often the same person even.
00:15:25.740 | - Yeah, often even the same person.
00:15:27.340 | So you can create things that are sort of not possible
00:15:30.980 | if you're not talking, you're not experimenting.
00:15:34.340 | - And you're using, like you said, Cursor to write Cursor.
00:15:37.180 | - Of course, oh yeah.
00:15:38.780 | - Well, let's talk about some of these features.
00:15:40.760 | Let's talk about the all-knowing,
00:15:43.120 | the all-powerful, praise be to the tab.
00:15:46.000 | You know, auto-complete on steroids, basically.
00:15:50.760 | So how does tab work?
00:15:51.980 | What is tab?
00:15:53.180 | - To highlight and summarize at a high level,
00:15:54.800 | I'd say that there are two things
00:15:57.000 | that Cursor is pretty good at right now.
00:15:58.920 | There are other things that it does,
00:16:01.400 | but two things that it helps programmers with.
00:16:04.800 | One is this idea of looking over your shoulder
00:16:08.320 | and being like a really fast colleague
00:16:10.440 | who can kind of jump ahead of you and type
00:16:12.720 | and figure out what you're gonna do next.
00:16:15.080 | And that was the original idea behind,
00:16:18.720 | that was kind of the kernel of the idea
00:16:20.000 | behind a good auto-complete
00:16:21.280 | was predicting what you're gonna do next.
00:16:23.240 | But you can make that concept even more ambitious
00:16:26.120 | by not just predicting the characters after your Cursor,
00:16:29.640 | but actually predicting the next entire change
00:16:31.160 | you're gonna make, the next diff,
00:16:32.120 | next place you're gonna jump to.
00:16:35.200 | And the second thing Cursor is pretty good at right now too
00:16:40.200 | is helping you sometimes jump ahead of the AI
00:16:42.680 | and tell it what to do and go from instructions to code.
00:16:47.120 | And on both of those, we've done a lot of work
00:16:48.560 | on making the editing experience for those things ergonomic
00:16:51.240 | and also making those things smart and fast.
00:16:54.520 | - One of the things we really wanted
00:16:56.240 | was we wanted the model to be able to edit code for us.
00:16:59.060 | That was kind of a wish.
00:17:00.200 | And we had multiple attempts at it
00:17:02.640 | before we had a sort of a good model
00:17:04.920 | that could edit code for you.
00:17:06.360 | Then after we had a good model,
00:17:09.760 | I think there've been a lot of effort
00:17:11.680 | to make the inference fast for having a good experience.
00:17:16.680 | And we've been starting to incorporate,
00:17:22.560 | I mean, Michael sort of mentioned this like ability
00:17:24.480 | to jump to different places.
00:17:26.280 | And that jump to different places,
00:17:27.720 | I think came from a feeling of,
00:17:30.400 | once you accept an edit,
00:17:32.340 | it's like, man, it should be just really obvious
00:17:36.720 | where to go next.
00:17:37.800 | It's like, I'd made this change,
00:17:39.960 | the model should just know that like the next place to go to
00:17:42.600 | is like 18 lines down.
00:17:44.680 | Like if you're a WIM user,
00:17:46.400 | you could press 1-8-J-J or whatever.
00:17:48.920 | But like, why am I doing this?
00:17:52.080 | Like the model should just know it.
00:17:54.040 | And then so the idea was you just pressed tab,
00:17:56.800 | it would go 18 lines down
00:17:58.080 | and then show you the next edit and you would press tab.
00:18:01.720 | So it was just you, as long as you could keep pressing tab.
00:18:04.680 | And so the internal competition was
00:18:06.280 | how many tabs can we make someone press?
00:18:08.480 | Once you have like the idea,
00:18:10.560 | more sort of abstractly the thing to think about
00:18:14.920 | is sort of like, how are the edits sort of zero entropy?
00:18:18.960 | So once you've sort of expressed your intent
00:18:20.920 | and the edit is,
00:18:22.440 | there's no like new bits of information
00:18:25.120 | to finish your thought,
00:18:27.440 | but you still have to type some characters
00:18:29.640 | to like make the computer understand
00:18:31.360 | what you're actually thinking.
00:18:33.080 | Then maybe the model should just sort of read your mind
00:18:35.800 | and all the zero entropy bits should just be like
00:18:39.240 | tabbed away.
00:18:40.600 | - Yeah.
00:18:41.440 | - That was sort of the abstract.
00:18:42.600 | - There's this interesting thing where
00:18:43.800 | if you look at language model loss on different domains,
00:18:46.960 | I believe the bits per byte,
00:18:49.360 | which is kind of character normalized loss for code
00:18:53.400 | is lower than language, which means in general,
00:18:56.040 | there are a lot of tokens in code
00:18:57.200 | that are super predictable.
00:18:58.840 | A lot of characters that are super predictable.
00:19:00.960 | And this is, I think, even magnified
00:19:03.040 | when you're not just trying to auto-complete code,
00:19:05.560 | but predicting what the user is going to do next
00:19:08.560 | in their editing of existing code.
00:19:10.880 | And so, you know, the goal of cursor tabs,
00:19:12.520 | let's eliminate all the low entropy actions
00:19:15.320 | you take inside of the editor.
00:19:16.760 | When the intent is effectively determined,
00:19:19.680 | let's just jump you forward in time, skip you forward.
00:19:22.400 | - Well, what's the intuition
00:19:23.960 | and what's the technical details
00:19:25.080 | of how to do next cursor prediction?
00:19:27.520 | That jump, that's not so intuitive, I think, to people.
00:19:31.440 | - Yeah.
00:19:32.520 | I think I can speak to a few of the details
00:19:35.520 | on how to make these things work.
00:19:37.280 | They're incredibly low latency.
00:19:38.480 | So you need to train small models on this task.
00:19:43.160 | In particular, they're incredibly pre-fill token hungry.
00:19:48.160 | What that means is they have these really,
00:19:49.840 | really long prompts where they see a lot of your code
00:19:52.600 | and they're not actually generating that many tokens.
00:19:54.880 | And so the perfect fit for that is using a sparse model,
00:19:58.680 | meaning an MOE model.
00:19:59.840 | So that was kind of one breakthrough we made
00:20:03.360 | that substantially improved its performance
00:20:05.000 | at longer context.
00:20:06.280 | The other being a variant of speculative decoding
00:20:10.000 | that we kind of built out called speculative edits.
00:20:13.280 | These are two, I think, important pieces
00:20:15.320 | of what make it quite high quality and very fast.
00:20:20.360 | - Okay, so MOE, mixture of experts.
00:20:22.800 | The input is huge, the output is small.
00:20:24.960 | - Yeah.
00:20:25.800 | - Okay, so what else can you say about how to make,
00:20:28.560 | does caching play a role in this particular--
00:20:31.200 | - Caching plays a huge role.
00:20:32.840 | Because you're dealing with this many input tokens,
00:20:36.480 | if every single keystroke that you're typing
00:20:39.080 | in a given line, you had to rerun the model
00:20:41.720 | on all of those tokens passed in,
00:20:44.120 | you're just going to, one, significantly degrade latency,
00:20:47.400 | two, you're gonna kill your GPUs with load.
00:20:49.880 | So you need to design the actual prompts you use
00:20:53.800 | for the model such that they're caching aware.
00:20:57.040 | And then, yeah, you need to reuse the KB cache
00:20:59.840 | across requests just so that you're spending less work,
00:21:03.000 | less compute.
00:21:04.440 | - Again, what are the things that TAB is supposed
00:21:07.320 | to be able to do kind of in the near term,
00:21:11.040 | just to like sort of linger on that?
00:21:13.520 | Generate code, like fill empty space,
00:21:18.440 | also edit code across multiple lines,
00:21:21.600 | and then jump to different locations inside the same file?
00:21:24.320 | - Yeah. - And then like--
00:21:25.160 | - Hopefully jump to different files also.
00:21:26.920 | So if you make an edit in one file,
00:21:28.680 | and maybe you have to go to another file
00:21:32.480 | to finish your thought,
00:21:33.400 | it should go to the second file also, yeah.
00:21:36.200 | - And then the full generalization
00:21:38.000 | is like next action prediction.
00:21:40.680 | Like sometimes you need to run a command in the terminal,
00:21:44.000 | and it should be able to suggest the command
00:21:46.880 | based on the code that you wrote too.
00:21:48.920 | Or sometimes you actually need to,
00:21:53.200 | like it suggests something,
00:21:54.080 | but it's hard for you to know if it's correct,
00:21:57.120 | because you actually need some more information to learn.
00:21:59.680 | Like you need to know the type
00:22:01.200 | to be able to verify that it's correct.
00:22:02.720 | And so maybe it should actually take you to a place
00:22:05.560 | that's like the definition of something,
00:22:07.520 | and then take you back
00:22:09.160 | so that you have all the requisite knowledge
00:22:11.160 | to be able to accept the next completion.
00:22:13.200 | - So providing the human the knowledge.
00:22:15.640 | - Yes.
00:22:17.200 | - Right.
00:22:18.240 | Can you integrate, like,
00:22:19.400 | I just got to know a guy named Prime Gen,
00:22:22.600 | who I believe has an SS,
00:22:24.920 | you can order coffee via SSH.
00:22:27.760 | - Oh yeah.
00:22:29.520 | - Oh, we did that.
00:22:30.360 | - We did that.
00:22:31.200 | - So can that also the model do that?
00:22:32.920 | Like feed you and provide you with caffeine?
00:22:37.360 | Okay, so that's the general framework.
00:22:39.280 | - Yeah, yeah.
00:22:40.120 | And the magic moment would be if it is,
00:22:44.680 | programming is this weird discipline
00:22:46.120 | where sometimes the next five minutes,
00:22:50.000 | not always, but sometimes the next five minutes,
00:22:51.680 | what you're gonna do is actually predictable
00:22:52.920 | from the stuff you've done recently.
00:22:54.360 | And so can you get to a world
00:22:55.480 | where that next five minutes either happens
00:22:57.160 | by you disengaging and it taking you through,
00:22:59.440 | or maybe a little bit more of just you seeing next step,
00:23:02.920 | what it's gonna do, and you're like,
00:23:03.760 | okay, that's good, that's good, that's good, that's good.
00:23:05.320 | And you can just sort of tap, tap, tap
00:23:07.120 | through these big changes.
00:23:09.000 | - As we're talking about this,
00:23:10.120 | I should mention that one of the really cool
00:23:12.680 | and noticeable things about Cursor is that
00:23:14.880 | there's this whole diff interface situation going on.
00:23:17.760 | So like the model suggests with the red and the green
00:23:22.560 | of like, here's how we're gonna modify the code.
00:23:24.440 | And in the chat window, you can apply
00:23:27.280 | and it shows you the diff and you can accept the diff.
00:23:29.880 | So maybe can you speak to whatever direction of that?
00:23:32.680 | - We'll probably have like four
00:23:34.040 | or five different kinds of diffs.
00:23:37.480 | So we have optimized the diff for the autocomplete.
00:23:40.880 | So that has a different diff interface
00:23:42.800 | than when you're reviewing larger blocks of code.
00:23:47.680 | And then we're trying to optimize another diff thing
00:23:50.720 | for when you're doing multiple different files
00:23:53.240 | and sort of at a high level, the difference is
00:23:57.480 | for when you're doing autocomplete,
00:24:00.480 | it should be really, really fast to read.
00:24:02.520 | Actually, it should be really fast to read in all situations
00:24:06.680 | but in autocomplete, it's sort of,
00:24:08.560 | you're really like your eyes focused in one area.
00:24:11.640 | You can't be in too many,
00:24:13.560 | the humans can't look in too many different places.
00:24:15.400 | - So you're talking about on the interface side?
00:24:17.200 | - On the interface side.
00:24:18.040 | So it currently has this box on the side.
00:24:20.440 | So we have the current box.
00:24:22.200 | And if it tries to delete code in some place
00:24:25.360 | and tries to add other code,
00:24:27.120 | it tries to show you a box on the side.
00:24:28.680 | - You can maybe show it if we pull it up on cursor.com.
00:24:31.600 | This is what we're talking about.
00:24:33.480 | - So that box, it was like three or four different attempts
00:24:38.400 | at trying to make this thing work.
00:24:40.760 | Where first attempt was like this blue crossed out line.
00:24:45.600 | So before it was a box on the side.
00:24:48.080 | It used to show you the code to delete
00:24:50.400 | by showing you like Google Docs style,
00:24:53.320 | you would see like a line through it.
00:24:55.240 | Then you would see the new code.
00:24:57.840 | And that was super distracting.
00:24:59.960 | And then we tried many different,
00:25:02.240 | there was sort of deletions,
00:25:03.800 | there was trying to read highlight.
00:25:06.320 | Then the next iteration of it, which is sort of funny,
00:25:09.040 | you would hold the on Mac, the option button.
00:25:14.040 | So it would sort of highlight a region of code
00:25:17.080 | to show you that there might be something coming.
00:25:19.600 | So maybe in this example,
00:25:21.720 | like the input and the value would all get blue.
00:25:26.120 | And the blue would to highlight
00:25:28.360 | that the AI had a suggestion for you.
00:25:30.200 | So instead of directly showing you the thing,
00:25:32.960 | it would show you that the AI,
00:25:34.600 | it would just hint that the AI had a suggestion.
00:25:36.440 | And if you really wanted to see it,
00:25:38.160 | you would hold the option button,
00:25:40.360 | and then you would see the new suggestion.
00:25:42.520 | Then if you release the option button,
00:25:45.120 | you would then see your original code.
00:25:47.560 | - So that's, by the way, that's pretty nice,
00:25:49.480 | but you have to know to hold the option button.
00:25:51.120 | - Yeah.
00:25:51.960 | - So by the way, I'm not a Mac user, but I got it.
00:25:54.600 | (laughs)
00:25:55.440 | - It was-
00:25:56.280 | - It's a button I guess, you people have.
00:25:59.080 | - It's again, it's just non-intuitive.
00:26:01.840 | I think that's the key thing.
00:26:03.680 | - And there's a chance this is also
00:26:05.280 | not the final version of it.
00:26:06.920 | - I am personally very excited
00:26:08.200 | for making a lot of improvements in this area.
00:26:13.200 | Like we often talk about it as the verification problem,
00:26:17.920 | where these diffs are great for small edits.
00:26:21.680 | For large edits, or like when it's multiple files
00:26:24.920 | or something, it's actually a little bit prohibitive
00:26:29.920 | to review these diffs.
00:26:32.520 | And so there are like a couple of different ideas here.
00:26:36.400 | Like one idea that we have is, okay, you know,
00:26:38.800 | like parts of the diffs are important.
00:26:41.040 | They have a lot of information.
00:26:42.400 | And then parts of the diff are just very low entropy.
00:26:46.480 | They're like the same thing over and over again.
00:26:49.520 | And so maybe you can highlight the important pieces
00:26:52.640 | and then gray out the not so important pieces.
00:26:55.200 | Or maybe you can have a model that looks at the diff
00:26:58.520 | and sees, oh, there's a likely bug here.
00:27:00.960 | I will like mark this with a little red squiggly
00:27:03.920 | and say like, you should probably like review
00:27:05.760 | this part of the diff.
00:27:07.680 | And ideas in that vein, I think are exciting.
00:27:11.400 | - Yeah, that's a really fascinating space
00:27:13.720 | of like UX design engineering.
00:27:15.960 | So you're basically trying to guide the human programmer
00:27:20.840 | through all the things they need to read and nothing more.
00:27:23.800 | - Yeah. - Like optimally.
00:27:25.280 | - Yeah, and you want an intelligent model to do it.
00:27:27.880 | Like currently diff algorithms are, they're like,
00:27:31.600 | they're just like normal algorithms.
00:27:36.080 | There is no intelligence.
00:27:38.000 | There's like intelligence that went into designing
00:27:39.800 | the algorithm, but then there's no, like,
00:27:42.320 | you don't care if it's about this thing or this thing,
00:27:45.080 | as you want a model to do this.
00:27:47.280 | - So I think the general question is like,
00:27:50.320 | Matt, these models are going to get much smarter.
00:27:53.640 | As the models get much smarter,
00:27:55.920 | the changes they will be able to propose are much bigger.
00:27:58.880 | So as the changes gets bigger and bigger and bigger,
00:28:02.040 | the humans have to do more and more
00:28:03.560 | and more verification work.
00:28:04.960 | It gets more and more and more hard.
00:28:06.640 | Like it's just, you need to help them out.
00:28:08.880 | It's sort of, I don't want to spend all my time
00:28:11.720 | reviewing code.
00:28:12.560 | - Can you say a little more across multiple files, Div?
00:28:20.040 | - Yeah, I mean, so GitHub tries to solve this, right?
00:28:23.560 | With code review.
00:28:25.120 | When you're doing code review,
00:28:26.080 | you're reviewing multiple diffs across multiple files.
00:28:29.640 | But like Arvid said earlier,
00:28:32.080 | I think you can do much better than code review.
00:28:34.960 | You know, code review kind of sucks.
00:28:36.960 | Like you spend a lot of time trying to grok this code
00:28:39.680 | that's often quite unfamiliar to you.
00:28:42.200 | And it often like doesn't even actually catch that many bugs.
00:28:47.200 | And I think you can significantly improve
00:28:50.240 | that review experience using language models,
00:28:52.120 | for example, using the kinds of tricks
00:28:54.000 | that Arvid had described of maybe pointing you
00:28:56.720 | towards the regions that actually matter.
00:28:58.920 | I think also, if the code is produced
00:29:03.960 | by these language models,
00:29:05.440 | and it's not produced by someone else,
00:29:07.560 | like the code review experience is designed
00:29:12.160 | for both the reviewer and the person that produced the code.
00:29:16.240 | In the case where the person that produced the code
00:29:18.360 | is a language model,
00:29:20.040 | you don't have to care that much about their experience.
00:29:22.160 | And you can design the entire thing around the reviewer
00:29:24.920 | such that the reviewer's job is as fun,
00:29:29.000 | as easy, as productive as possible.
00:29:31.000 | And I think that feels like the issue
00:29:34.360 | with just kind of naively trying to make
00:29:36.960 | these things look like code review.
00:29:39.440 | I think you can be a lot more creative
00:29:41.040 | and push the boundary on what's possible.
00:29:43.120 | - Just one idea there is I think ordering matters.
00:29:46.680 | Generally, when you review a PR,
00:29:48.320 | you have this list of files
00:29:50.120 | and you're reviewing them from top to bottom,
00:29:52.320 | but actually you actually want to understand this part first
00:29:55.840 | because that came logically first.
00:29:57.560 | And then you want to understand the next part.
00:29:59.040 | And you don't want to have to figure out that yourself.
00:30:02.680 | You want a model to guide you through the thing.
00:30:05.360 | - And is the step of creation going to be more
00:30:08.000 | and more natural language is the goal
00:30:10.480 | versus with actual writing?
00:30:12.120 | - I think sometimes.
00:30:14.120 | I don't think it's going to be the case
00:30:15.560 | that all of programming will be natural language.
00:30:18.360 | And the reason for that is if I'm pair programming
00:30:21.960 | with Swalla and Swalla is at the computer and the keyboard,
00:30:24.440 | and sometimes if I'm driving, I want to say to Swalla,
00:30:29.440 | "Hey, implement this function."
00:30:31.640 | And that works.
00:30:33.040 | And then sometimes it's just so annoying
00:30:35.800 | to explain to Swalla what I want him to do.
00:30:37.800 | And so I actually take over the keyboard
00:30:40.360 | and I show him, I write part of the example,
00:30:43.520 | and then it makes sense.
00:30:45.480 | And that's the easiest way to communicate.
00:30:47.400 | And so I think that's also the case for AI.
00:30:49.600 | Sometimes the easiest way to communicate with AI
00:30:51.680 | will be to show an example,
00:30:52.680 | and then it goes and does the thing everywhere else.
00:30:54.920 | Or sometimes if you're making a website, for example,
00:30:57.760 | the easiest way to show to the AI what you want
00:31:00.920 | is not to tell it what to do,
00:31:02.360 | but drag things around or draw things.
00:31:05.000 | And yeah, and maybe eventually we will get
00:31:09.160 | to brain machine interfaces or whatever,
00:31:11.120 | and it can understand what you're thinking.
00:31:12.760 | And so I think natural language will have a place.
00:31:14.720 | I think it will definitely not be the way
00:31:17.760 | most people program most of the time.
00:31:20.560 | - I'm really feeling the AGI with this editor.
00:31:23.000 | (laughing)
00:31:24.040 | It feels like there's a lot of machine learning
00:31:25.680 | going on underneath.
00:31:27.760 | Tell me about some of the ML stuff that makes it all work.
00:31:31.200 | - Well, Cursor really works via this ensemble
00:31:34.600 | of custom models that we've trained alongside
00:31:37.640 | the frontier models that are fantastic
00:31:39.160 | at the reasoning intense things.
00:31:40.800 | And so Cursor tab, for example, is a great example
00:31:43.840 | of where you can specialize this model to be even better
00:31:46.480 | than even frontier models.
00:31:47.520 | If you look at evals on the task we set it at.
00:31:50.360 | The other domain, which it's kind of surprising
00:31:53.080 | that it requires custom models,
00:31:54.200 | but it's kind of necessary and works quite well is in apply.
00:31:58.080 | So I think these models are like the frontier models
00:32:03.000 | are quite good at sketching out plans for code
00:32:05.200 | and generating like rough sketches of like the change,
00:32:07.720 | but actually creating diffs is quite hard
00:32:13.080 | for frontier models, for your training models.
00:32:15.440 | Like you try to do this with Sonnet, with O1,
00:32:21.200 | any frontier model, and it really messes up stupid things
00:32:24.200 | like counting line numbers,
00:32:26.280 | especially in super, super large files.
00:32:28.400 | And so what we've done to alleviate this
00:32:31.840 | is we let the model kind of sketch out this rough code block
00:32:35.200 | that indicates what the change will be.
00:32:37.920 | And we train a model to then apply that change to the file.
00:32:42.440 | - And we should say that apply is,
00:32:45.120 | the model looks at your code.
00:32:47.440 | It gives you a really damn good suggestion
00:32:49.480 | of what new things to do.
00:32:52.400 | And the seemingly for humans trivial step
00:32:55.080 | of combining the two, you're saying is not so trivial.
00:32:59.400 | - Contrary to popular perception,
00:33:01.160 | it is not a deterministic algorithm.
00:33:03.080 | - Yeah.
00:33:04.160 | I think like you see shallow copies of apply elsewhere,
00:33:09.160 | and it just breaks like most of the time
00:33:11.480 | because you think you can kind of try
00:33:12.720 | to do some deterministic matching,
00:33:14.000 | and then it fails, at least 40% of the time.
00:33:18.160 | And that just results in a terrible product experience.
00:33:21.480 | I think in general, this regime of,
00:33:26.160 | you are going to get smarter and smarter models.
00:33:29.080 | And like, so one other thing that apply lets you do
00:33:31.800 | is it lets you use fewer tokens
00:33:35.240 | with the most intelligent models.
00:33:37.280 | This is both expensive in terms of latency
00:33:39.960 | for generating all these tokens and cost.
00:33:44.200 | So you can give this very, very rough sketch
00:33:47.160 | and then have your small models go and implement it
00:33:49.880 | because it's a much easier task to implement
00:33:52.000 | this very, very sketched out code.
00:33:54.360 | And I think that this regime will continue
00:33:56.200 | where you can use smarter and smarter models
00:33:58.280 | to do the planning.
00:33:59.120 | And then maybe the implementation details
00:34:01.600 | can be handled by the less intelligent ones.
00:34:03.560 | Perhaps you'll have, you know, maybe a one,
00:34:05.880 | maybe it'll be even more capable models
00:34:08.360 | given an even higher level plan
00:34:10.800 | that is kind of recursively applied by Sonnet
00:34:15.640 | and then the apply model.
00:34:16.640 | - Maybe we should talk about how to make it fast.
00:34:18.960 | - Yeah.
00:34:19.800 | - I feel like fast is always an interesting detail.
00:34:21.720 | Fast is good.
00:34:22.800 | - Yeah. How do you make it fast?
00:34:25.120 | - Yeah, so one big component of making it fast
00:34:28.280 | is speculative edits.
00:34:29.960 | So speculative edits are a variant of speculative decoding.
00:34:33.080 | And maybe it'd be helpful to briefly describe
00:34:35.720 | speculative decoding.
00:34:37.800 | With speculative decoding,
00:34:39.240 | what you do is you can kind of take advantage
00:34:41.960 | of the fact that, you know, most of the time,
00:34:45.240 | and I'll add the caveat that it would be
00:34:47.640 | when you're memory bound in language model generation.
00:34:50.600 | If you process multiple tokens at once,
00:34:55.920 | it is faster than generating one token at a time.
00:34:58.760 | So this is like the same reason why
00:35:00.480 | if you look at tokens per second
00:35:02.640 | with prompt tokens versus generated tokens,
00:35:05.200 | it's much, much faster for prompt tokens.
00:35:07.520 | So what we do is instead of using
00:35:12.200 | what speculative decoding normally does,
00:35:13.800 | which is using a really small model
00:35:15.760 | to predict these draft tokens
00:35:17.160 | that your larger model will then go in and verify.
00:35:20.600 | With code edits, we have a very strong prior
00:35:24.000 | of what the existing code will look like.
00:35:25.920 | And that prior is literally the same exact code.
00:35:29.600 | So what you can do is you could just feed chunks
00:35:31.560 | of the original code back into the model.
00:35:35.000 | And then the model will just pretty much agree
00:35:37.920 | most of the time that, okay,
00:35:39.200 | I'm just gonna spit this code back out.
00:35:40.760 | And so you can process all of those lines in parallel.
00:35:43.480 | And you just do this with sufficiently many chunks.
00:35:45.320 | And then eventually you'll reach a point of disagreement
00:35:47.720 | where the model will now predict text that is different
00:35:51.200 | from the ground truth original code.
00:35:53.400 | It'll generate those tokens.
00:35:54.640 | And then we kind of will decide after enough tokens
00:35:57.160 | match the original code to restart speculating
00:36:01.160 | in chunks of code.
00:36:02.240 | What this actually ends up looking like
00:36:04.960 | is just a much faster version of normal editing code.
00:36:08.960 | So it looks like a much faster version
00:36:12.120 | of the model rewriting all the code.
00:36:13.680 | So we can use the same exact interface
00:36:16.560 | that we use for diffs,
00:36:19.120 | but it will just stream down a lot faster.
00:36:21.920 | - And then the advantage is that while it's streaming,
00:36:24.600 | you can just also start reviewing the code before it's done.
00:36:28.880 | So there's no big loading screen.
00:36:32.040 | So maybe that is part of the advantage.
00:36:36.440 | - So the human can start reading before the thing is done.
00:36:39.520 | - I think the interesting riff here is something like,
00:36:42.120 | like speculation is a fairly common idea nowadays.
00:36:45.680 | It's like not only in language models.
00:36:47.120 | I mean, there's obviously speculation in CPUs
00:36:49.240 | and there's like speculation for databases
00:36:51.560 | and speculation all over the place.
00:36:54.680 | - Well, let me ask this sort of the ridiculous question
00:36:56.920 | of which LLM is better at coding.
00:36:59.960 | GPT, CLAWD, who wins in the context of programming?
00:37:04.160 | And I'm sure the answer is much more nuanced
00:37:05.920 | because it sounds like every single part of this
00:37:08.880 | involves a different model.
00:37:12.080 | - Yeah, I think there's no model
00:37:15.200 | that Pareto dominates others,
00:37:18.960 | meaning it is better in all categories
00:37:21.440 | that we think matter.
00:37:22.760 | The categories being speed,
00:37:27.600 | ability to edit code,
00:37:29.080 | ability to process lots of code, long context,
00:37:32.000 | you know, a couple of other things
00:37:33.080 | and kind of coding capabilities.
00:37:35.560 | The one that I'd say right now is just kind of net best
00:37:38.840 | is SONNET.
00:37:39.680 | I think this is a consensus opinion.
00:37:41.240 | Our one's really interesting
00:37:42.480 | and it's really good at reasoning.
00:37:44.960 | So if you give it really hard programming interview
00:37:48.920 | style problems or lead code problems,
00:37:51.480 | it can do quite, quite well on them.
00:37:53.280 | But it doesn't feel like it kind of understands
00:37:57.400 | your rough intent as well as SONNET does.
00:38:00.400 | Like if you look at a lot of the other frontier models,
00:38:05.600 | one qualm I have is it feels like
00:38:08.080 | they're not necessarily over,
00:38:09.560 | I'm not saying they train on benchmarks,
00:38:12.040 | but they perform really well on benchmarks
00:38:14.440 | relative to kind of everything that's kind of in the middle.
00:38:17.800 | So if you try it in all of these benchmarks
00:38:19.240 | and things that are in the distribution of the benchmarks
00:38:21.320 | they're evaluated on, you know, they'll do really well,
00:38:23.360 | but when you push them a little bit outside of that,
00:38:25.680 | SONNET's I think the one that kind of does best
00:38:28.480 | at kind of maintaining that same capability.
00:38:31.480 | Like you kind of have the same capability in the benchmark
00:38:33.480 | as when you try to instruct it to do anything with coding.
00:38:37.320 | - What, another ridiculous question,
00:38:39.400 | is the difference between the normal programming experience
00:38:42.440 | versus what benchmarks represent.
00:38:44.920 | Like where do benchmarks fall short, do you think,
00:38:47.400 | when we're evaluating these models?
00:38:49.280 | - By the way, that's like a really, really hard,
00:38:51.560 | it's like critically important detail,
00:38:54.400 | like how different like benchmarks are
00:38:56.480 | versus like real coding.
00:38:58.640 | Where real coding, it's not interview style coding,
00:39:03.640 | it's you're doing these, you know,
00:39:06.720 | humans are saying like half broken English sometimes,
00:39:10.400 | and sometimes you're saying like, oh, do what I did before.
00:39:14.640 | Sometimes you're saying, you know, go add this thing
00:39:19.120 | and then do this other thing for me
00:39:20.480 | and then make this UI element.
00:39:21.880 | And then, you know, it's just like a lot of things
00:39:25.760 | are sort of context dependent.
00:39:27.880 | You really want to like understand the human
00:39:30.040 | and then do what the human wants
00:39:31.720 | as opposed to sort of this,
00:39:33.240 | maybe the way to put it is sort of abstractly
00:39:35.560 | is the interview problems are very well specified.
00:39:40.560 | They lean a lot on specification
00:39:45.120 | while the human stuff is less specified.
00:39:49.760 | - Yeah, I think that this benchmark question
00:39:51.960 | is both complicated by what Suhal just mentioned.
00:39:55.040 | And then also to what Aman was getting into
00:39:59.240 | is that even if you like, you know,
00:40:00.640 | there's this problem of like the skew
00:40:01.800 | between what can you actually model in a benchmark
00:40:03.400 | versus real programming.
00:40:05.680 | And that can be sometimes hard to encapsulate
00:40:07.360 | because it's like real programming is like very messy
00:40:10.240 | and sometimes things aren't super well specified,
00:40:12.760 | what's correct or what isn't.
00:40:14.040 | But then it's also doubly hard
00:40:16.560 | because of this public benchmark problem.
00:40:18.240 | And that's both because public benchmarks
00:40:20.320 | are sometimes kind of hill climbed on,
00:40:21.800 | but then it's like really, really hard
00:40:23.240 | to also get the data from the public benchmarks
00:40:26.280 | out of the models.
00:40:28.200 | And so for instance, like one of the most popular
00:40:31.560 | like agent benchmarks, suite bench
00:40:33.480 | is really, really contaminated
00:40:36.560 | in the training data of these foundation models.
00:40:39.280 | And so if you ask these foundation models
00:40:40.840 | to do a suite bench problem,
00:40:42.360 | you actually don't give them the context of a code base.
00:40:44.120 | They can like hallucinate the right file pass.
00:40:45.840 | They can hallucinate the right function names.
00:40:47.800 | And so it's also just the public aspect
00:40:52.040 | of these things is tricky.
00:40:53.360 | - Yeah, like in that case,
00:40:54.520 | it could be trained on the literal issues
00:40:56.920 | or pull requests themselves.
00:40:58.640 | And maybe the labs will start to do a better job
00:41:02.240 | or they've already done a good job
00:41:03.760 | at decontaminating those things,
00:41:05.360 | but they're not going to emit the actual training data
00:41:08.200 | of the repository itself.
00:41:09.760 | Like these are all like some of the most popular
00:41:11.600 | Python repositories, like SymPy is one example.
00:41:14.720 | I don't think they're going to handicap their models
00:41:17.680 | on SymPy and all these popular Python repositories
00:41:20.240 | in order to get true evaluation scores in these benchmarks.
00:41:24.120 | - I think that given the dearths in benchmarks,
00:41:27.280 | there have been like a few interesting crutches
00:41:30.320 | that places that build systems with these models
00:41:32.520 | or build these models actually use
00:41:35.000 | to get a sense of are they going the right direction or not?
00:41:37.120 | And in a lot of places,
00:41:39.400 | people will actually just have humans play with the things
00:41:41.920 | and give qualitative feedback on these.
00:41:44.160 | Like one or two of the foundation model companies,
00:41:45.920 | they have people who, that's a big part of their role.
00:41:49.000 | And internally we also qualitatively assess these models
00:41:53.080 | and actually lean on that a lot
00:41:54.040 | in addition to like private evals that we have.
00:41:56.560 | - It's like the vibe.
00:41:57.880 | - The vibe, yeah.
00:41:58.720 | - It's like the vibe.
00:41:59.560 | - The vibe benchmark, human benchmark.
00:42:02.120 | - Yeah.
00:42:02.960 | - You pull in the humans to do a vibe check.
00:42:05.600 | - Yeah.
00:42:06.440 | - Okay.
00:42:07.280 | I mean, that's kind of what I do,
00:42:08.120 | like just like reading online forums and Reddit and X,
00:42:12.520 | just like, well, I don't know how to properly load
00:42:17.520 | in people's opinions 'cause they'll say things like,
00:42:20.640 | I feel like Claude or GPT's gotten dumber or something.
00:42:25.200 | They'll say, I feel like,
00:42:27.680 | and then I sometimes feel like that too,
00:42:29.860 | but I wonder if it's the model's problem or mine.
00:42:34.000 | - Yeah, with Claude, there's an interesting take I heard
00:42:36.400 | where I think AWS has different chips
00:42:41.560 | and I suspect they have slightly different numerics
00:42:44.520 | than NVIDIA GPUs.
00:42:47.000 | And someone speculated that Claude's degraded performance
00:42:51.320 | had to do with maybe using the quantized version
00:42:54.360 | that existed on AWS Bedrock
00:42:56.040 | versus whatever was running on Anthropix GPUs.
00:43:00.780 | - I interview a bunch of people that have conspiracy theories,
00:43:03.000 | so I'm glad you spoke to this conspiracy theory.
00:43:05.680 | - Well, it's not like a conspiracy theory as much.
00:43:09.420 | They're just, they're like, they're, you know,
00:43:12.000 | humans are humans and there's these details
00:43:14.520 | and, you know, you're doing like this queasy amount of flops
00:43:19.200 | and that chips are messy and man, you can just have bugs.
00:43:22.600 | Like bugs are, it's hard to overstate
00:43:26.400 | how hard bugs are to avoid.
00:43:27.880 | - What's the role of a good prompt in all of this?
00:43:32.840 | See, you will mention that benchmarks
00:43:34.600 | have really structured, well-formulated prompts.
00:43:39.400 | What should a human be doing to maximize success?
00:43:44.400 | And what's the importance of what the human,
00:43:46.580 | you wrote a blog post on, you called it prompt design.
00:43:50.780 | - Yeah, I think it depends on which model you're using
00:43:55.780 | and all of them are slightly different
00:43:57.380 | and they respond differently to different prompts.
00:44:00.000 | But I think the original GPT-4
00:44:05.000 | and the original sort of pre-developed models last year,
00:44:08.580 | they were quite sensitive to the prompts
00:44:10.560 | and they also had a very small context window.
00:44:13.600 | And so we have all of these pieces of information
00:44:16.600 | around the code base that would maybe be relevant
00:44:19.880 | in the prompt, like you have the docs,
00:44:21.480 | you have the files that you add,
00:44:23.120 | you have the conversation history.
00:44:24.600 | And then there's a problem like,
00:44:26.640 | how do you decide what you actually put in the prompt
00:44:28.800 | and when you have a limited space.
00:44:30.720 | And even for today's models,
00:44:31.880 | even when you have long context,
00:44:33.880 | filling out the entire context window
00:44:35.800 | means that it's slower.
00:44:37.920 | It means that sometimes the model actually gets confused
00:44:40.540 | and some models get more confused than others.
00:44:43.180 | And we have this one system internally
00:44:45.460 | that we call preempt,
00:44:46.700 | which helps us with that a little bit.
00:44:50.040 | And I think it was built for the era before
00:44:55.040 | where we had 8,000 token context windows.
00:45:00.620 | And it's a little bit similar
00:45:03.420 | to when you're making a website,
00:45:05.820 | you wanted to work on mobile,
00:45:09.840 | you wanted to work on a desktop screen
00:45:12.200 | and you have this dynamic information,
00:45:16.720 | which you don't have, for example,
00:45:18.100 | if you're making like designing a print magazine,
00:45:19.920 | you have like, you know exactly where you can put stuff.
00:45:22.160 | But when you have a website or when you have a prompt,
00:45:24.240 | you have these inputs
00:45:25.800 | and then you need to format them to always work.
00:45:27.840 | Even if the input is really big,
00:45:29.020 | then you might have to cut something down.
00:45:31.280 | And so the idea was, okay, let's take some inspiration.
00:45:34.460 | What's the best way to design websites?
00:45:37.440 | Well, the thing that we really like is React
00:45:40.960 | and the declarative approach
00:45:42.080 | where you use JSX in JavaScript
00:45:47.080 | and then you declare, this is what I want.
00:45:50.360 | And I think this has higher priority
00:45:53.120 | or like this has higher Z index than something else.
00:45:56.480 | And then you have this rendering engine in web design.
00:46:00.120 | It's like Chrome.
00:46:01.120 | And in our case, it's a preempt renderer,
00:46:04.000 | which then fits everything onto the page.
00:46:07.140 | And as you declare it, it will decide what you want,
00:46:09.000 | and then it figures out what you want.
00:46:11.760 | And so we have found that to be quite helpful.
00:46:14.540 | And I think the role of it has sort of shifted over time,
00:46:19.080 | where initially it was to fit
00:46:20.180 | to these small context windows.
00:46:21.800 | Now it's really useful because, you know,
00:46:24.180 | it helps us with splitting up the data
00:46:27.760 | that goes into the prompt and the actual rendering of it.
00:46:30.240 | And so it's easier to debug
00:46:33.200 | because you can change the rendering of the prompt
00:46:35.480 | and then try it on old prompts
00:46:37.840 | because you have the raw data that went into the prompt.
00:46:40.320 | And then you can see,
00:46:41.160 | did my change actually improve it
00:46:42.480 | for like this entire eval set?
00:46:45.160 | - So do you literally prompt with JSX?
00:46:48.060 | - Yes. - Yeah.
00:46:49.240 | - So it kind of looks like React.
00:46:50.520 | There are components, like we have one component
00:46:52.320 | that's a file component and it takes in like the cursor,
00:46:57.320 | like usually there's like one line
00:46:59.140 | where the cursor is in your file.
00:47:00.840 | And that's like probably the most important line
00:47:02.280 | because that's the one you're looking at.
00:47:03.560 | And so then you can give priorities.
00:47:04.960 | So like that line has the highest priority
00:47:07.120 | and then you subtract one for every line
00:47:09.640 | that is farther away.
00:47:11.480 | And then eventually when it's rendered,
00:47:13.160 | it'd figure out how many lines can actually fit
00:47:14.960 | and it centers around that thing.
00:47:17.040 | - That's amazing. - Yeah.
00:47:17.880 | - And you can do like other fancy things
00:47:19.780 | where if you have lots of code blocks
00:47:21.780 | from the entire code base,
00:47:22.920 | you could use retrieval and things like embedding
00:47:26.320 | and re-ranking scores to add priorities
00:47:28.960 | for each of these components.
00:47:30.880 | - So should humans, when they ask questions,
00:47:33.400 | also try to use something like that?
00:47:35.720 | Like would it be beneficial to write JSX in the problem
00:47:39.960 | or the whole idea is you should be loose and messy?
00:47:43.320 | - I think our goal is kind of that
00:47:45.840 | you should just do whatever is the most natural thing
00:47:48.760 | for you.
00:47:49.680 | And then we, our job is to figure out
00:47:52.880 | how do we actually like retrieve the relative thing
00:47:55.360 | so that your thing actually makes sense.
00:47:56.920 | - Well, this is sort of the discussion I had
00:47:58.840 | with Arvind of Perplexity.
00:48:01.520 | It's like, his whole idea is like,
00:48:03.080 | you should let the person be as lazy as he wants.
00:48:06.120 | - Yeah.
00:48:06.960 | - But like, yeah, that's a beautiful thing.
00:48:10.360 | But I feel like you're allowed to ask more of programmers.
00:48:14.000 | Right? - Yes.
00:48:14.840 | - So like if you say, just do what you want,
00:48:16.760 | I mean, humans are lazy.
00:48:19.080 | There's a kind of tension between just being lazy
00:48:21.520 | versus like provide more as,
00:48:23.360 | be prompted, almost like the system pressuring you
00:48:28.720 | or inspiring you to be articulate.
00:48:32.200 | - Yeah.
00:48:33.040 | - Not in terms of the grammar of the sentences,
00:48:34.400 | but in terms of the depth of thoughts
00:48:36.280 | that you convey inside the prompts.
00:48:38.960 | - I think even as a system gets closer
00:48:40.760 | to some level of perfection,
00:48:43.320 | often when you ask the model for something,
00:48:47.120 | you just are not, not enough intent is conveyed
00:48:50.520 | to know what to do.
00:48:51.880 | And there are like a few ways to resolve that intent.
00:48:54.400 | One is the simple thing of having the model just ask you,
00:48:58.040 | I'm not sure how to do these parts based on your query.
00:49:01.440 | Could you clarify that?
00:49:02.800 | I think the other could be,
00:49:06.200 | maybe if you, there are five or six possible generations,
00:49:11.200 | given the uncertainty present in your query so far,
00:49:14.680 | why don't we just actually show you all of those
00:49:16.480 | and let you pick them?
00:49:17.600 | - How hard is it to, for the model to choose to speak,
00:49:22.520 | talk back?
00:49:23.360 | Sort of versus, that's hard.
00:49:27.160 | It's sort of like how to deal with the uncertainty.
00:49:30.080 | Do I choose to ask for more information
00:49:34.200 | to reduce the ambiguity?
00:49:36.080 | - So, I mean, one of the things we do is,
00:49:39.440 | it's like a recent addition is,
00:49:41.520 | try to suggest files that you can add.
00:49:44.360 | So, and while you're typing,
00:49:46.720 | one can guess what the uncertainty is
00:49:50.720 | and maybe suggest that like,
00:49:53.840 | maybe you're writing your API
00:49:56.200 | and we can guess using the commits
00:50:02.200 | that you've made previously in the same file
00:50:05.640 | that the client and the server is super useful.
00:50:09.280 | And there's like a hard technical problem
00:50:12.720 | of how do you resolve it across all commits?
00:50:15.360 | Which files are the most important
00:50:16.920 | given your current prompt?
00:50:19.160 | And we're still sort of, initial version is rolled out
00:50:23.000 | and I'm sure we can make it much more accurate.
00:50:25.280 | It's very experimental,
00:50:28.000 | but then the idea is we show you like,
00:50:29.880 | do you just want to add this file, this file, this file also
00:50:33.000 | to tell the model to edit those files for you?
00:50:36.280 | Because if you're, maybe you're making the API,
00:50:38.960 | like you should also edit the client and the server
00:50:41.160 | that is using the API and the other one resolving the API.
00:50:44.200 | And so that'll be kind of cool as,
00:50:46.240 | both there's the phase where you're writing a prompt
00:50:49.320 | and there's before you even click enter,
00:50:52.000 | maybe we can help resolve some of the uncertainty.
00:50:54.360 | - To what degree do you use agentic approaches?
00:50:57.080 | How you use fire agents?
00:50:59.080 | - We think agents are really, really cool.
00:51:02.480 | Like, I think agents is like,
00:51:05.040 | it's like, it resembles sort of like a human,
00:51:07.960 | it's sort of like the things,
00:51:09.280 | like you can kind of feel that it,
00:51:11.080 | like you're getting closer to AGI
00:51:12.600 | because you see a demo where it acts as a human would.
00:51:17.600 | And it's really, really cool.
00:51:19.720 | I think agents are not yet super useful for many things.
00:51:24.720 | They, I think we're getting close
00:51:28.720 | to where they will actually be useful.
00:51:30.520 | And so I think there are certain types of tasks
00:51:35.520 | where having an agent would be really nice.
00:51:39.600 | Like I would love to have an agent.
00:51:40.720 | For example, if like we have a bug
00:51:42.120 | where you sometimes can't command C
00:51:44.760 | and command V inside our chat input box.
00:51:48.720 | And that's a task that's super well specified.
00:51:50.840 | I just want to say like in two sentences,
00:51:52.640 | this does not work, please fix it.
00:51:54.360 | And then I would love to have an agent
00:51:56.240 | that just goes off, does it.
00:51:58.320 | And then a day later I come back and I reviewed the thing.
00:52:02.800 | - You mean it goes, finds the right file.
00:52:05.480 | - Yeah, it finds the right files.
00:52:06.920 | It like tries to reproduce the bug.
00:52:08.800 | It like fixes the bug
00:52:10.160 | and then it verifies that it's correct.
00:52:11.720 | And this could be a process that takes a long time.
00:52:14.800 | And so I think I would love to have that.
00:52:17.560 | And then I think a lot of programming,
00:52:20.560 | like there is often this belief
00:52:22.000 | that agents will take over all of programming.
00:52:24.760 | I don't think we think that that's the case
00:52:28.360 | because a lot of programming,
00:52:29.800 | a lot of the value is in iterating
00:52:32.480 | or you don't actually want to specify something upfront
00:52:35.440 | because you don't really know what you want
00:52:37.320 | until you've seen an initial version
00:52:39.080 | and then you want to iterate on that.
00:52:41.120 | And then you provide more information.
00:52:43.040 | And so for a lot of programming,
00:52:44.800 | I think you actually want a system that's instant
00:52:47.280 | that gives you an initial version instantly back
00:52:49.120 | and then you can iterate super, super quickly.
00:52:52.160 | - What about something like
00:52:53.920 | that recently came out Repl.it agent
00:52:56.320 | that does also like setting up the development environment,
00:52:59.800 | installing software packages, configuring everything,
00:53:02.320 | configuring the databases and actually deploying the app?
00:53:05.680 | - Yeah.
00:53:06.520 | - Is that also in the set of things you dream about?
00:53:09.880 | - I think so.
00:53:10.720 | I think that would be really cool
00:53:11.640 | for certain types of programming.
00:53:14.160 | It would be really cool.
00:53:15.200 | - Is that within scope of Cursor?
00:53:17.760 | - Yeah.
00:53:18.600 | We aren't actively working on it right now,
00:53:20.800 | but it's definitely like,
00:53:22.840 | we want to make the programmer's life easier and more fun.
00:53:27.680 | And some things are just really tedious
00:53:30.040 | and you need to go through a bunch of steps
00:53:31.480 | and you want to delegate that to an agent.
00:53:34.120 | And then some things,
00:53:35.040 | you can actually have an agent in the background
00:53:36.840 | while you're working.
00:53:37.840 | Like, let's say you have a PR
00:53:39.520 | that's both backend and frontend
00:53:41.520 | and you're working in the frontend
00:53:42.480 | and then you can have a background agent
00:53:44.080 | that does some work and figure out
00:53:46.000 | kind of what you're doing.
00:53:47.040 | And then when you get to the backend part of your PR,
00:53:50.120 | then you have some like initial piece of code
00:53:52.840 | that you can iterate on.
00:53:54.120 | And so that would also be really cool.
00:53:58.520 | - One of the things we already talked about is speed,
00:54:01.400 | but I wonder if we can just linger on that some more
00:54:04.600 | in the various places that the technical details involved
00:54:09.600 | in making this thing really fast.
00:54:11.720 | So every single aspect of a cursor,
00:54:14.200 | most aspects of cursor feel really fast.
00:54:16.360 | Like I mentioned, the apply is probably the slowest thing.
00:54:18.480 | And for me, I'm sorry, the pain on Harvey's face.
00:54:21.800 | - I know, it's the pain.
00:54:23.000 | It's the pain that we're feeling
00:54:24.120 | and we're working on fixing it.
00:54:26.000 | - Yeah, I mean, it says something that feels,
00:54:30.200 | I don't know what it is,
00:54:31.040 | like one second or two seconds, that feels slow.
00:54:33.960 | That means that's actually shows
00:54:36.800 | that everything else is just really, really fast.
00:54:39.640 | So is there some technical details
00:54:40.960 | about how to make some of these models,
00:54:42.840 | how to make the chat fast, how to make the divs fast?
00:54:47.320 | Is there something that just jumps to mind?
00:54:49.120 | - Yeah, I mean, so we can go over
00:54:50.480 | a lot of the strategies that we use.
00:54:51.800 | One interesting thing is cache warming.
00:54:53.880 | And so what you can do is if, as the user's typing,
00:54:59.640 | you can have, you're probably going to use
00:55:03.600 | some piece of context and you can know that
00:55:05.880 | before the user's done typing.
00:55:07.440 | So, you know, as we discussed before,
00:55:10.600 | reusing the KVCache results in lower latency,
00:55:13.440 | lower costs across requests.
00:55:15.840 | So as the user starts typing,
00:55:17.000 | you can immediately warm the cache with like,
00:55:19.200 | let's say the current file contents.
00:55:21.160 | And then when they've pressed enter,
00:55:23.040 | there's very few tokens it actually has to pre-fill
00:55:27.040 | and compute before starting the generation.
00:55:28.680 | This will significantly lower TTFD.
00:55:30.640 | - Can you explain how KVCache works?
00:55:33.000 | - Yeah, so the way transformers work,
00:55:35.880 | (laughing)
00:55:37.720 | - I like it.
00:55:38.560 | - I mean, like one of the mechanisms
00:55:41.920 | that allow transformers to not just independently,
00:55:45.240 | like the mechanism that allows transformers
00:55:46.840 | to not just independently look at each token,
00:55:48.480 | but see previous tokens are the keys and values to tension.
00:55:52.520 | And generally the way attention works
00:55:54.480 | is you have at your current token, some query,
00:55:58.480 | and then you've all the keys and values
00:56:00.440 | of all your previous tokens,
00:56:01.520 | which are some kind of representation
00:56:04.000 | that the model stores internally of all the previous tokens
00:56:06.720 | in the prompt.
00:56:08.000 | And like by default, when you're doing a chat,
00:56:12.880 | the model has to, for every single token,
00:56:15.800 | do this forward pass through the entire model.
00:56:19.320 | That's a lot of matrix multiplies that happen.
00:56:21.440 | And that is really, really slow.
00:56:23.600 | Instead, if you have already done that
00:56:26.120 | and you stored the keys and values
00:56:28.080 | and you keep that in the GPU,
00:56:30.400 | then when I'm, let's say I have sorted
00:56:32.440 | for the last N tokens, if I now want to compute
00:56:35.480 | the output token for the N plus one token,
00:56:38.640 | I don't need to pass those first N tokens
00:56:42.040 | through the entire model because I already have
00:56:44.600 | all those keys and values.
00:56:46.400 | And so you just need to do the forward pass
00:56:48.280 | through that last token.
00:56:49.960 | And then when you're doing attention,
00:56:52.120 | you're reusing those keys and values that have been computed,
00:56:54.840 | which is the only kind of sequential part
00:56:57.680 | or sequentially dependent part of the transformer.
00:56:59.920 | - Is there like higher level caching
00:57:02.040 | or like caching of the prompts
00:57:04.280 | or that kind of stuff that could help?
00:57:06.560 | - Yeah, there's other types of caching you can kind of do.
00:57:10.560 | One interesting thing that you can do for CursorTab
00:57:15.920 | is you can basically predict ahead
00:57:20.600 | as if the user would have accepted the suggestion
00:57:23.400 | and then trigger another request.
00:57:26.680 | And so then you've cached, you've done a speculative,
00:57:29.160 | it's a mix of speculation and caching, right?
00:57:31.080 | 'Cause you're speculating what would happen
00:57:32.840 | if they accepted it.
00:57:34.040 | And then you have this value that is cached,
00:57:36.880 | this suggestion.
00:57:38.560 | And then when they press tab,
00:57:39.600 | the next one would be waiting for them immediately.
00:57:41.880 | It's a kind of clever heuristic slash trick
00:57:44.600 | that uses a higher level caching
00:57:47.280 | and can give the, it feels fast
00:57:51.280 | despite there not actually being any changes
00:57:53.360 | in the model.
00:57:54.200 | - And if you can make the KV cache smaller,
00:57:56.320 | one of the advantages you get is like,
00:57:58.200 | maybe you can speculate even more.
00:57:59.920 | Maybe you can guess here's the 10 things
00:58:01.960 | that could be useful.
00:58:03.640 | I don't like, like you predict the next 10
00:58:06.720 | and then like it's possible the user hits
00:58:09.280 | the one of the 10.
00:58:10.840 | It's like much higher chance than the user hits
00:58:12.600 | like the exact one that you showed them.
00:58:14.800 | Maybe they type in other character
00:58:16.840 | and we sort of hit something else in the cache.
00:58:19.080 | So there's all these tricks where,
00:58:20.880 | the general phenomena here is,
00:58:25.320 | I think it's also super useful for RL is,
00:58:28.560 | maybe a single sample from the model isn't very good.
00:58:33.240 | But if you predict like 10 different things,
00:58:37.920 | it turns out that one of the 10 that's right
00:58:40.720 | is the probability is much higher.
00:58:42.520 | So there's these passive key curves
00:58:44.560 | and part of RL, like what RL does is,
00:58:47.760 | you can exploit this passive key phenomena
00:58:51.200 | to make many different predictions
00:58:53.400 | and one way to think about this,
00:58:57.240 | the model sort of knows internally has like,
00:58:59.560 | has some uncertainty over like,
00:59:01.280 | which of the key things is correct
00:59:03.040 | or like which of the key things does the human want.
00:59:05.440 | When we RL our cursor tab model,
00:59:09.480 | one of the things we're doing is,
00:59:11.920 | we're predicting which of the hundred different suggestions
00:59:16.920 | the model produces is more amenable for humans.
00:59:20.480 | Like which of them do humans more like than other things.
00:59:23.560 | Maybe like there's something where the model
00:59:26.400 | can predict very far ahead versus like a little bit
00:59:28.600 | and maybe somewhere in the middle
00:59:30.080 | and then you can give a reward to the things
00:59:34.280 | that humans would like more
00:59:35.400 | and sort of punish the things that it would like
00:59:37.840 | and sort of then train the model
00:59:39.040 | to output the suggestions that humans would like more.
00:59:40.920 | You have these like RL loops that are very useful
00:59:43.200 | that exploit these passive key curves.
00:59:45.840 | Oman maybe can go into even more detail.
00:59:48.640 | - Yeah, it is a little different than speed.
00:59:50.840 | But I mean, like technically you tie it back in
00:59:55.880 | because you can get away with the smaller model
00:59:57.280 | if you RL your smaller model
00:59:58.920 | and it gets the same performances as the bigger one.
01:00:01.480 | That's like, and Suali was mentioning stuff about KB,
01:00:07.200 | about reducing the size of your KB cache.
01:00:08.800 | There are other techniques there as well
01:00:10.120 | that are really helpful for speed.
01:00:11.800 | So kind of back in the day,
01:00:15.080 | like all the way two years ago,
01:00:17.640 | people mainly use multi-head attention.
01:00:20.120 | And I think there's been a migration
01:00:21.480 | towards more efficient attention schemes
01:00:24.320 | like group query or multi-query attention.
01:00:28.720 | And this is really helpful for then
01:00:31.640 | with larger batch sizes,
01:00:33.600 | being able to generate the tokens much faster.
01:00:36.640 | The interesting thing here is
01:00:38.200 | this now has no effect on that
01:00:42.120 | time to first token pre-fill speed.
01:00:45.280 | The thing this matters for is now generating tokens.
01:00:48.880 | And why is that?
01:00:49.880 | 'Cause when you're generating tokens,
01:00:51.760 | instead of being bottlenecked
01:00:54.560 | by doing these super parallelizable matrix multiplies
01:00:57.680 | across all your tokens,
01:00:59.040 | you're bottlenecked by how quickly it's for long context
01:01:02.400 | with large batch sizes,
01:01:04.200 | by how quickly you can read those cache keys and values.
01:01:07.200 | And so then that's memory bandwidth
01:01:10.760 | and how can we make this faster?
01:01:12.560 | We can try to compress the size of these keys and values.
01:01:15.280 | So multi-query attention is the most aggressive of these,
01:01:18.280 | where normally with multi-head attention,
01:01:20.880 | you have some number of "attention heads"
01:01:24.320 | and some number of kind of query heads.
01:01:29.320 | Multi-query just preserves the query heads,
01:01:32.200 | gets rid of all the key value heads.
01:01:34.480 | So there's only one kind of key value head
01:01:37.920 | and there's all the remaining query heads.
01:01:41.320 | With group query, you instead preserve all the query heads
01:01:46.320 | and then your keys and values are kind of...
01:01:51.960 | There are fewer heads for the keys and values,
01:01:53.680 | but you're not reducing it to just one.
01:01:56.040 | But anyways, the whole point here
01:01:57.280 | is you're just reducing the size of your KB cache.
01:02:00.480 | - And then there is MLA.
01:02:02.480 | - Yeah, multi-latent.
01:02:04.200 | That's a little more complicated.
01:02:06.040 | And the way that this works
01:02:07.800 | is it kind of turns the entirety of your keys and values
01:02:12.280 | across all your heads into this kind of one latent vector
01:02:16.760 | that is then kind of expanded inference time.
01:02:19.760 | - But MLA is from this company called DeepSeq.
01:02:23.960 | It's quite an interesting algorithm.
01:02:26.280 | Maybe the key idea is sort of,
01:02:27.920 | in both MQA and in other places,
01:02:32.480 | what you're doing is you're sort of reducing
01:02:35.920 | like the number of KB heads.
01:02:38.720 | The advantage you get from that is there's less of them,
01:02:43.480 | but maybe the theory is that you actually want
01:02:47.680 | a lot of different,
01:02:48.840 | like you want each of the keys and values
01:02:51.840 | to actually be different.
01:02:52.840 | So one way to reduce the size is you keep
01:02:55.880 | one big shared vector for all the keys and values,
01:03:01.360 | and then you have smaller vectors for every single token
01:03:03.880 | so that you can store only the smaller thing
01:03:07.560 | as some sort of like low-rank reduction.
01:03:10.080 | And the low-rank reduction,
01:03:11.480 | and at the end of the time,
01:03:12.800 | when you eventually want to compute the final thing,
01:03:16.080 | remember that you're memory bound,
01:03:17.600 | which means that you still have some compute left
01:03:20.320 | that you can use for these things.
01:03:21.400 | And so if you can expand the latent vector back out,
01:03:26.400 | and somehow this is far more efficient
01:03:29.920 | because you're reducing, for example,
01:03:33.120 | maybe like reducing like 32 or something,
01:03:36.240 | like the size of the vector that you're keeping.
01:03:37.960 | - Yeah, there's perhaps some richness
01:03:39.920 | in having a separate set of keys and values
01:03:43.880 | and query that kind of pairwise match up
01:03:45.960 | versus compressing that all into one,
01:03:47.760 | and that interaction at least.
01:03:51.320 | - Okay, and all of that is dealing with being memory bound.
01:03:55.400 | - Yeah.
01:03:56.240 | - And what, I mean, ultimately,
01:03:59.000 | how does that map to the user experience?
01:04:01.400 | Trying to get the-
01:04:02.240 | - Yeah, the two things that it maps to
01:04:03.880 | is you can now make your cache a lot larger
01:04:06.640 | because you've less space allocated for the KB cache.
01:04:09.600 | You can maybe cache a lot more aggressively
01:04:11.240 | in a lot more things.
01:04:12.400 | So you get more cache hits,
01:04:14.120 | which are helpful for reducing the time to first token
01:04:17.280 | for the reasons that were kind of described earlier.
01:04:19.520 | And then the second being,
01:04:20.680 | when you start doing inference with more and more requests
01:04:24.160 | and larger and larger batch sizes,
01:04:25.920 | you don't see much of a slowdown
01:04:28.400 | in as it's generating the tokens, the speed of that.
01:04:31.720 | - Would it also allow you to make your prompt bigger
01:04:33.960 | for certain-
01:04:34.800 | - Yeah, yeah.
01:04:35.640 | So like the basic, the size of your KB cache
01:04:38.280 | is both the size of all your prompts
01:04:41.280 | multiplied by the number of prompts
01:04:42.480 | being processed in parallel.
01:04:43.600 | So you could increase either of those dimensions, right?
01:04:45.800 | The batch size or the size of your prompts
01:04:48.120 | without degrading the latency of generating tokens.
01:04:51.920 | - Arvid, you wrote a blog post, Shadow Workspace.
01:04:54.360 | - Yeah.
01:04:55.200 | - Iterating on code in the background.
01:04:56.520 | - Yeah.
01:04:57.360 | - So what's going on?
01:04:58.520 | - So to be clear,
01:04:59.680 | we want there to be a lot of stuff happening
01:05:02.840 | in the background
01:05:03.680 | and we're experimenting with a lot of things.
01:05:05.760 | Right now, we don't have much of that happening
01:05:09.120 | other than like the cache warming
01:05:10.960 | or like figuring out the right context
01:05:13.920 | that goes into your command key prompts, for example.
01:05:16.520 | But the idea is,
01:05:17.800 | if you can actually spend computation in the background,
01:05:20.320 | then you can help the user
01:05:24.760 | maybe like at a slightly longer time horizon
01:05:27.840 | than just predicting the next few lines
01:05:30.120 | that you're gonna make.
01:05:30.960 | But actually like in the next 10 minutes,
01:05:32.880 | what are you going to make?
01:05:34.040 | And by doing it in the background,
01:05:35.680 | you can spend more computation doing that.
01:05:38.760 | And so the idea of the Shadow Workspace
01:05:41.560 | that we implemented
01:05:42.680 | and we use it internally for like experiments
01:05:45.760 | is that to actually get advantage
01:05:49.280 | of doing stuff in the background,
01:05:50.880 | you want some kind of feedback signal
01:05:53.440 | to give back to the model.
01:05:54.840 | Because otherwise, like you can get higher performance
01:05:57.480 | by just letting the model think for longer.
01:06:00.760 | And so like O1 is a good example of that.
01:06:03.000 | But another way you can improve performance
01:06:04.800 | is by letting the model iterate and get feedback.
01:06:08.040 | And so one very important piece of feedback
01:06:11.200 | when you're a programmer is the language server,
01:06:15.000 | which is this thing,
01:06:16.960 | it exists for most different languages
01:06:20.080 | and there's like a separate language server per language.
01:06:22.640 | And it can tell you, you know,
01:06:24.680 | you're using the wrong type here
01:06:26.120 | and then it gives you an error
01:06:27.320 | or it can allow you to go to definition
01:06:29.400 | and sort of understands the structure of your code.
01:06:31.880 | So language servers are extensions developed by,
01:06:34.920 | like there's a TypeScript language
01:06:36.280 | that were developed by the TypeScript people,
01:06:38.160 | a Rust language that were developed by the Rust people,
01:06:40.120 | and then they all interface
01:06:41.480 | over the language server protocol to VS Code.
01:06:43.480 | So that VS Code doesn't need to have
01:06:45.120 | all of the different languages built into VS Code,
01:06:47.480 | but rather you can use
01:06:49.000 | the existing compiler infrastructure.
01:06:50.880 | - For linting purposes, what?
01:06:52.880 | - It's for linting, it's for going to definition
01:06:56.000 | and for like seeing the right types that you're using.
01:06:59.600 | - So it's doing like type checking also?
01:07:01.400 | - Yes, type checking and going to references.
01:07:03.960 | And that's like, when you're working in a big project,
01:07:07.040 | you kind of need that.
01:07:08.440 | If you don't have that,
01:07:09.600 | it's like really hard to code in a big project.
01:07:12.720 | - Can you say again how that's being used inside Cursor,
01:07:16.320 | the language server protocol communication thing?
01:07:20.440 | - So it's being used in Cursor to show to the programmer,
01:07:22.520 | just like in VS Code.
01:07:23.680 | But then the idea is you want to show that same information
01:07:26.840 | to the models, the AI models.
01:07:30.040 | And you want to do that in a way
01:07:31.840 | that doesn't affect the user,
01:07:33.400 | because you want to do it in background.
01:07:34.800 | And so the idea behind the shadow workspace was,
01:07:38.000 | okay, like one way we can do this
01:07:40.040 | is we spawn a separate window of Cursor that's hidden.
01:07:45.040 | And so you can set this flag and an electron is hidden.
01:07:48.920 | There is a window, but you don't actually see it.
01:07:50.720 | And inside of this window,
01:07:52.680 | the AI agents can modify code however they want,
01:07:56.000 | as long as they don't save it,
01:07:57.080 | because it's still the same folder,
01:07:59.280 | and then can get feedback from the linters
01:08:01.720 | and go to definition and iterate on their code.
01:08:04.080 | - So like literally run everything in the background,
01:08:06.800 | like as if, right?
01:08:08.560 | - Yeah.
01:08:09.400 | - Maybe even run the code?
01:08:10.840 | - So that's the eventual version.
01:08:13.280 | That's what you want.
01:08:14.120 | And a lot of the blog post is actually about
01:08:16.280 | how do you make that happen?
01:08:19.120 | Because it's a little bit tricky.
01:08:20.760 | You want it to be on the user's machine
01:08:22.280 | so that it exactly mirrors the user's environment.
01:08:25.880 | And then on Linux, you can do this cool thing
01:08:29.080 | where you can actually mirror the file system
01:08:31.400 | and have the AI make changes to the files.
01:08:35.360 | And it thinks that it's operating on the file level,
01:08:38.680 | but actually that's stored in memory
01:08:42.840 | and you can create this kernel extension to make it work.
01:08:47.200 | Whereas on Mac and Windows,
01:08:49.840 | it's a little bit more difficult,
01:08:51.920 | but it's a fun technical problem, so that's why.
01:08:57.400 | - One maybe hacky, but interesting idea that I like
01:08:59.640 | is holding a lock on saving.
01:09:02.200 | And so basically you can then have the language model
01:09:04.720 | kind of hold the lock on saving to disk.
01:09:07.360 | And then instead of you operating in the ground truth
01:09:09.960 | version of the files that are saved to disk,
01:09:12.240 | you actually are operating
01:09:13.320 | what was the shadow workspace before
01:09:14.800 | and these unsaved things that only exist in memory
01:09:16.600 | that you still get linter errors for and you can code in.
01:09:19.120 | And then when you try to maybe run code,
01:09:21.320 | it's just like, there's a small warning that there's a lock
01:09:23.960 | and then you kind of will take back the lock
01:09:25.800 | from the language server
01:09:27.000 | if you're trying to do things concurrently
01:09:28.560 | or from the shadow workspace
01:09:29.800 | if you're trying to do things concurrently.
01:09:31.720 | - That's such an exciting future, by the way.
01:09:33.880 | It's a bit of a tangent,
01:09:34.840 | but like to allow a model to change files,
01:09:38.400 | it's scary for people, but like, it's really cool
01:09:42.120 | to be able to just like let the agent do a set of tasks
01:09:46.000 | and you come back the next day and kind of observe
01:09:49.680 | like it's a colleague or something like that.
01:09:51.920 | - Yeah, and I think there may be different versions
01:09:53.960 | of like runability where for the simple things
01:09:57.560 | where you're doing things in the span of a few minutes
01:09:59.960 | on behalf of the user as they're programming,
01:10:02.040 | it makes sense to make something work locally
01:10:04.840 | in their machine.
01:10:05.800 | I think for the more aggressive things
01:10:07.200 | where you're making larger changes
01:10:08.640 | that take longer periods of time,
01:10:10.360 | you'll probably want to do this
01:10:11.600 | in some sandbox remote environment.
01:10:13.480 | And that's another incredibly tricky problem
01:10:15.800 | of how do you exactly reproduce or mostly reproduce
01:10:20.120 | to the point of it being effectively equivalent
01:10:22.480 | for running code, the user's environment
01:10:24.960 | with this remote sandbox.
01:10:27.240 | - I'm curious what kind of agents you want for coding.
01:10:30.680 | Do you want them to find bugs?
01:10:32.920 | Do you want them to like implement new features?
01:10:35.040 | Like what agents do you want?
01:10:36.720 | - So by the way, when I think about agents,
01:10:38.560 | I don't think just about coding.
01:10:41.400 | I think so for the practices, this particular podcast,
01:10:45.120 | there's video editing and a lot of, if you look in Adobe,
01:10:47.920 | a lot of there's code behind.
01:10:50.320 | It's very poorly documented code,
01:10:52.080 | but you can interact with Premiere, for example,
01:10:55.240 | using code and basically all the uploading,
01:10:58.640 | everything I do on YouTube,
01:10:59.640 | everything as you could probably imagine,
01:11:01.480 | I do all of that through code and including translation
01:11:04.920 | and overdubbing all of this.
01:11:06.640 | So I envision all of those kinds of tasks.
01:11:10.160 | So automating many of the tasks
01:11:11.840 | that don't have to do directly with the editing.
01:11:14.120 | So that, okay.
01:11:16.080 | That's what I was thinking about.
01:11:17.000 | But in terms of coding,
01:11:19.520 | I would be fundamentally thinking about bug finding,
01:11:23.960 | like many levels of kind of bug finding
01:11:27.480 | and also bug finding like logical bugs,
01:11:30.200 | not logical, like spiritual bugs or something.
01:11:32.520 | (all laughing)
01:11:34.280 | Ones like sort of big directions of implementation,
01:11:37.400 | that kind of stuff.
01:11:38.680 | - That's a bind on bug finding.
01:11:40.000 | - Yeah, I mean, it's really interesting
01:11:41.960 | that these models are so bad at bug finding
01:11:44.840 | when just naively prompted to find a bug.
01:11:48.720 | They're incredibly poorly calibrated.
01:11:51.320 | - Even the smartest models.
01:11:52.720 | - Exactly, even O1.
01:11:54.800 | - How do you explain that?
01:11:56.480 | Is there a good intuition?
01:11:57.840 | - I think these models are really strong reflection
01:12:02.080 | of the pre-training distribution.
01:12:04.560 | And I do think they generalize
01:12:06.880 | as the loss gets lower and lower,
01:12:08.520 | but I don't think the loss in the scale is quite,
01:12:11.360 | or the loss is low enough
01:12:13.000 | such that they're like really fully generalizing in code.
01:12:16.360 | Like the things that we use these things for,
01:12:18.680 | the frontier models that they're quite good at
01:12:21.360 | are really code generation and question answering.
01:12:25.120 | And these things exist in massive quantities
01:12:28.440 | in pre-training with all of the code on GitHub
01:12:30.880 | on the scale of many, many trillions of tokens
01:12:33.160 | and questions and answers on things like Stack Overflow
01:12:37.400 | and maybe GitHub issues.
01:12:39.160 | And so when you try to push one of these things
01:12:41.880 | that really don't exist very much online,
01:12:46.160 | like for example, the cursor tab objective
01:12:48.680 | of predicting the next edit,
01:12:49.960 | given the edits done so far,
01:12:52.040 | the brittleness kind of shows.
01:12:53.720 | And then bug detection is another great example
01:12:55.880 | where there aren't really that many examples
01:12:58.080 | of like actually detecting real bugs
01:12:59.720 | and then proposing fixes.
01:13:01.080 | And the models just kind of like really struggle at it.
01:13:05.920 | But I think it's a question of transferring the model,
01:13:08.520 | like in the same way that you get this fantastic transfer
01:13:11.920 | from pre-trained models just on code in general
01:13:14.680 | to the cursor tab objective,
01:13:17.000 | you'll see a very, very similar thing
01:13:19.080 | with generalized models that are really good at code
01:13:21.560 | to bug detection.
01:13:22.400 | It just takes like a little bit of kind of nudging
01:13:24.280 | in that direction.
01:13:25.240 | - Like to be clear,
01:13:26.080 | I think they sort of understand code really well.
01:13:28.200 | Like while they're being pre-trained,
01:13:30.280 | like the representation that's being built up,
01:13:33.400 | like almost certainly like somewhere in the stream,
01:13:36.960 | there's the model knows
01:13:38.880 | that maybe there's something sketchy going on, right?
01:13:42.000 | It sort of has some sketchiness,
01:13:43.560 | but actually eliciting the sketchiness to,
01:13:46.920 | like actually like part of it
01:13:51.320 | is that humans are really calibrated
01:13:52.920 | on which bugs are really important.
01:13:55.000 | It's not just actually saying
01:13:57.080 | like there's something sketchy.
01:13:58.040 | It's like, is this sketchy trivial?
01:14:00.480 | Is this sketchy like you're gonna take the server down?
01:14:03.040 | It's like part of it is maybe the cultural knowledge
01:14:05.720 | of like, why is a staff engineer a staff engineer?
01:14:09.240 | A staff engineer is good
01:14:10.720 | because they know that three years ago,
01:14:12.800 | like someone wrote a really, you know,
01:14:15.000 | a sketchy piece of code that took the server down.
01:14:17.960 | And as opposed to like,
01:14:20.160 | as opposed to maybe you're just like, you know,
01:14:21.920 | you just, this thing is like an experiment.
01:14:25.920 | So like a few bugs are fine.
01:14:27.440 | Like you're just trying to experiment
01:14:28.720 | and get the feel of the thing.
01:14:30.320 | And so if the model gets really annoying
01:14:31.960 | when you're writing an experiment, that's really bad.
01:14:34.560 | But if you're writing something for super production,
01:14:36.920 | you're like writing a database, right?
01:14:38.320 | You're writing code in Postgres or Linux or whatever,
01:14:40.880 | like you're Linus Torvalds.
01:14:42.680 | It's sort of unacceptable to have even an edge case
01:14:45.400 | and just having the calibration of like,
01:14:47.640 | how paranoid is the user?
01:14:51.600 | - But even then, like,
01:14:52.720 | if you're putting it on maximum paranoia,
01:14:55.120 | it's still just like, doesn't quite get it.
01:14:57.680 | - Yeah, yeah, yeah.
01:14:58.800 | - I mean, but this is hard for humans too,
01:15:01.000 | to understand which line of code is important
01:15:04.120 | and which is not.
01:15:05.320 | Like you, I think one of your principles on a website says,
01:15:08.400 | if a code can do a lot of damage,
01:15:11.520 | one should add a comment that say,
01:15:13.640 | this line of code is dangerous.
01:15:17.000 | - And all caps, repeat it 10 times.
01:15:20.720 | - No, you say like, for every single line of code
01:15:24.640 | inside the function, you have to add,
01:15:26.400 | and that's quite profound.
01:15:28.120 | That says something about human beings
01:15:30.160 | because the engineers move on,
01:15:33.400 | even the same person might just forget
01:15:36.360 | how it can sync the Titanic, a single function.
01:15:38.520 | Like you don't, you might not intuit that quite clearly
01:15:41.040 | by looking at the single piece of code.
01:15:42.920 | - Yeah, and I think that one is also,
01:15:45.440 | partially also for today's AI models,
01:15:48.120 | where if you actually write dangerous, dangerous, dangerous
01:15:51.840 | in every single line,
01:15:52.800 | like the models will pay more attention to that
01:15:57.520 | and will be more likely to find bugs in that region.
01:16:00.480 | - That's actually just straight up a really good practice
01:16:03.600 | of labeling code, of how much damage this can do.
01:16:08.280 | - Yeah, I mean, it's controversial.
01:16:10.160 | Some people think it's ugly.
01:16:11.720 | - Well, I actually think it's like,
01:16:14.640 | in fact, I actually think this is one of the things
01:16:16.680 | I learned from Aureate is, you know,
01:16:18.240 | like I sort of aesthetically, I don't like it,
01:16:22.080 | but I think there's certainly something
01:16:24.240 | where like it's useful for the models
01:16:26.520 | and humans just forget a lot.
01:16:28.200 | And it's really easy to make a small mistake
01:16:30.480 | and cause like, bring down, you know,
01:16:33.920 | like just bring down the server and like,
01:16:36.080 | like, of course we like test a lot and whatever,
01:16:39.360 | but there's always these things
01:16:41.360 | that you have to be very careful.
01:16:42.480 | - Yeah, like with just normal doc strings,
01:16:44.400 | I think people will often just skim it
01:16:46.320 | when making a change and think,
01:16:47.440 | oh, I know how to do this.
01:16:49.520 | And you kind of really need to point it out to them
01:16:53.360 | so that that doesn't slip through.
01:16:55.800 | - Yeah, you have to be reminded
01:16:57.000 | that you can do a lot of damage.
01:16:58.600 | That's like, we don't really think about that.
01:17:02.080 | You think about, okay, how do I figure out how this works
01:17:04.960 | so I can improve it?
01:17:05.800 | You don't think about the other direction.
01:17:08.720 | - Until we have formal verification for everything,
01:17:12.760 | then you can do whatever you want
01:17:14.200 | and you know for certain
01:17:16.560 | that you have not introduced a bug if the proof pass.
01:17:18.920 | - But concretely, what do you think
01:17:20.000 | that future would look like?
01:17:22.000 | - I think people will just not write tests anymore
01:17:26.000 | and the model will suggest,
01:17:29.960 | like you write a function,
01:17:31.280 | the model will suggest a spec
01:17:32.960 | and you review the spec.
01:17:34.200 | And in the meantime,
01:17:36.920 | smart reasoning model computes a proof
01:17:39.440 | that the implementation follows the spec.
01:17:42.120 | And I think that happens for most functions.
01:17:44.280 | - Don't you think this gets at a little bit,
01:17:46.360 | some of the stuff you were talking about earlier
01:17:47.680 | with the difficulty of specifying intent
01:17:49.400 | for what you want with software,
01:17:51.680 | where sometimes it might be,
01:17:53.160 | because the intent is really hard to specify,
01:17:54.800 | it's also then going to be really hard to prove
01:17:56.680 | that it's actually matching whatever your intent is.
01:17:58.440 | - Like you think that spec is hard to generate?
01:18:00.880 | - Yeah, or just like for a given spec,
01:18:06.720 | maybe you can, I think there is a question of like,
01:18:08.920 | can you actually do the formal verification?
01:18:10.960 | Like that's like, is that possible?
01:18:12.880 | I think that there's like more to dig into there.
01:18:15.000 | But then also-
01:18:15.880 | - Even if you have the spec?
01:18:17.480 | - If you have the spec.
01:18:18.320 | - But how do you-
01:18:19.160 | - Even if you have the spec.
01:18:20.000 | Is the spec written in natural language?
01:18:20.960 | Or is it more formal?
01:18:21.800 | - No, the spec would be formal.
01:18:24.840 | - But how easy would that be to draw?
01:18:25.680 | - So then I think that you care about things
01:18:28.080 | that are not going to be easily well-specified
01:18:29.640 | in the spec language.
01:18:30.840 | - I see, I see.
01:18:31.680 | Yeah, yeah.
01:18:32.760 | - Maybe an argument against formal verification
01:18:35.160 | is all you need.
01:18:36.000 | - Yeah.
01:18:36.840 | - The worry is there's this massive document.
01:18:38.360 | - Replacing something like unit tests.
01:18:40.760 | Sure.
01:18:41.600 | - Yeah, yeah.
01:18:42.440 | I think you can probably also evolve the spec languages
01:18:47.040 | to capture some of the things
01:18:48.560 | that they don't really capture right now.
01:18:51.160 | But I don't know.
01:18:53.640 | I think it's very exciting.
01:18:55.000 | - And you're speaking not just about like single functions.
01:18:57.920 | You're speaking about entire code bases.
01:19:00.120 | - I think entire code bases is harder,
01:19:01.600 | but that is what I would love to have.
01:19:03.920 | And I think it should be possible.
01:19:05.920 | And because you can even,
01:19:07.440 | there's like a lot of work recently
01:19:09.160 | where you can prove, formally verify down to the hardware.
01:19:13.640 | So like through the, you formally verify the C code
01:19:16.680 | and then you formally verify through the GCC compiler
01:19:19.600 | and then through the very log down to the hardware.
01:19:22.280 | And that's like incredibly big system,
01:19:25.600 | but it actually works.
01:19:26.720 | And I think big code bases are sort of similar
01:19:28.960 | in that they're like multi-layered system.
01:19:31.120 | And if you can decompose it and formally verify each part,
01:19:35.040 | then I think it should be possible.
01:19:36.560 | I think the specification problem is a real problem, but.
01:19:39.080 | - How do you handle side effects?
01:19:40.880 | Or how do you handle, I guess, external dependencies
01:19:44.200 | like calling the Stripe API?
01:19:46.320 | - Maybe Stripe would write a spec for the API.
01:19:48.600 | - But like, you can't do this for everything.
01:19:49.920 | Like, can you do this for everything you use?
01:19:52.200 | Like, how do you do it for, if there's a language model,
01:19:55.160 | like maybe like people will use language models
01:19:57.320 | as primitives in the programs they write.
01:19:59.440 | And there's like a dependence on it.
01:20:00.680 | And like, how do you now include that?
01:20:02.680 | - I think you might be able to prove that still.
01:20:05.920 | - Prove what about language models?
01:20:07.600 | - I think it feels possible that you could actually prove
01:20:10.800 | that a language model is aligned, for example.
01:20:14.920 | Or like you can prove that it actually gives
01:20:17.200 | the right answer.
01:20:18.920 | - That's the dream.
01:20:21.360 | - Yeah, that is.
01:20:22.200 | I mean, if it's possible, that's your, I have a dream speech.
01:20:26.040 | If it's possible, that will certainly help with, you know,
01:20:29.680 | making sure your code doesn't have bugs
01:20:31.880 | and making sure AI doesn't destroy all of human civilization.
01:20:35.040 | So the full spectrum of AI safety to just bug finding.
01:20:39.300 | So you said the models struggle with bug finding.
01:20:42.720 | What's the hope?
01:20:43.880 | - You know, my hope initially is,
01:20:46.040 | and I can let Michael chime in too,
01:20:48.600 | but it was like, it should, you know,
01:20:52.800 | first help with the stupid bugs.
01:20:54.560 | Like it should very quickly catch the stupid bugs,
01:20:56.960 | like off by one error is like,
01:20:58.880 | sometimes you write something in a comment
01:21:00.360 | and do it the other way.
01:21:01.960 | It's like very common.
01:21:02.800 | Like I do this, I write like less than in a comment
01:21:04.960 | and like I've maybe write a greater than sign
01:21:06.560 | or something like that.
01:21:07.920 | And the model is like, yeah, it looks sketchy.
01:21:10.240 | Like, are you sure you want to do that?
01:21:13.000 | But eventually it should be able to catch harder bugs too.
01:21:16.160 | - Yeah, and I think that it's also important to note
01:21:19.040 | that this is, having good bug finding models
01:21:21.800 | feels necessary to get to the highest reaches
01:21:24.640 | of having AI do more and more programming for you,
01:21:27.840 | where you're going to, you know,
01:21:29.200 | if AI is building more and more of the system for you,
01:21:31.160 | you need to not just generate, but also verify.
01:21:33.800 | And without that, some of the problems
01:21:35.880 | that we've talked about before
01:21:37.520 | with programming with these models
01:21:39.880 | will just become untenable.
01:21:42.400 | So it's not just for humans, like you write a bug,
01:21:45.680 | I write a bug, find the bug for me,
01:21:47.160 | but it's also being able to verify the AI's code
01:21:50.200 | and check it is really important.
01:21:52.560 | - Yeah, and then how do you actually do this?
01:21:54.120 | Like we have had a lot of contentious dinner discussions
01:21:56.320 | of how do you actually train a bug model?
01:21:57.720 | But one very popular idea is, you know,
01:22:00.720 | it's kind of potentially easy to introduce a bug
01:22:04.160 | than actually finding the bug.
01:22:05.360 | And so you can train a model to introduce bugs
01:22:08.200 | in existing code.
01:22:09.560 | And then you can train a reverse bug model
01:22:13.320 | then that can find bugs using this synthetic data.
01:22:17.360 | So that's like one example.
01:22:18.720 | But yeah, there are lots of ideas for how to-
01:22:21.920 | - You can also do a bunch of work,
01:22:24.600 | not even at the model level, of taking the biggest models
01:22:27.400 | and then maybe giving them access to a lot of information
01:22:30.760 | that's not just the code.
01:22:32.320 | Like it's kind of a hard problem to like stare at a file
01:22:34.360 | and be like, where's the bug?
01:22:35.680 | And you know, that's hard for humans often, right?
01:22:38.120 | And so often you have to run the code
01:22:39.720 | and being able to see things like traces
01:22:41.320 | and step through a debugger.
01:22:43.280 | There's a whole nother direction
01:22:44.560 | where it like kind of tends toward that.
01:22:46.160 | And it could also be that there are kind of
01:22:47.360 | two different product form factors here.
01:22:48.680 | It could be that you have a really specialty model
01:22:50.640 | that's quite fast, that's kind of running in the background
01:22:52.440 | and trying to spot bugs.
01:22:53.880 | And it might be that sometimes,
01:22:55.520 | sort of to Arvid's earlier example about, you know,
01:22:57.960 | some nefarious input box bug,
01:22:59.680 | it might be that sometimes you wanna like,
01:23:01.280 | you know there's a bug,
01:23:02.200 | you're not just like checking hypothesis-free,
01:23:04.080 | you're like, this is a problem, I really wanna solve it.
01:23:06.560 | And you zap that with tons and tons and tons of compute
01:23:08.840 | and you're willing to put in like $50 to solve that bug
01:23:11.160 | or something even more.
01:23:12.760 | - Have you thought about integrating money
01:23:14.720 | into this whole thing?
01:23:15.560 | Like I would pay probably a large amount of money
01:23:17.200 | for if you found a bug
01:23:19.240 | or even generated a code that I really appreciated.
01:23:21.320 | Like I had a moment a few days ago
01:23:23.680 | when I started using cursor or generated
01:23:25.800 | perfect,
01:23:29.160 | like perfect three functions
01:23:32.720 | for interacting with the YouTube API
01:23:36.080 | to update captions
01:23:38.960 | and for localization like in different languages.
01:23:42.400 | The API documentation is not very good.
01:23:45.160 | And the code across, like if I Googled it for a while,
01:23:48.280 | I couldn't find exactly,
01:23:49.520 | there's a lot of confusing information
01:23:51.320 | and cursor generated perfectly.
01:23:53.240 | And I was like, I just sat back,
01:23:54.840 | I read the code and I was like, this is correct.
01:23:56.520 | I tested it, it's correct.
01:23:58.160 | I was like, I wanna tip on a button that goes.
01:24:01.840 | - Yeah.
01:24:02.760 | - Here's $5.
01:24:03.920 | One that's really good just to support the company
01:24:05.920 | and support what the interface is.
01:24:08.080 | And the other is that probably sends a strong signal,
01:24:10.800 | like good job, right?
01:24:14.080 | So there's this much stronger signal
01:24:15.560 | than just accepting the code, right?
01:24:17.000 | You just actually send like a strong, good job.
01:24:20.200 | That and for bug finding, obviously,
01:24:22.480 | like there's a lot of people,
01:24:24.920 | that would pay a huge amount of money for a bug,
01:24:28.800 | like a bug bounty thing, right?
01:24:32.400 | Do you guys think about that?
01:24:33.720 | - Yeah, it's a controversial idea inside the company.
01:24:37.000 | I think it sort of depends on how much
01:24:38.960 | you believe in humanity almost, you know?
01:24:42.440 | Like, I think it would be really cool
01:24:45.480 | if like you spend nothing to try to find a bug.
01:24:49.080 | And if it doesn't find a bug, you spend $0.
01:24:51.560 | And then if it does find a bug and you click accept,
01:24:54.560 | then it also shows like in parentheses, like $1.
01:24:58.080 | And so you spend $1 to accept the bug.
01:25:00.480 | And then of course there's a worry like,
01:25:02.160 | okay, we spent a lot of computation,
01:25:03.600 | like maybe people will just copy paste.
01:25:05.560 | I think that's a worry.
01:25:08.480 | And then there's also the worry that like introducing money
01:25:10.960 | into the product makes it like kind of,
01:25:14.280 | you know, like it doesn't feel as fun anymore.
01:25:16.160 | Like you have to like think about money
01:25:18.080 | and all you want to think about is like the code.
01:25:21.080 | And so maybe it actually makes more sense to separate it out
01:25:23.520 | and like you pay some fee like every month
01:25:26.760 | and then you get all of these things for free.
01:25:29.320 | - But there could be a tipping component,
01:25:30.800 | which is not like it costs this.
01:25:32.360 | - Yes, but it still has that like dollar symbol.
01:25:34.600 | I think it's fine.
01:25:35.440 | But I also see the point where like,
01:25:38.560 | maybe you don't want to introduce it.
01:25:40.160 | - Yeah, I was gonna say the moment
01:25:41.000 | that feels like people do this is when they share it,
01:25:43.240 | when they have this fantastic example,
01:25:45.120 | they just kind of share it with their friends.
01:25:46.880 | There is also a potential world
01:25:48.040 | where there's a technical solution to this,
01:25:49.880 | like on our system problem too,
01:25:51.560 | where if we can get to a place where we understand
01:25:54.000 | the output of the system more,
01:25:55.960 | I mean, to the stuff we were talking about with like,
01:25:57.880 | you know, error checking with the LSP
01:25:59.360 | and then also running the code.
01:26:00.720 | But if you could get to a place
01:26:01.560 | where you could actually somehow verify,
01:26:03.560 | oh, I have fixed the bug.
01:26:05.040 | Maybe then the bounty system
01:26:07.200 | doesn't need to rely on the honor system too.
01:26:09.120 | - How much interaction is there
01:26:10.400 | between the terminal and the code?
01:26:12.240 | Like how much information is gained
01:26:14.120 | from if you run the code in the terminal?
01:26:16.360 | Can you use, can you do like a loop where it runs the code
01:26:22.280 | and suggest how to change the code
01:26:24.720 | if the code and runtime gives an error?
01:26:27.760 | Is right now they're separate worlds completely?
01:26:30.720 | Like I know you can like do control K inside the terminal
01:26:34.080 | to help you write the code.
01:26:35.040 | - You can use terminal context as well
01:26:38.080 | inside of Jackman K kind of everything.
01:26:40.960 | We don't have the looping part yet,
01:26:44.640 | though we suspect something like this
01:26:46.080 | could make a lot of sense.
01:26:47.360 | There's a question of whether it happens
01:26:48.560 | in the foreground too,
01:26:49.640 | or if it happens in the background,
01:26:51.400 | like what we've been discussing.
01:26:52.680 | - Sure, the background is pretty cool.
01:26:54.200 | Like we do running the code in different ways.
01:26:56.640 | Plus there's a database side to this,
01:26:58.280 | which how do you protect it from not modifying the database?
01:27:01.120 | But okay.
01:27:01.960 | - I mean, there's certainly cool solutions there.
01:27:06.080 | There's this new API that is being developed for,
01:27:10.360 | it's not in AWS, but it certainly is.
01:27:15.280 | I think it's in PlanetScale.
01:27:16.480 | I don't know if PlanetScale was the first one to add it.
01:27:18.760 | It's this ability to sort of add branches to a database,
01:27:22.360 | which is like if you're working on a feature
01:27:25.520 | and you want to test against a broad database,
01:27:27.680 | but you don't actually want to test
01:27:28.920 | against a broad database,
01:27:29.880 | you could sort of add a branch to the database.
01:27:31.960 | And the way to do that is to add a branch
01:27:33.440 | to the write-ahead log.
01:27:35.200 | And there's obviously a lot of technical complexity
01:27:37.400 | in doing it correctly.
01:27:38.480 | I guess database companies need new things to do.
01:27:41.640 | They have good databases now.
01:27:47.520 | And I think like TurboBuffer,
01:27:50.160 | which is one of the databases we use,
01:27:52.080 | is going to add maybe branching to the write-ahead log.
01:27:57.080 | And so maybe the AI agents will use branching.
01:28:03.040 | They'll like test against some branch
01:28:05.400 | and it's sort of gonna be a requirement for the database
01:28:08.760 | to like support branching or something.
01:28:10.600 | - It'd be really interesting
01:28:11.440 | if you could branch a file system, right?
01:28:13.680 | - Yeah, I feel like everything needs branching.
01:28:15.720 | It's like that.
01:28:17.040 | - Yeah, it's the problem with the multiverse, right?
01:28:22.000 | Like if you branch on everything, that's like a lot.
01:28:24.360 | - I mean, there's obviously these like super clever
01:28:26.320 | algorithms to make sure that you don't actually
01:28:28.520 | sort of use a lot of space or CPU or whatever.
01:28:32.280 | - Okay, this is a good place to ask about infrastructure.
01:28:34.880 | So you guys mostly use AWS.
01:28:36.880 | What are some interesting details?
01:28:38.240 | What are some interesting challenges?
01:28:39.640 | Why'd you choose AWS?
01:28:41.400 | Why is AWS still winning?
01:28:44.080 | Hashtag.
01:28:45.000 | - AWS is just really, really good.
01:28:48.280 | It's really good.
01:28:49.120 | Like whenever you use an AWS product,
01:28:54.120 | you just know that it's going to work.
01:28:56.840 | Like it might be absolute hell to go through the steps
01:29:00.480 | to set it up.
01:29:02.120 | - Why is the interface so horrible?
01:29:04.200 | - Because it's just so good.
01:29:06.200 | It doesn't need to-
01:29:07.040 | - It's the nature of winning.
01:29:08.920 | - I think it's exactly, it's just nature of winning.
01:29:11.240 | Yeah, yeah.
01:29:12.440 | But AWS, you can always trust, like it will always work.
01:29:15.240 | And if there is a problem, it's probably your problem.
01:29:18.600 | Yeah.
01:29:20.920 | - Okay.
01:29:21.760 | Is there some interesting like challenges to,
01:29:23.640 | you guys have a pretty new startup to get scaling
01:29:26.840 | to like, to so many people and-
01:29:29.320 | - Yeah, I think that there,
01:29:30.680 | it has been an interesting journey adding, you know,
01:29:35.440 | each extra zero to the request per second.
01:29:37.920 | You run into all of these with like, you know,
01:29:39.520 | the general components you're using for caching
01:29:41.520 | and databases run into issues as you make things
01:29:43.720 | bigger and bigger.
01:29:44.560 | At the scale where we get like, you know,
01:29:45.760 | into overflows on our tables and things like that.
01:29:48.720 | And then also there have been some custom systems
01:29:51.800 | that we've built, like for instance,
01:29:53.200 | our retrieval system for computing a semantic index
01:29:57.040 | of your code base and answering questions about a code base
01:30:00.120 | that have continually, I feel like been,
01:30:02.280 | well, one of the trickier things to scale.
01:30:04.360 | - I have a few friends who are super, super senior engineers
01:30:07.520 | and one of their sort of lines is like,
01:30:09.040 | it's very hard to predict where systems will break
01:30:11.840 | when you scale them.
01:30:13.360 | You can sort of try to predict in advance,
01:30:17.000 | but like, there's always something weird
01:30:18.960 | that's going to happen when you add this extra zero.
01:30:22.040 | You thought you thought through everything,
01:30:23.720 | but you didn't actually think through everything.
01:30:26.320 | But I think for that particular system, we've,
01:30:30.520 | so for concrete details,
01:30:34.640 | the thing we do is obviously we upload,
01:30:36.880 | when like we chunk up all of your code
01:30:41.120 | and then we send up sort of the code for embedding
01:30:44.720 | and we embed the code.
01:30:46.280 | And then we store the embeddings in a database,
01:30:49.280 | but we don't actually store any of the code.
01:30:51.800 | And then there's reasons around making sure
01:30:53.560 | that we don't introduce client bugs
01:30:56.320 | because we're very, very paranoid about client bugs.
01:30:59.080 | We store much of the details on the server,
01:31:03.520 | like everything is sort of encrypted.
01:31:06.680 | So one of the technical challenges
01:31:09.840 | is always making sure that the local index,
01:31:12.720 | the local code base state is the same as the state
01:31:16.160 | that is on the server.
01:31:17.920 | And the way sort of technically we ended up doing that is,
01:31:21.840 | so for every single file, you can sort of keep this hash.
01:31:25.800 | And then for every folder, you can sort of keep a hash,
01:31:28.640 | which is the hash of all of its children.
01:31:31.160 | And you can sort of recursively do that until the top.
01:31:33.880 | And why do something complicated?
01:31:37.640 | One thing you could do is you could keep a hash
01:31:39.720 | for every file.
01:31:40.920 | Then every minute you could try to download
01:31:43.440 | the hashes that are on the server,
01:31:44.720 | figure out what are the files that don't exist on the server.
01:31:47.360 | Maybe you just created a new file.
01:31:48.880 | Maybe you just deleted a file.
01:31:50.320 | Maybe you checked out a new branch
01:31:52.400 | and try to reconcile the state
01:31:54.480 | between the client and the server.
01:31:56.160 | But that introduces like absolutely ginormous
01:31:59.960 | network overhead, both on the client side.
01:32:03.680 | I mean, nobody really wants us to hammer their wifi
01:32:06.880 | all the time if you're using cursor.
01:32:09.360 | But also like, I mean, it would introduce
01:32:11.000 | like ginormous overhead in the database.
01:32:13.680 | It would sort of be reading this tens of terabyte database,
01:32:18.680 | sort of approaching like 20 terabytes
01:32:23.200 | or something database, like every second.
01:32:25.680 | That's just sort of kind of crazy.
01:32:28.120 | You definitely don't want to do that.
01:32:30.560 | So what do you do?
01:32:31.400 | You sort of, you just try to reconcile the single hash,
01:32:34.320 | which is at the root of the project.
01:32:35.760 | And then if something mismatches, then you go,
01:32:37.800 | you find where all the things disagree.
01:32:39.600 | Maybe you look at the children and see if the hashes match.
01:32:42.000 | And if the hashes don't match,
01:32:43.000 | go look at their children and so on.
01:32:44.640 | But you only do that in this scenario
01:32:46.280 | where things don't match.
01:32:47.280 | And for most people, most of the time the hashes match.
01:32:50.000 | - So it's a kind of like hierarchical reconciliation.
01:32:53.240 | - Yeah, something like that.
01:32:54.800 | Yeah, it's called the Merkle tree.
01:32:56.360 | - Yeah, Merkle, yeah.
01:32:58.120 | I mean, so yeah, this is cool to see that
01:33:00.080 | you kind of have to think through all these problems.
01:33:01.800 | - And I mean, the point of, like the reason it's gotten hard
01:33:04.480 | is just because, like the number of people using it
01:33:07.080 | and if some of your customers
01:33:09.280 | have really, really large code bases,
01:33:13.040 | to the point where, you know,
01:33:15.880 | we originally reordered our code base, which is big,
01:33:18.680 | but I mean, it's just not the size of some company
01:33:21.360 | that's been there for 20 years
01:33:22.600 | and sort of has a ginormous number of files.
01:33:25.080 | And you sort of want to scale that across programmers.
01:33:28.200 | There's all these details where like
01:33:30.000 | building a simple thing is easy,
01:33:31.360 | but scaling it to a lot of people, like a lot of companies
01:33:34.400 | is obviously a difficult problem.
01:33:36.240 | Which sort of, you know, independent of actually,
01:33:38.360 | so that's, there's part of this scaling
01:33:39.720 | our current solution is also, you know,
01:33:41.800 | coming up with new ideas that obviously we're working on,
01:33:45.520 | but then scaling all of that in the last few weeks, months.
01:33:48.440 | - Yeah, and there are a lot of clever things,
01:33:50.640 | like additional things that go into this indexing system.
01:33:53.640 | For example, the bottleneck in terms of costs
01:33:57.160 | is not storing things in the vector database,
01:33:58.960 | or the database that's actually embedding the code.
01:34:01.040 | And you don't want to re-embed the code base
01:34:02.680 | for every single person in a company
01:34:04.720 | that is using the same exact code,
01:34:07.400 | except for maybe they're in a different branch
01:34:08.960 | with a few different files,
01:34:09.880 | or they've made a few local changes.
01:34:12.320 | And so, because again, embeddings are the bottleneck,
01:34:14.600 | you can do just one clever trick
01:34:16.160 | and not have to worry about like the complexity
01:34:18.320 | of like dealing with branches and the other databases,
01:34:20.600 | where you just have some cache on the actual vectors
01:34:25.600 | computed from the hash of a given chunk.
01:34:29.560 | And so this means that when the nth person at a company
01:34:33.720 | goes and embeds their code base, it's really, really fast.
01:34:36.680 | And you do all this without actually storing any code
01:34:39.240 | on our servers at all.
01:34:40.120 | No code data is stored.
01:34:41.680 | We just store the vectors in the vector database
01:34:43.400 | and the vector cache.
01:34:45.480 | - What's the biggest gains at this time
01:34:49.120 | you get from indexing the code base?
01:34:51.680 | Just out of curiosity, like what benefit do users have?
01:34:56.000 | It seems like longer term,
01:34:57.400 | there'll be more and more benefit,
01:34:58.680 | but in the short term,
01:34:59.600 | just asking questions of the code base,
01:35:02.880 | what's the usefulness of that?
01:35:06.080 | - I think the most obvious one is just,
01:35:10.640 | you want to find out where something is happening
01:35:13.120 | in your large code base.
01:35:14.600 | And you sort of have a fuzzy memory of,
01:35:16.960 | okay, I want to find the place where we do X,
01:35:19.320 | but you don't exactly know what to search for
01:35:22.240 | in a normal text search.
01:35:23.440 | And so you ask a chat,
01:35:25.240 | you hit command enter to ask with the code base chat,
01:35:27.920 | and then very often it finds the right place
01:35:31.120 | that you were thinking of.
01:35:32.160 | I think, like you mentioned,
01:35:34.760 | in the future, I think this is only going to get more
01:35:37.080 | and more powerful,
01:35:38.320 | where we're working a lot
01:35:39.440 | on improving the quality of our retrieval.
01:35:42.120 | And I think the ceiling for that is really, really much
01:35:44.080 | higher than people give it credit for.
01:35:45.960 | - One question that's good to ask here,
01:35:47.920 | have you considered and why haven't you much done
01:35:50.320 | sort of local stuff to where you can do the,
01:35:53.680 | I mean, it seems like everything we just discussed
01:35:55.640 | is exceptionally difficult to do.
01:35:57.240 | To go to the cloud,
01:35:58.520 | you have to think about all these things
01:35:59.840 | with the caching and the large code base
01:36:04.840 | with a large number of programmers
01:36:06.520 | are using the same code base.
01:36:07.520 | You have to figure out the puzzle of that.
01:36:09.400 | A lot of it, most software just does stuff,
01:36:13.160 | this heavy computational stuff locally.
01:36:16.360 | Have you considered doing sort of embeddings locally?
01:36:18.800 | - Yeah, we thought about it
01:36:19.840 | and I think it would be cool to do it locally.
01:36:22.640 | I think it's just really hard.
01:36:24.600 | And one thing to keep in mind is that,
01:36:28.120 | some of our users use the latest MacBook Pro
01:36:30.800 | but most of our users,
01:36:33.240 | like more than 80% of our users are in Windows machines,
01:36:36.240 | which, and many of them are not very powerful.
01:36:39.760 | And so local models really only works
01:36:42.600 | on the latest computers.
01:36:44.240 | And it's also a big overhead to build that in.
01:36:48.400 | And so even if we would like to do that,
01:36:50.440 | it's currently not something that we are able to focus on.
01:36:54.360 | And I think there are some people that do that.
01:36:57.600 | And I think that's great.
01:36:58.880 | But especially as models get bigger and bigger
01:37:02.640 | and you want to do fancier things with like bigger models,
01:37:05.920 | it becomes even harder to do it locally.
01:37:07.920 | - Yeah, and it's not a problem of like weaker computers.
01:37:11.640 | It's just that, for example, if you're some big company,
01:37:15.680 | you have big company code base,
01:37:17.680 | it's just really hard to process big company code base
01:37:20.240 | even on the beefiest MacBook Pros.
01:37:22.280 | So even if it's not even a matter of like,
01:37:24.560 | if you're just like a student or something,
01:37:28.040 | I think if you're like the best programmer at a big company,
01:37:31.760 | you're still going to have a horrible experience
01:37:34.440 | if you do everything locally.
01:37:35.760 | I mean, you could do edge and sort of scrape by,
01:37:38.680 | but like, again, it wouldn't be fun anymore.
01:37:40.840 | - Yeah, like at approximate nearest neighbors
01:37:42.440 | and this massive code base is going to just eat up
01:37:44.480 | your memory and your CPU.
01:37:46.280 | And that's just that.
01:37:50.080 | Like, let's talk about like also the modeling side
01:37:52.800 | where, as Arvid said, there are these massive headwinds
01:37:55.800 | against local models where one,
01:37:59.560 | things that seem to move towards MOEs,
01:38:01.680 | which like one benefit is maybe
01:38:03.320 | they're more memory bandwidth bound,
01:38:05.320 | which plays in favor of local versus using GPUs
01:38:09.600 | or using NVIDIA GPUs.
01:38:12.320 | But the downside is these models are just bigger in total.
01:38:16.520 | And they're going to need to fit often,
01:38:18.960 | not even on a single node, but multiple nodes.
01:38:22.160 | There's no way that's going to fit inside
01:38:24.240 | of even really good MacBooks.
01:38:26.520 | And I think, especially for coding,
01:38:28.880 | it's not a question as much of like,
01:38:31.480 | does it clear some bar of like the models good enough
01:38:34.840 | to do these things and then like we're satisfied,
01:38:37.320 | which may be the case for other problems
01:38:39.680 | and maybe where local models shine,
01:38:41.640 | but people are always going to want the best,
01:38:43.480 | the most intelligent, the most capable things.
01:38:46.200 | And that's going to be really, really hard to run
01:38:48.480 | for almost all people locally.
01:38:51.320 | - Don't you want the most capable model?
01:38:53.800 | Like you want SONNET?
01:38:56.160 | - And also with O1.
01:38:58.160 | - I like how you're pitching me.
01:39:00.520 | Would you be satisfied with an inferior model?
01:39:03.220 | Listen, I'm yes, I'm one of those,
01:39:05.520 | but there's some people that like to do stuff locally,
01:39:07.960 | especially like really, there's a whole,
01:39:11.080 | obviously open source movement that kind of resists.
01:39:13.640 | And it's good that they exist actually,
01:39:15.500 | because you want to resist the power centers
01:39:18.880 | that are growing.
01:39:20.080 | - There's actually an alternative to local models
01:39:23.000 | that I am particularly fond of.
01:39:25.200 | I think it's still very much in the research stage,
01:39:28.520 | but you could imagine to do homomorphic encryption
01:39:32.560 | for language model inference.
01:39:34.360 | So you encrypt your input on your local machine,
01:39:36.920 | then you send that up,
01:39:37.920 | and then the server can use lots of computation.
01:39:42.920 | They can run models that you cannot run locally
01:39:45.040 | on this encrypted data,
01:39:46.920 | but they cannot see what the data is.
01:39:48.520 | And then they send back the answer
01:39:49.760 | and you decrypt the answer and only you can see the answer.
01:39:52.480 | So I think that's still very much research
01:39:55.880 | and all of it is about trying to make the overhead lower
01:40:00.720 | because right now the overhead is really big.
01:40:02.800 | But if you can make that happen,
01:40:04.480 | I think that would be really, really cool.
01:40:07.240 | And I think it would be really, really impactful
01:40:10.080 | because I think one thing that's actually kind of worrisome
01:40:12.160 | is that as these models get better and better,
01:40:14.840 | they're going to become more and more economically useful.
01:40:17.880 | And so more and more of the world's information and data
01:40:21.040 | will flow through, you know, one or two centralized actors.
01:40:26.040 | And then there are worries about, you know,
01:40:29.480 | there can be traditional hacker attempts,
01:40:31.400 | but it also creates this kind of scary part
01:40:35.040 | where if all of the world's information
01:40:37.480 | is flowing through one node in plain text,
01:40:39.800 | you can have surveillance in very bad ways.
01:40:43.960 | And sometimes that will happen for, you know,
01:40:47.680 | initially will be like good reasons,
01:40:49.800 | like people will want to try to protect against
01:40:52.720 | like bad actors using AI models in bad ways.
01:40:55.680 | And then you will add in some surveillance code
01:40:57.480 | and then someone else will come in and, you know,
01:40:59.640 | you're in a slippery slope and then you start
01:41:01.840 | doing bad things with a lot of the world's data.
01:41:06.880 | And so I'm very hopeful that we can solve
01:41:10.480 | homomorphic encryption for language model inference.
01:41:12.640 | - Doing privacy preserving machine learning.
01:41:14.320 | But I would say like that's the challenge we have
01:41:16.240 | with all software these days.
01:41:18.680 | It's like there's so many features that can be provided
01:41:22.240 | from the cloud and all of us increasingly rely on it
01:41:25.160 | and make our life awesome, but there's downsides.
01:41:27.720 | And that's why you rely on really good security
01:41:29.600 | to protect from basic attacks.
01:41:31.600 | But there's also only a small set of companies
01:41:35.320 | that are controlling that data, you know,
01:41:37.800 | and they obviously have leverage
01:41:40.040 | and they can be infiltrated in all kinds of ways.
01:41:42.000 | That's the world we live in.
01:41:43.600 | - Yeah, I mean, the thing I'm just actually quite worried
01:41:46.640 | about is sort of the world where, I mean,
01:41:48.560 | so Entropiq has this responsible scaling policy
01:41:51.440 | and so we're on like the low ASLs,
01:41:55.120 | which is the Entropiq security level or whatever
01:41:57.200 | of like of the models.
01:41:58.920 | But as we get to like code and code ASL3, ASL4,
01:42:02.320 | whatever models, which are sort of very powerful.
01:42:06.440 | But for mostly reasonable security reasons,
01:42:11.120 | you would want to monitor all the prompts.
01:42:13.600 | But I think that's sort of reasonable
01:42:16.280 | and understandable where everyone is coming from.
01:42:18.560 | But, man, it'd be really horrible
01:42:20.960 | if sort of like all the world's information
01:42:23.080 | is sort of monitored that heavily.
01:42:24.800 | It's way too centralized.
01:42:27.000 | It's like sort of this really fine line you're walking
01:42:30.600 | where on the one side, like,
01:42:33.360 | you don't want the models to go rogue.
01:42:35.160 | On the other side, like, man, it's humans, like,
01:42:38.160 | I don't know if I trust like all the world's information
01:42:41.040 | to pass through like three model providers.
01:42:43.400 | - Yeah.
01:42:44.640 | - Why do you think it's different than cloud providers?
01:42:47.600 | - Because I think this is,
01:42:51.440 | a lot of this data would never have gone
01:42:54.080 | to the cloud providers in the first place.
01:42:56.520 | Where this is often like,
01:43:00.560 | you want to give more data to the EIA models.
01:43:02.440 | You want to give personal data
01:43:04.480 | that you would never have put online in the first place
01:43:07.400 | to these companies or to these models.
01:43:10.960 | And it also centralizes control
01:43:15.080 | where right now for cloud,
01:43:19.040 | you can often use your own encryption keys
01:43:21.080 | and like AWS can't really do much.
01:43:24.400 | But here it's just centralized actors
01:43:29.240 | that see the exact plain text of everything.
01:43:31.640 | - On the topic of context,
01:43:34.160 | that's actually been a friction for me.
01:43:36.080 | When I'm writing code, you know, in Python,
01:43:38.040 | there's a bunch of stuff imported.
01:43:40.120 | There's a, you could probably intuit
01:43:42.680 | the kind of stuff I would like to include in the context.
01:43:45.520 | Is there, like how hard is it
01:43:48.040 | to auto figure out the context?
01:43:51.040 | - It's tricky.
01:43:52.800 | I think we can do a lot better
01:43:54.640 | at computing the context automatically in the future.
01:43:58.680 | One thing that's important to notice,
01:44:00.120 | there are trade-offs with including automatic context.
01:44:03.600 | So the more context you include for these models,
01:44:06.720 | first of all, the slower they are.
01:44:09.640 | And the more expensive those requests are,
01:44:12.200 | which means you can then do less model calls
01:44:13.880 | and do less fancy stuff in the background.
01:44:16.040 | Also for a lot of these models,
01:44:17.480 | they get confused if you have a lot of information
01:44:19.200 | in the prompt.
01:44:20.160 | So the bar for accuracy
01:44:23.080 | and for relevance of the context you include
01:44:25.080 | should be quite high.
01:44:26.120 | But this is, already we do some automatic context
01:44:31.640 | in some places within the product.
01:44:33.040 | It's definitely something we wanna get a lot better at.
01:44:35.360 | And I think that there are a lot of cool ideas
01:44:39.440 | to try there,
01:44:40.280 | both on the learning better retrieval systems,
01:44:45.680 | like better embedding models, better re-rankers.
01:44:48.400 | I think that there are also cool academic ideas,
01:44:52.120 | stuff we've tried out internally,
01:44:53.280 | but also the field is grappling with writ large
01:44:55.880 | about can you get language models to a place
01:44:58.200 | where you can actually just have the model itself,
01:45:00.280 | like understand a new corpus of information.
01:45:02.640 | And the most popular talked about version of this is,
01:45:05.640 | can you make the context windows infinite?
01:45:07.520 | Then if you make the context windows infinite,
01:45:08.880 | can you make the model actually pay attention
01:45:10.480 | to the infinite context?
01:45:11.680 | And then after you can make it pay attention
01:45:13.120 | to the infinite context,
01:45:14.320 | to make it somewhat feasible to actually do it,
01:45:16.680 | can you then do caching for that infinite context?
01:45:18.760 | You don't have to recompute that all the time.
01:45:20.920 | But there are other cool ideas that are being tried
01:45:23.440 | that are a little bit more analogous to fine tuning
01:45:25.760 | of actually learning this information
01:45:27.120 | in the weights of the model.
01:45:28.640 | And it might be that you actually get
01:45:30.760 | sort of a qualitatively different type of understanding
01:45:34.760 | if you do it more at the weight level
01:45:36.000 | than if you do it at the in-context learning level.
01:45:37.720 | I think the jury's still a little bit out
01:45:40.640 | on how this is all gonna work in the end.
01:45:43.040 | But in the interim, us as a company,
01:45:44.640 | we are really excited about better retrieval systems
01:45:47.200 | and picking the parts of the code base
01:45:49.200 | that are most relevant to what you're doing.
01:45:51.120 | We could do that a lot better.
01:45:52.520 | - Like one interesting proof of concept
01:45:54.440 | for the learning this knowledge directly in the weights
01:45:58.280 | is with VS Code.
01:46:00.400 | So we're in a VS Code fork and VS Code,
01:46:03.400 | the code is all public.
01:46:04.920 | So these models in pre-training have seen all the code.
01:46:08.680 | They've probably also seen questions and answers about it.
01:46:11.080 | And then they've been fine-tuned and RLA-chefed
01:46:13.360 | to be able to answer questions about code in general.
01:46:16.040 | So when you ask it a question about VS Code,
01:46:18.880 | sometimes it'll hallucinate,
01:46:20.080 | but sometimes it actually does a pretty good job
01:46:22.960 | at answering the question.
01:46:24.760 | And I think like this is just by,
01:46:27.480 | it happens to be okay.
01:46:29.560 | But what if you could actually like specifically
01:46:31.840 | train or post-train a model
01:46:33.040 | such that it really was built to understand this code base?
01:46:37.520 | It's an open research question,
01:46:40.040 | one that we're quite interested in.
01:46:41.400 | And then there's also uncertainty of like,
01:46:43.000 | do you want the model to be the thing
01:46:44.640 | that end-to-end is doing everything?
01:46:46.800 | I.e. it's doing the retrieval and its internals
01:46:49.640 | and then kind of answering the question, creating the code,
01:46:51.840 | or do you want to separate the retrieval
01:46:55.200 | from the frontier model where maybe, you know,
01:46:58.080 | you'll get some really capable models
01:46:59.520 | that are much better than like the best open source ones
01:47:01.960 | in a handful of months.
01:47:03.280 | And then you'll want to separately train
01:47:07.120 | a really good open source model to be the retriever,
01:47:09.400 | to be the thing that feeds in the context
01:47:12.320 | to these larger models.
01:47:14.280 | - Can you speak a little more to the post-training a model
01:47:16.880 | to understand the code base?
01:47:18.800 | Like, what do you mean by that with,
01:47:20.320 | is this a synthetic data direction?
01:47:22.360 | Is this-
01:47:23.200 | - Yeah, I mean, there are many possible ways
01:47:25.560 | you could try doing it.
01:47:26.800 | There's certainly no shortage of ideas.
01:47:30.200 | It's just a question of going in
01:47:31.280 | and like trying all of them and being empirical
01:47:33.240 | about which one works best.
01:47:34.600 | You know, one very naive thing is to try to replicate
01:47:38.840 | what's done with VS Code and these frontier models.
01:47:43.080 | So let's like continue pre-training,
01:47:45.840 | some kind of continued pre-training
01:47:46.880 | that includes general code data,
01:47:48.040 | but also throws in a lot of the data
01:47:50.440 | of some particular repository that you care about.
01:47:53.120 | And then in post-training, meaning in,
01:47:56.400 | let's just start with instruction fine-tuning,
01:47:58.360 | you have like a normal instruction fine-tuning data set
01:48:00.440 | about code, but you throw in a lot of questions
01:48:03.480 | about code in that repository.
01:48:07.040 | So you could either get ground truth ones,
01:48:09.680 | which might be difficult,
01:48:10.520 | or you could do what you kind of hinted at
01:48:12.200 | or suggested using synthetic data,
01:48:14.800 | i.e. kind of having the model ask questions
01:48:19.800 | about various pieces of the code.
01:48:22.880 | So you kind of take the pieces of the code,
01:48:24.440 | then prompt the model or have a model propose a question
01:48:27.800 | for that piece of code,
01:48:28.960 | and then add those as instruction fine-tuning data points.
01:48:32.560 | And then in theory, this might unlock the model's ability
01:48:36.200 | to answer questions about that code base.
01:48:39.400 | - Let me ask you about OpenAI-01.
01:48:42.440 | What do you think is the role of that kind of
01:48:44.480 | test-time compute system in programming?
01:48:47.280 | - I think test-time compute is really, really interesting.
01:48:50.800 | So there's been the pre-training regime,
01:48:52.600 | which will kind of, as you scale up the amount of data
01:48:57.040 | and the size of your model,
01:48:58.040 | get you better and better performance,
01:48:59.480 | both on loss and then on downstream benchmarks,
01:49:02.600 | and just general performance when we use it for coding
01:49:05.200 | or other tasks.
01:49:07.000 | We're starting to hit a bit of a data wall,
01:49:12.600 | meaning it's going to be hard
01:49:13.960 | to continue scaling up this regime.
01:49:16.040 | And so scaling up test-time compute
01:49:18.360 | is an interesting way of now, you know,
01:49:20.120 | increasing the number of inference time flops that we use,
01:49:24.600 | but still getting, like, yeah,
01:49:27.240 | as you increase the number of flops you use inference time,
01:49:29.280 | getting corresponding improvements
01:49:31.760 | in the performance of these models.
01:49:33.400 | Traditionally, we just had to literally train a bigger model
01:49:35.560 | that always used that many more flops,
01:49:38.840 | but now we could perhaps use the same size model
01:49:41.480 | and run it for longer to be able to get an answer
01:49:45.400 | at the quality of a much larger model.
01:49:46.760 | And so the really interesting thing I like about this
01:49:49.480 | is there are some problems that perhaps require
01:49:53.200 | 100 trillion parameter model intelligence
01:49:55.200 | trained on 100 trillion tokens,
01:49:56.760 | but that's, like, maybe 1%,
01:50:00.200 | maybe, like, 0.1% of all queries.
01:50:02.920 | So are you going to spend all of this effort,
01:50:05.560 | all this compute training a model that costs that much
01:50:09.560 | and then run it so infrequently?
01:50:12.080 | It feels completely wasteful
01:50:13.840 | when instead you get the model that can,
01:50:16.040 | that you train the model that's capable of doing
01:50:18.160 | the 99.9% of queries,
01:50:20.240 | then you have a way of inference time running it longer
01:50:23.560 | for those few people that really,
01:50:25.120 | really want max intelligence.
01:50:26.960 | - How do you figure out which problem
01:50:30.560 | requires what level of intelligence?
01:50:33.320 | Is that possible to dynamically figure out
01:50:35.120 | when to use GPT-4, when to use,
01:50:37.480 | like, when to use a small model
01:50:39.000 | and when you need the O-1?
01:50:41.680 | - I mean, yeah, that's an open research problem, certainly.
01:50:47.240 | I don't think anyone's actually cracked
01:50:48.760 | this model routing problem quite well.
01:50:51.040 | We'd like to.
01:50:51.880 | We have, like, initial implementations of this for things,
01:50:55.600 | for something like CursorTab,
01:50:57.040 | but at the level of, like,
01:50:59.520 | going between 4.0 Sonnet to O-1,
01:51:02.600 | it's a bit trickier.
01:51:04.880 | There's also a question of, like,
01:51:05.800 | what level of intelligence do you need
01:51:07.720 | to determine if the thing is too hard
01:51:12.200 | for the four-level model?
01:51:13.800 | Maybe you need the O-1 level model.
01:51:17.680 | It's really unclear.
01:51:19.320 | - But you mentioned, so there's a pre-training process,
01:51:23.560 | then there's post-training,
01:51:25.160 | and then there's, like, test-time compute.
01:51:27.080 | Is that fair to sort of separate?
01:51:28.680 | Where's the biggest gains?
01:51:30.080 | - Well, it's weird, because, like, test-time compute,
01:51:33.600 | there's, like, a whole training strategy needed
01:51:36.120 | to get test-time compute to work,
01:51:38.040 | and the other really weird thing about this is no one,
01:51:42.680 | like, outside of the big labs,
01:51:44.520 | and maybe even just OpenAI,
01:51:45.960 | no one really knows how it works.
01:51:47.680 | Like, there have been some really interesting papers
01:51:49.840 | that show hints of what they might be doing.
01:51:53.680 | And so perhaps they're doing something
01:51:56.680 | with tree search using process reward models.
01:52:00.080 | But yeah, I just, I think the issue is
01:52:02.520 | we don't quite know exactly what it looks like,
01:52:04.840 | so it would be hard to kind of comment
01:52:06.320 | on, like, where it fits in.
01:52:07.960 | I would put it in post-training,
01:52:09.400 | but maybe, like, the compute spent for this kind of,
01:52:12.160 | for getting test-time compute to work for a model
01:52:14.680 | is going to dwarf pre-training eventually.
01:52:17.520 | - So we don't even know if O1 is using
01:52:21.600 | just, like, chain-of-thought, RL.
01:52:23.800 | We don't know how they're using any of these.
01:52:26.000 | We don't know anything.
01:52:27.240 | - It's fun to speculate.
01:52:28.320 | (all laughing)
01:52:30.520 | - Like, if you were to build a competing model,
01:52:33.360 | what would you do?
01:52:35.000 | - Yeah, so one thing to do would be,
01:52:38.240 | I think you probably need to train a process reward model,
01:52:41.040 | which is, so maybe we can get into reward models
01:52:43.720 | and outcome reward models versus process reward models.
01:52:46.320 | Outcome reward models are the kind of
01:52:48.000 | traditional reward models that people are trained
01:52:50.560 | for language modeling,
01:52:53.880 | and it's just looking at the final thing.
01:52:55.360 | So if you're doing some math problem,
01:52:56.520 | let's look at that final thing you've done, everything,
01:52:59.120 | and let's assign a grade to it,
01:53:02.080 | how likely we think, like,
01:53:03.640 | what's the reward for this outcome.
01:53:05.760 | Process reward models, instead,
01:53:07.120 | try to grade the chain of thought.
01:53:09.240 | And so OpenAI had some preliminary paper on this,
01:53:11.600 | I think last summer,
01:53:13.800 | where they used human labelers
01:53:17.120 | to get this pretty large, several hundred thousand dataset
01:53:20.280 | of grading chains of thought.
01:53:21.960 | Ultimately, it feels like,
01:53:24.840 | I haven't seen anything interesting
01:53:26.720 | in the ways that people use process reward models
01:53:29.280 | outside of just using it as a means of
01:53:33.160 | affecting how we choose between a bunch of samples.
01:53:36.400 | So like what people do in all these papers
01:53:39.000 | is they sample a bunch of outputs from the language model,
01:53:42.000 | and then use the process reward models
01:53:44.440 | to grade all those generations
01:53:47.200 | alongside maybe some other heuristics,
01:53:49.040 | and then use that to choose the best answer.
01:53:51.640 | The really interesting thing that people think might work
01:53:55.000 | and people want to work
01:53:56.320 | is tree search with these process reward models,
01:53:58.760 | because if you really can grade every single step
01:54:02.280 | of the chain of thought,
01:54:03.640 | then you can kind of branch out
01:54:05.720 | and explore multiple paths of this chain of thought,
01:54:08.880 | and then use these process reward models
01:54:10.400 | to evaluate how good is this branch that you're taking.
01:54:14.000 | - Yeah, when the quality of the branch
01:54:16.600 | is somehow strongly correlated
01:54:18.240 | with the quality of the outcome at the very end.
01:54:20.480 | So like you have a good model
01:54:21.760 | of knowing which branch to take.
01:54:23.440 | So not just in the short term,
01:54:24.960 | and like in the long term.
01:54:25.920 | - Yeah, and like the interesting work
01:54:27.400 | that I think has been done
01:54:28.240 | is figuring out how to properly train the process,
01:54:30.880 | or the interesting work that has been open sourced
01:54:33.600 | and people I think talk about
01:54:35.520 | is how to train the process reward models,
01:54:38.880 | maybe in a more automated way.
01:54:41.000 | I could be wrong here,
01:54:42.200 | could not be mentioning something,
01:54:43.400 | because I haven't seen anything super,
01:54:46.000 | that seems to work really well
01:54:47.440 | for using the process reward models creatively
01:54:50.800 | to do tree searching code.
01:54:52.720 | - This is kind of an AI safety,
01:54:54.160 | maybe a bit of a philosophy question.
01:54:55.840 | So OpenAI says that they're hiding the chain of thought
01:54:58.560 | from the user.
01:54:59.960 | And they've said that that was a difficult decision to make.
01:55:03.120 | They, instead of showing the chain of thought,
01:55:06.080 | they're asking the model to summarize the chain of thought.
01:55:09.280 | They're also in the background saying
01:55:10.560 | they're going to monitor the chain of thought
01:55:13.000 | to make sure the model is not trying to manipulate the user,
01:55:15.840 | which is a fascinating possibility.
01:55:17.760 | But anyway,
01:55:18.600 | what do you think about hiding the chain of thought?
01:55:21.160 | - One consideration for OpenAI,
01:55:22.720 | and this is completely speculative,
01:55:24.560 | could be that they wanna make it hard for people
01:55:26.920 | to distill these capabilities out of their model.
01:55:29.720 | It might actually be easier
01:55:31.120 | if you had access to that hidden chain of thought
01:55:33.600 | to replicate the technology,
01:55:36.040 | 'cause that's pretty important data,
01:55:37.120 | like seeing the steps that the model took
01:55:38.840 | to get to the final result.
01:55:39.920 | - So you could probably train on that also.
01:55:42.360 | - And there was sort of a mirror situation with this,
01:55:45.240 | with some of the large language model providers,
01:55:47.040 | and also this is speculation,
01:55:48.760 | but some of these APIs
01:55:52.120 | used to offer easy access to log probabilities
01:55:55.360 | for all the tokens that they're generating,
01:55:57.640 | and also log probabilities for the prompt tokens.
01:55:59.960 | And then some of these APIs took those away.
01:56:02.640 | And again, complete speculation,
01:56:03.880 | but one of the thoughts is that
01:56:07.360 | the reason those were taken away is
01:56:08.840 | if you have access to log probabilities,
01:56:11.080 | similar to this hidden chain of thought,
01:56:12.520 | that can give you even more information
01:56:13.840 | to try and distill these capabilities out of the APIs,
01:56:16.960 | out of these biggest models,
01:56:18.680 | into models you control.
01:56:20.040 | As an asterisk on also the previous discussion
01:56:23.200 | about us integrating O1,
01:56:26.120 | I think that we're still learning how to use this model.
01:56:29.320 | So we made O1 available in Cursor
01:56:31.120 | because when we got the model,
01:56:33.880 | we were really interested in trying it out.
01:56:35.840 | I think a lot of programmers
01:56:37.280 | are gonna be interested in trying it out,
01:56:38.960 | but O1 is not part of the default Cursor experience
01:56:43.520 | in any way up.
01:56:44.360 | And we still haven't found a way
01:56:47.480 | to get integrated into the editor
01:56:51.240 | in a way that we reach for sort of every hour,
01:56:54.880 | maybe even every day.
01:56:56.200 | And so I think the jury's still out
01:56:58.560 | on how to use the model.
01:57:00.080 | And we haven't seen examples yet
01:57:04.120 | of people releasing things where it seems really clear,
01:57:07.360 | like, oh, that's like now the use case.
01:57:09.760 | The obvious one to return to
01:57:11.240 | is maybe this can make it easier
01:57:12.880 | for you to have these background things running, right?
01:57:15.240 | To have these models in loops,
01:57:16.120 | to have these models be agentic.
01:57:17.800 | But we're still discovering.
01:57:22.560 | - To be clear, we have ideas.
01:57:24.040 | We just need to try
01:57:25.760 | and get something incredibly useful
01:57:28.160 | before we put it out there.
01:57:29.600 | - But it has these significant limitations.
01:57:31.720 | Like, even like barring capabilities,
01:57:35.640 | it does not stream.
01:57:37.600 | And that means it's really, really painful to use
01:57:40.560 | for things where you want to supervise the output.
01:57:43.280 | And instead, you're just waiting
01:57:45.240 | for the wall of text to show up.
01:57:47.320 | Also, it does feel like the early innings
01:57:49.480 | of test, time, compute, and search
01:57:50.840 | where it's just like a very, very much a V0.
01:57:54.640 | And there's so many things that like don't feel quite right.
01:57:58.760 | And I suspect in parallel
01:58:01.760 | to people increasing the amount of pre-training data
01:58:05.800 | and the size of the models and pre-training
01:58:07.080 | and finding tricks there,
01:58:08.240 | you'll now have this other thread
01:58:09.920 | of getting search to work better and better.
01:58:12.640 | - So let me ask you about Strawberry Tomorrow Eyes.
01:58:19.840 | So it looks like GitHub Copilot
01:58:24.440 | might be integrating O1 in some kind of way.
01:58:28.280 | And I think some of the comments are saying,
01:58:29.920 | does this mean Cursor is done?
01:58:31.520 | I think I saw one comment saying that.
01:58:35.000 | - I saw, time to shut down Cursor.
01:58:37.120 | - Time to shut down Cursor, thank you.
01:58:39.440 | So is it time to shut down Cursor?
01:58:41.400 | - I think this space is a little bit different
01:58:43.080 | from past software spaces over the 2010s,
01:58:46.840 | where I think that the ceiling here
01:58:49.160 | is really, really, really incredibly high.
01:58:51.280 | And so I think that the best product in three to four years
01:58:54.360 | will just be so much more useful
01:58:55.680 | than the best product today.
01:58:57.320 | And you can like wax poetic about moats this
01:59:01.640 | and brand that, and this is our advantage.
01:59:05.000 | But I think in the end, just if you don't have,
01:59:07.560 | like if you stop innovating on the product, you will lose.
01:59:10.800 | And that's also great for startups.
01:59:13.360 | That's great for people trying to enter this market
01:59:16.040 | because it means you have an opportunity
01:59:17.960 | to win against people who have, you know,
01:59:19.800 | lots of users already by just building something better.
01:59:23.000 | And so I think, yeah, over the next few years,
01:59:26.120 | it's just about building the best product,
01:59:28.800 | building the best system, and that both comes down
01:59:31.240 | to the modeling engine side of things.
01:59:34.480 | And it also comes down to the editing experience.
01:59:37.440 | - Yeah, I think most of the additional value
01:59:40.120 | from Cursor versus everything else out there
01:59:42.520 | is not just integrating the new model fast, like a one.
01:59:46.160 | And it comes from all of the kind of depth
01:59:49.480 | that goes into these custom models
01:59:51.560 | that you don't realize are working for you
01:59:53.480 | in kind of every facet of the product,
01:59:55.480 | as well as like the really thoughtful UX
01:59:59.400 | with every single feature.
02:00:00.720 | - All right, from that profound answer,
02:00:03.800 | let's descend back down to the technical.
02:00:05.560 | You mentioned you have a taxonomy of synthetic data.
02:00:08.480 | - Oh, yeah.
02:00:09.720 | - Can you please explain?
02:00:10.600 | - Yeah, I think there are three main kinds of synthetic data.
02:00:15.240 | The first is, so what is synthetic data first?
02:00:18.200 | So there's normal data, like non-synthetic data,
02:00:20.400 | which is just data that's naturally created,
02:00:23.800 | i.e. usually it'll be from humans having done things.
02:00:27.120 | So from some human process, you get this data.
02:00:30.480 | Synthetic data, the first one would be distillation.
02:00:34.720 | So having a language model, kind of output tokens
02:00:38.080 | or probability distributions over tokens.
02:00:41.760 | And then you can train some less capable model on this.
02:00:45.640 | This approach is not gonna get you a net,
02:00:47.960 | like more capable model than the original one
02:00:49.880 | that has produced the tokens.
02:00:51.320 | But it's really useful for if there's some capability
02:00:55.360 | you wanna elicit from some really expensive
02:00:58.040 | high latency model, you can then distill that down
02:01:00.880 | into some smaller task specific model.
02:01:03.400 | The second kind is when like one direction of the problem
02:01:09.280 | is easier than the reverse.
02:01:11.840 | And so a great example of this is bug detection,
02:01:16.120 | like we mentioned earlier, where it's a lot easier
02:01:19.840 | to introduce reasonable looking bugs
02:01:22.600 | than it is to actually detect them.
02:01:24.960 | And this is probably the case for humans too.
02:01:27.200 | And so what you can do is you can get a model
02:01:31.440 | that's not training that much data, that's not that smart
02:01:34.320 | to introduce a bunch of bugs in code.
02:01:35.840 | And then you can use that to then train,
02:01:38.320 | use this synthetic data to train a model
02:01:39.800 | that can be really good at detecting bugs.
02:01:42.240 | The last category, I think is, I guess the main one
02:01:45.240 | that it feels like the big labs are doing
02:01:48.360 | for synthetic data, which is producing texts
02:01:52.800 | with language models that can then be verified easily.
02:01:57.360 | So like, extreme example of this
02:01:59.920 | is if you have a verification system that can detect
02:02:02.840 | if language is Shakespeare level
02:02:05.760 | and then you have a bunch of monkeys typing in typewriters.
02:02:08.160 | Like, you can eventually get enough training data
02:02:10.640 | to train a Shakespeare level language model.
02:02:12.640 | And I mean, this is the case, like very much the case
02:02:14.760 | for math where verification is actually really, really easy
02:02:19.160 | for formal languages.
02:02:22.680 | And then what you can do is you can have an okay model,
02:02:26.200 | generate a ton of rollouts and then choose the ones
02:02:29.600 | that you know have actually proved
02:02:31.840 | the ground truth theorems and train that further.
02:02:34.680 | There's similar things you can do for code
02:02:36.360 | with LeetCode like problems where if you have some set
02:02:40.400 | of tests that you know correspond to,
02:02:42.440 | if something passes these tests,
02:02:43.760 | it is actually solved the problem.
02:02:45.600 | You could do the same thing where you verify
02:02:46.880 | that it's passed the test and then train the model
02:02:48.760 | and the outputs that have passed the tests.
02:02:51.680 | I think it's gonna be a little tricky getting this to work
02:02:54.280 | in all domains or just in general.
02:02:57.720 | Like having the perfect verifier feels really, really hard
02:03:00.440 | to do with just like open-ended miscellaneous tasks.
02:03:04.760 | You get the model or more like long horizon tasks,
02:03:07.800 | even in coding.
02:03:09.040 | - That's 'cause you're not as optimistic as Arvid, but yeah.
02:03:12.280 | So yeah, so that third category requires having a verifier.
02:03:16.560 | - Yeah, verification is, it feels like it's best
02:03:18.880 | when you know for a fact that it's correct.
02:03:20.520 | And like, then it wouldn't be like using a language model
02:03:23.720 | to verify, it would be using tests or formal systems.
02:03:28.440 | - Or running the thing too.
02:03:30.640 | Doing like the human form of verification
02:03:32.440 | where you just do manual quality control.
02:03:34.280 | - Yeah, yeah.
02:03:35.360 | - But like the language model version of that
02:03:37.000 | where it's like running the thing
02:03:37.840 | and it actually understands the output.
02:03:39.760 | - Yeah, no, that's true.
02:03:40.600 | - Sort of somewhere between.
02:03:41.880 | - Yeah, I think that's the category that is most likely
02:03:45.680 | to result in like massive gains.
02:03:48.280 | - What about RL with feedback side, RLHF versus RLAIF?
02:03:52.520 | What's the role of that in getting better performance
02:03:57.920 | on the models?
02:04:00.080 | - Yeah, so RLHF is when the reward model you use
02:04:05.080 | is trained from some labels you've collected
02:04:09.880 | from humans giving feedback.
02:04:11.400 | I think this works if you have the ability
02:04:15.360 | to get a ton of human feedback
02:04:18.280 | for this kind of task that you care about.
02:04:20.840 | RLAIF is interesting because you're kind of depending on,
02:04:26.840 | like this is actually kind of going to,
02:04:30.000 | it's depending on the constraint that verification
02:04:33.200 | is actually a decent bit easier than generation.
02:04:36.880 | Because it feels like, okay, like, what are you doing?
02:04:38.920 | Are you using this language model
02:04:40.080 | to look at the language model outputs
02:04:41.320 | and then prove the language model?
02:04:42.680 | But no, it actually may work if the language model
02:04:46.720 | has a much easier time verifying some solution
02:04:49.960 | than it does generating it.
02:04:50.880 | Then you actually could perhaps get this kind of recursive.
02:04:54.240 | I don't think it's going to look exactly like that.
02:04:56.840 | The other thing you could do is,
02:04:59.040 | that we kind of do is like a little bit of a mix
02:05:03.200 | of RLAIF and RLHF,
02:05:05.440 | where usually the model is actually quite correct.
02:05:07.640 | And this is in the case of cursor tap,
02:05:09.840 | picking between like two possible generations
02:05:13.360 | of what is the better one.
02:05:15.040 | And then it just needs like a hand,
02:05:16.720 | a little bit of human nudging
02:05:18.880 | with only like on the order of 50, 100 examples
02:05:24.080 | to like kind of align that prior the model has
02:05:27.240 | with exactly with what you want.
02:05:29.200 | It looks different than I think normal RLHF
02:05:31.240 | where you're usually training these reward models
02:05:33.120 | on tons of examples.
02:05:34.520 | - What's your intuition when you compare generation
02:05:39.320 | and verification, or generation and ranking?
02:05:42.360 | Is ranking way easier than generation?
02:05:45.840 | - My intuition would just say, yeah, it should be.
02:05:49.160 | Like this is kind of going back to,
02:05:53.800 | like if you believe P does not equal NP,
02:05:56.600 | then there's this massive class of problems
02:05:59.520 | that are much, much easier to verify given a proof
02:06:02.240 | than actually proving it.
02:06:03.920 | - I wonder if the same thing will prove P not equal to NP
02:06:07.240 | or P equal to NP.
02:06:08.480 | - That would be, that would be really cool.
02:06:11.640 | - That'd be a whatever fields model by AI.
02:06:16.200 | Who gets the credit?
02:06:17.800 | Another open philosophical question.
02:06:19.640 | - I'm actually--
02:06:22.040 | - Whoever prompted it.
02:06:22.880 | (laughs)
02:06:24.240 | - I'm actually surprisingly curious what like a good bet
02:06:27.760 | for when AI will get the Fields Medal will be.
02:06:31.280 | I actually don't have--
02:06:32.120 | - Isn't this Amon's specialty?
02:06:33.120 | - I don't know what Amon's bet here is.
02:06:35.400 | - Oh, sorry, Nobel Prize or Fields Medal first?
02:06:37.760 | - Fields Medal.
02:06:38.600 | - Well, Fields Medal level.
02:06:39.720 | - Fields Medal comes first, I think.
02:06:41.280 | - Fields Medal comes first.
02:06:42.520 | Well, you would say that, of course.
02:06:44.840 | - But it's also this like isolated system
02:06:46.600 | you can verify and--
02:06:47.880 | - Sure.
02:06:48.920 | - Like, I don't even know if I would--
02:06:49.760 | - You don't need to do--
02:06:50.600 | - I feel like I have much more to do there.
02:06:51.720 | I felt like the path to get to IMO
02:06:53.520 | was a little bit more clear
02:06:55.160 | because it already could get a few IMO problems.
02:06:57.720 | And there were a bunch of like,
02:06:59.040 | there was a bunch of low hanging fruit
02:07:00.360 | given the literature at the time
02:07:01.600 | of like what tactics people could take.
02:07:04.000 | I think I'm one, much less versed in the space
02:07:06.520 | that they're improving now.
02:07:07.760 | And two, yeah, less intuition about how close we are
02:07:11.720 | to solving these really, really hard open problems.
02:07:15.600 | - So you think you'll be Fields Medal first?
02:07:17.280 | It won't be like in physics or in--
02:07:20.400 | - Oh, 100%.
02:07:21.240 | I think that's probably more likely.
02:07:23.840 | Like, it's probably much more likely that it'll get in.
02:07:26.800 | Yeah, yeah, yeah, yeah.
02:07:27.640 | Well, I think it goes to like, I don't know,
02:07:29.080 | like BSD, which is the Burt-Springer-Dyer conjecture,
02:07:32.240 | or like Riemann hypothesis,
02:07:33.680 | or any one of these like hard math problems,
02:07:36.720 | which is actually really hard.
02:07:38.560 | It's sort of unclear what the path to get
02:07:41.400 | even a solution looks like.
02:07:42.920 | Like, we don't even know what a path looks like,
02:07:44.640 | let alone--
02:07:45.480 | - And you don't buy the idea
02:07:47.480 | that this is like an isolated system
02:07:49.120 | and you can actually have a good reward system,
02:07:51.280 | and it feels like it's easier to train for that.
02:07:56.000 | - I think we might get Fields Medal before AGI.
02:07:59.520 | - I think--
02:08:00.360 | - I mean, I'd be very happy.
02:08:02.840 | (laughs)
02:08:03.680 | I'd be very happy.
02:08:04.500 | But I don't know if I think 2028, 2030.
02:08:08.520 | (laughs)
02:08:09.720 | - Or Fields Medal.
02:08:10.880 | - Fields Medal.
02:08:11.720 | - All right.
02:08:12.960 | It feels like forever from now,
02:08:15.040 | given how fast things have been going.
02:08:17.560 | - Speaking of how fast things have been going,
02:08:19.160 | let's talk about scaling laws.
02:08:21.440 | So for people who don't know,
02:08:23.000 | maybe it's good to talk about this whole idea
02:08:28.920 | of scaling laws.
02:08:30.040 | What are they?
02:08:31.000 | Where do you think stand?
02:08:32.200 | And where do you think things are going?
02:08:34.360 | - I think it was interesting,
02:08:35.200 | the original scaling laws paper by OpenAI
02:08:37.160 | was slightly wrong,
02:08:38.000 | 'cause I think of some issues they did
02:08:40.480 | with learning rate schedules.
02:08:43.160 | And then Chinchilla showed a more correct version.
02:08:46.520 | And then from then people have, again,
02:08:48.400 | kind of deviated from doing the compute optimal thing,
02:08:50.360 | 'cause people start now optimizing more so
02:08:53.360 | for making the thing work really well,
02:08:56.400 | given an inference budget.
02:08:58.920 | And I think there are a lot more dimensions to these curves
02:09:03.280 | than what we originally used of just compute,
02:09:06.680 | number of parameters and data.
02:09:09.640 | Like inference compute is the obvious one.
02:09:12.600 | I think context length is another obvious one.
02:09:14.720 | So if you care,
02:09:15.560 | like let's say you care about the two things
02:09:16.800 | of inference, compute, and then context window,
02:09:21.240 | maybe the thing you wanna train is some kind of SSM
02:09:24.680 | because they're much, much cheaper and faster
02:09:27.480 | at super, super long context.
02:09:28.920 | And even if maybe it is 10X worse scaling properties
02:09:31.680 | during training,
02:09:32.520 | meaning you've spent 10X more compute
02:09:34.520 | to train the thing to get the same level of capabilities,
02:09:37.840 | it's worth it because you care most
02:09:40.080 | about that inference budget for really long context windows.
02:09:43.400 | So it'll be interesting to see how people kind of play
02:09:46.000 | with all these dimensions.
02:09:47.520 | - So, yeah.
02:09:48.360 | I mean, you speak to the multiple dimensions, obviously.
02:09:49.880 | The original conception was just looking at the variables
02:09:52.400 | of the size of the model as measured by parameters
02:09:55.480 | and the size of the data as measured
02:09:56.920 | by the number of tokens and looking at the ratio of the two.
02:09:59.760 | - Yeah.
02:10:00.600 | - And it's kind of a compelling notion
02:10:02.520 | that there is a number or at least a minimum.
02:10:06.360 | And it seems like one was emerging.
02:10:10.440 | Do you still believe that there is a kind of,
02:10:13.200 | bigger is better?
02:10:14.240 | - I mean, I think bigger is certainly better
02:10:19.080 | for just raw performance.
02:10:21.520 | - And raw intelligence.
02:10:22.480 | - And raw intelligence.
02:10:23.560 | I think that the path that people might take is,
02:10:25.640 | I'm particularly bullish on distillation.
02:10:28.200 | And like, yeah.
02:10:29.040 | How many knobs can you turn to,
02:10:31.160 | if we spend like a ton, ton of money on training,
02:10:34.840 | like get the most capable, cheap model, right?
02:10:38.360 | Like really, really caring as much as you can.
02:10:40.360 | 'Cause like the naive version of caring as much as you can
02:10:42.960 | about inference time compute
02:10:43.920 | is what people have already done with like the Lama models
02:10:46.000 | or just overtraining the shit out of 7B models
02:10:50.160 | on way, way, way more tokens than is essential optimal.
02:10:54.160 | Right, but if you really care about it,
02:10:55.160 | maybe the thing to do is what Gemma did,
02:10:56.360 | which is let's not just train on tokens.
02:10:59.040 | Let's literally train on minimizing the KL divergence
02:11:04.040 | with the distribution of Gemma 27B, right?
02:11:08.480 | So knowledge distillation there.
02:11:11.000 | And you're spending the compute
02:11:12.720 | of literally training this 27 billion model,
02:11:15.760 | billion parameter model on all these tokens
02:11:17.640 | just to get out this, I don't know, smaller model.
02:11:20.320 | - And the distillation gives you just a faster model.
02:11:22.480 | Smaller means faster.
02:11:23.840 | - Yeah, distillation in theory is,
02:11:25.680 | I think getting out more signal
02:11:29.080 | from the data that you're training on.
02:11:30.600 | And it's like another,
02:11:31.440 | it's perhaps another way of getting over,
02:11:33.800 | not like completely over,
02:11:35.000 | but like partially helping with the data wall.
02:11:37.640 | Where like you only have so much data to train on,
02:11:39.400 | let's like train this really, really big model
02:11:41.120 | on all these tokens and we'll distill it into a smaller one.
02:11:43.760 | And maybe we can get more signal per token
02:11:48.280 | for this much smaller model
02:11:50.040 | than we would have originally if we trained it.
02:11:51.600 | - So if I gave you $10 trillion, how would you spend it?
02:11:55.560 | I mean, you can't buy an island or whatever.
02:11:58.640 | How would you allocate it
02:12:00.360 | in terms of improving the big model
02:12:03.600 | versus maybe paying for HF in the RLHF or?
02:12:08.600 | - Yeah, I think there's a lot of these secrets
02:12:14.000 | and details about training these large models
02:12:16.720 | that I just don't know
02:12:18.400 | and are only privy to the large labs.
02:12:19.960 | And the issue is I would waste a lot of that money
02:12:22.600 | if I even attempted this,
02:12:24.040 | because I wouldn't know those things.
02:12:26.360 | Suspending a lot of disbelief
02:12:28.200 | and assuming like you had the know-how
02:12:32.960 | or if you're saying like you have to operate
02:12:35.200 | with like the limited information you have now.
02:12:37.800 | - No, no, no.
02:12:38.640 | Actually, I would say you swoop in
02:12:40.800 | and you get all the information,
02:12:42.040 | all the little heuristics, all the little parameters,
02:12:44.560 | all the parameters that define how the thing is trained.
02:12:49.280 | If we look in how to invest money for the next five years
02:12:54.320 | in terms of maximizing what you called raw intelligence.
02:12:57.480 | - I mean, isn't the answer like really simple?
02:12:59.280 | You just try to get as much compute as possible?
02:13:02.200 | Like at the end of the day, all you need to buy is the GPUs
02:13:05.000 | and then sort of the researchers can find all the,
02:13:08.840 | like they can sort of, you can tune whether you want
02:13:12.320 | to pre-train a big model or a small model.
02:13:15.200 | - Well, this gets into the question of like,
02:13:16.560 | are you really limited by compute and money
02:13:18.920 | or are you limited by these other things?
02:13:21.040 | - I'm more privy to Arvid's belief
02:13:24.360 | that we're sort of ideal limited, but there's always-
02:13:27.760 | - But if you have a lot of compute,
02:13:30.640 | you can run a lot of experiments.
02:13:32.760 | - So you would run a lot of experiments
02:13:34.920 | versus like use that compute to train a gigantic model.
02:13:38.560 | - I would, but I do believe that we are limited
02:13:42.560 | in terms of ideas that we have.
02:13:44.600 | - I think, yeah, 'cause even with all this compute
02:13:47.960 | and like, you know, all the data you could collect
02:13:49.920 | in the world, I think you really are ultimately limited
02:13:52.680 | by not even ideas, but just like really good engineering.
02:13:58.920 | Like even with all the capital in the world,
02:14:00.880 | would you really be able to assemble,
02:14:03.560 | like there aren't that many people in the world
02:14:05.520 | who really can like make the difference here.
02:14:08.000 | And there's so much work that goes into research
02:14:11.640 | that is just like pure, really, really hard engineering work.
02:14:15.760 | As like a very kind of hand-wavy example,
02:14:18.680 | if you look at the original "Transformer" paper,
02:14:20.680 | you know how much work was kind of joining together
02:14:22.800 | a lot of these really interesting concepts
02:14:25.160 | embedded in the literature versus then going in
02:14:28.720 | and writing all the code,
02:14:30.160 | like maybe the CUDA kernels, maybe whatever else,
02:14:31.880 | I don't know if it ran on GPUs or TPUs originally,
02:14:34.000 | such that it actually saturated the GPU performance, right?
02:14:38.360 | Getting Gnome to go in and do all of this code, right?
02:14:41.200 | And Gnome is like probably one of the best engineers
02:14:42.920 | in the world, or maybe going a step further,
02:14:45.160 | like the next generation of models, having these things,
02:14:47.720 | like getting model parallelism to work
02:14:49.480 | and scaling it on like, you know, thousands of,
02:14:51.680 | or maybe tens of thousands of like V100s,
02:14:54.280 | which I think GBDE3 may have been.
02:14:57.160 | There's just so much engineering effort
02:14:58.720 | that has to go into all of these things to make it work.
02:15:01.760 | If you really brought that cost down to like, you know,
02:15:07.680 | maybe not zero, but just made it 10X easier,
02:15:10.280 | made it super easy for someone with really fantastic ideas
02:15:13.560 | to immediately get to the version of like
02:15:16.000 | the new architecture they dreamed up
02:15:17.480 | that is like getting 50, 40% utilization on the GPUs.
02:15:22.840 | I think that would just speed up research by a ton.
02:15:27.640 | - I mean, I think if you see a clear path to improvement,
02:15:30.360 | you should always sort of take
02:15:31.720 | the low-hanging fruit first, right?
02:15:33.040 | And I think probably OpenAI and all the other labs
02:15:36.720 | that did the right thing to pick off the low-hanging fruit,
02:15:39.280 | where the low-hanging fruit is like sort of,
02:15:41.920 | you could scale up to a GPT 4.25 scale,
02:15:47.680 | and you just keep scaling,
02:15:50.960 | and like things keep getting better.
02:15:53.200 | And as long as, like you,
02:15:55.440 | there's no point of experimenting with new ideas
02:15:57.440 | when like everything is working.
02:15:59.480 | And you should sort of bang on it
02:16:01.560 | and try to get as much juice out of the possible.
02:16:04.120 | And then maybe when you really need new ideas for,
02:16:07.040 | I think if you're spending 10 trillion dollars,
02:16:08.960 | you probably want to spend some,
02:16:10.720 | so you know, then actually like re-evaluate your ideas.
02:16:13.320 | Like probably your idea limited at that point.
02:16:15.480 | - I think all of us believe new ideas
02:16:18.120 | are probably needed to get, you know,
02:16:20.520 | all the way there to AGI.
02:16:22.760 | And all of us also probably believe
02:16:27.160 | there exist ways of testing out those ideas
02:16:30.160 | at smaller scales and being fairly confident
02:16:34.040 | that they'll play out.
02:16:35.680 | It's just quite difficult for the labs
02:16:39.080 | in their current position to dedicate
02:16:41.400 | their very limited research and engineering talent
02:16:45.200 | to exploring all these other ideas
02:16:47.240 | when there's like this core thing
02:16:48.600 | that will probably like improve performance
02:16:52.640 | for some like decent amount of time.
02:16:54.560 | - Yeah, but also these big labs like winning.
02:16:59.040 | So they're just going wild.
02:17:02.400 | Okay, so how, big question looking out into the future.
02:17:07.400 | You're now at the center of the programming world.
02:17:12.000 | How do you think programming,
02:17:13.240 | the nature of programming changes
02:17:14.840 | in the next few months, in the next year,
02:17:17.600 | in the next two years, the next five years, 10 years?
02:17:20.840 | - I think we're really excited about a future
02:17:23.800 | where the programmer's in the driver's seat for a long time.
02:17:27.960 | And you've heard us talk about this a little bit,
02:17:30.320 | but one that emphasizes speed and agency
02:17:34.240 | for the programmer and control,
02:17:36.200 | the ability to modify anything you want to modify,
02:17:38.640 | the ability to iterate really fast
02:17:40.200 | in what you're building.
02:17:41.720 | And this is a little different, I think,
02:17:45.280 | than where some people are jumping to in the space,
02:17:50.280 | where I think one idea that's captivated people
02:17:54.200 | is can you talk to your computer?
02:17:58.280 | Can you have it build software for you
02:17:59.520 | as if you're talking to like an engineering department
02:18:01.400 | or an engineer over Slack?
02:18:02.680 | And can it just be this sort of isolated text box?
02:18:05.640 | And part of the reason we're not excited about that
02:18:10.720 | is some of the stuff we've talked about with latency.
02:18:12.760 | But then a big piece, a reason we're not excited about that
02:18:16.040 | is because that comes with giving up a lot of control.
02:18:19.080 | It's much harder to be really specific
02:18:20.640 | when you're talking in the text box.
02:18:22.360 | And if you're necessarily just going to communicate
02:18:25.760 | with a thing, like you would be communicating
02:18:27.200 | with an engineering department,
02:18:28.040 | you're actually abdicating tons and tons
02:18:29.520 | of really important decisions to this bot.
02:18:32.480 | And this kind of gets at fundamentally what engineering is.
02:18:38.600 | I think that some people
02:18:40.440 | who are a little bit more removed from engineering
02:18:41.840 | might think of it as the spec is completely written out
02:18:44.920 | and then the engineers just come and they just implement.
02:18:47.880 | And it's just about making the thing happen in code
02:18:49.960 | and making the thing exist.
02:18:52.040 | But I think a lot of the best engineering,
02:18:55.160 | the engineering we enjoy,
02:18:56.400 | involves tons of tiny micro decisions
02:18:59.520 | about what exactly you're building
02:19:01.240 | and about really hard trade-offs between speed and cost
02:19:05.080 | and just all the other things involved in a system.
02:19:08.320 | And we want, as long as humans
02:19:12.600 | are actually the ones designing the software
02:19:15.440 | and the ones specifying what they want to be built,
02:19:18.400 | and it's not just like company run by all AIs,
02:19:20.760 | we think you'll really want the human in a driver's seat
02:19:23.560 | dictating these decisions.
02:19:26.240 | And so the jury's still out on kind of what that looks like.
02:19:30.640 | I think that one weird idea for what that could look like
02:19:34.200 | is it could look like you can control
02:19:37.200 | the level of abstraction you view a code base at.
02:19:39.760 | And you can point at specific parts of a code base
02:19:43.200 | that maybe you digest a code base
02:19:46.720 | by looking at it in the form of pseudocode.
02:19:49.120 | And you can actually edit that pseudocode too,
02:19:52.560 | and then have changes get made down
02:19:54.320 | at the sort of formal programming level.
02:19:56.520 | And you can gesture at any piece of logic
02:20:01.520 | in your software component of programming.
02:20:04.120 | You keep the inflow, text editing component of programming,
02:20:07.120 | you keep the control of,
02:20:08.560 | you can even go down into the code,
02:20:10.040 | you can go at higher levels of abstraction,
02:20:12.320 | while also giving you these big productivity gains.
02:20:14.640 | - It'd be nice if you can go up and down
02:20:16.320 | the abstraction stack.
02:20:18.280 | - Yeah, and there are a lot of details to figure out there
02:20:20.200 | that's sort of like a fuzzy idea,
02:20:21.800 | time will tell if it actually works,
02:20:23.200 | but these principles of control and speed
02:20:25.760 | in the human in the driver's seat
02:20:26.640 | we think are really important.
02:20:28.680 | We think for some things, like Arvid mentioned before,
02:20:31.080 | for some styles of programming,
02:20:32.360 | you can kind of hand it off chatbot style,
02:20:34.800 | if you have a bug that's really well-specified,
02:20:36.800 | but that's not most of programming,
02:20:39.240 | and that's also not most of the programming
02:20:41.800 | we think a lot of people value.
02:20:43.440 | - What about like the fundamental skill of programming?
02:20:46.080 | There's a lot of people, like young people right now,
02:20:49.840 | kind of scared, like thinking,
02:20:53.800 | 'cause they like love programming,
02:20:55.240 | but they're scared about like,
02:20:56.280 | will I be able to have a future
02:20:57.640 | if I pursue this career path?
02:20:59.800 | Do you think the very skill of programming
02:21:01.840 | will change fundamentally?
02:21:04.040 | - I actually think this is a really, really exciting time
02:21:06.600 | to be building software.
02:21:08.040 | Like we remember what programming was like in 2013, 2012,
02:21:13.040 | whatever it was,
02:21:14.760 | and there was just so much more cruft and boilerplate
02:21:20.360 | and looking up something really gnarly,
02:21:25.320 | and that stuff still exists, it's definitely not at zero,
02:21:28.640 | but programming today is way more fun than back then.
02:21:32.520 | It's like, we're really getting down
02:21:34.200 | to the delight concentration,
02:21:36.720 | and all the things that really draw people to programming,
02:21:39.520 | like for instance, this element of being able
02:21:41.160 | to build things really fast and speed,
02:21:44.120 | and also individual control,
02:21:45.840 | like all those are just being turned up a ton.
02:21:48.320 | And so I think it's just gonna be,
02:21:50.280 | I think it's gonna be a really, really fun time
02:21:51.720 | for people who build software.
02:21:53.720 | I think that the skills will probably change too.
02:21:56.120 | I think that people's taste in creative ideas
02:21:58.600 | will be magnified, and it will be less about,
02:22:02.160 | maybe less a little bit about boilerplate text editing,
02:22:05.160 | maybe even a little bit less about carefulness,
02:22:07.840 | which I think is really important today.
02:22:10.760 | If you're a programmer, I think it'll be a lot more fun.
02:22:13.440 | - What do you guys think?
02:22:15.200 | - I agree.
02:22:16.120 | I'm very excited to be able to change,
02:22:18.320 | like just, one thing that happened recently
02:22:22.800 | was like we wanted to do a relatively big migration
02:22:25.800 | to our code base.
02:22:26.640 | We were using async local storage in Node.js,
02:22:30.120 | which is known to be not very performant,
02:22:31.960 | and we wanted to migrate to our context object.
02:22:33.760 | And this is a big migration
02:22:35.440 | and affects the entire code base.
02:22:37.640 | And Swale and I spent, I don't know, five days
02:22:41.360 | working through this, even with today's AI tools.
02:22:43.640 | And I am really excited for a future
02:22:47.040 | where I can just show a couple of examples,
02:22:50.520 | and then the AI applies that to all of the locations.
02:22:54.120 | And then it highlights, oh, this is a new example,
02:22:56.960 | like what should I do?
02:22:57.800 | And then I show exactly what to do there.
02:22:59.440 | And then that can be done in like 10 minutes.
02:23:02.520 | And then you can iterate much, much faster.
02:23:04.920 | Then you don't have to think as much upfront
02:23:08.280 | and stand at the blackboard and like,
02:23:10.520 | think exactly like, how are we going to do this?
02:23:12.360 | Because the cost is so high,
02:23:13.800 | but you can just try something first and you realize,
02:23:16.480 | oh, this is not actually exactly what I want.
02:23:18.400 | And then you can change it instantly again after.
02:23:20.800 | And so, yeah, I think being a programmer in the future
02:23:24.960 | is going to be a lot of fun.
02:23:26.560 | - Yeah, I really liked that point about,
02:23:29.840 | it feels like a lot of the time with programming,
02:23:31.280 | there are two ways you can go about it.
02:23:33.560 | One is like, you think really hard, carefully upfront
02:23:37.240 | about the best possible way to do it.
02:23:39.760 | And then you spend your limited time of engineering
02:23:42.160 | to actually implement it.
02:23:43.480 | But I much prefer just getting in the code
02:23:46.080 | and like, taking a crack at it,
02:23:47.720 | seeing how it kind of lays out,
02:23:49.920 | and then iterating really quickly on that.
02:23:52.680 | That feels more fun.
02:23:55.880 | - Yeah, like you're speaking to,
02:23:57.240 | generating the boilerplate is great.
02:23:59.320 | So you just focus on the difficult design,
02:24:01.840 | nuanced, difficult design decisions.
02:24:04.320 | Migration, I feel like this is a cool one.
02:24:07.960 | Like, it seems like larger language models
02:24:09.520 | are able to basically translate
02:24:11.200 | from one program language to another,
02:24:12.560 | or like translate, like migrate,
02:24:15.400 | in the general sense of what migrate is.
02:24:17.400 | But that's in the current moment.
02:24:20.720 | So I mean, the fear has to do with like,
02:24:22.640 | okay, as these models get better and better,
02:24:24.920 | then you're doing less and less creative decisions.
02:24:27.120 | And is it going to kind of move to a place
02:24:28.880 | where it's, you're operating in the design space
02:24:33.040 | of natural language,
02:24:34.000 | where natural language is the main programming language.
02:24:37.320 | And I guess I could ask that by way of advice.
02:24:39.280 | Like, if somebody is interested in programming now,
02:24:41.520 | what do you think they should learn?
02:24:43.240 | Like, do they, you guys started in some Java,
02:24:47.320 | and I forget the, oh, some PHP.
02:24:53.000 | - Objective C.
02:24:54.120 | - Objective C.
02:24:54.960 | There you go.
02:24:56.320 | I mean, in the end,
02:24:57.160 | we all know JavaScript is going to win.
02:24:58.960 | (laughs)
02:25:01.040 | And not TypeScript.
02:25:02.440 | It's just, it's going to be like vanilla JavaScript.
02:25:04.680 | It's just going to eat the world,
02:25:06.840 | and maybe a little bit of PHP.
02:25:08.360 | And I mean, it also brings up the question of like,
02:25:10.800 | I think Don Knuth has this idea
02:25:14.040 | that some percent of the population is geeks.
02:25:16.680 | And like, there's a particular kind of psychology in mind
02:25:20.280 | required for programming.
02:25:22.200 | And it feels like more and more that expands
02:25:25.760 | the kind of person that should be able to,
02:25:27.680 | can do great programming, might expand.
02:25:30.920 | - I think different people do programming
02:25:34.920 | for different reasons.
02:25:36.600 | But I think the true, maybe like the best programmers
02:25:39.760 | are the ones that really love,
02:25:43.400 | just like absolutely love programming.
02:25:46.560 | For example, there are folks in our team
02:25:48.400 | who literally when they get back from work,
02:25:53.400 | they go and then they boot up Cursor,
02:25:58.440 | and then they start coding on their side projects
02:26:00.600 | for the entire night.
02:26:01.440 | And they stay up till 3 a.m. doing that.
02:26:03.360 | And when they're sad, they said,
02:26:06.680 | "I just really need to code."
02:26:09.600 | (laughs)
02:26:11.160 | And I think like, you know,
02:26:14.000 | there's that level of programmer
02:26:15.480 | where like this obsession and love of programming,
02:26:18.040 | I think makes really the best programmers.
02:26:22.920 | And I think these types of people
02:26:24.400 | will really get into the details of how things work.
02:26:29.400 | - I guess the question I'm asking,
02:26:30.720 | that exact problem, let's think about that person.
02:26:33.560 | When the super tab, the super awesome,
02:26:37.640 | praise be the tab, succeeds,
02:26:40.400 | and you keep pressing tab.
02:26:42.400 | - That person in the team loves Cursor tab
02:26:44.560 | more than anybody else.
02:26:45.800 | - Yeah, and it's also not just like,
02:26:48.240 | pressing tab is like the just pressing tab.
02:26:50.600 | That's like the easy way to say it
02:26:51.840 | in the catchphrase, you know?
02:26:54.440 | But what you're actually doing when you're pressing tab
02:26:56.600 | is that you're injecting intent
02:26:59.880 | all the time while you're doing it.
02:27:02.360 | Sometimes you're rejecting it,
02:27:03.440 | sometimes you're typing a few more characters.
02:27:05.960 | And that's the way that you're sort of shaping
02:27:10.800 | the things that's being created.
02:27:12.200 | And I think programming will change a lot
02:27:14.920 | to just what is it that you want to make?
02:27:17.680 | - It's sort of higher bandwidth.
02:27:18.880 | The communication to the computer
02:27:20.200 | just becomes higher and higher bandwidth
02:27:21.760 | as opposed to just typing is much lower bandwidth
02:27:25.800 | than communicating intent.
02:27:27.760 | - I mean, this goes to your manifesto
02:27:31.280 | titled Engineering Genius.
02:27:33.840 | We are an applied research lab
02:27:35.840 | building extraordinary productive human AI systems.
02:27:38.480 | So speaking to this like hybrid element.
02:27:41.640 | To start, we're building the engineer of the future,
02:27:44.880 | a human AI programmer.
02:27:47.080 | That's an order of magnitude more effective
02:27:48.800 | than any one engineer.
02:27:50.720 | This hybrid engineer will have effortless control
02:27:53.240 | over their code base and no low entropy keystrokes.
02:27:56.920 | They will iterate at the speed of their judgment,
02:27:59.880 | even in the most complex systems.
02:28:02.160 | Using a combination of AI and human ingenuity,
02:28:05.280 | they will outsmart and out engineer
02:28:07.440 | the best pure AI systems.
02:28:09.640 | We are a group of researchers and engineers.
02:28:12.080 | We build software and models to invent
02:28:14.480 | at the edge of what's useful and what's possible.
02:28:16.240 | Our work has already improved the lives
02:28:18.560 | of hundreds of thousands of programmers.
02:28:21.240 | And on the way to that,
02:28:22.880 | we'll at least make programming more fun.
02:28:24.880 | So thank you for talking today.
02:28:26.720 | - Thank you.
02:28:27.540 | - Thanks for having us.
02:28:28.380 | - Thank you. - Thank you.
02:28:29.960 | - Thanks for listening to this conversation
02:28:31.360 | with Michael, Swale, Arvid, and Aman.
02:28:34.640 | To support this podcast,
02:28:35.640 | please check out our sponsors in the description.
02:28:38.240 | And now let me leave you with a random funny
02:28:42.600 | and perhaps profound programming code I saw on Reddit.
02:28:45.560 | Nothing is as permanent as a temporary solution that works.
02:28:51.120 | Thank you for listening and hope to see you next time.
02:28:55.340 | (upbeat music)
02:28:57.920 | (upbeat music)
02:29:02.560 | [BLANK_AUDIO]