back to indexCursor Team: Future of Programming with AI | Lex Fridman Podcast #447
Chapters
0:0 Introduction
0:59 Code editor basics
3:9 GitHub Copilot
10:27 Cursor
16:54 Cursor Tab
23:8 Code diff
31:20 ML details
36:54 GPT vs Claude
43:28 Prompt engineering
50:54 AI agents
64:51 Running code in background
69:31 Debugging
74:58 Dangerous code
86:9 Branching file systems
89:20 Scaling challenges
103:32 Context
108:39 OpenAI o1
120:1 Synthetic data
123:48 RLHF vs RLAIF
125:34 Fields Medal for AI
128:17 Scaling laws
137:6 The future of programming
00:00:01.760 |
with the founding members of the Cursor team, 00:00:04.520 |
Michael Truel, Suali Asif, Arvid Lundmark, and Aman Sanger. 00:00:14.260 |
that has a lot of powerful features for AI-assisted coding. 00:00:17.940 |
It has captivated the attention and excitement 00:00:23.900 |
So I thought this is an excellent opportunity 00:00:26.760 |
to dive deep into the role of AI in programming. 00:00:33.040 |
that is bigger than just about one code editor. 00:00:38.440 |
and in general, the future of human-AI collaboration 00:00:52.160 |
And now, dear friends, here's Michael, Suali, Arvid, 00:01:17.320 |
the place where you text edit a formal programming language. 00:01:23.640 |
is like a really souped-up word processor for programmers, 00:01:31.440 |
And so the quote-unquote word processor, the code editor, 00:01:37.280 |
that word processors sort of in the writing space 00:01:39.680 |
haven't been able to do for people editing text there. 00:01:49.120 |
to letting you navigate around the code base, 00:01:51.000 |
sort of like you're navigating around the internet 00:01:53.280 |
Going to sort of definitions of things you're using, 00:01:55.680 |
to error checking, to catch rudimentary bugs. 00:02:00.280 |
And so traditionally, that's what a code editor has meant. 00:02:10.040 |
is going to change a lot over the next 10 years 00:02:16.800 |
- I think also a code editor should just be fun. 00:02:22.240 |
And it's actually sort of an underrated aspect 00:02:30.280 |
and then we try them out, we do an experiment 00:02:47.080 |
- Like fundamentally, I think one of the things 00:02:50.920 |
that draws a lot of people to building stuff on computers 00:03:16.280 |
It'd be interesting to get your kind of explanation 00:03:29.960 |
and how did that lead to your journey with Cursor? 00:03:47.360 |
it was around the time that Copilot came out, 00:03:57.920 |
the only code editor in which it was available. 00:04:07.440 |
was more than good enough to convince me to switch. 00:04:14.680 |
- And maybe we should explain what Copilot does. 00:04:29.360 |
you know, like when you have a close friendship 00:04:34.040 |
Like when it's done well, there's an intimate feeling. 00:04:37.320 |
There's probably a better word than intimate, 00:04:52.240 |
the feeling that it gets me overpowers that it doesn't. 00:04:55.160 |
- And I think actually one of the underrated aspects 00:04:57.080 |
of GitHub Copilot is that even when it's wrong, 00:04:59.320 |
it's like a little bit annoying, but it's not that bad 00:05:05.800 |
or you type another character and then it gets you. 00:05:11.840 |
I mean, the other underrated part of Copilot for me 00:05:18.000 |
So the first language model consumer product. 00:05:21.440 |
- So Copilot was kind of like the first killer app for LLMs. 00:05:34.160 |
- So around 2020, the scaling loss papers came out 00:05:46.040 |
it looks like you can make these models a lot better 00:05:49.720 |
- By the way, we'll probably talk for three to four hours 00:05:56.000 |
- But just to summarize, it's a paper and a set of papers 00:05:59.520 |
and a set of ideas that say bigger might be better 00:06:05.720 |
- It's bigger and better, but predictably better. 00:06:13.080 |
there were like a lot of conceptual conversations 00:06:18.520 |
for all these different knowledge worker fields 00:06:25.160 |
And then I think there were a couple of moments 00:06:27.840 |
where like the theoretical gains predicted in that paper 00:06:37.040 |
if you wanted to work on, do useful work in AI, 00:06:40.320 |
actually felt like now there was this whole set of systems 00:06:47.160 |
which was playing with the early bit of Copilot, 00:07:02.400 |
and the step-up in capabilities felt enormous. 00:07:06.960 |
we had been working on a couple of different projects. 00:07:08.780 |
We had been, because of Copilot, because of scaling odds, 00:07:13.040 |
because of our prior interest in the technology, 00:07:15.060 |
we had been tinkering around with tools for programmers, 00:07:20.500 |
So, we were building tools for financial professionals 00:07:27.120 |
can you do static analysis with these models? 00:07:31.240 |
look, that really made concrete the theoretical gains 00:07:46.520 |
this wasn't just gonna be a point solution thing. 00:08:05.140 |
and there's a competition in the U.S. called the Putnam, 00:08:48.020 |
it was like, even though I sort of believed in progress, 00:09:01.100 |
but that was maybe the most prescient bet in the group. 00:09:17.800 |
- And before, Aman had this like scaling loss T-shirt 00:09:22.360 |
where it had the like charts and like the formulas on it. 00:09:25.240 |
- So you like felt the AGI or you felt the scaling loss? 00:09:30.100 |
there was this one conversation I had with Michael, 00:09:46.360 |
And I think I went through like the stages of grief. 00:09:49.440 |
There is anger, denial, and then finally at the end, 00:10:05.700 |
I think it also depends on like which domains 00:10:09.840 |
because especially like formal theorem proving, 00:10:15.340 |
of actually verifying if the thing was correct. 00:10:27.680 |
- Okay, so can we take it all the way to Cursor? 00:10:34.560 |
And VS Code is one of the most popular editors 00:10:59.880 |
it's not enough to just write an extension for your VS Code, 00:11:02.880 |
because there's a lot of limitations to that. 00:11:07.340 |
if AI is gonna keep getting better, better, better, 00:11:11.080 |
how the AI is gonna be part of the editing process. 00:11:16.700 |
and start to build a lot of the amazing features 00:11:30.320 |
What was the decision like to just fork VS Code? 00:11:37.880 |
for at least what we wanted to do and achieve. 00:11:40.440 |
Because when we started working on the editor, 00:11:42.380 |
the idea was these models are gonna get much better, 00:11:45.520 |
and it's gonna entirely change how you build software. 00:11:47.680 |
Both in a, you will have big productivity gains, 00:11:49.960 |
but also radical in how like the act of building software 00:11:58.180 |
if you're a plugin to an existing coding environment. 00:12:01.580 |
And we didn't wanna get locked in by those limitations. 00:12:04.940 |
We wanted to be able to just build the most useful stuff. 00:12:10.340 |
you know, VS Code is kind of with Copilot a competitor. 00:12:23.000 |
that is quite interesting, perhaps quite unique, 00:12:29.760 |
maybe there's kind of one major thing that happened 00:12:34.200 |
But every single year, every single model capability 00:12:43.560 |
things that are possible, especially in programming. 00:12:53.380 |
makes your product much, much, much more useful. 00:12:57.780 |
will need to make the cursor of today look obsolete. 00:13:00.880 |
And I think, you know, Microsoft has done a number 00:13:08.340 |
to really keep innovating and pushing on this 00:13:24.060 |
- I don't know if I think of it in terms of features 00:13:26.060 |
as I think of it in terms of like capabilities 00:13:34.900 |
and I'm sure there are going to be more models 00:13:37.280 |
of different types, like longer context and maybe faster. 00:13:40.700 |
Like there's all these crazy ideas that you can try 00:13:47.740 |
will make it into something kind of cool and useful. 00:14:00.820 |
you really felt this frustration that, you know, models, 00:14:27.660 |
I'm one of these people that really want to try 00:14:45.760 |
- Yeah, I think one thing that I think helps us 00:14:59.680 |
like how we actually make the model give better answers. 00:15:06.960 |
And for a Cursor tab, like how do you train the model? 00:15:27.340 |
So you can create things that are sort of not possible 00:15:30.980 |
if you're not talking, you're not experimenting. 00:15:34.340 |
- And you're using, like you said, Cursor to write Cursor. 00:15:38.780 |
- Well, let's talk about some of these features. 00:15:46.000 |
You know, auto-complete on steroids, basically. 00:15:53.180 |
- To highlight and summarize at a high level, 00:16:01.400 |
but two things that it helps programmers with. 00:16:04.800 |
One is this idea of looking over your shoulder 00:16:23.240 |
But you can make that concept even more ambitious 00:16:26.120 |
by not just predicting the characters after your Cursor, 00:16:29.640 |
but actually predicting the next entire change 00:16:35.200 |
And the second thing Cursor is pretty good at right now too 00:16:40.200 |
is helping you sometimes jump ahead of the AI 00:16:42.680 |
and tell it what to do and go from instructions to code. 00:16:47.120 |
And on both of those, we've done a lot of work 00:16:48.560 |
on making the editing experience for those things ergonomic 00:16:56.240 |
was we wanted the model to be able to edit code for us. 00:17:11.680 |
to make the inference fast for having a good experience. 00:17:22.560 |
I mean, Michael sort of mentioned this like ability 00:17:32.340 |
it's like, man, it should be just really obvious 00:17:39.960 |
the model should just know that like the next place to go to 00:17:54.040 |
And then so the idea was you just pressed tab, 00:17:58.080 |
and then show you the next edit and you would press tab. 00:18:01.720 |
So it was just you, as long as you could keep pressing tab. 00:18:10.560 |
more sort of abstractly the thing to think about 00:18:14.920 |
is sort of like, how are the edits sort of zero entropy? 00:18:33.080 |
Then maybe the model should just sort of read your mind 00:18:35.800 |
and all the zero entropy bits should just be like 00:18:43.800 |
if you look at language model loss on different domains, 00:18:49.360 |
which is kind of character normalized loss for code 00:18:53.400 |
is lower than language, which means in general, 00:18:58.840 |
A lot of characters that are super predictable. 00:19:03.040 |
when you're not just trying to auto-complete code, 00:19:05.560 |
but predicting what the user is going to do next 00:19:19.680 |
let's just jump you forward in time, skip you forward. 00:19:27.520 |
That jump, that's not so intuitive, I think, to people. 00:19:38.480 |
So you need to train small models on this task. 00:19:43.160 |
In particular, they're incredibly pre-fill token hungry. 00:19:49.840 |
really long prompts where they see a lot of your code 00:19:52.600 |
and they're not actually generating that many tokens. 00:19:54.880 |
And so the perfect fit for that is using a sparse model, 00:20:06.280 |
The other being a variant of speculative decoding 00:20:10.000 |
that we kind of built out called speculative edits. 00:20:15.320 |
of what make it quite high quality and very fast. 00:20:25.800 |
- Okay, so what else can you say about how to make, 00:20:28.560 |
does caching play a role in this particular-- 00:20:32.840 |
Because you're dealing with this many input tokens, 00:20:44.120 |
you're just going to, one, significantly degrade latency, 00:20:49.880 |
So you need to design the actual prompts you use 00:20:53.800 |
for the model such that they're caching aware. 00:20:57.040 |
And then, yeah, you need to reuse the KB cache 00:20:59.840 |
across requests just so that you're spending less work, 00:21:04.440 |
- Again, what are the things that TAB is supposed 00:21:21.600 |
and then jump to different locations inside the same file? 00:21:40.680 |
Like sometimes you need to run a command in the terminal, 00:21:54.080 |
but it's hard for you to know if it's correct, 00:21:57.120 |
because you actually need some more information to learn. 00:22:02.720 |
And so maybe it should actually take you to a place 00:22:50.000 |
not always, but sometimes the next five minutes, 00:22:57.160 |
by you disengaging and it taking you through, 00:22:59.440 |
or maybe a little bit more of just you seeing next step, 00:23:03.760 |
okay, that's good, that's good, that's good, that's good. 00:23:14.880 |
there's this whole diff interface situation going on. 00:23:17.760 |
So like the model suggests with the red and the green 00:23:22.560 |
of like, here's how we're gonna modify the code. 00:23:27.280 |
and it shows you the diff and you can accept the diff. 00:23:29.880 |
So maybe can you speak to whatever direction of that? 00:23:37.480 |
So we have optimized the diff for the autocomplete. 00:23:42.800 |
than when you're reviewing larger blocks of code. 00:23:47.680 |
And then we're trying to optimize another diff thing 00:23:50.720 |
for when you're doing multiple different files 00:23:53.240 |
and sort of at a high level, the difference is 00:24:02.520 |
Actually, it should be really fast to read in all situations 00:24:08.560 |
you're really like your eyes focused in one area. 00:24:13.560 |
the humans can't look in too many different places. 00:24:15.400 |
- So you're talking about on the interface side? 00:24:28.680 |
- You can maybe show it if we pull it up on cursor.com. 00:24:33.480 |
- So that box, it was like three or four different attempts 00:24:40.760 |
Where first attempt was like this blue crossed out line. 00:25:06.320 |
Then the next iteration of it, which is sort of funny, 00:25:09.040 |
you would hold the on Mac, the option button. 00:25:14.040 |
So it would sort of highlight a region of code 00:25:17.080 |
to show you that there might be something coming. 00:25:21.720 |
like the input and the value would all get blue. 00:25:30.200 |
So instead of directly showing you the thing, 00:25:34.600 |
it would just hint that the AI had a suggestion. 00:25:49.480 |
but you have to know to hold the option button. 00:25:51.960 |
- So by the way, I'm not a Mac user, but I got it. 00:26:08.200 |
for making a lot of improvements in this area. 00:26:13.200 |
Like we often talk about it as the verification problem, 00:26:21.680 |
For large edits, or like when it's multiple files 00:26:24.920 |
or something, it's actually a little bit prohibitive 00:26:32.520 |
And so there are like a couple of different ideas here. 00:26:36.400 |
Like one idea that we have is, okay, you know, 00:26:42.400 |
And then parts of the diff are just very low entropy. 00:26:46.480 |
They're like the same thing over and over again. 00:26:49.520 |
And so maybe you can highlight the important pieces 00:26:52.640 |
and then gray out the not so important pieces. 00:26:55.200 |
Or maybe you can have a model that looks at the diff 00:27:00.960 |
I will like mark this with a little red squiggly 00:27:03.920 |
and say like, you should probably like review 00:27:07.680 |
And ideas in that vein, I think are exciting. 00:27:15.960 |
So you're basically trying to guide the human programmer 00:27:20.840 |
through all the things they need to read and nothing more. 00:27:25.280 |
- Yeah, and you want an intelligent model to do it. 00:27:27.880 |
Like currently diff algorithms are, they're like, 00:27:38.000 |
There's like intelligence that went into designing 00:27:42.320 |
you don't care if it's about this thing or this thing, 00:27:50.320 |
Matt, these models are going to get much smarter. 00:27:55.920 |
the changes they will be able to propose are much bigger. 00:27:58.880 |
So as the changes gets bigger and bigger and bigger, 00:28:08.880 |
It's sort of, I don't want to spend all my time 00:28:12.560 |
- Can you say a little more across multiple files, Div? 00:28:20.040 |
- Yeah, I mean, so GitHub tries to solve this, right? 00:28:26.080 |
you're reviewing multiple diffs across multiple files. 00:28:32.080 |
I think you can do much better than code review. 00:28:36.960 |
Like you spend a lot of time trying to grok this code 00:28:42.200 |
And it often like doesn't even actually catch that many bugs. 00:28:50.240 |
that review experience using language models, 00:28:54.000 |
that Arvid had described of maybe pointing you 00:29:12.160 |
for both the reviewer and the person that produced the code. 00:29:16.240 |
In the case where the person that produced the code 00:29:20.040 |
you don't have to care that much about their experience. 00:29:22.160 |
And you can design the entire thing around the reviewer 00:29:43.120 |
- Just one idea there is I think ordering matters. 00:29:50.120 |
and you're reviewing them from top to bottom, 00:29:52.320 |
but actually you actually want to understand this part first 00:29:57.560 |
And then you want to understand the next part. 00:29:59.040 |
And you don't want to have to figure out that yourself. 00:30:02.680 |
You want a model to guide you through the thing. 00:30:05.360 |
- And is the step of creation going to be more 00:30:15.560 |
that all of programming will be natural language. 00:30:18.360 |
And the reason for that is if I'm pair programming 00:30:21.960 |
with Swalla and Swalla is at the computer and the keyboard, 00:30:24.440 |
and sometimes if I'm driving, I want to say to Swalla, 00:30:49.600 |
Sometimes the easiest way to communicate with AI 00:30:52.680 |
and then it goes and does the thing everywhere else. 00:30:54.920 |
Or sometimes if you're making a website, for example, 00:30:57.760 |
the easiest way to show to the AI what you want 00:31:12.760 |
And so I think natural language will have a place. 00:31:20.560 |
- I'm really feeling the AGI with this editor. 00:31:24.040 |
It feels like there's a lot of machine learning 00:31:27.760 |
Tell me about some of the ML stuff that makes it all work. 00:31:31.200 |
- Well, Cursor really works via this ensemble 00:31:34.600 |
of custom models that we've trained alongside 00:31:40.800 |
And so Cursor tab, for example, is a great example 00:31:43.840 |
of where you can specialize this model to be even better 00:31:47.520 |
If you look at evals on the task we set it at. 00:31:50.360 |
The other domain, which it's kind of surprising 00:31:54.200 |
but it's kind of necessary and works quite well is in apply. 00:31:58.080 |
So I think these models are like the frontier models 00:32:03.000 |
are quite good at sketching out plans for code 00:32:05.200 |
and generating like rough sketches of like the change, 00:32:13.080 |
for frontier models, for your training models. 00:32:15.440 |
Like you try to do this with Sonnet, with O1, 00:32:21.200 |
any frontier model, and it really messes up stupid things 00:32:31.840 |
is we let the model kind of sketch out this rough code block 00:32:37.920 |
And we train a model to then apply that change to the file. 00:32:55.080 |
of combining the two, you're saying is not so trivial. 00:33:04.160 |
I think like you see shallow copies of apply elsewhere, 00:33:18.160 |
And that just results in a terrible product experience. 00:33:26.160 |
you are going to get smarter and smarter models. 00:33:29.080 |
And like, so one other thing that apply lets you do 00:33:47.160 |
and then have your small models go and implement it 00:34:10.800 |
that is kind of recursively applied by Sonnet 00:34:16.640 |
- Maybe we should talk about how to make it fast. 00:34:19.800 |
- I feel like fast is always an interesting detail. 00:34:25.120 |
- Yeah, so one big component of making it fast 00:34:29.960 |
So speculative edits are a variant of speculative decoding. 00:34:33.080 |
And maybe it'd be helpful to briefly describe 00:34:39.240 |
what you do is you can kind of take advantage 00:34:41.960 |
of the fact that, you know, most of the time, 00:34:47.640 |
when you're memory bound in language model generation. 00:34:55.920 |
it is faster than generating one token at a time. 00:35:17.160 |
that your larger model will then go in and verify. 00:35:25.920 |
And that prior is literally the same exact code. 00:35:29.600 |
So what you can do is you could just feed chunks 00:35:35.000 |
And then the model will just pretty much agree 00:35:40.760 |
And so you can process all of those lines in parallel. 00:35:43.480 |
And you just do this with sufficiently many chunks. 00:35:45.320 |
And then eventually you'll reach a point of disagreement 00:35:47.720 |
where the model will now predict text that is different 00:35:54.640 |
And then we kind of will decide after enough tokens 00:35:57.160 |
match the original code to restart speculating 00:36:04.960 |
is just a much faster version of normal editing code. 00:36:21.920 |
- And then the advantage is that while it's streaming, 00:36:24.600 |
you can just also start reviewing the code before it's done. 00:36:36.440 |
- So the human can start reading before the thing is done. 00:36:39.520 |
- I think the interesting riff here is something like, 00:36:42.120 |
like speculation is a fairly common idea nowadays. 00:36:47.120 |
I mean, there's obviously speculation in CPUs 00:36:54.680 |
- Well, let me ask this sort of the ridiculous question 00:36:59.960 |
GPT, CLAWD, who wins in the context of programming? 00:37:05.920 |
because it sounds like every single part of this 00:37:29.080 |
ability to process lots of code, long context, 00:37:35.560 |
The one that I'd say right now is just kind of net best 00:37:44.960 |
So if you give it really hard programming interview 00:37:53.280 |
But it doesn't feel like it kind of understands 00:38:00.400 |
Like if you look at a lot of the other frontier models, 00:38:14.440 |
relative to kind of everything that's kind of in the middle. 00:38:19.240 |
and things that are in the distribution of the benchmarks 00:38:21.320 |
they're evaluated on, you know, they'll do really well, 00:38:23.360 |
but when you push them a little bit outside of that, 00:38:25.680 |
SONNET's I think the one that kind of does best 00:38:31.480 |
Like you kind of have the same capability in the benchmark 00:38:33.480 |
as when you try to instruct it to do anything with coding. 00:38:39.400 |
is the difference between the normal programming experience 00:38:44.920 |
Like where do benchmarks fall short, do you think, 00:38:49.280 |
- By the way, that's like a really, really hard, 00:38:58.640 |
Where real coding, it's not interview style coding, 00:39:06.720 |
humans are saying like half broken English sometimes, 00:39:10.400 |
and sometimes you're saying like, oh, do what I did before. 00:39:14.640 |
Sometimes you're saying, you know, go add this thing 00:39:21.880 |
And then, you know, it's just like a lot of things 00:39:33.240 |
maybe the way to put it is sort of abstractly 00:39:35.560 |
is the interview problems are very well specified. 00:39:51.960 |
is both complicated by what Suhal just mentioned. 00:40:01.800 |
between what can you actually model in a benchmark 00:40:05.680 |
And that can be sometimes hard to encapsulate 00:40:07.360 |
because it's like real programming is like very messy 00:40:10.240 |
and sometimes things aren't super well specified, 00:40:23.240 |
to also get the data from the public benchmarks 00:40:28.200 |
And so for instance, like one of the most popular 00:40:36.560 |
in the training data of these foundation models. 00:40:42.360 |
you actually don't give them the context of a code base. 00:40:44.120 |
They can like hallucinate the right file pass. 00:40:45.840 |
They can hallucinate the right function names. 00:40:58.640 |
And maybe the labs will start to do a better job 00:41:05.360 |
but they're not going to emit the actual training data 00:41:09.760 |
Like these are all like some of the most popular 00:41:11.600 |
Python repositories, like SymPy is one example. 00:41:14.720 |
I don't think they're going to handicap their models 00:41:17.680 |
on SymPy and all these popular Python repositories 00:41:20.240 |
in order to get true evaluation scores in these benchmarks. 00:41:24.120 |
- I think that given the dearths in benchmarks, 00:41:27.280 |
there have been like a few interesting crutches 00:41:30.320 |
that places that build systems with these models 00:41:35.000 |
to get a sense of are they going the right direction or not? 00:41:39.400 |
people will actually just have humans play with the things 00:41:44.160 |
Like one or two of the foundation model companies, 00:41:45.920 |
they have people who, that's a big part of their role. 00:41:49.000 |
And internally we also qualitatively assess these models 00:41:54.040 |
in addition to like private evals that we have. 00:42:08.120 |
like just like reading online forums and Reddit and X, 00:42:12.520 |
just like, well, I don't know how to properly load 00:42:17.520 |
in people's opinions 'cause they'll say things like, 00:42:20.640 |
I feel like Claude or GPT's gotten dumber or something. 00:42:29.860 |
but I wonder if it's the model's problem or mine. 00:42:34.000 |
- Yeah, with Claude, there's an interesting take I heard 00:42:41.560 |
and I suspect they have slightly different numerics 00:42:47.000 |
And someone speculated that Claude's degraded performance 00:42:51.320 |
had to do with maybe using the quantized version 00:42:56.040 |
versus whatever was running on Anthropix GPUs. 00:43:00.780 |
- I interview a bunch of people that have conspiracy theories, 00:43:03.000 |
so I'm glad you spoke to this conspiracy theory. 00:43:05.680 |
- Well, it's not like a conspiracy theory as much. 00:43:09.420 |
They're just, they're like, they're, you know, 00:43:14.520 |
and, you know, you're doing like this queasy amount of flops 00:43:19.200 |
and that chips are messy and man, you can just have bugs. 00:43:27.880 |
- What's the role of a good prompt in all of this? 00:43:34.600 |
have really structured, well-formulated prompts. 00:43:39.400 |
What should a human be doing to maximize success? 00:43:46.580 |
you wrote a blog post on, you called it prompt design. 00:43:50.780 |
- Yeah, I think it depends on which model you're using 00:43:57.380 |
and they respond differently to different prompts. 00:44:05.000 |
and the original sort of pre-developed models last year, 00:44:10.560 |
and they also had a very small context window. 00:44:13.600 |
And so we have all of these pieces of information 00:44:16.600 |
around the code base that would maybe be relevant 00:44:26.640 |
how do you decide what you actually put in the prompt 00:44:37.920 |
It means that sometimes the model actually gets confused 00:44:40.540 |
and some models get more confused than others. 00:45:18.100 |
if you're making like designing a print magazine, 00:45:19.920 |
you have like, you know exactly where you can put stuff. 00:45:22.160 |
But when you have a website or when you have a prompt, 00:45:25.800 |
and then you need to format them to always work. 00:45:31.280 |
And so the idea was, okay, let's take some inspiration. 00:45:53.120 |
or like this has higher Z index than something else. 00:45:56.480 |
And then you have this rendering engine in web design. 00:46:07.140 |
And as you declare it, it will decide what you want, 00:46:11.760 |
And so we have found that to be quite helpful. 00:46:14.540 |
And I think the role of it has sort of shifted over time, 00:46:27.760 |
that goes into the prompt and the actual rendering of it. 00:46:33.200 |
because you can change the rendering of the prompt 00:46:37.840 |
because you have the raw data that went into the prompt. 00:46:50.520 |
There are components, like we have one component 00:46:52.320 |
that's a file component and it takes in like the cursor, 00:47:00.840 |
And that's like probably the most important line 00:47:13.160 |
it'd figure out how many lines can actually fit 00:47:22.920 |
you could use retrieval and things like embedding 00:47:35.720 |
Like would it be beneficial to write JSX in the problem 00:47:39.960 |
or the whole idea is you should be loose and messy? 00:47:45.840 |
you should just do whatever is the most natural thing 00:47:52.880 |
how do we actually like retrieve the relative thing 00:48:03.080 |
you should let the person be as lazy as he wants. 00:48:10.360 |
But I feel like you're allowed to ask more of programmers. 00:48:19.080 |
There's a kind of tension between just being lazy 00:48:23.360 |
be prompted, almost like the system pressuring you 00:48:33.040 |
- Not in terms of the grammar of the sentences, 00:48:47.120 |
you just are not, not enough intent is conveyed 00:48:51.880 |
And there are like a few ways to resolve that intent. 00:48:54.400 |
One is the simple thing of having the model just ask you, 00:48:58.040 |
I'm not sure how to do these parts based on your query. 00:49:06.200 |
maybe if you, there are five or six possible generations, 00:49:11.200 |
given the uncertainty present in your query so far, 00:49:14.680 |
why don't we just actually show you all of those 00:49:17.600 |
- How hard is it to, for the model to choose to speak, 00:49:27.160 |
It's sort of like how to deal with the uncertainty. 00:50:05.640 |
that the client and the server is super useful. 00:50:19.160 |
And we're still sort of, initial version is rolled out 00:50:23.000 |
and I'm sure we can make it much more accurate. 00:50:29.880 |
do you just want to add this file, this file, this file also 00:50:33.000 |
to tell the model to edit those files for you? 00:50:36.280 |
Because if you're, maybe you're making the API, 00:50:38.960 |
like you should also edit the client and the server 00:50:41.160 |
that is using the API and the other one resolving the API. 00:50:46.240 |
both there's the phase where you're writing a prompt 00:50:52.000 |
maybe we can help resolve some of the uncertainty. 00:50:54.360 |
- To what degree do you use agentic approaches? 00:51:05.040 |
it's like, it resembles sort of like a human, 00:51:12.600 |
because you see a demo where it acts as a human would. 00:51:19.720 |
I think agents are not yet super useful for many things. 00:51:30.520 |
And so I think there are certain types of tasks 00:51:48.720 |
And that's a task that's super well specified. 00:51:58.320 |
And then a day later I come back and I reviewed the thing. 00:52:11.720 |
And this could be a process that takes a long time. 00:52:22.000 |
that agents will take over all of programming. 00:52:32.480 |
or you don't actually want to specify something upfront 00:52:44.800 |
I think you actually want a system that's instant 00:52:47.280 |
that gives you an initial version instantly back 00:52:49.120 |
and then you can iterate super, super quickly. 00:52:56.320 |
that does also like setting up the development environment, 00:52:59.800 |
installing software packages, configuring everything, 00:53:02.320 |
configuring the databases and actually deploying the app? 00:53:06.520 |
- Is that also in the set of things you dream about? 00:53:22.840 |
we want to make the programmer's life easier and more fun. 00:53:35.040 |
you can actually have an agent in the background 00:53:47.040 |
And then when you get to the backend part of your PR, 00:53:50.120 |
then you have some like initial piece of code 00:53:58.520 |
- One of the things we already talked about is speed, 00:54:01.400 |
but I wonder if we can just linger on that some more 00:54:04.600 |
in the various places that the technical details involved 00:54:16.360 |
Like I mentioned, the apply is probably the slowest thing. 00:54:18.480 |
And for me, I'm sorry, the pain on Harvey's face. 00:54:26.000 |
- Yeah, I mean, it says something that feels, 00:54:31.040 |
like one second or two seconds, that feels slow. 00:54:36.800 |
that everything else is just really, really fast. 00:54:42.840 |
how to make the chat fast, how to make the divs fast? 00:54:53.880 |
And so what you can do is if, as the user's typing, 00:55:10.600 |
reusing the KVCache results in lower latency, 00:55:17.000 |
you can immediately warm the cache with like, 00:55:23.040 |
there's very few tokens it actually has to pre-fill 00:55:41.920 |
that allow transformers to not just independently, 00:55:46.840 |
to not just independently look at each token, 00:55:48.480 |
but see previous tokens are the keys and values to tension. 00:55:54.480 |
is you have at your current token, some query, 00:56:04.000 |
that the model stores internally of all the previous tokens 00:56:08.000 |
And like by default, when you're doing a chat, 00:56:15.800 |
do this forward pass through the entire model. 00:56:19.320 |
That's a lot of matrix multiplies that happen. 00:56:32.440 |
for the last N tokens, if I now want to compute 00:56:42.040 |
through the entire model because I already have 00:56:52.120 |
you're reusing those keys and values that have been computed, 00:56:57.680 |
or sequentially dependent part of the transformer. 00:57:06.560 |
- Yeah, there's other types of caching you can kind of do. 00:57:10.560 |
One interesting thing that you can do for CursorTab 00:57:20.600 |
as if the user would have accepted the suggestion 00:57:26.680 |
And so then you've cached, you've done a speculative, 00:57:29.160 |
it's a mix of speculation and caching, right? 00:57:39.600 |
the next one would be waiting for them immediately. 00:58:10.840 |
It's like much higher chance than the user hits 00:58:16.840 |
and we sort of hit something else in the cache. 00:58:28.560 |
maybe a single sample from the model isn't very good. 00:59:03.040 |
or like which of the key things does the human want. 00:59:11.920 |
we're predicting which of the hundred different suggestions 00:59:16.920 |
the model produces is more amenable for humans. 00:59:20.480 |
Like which of them do humans more like than other things. 00:59:26.400 |
can predict very far ahead versus like a little bit 00:59:35.400 |
and sort of punish the things that it would like 00:59:39.040 |
to output the suggestions that humans would like more. 00:59:40.920 |
You have these like RL loops that are very useful 00:59:50.840 |
But I mean, like technically you tie it back in 00:59:55.880 |
because you can get away with the smaller model 00:59:58.920 |
and it gets the same performances as the bigger one. 01:00:01.480 |
That's like, and Suali was mentioning stuff about KB, 01:00:33.600 |
being able to generate the tokens much faster. 01:00:45.280 |
The thing this matters for is now generating tokens. 01:00:54.560 |
by doing these super parallelizable matrix multiplies 01:00:59.040 |
you're bottlenecked by how quickly it's for long context 01:01:04.200 |
by how quickly you can read those cache keys and values. 01:01:12.560 |
We can try to compress the size of these keys and values. 01:01:15.280 |
So multi-query attention is the most aggressive of these, 01:01:41.320 |
With group query, you instead preserve all the query heads 01:01:51.960 |
There are fewer heads for the keys and values, 01:01:57.280 |
is you're just reducing the size of your KB cache. 01:02:07.800 |
is it kind of turns the entirety of your keys and values 01:02:12.280 |
across all your heads into this kind of one latent vector 01:02:16.760 |
that is then kind of expanded inference time. 01:02:19.760 |
- But MLA is from this company called DeepSeq. 01:02:38.720 |
The advantage you get from that is there's less of them, 01:02:43.480 |
but maybe the theory is that you actually want 01:02:55.880 |
one big shared vector for all the keys and values, 01:03:01.360 |
and then you have smaller vectors for every single token 01:03:12.800 |
when you eventually want to compute the final thing, 01:03:17.600 |
which means that you still have some compute left 01:03:21.400 |
And so if you can expand the latent vector back out, 01:03:36.240 |
like the size of the vector that you're keeping. 01:03:51.320 |
- Okay, and all of that is dealing with being memory bound. 01:04:06.640 |
because you've less space allocated for the KB cache. 01:04:14.120 |
which are helpful for reducing the time to first token 01:04:17.280 |
for the reasons that were kind of described earlier. 01:04:20.680 |
when you start doing inference with more and more requests 01:04:28.400 |
in as it's generating the tokens, the speed of that. 01:04:31.720 |
- Would it also allow you to make your prompt bigger 01:04:43.600 |
So you could increase either of those dimensions, right? 01:04:48.120 |
without degrading the latency of generating tokens. 01:04:51.920 |
- Arvid, you wrote a blog post, Shadow Workspace. 01:05:03.680 |
and we're experimenting with a lot of things. 01:05:05.760 |
Right now, we don't have much of that happening 01:05:13.920 |
that goes into your command key prompts, for example. 01:05:17.800 |
if you can actually spend computation in the background, 01:05:42.680 |
and we use it internally for like experiments 01:05:54.840 |
Because otherwise, like you can get higher performance 01:06:04.800 |
is by letting the model iterate and get feedback. 01:06:11.200 |
when you're a programmer is the language server, 01:06:20.080 |
and there's like a separate language server per language. 01:06:29.400 |
and sort of understands the structure of your code. 01:06:31.880 |
So language servers are extensions developed by, 01:06:36.280 |
that were developed by the TypeScript people, 01:06:38.160 |
a Rust language that were developed by the Rust people, 01:06:41.480 |
over the language server protocol to VS Code. 01:06:45.120 |
all of the different languages built into VS Code, 01:06:52.880 |
- It's for linting, it's for going to definition 01:06:56.000 |
and for like seeing the right types that you're using. 01:07:01.400 |
- Yes, type checking and going to references. 01:07:03.960 |
And that's like, when you're working in a big project, 01:07:09.600 |
it's like really hard to code in a big project. 01:07:12.720 |
- Can you say again how that's being used inside Cursor, 01:07:16.320 |
the language server protocol communication thing? 01:07:20.440 |
- So it's being used in Cursor to show to the programmer, 01:07:23.680 |
But then the idea is you want to show that same information 01:07:34.800 |
And so the idea behind the shadow workspace was, 01:07:40.040 |
is we spawn a separate window of Cursor that's hidden. 01:07:45.040 |
And so you can set this flag and an electron is hidden. 01:07:48.920 |
There is a window, but you don't actually see it. 01:07:52.680 |
the AI agents can modify code however they want, 01:08:01.720 |
and go to definition and iterate on their code. 01:08:04.080 |
- So like literally run everything in the background, 01:08:22.280 |
so that it exactly mirrors the user's environment. 01:08:25.880 |
And then on Linux, you can do this cool thing 01:08:29.080 |
where you can actually mirror the file system 01:08:35.360 |
And it thinks that it's operating on the file level, 01:08:42.840 |
and you can create this kernel extension to make it work. 01:08:51.920 |
but it's a fun technical problem, so that's why. 01:08:57.400 |
- One maybe hacky, but interesting idea that I like 01:09:02.200 |
And so basically you can then have the language model 01:09:07.360 |
And then instead of you operating in the ground truth 01:09:14.800 |
and these unsaved things that only exist in memory 01:09:16.600 |
that you still get linter errors for and you can code in. 01:09:21.320 |
it's just like, there's a small warning that there's a lock 01:09:31.720 |
- That's such an exciting future, by the way. 01:09:38.400 |
it's scary for people, but like, it's really cool 01:09:42.120 |
to be able to just like let the agent do a set of tasks 01:09:46.000 |
and you come back the next day and kind of observe 01:09:49.680 |
like it's a colleague or something like that. 01:09:51.920 |
- Yeah, and I think there may be different versions 01:09:53.960 |
of like runability where for the simple things 01:09:57.560 |
where you're doing things in the span of a few minutes 01:09:59.960 |
on behalf of the user as they're programming, 01:10:02.040 |
it makes sense to make something work locally 01:10:15.800 |
of how do you exactly reproduce or mostly reproduce 01:10:20.120 |
to the point of it being effectively equivalent 01:10:27.240 |
- I'm curious what kind of agents you want for coding. 01:10:32.920 |
Do you want them to like implement new features? 01:10:41.400 |
I think so for the practices, this particular podcast, 01:10:45.120 |
there's video editing and a lot of, if you look in Adobe, 01:10:52.080 |
but you can interact with Premiere, for example, 01:11:01.480 |
I do all of that through code and including translation 01:11:11.840 |
that don't have to do directly with the editing. 01:11:19.520 |
I would be fundamentally thinking about bug finding, 01:11:30.200 |
not logical, like spiritual bugs or something. 01:11:34.280 |
Ones like sort of big directions of implementation, 01:11:57.840 |
- I think these models are really strong reflection 01:12:08.520 |
but I don't think the loss in the scale is quite, 01:12:13.000 |
such that they're like really fully generalizing in code. 01:12:16.360 |
Like the things that we use these things for, 01:12:18.680 |
the frontier models that they're quite good at 01:12:21.360 |
are really code generation and question answering. 01:12:28.440 |
in pre-training with all of the code on GitHub 01:12:30.880 |
on the scale of many, many trillions of tokens 01:12:33.160 |
and questions and answers on things like Stack Overflow 01:12:39.160 |
And so when you try to push one of these things 01:12:53.720 |
And then bug detection is another great example 01:13:01.080 |
And the models just kind of like really struggle at it. 01:13:05.920 |
But I think it's a question of transferring the model, 01:13:08.520 |
like in the same way that you get this fantastic transfer 01:13:11.920 |
from pre-trained models just on code in general 01:13:19.080 |
with generalized models that are really good at code 01:13:22.400 |
It just takes like a little bit of kind of nudging 01:13:26.080 |
I think they sort of understand code really well. 01:13:30.280 |
like the representation that's being built up, 01:13:33.400 |
like almost certainly like somewhere in the stream, 01:13:38.880 |
that maybe there's something sketchy going on, right? 01:14:00.480 |
Is this sketchy like you're gonna take the server down? 01:14:03.040 |
It's like part of it is maybe the cultural knowledge 01:14:05.720 |
of like, why is a staff engineer a staff engineer? 01:14:15.000 |
a sketchy piece of code that took the server down. 01:14:20.160 |
as opposed to maybe you're just like, you know, 01:14:31.960 |
when you're writing an experiment, that's really bad. 01:14:34.560 |
But if you're writing something for super production, 01:14:38.320 |
You're writing code in Postgres or Linux or whatever, 01:14:42.680 |
It's sort of unacceptable to have even an edge case 01:15:01.000 |
to understand which line of code is important 01:15:05.320 |
Like you, I think one of your principles on a website says, 01:15:20.720 |
- No, you say like, for every single line of code 01:15:36.360 |
how it can sync the Titanic, a single function. 01:15:38.520 |
Like you don't, you might not intuit that quite clearly 01:15:48.120 |
where if you actually write dangerous, dangerous, dangerous 01:15:52.800 |
like the models will pay more attention to that 01:15:57.520 |
and will be more likely to find bugs in that region. 01:16:00.480 |
- That's actually just straight up a really good practice 01:16:03.600 |
of labeling code, of how much damage this can do. 01:16:14.640 |
in fact, I actually think this is one of the things 01:16:18.240 |
like I sort of aesthetically, I don't like it, 01:16:36.080 |
like, of course we like test a lot and whatever, 01:16:49.520 |
And you kind of really need to point it out to them 01:16:58.600 |
That's like, we don't really think about that. 01:17:02.080 |
You think about, okay, how do I figure out how this works 01:17:08.720 |
- Until we have formal verification for everything, 01:17:16.560 |
that you have not introduced a bug if the proof pass. 01:17:22.000 |
- I think people will just not write tests anymore 01:17:46.360 |
some of the stuff you were talking about earlier 01:17:53.160 |
because the intent is really hard to specify, 01:17:54.800 |
it's also then going to be really hard to prove 01:17:56.680 |
that it's actually matching whatever your intent is. 01:17:58.440 |
- Like you think that spec is hard to generate? 01:18:06.720 |
maybe you can, I think there is a question of like, 01:18:12.880 |
I think that there's like more to dig into there. 01:18:28.080 |
that are not going to be easily well-specified 01:18:32.760 |
- Maybe an argument against formal verification 01:18:36.840 |
- The worry is there's this massive document. 01:18:42.440 |
I think you can probably also evolve the spec languages 01:18:55.000 |
- And you're speaking not just about like single functions. 01:19:09.160 |
where you can prove, formally verify down to the hardware. 01:19:13.640 |
So like through the, you formally verify the C code 01:19:16.680 |
and then you formally verify through the GCC compiler 01:19:19.600 |
and then through the very log down to the hardware. 01:19:26.720 |
And I think big code bases are sort of similar 01:19:31.120 |
And if you can decompose it and formally verify each part, 01:19:36.560 |
I think the specification problem is a real problem, but. 01:19:40.880 |
Or how do you handle, I guess, external dependencies 01:19:46.320 |
- Maybe Stripe would write a spec for the API. 01:19:48.600 |
- But like, you can't do this for everything. 01:19:49.920 |
Like, can you do this for everything you use? 01:19:52.200 |
Like, how do you do it for, if there's a language model, 01:19:55.160 |
like maybe like people will use language models 01:20:02.680 |
- I think you might be able to prove that still. 01:20:07.600 |
- I think it feels possible that you could actually prove 01:20:10.800 |
that a language model is aligned, for example. 01:20:22.200 |
I mean, if it's possible, that's your, I have a dream speech. 01:20:26.040 |
If it's possible, that will certainly help with, you know, 01:20:31.880 |
and making sure AI doesn't destroy all of human civilization. 01:20:35.040 |
So the full spectrum of AI safety to just bug finding. 01:20:39.300 |
So you said the models struggle with bug finding. 01:20:54.560 |
Like it should very quickly catch the stupid bugs, 01:21:02.800 |
Like I do this, I write like less than in a comment 01:21:04.960 |
and like I've maybe write a greater than sign 01:21:07.920 |
And the model is like, yeah, it looks sketchy. 01:21:13.000 |
But eventually it should be able to catch harder bugs too. 01:21:16.160 |
- Yeah, and I think that it's also important to note 01:21:21.800 |
feels necessary to get to the highest reaches 01:21:24.640 |
of having AI do more and more programming for you, 01:21:29.200 |
if AI is building more and more of the system for you, 01:21:31.160 |
you need to not just generate, but also verify. 01:21:42.400 |
So it's not just for humans, like you write a bug, 01:21:47.160 |
but it's also being able to verify the AI's code 01:21:52.560 |
- Yeah, and then how do you actually do this? 01:21:54.120 |
Like we have had a lot of contentious dinner discussions 01:22:00.720 |
it's kind of potentially easy to introduce a bug 01:22:05.360 |
And so you can train a model to introduce bugs 01:22:13.320 |
then that can find bugs using this synthetic data. 01:22:18.720 |
But yeah, there are lots of ideas for how to- 01:22:24.600 |
not even at the model level, of taking the biggest models 01:22:27.400 |
and then maybe giving them access to a lot of information 01:22:32.320 |
Like it's kind of a hard problem to like stare at a file 01:22:35.680 |
And you know, that's hard for humans often, right? 01:22:48.680 |
It could be that you have a really specialty model 01:22:50.640 |
that's quite fast, that's kind of running in the background 01:22:55.520 |
sort of to Arvid's earlier example about, you know, 01:23:02.200 |
you're not just like checking hypothesis-free, 01:23:04.080 |
you're like, this is a problem, I really wanna solve it. 01:23:06.560 |
And you zap that with tons and tons and tons of compute 01:23:08.840 |
and you're willing to put in like $50 to solve that bug 01:23:15.560 |
Like I would pay probably a large amount of money 01:23:19.240 |
or even generated a code that I really appreciated. 01:23:38.960 |
and for localization like in different languages. 01:23:45.160 |
And the code across, like if I Googled it for a while, 01:23:54.840 |
I read the code and I was like, this is correct. 01:23:58.160 |
I was like, I wanna tip on a button that goes. 01:24:03.920 |
One that's really good just to support the company 01:24:08.080 |
And the other is that probably sends a strong signal, 01:24:17.000 |
You just actually send like a strong, good job. 01:24:24.920 |
that would pay a huge amount of money for a bug, 01:24:33.720 |
- Yeah, it's a controversial idea inside the company. 01:24:45.480 |
if like you spend nothing to try to find a bug. 01:24:51.560 |
And then if it does find a bug and you click accept, 01:24:54.560 |
then it also shows like in parentheses, like $1. 01:25:08.480 |
And then there's also the worry that like introducing money 01:25:14.280 |
you know, like it doesn't feel as fun anymore. 01:25:18.080 |
and all you want to think about is like the code. 01:25:21.080 |
And so maybe it actually makes more sense to separate it out 01:25:26.760 |
and then you get all of these things for free. 01:25:32.360 |
- Yes, but it still has that like dollar symbol. 01:25:41.000 |
that feels like people do this is when they share it, 01:25:45.120 |
they just kind of share it with their friends. 01:25:51.560 |
where if we can get to a place where we understand 01:25:55.960 |
I mean, to the stuff we were talking about with like, 01:26:07.200 |
doesn't need to rely on the honor system too. 01:26:16.360 |
Can you use, can you do like a loop where it runs the code 01:26:27.760 |
Is right now they're separate worlds completely? 01:26:30.720 |
Like I know you can like do control K inside the terminal 01:26:54.200 |
Like we do running the code in different ways. 01:26:58.280 |
which how do you protect it from not modifying the database? 01:27:01.960 |
- I mean, there's certainly cool solutions there. 01:27:06.080 |
There's this new API that is being developed for, 01:27:16.480 |
I don't know if PlanetScale was the first one to add it. 01:27:18.760 |
It's this ability to sort of add branches to a database, 01:27:25.520 |
and you want to test against a broad database, 01:27:29.880 |
you could sort of add a branch to the database. 01:27:35.200 |
And there's obviously a lot of technical complexity 01:27:38.480 |
I guess database companies need new things to do. 01:27:52.080 |
is going to add maybe branching to the write-ahead log. 01:27:57.080 |
And so maybe the AI agents will use branching. 01:28:05.400 |
and it's sort of gonna be a requirement for the database 01:28:13.680 |
- Yeah, I feel like everything needs branching. 01:28:17.040 |
- Yeah, it's the problem with the multiverse, right? 01:28:22.000 |
Like if you branch on everything, that's like a lot. 01:28:24.360 |
- I mean, there's obviously these like super clever 01:28:26.320 |
algorithms to make sure that you don't actually 01:28:28.520 |
sort of use a lot of space or CPU or whatever. 01:28:32.280 |
- Okay, this is a good place to ask about infrastructure. 01:28:56.840 |
Like it might be absolute hell to go through the steps 01:29:08.920 |
- I think it's exactly, it's just nature of winning. 01:29:12.440 |
But AWS, you can always trust, like it will always work. 01:29:15.240 |
And if there is a problem, it's probably your problem. 01:29:21.760 |
Is there some interesting like challenges to, 01:29:23.640 |
you guys have a pretty new startup to get scaling 01:29:30.680 |
it has been an interesting journey adding, you know, 01:29:37.920 |
You run into all of these with like, you know, 01:29:39.520 |
the general components you're using for caching 01:29:41.520 |
and databases run into issues as you make things 01:29:45.760 |
into overflows on our tables and things like that. 01:29:48.720 |
And then also there have been some custom systems 01:29:53.200 |
our retrieval system for computing a semantic index 01:29:57.040 |
of your code base and answering questions about a code base 01:30:04.360 |
- I have a few friends who are super, super senior engineers 01:30:09.040 |
it's very hard to predict where systems will break 01:30:18.960 |
that's going to happen when you add this extra zero. 01:30:23.720 |
but you didn't actually think through everything. 01:30:26.320 |
But I think for that particular system, we've, 01:30:41.120 |
and then we send up sort of the code for embedding 01:30:46.280 |
And then we store the embeddings in a database, 01:30:56.320 |
because we're very, very paranoid about client bugs. 01:31:12.720 |
the local code base state is the same as the state 01:31:17.920 |
And the way sort of technically we ended up doing that is, 01:31:21.840 |
so for every single file, you can sort of keep this hash. 01:31:25.800 |
And then for every folder, you can sort of keep a hash, 01:31:31.160 |
And you can sort of recursively do that until the top. 01:31:37.640 |
One thing you could do is you could keep a hash 01:31:44.720 |
figure out what are the files that don't exist on the server. 01:31:56.160 |
But that introduces like absolutely ginormous 01:32:03.680 |
I mean, nobody really wants us to hammer their wifi 01:32:13.680 |
It would sort of be reading this tens of terabyte database, 01:32:31.400 |
You sort of, you just try to reconcile the single hash, 01:32:35.760 |
And then if something mismatches, then you go, 01:32:39.600 |
Maybe you look at the children and see if the hashes match. 01:32:47.280 |
And for most people, most of the time the hashes match. 01:32:50.000 |
- So it's a kind of like hierarchical reconciliation. 01:33:00.080 |
you kind of have to think through all these problems. 01:33:01.800 |
- And I mean, the point of, like the reason it's gotten hard 01:33:04.480 |
is just because, like the number of people using it 01:33:15.880 |
we originally reordered our code base, which is big, 01:33:18.680 |
but I mean, it's just not the size of some company 01:33:25.080 |
And you sort of want to scale that across programmers. 01:33:31.360 |
but scaling it to a lot of people, like a lot of companies 01:33:36.240 |
Which sort of, you know, independent of actually, 01:33:41.800 |
coming up with new ideas that obviously we're working on, 01:33:45.520 |
but then scaling all of that in the last few weeks, months. 01:33:48.440 |
- Yeah, and there are a lot of clever things, 01:33:50.640 |
like additional things that go into this indexing system. 01:33:53.640 |
For example, the bottleneck in terms of costs 01:33:57.160 |
is not storing things in the vector database, 01:33:58.960 |
or the database that's actually embedding the code. 01:34:07.400 |
except for maybe they're in a different branch 01:34:12.320 |
And so, because again, embeddings are the bottleneck, 01:34:16.160 |
and not have to worry about like the complexity 01:34:18.320 |
of like dealing with branches and the other databases, 01:34:20.600 |
where you just have some cache on the actual vectors 01:34:29.560 |
And so this means that when the nth person at a company 01:34:33.720 |
goes and embeds their code base, it's really, really fast. 01:34:36.680 |
And you do all this without actually storing any code 01:34:41.680 |
We just store the vectors in the vector database 01:34:51.680 |
Just out of curiosity, like what benefit do users have? 01:35:10.640 |
you want to find out where something is happening 01:35:16.960 |
okay, I want to find the place where we do X, 01:35:19.320 |
but you don't exactly know what to search for 01:35:25.240 |
you hit command enter to ask with the code base chat, 01:35:34.760 |
in the future, I think this is only going to get more 01:35:42.120 |
And I think the ceiling for that is really, really much 01:35:47.920 |
have you considered and why haven't you much done 01:35:53.680 |
I mean, it seems like everything we just discussed 01:36:16.360 |
Have you considered doing sort of embeddings locally? 01:36:19.840 |
and I think it would be cool to do it locally. 01:36:33.240 |
like more than 80% of our users are in Windows machines, 01:36:36.240 |
which, and many of them are not very powerful. 01:36:44.240 |
And it's also a big overhead to build that in. 01:36:50.440 |
it's currently not something that we are able to focus on. 01:36:54.360 |
And I think there are some people that do that. 01:36:58.880 |
But especially as models get bigger and bigger 01:37:02.640 |
and you want to do fancier things with like bigger models, 01:37:07.920 |
- Yeah, and it's not a problem of like weaker computers. 01:37:11.640 |
It's just that, for example, if you're some big company, 01:37:17.680 |
it's just really hard to process big company code base 01:37:28.040 |
I think if you're like the best programmer at a big company, 01:37:31.760 |
you're still going to have a horrible experience 01:37:35.760 |
I mean, you could do edge and sort of scrape by, 01:37:40.840 |
- Yeah, like at approximate nearest neighbors 01:37:42.440 |
and this massive code base is going to just eat up 01:37:50.080 |
Like, let's talk about like also the modeling side 01:37:52.800 |
where, as Arvid said, there are these massive headwinds 01:38:05.320 |
which plays in favor of local versus using GPUs 01:38:12.320 |
But the downside is these models are just bigger in total. 01:38:18.960 |
not even on a single node, but multiple nodes. 01:38:31.480 |
does it clear some bar of like the models good enough 01:38:34.840 |
to do these things and then like we're satisfied, 01:38:41.640 |
but people are always going to want the best, 01:38:43.480 |
the most intelligent, the most capable things. 01:38:46.200 |
And that's going to be really, really hard to run 01:39:00.520 |
Would you be satisfied with an inferior model? 01:39:05.520 |
but there's some people that like to do stuff locally, 01:39:11.080 |
obviously open source movement that kind of resists. 01:39:20.080 |
- There's actually an alternative to local models 01:39:25.200 |
I think it's still very much in the research stage, 01:39:28.520 |
but you could imagine to do homomorphic encryption 01:39:34.360 |
So you encrypt your input on your local machine, 01:39:37.920 |
and then the server can use lots of computation. 01:39:42.920 |
They can run models that you cannot run locally 01:39:49.760 |
and you decrypt the answer and only you can see the answer. 01:39:55.880 |
and all of it is about trying to make the overhead lower 01:40:00.720 |
because right now the overhead is really big. 01:40:07.240 |
And I think it would be really, really impactful 01:40:10.080 |
because I think one thing that's actually kind of worrisome 01:40:12.160 |
is that as these models get better and better, 01:40:14.840 |
they're going to become more and more economically useful. 01:40:17.880 |
And so more and more of the world's information and data 01:40:21.040 |
will flow through, you know, one or two centralized actors. 01:40:43.960 |
And sometimes that will happen for, you know, 01:40:49.800 |
like people will want to try to protect against 01:40:55.680 |
And then you will add in some surveillance code 01:40:57.480 |
and then someone else will come in and, you know, 01:40:59.640 |
you're in a slippery slope and then you start 01:41:01.840 |
doing bad things with a lot of the world's data. 01:41:10.480 |
homomorphic encryption for language model inference. 01:41:14.320 |
But I would say like that's the challenge we have 01:41:18.680 |
It's like there's so many features that can be provided 01:41:22.240 |
from the cloud and all of us increasingly rely on it 01:41:25.160 |
and make our life awesome, but there's downsides. 01:41:27.720 |
And that's why you rely on really good security 01:41:31.600 |
But there's also only a small set of companies 01:41:40.040 |
and they can be infiltrated in all kinds of ways. 01:41:43.600 |
- Yeah, I mean, the thing I'm just actually quite worried 01:41:48.560 |
so Entropiq has this responsible scaling policy 01:41:55.120 |
which is the Entropiq security level or whatever 01:41:58.920 |
But as we get to like code and code ASL3, ASL4, 01:42:02.320 |
whatever models, which are sort of very powerful. 01:42:16.280 |
and understandable where everyone is coming from. 01:42:27.000 |
It's like sort of this really fine line you're walking 01:42:35.160 |
On the other side, like, man, it's humans, like, 01:42:38.160 |
I don't know if I trust like all the world's information 01:42:44.640 |
- Why do you think it's different than cloud providers? 01:43:00.560 |
you want to give more data to the EIA models. 01:43:04.480 |
that you would never have put online in the first place 01:43:42.680 |
the kind of stuff I would like to include in the context. 01:43:54.640 |
at computing the context automatically in the future. 01:44:00.120 |
there are trade-offs with including automatic context. 01:44:03.600 |
So the more context you include for these models, 01:44:17.480 |
they get confused if you have a lot of information 01:44:26.120 |
But this is, already we do some automatic context 01:44:33.040 |
It's definitely something we wanna get a lot better at. 01:44:35.360 |
And I think that there are a lot of cool ideas 01:44:40.280 |
both on the learning better retrieval systems, 01:44:45.680 |
like better embedding models, better re-rankers. 01:44:48.400 |
I think that there are also cool academic ideas, 01:44:53.280 |
but also the field is grappling with writ large 01:44:58.200 |
where you can actually just have the model itself, 01:45:02.640 |
And the most popular talked about version of this is, 01:45:07.520 |
Then if you make the context windows infinite, 01:45:08.880 |
can you make the model actually pay attention 01:45:14.320 |
to make it somewhat feasible to actually do it, 01:45:16.680 |
can you then do caching for that infinite context? 01:45:18.760 |
You don't have to recompute that all the time. 01:45:20.920 |
But there are other cool ideas that are being tried 01:45:23.440 |
that are a little bit more analogous to fine tuning 01:45:30.760 |
sort of a qualitatively different type of understanding 01:45:36.000 |
than if you do it at the in-context learning level. 01:45:44.640 |
we are really excited about better retrieval systems 01:45:54.440 |
for the learning this knowledge directly in the weights 01:46:04.920 |
So these models in pre-training have seen all the code. 01:46:08.680 |
They've probably also seen questions and answers about it. 01:46:11.080 |
And then they've been fine-tuned and RLA-chefed 01:46:13.360 |
to be able to answer questions about code in general. 01:46:20.080 |
but sometimes it actually does a pretty good job 01:46:29.560 |
But what if you could actually like specifically 01:46:33.040 |
such that it really was built to understand this code base? 01:46:46.800 |
I.e. it's doing the retrieval and its internals 01:46:49.640 |
and then kind of answering the question, creating the code, 01:46:55.200 |
from the frontier model where maybe, you know, 01:46:59.520 |
that are much better than like the best open source ones 01:47:07.120 |
a really good open source model to be the retriever, 01:47:14.280 |
- Can you speak a little more to the post-training a model 01:47:31.280 |
and like trying all of them and being empirical 01:47:34.600 |
You know, one very naive thing is to try to replicate 01:47:38.840 |
what's done with VS Code and these frontier models. 01:47:50.440 |
of some particular repository that you care about. 01:47:56.400 |
let's just start with instruction fine-tuning, 01:47:58.360 |
you have like a normal instruction fine-tuning data set 01:48:00.440 |
about code, but you throw in a lot of questions 01:48:24.440 |
then prompt the model or have a model propose a question 01:48:28.960 |
and then add those as instruction fine-tuning data points. 01:48:32.560 |
And then in theory, this might unlock the model's ability 01:48:42.440 |
What do you think is the role of that kind of 01:48:47.280 |
- I think test-time compute is really, really interesting. 01:48:52.600 |
which will kind of, as you scale up the amount of data 01:48:59.480 |
both on loss and then on downstream benchmarks, 01:49:02.600 |
and just general performance when we use it for coding 01:49:20.120 |
increasing the number of inference time flops that we use, 01:49:27.240 |
as you increase the number of flops you use inference time, 01:49:33.400 |
Traditionally, we just had to literally train a bigger model 01:49:38.840 |
but now we could perhaps use the same size model 01:49:41.480 |
and run it for longer to be able to get an answer 01:49:46.760 |
And so the really interesting thing I like about this 01:49:49.480 |
is there are some problems that perhaps require 01:50:02.920 |
So are you going to spend all of this effort, 01:50:05.560 |
all this compute training a model that costs that much 01:50:16.040 |
that you train the model that's capable of doing 01:50:20.240 |
then you have a way of inference time running it longer 01:50:41.680 |
- I mean, yeah, that's an open research problem, certainly. 01:50:51.880 |
We have, like, initial implementations of this for things, 01:51:19.320 |
- But you mentioned, so there's a pre-training process, 01:51:30.080 |
- Well, it's weird, because, like, test-time compute, 01:51:33.600 |
there's, like, a whole training strategy needed 01:51:38.040 |
and the other really weird thing about this is no one, 01:51:47.680 |
Like, there have been some really interesting papers 01:51:56.680 |
with tree search using process reward models. 01:52:02.520 |
we don't quite know exactly what it looks like, 01:52:09.400 |
but maybe, like, the compute spent for this kind of, 01:52:12.160 |
for getting test-time compute to work for a model 01:52:23.800 |
We don't know how they're using any of these. 01:52:30.520 |
- Like, if you were to build a competing model, 01:52:38.240 |
I think you probably need to train a process reward model, 01:52:41.040 |
which is, so maybe we can get into reward models 01:52:43.720 |
and outcome reward models versus process reward models. 01:52:48.000 |
traditional reward models that people are trained 01:52:56.520 |
let's look at that final thing you've done, everything, 01:53:09.240 |
And so OpenAI had some preliminary paper on this, 01:53:17.120 |
to get this pretty large, several hundred thousand dataset 01:53:26.720 |
in the ways that people use process reward models 01:53:33.160 |
affecting how we choose between a bunch of samples. 01:53:39.000 |
is they sample a bunch of outputs from the language model, 01:53:51.640 |
The really interesting thing that people think might work 01:53:56.320 |
is tree search with these process reward models, 01:53:58.760 |
because if you really can grade every single step 01:54:05.720 |
and explore multiple paths of this chain of thought, 01:54:10.400 |
to evaluate how good is this branch that you're taking. 01:54:18.240 |
with the quality of the outcome at the very end. 01:54:28.240 |
is figuring out how to properly train the process, 01:54:30.880 |
or the interesting work that has been open sourced 01:54:47.440 |
for using the process reward models creatively 01:54:55.840 |
So OpenAI says that they're hiding the chain of thought 01:54:59.960 |
And they've said that that was a difficult decision to make. 01:55:03.120 |
They, instead of showing the chain of thought, 01:55:06.080 |
they're asking the model to summarize the chain of thought. 01:55:10.560 |
they're going to monitor the chain of thought 01:55:13.000 |
to make sure the model is not trying to manipulate the user, 01:55:18.600 |
what do you think about hiding the chain of thought? 01:55:24.560 |
could be that they wanna make it hard for people 01:55:26.920 |
to distill these capabilities out of their model. 01:55:31.120 |
if you had access to that hidden chain of thought 01:55:42.360 |
- And there was sort of a mirror situation with this, 01:55:45.240 |
with some of the large language model providers, 01:55:52.120 |
used to offer easy access to log probabilities 01:55:57.640 |
and also log probabilities for the prompt tokens. 01:56:13.840 |
to try and distill these capabilities out of the APIs, 01:56:20.040 |
As an asterisk on also the previous discussion 01:56:26.120 |
I think that we're still learning how to use this model. 01:56:38.960 |
but O1 is not part of the default Cursor experience 01:56:51.240 |
in a way that we reach for sort of every hour, 01:57:04.120 |
of people releasing things where it seems really clear, 01:57:12.880 |
for you to have these background things running, right? 01:57:37.600 |
And that means it's really, really painful to use 01:57:40.560 |
for things where you want to supervise the output. 01:57:54.640 |
And there's so many things that like don't feel quite right. 01:58:01.760 |
to people increasing the amount of pre-training data 01:58:12.640 |
- So let me ask you about Strawberry Tomorrow Eyes. 01:58:41.400 |
- I think this space is a little bit different 01:58:51.280 |
And so I think that the best product in three to four years 01:59:05.000 |
But I think in the end, just if you don't have, 01:59:07.560 |
like if you stop innovating on the product, you will lose. 01:59:13.360 |
That's great for people trying to enter this market 01:59:19.800 |
lots of users already by just building something better. 01:59:23.000 |
And so I think, yeah, over the next few years, 01:59:28.800 |
building the best system, and that both comes down 01:59:34.480 |
And it also comes down to the editing experience. 01:59:42.520 |
is not just integrating the new model fast, like a one. 02:00:05.560 |
You mentioned you have a taxonomy of synthetic data. 02:00:10.600 |
- Yeah, I think there are three main kinds of synthetic data. 02:00:15.240 |
The first is, so what is synthetic data first? 02:00:18.200 |
So there's normal data, like non-synthetic data, 02:00:23.800 |
i.e. usually it'll be from humans having done things. 02:00:27.120 |
So from some human process, you get this data. 02:00:30.480 |
Synthetic data, the first one would be distillation. 02:00:34.720 |
So having a language model, kind of output tokens 02:00:41.760 |
And then you can train some less capable model on this. 02:00:47.960 |
like more capable model than the original one 02:00:51.320 |
But it's really useful for if there's some capability 02:00:58.040 |
high latency model, you can then distill that down 02:01:03.400 |
The second kind is when like one direction of the problem 02:01:11.840 |
And so a great example of this is bug detection, 02:01:16.120 |
like we mentioned earlier, where it's a lot easier 02:01:24.960 |
And this is probably the case for humans too. 02:01:27.200 |
And so what you can do is you can get a model 02:01:31.440 |
that's not training that much data, that's not that smart 02:01:42.240 |
The last category, I think is, I guess the main one 02:01:52.800 |
with language models that can then be verified easily. 02:01:59.920 |
is if you have a verification system that can detect 02:02:05.760 |
and then you have a bunch of monkeys typing in typewriters. 02:02:08.160 |
Like, you can eventually get enough training data 02:02:12.640 |
And I mean, this is the case, like very much the case 02:02:14.760 |
for math where verification is actually really, really easy 02:02:22.680 |
And then what you can do is you can have an okay model, 02:02:26.200 |
generate a ton of rollouts and then choose the ones 02:02:31.840 |
the ground truth theorems and train that further. 02:02:36.360 |
with LeetCode like problems where if you have some set 02:02:46.880 |
that it's passed the test and then train the model 02:02:51.680 |
I think it's gonna be a little tricky getting this to work 02:02:57.720 |
Like having the perfect verifier feels really, really hard 02:03:00.440 |
to do with just like open-ended miscellaneous tasks. 02:03:04.760 |
You get the model or more like long horizon tasks, 02:03:09.040 |
- That's 'cause you're not as optimistic as Arvid, but yeah. 02:03:12.280 |
So yeah, so that third category requires having a verifier. 02:03:16.560 |
- Yeah, verification is, it feels like it's best 02:03:20.520 |
And like, then it wouldn't be like using a language model 02:03:23.720 |
to verify, it would be using tests or formal systems. 02:03:35.360 |
- But like the language model version of that 02:03:41.880 |
- Yeah, I think that's the category that is most likely 02:03:48.280 |
- What about RL with feedback side, RLHF versus RLAIF? 02:03:52.520 |
What's the role of that in getting better performance 02:04:00.080 |
- Yeah, so RLHF is when the reward model you use 02:04:20.840 |
RLAIF is interesting because you're kind of depending on, 02:04:30.000 |
it's depending on the constraint that verification 02:04:33.200 |
is actually a decent bit easier than generation. 02:04:36.880 |
Because it feels like, okay, like, what are you doing? 02:04:42.680 |
But no, it actually may work if the language model 02:04:46.720 |
has a much easier time verifying some solution 02:04:50.880 |
Then you actually could perhaps get this kind of recursive. 02:04:54.240 |
I don't think it's going to look exactly like that. 02:04:59.040 |
that we kind of do is like a little bit of a mix 02:05:05.440 |
where usually the model is actually quite correct. 02:05:09.840 |
picking between like two possible generations 02:05:18.880 |
with only like on the order of 50, 100 examples 02:05:24.080 |
to like kind of align that prior the model has 02:05:31.240 |
where you're usually training these reward models 02:05:34.520 |
- What's your intuition when you compare generation 02:05:45.840 |
- My intuition would just say, yeah, it should be. 02:05:59.520 |
that are much, much easier to verify given a proof 02:06:03.920 |
- I wonder if the same thing will prove P not equal to NP 02:06:24.240 |
- I'm actually surprisingly curious what like a good bet 02:06:27.760 |
for when AI will get the Fields Medal will be. 02:06:35.400 |
- Oh, sorry, Nobel Prize or Fields Medal first? 02:06:55.160 |
because it already could get a few IMO problems. 02:07:04.000 |
I think I'm one, much less versed in the space 02:07:07.760 |
And two, yeah, less intuition about how close we are 02:07:11.720 |
to solving these really, really hard open problems. 02:07:23.840 |
Like, it's probably much more likely that it'll get in. 02:07:29.080 |
like BSD, which is the Burt-Springer-Dyer conjecture, 02:07:42.920 |
Like, we don't even know what a path looks like, 02:07:49.120 |
and you can actually have a good reward system, 02:07:51.280 |
and it feels like it's easier to train for that. 02:07:56.000 |
- I think we might get Fields Medal before AGI. 02:08:17.560 |
- Speaking of how fast things have been going, 02:08:23.000 |
maybe it's good to talk about this whole idea 02:08:43.160 |
And then Chinchilla showed a more correct version. 02:08:48.400 |
kind of deviated from doing the compute optimal thing, 02:08:58.920 |
And I think there are a lot more dimensions to these curves 02:09:03.280 |
than what we originally used of just compute, 02:09:12.600 |
I think context length is another obvious one. 02:09:16.800 |
of inference, compute, and then context window, 02:09:21.240 |
maybe the thing you wanna train is some kind of SSM 02:09:24.680 |
because they're much, much cheaper and faster 02:09:28.920 |
And even if maybe it is 10X worse scaling properties 02:09:34.520 |
to train the thing to get the same level of capabilities, 02:09:40.080 |
about that inference budget for really long context windows. 02:09:43.400 |
So it'll be interesting to see how people kind of play 02:09:48.360 |
I mean, you speak to the multiple dimensions, obviously. 02:09:49.880 |
The original conception was just looking at the variables 02:09:52.400 |
of the size of the model as measured by parameters 02:09:56.920 |
by the number of tokens and looking at the ratio of the two. 02:10:02.520 |
that there is a number or at least a minimum. 02:10:10.440 |
Do you still believe that there is a kind of, 02:10:23.560 |
I think that the path that people might take is, 02:10:31.160 |
if we spend like a ton, ton of money on training, 02:10:34.840 |
like get the most capable, cheap model, right? 02:10:38.360 |
Like really, really caring as much as you can. 02:10:40.360 |
'Cause like the naive version of caring as much as you can 02:10:43.920 |
is what people have already done with like the Lama models 02:10:46.000 |
or just overtraining the shit out of 7B models 02:10:50.160 |
on way, way, way more tokens than is essential optimal. 02:10:59.040 |
Let's literally train on minimizing the KL divergence 02:11:17.640 |
just to get out this, I don't know, smaller model. 02:11:20.320 |
- And the distillation gives you just a faster model. 02:11:35.000 |
but like partially helping with the data wall. 02:11:37.640 |
Where like you only have so much data to train on, 02:11:39.400 |
let's like train this really, really big model 02:11:41.120 |
on all these tokens and we'll distill it into a smaller one. 02:11:50.040 |
than we would have originally if we trained it. 02:11:51.600 |
- So if I gave you $10 trillion, how would you spend it? 02:12:08.600 |
- Yeah, I think there's a lot of these secrets 02:12:14.000 |
and details about training these large models 02:12:19.960 |
And the issue is I would waste a lot of that money 02:12:35.200 |
with like the limited information you have now. 02:12:42.040 |
all the little heuristics, all the little parameters, 02:12:44.560 |
all the parameters that define how the thing is trained. 02:12:49.280 |
If we look in how to invest money for the next five years 02:12:54.320 |
in terms of maximizing what you called raw intelligence. 02:12:57.480 |
- I mean, isn't the answer like really simple? 02:12:59.280 |
You just try to get as much compute as possible? 02:13:02.200 |
Like at the end of the day, all you need to buy is the GPUs 02:13:05.000 |
and then sort of the researchers can find all the, 02:13:08.840 |
like they can sort of, you can tune whether you want 02:13:24.360 |
that we're sort of ideal limited, but there's always- 02:13:34.920 |
versus like use that compute to train a gigantic model. 02:13:38.560 |
- I would, but I do believe that we are limited 02:13:44.600 |
- I think, yeah, 'cause even with all this compute 02:13:47.960 |
and like, you know, all the data you could collect 02:13:49.920 |
in the world, I think you really are ultimately limited 02:13:52.680 |
by not even ideas, but just like really good engineering. 02:14:03.560 |
like there aren't that many people in the world 02:14:05.520 |
who really can like make the difference here. 02:14:08.000 |
And there's so much work that goes into research 02:14:11.640 |
that is just like pure, really, really hard engineering work. 02:14:18.680 |
if you look at the original "Transformer" paper, 02:14:20.680 |
you know how much work was kind of joining together 02:14:25.160 |
embedded in the literature versus then going in 02:14:30.160 |
like maybe the CUDA kernels, maybe whatever else, 02:14:31.880 |
I don't know if it ran on GPUs or TPUs originally, 02:14:34.000 |
such that it actually saturated the GPU performance, right? 02:14:38.360 |
Getting Gnome to go in and do all of this code, right? 02:14:41.200 |
And Gnome is like probably one of the best engineers 02:14:45.160 |
like the next generation of models, having these things, 02:14:49.480 |
and scaling it on like, you know, thousands of, 02:14:58.720 |
that has to go into all of these things to make it work. 02:15:01.760 |
If you really brought that cost down to like, you know, 02:15:10.280 |
made it super easy for someone with really fantastic ideas 02:15:17.480 |
that is like getting 50, 40% utilization on the GPUs. 02:15:22.840 |
I think that would just speed up research by a ton. 02:15:27.640 |
- I mean, I think if you see a clear path to improvement, 02:15:33.040 |
And I think probably OpenAI and all the other labs 02:15:36.720 |
that did the right thing to pick off the low-hanging fruit, 02:15:55.440 |
there's no point of experimenting with new ideas 02:16:01.560 |
and try to get as much juice out of the possible. 02:16:04.120 |
And then maybe when you really need new ideas for, 02:16:07.040 |
I think if you're spending 10 trillion dollars, 02:16:10.720 |
so you know, then actually like re-evaluate your ideas. 02:16:13.320 |
Like probably your idea limited at that point. 02:16:41.400 |
their very limited research and engineering talent 02:16:54.560 |
- Yeah, but also these big labs like winning. 02:17:02.400 |
Okay, so how, big question looking out into the future. 02:17:07.400 |
You're now at the center of the programming world. 02:17:17.600 |
in the next two years, the next five years, 10 years? 02:17:20.840 |
- I think we're really excited about a future 02:17:23.800 |
where the programmer's in the driver's seat for a long time. 02:17:27.960 |
And you've heard us talk about this a little bit, 02:17:36.200 |
the ability to modify anything you want to modify, 02:17:45.280 |
than where some people are jumping to in the space, 02:17:50.280 |
where I think one idea that's captivated people 02:17:59.520 |
as if you're talking to like an engineering department 02:18:02.680 |
And can it just be this sort of isolated text box? 02:18:05.640 |
And part of the reason we're not excited about that 02:18:10.720 |
is some of the stuff we've talked about with latency. 02:18:12.760 |
But then a big piece, a reason we're not excited about that 02:18:16.040 |
is because that comes with giving up a lot of control. 02:18:22.360 |
And if you're necessarily just going to communicate 02:18:25.760 |
with a thing, like you would be communicating 02:18:32.480 |
And this kind of gets at fundamentally what engineering is. 02:18:40.440 |
who are a little bit more removed from engineering 02:18:41.840 |
might think of it as the spec is completely written out 02:18:44.920 |
and then the engineers just come and they just implement. 02:18:47.880 |
And it's just about making the thing happen in code 02:19:01.240 |
and about really hard trade-offs between speed and cost 02:19:05.080 |
and just all the other things involved in a system. 02:19:15.440 |
and the ones specifying what they want to be built, 02:19:18.400 |
and it's not just like company run by all AIs, 02:19:20.760 |
we think you'll really want the human in a driver's seat 02:19:26.240 |
And so the jury's still out on kind of what that looks like. 02:19:30.640 |
I think that one weird idea for what that could look like 02:19:37.200 |
the level of abstraction you view a code base at. 02:19:39.760 |
And you can point at specific parts of a code base 02:19:49.120 |
And you can actually edit that pseudocode too, 02:20:04.120 |
You keep the inflow, text editing component of programming, 02:20:12.320 |
while also giving you these big productivity gains. 02:20:18.280 |
- Yeah, and there are a lot of details to figure out there 02:20:28.680 |
We think for some things, like Arvid mentioned before, 02:20:34.800 |
if you have a bug that's really well-specified, 02:20:43.440 |
- What about like the fundamental skill of programming? 02:20:46.080 |
There's a lot of people, like young people right now, 02:21:04.040 |
- I actually think this is a really, really exciting time 02:21:08.040 |
Like we remember what programming was like in 2013, 2012, 02:21:14.760 |
and there was just so much more cruft and boilerplate 02:21:25.320 |
and that stuff still exists, it's definitely not at zero, 02:21:28.640 |
but programming today is way more fun than back then. 02:21:36.720 |
and all the things that really draw people to programming, 02:21:39.520 |
like for instance, this element of being able 02:21:45.840 |
like all those are just being turned up a ton. 02:21:50.280 |
I think it's gonna be a really, really fun time 02:21:53.720 |
I think that the skills will probably change too. 02:21:56.120 |
I think that people's taste in creative ideas 02:21:58.600 |
will be magnified, and it will be less about, 02:22:02.160 |
maybe less a little bit about boilerplate text editing, 02:22:05.160 |
maybe even a little bit less about carefulness, 02:22:10.760 |
If you're a programmer, I think it'll be a lot more fun. 02:22:22.800 |
was like we wanted to do a relatively big migration 02:22:26.640 |
We were using async local storage in Node.js, 02:22:31.960 |
and we wanted to migrate to our context object. 02:22:37.640 |
And Swale and I spent, I don't know, five days 02:22:41.360 |
working through this, even with today's AI tools. 02:22:50.520 |
and then the AI applies that to all of the locations. 02:22:54.120 |
And then it highlights, oh, this is a new example, 02:22:59.440 |
And then that can be done in like 10 minutes. 02:23:10.520 |
think exactly like, how are we going to do this? 02:23:13.800 |
but you can just try something first and you realize, 02:23:16.480 |
oh, this is not actually exactly what I want. 02:23:18.400 |
And then you can change it instantly again after. 02:23:20.800 |
And so, yeah, I think being a programmer in the future 02:23:29.840 |
it feels like a lot of the time with programming, 02:23:33.560 |
One is like, you think really hard, carefully upfront 02:23:39.760 |
And then you spend your limited time of engineering 02:24:24.920 |
then you're doing less and less creative decisions. 02:24:28.880 |
where it's, you're operating in the design space 02:24:34.000 |
where natural language is the main programming language. 02:24:37.320 |
And I guess I could ask that by way of advice. 02:24:39.280 |
Like, if somebody is interested in programming now, 02:24:43.240 |
Like, do they, you guys started in some Java, 02:25:02.440 |
It's just, it's going to be like vanilla JavaScript. 02:25:08.360 |
And I mean, it also brings up the question of like, 02:25:14.040 |
that some percent of the population is geeks. 02:25:16.680 |
And like, there's a particular kind of psychology in mind 02:25:36.600 |
But I think the true, maybe like the best programmers 02:25:58.440 |
and then they start coding on their side projects 02:26:15.480 |
where like this obsession and love of programming, 02:26:24.400 |
will really get into the details of how things work. 02:26:30.720 |
that exact problem, let's think about that person. 02:26:54.440 |
But what you're actually doing when you're pressing tab 02:27:03.440 |
sometimes you're typing a few more characters. 02:27:05.960 |
And that's the way that you're sort of shaping 02:27:21.760 |
as opposed to just typing is much lower bandwidth 02:27:35.840 |
building extraordinary productive human AI systems. 02:27:41.640 |
To start, we're building the engineer of the future, 02:27:50.720 |
This hybrid engineer will have effortless control 02:27:53.240 |
over their code base and no low entropy keystrokes. 02:27:56.920 |
They will iterate at the speed of their judgment, 02:28:02.160 |
Using a combination of AI and human ingenuity, 02:28:14.480 |
at the edge of what's useful and what's possible. 02:28:35.640 |
please check out our sponsors in the description. 02:28:42.600 |
and perhaps profound programming code I saw on Reddit. 02:28:45.560 |
Nothing is as permanent as a temporary solution that works. 02:28:51.120 |
Thank you for listening and hope to see you next time.