back to indexCode Generation and Maintenance at Scale: Morgante Pell

00:00:16.200 |
I'm going to talk about code generation and maintenance at scale, or CPUs still matter, 00:00:21.000 |
and what it takes to actually make one of these agentic workflows work in production. 00:00:24.680 |
Quinn was talking about how most people have not merged them. 00:00:29.520 |
It has probably merged more PRs than any other company 00:00:32.260 |
in the world at this point because we've focused very narrowly 00:00:35.100 |
and have done a lot of work above the model layer. 00:00:37.380 |
And we're going to talk about how we did that. 00:00:43.680 |
I've been working at Google Cloud for five years 00:00:46.560 |
and built a lot of stuff on the DevOps layer, right? 00:00:49.560 |
Thinking about Kubernetes and how do you orchestrate very large scale 00:00:52.080 |
systems, working on tools like Customize or Terraform templates. 00:00:57.300 |
And one of the biggest things I learned from this 00:00:59.520 |
was how rare it was for a customer to come to us and ask for a brand new application. 00:01:04.380 |
People didn't come and say, I want to build a new app on Google Cloud. 00:01:08.100 |
90% of the time customers came and said, I have this line of business application 00:01:18.200 |
That's what everything we built in the sort of pre-AI era of automation 00:01:25.000 |
And that's why we started Grit, because every demo that you usually 00:01:28.140 |
see that's hyped, one of these ones on Twitter, 00:01:30.160 |
it's usually type a prompt, get a new application from scratch, right? 00:01:39.080 |
Developers spend most of their time modifying huge applications so that flights run 00:01:45.360 |
And this goes into three sort of categories of developer tools. 00:01:49.660 |
This clearly has the most product market fit today. 00:01:52.700 |
Like I was saying, you just do auto-complete, right? 00:01:56.820 |
Then there's AI agents that are focused on lowering the floor. 00:01:59.100 |
They're allowing people to do tasks that they otherwise don't have the skills 00:02:02.880 |
Allowing a product manager or other non-technical user to build an application 00:02:09.220 |
This is powerful, but I actually am pretty skeptical that that's 00:02:12.200 |
how most software's going to be built in the future. 00:02:14.040 |
It requires a real thinking of how do you actually spec things out? 00:02:18.660 |
Basically, how we train as engineers that's required to build great software, 00:02:22.360 |
which is why with Grit, we focus on raising the ceiling of what great engineers can do, 00:02:27.660 |
Principal engineers, the most high-level engineers that you work with, 00:02:34.580 |
But AI agents can be in 10 places at once if there's the right engineer controlling them. 00:02:38.360 |
And that's what we focus on is supercharging the productivity of the top 1% of engineers. 00:02:43.360 |
It also helpfully gets around the problem of 95% of engineers not using AI. 00:02:47.360 |
The great thing about Grit is 95% of our customers, 95% of our engineers don't touch it, right? 00:02:55.360 |
There's one engineer who is deeply embedded with Grit and is generating hundreds of PRs with their agents. 00:03:06.360 |
The IDE that you have today, it's a scalpel, right? 00:03:08.360 |
It's focused on editing individual lines of code. 00:03:11.360 |
But it's not focused on editing hundreds of repos at once. 00:03:14.360 |
It's not focused on how do you open 1,000 files and make those changes in them. 00:03:19.360 |
So we want to have bulldozers for code, right? 00:03:23.360 |
How do you push that around in an effective way when you're not editing individual lines 00:03:27.360 |
when you're working on a higher level of abstraction? 00:03:31.360 |
We've seen an explosion in how much code is being generated. 00:03:34.360 |
A lot of our customers are seeing 20% to 30% more code that's coming out of their teams now 00:03:39.360 |
just because there's more PRs, there's more CI, there's everything that's running because you have code gen. 00:03:44.360 |
Once we go from 5% to maybe 50% of people actually using AI, there's way, way more code in the world. 00:03:49.360 |
And we need better tools for managing that code once it's in production. 00:03:54.360 |
So just to give an example of what this looks like in practice. 00:04:04.360 |
And they wanted to use OpenTelemetry instead of logging, right? 00:04:07.360 |
This is traditionally a massive effort, right? 00:04:09.360 |
You have to coordinate across hundreds of teams to get them to understand OpenTelemetry, 00:04:14.360 |
You have to get them to do actual code changes to swap out their logging library. 00:04:20.360 |
And it's actually very much a people in process problem usually, right? 00:04:23.360 |
Something where you have a program manager who has a massive Excel spreadsheet. 00:04:27.360 |
That's what I say like grit we compete with is actually Excel, not any other AI dev tool. 00:04:31.360 |
It's that you have these spreadsheets where you can manage these changes, right? 00:04:35.360 |
And tens of thousands of developers go into a change like that. 00:04:38.360 |
So a lot of companies just say, you know what? 00:04:45.360 |
You know, people still have millions of lines of COBOL because it's just so much work to do this kind of coordinated change. 00:04:50.360 |
With grit, you don't have to do that coordination effort, right? 00:04:53.360 |
Because you can have one engineer who is actually coordinating that change, is driving individual AI developer agents to do the changes. 00:05:00.360 |
You don't have to have a bunch of meetings because it's just one person telling their little agents what to do, right? 00:05:05.360 |
And you can do it with under 100 developer hours because they're just doing high-level orchestration. 00:05:09.360 |
And then thousands of compute hours that the AI is doing is that's healing these changes. 00:05:14.360 |
And we've seen this is literally a project that they had postponed for years because it just was not feasible. 00:05:24.360 |
It just opened up 1,000 PRs across all the repos, fixed them, iterate on the changes, merged, and migrated over. 00:05:39.360 |
We do both semantic indexing, so understanding embeddings and understanding the intent of each file. 00:05:44.360 |
But we also do a lot of traditional, more static analysis indexing. 00:05:47.360 |
So we understand what's the structure of the code, what's being imported from where, what's the dependency graph, right? 00:05:52.360 |
This is all the sort of thing you need to know to actually do really high-quality agentic flows. 00:05:56.360 |
Then once we have the plan of how we're going to make changes, we execute the plan, right? 00:06:00.360 |
So we use large language models that are going to take that change, delegate it to a sub-agent. 00:06:05.360 |
The sub-agent is going to make a modification in one file. 00:06:08.360 |
It uses something called grit QL, as well as diff generation from language models. 00:06:12.360 |
Grit QL is our custom query engine that's able to actually index millions of lines of code 00:06:17.360 |
and find the right places to modify things, and then finally push it up for PR review. 00:06:21.360 |
And be able to both have developers who are the director of it, right? 00:06:25.360 |
So a typical scenario is that there will be the principal engineer who's driving the change. 00:06:30.360 |
But then into the developers, their primary interaction with grit is just seeing a PR land in their repo. 00:06:34.360 |
And then leave like one-line comment that grit will learn from. 00:06:37.360 |
But they don't actually open up the grit UI ever because they're just responding to the changes that come from grit. 00:06:42.360 |
So a little bit more about how we find code, right? 00:06:46.360 |
So our goal here is to find all the error logs in the code base because we want to migrate those over to OpenTelemetry. 00:06:51.360 |
The naive approach that if you go to many of the workshops yesterday would be, all right, just chunk it, put it into a rag. 00:06:56.360 |
You have a bunch of embeddings and, you know, that theoretically could work for maybe some document use cases. 00:07:02.360 |
I'll tell you that absolutely will not work for this problem, right? 00:07:05.360 |
If you just go to try to find stuff that looks like a logging error, it's going to find a lot of irrelevant context, right? 00:07:11.360 |
It's going to find anything that looks like log-like. 00:07:13.360 |
And LLM has a hard time differentiating between a user-facing log, like an alert in a UI, and an actual log that you want to be putting into OpenTelemetry, right? 00:07:22.360 |
There's also unpredictable thresholds, right? 00:07:24.360 |
You don't actually know how much code you're looking for. 00:07:25.360 |
You can't do, you know, retrieve the top 10 closest matches. 00:07:31.360 |
In a lot of cases, developers don't even actually know how big of a change it is until grit starts to propose it for them, right? 00:07:37.360 |
So that's where we built the grit QL query engine. 00:07:39.360 |
It's our own question query system that combines the best of static analysis with the best of AI. 00:07:45.360 |
So we've got this query here that's looking for logger with some set of args, right? 00:07:50.360 |
So we're just going to look for a function call, basically. 00:07:53.360 |
So we're just looking for all of our function calls across our entire code base. 00:07:57.360 |
And then we're going to say that our args should be like an error occurred, right? 00:08:01.360 |
And that's just we're giving an example of, like, what's an error message that we might be trying to look like? 00:08:05.360 |
Like there is a magical word that converts it into a semantic representation. 00:08:08.360 |
So we want to say, what's some code that embedding search to cosign similarity is sufficiently above a threshold? 00:08:14.360 |
That this is an actual log message versus some other function call that we wouldn't be wanting to modify. 00:08:23.360 |
We've got a built-in library to be able to understand the whole dependency graph. 00:08:26.360 |
So we can do things like make sure this is actually imported from log4j. 00:08:29.360 |
In this example, we wanted to make sure that we're only substituting our log4j logs. 00:08:33.360 |
And that would go and traverse the import graph earlier in the program. 00:08:38.360 |
So that brings us to finding the code that we want to change. 00:08:41.360 |
But once we've actually found the code, how do we make reliable changes? 00:08:46.360 |
And unfortunately, really smart models still have a hard time doing this completely autonomously. 00:08:51.360 |
Just to give an example, I just used Claude Sonnet today. 00:08:57.360 |
And it uses -- we put the entire files -- so a bunch of context into the context window. 00:09:03.360 |
100,000 tokens from Grit's VS Code extension and some of our linter outputs. 00:09:08.360 |
And we wanted to just write a function that's going to convert from our linter JSON output 00:09:13.360 |
and puts it into diagnostics for the Grit VS Code extension. 00:09:18.360 |
I promise that everything that's required was in the context window. 00:09:21.360 |
It's not something where it had to go retrieve additional information. 00:09:25.360 |
Came back with a pretty reasonable completion. 00:09:33.360 |
Like, I imagine if you look at this code, you wouldn't be able to tell anything that's wrong 00:09:37.360 |
I certainly couldn't tell anything that's wrong with it from eyeballing it. 00:09:43.360 |
And this is one of the main things to understand is that humans also can't look at this code 00:09:48.360 |
This is why we have systems that allow us to type check, lint things. 00:09:52.360 |
That allows us to understand code that even looks kind of correct is, in fact, incorrect 00:09:58.360 |
But I went back and just asked Claude to fix the code for me. 00:10:06.360 |
As you can kind of imagine, it doesn't do any better than I do of just looking and eyeballing 00:10:12.360 |
So it just comes back and says some totally irrelevant answer on how to fix it. 00:10:16.360 |
Because, again, it's not grounded in what the actual errors are. 00:10:31.360 |
And this is why compilers are great, is we can actually get that information really close 00:10:37.760 |
With this information fed back into an LLM, it's able to correct that mistake no problem. 00:10:43.920 |
It uses the convert LSP range to grit range, which, by the way, was in the context window. 00:10:49.620 |
It just didn't realize that it needed to use that until it had the compiler error forcing 00:10:55.000 |
So this is already how I think it's important to see that IDs are already making us superhuman, 00:10:59.780 |
and we need to make sure that all of our AI agents have access to the same tools that 00:11:08.860 |
This basic flow of prompt, get some code, build it, type check it, and then fix that output 00:11:17.360 |
This is probably half of what you need to do to build a really good agent, is make sure 00:11:24.060 |
But they're really slow when you're talking about enterprise code basis. 00:11:27.200 |
So this is real numbers from one of our customers. 00:11:30.320 |
It takes them 10 minutes to build their application from scratch. 00:11:36.360 |
And this is actually pretty typical if you look at very large scale enterprise code bases. 00:11:40.680 |
That's why large companies have had to build a lot of caching, because it's hard to build 00:11:44.280 |
a large code base from scratch, which this is completely different than what people usually 00:11:49.360 |
People usually think inference takes a long time, right? 00:11:52.740 |
And this is actually a pretty long prompt, 30 seconds, right? 00:11:54.740 |
We're using a huge model to generate this code, takes 30 seconds. 00:11:57.980 |
But that's dwarfed by the 10 minutes to build the application, right? 00:12:01.120 |
This basically destroys our entire agentic flow if we're waiting 10 minutes for every single 00:12:08.280 |
But this is even more compounded if we're trying to do that in a loop, right? 00:12:11.740 |
If we're trying to do a single change, it might take a day if you're just doing this naively. 00:12:15.980 |
There are some agent projects that, in fact, do take a day to make very basic changes because 00:12:20.400 |
you haven't done this at an optimization level. 00:12:22.200 |
But they might ask, like, how are you able to make changes in your IDE at a fast rate? 00:12:27.140 |
You're not waiting 10 minutes every time you make a single keystroke to get a compiler error. 00:12:32.020 |
It's because there's been a lot of work with language servers to solve this so that you 00:12:36.460 |
can do a bunch of upfront prep so you can build the index in memory, have that in-memory 00:12:40.620 |
queryable index, and then only rewrite the parts or only recheck the parts that you've modified. 00:12:45.220 |
On every keystroke, most tools, like TS server, for example, in TypeScript, is doing live reconcilation 00:12:54.440 |
You can do the 30-second prompt, then one second recompute from the TS server, then 30 seconds 00:12:59.100 |
is to fix it, and this is a much more reasonable flow, right? 00:13:01.780 |
So you obviously want to be using the same kind of language server tools that you'd be using 00:13:05.560 |
as a human, not CLI-based tools, which often don't have the same heuristics in place to be 00:13:13.540 |
And then ideally, you do this in a nice loop, eventually get to the point where you can commit 00:13:17.340 |
and get a fresh PR to do that migration to open telemetry. 00:13:23.540 |
In practice, at some point, it hits an error that it can't fix, right? 00:13:28.340 |
It hits an error that gets into a loop, and it's continuously trying to fix the same error. 00:13:32.020 |
It uses five different techniques, then goes back, and your context window is completely 00:13:37.520 |
Everyone often says, like, agents don't work. 00:13:40.100 |
This is probably half of the reason that agents don't work, is that you just have compounding 00:13:44.220 |
We found any time we actually have more than 10 prompts in a row, our chance of having 00:13:51.420 |
So the way we work around that is instead of trying to repeatedly fix an error, we should 00:13:58.420 |
actually just save our original state, revert back to that, and then continue to edit from 00:14:04.700 |
So if we went down a path that's just a bad path that got stuck in a row, we want to go 00:14:08.100 |
back to a known good checkpoint and then build from there. 00:14:11.100 |
This is actually how we're able to do this quickly. 00:14:12.420 |
We don't want to spend 10 minutes recomputing each time. 00:14:15.060 |
We want to actually build our in-memory graph that we're talking about with TS server. 00:14:23.220 |
It's a VM manager that's used for AWS Lambda, but we can actually use it for dev environments 00:14:28.440 |
And we can actually take the in-memory state, snapshot that, and then fork it into 10 different 00:14:33.660 |
isolated environments that all have everything pre-computed. 00:14:36.580 |
You can try 10 different changes in them and then figure out the correct change that is most 00:14:45.400 |
You can end up with an AI system that looks more like a distributed database than it does 00:14:49.760 |
a traditional agent or something that you're running on your laptop, right? 00:14:52.260 |
We actually have flows where we often have six up to 10 different agents working in parallel, 00:14:59.700 |
They're supposed to report back once they're done. 00:15:02.260 |
And then we'll actually look at the different evaluations. 00:15:04.440 |
We look at both some LM-based evals, but also heuristics like how many errors are there 00:15:08.580 |
currently in the code base, how many unit tests are currently passing, and then actually 00:15:12.680 |
compute what of these, which is the quorum, right? 00:15:15.300 |
It's actually similar to, again, a database system where you would have a voting of what's 00:15:20.180 |
Here, it's like, what's the new good state that we want to fork from? 00:15:22.840 |
There's four here that have similar states that we want to use that as our known good state, 00:15:29.380 |
save that as our known good state, and then fork from there going forward. 00:15:31.800 |
And this ends up being much, much of a rival because we can have an entire PR that, yes, 00:15:37.020 |
we've done 30 or 40 different generations on it, but in the final chain, there was only 00:15:44.280 |
Because we had one that went back to known good state, and then the second one is all operating 00:15:54.300 |
If you're doing 40 different edits to make a single PR across very large files, that's 00:15:59.500 |
a lot of money that you're spending on inference. 00:16:02.020 |
This is a common problem with making good edits. 00:16:04.100 |
Everyone naively just asks for, generate the whole file again, right? 00:16:07.640 |
You definitely should start with that if you're building your own AI tool. 00:16:10.880 |
But then you run into the classic problem of laziness. 00:16:15.160 |
It still said, you know, the start of the function remains the same. 00:16:17.740 |
It left this comment in because it didn't want to output that code. 00:16:19.780 |
And it's just because output tokens are fundamentally more expensive. 00:16:22.220 |
And if you look at GPT-40, it's a $5 to $15 ratio of input tokens to output tokens. 00:16:32.820 |
And then response limits are not growing at the same level of context size, right? 00:16:36.300 |
We've got models out there that have 1.5 million tokens, 2 million tokens in their context 00:16:41.000 |
window, and still only, if outputting, 4,000 tokens at a time, right? 00:16:44.780 |
Because it's auto-aggressive, it gets a bunch more expensive. 00:16:47.060 |
So you really don't want to output entire large files as you're making edits. 00:16:58.220 |
You can say, like, generate a unified diff for this. 00:17:03.600 |
LMs are still not very good at knowing what the right line number is, even if you give it 00:17:10.940 |
Real-world code that it's trained on is largely not trained on diffs, right? 00:17:16.500 |
You can do simple search and replace with function calls. 00:17:19.220 |
The problem with this is function calls are underneath, for the most part, JSON. 00:17:25.520 |
You end up using a lot of tokens just for escape characters, right? 00:17:31.180 |
So that's why we actually developed a grit QL loose search and replace. 00:17:34.800 |
So we could actually do something that's similar to what you would have on the model of being 00:17:39.140 |
And this is something you might have, like, in a tutorial, which is, like, replace this 00:17:43.020 |
This is what we-- and this is actually the exact same output that comes from the LM. 00:17:47.500 |
We'll do a loose match to try to find what's the code that looks like that and replace it 00:17:51.460 |
with the code that looks most similar to that afterwards, right? 00:17:55.020 |
Because we don't have to-- we're going to lie to irrelevant details, like what's currently 00:17:58.100 |
inside the make match function, and just give enough detail to make the replacement. 00:18:03.160 |
And just want to leave you with where we're going next. 00:18:08.520 |
It still is building, you know, what's an AI workflow. 00:18:11.780 |
It looks kind of like your CI, even though it's thousands of agents executing. 00:18:15.620 |
I'm really excited about where we go next with this, figuring out what does it look like 00:18:21.480 |
I think of like SimCity is like the ultimate, where you can zoom in and out, and understand 00:18:25.960 |
at different levels of granularity and edit things there.