Code Generation and Maintenance at Scale: Morgante Pell

So I'm Brigante, the founder of Grit. I'm going to talk about code generation and maintenance at scale, or CPUs still matter, and what it takes to actually make one of these agentic workflows work in production. Quinn was talking about how most people have not merged them. That's 100% true.

It has probably merged more PRs than any other company in the world at this point because we've focused very narrowly and have done a lot of work above the model layer. And we're going to talk about how we did that. It's helpful to know why I started Grit. My background is all developer tools.

I've been working at Google Cloud for five years and built a lot of stuff on the DevOps layer, right? Thinking about Kubernetes and how do you orchestrate very large scale systems, working on tools like Customize or Terraform templates. And one of the biggest things I learned from this was how rare it was for a customer to come to us and ask for a brand new application.

Right? People didn't come and say, I want to build a new app on Google Cloud. It sounds cool. 90% of the time customers came and said, I have this line of business application that is doing $100 million in revenue. How do I run that on Kubernetes, right? And that's what all of our templates did.

That's what everything we built in the sort of pre-AI era of automation was on how you ran existing applications. And that's why we started Grit, because every demo that you usually see that's hyped, one of these ones on Twitter, it's usually type a prompt, get a new application from scratch, right?

Build something brand new. It's exciting. It goes really well on Twitter. That's not what devs do day in and day out. Developers spend most of their time modifying huge applications so that flights run on time. And this goes into three sort of categories of developer tools. Right? There's ID developer assistance.

This clearly has the most product market fit today. It's really easy, right? Like I was saying, you just do auto-complete, right? That's a very simple thing. Then there's AI agents that are focused on lowering the floor. They're allowing people to do tasks that they otherwise don't have the skills for, right?

Allowing a product manager or other non-technical user to build an application that they don't have the skill set for. This is powerful, but I actually am pretty skeptical that that's how most software's going to be built in the future. It requires a real thinking of how do you actually spec things out?

How do you think about edge cases? Basically, how we train as engineers that's required to build great software, which is why with Grit, we focus on raising the ceiling of what great engineers can do, right? Principal engineers, the most high-level engineers that you work with, they're primarily limited by time, right?

They can't be in 10 places at once. But AI agents can be in 10 places at once if there's the right engineer controlling them. And that's what we focus on is supercharging the productivity of the top 1% of engineers. It also helpfully gets around the problem of 95% of engineers not using AI.

The great thing about Grit is 95% of our customers, 95% of our engineers don't touch it, right? There will be 100 engineers on the team. They're not using Grit. There's one engineer who is deeply embedded with Grit and is generating hundreds of PRs with their agents. But to do this, tools need to change, right?

The IDE that you have today, it's a scalpel, right? It's focused on editing individual lines of code. And it's great for that. But it's not focused on editing hundreds of repos at once. It's not focused on how do you open 1,000 files and make those changes in them. And that's why we built Grit.

So we want to have bulldozers for code, right? So we're generating huge quantities of code. How do you push that around in an effective way when you're not editing individual lines when you're working on a higher level of abstraction? And this is super necessary, right? We've seen an explosion in how much code is being generated.

A lot of our customers are seeing 20% to 30% more code that's coming out of their teams now just because there's more PRs, there's more CI, there's everything that's running because you have code gen. And this is just an accelerate, right? Once we go from 5% to maybe 50% of people actually using AI, there's way, way more code in the world.

And we need better tools for managing that code once it's in production. So just to give an example of what this looks like in practice. This is a real customer that we had. They have been around. They've got thousands of repos. They've got thousands of developers. And they wanted to use OpenTelemetry instead of logging, right?

This is traditionally a massive effort, right? You have to coordinate across hundreds of teams to get them to understand OpenTelemetry, to understand how to instrument their code. You have to get them to do actual code changes to swap out their logging library. You have to do a bunch of education efforts.

And it's actually very much a people in process problem usually, right? Something where you have a program manager who has a massive Excel spreadsheet. That's what I say like grit we compete with is actually Excel, not any other AI dev tool. It's that you have these spreadsheets where you can manage these changes, right?

And tens of thousands of developers go into a change like that. So a lot of companies just say, you know what? That's not worth it, right? I'm not going to migrate to OpenTelemetry. I'm not going to go into the cloud. I'm going to stay in my older ways. You know, people still have millions of lines of COBOL because it's just so much work to do this kind of coordinated change.

With grit, you don't have to do that coordination effort, right? Because you can have one engineer who is actually coordinating that change, is driving individual AI developer agents to do the changes. You don't have to have a bunch of meetings because it's just one person telling their little agents what to do, right?

And you can do it with under 100 developer hours because they're just doing high-level orchestration. And then thousands of compute hours that the AI is doing is that's healing these changes. And we've seen this is literally a project that they had postponed for years because it just was not feasible.

They couldn't get enough on the roadmaps. They got it done with grit in a week, right? It just opened up 1,000 PRs across all the repos, fixed them, iterate on the changes, merged, and migrated over. So how do we actually do a change like that? It's sort of a three-level process.

Planning is a big part of it, right? So we index the entire code base. We do both semantic indexing, so understanding embeddings and understanding the intent of each file. But we also do a lot of traditional, more static analysis indexing. So we understand what's the structure of the code, what's being imported from where, what's the dependency graph, right?

This is all the sort of thing you need to know to actually do really high-quality agentic flows. Then once we have the plan of how we're going to make changes, we execute the plan, right? So we use large language models that are going to take that change, delegate it to a sub-agent.

The sub-agent is going to make a modification in one file. It uses something called grit QL, as well as diff generation from language models. Grit QL is our custom query engine that's able to actually index millions of lines of code and find the right places to modify things, and then finally push it up for PR review.

And be able to both have developers who are the director of it, right? So a typical scenario is that there will be the principal engineer who's driving the change. They'll review the PRs. But then into the developers, their primary interaction with grit is just seeing a PR land in their repo.

And then leave like one-line comment that grit will learn from. But they don't actually open up the grit UI ever because they're just responding to the changes that come from grit. Cool. So a little bit more about how we find code, right? So our goal here is to find all the error logs in the code base because we want to migrate those over to OpenTelemetry.

The naive approach that if you go to many of the workshops yesterday would be, all right, just chunk it, put it into a rag. You have a bunch of embeddings and, you know, that theoretically could work for maybe some document use cases. I'll tell you that absolutely will not work for this problem, right?

If you just go to try to find stuff that looks like a logging error, it's going to find a lot of irrelevant context, right? It's going to find anything that looks like log-like. And LLM has a hard time differentiating between a user-facing log, like an alert in a UI, and an actual log that you want to be putting into OpenTelemetry, right?

There's also unpredictable thresholds, right? You don't actually know how much code you're looking for. You can't do, you know, retrieve the top 10 closest matches. In some cases, you want to retrieve 10,000. In a lot of cases, developers don't even actually know how big of a change it is until grit starts to propose it for them, right?

So that's where we built the grit QL query engine. It's our own question query system that combines the best of static analysis with the best of AI. So we've got this query here that's looking for logger with some set of args, right? So we're just going to look for a function call, basically.

And that's a syntactic query. So we're just looking for all of our function calls across our entire code base. And then we're going to say that our args should be like an error occurred, right? And that's just we're giving an example of, like, what's an error message that we might be trying to look like?

Like there is a magical word that converts it into a semantic representation. So we want to say, what's some code that embedding search to cosign similarity is sufficiently above a threshold? That this is an actual log message versus some other function call that we wouldn't be wanting to modify.

And then we can finally do an imported from. We've got a built-in library to be able to understand the whole dependency graph. So we can do things like make sure this is actually imported from log4j. In this example, we wanted to make sure that we're only substituting our log4j logs.

And that would go and traverse the import graph earlier in the program. So that brings us to finding the code that we want to change. But once we've actually found the code, how do we make reliable changes? And unfortunately, really smart models still have a hard time doing this completely autonomously.

Just to give an example, I just used Claude Sonnet today. 3.5. It's a really good model. And it uses -- we put the entire files -- so a bunch of context into the context window. 100,000 tokens from Grit's VS Code extension and some of our linter outputs. And we wanted to just write a function that's going to convert from our linter JSON output and puts it into diagnostics for the Grit VS Code extension.

Right? Pretty simple task. I promise that everything that's required was in the context window. Right? It's not something where it had to go retrieve additional information. It was all there. Came back with a pretty reasonable completion. Converts ESLint to LSP diagnostics. This looks reasonable to me. Like, I imagine if you look at this code, you wouldn't be able to tell anything that's wrong with it.

I certainly couldn't tell anything that's wrong with it from eyeballing it. Right? But this is wrong. Right? And this is one of the main things to understand is that humans also can't look at this code and understand what's wrong with it. Right? This is why we have systems that allow us to type check, lint things.

Right? That allows us to understand code that even looks kind of correct is, in fact, incorrect and will fail in production. But I went back and just asked Claude to fix the code for me. I said, this broke in production. I tried to put it in my VS Code extension.

It broke. And I just asked it why. As you can kind of imagine, it doesn't do any better than I do of just looking and eyeballing the code and understanding why it's wrong. Right? So it just comes back and says some totally irrelevant answer on how to fix it.

Because, again, it's not grounded in what the actual errors are. It doesn't do any better. Right? That's a good question. Yeah? That's a good question. Yeah? That's a good question. Okay. That's a good question. Yeah? That's a good question. Yeah? That's a good question. Yeah? Yeah? And this is why compilers are great, is we can actually get that information really close in the dev loop.

With this information fed back into an LLM, it's able to correct that mistake no problem. That's a pretty easy change. It uses the convert LSP range to grit range, which, by the way, was in the context window. It could have used that before. It just didn't realize that it needed to use that until it had the compiler error forcing it to.

So this is already how I think it's important to see that IDs are already making us superhuman, and we need to make sure that all of our AI agents have access to the same tools that make them super AI. So compilers rock, right? This basic flow of prompt, get some code, build it, type check it, and then fix that output based on the LLM.

This is actually really powerful. This is probably half of what you need to do to build a really good agent, is make sure you have this flow working reliably. But they're really slow when you're talking about enterprise code basis. So this is real numbers from one of our customers.

It takes them 10 minutes to build their application from scratch. And that's just for type checking. It's not even pushing a production build. And this is actually pretty typical if you look at very large scale enterprise code bases. That's why large companies have had to build a lot of caching, because it's hard to build a large code base from scratch, which this is completely different than what people usually expect for AI.

People usually think inference takes a long time, right? You're waiting for the AI. And this is actually a pretty long prompt, 30 seconds, right? We're using a huge model to generate this code, takes 30 seconds. But that's dwarfed by the 10 minutes to build the application, right? This basically destroys our entire agentic flow if we're waiting 10 minutes for every single change to validate if it's correct.

But this is even more compounded if we're trying to do that in a loop, right? If we're trying to do a single change, it might take a day if you're just doing this naively. There are some agent projects that, in fact, do take a day to make very basic changes because you haven't done this at an optimization level.

But they might ask, like, how are you able to make changes in your IDE at a fast rate? You're not waiting 10 minutes every time you make a single keystroke to get a compiler error. It's much faster than that. It's because there's been a lot of work with language servers to solve this so that you can do a bunch of upfront prep so you can build the index in memory, have that in-memory queryable index, and then only rewrite the parts or only recheck the parts that you've modified.

On every keystroke, most tools, like TS server, for example, in TypeScript, is doing live reconcilation of figuring out that specific file, right? This is much, much faster. You can do the 30-second prompt, then one second recompute from the TS server, then 30 seconds is to fix it, and this is a much more reasonable flow, right?

So you obviously want to be using the same kind of language server tools that you'd be using as a human, not CLI-based tools, which often don't have the same heuristics in place to be able to optimize. And then ideally, you do this in a nice loop, eventually get to the point where you can commit and get a fresh PR to do that migration to open telemetry.

This is what it looks like in theory. In practice, at some point, it hits an error that it can't fix, right? It hits an error that gets into a loop, and it's continuously trying to fix the same error. It uses five different techniques, then goes back, and your context window is completely polluted with the wrong errors, right?

Everyone often says, like, agents don't work. This is probably half of the reason that agents don't work, is that you just have compounding failures, right? We found any time we actually have more than 10 prompts in a row, our chance of having successful PR is dramatically lower, right? So the way we work around that is instead of trying to repeatedly fix an error, we should actually just save our original state, revert back to that, and then continue to edit from there, right?

So if we went down a path that's just a bad path that got stuck in a row, we want to go back to a known good checkpoint and then build from there. This is actually how we're able to do this quickly. We don't want to spend 10 minutes recomputing each time.

We want to actually build our in-memory graph that we're talking about with TS server. We want to save that. We want to take a snapshot of memory. So we use Firecracker. It's a VM manager that's used for AWS Lambda, but we can actually use it for dev environments too.

And we can actually take the in-memory state, snapshot that, and then fork it into 10 different isolated environments that all have everything pre-computed. You can try 10 different changes in them and then figure out the correct change that is most likely to yield good results from there. In fact, this becomes massively parallel.

You can end up with an AI system that looks more like a distributed database than it does a traditional agent or something that you're running on your laptop, right? We actually have flows where we often have six up to 10 different agents working in parallel, all working from a known good state.

They're supposed to report back once they're done. And then we'll actually look at the different evaluations. We look at both some LM-based evals, but also heuristics like how many errors are there currently in the code base, how many unit tests are currently passing, and then actually compute what of these, which is the quorum, right?

It's actually similar to, again, a database system where you would have a voting of what's the new master. Here, it's like, what's the new good state that we want to fork from? There's four here that have similar states that we want to use that as our known good state, save that as our known good state, and then fork from there going forward.

And this ends up being much, much of a rival because we can have an entire PR that, yes, we've done 30 or 40 different generations on it, but in the final chain, there was only four different generations, right? Because we had one that went back to known good state, and then the second one is all operating from that quorum at each checkpoint.

But these edits get pretty expensive, right? If you're doing 40 different edits to make a single PR across very large files, that's a lot of money that you're spending on inference. This is a common problem with making good edits. Everyone naively just asks for, generate the whole file again, right?

It's the simplest approach. You definitely should start with that if you're building your own AI tool. But then you run into the classic problem of laziness. So this is actually still from Sonnet. It still said, you know, the start of the function remains the same. It left this comment in because it didn't want to output that code.

And it's just because output tokens are fundamentally more expensive. And if you look at GPT-40, it's a $5 to $15 ratio of input tokens to output tokens. Cloud 3.5 Sonnet is 3 to 15. This is pretty consistent across the board. And then response limits are not growing at the same level of context size, right?

We've got models out there that have 1.5 million tokens, 2 million tokens in their context window, and still only, if outputting, 4,000 tokens at a time, right? Because it's auto-aggressive, it gets a bunch more expensive. So you really don't want to output entire large files as you're making edits.

You want to find a good edit format. So the whole edit format works well. It's very expensive, though. You can do diffs, right? You can say, like, generate a unified diff for this. Try to apply that. There's some problems with this. One is, like, line numbers. LMs are still not very good at knowing what the right line number is, even if you give it them.

It's not that good at the math part. And it's also off-distribution, right? Real-world code that it's trained on is largely not trained on diffs, right? It's trained on actual full files. You can do simple search and replace with function calls. The problem with this is function calls are underneath, for the most part, JSON.

Escaping code in JSON format is terrible. You end up using a lot of tokens just for escape characters, right? It's just not a very good format to use. So that's why we actually developed a grit QL loose search and replace. So we could actually do something that's similar to what you would have on the model of being just a before snippet.

And this is something you might have, like, in a tutorial, which is, like, replace this with that, right? This is what we-- and this is actually the exact same output that comes from the LM. We'll do a match. We'll do a loose match to try to find what's the code that looks like that and replace it with the code that looks most similar to that afterwards, right?

And this works really, really well. Because we don't have to-- we're going to lie to irrelevant details, like what's currently inside the make match function, and just give enough detail to make the replacement. Cool. And just want to leave you with where we're going next. This is our current UI.

It still is very traditional, right? It still is building, you know, what's an AI workflow. It looks kind of like your CI, even though it's thousands of agents executing. I'm really excited about where we go next with this, figuring out what does it look like to manage an entire code base.

I think of like SimCity is like the ultimate, where you can zoom in and out, and understand at different levels of granularity and edit things there. Cool. Thanks so much. And we are hiring, so scan the QR code. We'll see you next time in the next one. Bye. Bye.

Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. We'll see you next time.

Code Generation and Maintenance at Scale: Morgante Pell

Transcript