CI in the Era of AI: From Unit Tests to Stochastic Evals

. NATHAN SOBO: Hey, everybody. Thanks for coming to my talk. I'm Nathan Sobo. I'm the co-founder of Zedd. We are an AI-enabled code editor. And what sets us apart is we are not a fork of VS Code. We are implemented from scratch in Rust. We literally engineered the entire system like a video game around about 1,200 lines of shader program that run on the GPU.

And the rest of the system is organized to deliver frames at 120 frames per second really quickly. And recently, I recorded a video because time is short. And this is obviously sped up massively, but we launched agentic editing in Zedd. And I wanted to talk about the approach that we took to test this and be empirical and deliver a reliable product that does this effectively.

So here's what we're going to be talking about. The rest of the talk is going to be looking at code and just talking about our experience. So testing, evaluation, what's the difference? First, I just want to start by saying, without an empirical approach to the software that we've been building since 2018, 2021, depends on how long you measure it, Zedd would crash every eight seconds.

Like, we have probably tens of thousands of tests at this point. And here's an example of the extreme to which we take empirical software development. We literally are starting a server, creating two clients, and we have a simulated scheduler that gets passed in here. And we run 50 different iterations of this with every random interleaving of anything that could possibly be concurrent.

So that if the 375,000, this is only 50, but iteration of this particular interleaving of network packets goes wrong, we can replay that over and over again, freeze it, and have full control. So we're very hardcore about testing and being empirical at Zedd. But up until very recently, until shipping some of these AI related features, we've been able to be fully deterministic.

Even with sort of non-deterministic things, we've been able to really lock things down and never have flaky tests on CI. But as soon as an LLM enters the picture, that's all out the window, right? Because we could, assuming that we're using frontier models, but even if we were to use our own model and to control the sampling off the logits, you change one token in the input and you're going to get a completely different output.

So it's just a fundamentally different problem where we have to embrace stochastic behavior. And so as we built this system, our first eval that we really hammered on was something that if you've seen something like Sweebench, probably looks pretty familiar. It's very data-driven. And it seems like in the machine learning world, this is what an eval is, input, output.

But in the programmatic software world, an eval is more like a test that passes or fails. Like, we come from this very different perspective of automated testing. And so right away, you know, we had this traditional data-driven eval and then backing that, you know, in the same program that runs all these evals, basically compiles a headless copy of Zed, checks out a repo, runs the agent, tries to make it do things.

And we right away got into, like, making the eval more programmatic. So you can see here the conversation is literally a function. And then our ability to sort of come in here and write code that performs assertions about what the agent did, even, you know, getting quite a bit more granular.

So the problem is when that big eval fails, it's like, what do we do? There's a million ways that that thing could go wrong. And so when we wrote this eval, we were able to drive out one really simple failure mode, which is when we run the grep tool, this is what our original, like, dumb implementation of the grep tool looked like.

So you can see in this case we're saying we want to add a window argument to this tool run trait method in this particular file name, right? But if this is what the model sees, then we're in trouble, right? And so what we ended up driving from this stochastic test here is a more deterministic test, right?

Where I'm going to go ahead and set up the project, perform this search for fn run, and then we use tree sitter, which is Max's parsing-- parser generator framework to actually expand out the match to the syntactic boundaries. And so that was able to, you know, drive a substantial improvement in the behavior of the agent.

So right there, like, that was an interesting discovery of just being empirical and driving out what was ultimately an algorithmic problem with a stochastic test. But then we moved on to the editing. And when we first implemented editing, it was just done with tool calls. But the problem with tool calls, if anybody's worked with them, is they don't really stream very well.

Like, you get key, value, key, value, and it all comes back at once. So what we ended up doing is deciding-- instead, we perform a small tool call that just describes the edits. Then we loop back around for the moment to just the same model, because it has everything already loaded in its cache, and ask it to omit these old text, new text blocks.

But there's some challenges there. Like, now we have to parse that data is coming back, and all kinds of weird stuff that the model's going to be doing. So I'll just dive in and show you some of the examples. So here's like-- there were a bunch of different tests.

Like, these are part of our main regular old test suite. And we just added this eval function. And we say, I want to run this eval 200 times. And in this particular case, didn't start this way. But eventually, we got this way. We want 100% of these 200 examples to pass, or this test should, like, literally fail the build.

And so we kind of set a watermark that way. And then here, you can see I'm just, like, loading in a conversation with the LLM. And then ultimately, in this case, I'm using eval as judge. But there are some cases you saw earlier where we do things more programmatically.

In this case, we're just, like, verifying if this thing worked correctly. And so we go from a test like this, a stochastic test, into the particular problems that drive this. So, I mean, but it's funny, a lot of the things that went wrong here were really just, like, basic things, right?

Like, and so again, these are things that we can-- they're non-deterministic, but the non-determinism is sort of bounded. So in this case, we're going to run 100 iterations of this very simple parsing test, where we're randomly chunking up the input into arbitrary boundaries and just making sure that the parser for this data can handle arbitrary chunking of the text.

Similarly here, fuzzy matching algorithm. That's a critical piece of, you know, if the model's fuzzy or generates something slightly wrong, being able to do this dynamic programming that gets us an approximate match really saved a lot of the tool calling failures for us. But again, something that could be deterministically tested.

And then finally, there's this idea of a streaming diff, that as the model is emitting new text, we need to be comparing the new text it's emitting with the old text, and then making a dynamic decision of, if I don't see some text that was present in the old text, is that because it was deleted or it just hasn't been streamed out yet?

So that's another, you know, completely deterministic thing that still was critical to making the system work correctly. So yet another, like, deterministic test. But then we get into some of the fun stuff with the behavior of the model itself. So one thing that we noticed right away was the model, when it was trying to insert at the top or the end of the document, in some cases, would just have an empty old text tag, which, like, if you're doing old text, new text matching is not very useful.

And so we found that we were able to just add this, you know, simple thing to the prompt and improve the eval. But again, it would still do it, you know, 1 or 2% of the time. And so just being able to test that, you know, we can handle that case robustly, you know, ended up being really helpful.

And this is just like, again, I'm in this case simulating the output of the LLM. Another case that we encountered a lot was that it would mismatch the XML tags. And so we would have old text, and then it would end like this with new text. And so, again, we were able to get a certain distance with the prompt.

We initially were sort of-- 40% of the time we were getting this mismatch and blowing up. We were able to move that by saying, always close all tags properly to, like, 95%, but we still have that last, like, 5%. What do we do with that? And so, again, it's just like driving it into a deterministic test where we have old text, new text, you know, and we need to just assert that even though we got-- you know, we just need to be robust and accept crazy stuff from the LLM, like, in its output.

And another interesting case was indentation. So here's an eval that we wrote, which was this idea of having an outer function, and then inside here there's an inner function. And we have this to-do macro here. And we want to say, replace this to-do with return 42. And we would see the model doing things like this, right, where the indentation level was completely flattened out, right?

It says, "replace fn_inner," but fn_inner has, like, all this leading indentation on the front of it. But otherwise, like, it's perfectly fine. And so, in this case, what we did is the strategy of, like, detecting this indent delta. So we figure out, basically, sort of renormalize the indent, if that makes sense.

So if the text in the buffer is indented by a certain amount, but we otherwise match, then we also detect the text that the LLM emitted and compute this delta. And then, again, driving it back to a deterministic test of, here, we build a buffer with lorem ipsum doler sit with this very interesting indentation pattern.

And then you can see here, we want to replace ipsum doler sit with ipsum doler sit amet, where it's out-dented, right? And you can see that when we actually do our indentation normalization, we're able to handle that correctly. Another tricky thing that we ran into was this weird escaping behavior, where in these certain cases where we have weird constructs, like, this is a Rust, sort of a raw string, basically, that can include things like quotes inside of it, due to this interesting escaping syntax.

And doing something as simple as saying, just, like, tweak this, we would notice this is the behavior we want. And we would notice, especially Gemini, doing all kinds of crazy stuff, like doing HTML escape codes, doing this sort of backslash escaping. Yeah, with new lines and stuff, we would see, like, you know, crazy escaping that it would do, double escaping the new lines.

And so in this case, like, that was just a pure prompt fix and didn't really-- we didn't try yet, at least, to kind of detect any of this escaping any further. Although, obviously, like, that's an opportunity, like, that's an opportunity to keep going even further. So, yeah, I mean, from my perspective, like, the lessons we learned from this entire experience is that, like, rigorous testing is fundamental to building reliable software, period.

We have this new fancy parlance of, like, an eval, which comes, I think, out of the machine learning field of kind of input, output, and having lots of examples of that. But I think a lot of the techniques of just traditional, good old-fashioned software engineering are still really applicable.

But we have to embrace this more statistical approach where we're running it 100 times, 200 times, and asserting a threshold of pass versus fail. But I think-- I mean, this is a real-world example of just the stuff that we saw trying to implement streaming edits or an agent that can go search and edit.

A lot of those problems are not-- I don't know. They're not advanced machine learning problems or anything. It's just stupid things that the model will try to do that we need to account for. And so I think this motion of sort of starting with the zoomed-out eval, then zooming into a sort of stochastic unit test that's still random and interacting with the model, but is more focused in on a particular aspect of the experience.

And then, finally, driving that even further to an actual good old-fashioned test, that's been the process that we've discovered. But we, so far, haven't needed to use, like, special external tools or eval frameworks or anything like that. It's just in our test suite so far and kind of using the same underlying infrastructure that's used to do any other kind of software testing and, honestly, using just a lot of those same skills.

So for us, it was, like, just be empirical, just like we've always been. But it's this new X factor of the LLM doing all these crazy things. Yeah, so that's a little story of how we-- and this is all open source, so you can check it out. Zed is under the GPL license.

I'd love help improving it all, actually. Maybe some of you are like, what are you doing? You could do better in the following five ways. I'd love contributions if that's interesting to folks. And, yeah, it's working. I'm-- with the Claude IV models, I'm finally able to write Rust agentically, really efficiently.

And I'm just loving it. So that's how we got there. I appreciate your attention. Thank you. Thank you. Thank you. Thank you. Thank you. I appreciate your attention.

CI in the Era of AI: From Unit Tests to Stochastic Evals — Nathan Sobo, Zed

Transcript