back to indexCI in the Era of AI: From Unit Tests to Stochastic Evals — Nathan Sobo, Zed

00:00:24.000 |
And what sets us apart is we are not a fork of VS Code. 00:00:34.840 |
We literally engineered the entire system like a video game 00:00:38.060 |
around about 1,200 lines of shader program that run on the GPU. 00:00:41.540 |
And the rest of the system is organized to deliver frames 00:00:47.160 |
And recently, I recorded a video because time is short. 00:01:01.040 |
and deliver a reliable product that does this effectively. 00:01:04.240 |
So here's what we're going to be talking about. 00:01:06.340 |
The rest of the talk is going to be looking at code 00:01:12.080 |
So testing, evaluation, what's the difference? 00:01:19.460 |
without an empirical approach to the software that we've been building since 2018, 2021, 00:01:25.140 |
depends on how long you measure it, Zedd would crash every eight seconds. 00:01:29.340 |
Like, we have probably tens of thousands of tests at this point. 00:01:34.040 |
And here's an example of the extreme to which we take empirical software development. 00:01:37.940 |
We literally are starting a server, creating two clients, and we have a simulated scheduler 00:01:44.740 |
And we run 50 different iterations of this with every random interleaving of anything that 00:01:51.760 |
So that if the 375,000, this is only 50, but iteration of this particular interleaving of network packets 00:02:01.000 |
goes wrong, we can replay that over and over again, freeze it, and have full control. 00:02:06.040 |
So we're very hardcore about testing and being empirical at Zedd. 00:02:10.000 |
But up until very recently, until shipping some of these AI related features, we've been able to be fully deterministic. 00:02:16.960 |
Even with sort of non-deterministic things, we've been able to really lock things down and never have flaky tests on CI. 00:02:24.140 |
But as soon as an LLM enters the picture, that's all out the window, right? 00:02:28.080 |
Because we could, assuming that we're using frontier models, but even if we were to use our own model 00:02:37.140 |
and to control the sampling off the logits, you change one token in the input and you're going to get a completely different output. 00:02:43.960 |
So it's just a fundamentally different problem where we have to embrace stochastic behavior. 00:02:48.820 |
And so as we built this system, our first eval that we really hammered on was something that if you've seen something 00:02:55.640 |
like Sweebench, probably looks pretty familiar. 00:02:58.740 |
And it seems like in the machine learning world, this is what an eval is, input, output. 00:03:04.820 |
But in the programmatic software world, an eval is more like a test that passes or fails. 00:03:11.820 |
Like, we come from this very different perspective of automated testing. 00:03:15.620 |
And so right away, you know, we had this traditional data-driven eval and then backing that, you know, 00:03:20.820 |
in the same program that runs all these evals, basically compiles a headless copy of Zed, checks out a repo, runs the agent, tries to make it do things. 00:03:30.820 |
And we right away got into, like, making the eval more programmatic. 00:03:35.680 |
So you can see here the conversation is literally a function. 00:03:38.680 |
And then our ability to sort of come in here and write code that performs assertions about what the agent did, even, you know, getting quite a bit more granular. 00:03:49.680 |
So the problem is when that big eval fails, it's like, what do we do? 00:03:53.680 |
There's a million ways that that thing could go wrong. 00:03:55.820 |
And so when we wrote this eval, we were able to drive out one really simple failure mode, which is when we run the grep tool, 00:04:04.140 |
this is what our original, like, dumb implementation of the grep tool looked like. 00:04:07.940 |
So you can see in this case we're saying we want to add a window argument to this tool run trait method in this particular file name, right? 00:04:17.800 |
But if this is what the model sees, then we're in trouble, right? 00:04:21.740 |
And so what we ended up driving from this stochastic test here is a more deterministic test, right? 00:04:31.080 |
Where I'm going to go ahead and set up the project, perform this search for fn run, and then we use tree sitter, 00:04:39.100 |
which is Max's parsing-- parser generator framework to actually expand out the match to the syntactic boundaries. 00:04:48.440 |
And so that was able to, you know, drive a substantial improvement in the behavior of the agent. 00:04:54.860 |
So right there, like, that was an interesting discovery of just being empirical and driving out what was ultimately an algorithmic problem with a stochastic test. 00:05:06.300 |
And when we first implemented editing, it was just done with tool calls. 00:05:09.860 |
But the problem with tool calls, if anybody's worked with them, is they don't really stream very well. 00:05:14.400 |
Like, you get key, value, key, value, and it all comes back at once. 00:05:19.840 |
So what we ended up doing is deciding-- instead, we perform a small tool call that just describes the edits. 00:05:26.540 |
Then we loop back around for the moment to just the same model, because it has everything already loaded in its cache, 00:05:32.960 |
and ask it to omit these old text, new text blocks. 00:05:38.920 |
Like, now we have to parse that data is coming back, and all kinds of weird stuff that the model's going to be doing. 00:05:45.520 |
So I'll just dive in and show you some of the examples. 00:05:49.520 |
So here's like-- there were a bunch of different tests. 00:05:54.080 |
Like, these are part of our main regular old test suite. 00:05:59.660 |
And we say, I want to run this eval 200 times. 00:06:02.760 |
And in this particular case, didn't start this way. 00:06:08.360 |
We want 100% of these 200 examples to pass, or this test should, like, literally fail the build. 00:06:16.760 |
And then here, you can see I'm just, like, loading in a conversation with the LLM. 00:06:22.640 |
And then ultimately, in this case, I'm using eval as judge. 00:06:27.140 |
But there are some cases you saw earlier where we do things more programmatically. 00:06:30.140 |
In this case, we're just, like, verifying if this thing worked correctly. 00:06:34.940 |
And so we go from a test like this, a stochastic test, into the particular problems that drive this. 00:06:42.600 |
So, I mean, but it's funny, a lot of the things that went wrong here were really just, like, basic things, right? 00:06:48.480 |
Like, and so again, these are things that we can-- they're non-deterministic, but the non-determinism is sort of bounded. 00:06:57.340 |
So in this case, we're going to run 100 iterations of this very simple parsing test, where we're randomly chunking up the input 00:07:05.760 |
into arbitrary boundaries and just making sure that the parser for this data can handle arbitrary chunking of the text. 00:07:18.600 |
That's a critical piece of, you know, if the model's fuzzy or generates something slightly wrong, being able to do this dynamic programming that gets us an approximate match 00:07:31.760 |
really saved a lot of the tool calling failures for us. 00:07:34.640 |
But again, something that could be deterministically tested. 00:07:38.180 |
And then finally, there's this idea of a streaming diff, that as the model is emitting new text, 00:07:43.480 |
we need to be comparing the new text it's emitting with the old text, 00:07:47.400 |
and then making a dynamic decision of, if I don't see some text that was present in the old text, 00:07:54.380 |
is that because it was deleted or it just hasn't been streamed out yet? 00:07:58.740 |
So that's another, you know, completely deterministic thing that still was critical to making the system work correctly. 00:08:09.580 |
But then we get into some of the fun stuff with the behavior of the model itself. 00:08:17.380 |
So one thing that we noticed right away was the model, when it was trying to insert at the top or the end of the document, 00:08:24.920 |
in some cases, would just have an empty old text tag, which, like, if you're doing old text, new text matching is not very useful. 00:08:31.920 |
And so we found that we were able to just add this, you know, simple thing to the prompt and improve the eval. 00:08:39.660 |
But again, it would still do it, you know, 1 or 2% of the time. 00:08:44.720 |
And so just being able to test that, you know, we can handle that case robustly, you know, ended up being really helpful. 00:08:52.660 |
And this is just like, again, I'm in this case simulating the output of the LLM. 00:08:56.760 |
Another case that we encountered a lot was that it would mismatch the XML tags. 00:09:04.500 |
And so we would have old text, and then it would end like this with new text. 00:09:11.060 |
And so, again, we were able to get a certain distance with the prompt. 00:09:14.800 |
We initially were sort of-- 40% of the time we were getting this mismatch and blowing up. 00:09:20.020 |
We were able to move that by saying, always close all tags properly to, like, 95%, but we still have that last, like, 5%. 00:09:30.480 |
And so, again, it's just like driving it into a deterministic test where we have old text, new text, you know, and we need to just assert that even though we got-- you know, we just need to be robust and accept crazy stuff from the LLM, like, in its output. 00:09:48.020 |
And another interesting case was indentation. 00:09:53.160 |
So here's an eval that we wrote, which was this idea of having an outer function, and then inside here there's an inner function. 00:10:02.180 |
And we want to say, replace this to-do with return 42. 00:10:06.060 |
And we would see the model doing things like this, right, where the indentation level was completely flattened out, right? 00:10:13.200 |
It says, "replace fn_inner," but fn_inner has, like, all this leading indentation on the front of it. 00:10:22.200 |
And so, in this case, what we did is the strategy of, like, detecting this indent delta. 00:10:28.340 |
So we figure out, basically, sort of renormalize the indent, if that makes sense. 00:10:34.340 |
So if the text in the buffer is indented by a certain amount, but we otherwise match, then we also detect the text that the LLM emitted and compute this delta. 00:10:45.480 |
And then, again, driving it back to a deterministic test of, here, we build a buffer with lorem ipsum doler sit with this very interesting indentation pattern. 00:10:57.620 |
And then you can see here, we want to replace ipsum doler sit with ipsum doler sit amet, where it's out-dented, right? 00:11:06.620 |
And you can see that when we actually do our indentation normalization, we're able to handle that correctly. 00:11:14.840 |
Another tricky thing that we ran into was this weird escaping behavior, where in these certain cases where we have weird constructs, like, this is a Rust, sort of a raw string, basically, that can include things like quotes inside of it, due to this interesting escaping syntax. 00:11:33.040 |
And doing something as simple as saying, just, like, tweak this, we would notice this is the behavior we want. 00:11:39.700 |
And we would notice, especially Gemini, doing all kinds of crazy stuff, like doing HTML escape codes, doing this sort of backslash escaping. 00:11:50.700 |
Yeah, with new lines and stuff, we would see, like, you know, crazy escaping that it would do, double escaping the new lines. 00:11:58.700 |
And so in this case, like, that was just a pure prompt fix and didn't really-- we didn't try yet, at least, to kind of detect any of this escaping any further. 00:12:10.700 |
Although, obviously, like, that's an opportunity, like, that's an opportunity to keep going even further. 00:12:16.700 |
So, yeah, I mean, from my perspective, like, the lessons we learned from this entire experience is that, like, rigorous testing is fundamental to building reliable software, period. 00:12:29.700 |
We have this new fancy parlance of, like, an eval, which comes, I think, out of the machine learning field of kind of input, output, and having lots of examples of that. 00:12:37.700 |
But I think a lot of the techniques of just traditional, good old-fashioned software engineering are still really applicable. 00:12:44.700 |
But we have to embrace this more statistical approach where we're running it 100 times, 200 times, and asserting a threshold of pass versus fail. 00:12:53.700 |
But I think-- I mean, this is a real-world example of just the stuff that we saw trying to implement streaming edits or an agent that can go search and edit. 00:13:03.700 |
A lot of those problems are not-- I don't know. 00:13:06.700 |
They're not advanced machine learning problems or anything. 00:13:09.700 |
It's just stupid things that the model will try to do that we need to account for. 00:13:14.700 |
And so I think this motion of sort of starting with the zoomed-out eval, then zooming into a sort of stochastic unit test that's still random and interacting with the model, 00:13:25.700 |
but is more focused in on a particular aspect of the experience. 00:13:28.700 |
And then, finally, driving that even further to an actual good old-fashioned test, that's been the process that we've discovered. 00:13:36.700 |
But we, so far, haven't needed to use, like, special external tools or eval frameworks or anything like that. 00:13:45.700 |
It's just in our test suite so far and kind of using the same underlying infrastructure that's used to do any other kind of software testing 00:13:53.700 |
and, honestly, using just a lot of those same skills. 00:13:58.700 |
So for us, it was, like, just be empirical, just like we've always been. 00:14:02.700 |
But it's this new X factor of the LLM doing all these crazy things. 00:14:08.700 |
Yeah, so that's a little story of how we-- and this is all open source, so you can check it out. 00:14:20.700 |
Maybe some of you are like, what are you doing? 00:14:22.700 |
You could do better in the following five ways. 00:14:24.700 |
I'd love contributions if that's interesting to folks. 00:14:30.700 |
I'm-- with the Claude IV models, I'm finally able to write Rust agentically, really efficiently.