back to indexAI powered entomology: Lessons from millions of AI code reviews — Tomas Reimers, Graphite

00:00:14.740 |
Thank you all so much for coming to this talk. 00:00:17.040 |
Thank you for being at this conference generally. 00:00:21.800 |
and I'm here to talk to you around AI power entomology. 00:00:24.100 |
If you don't know, entomology is the study of bugs. 00:00:26.080 |
It's something that is very near and dear to our heart, 00:00:30.860 |
So Graphite, for those of you that don't know, 00:00:42.940 |
What we started to notice was that the amount of code 00:00:45.220 |
being written by AI was going up and up and up, 00:00:50.680 |
we thought that this might actually be part and parcel, 00:00:54.240 |
and what we need to do is we need to find a way 00:01:09.180 |
We started to go ahead and do things like ask Claude, 00:01:16.420 |
And we were pretty impressed with the early results. 00:01:18.720 |
Here's an example actually pulled from this week 00:01:20.760 |
from our code base where it turns out that in certain instances, 00:01:23.600 |
we'd be returning one of our database ORM classes 00:01:26.180 |
uninstantiated, which would go ahead and crash our server. 00:01:29.100 |
Here's an example that came up on Twitter this week 00:01:30.960 |
from our bot that found that in certain instances, 00:01:34.640 |
there would be math being done around border radiuses 00:01:36.980 |
that would lead to a division by a negative number, 00:01:38.920 |
and would then go ahead and crash the front end. 00:01:41.900 |
So to answer the question, it turns out AI can find bugs. 00:01:53.940 |
you should update this code to do what it already does. 00:02:02.140 |
to what it used to do because it used to do it. 00:02:13.080 |
and maybe there's actually more than one type of bug. 00:02:15.200 |
Maybe there's more than one type of thing an LLM can find. 00:02:18.000 |
And so we started with the most basic division of, 00:02:20.920 |
well, there's probably stuff that LLMs are good at catching 00:02:24.100 |
and things that they're not good at catching. 00:02:25.700 |
At the end of the day, LLMs ultimately try and mimic 00:02:30.600 |
what kind of code review comments would be left on this PR? 00:02:36.180 |
and things that are not within its capability. 00:02:40.740 |
What we found though was even when we categorize those, 00:02:43.480 |
the LLM would start to leave comments like this. 00:02:46.160 |
You should add a comment describing what this class does. 00:02:48.860 |
You should extract this logic out into a function, 00:02:58.820 |
And I think this was actually one of the most insightful moments 00:03:04.880 |
and we started to actually go through past bugs, 00:03:08.180 |
both those left by our bot and by humans in our own codebase. 00:03:13.220 |
The developers were all pretty much on the same page of like, 00:03:21.100 |
And our designers were actually kind of baffled by it. 00:03:26.320 |
And I think that what's happening here in the mind of the developer 00:03:28.920 |
is if you go ahead and you read a type of comment like this, 00:03:32.520 |
maybe you find it pedantic, frustrating, annoying 00:03:36.700 |
and you're much more welcoming to it when it comes from a human. 00:03:43.700 |
we started to think around actually second access here, 00:03:45.940 |
which was there's stuff LLMs can catch and LLMs can't catch, 00:03:49.200 |
but there's also stuff that humans want to receive from an LLM 00:03:52.080 |
and humans don't want to receive from an LLM. 00:03:54.480 |
And so, what we went ahead and did was we went ahead 00:04:00.300 |
and we actually took 10,000 comments from our own codebase, 00:04:03.780 |
from open source codebases, open source codebases, 00:04:14.360 |
And then we went ahead and we summarized those comments. 00:04:16.620 |
And what we ended up with was actually this chart, 00:04:19.400 |
where it says there's actually quite a few different types 00:04:21.720 |
of bugs that you see left on codebases in the wild. 00:04:24.500 |
Ignoring LLMs for a second, just talking around humans, 00:04:28.660 |
those are logical inconsistencies that lead the code 00:04:30.880 |
to behave in a way it doesn't want to behave. 00:04:34.440 |
This actually shows up more than you would expect. 00:04:38.620 |
There's documentation where the code says one thing 00:04:40.800 |
and does another and it's not clear which one's right. 00:04:48.380 |
or in this codebase we follow this other pattern. 00:04:53.660 |
So in the bottom right where humans want to receive it, 00:04:57.440 |
but the LLMs don't seem to be able to get there yet, 00:05:01.680 |
One class of comment that you'll see a lot in PRs is, 00:05:06.500 |
We don't do it this way anymore because of blank. 00:05:10.500 |
It exists in the heads of your senior developers. 00:05:12.500 |
And that's wonderful, but it's really hard for an AI 00:05:17.860 |
On the left side where LLMs definitely can catch it, 00:05:25.220 |
Examples of these that we've found are comment this function, 00:05:28.940 |
add tests, extract this type out into a different type, 00:05:37.180 |
I think it's really hard to know when to apply to an LLM. 00:05:39.900 |
I think as a human, you're applying some kind of barometer 00:05:42.560 |
of, well, in this codebase, this logic is particularly tricky 00:05:47.260 |
so we should extract it out versus, well, in this codebase, 00:05:51.700 |
But what a bot can pretty much always leave this comment. 00:05:58.280 |
The question is whether it's welcome in the codebase. 00:06:01.320 |
And one thing I'm gonna say sort of like outside of all of this 00:06:04.680 |
is as you add more, this area seems to become larger 00:06:08.620 |
of what people are comfortable with, but for now, 00:06:11.080 |
given the context that we have, given the codebase, 00:06:13.340 |
the past history, your style guide and rules, 00:06:19.580 |
And so we ended up with this idea of, well, it turns out that 00:06:26.440 |
that we think that human, that LLMs can both create 00:06:32.100 |
Now, if you've worked with LLMs, you know that these kinds of 00:06:39.640 |
for initial categorizations, but the much harder question is, 00:06:42.580 |
how do you know that you're right continuously, right? 00:06:44.800 |
So we can, so as the story goes, we went ahead. 00:06:47.800 |
We basically started to characterize comments that LLMs leave. 00:06:50.660 |
We updated our prompts to only prompt the LLM to do things 00:06:53.240 |
that were in its capacity and that humans wanted to receive. 00:06:56.740 |
And people anecdotally started to like it a lot more. 00:06:59.580 |
But as we started to then think around, well, 00:07:06.640 |
as we got into Claude 4 or Opus instead of Sonnet, 00:07:15.460 |
how do we know that this isn't growing on us? 00:07:19.040 |
And actually, maybe there are even more types of comments 00:07:20.840 |
that we could be leaving that we're not leaving already. 00:07:26.680 |
we started by just looking at what kinds of comments 00:07:31.760 |
For us, this is roughly the proportion we see 00:07:48.820 |
So think around what they can catch and they can't catch. 00:07:51.680 |
What we started to do was we started to actually add 00:07:55.240 |
So we let you go ahead and emoji react in these comments, 00:07:57.880 |
and they pretty much tell us when the LLM's hallucinating, 00:08:07.300 |
beyond its capabilities, we need to tone it down. 00:08:12.020 |
The humans want to receive and humans don't want to receive 00:08:14.760 |
was something that we weren't really sure how to get at. 00:08:21.600 |
We see about a less than a 4% downvote rate these days. 00:08:26.500 |
The second one, as we started to think around it, 00:08:32.000 |
You leave a comment in code review ultimately 00:08:33.880 |
so that someone actually updates the code to reflect that. 00:08:36.440 |
And so our question was, well, can we measure that? 00:08:40.440 |
actually lead to the change that they describe? 00:08:45.920 |
on open source repos and on the variety of repos 00:08:48.400 |
that Graphite, which is a code review tool, has access to, 00:08:51.140 |
can we actually start to measure that number? 00:08:52.820 |
And I think one of the most fascinating things we found 00:08:56.540 |
was that only about 50% of human comments lead to changes. 00:08:59.960 |
And so we started to ask the question of, well, 00:09:01.860 |
could we get the LLM to at least this, right? 00:09:05.900 |
it's at least leaving comments on the level of fidelity 00:09:09.540 |
Now, you might be saying in the audience and being like, 00:09:11.680 |
well, why don't 100% of comments lead to action? 00:09:17.460 |
I'm saying lead to action within that PR itself. 00:09:20.140 |
And so a lot of comments are sometimes fixed forward, 00:09:26.920 |
A lot of comments are also like, hey, as a heads up, 00:09:29.860 |
in the future if you do this, maybe you can do it this other way, 00:09:33.700 |
And I think there's a fair-- and some of them 00:09:36.100 |
are just purely preferential of, I would do it this way, 00:09:40.060 |
In healthy code review cultures, that space for disagreement 00:09:44.920 |
And we started to say, could we get the bot here? 00:09:50.580 |
is to say that if you start to actually prompt it correctly, 00:10:05.060 |
If you want to try any of these findings in production,