AI powered entomology: Lessons from millions of AI code reviews

00:00:00.000 | .

00:00:14.740 | Thank you all so much for coming to this talk.

00:00:17.040 | Thank you for being at this conference generally.

00:00:18.880 | My name is Tomas.

00:00:19.820 | I'm one of the co-founders of Graphite,

00:00:21.800 | and I'm here to talk to you around AI power entomology.

00:00:24.100 | If you don't know, entomology is the study of bugs.

00:00:26.080 | It's something that is very near and dear to our heart,

00:00:28.960 | and part of what our product does.

00:00:30.860 | So Graphite, for those of you that don't know,

00:00:33.720 | builds a product called Diamond.

00:00:34.800 | Diamond's an AI-powered code reviewer.

00:00:36.400 | You go ahead, you connect it to your GitHub,

00:00:39.520 | and it goes ahead and finds bugs.

00:00:41.340 | The project started about a year ago.

00:00:42.940 | What we started to notice was that the amount of code

00:00:45.220 | being written by AI was going up and up and up,

00:00:47.580 | but so was the amount of bugs.

00:00:49.160 | And after really thinking about it,

00:00:50.680 | we thought that this might actually be part and parcel,

00:00:54.240 | and what we need to do is we need to find a way

00:00:55.880 | to better address these bugs in general.

00:00:59.880 | Given the technological advances,

00:01:01.520 | the first thing we turned to was AI itself,

00:01:03.680 | and we started to ask, well,

00:01:05.140 | maybe AI is creating the bugs,

00:01:06.940 | but can it also find the bugs?

00:01:08.200 | Can it help us?

00:01:09.180 | We started to go ahead and do things like ask Claude,

00:01:13.040 | hey, here's a PR.

00:01:14.640 | Can you find bugs on this PR?

00:01:16.420 | And we were pretty impressed with the early results.

00:01:18.720 | Here's an example actually pulled from this week

00:01:20.760 | from our code base where it turns out that in certain instances,

00:01:23.600 | we'd be returning one of our database ORM classes

00:01:26.180 | uninstantiated, which would go ahead and crash our server.

00:01:29.100 | Here's an example that came up on Twitter this week

00:01:30.960 | from our bot that found that in certain instances,

00:01:34.640 | there would be math being done around border radiuses

00:01:36.980 | that would lead to a division by a negative number,

00:01:38.920 | and would then go ahead and crash the front end.

00:01:41.900 | So to answer the question, it turns out AI can find bugs.

00:01:44.920 | That's the end of the talk.

00:01:46.200 | I'm kidding if you've tried this.

00:01:48.600 | You know you've probably had a really,

00:01:49.960 | really frustrating experience.

00:01:51.780 | We also went ahead and saw things like,

00:01:53.940 | you should update this code to do what it already does.

00:01:57.060 | CSS doesn't work this way when it does.

00:01:59.820 | Or my favorite, you should revert this code

00:02:02.140 | to what it used to do because it used to do it.

00:02:04.320 | Getting those lost us a lot of confidence,

00:02:08.560 | but we started to think, well,

00:02:10.340 | we're seeing some really good things

00:02:11.660 | and we're seeing some really bad things,

00:02:13.080 | and maybe there's actually more than one type of bug.

00:02:15.200 | Maybe there's more than one type of thing an LLM can find.

00:02:18.000 | And so we started with the most basic division of,

00:02:20.920 | well, there's probably stuff that LLMs are good at catching

00:02:24.100 | and things that they're not good at catching.

00:02:25.700 | At the end of the day, LLMs ultimately try and mimic

00:02:28.180 | the thing that you're asking them to do.

00:02:29.560 | And if you ask them, hey,

00:02:30.600 | what kind of code review comments would be left on this PR?

00:02:32.960 | It goes ahead and leaves everything,

00:02:34.280 | both those that are within its capability

00:02:36.180 | and things that are not within its capability.

00:02:38.680 | And so we started to categorize those.

00:02:40.740 | What we found though was even when we categorize those,

00:02:43.480 | the LLM would start to leave comments like this.

00:02:46.160 | You should add a comment describing what this class does.

00:02:48.860 | You should extract this logic out into a function,

00:02:51.440 | or you should make sure this code has tests.

00:02:53.860 | Well, these are technically correct.

00:02:57.200 | To developers, they're really frustrating.

00:02:58.820 | And I think this was actually one of the most insightful moments

00:03:01.320 | for us in building this project,

00:03:02.960 | was when we sat down with our design team

00:03:04.880 | and we started to actually go through past bugs,

00:03:08.180 | both those left by our bot and by humans in our own codebase.

00:03:13.220 | The developers were all pretty much on the same page of like,

00:03:15.640 | yep, I'd be okay if an LLM left that.

00:03:17.640 | No, I would not be okay if an LLM left that.

00:03:19.740 | Yes, I'd be okay.

00:03:21.100 | And our designers were actually kind of baffled by it.

00:03:23.180 | They're like, well, but like,

00:03:24.100 | that kind of looks like that other comment.

00:03:26.320 | And I think that what's happening here in the mind of the developer

00:03:28.920 | is if you go ahead and you read a type of comment like this,

00:03:32.520 | maybe you find it pedantic, frustrating, annoying

00:03:35.400 | when it comes from LLM,

00:03:36.700 | and you're much more welcoming to it when it comes from a human.

00:03:40.060 | And so, as we started to think more around

00:03:41.700 | sort of that classification of bugs,

00:03:43.700 | we started to think around actually second access here,

00:03:45.940 | which was there's stuff LLMs can catch and LLMs can't catch,

00:03:49.200 | but there's also stuff that humans want to receive from an LLM

00:03:52.080 | and humans don't want to receive from an LLM.

00:03:54.480 | And so, what we went ahead and did was we went ahead

00:04:00.300 | and we actually took 10,000 comments from our own codebase,

00:04:03.780 | from open source codebases, open source codebases,

00:04:07.280 | and we fed them to various LLMs

00:04:08.820 | and we asked them to categorize them.

00:04:10.720 | And we did that not just once,

00:04:13.140 | but we did that quite a few times.

00:04:14.360 | And then we went ahead and we summarized those comments.

00:04:16.620 | And what we ended up with was actually this chart,

00:04:19.400 | where it says there's actually quite a few different types

00:04:21.720 | of bugs that you see left on codebases in the wild.

00:04:24.500 | Ignoring LLMs for a second, just talking around humans,

00:04:27.240 | you see things which are bugs,

00:04:28.660 | those are logical inconsistencies that lead the code

00:04:30.880 | to behave in a way it doesn't want to behave.

00:04:32.740 | There's also accidentally committed code.

00:04:34.440 | This actually shows up more than you would expect.

00:04:36.740 | There are performance and security concerns.

00:04:38.620 | There's documentation where the code says one thing

00:04:40.800 | and does another and it's not clear which one's right.

00:04:43.440 | There's stylistic changes, things like,

00:04:45.440 | hey, you should update this comment

00:04:48.380 | or in this codebase we follow this other pattern.

00:04:50.800 | And then there's a lot of stuff outside

00:04:52.300 | of sort of that top right quadrant.

00:04:53.660 | So in the bottom right where humans want to receive it,

00:04:57.440 | but the LLMs don't seem to be able to get there yet,

00:04:59.900 | are things like tribal knowledge.

00:05:01.680 | One class of comment that you'll see a lot in PRs is,

00:05:04.220 | hey, we used to do it this way.

00:05:06.500 | We don't do it this way anymore because of blank.

00:05:08.500 | This documentation doesn't exist.

00:05:10.500 | It exists in the heads of your senior developers.

00:05:12.500 | And that's wonderful, but it's really hard for an AI

00:05:15.860 | to be able to mind read to that.

00:05:17.860 | On the left side where LLMs definitely can catch it,

00:05:20.500 | but humans don't want to receive

00:05:21.880 | are those things I showed you earlier,

00:05:23.400 | code cleanliness and best practice.

00:05:25.220 | Examples of these that we've found are comment this function,

00:05:28.940 | add tests, extract this type out into a different type,

00:05:31.720 | extract this logic out into a function.

00:05:34.080 | While this is always correct to say,

00:05:37.180 | I think it's really hard to know when to apply to an LLM.

00:05:39.900 | I think as a human, you're applying some kind of barometer

00:05:42.560 | of, well, in this codebase, this logic is particularly tricky

00:05:46.080 | and I think someone's gonna get tripped up

00:05:47.260 | so we should extract it out versus, well, in this codebase,

00:05:50.320 | it's actually fine.

00:05:51.700 | But what a bot can pretty much always leave this comment.

00:05:54.060 | I'd actually make the argument a human

00:05:55.360 | can pretty much always leave this comment

00:05:56.680 | and it'd be technically correct.

00:05:58.280 | The question is whether it's welcome in the codebase.

00:06:01.320 | And one thing I'm gonna say sort of like outside of all of this

00:06:04.680 | is as you add more, this area seems to become larger

00:06:08.620 | of what people are comfortable with, but for now,

00:06:11.080 | given the context that we have, given the codebase,

00:06:13.340 | the past history, your style guide and rules,

00:06:17.000 | we are what we have, we have what we have.

00:06:19.580 | And so we ended up with this idea of, well, it turns out that

00:06:23.320 | these are basically the classes of comments

00:06:26.440 | that we think that human, that LLMs can both create

00:06:30.620 | and humans want to receive.

00:06:32.100 | Now, if you've worked with LLMs, you know that these kinds of

00:06:37.260 | offline passes and first passes are great

00:06:39.640 | for initial categorizations, but the much harder question is,

00:06:42.580 | how do you know that you're right continuously, right?

00:06:44.800 | So we can, so as the story goes, we went ahead.

00:06:47.800 | We basically started to characterize comments that LLMs leave.

00:06:50.660 | We updated our prompts to only prompt the LLM to do things

00:06:53.240 | that were in its capacity and that humans wanted to receive.

00:06:56.740 | And people anecdotally started to like it a lot more.

00:06:59.580 | But as we started to then think around, well,

00:07:01.900 | how can we get this LLM to,

00:07:03.600 | how do we know that this is going right?

00:07:05.140 | As we think around new LLMs,

00:07:06.640 | as we got into Claude 4 or Opus instead of Sonnet,

00:07:10.420 | how do we know that we're actually staying

00:07:12.460 | in this top right quadrant?

00:07:13.760 | And as we increase the context,

00:07:15.460 | how do we know that this isn't growing on us?

00:07:19.040 | And actually, maybe there are even more types of comments

00:07:20.840 | that we could be leaving that we're not leaving already.

00:07:24.280 | And so first and foremost,

00:07:26.680 | we started by just looking at what kinds of comments

00:07:28.720 | is the thing currently leaving?

00:07:30.660 | Your mileage may vary.

00:07:31.760 | For us, this is roughly the proportion we see

00:07:34.080 | of comments being left by the LLM right now

00:07:36.780 | based just on what we've seen.

00:07:39.040 | But the deeper question for us was,

00:07:41.340 | how do we measure the success, right?

00:07:43.620 | Like given this quadrant,

00:07:44.620 | how do we know that we're in the top right?

00:07:47.260 | The first one was easy for us.

00:07:48.820 | So think around what they can catch and they can't catch.

00:07:51.680 | What we started to do was we started to actually add

00:07:53.860 | upvotes and downvotes to the product.

00:07:55.240 | So we let you go ahead and emoji react in these comments,

00:07:57.880 | and they pretty much tell us when the LLM's hallucinating,

00:08:00.240 | when we start to see a downvote spike.

00:08:04.600 | We know that, okay,

00:08:05.820 | we might be trying to extend this thing

00:08:07.300 | beyond its capabilities, we need to tone it down.

00:08:10.600 | But the second one was a lot harder.

00:08:12.020 | The humans want to receive and humans don't want to receive

00:08:14.760 | was something that we weren't really sure how to get at.

00:08:18.800 | And so upvote, downvote, we implemented it.

00:08:21.600 | We see about a less than a 4% downvote rate these days.

00:08:24.260 | We felt pretty good about that.

00:08:26.500 | The second one, as we started to think around it,

00:08:28.320 | well, what we realized was,

00:08:29.320 | well, what's the point of a comment?

00:08:30.440 | Why do you leave a comment in code review?

00:08:32.000 | You leave a comment in code review ultimately

00:08:33.880 | so that someone actually updates the code to reflect that.

00:08:36.440 | And so our question was, well, can we measure that?

00:08:38.560 | Can we measure what percent of comments

00:08:40.440 | actually lead to the change that they describe?

00:08:43.080 | And so we started to do that.

00:08:44.360 | And we started to ask that question of,

00:08:45.920 | on open source repos and on the variety of repos

00:08:48.400 | that Graphite, which is a code review tool, has access to,

00:08:51.140 | can we actually start to measure that number?

00:08:52.820 | And I think one of the most fascinating things we found

00:08:56.540 | was that only about 50% of human comments lead to changes.

00:08:59.960 | And so we started to ask the question of, well,

00:09:01.860 | could we get the LLM to at least this, right?

00:09:04.580 | Because if we get it to at least this,

00:09:05.900 | it's at least leaving comments on the level of fidelity

00:09:08.520 | that humans are.

00:09:09.540 | Now, you might be saying in the audience and being like,

00:09:11.680 | well, why don't 100% of comments lead to action?

00:09:15.680 | I want to caveat this number.

00:09:17.460 | I'm saying lead to action within that PR itself.

00:09:20.140 | And so a lot of comments are sometimes fixed forward,

00:09:22.700 | where people are like, hey, I hear you,

00:09:25.180 | and I'm going to fix this in a follow-up.

00:09:26.920 | A lot of comments are also like, hey, as a heads up,

00:09:29.860 | in the future if you do this, maybe you can do it this other way,

00:09:32.520 | but don't need to be acted on them.

00:09:33.700 | And I think there's a fair-- and some of them

00:09:36.100 | are just purely preferential of, I would do it this way,

00:09:38.740 | someone disagrees.

00:09:40.060 | In healthy code review cultures, that space for disagreement

00:09:42.400 | exists.

00:09:43.420 | And so we started to measure this.

00:09:44.920 | And we started to say, could we get the bot here?

00:09:47.860 | Over time, we actually have.

00:09:48.960 | So as of March, we're at 52%, which

00:09:50.580 | is to say that if you start to actually prompt it correctly,

00:09:53.740 | you can get there.

00:09:54.860 | And I think our sort of broader thesis

00:09:56.820 | is that this measuring--

00:09:59.700 | getting bugs via an LLM does actually work.

00:10:05.060 | If you want to try any of these findings in production,

00:10:08.260 | Diamond is our product that offers it.

00:10:10.420 | We have a booth over there.

00:10:13.060 | Thank you.

00:10:14.500 | Thank you.

00:10:15.500 | Thank you.

AI powered entomology: Lessons from millions of AI code reviews — Tomas Reimers, Graphite