back to index

AI powered entomology: Lessons from millions of AI code reviews — Tomas Reimers, Graphite


Whisper Transcript | Transcript Only Page

00:00:14.740 | Thank you all so much for coming to this talk.
00:00:17.040 | Thank you for being at this conference generally.
00:00:18.880 | My name is Tomas.
00:00:19.820 | I'm one of the co-founders of Graphite,
00:00:21.800 | and I'm here to talk to you around AI power entomology.
00:00:24.100 | If you don't know, entomology is the study of bugs.
00:00:26.080 | It's something that is very near and dear to our heart,
00:00:28.960 | and part of what our product does.
00:00:30.860 | So Graphite, for those of you that don't know,
00:00:33.720 | builds a product called Diamond.
00:00:34.800 | Diamond's an AI-powered code reviewer.
00:00:36.400 | You go ahead, you connect it to your GitHub,
00:00:39.520 | and it goes ahead and finds bugs.
00:00:41.340 | The project started about a year ago.
00:00:42.940 | What we started to notice was that the amount of code
00:00:45.220 | being written by AI was going up and up and up,
00:00:47.580 | but so was the amount of bugs.
00:00:49.160 | And after really thinking about it,
00:00:50.680 | we thought that this might actually be part and parcel,
00:00:54.240 | and what we need to do is we need to find a way
00:00:55.880 | to better address these bugs in general.
00:00:59.880 | Given the technological advances,
00:01:01.520 | the first thing we turned to was AI itself,
00:01:03.680 | and we started to ask, well,
00:01:05.140 | maybe AI is creating the bugs,
00:01:06.940 | but can it also find the bugs?
00:01:08.200 | Can it help us?
00:01:09.180 | We started to go ahead and do things like ask Claude,
00:01:13.040 | hey, here's a PR.
00:01:14.640 | Can you find bugs on this PR?
00:01:16.420 | And we were pretty impressed with the early results.
00:01:18.720 | Here's an example actually pulled from this week
00:01:20.760 | from our code base where it turns out that in certain instances,
00:01:23.600 | we'd be returning one of our database ORM classes
00:01:26.180 | uninstantiated, which would go ahead and crash our server.
00:01:29.100 | Here's an example that came up on Twitter this week
00:01:30.960 | from our bot that found that in certain instances,
00:01:34.640 | there would be math being done around border radiuses
00:01:36.980 | that would lead to a division by a negative number,
00:01:38.920 | and would then go ahead and crash the front end.
00:01:41.900 | So to answer the question, it turns out AI can find bugs.
00:01:44.920 | That's the end of the talk.
00:01:46.200 | I'm kidding if you've tried this.
00:01:48.600 | You know you've probably had a really,
00:01:49.960 | really frustrating experience.
00:01:51.780 | We also went ahead and saw things like,
00:01:53.940 | you should update this code to do what it already does.
00:01:57.060 | CSS doesn't work this way when it does.
00:01:59.820 | Or my favorite, you should revert this code
00:02:02.140 | to what it used to do because it used to do it.
00:02:04.320 | Getting those lost us a lot of confidence,
00:02:08.560 | but we started to think, well,
00:02:10.340 | we're seeing some really good things
00:02:11.660 | and we're seeing some really bad things,
00:02:13.080 | and maybe there's actually more than one type of bug.
00:02:15.200 | Maybe there's more than one type of thing an LLM can find.
00:02:18.000 | And so we started with the most basic division of,
00:02:20.920 | well, there's probably stuff that LLMs are good at catching
00:02:24.100 | and things that they're not good at catching.
00:02:25.700 | At the end of the day, LLMs ultimately try and mimic
00:02:28.180 | the thing that you're asking them to do.
00:02:29.560 | And if you ask them, hey,
00:02:30.600 | what kind of code review comments would be left on this PR?
00:02:32.960 | It goes ahead and leaves everything,
00:02:34.280 | both those that are within its capability
00:02:36.180 | and things that are not within its capability.
00:02:38.680 | And so we started to categorize those.
00:02:40.740 | What we found though was even when we categorize those,
00:02:43.480 | the LLM would start to leave comments like this.
00:02:46.160 | You should add a comment describing what this class does.
00:02:48.860 | You should extract this logic out into a function,
00:02:51.440 | or you should make sure this code has tests.
00:02:53.860 | Well, these are technically correct.
00:02:57.200 | To developers, they're really frustrating.
00:02:58.820 | And I think this was actually one of the most insightful moments
00:03:01.320 | for us in building this project,
00:03:02.960 | was when we sat down with our design team
00:03:04.880 | and we started to actually go through past bugs,
00:03:08.180 | both those left by our bot and by humans in our own codebase.
00:03:13.220 | The developers were all pretty much on the same page of like,
00:03:15.640 | yep, I'd be okay if an LLM left that.
00:03:17.640 | No, I would not be okay if an LLM left that.
00:03:19.740 | Yes, I'd be okay.
00:03:21.100 | And our designers were actually kind of baffled by it.
00:03:23.180 | They're like, well, but like,
00:03:24.100 | that kind of looks like that other comment.
00:03:26.320 | And I think that what's happening here in the mind of the developer
00:03:28.920 | is if you go ahead and you read a type of comment like this,
00:03:32.520 | maybe you find it pedantic, frustrating, annoying
00:03:35.400 | when it comes from LLM,
00:03:36.700 | and you're much more welcoming to it when it comes from a human.
00:03:40.060 | And so, as we started to think more around
00:03:41.700 | sort of that classification of bugs,
00:03:43.700 | we started to think around actually second access here,
00:03:45.940 | which was there's stuff LLMs can catch and LLMs can't catch,
00:03:49.200 | but there's also stuff that humans want to receive from an LLM
00:03:52.080 | and humans don't want to receive from an LLM.
00:03:54.480 | And so, what we went ahead and did was we went ahead
00:04:00.300 | and we actually took 10,000 comments from our own codebase,
00:04:03.780 | from open source codebases, open source codebases,
00:04:07.280 | and we fed them to various LLMs
00:04:08.820 | and we asked them to categorize them.
00:04:10.720 | And we did that not just once,
00:04:13.140 | but we did that quite a few times.
00:04:14.360 | And then we went ahead and we summarized those comments.
00:04:16.620 | And what we ended up with was actually this chart,
00:04:19.400 | where it says there's actually quite a few different types
00:04:21.720 | of bugs that you see left on codebases in the wild.
00:04:24.500 | Ignoring LLMs for a second, just talking around humans,
00:04:27.240 | you see things which are bugs,
00:04:28.660 | those are logical inconsistencies that lead the code
00:04:30.880 | to behave in a way it doesn't want to behave.
00:04:32.740 | There's also accidentally committed code.
00:04:34.440 | This actually shows up more than you would expect.
00:04:36.740 | There are performance and security concerns.
00:04:38.620 | There's documentation where the code says one thing
00:04:40.800 | and does another and it's not clear which one's right.
00:04:43.440 | There's stylistic changes, things like,
00:04:45.440 | hey, you should update this comment
00:04:48.380 | or in this codebase we follow this other pattern.
00:04:50.800 | And then there's a lot of stuff outside
00:04:52.300 | of sort of that top right quadrant.
00:04:53.660 | So in the bottom right where humans want to receive it,
00:04:57.440 | but the LLMs don't seem to be able to get there yet,
00:04:59.900 | are things like tribal knowledge.
00:05:01.680 | One class of comment that you'll see a lot in PRs is,
00:05:04.220 | hey, we used to do it this way.
00:05:06.500 | We don't do it this way anymore because of blank.
00:05:08.500 | This documentation doesn't exist.
00:05:10.500 | It exists in the heads of your senior developers.
00:05:12.500 | And that's wonderful, but it's really hard for an AI
00:05:15.860 | to be able to mind read to that.
00:05:17.860 | On the left side where LLMs definitely can catch it,
00:05:20.500 | but humans don't want to receive
00:05:21.880 | are those things I showed you earlier,
00:05:23.400 | code cleanliness and best practice.
00:05:25.220 | Examples of these that we've found are comment this function,
00:05:28.940 | add tests, extract this type out into a different type,
00:05:31.720 | extract this logic out into a function.
00:05:34.080 | While this is always correct to say,
00:05:37.180 | I think it's really hard to know when to apply to an LLM.
00:05:39.900 | I think as a human, you're applying some kind of barometer
00:05:42.560 | of, well, in this codebase, this logic is particularly tricky
00:05:46.080 | and I think someone's gonna get tripped up
00:05:47.260 | so we should extract it out versus, well, in this codebase,
00:05:50.320 | it's actually fine.
00:05:51.700 | But what a bot can pretty much always leave this comment.
00:05:54.060 | I'd actually make the argument a human
00:05:55.360 | can pretty much always leave this comment
00:05:56.680 | and it'd be technically correct.
00:05:58.280 | The question is whether it's welcome in the codebase.
00:06:01.320 | And one thing I'm gonna say sort of like outside of all of this
00:06:04.680 | is as you add more, this area seems to become larger
00:06:08.620 | of what people are comfortable with, but for now,
00:06:11.080 | given the context that we have, given the codebase,
00:06:13.340 | the past history, your style guide and rules,
00:06:17.000 | we are what we have, we have what we have.
00:06:19.580 | And so we ended up with this idea of, well, it turns out that
00:06:23.320 | these are basically the classes of comments
00:06:26.440 | that we think that human, that LLMs can both create
00:06:30.620 | and humans want to receive.
00:06:32.100 | Now, if you've worked with LLMs, you know that these kinds of
00:06:37.260 | offline passes and first passes are great
00:06:39.640 | for initial categorizations, but the much harder question is,
00:06:42.580 | how do you know that you're right continuously, right?
00:06:44.800 | So we can, so as the story goes, we went ahead.
00:06:47.800 | We basically started to characterize comments that LLMs leave.
00:06:50.660 | We updated our prompts to only prompt the LLM to do things
00:06:53.240 | that were in its capacity and that humans wanted to receive.
00:06:56.740 | And people anecdotally started to like it a lot more.
00:06:59.580 | But as we started to then think around, well,
00:07:01.900 | how can we get this LLM to,
00:07:03.600 | how do we know that this is going right?
00:07:05.140 | As we think around new LLMs,
00:07:06.640 | as we got into Claude 4 or Opus instead of Sonnet,
00:07:10.420 | how do we know that we're actually staying
00:07:12.460 | in this top right quadrant?
00:07:13.760 | And as we increase the context,
00:07:15.460 | how do we know that this isn't growing on us?
00:07:19.040 | And actually, maybe there are even more types of comments
00:07:20.840 | that we could be leaving that we're not leaving already.
00:07:24.280 | And so first and foremost,
00:07:26.680 | we started by just looking at what kinds of comments
00:07:28.720 | is the thing currently leaving?
00:07:30.660 | Your mileage may vary.
00:07:31.760 | For us, this is roughly the proportion we see
00:07:34.080 | of comments being left by the LLM right now
00:07:36.780 | based just on what we've seen.
00:07:39.040 | But the deeper question for us was,
00:07:41.340 | how do we measure the success, right?
00:07:43.620 | Like given this quadrant,
00:07:44.620 | how do we know that we're in the top right?
00:07:47.260 | The first one was easy for us.
00:07:48.820 | So think around what they can catch and they can't catch.
00:07:51.680 | What we started to do was we started to actually add
00:07:53.860 | upvotes and downvotes to the product.
00:07:55.240 | So we let you go ahead and emoji react in these comments,
00:07:57.880 | and they pretty much tell us when the LLM's hallucinating,
00:08:00.240 | when we start to see a downvote spike.
00:08:04.600 | We know that, okay,
00:08:05.820 | we might be trying to extend this thing
00:08:07.300 | beyond its capabilities, we need to tone it down.
00:08:10.600 | But the second one was a lot harder.
00:08:12.020 | The humans want to receive and humans don't want to receive
00:08:14.760 | was something that we weren't really sure how to get at.
00:08:18.800 | And so upvote, downvote, we implemented it.
00:08:21.600 | We see about a less than a 4% downvote rate these days.
00:08:24.260 | We felt pretty good about that.
00:08:26.500 | The second one, as we started to think around it,
00:08:28.320 | well, what we realized was,
00:08:29.320 | well, what's the point of a comment?
00:08:30.440 | Why do you leave a comment in code review?
00:08:32.000 | You leave a comment in code review ultimately
00:08:33.880 | so that someone actually updates the code to reflect that.
00:08:36.440 | And so our question was, well, can we measure that?
00:08:38.560 | Can we measure what percent of comments
00:08:40.440 | actually lead to the change that they describe?
00:08:43.080 | And so we started to do that.
00:08:44.360 | And we started to ask that question of,
00:08:45.920 | on open source repos and on the variety of repos
00:08:48.400 | that Graphite, which is a code review tool, has access to,
00:08:51.140 | can we actually start to measure that number?
00:08:52.820 | And I think one of the most fascinating things we found
00:08:56.540 | was that only about 50% of human comments lead to changes.
00:08:59.960 | And so we started to ask the question of, well,
00:09:01.860 | could we get the LLM to at least this, right?
00:09:04.580 | Because if we get it to at least this,
00:09:05.900 | it's at least leaving comments on the level of fidelity
00:09:08.520 | that humans are.
00:09:09.540 | Now, you might be saying in the audience and being like,
00:09:11.680 | well, why don't 100% of comments lead to action?
00:09:15.680 | I want to caveat this number.
00:09:17.460 | I'm saying lead to action within that PR itself.
00:09:20.140 | And so a lot of comments are sometimes fixed forward,
00:09:22.700 | where people are like, hey, I hear you,
00:09:25.180 | and I'm going to fix this in a follow-up.
00:09:26.920 | A lot of comments are also like, hey, as a heads up,
00:09:29.860 | in the future if you do this, maybe you can do it this other way,
00:09:32.520 | but don't need to be acted on them.
00:09:33.700 | And I think there's a fair-- and some of them
00:09:36.100 | are just purely preferential of, I would do it this way,
00:09:38.740 | someone disagrees.
00:09:40.060 | In healthy code review cultures, that space for disagreement
00:09:42.400 | exists.
00:09:43.420 | And so we started to measure this.
00:09:44.920 | And we started to say, could we get the bot here?
00:09:47.860 | Over time, we actually have.
00:09:48.960 | So as of March, we're at 52%, which
00:09:50.580 | is to say that if you start to actually prompt it correctly,
00:09:53.740 | you can get there.
00:09:54.860 | And I think our sort of broader thesis
00:09:56.820 | is that this measuring--
00:09:59.700 | getting bugs via an LLM does actually work.
00:10:05.060 | If you want to try any of these findings in production,
00:10:08.260 | Diamond is our product that offers it.
00:10:10.420 | We have a booth over there.
00:10:13.060 | Thank you.
00:10:14.500 | Thank you.
00:10:15.500 | Thank you.