back to index

How to Improve your Vibe Coding — Ian Butler


Whisper Transcript | Transcript Only Page

00:00:15.000 | My name's Ian.
00:00:16.400 | I'm the CEO of Bismuth.
00:00:17.960 | We're an end-to-end agent decoding solution,
00:00:19.920 | kind of like Codex.
00:00:21.840 | We've been working on evals for how good agents
00:00:24.880 | are at finding and fixing bugs for the last several months.
00:00:28.000 | And we dropped a benchmark yesterday discussing our results.
00:00:32.140 | So one thing to point out about agents currently
00:00:35.500 | is that they have a pretty low overall find rate for bugs.
00:00:38.440 | They actually generate a significant amount
00:00:40.420 | of false positives.
00:00:42.200 | You can see something like Devon and Cursor
00:00:44.140 | have a less than 10% true positive rate for finding bugs.
00:00:48.520 | This is an issue when you're vibe coding
00:00:50.320 | because these agents can quickly overrun your code base
00:00:53.340 | with unintended bugs that they're not
00:00:55.060 | able to actually find and then later then fix.
00:00:59.900 | Overall, too, it's worth noting that in terms of needle
00:01:03.400 | in a haystack, when we plant bugs in a code base,
00:01:06.400 | these agents struggle to navigate more broadly
00:01:09.100 | across those larger code bases and actually
00:01:11.260 | find the specific bugs.
00:01:14.400 | So here's the hard truth, right?
00:01:16.740 | Three out of six agents on our benchmark
00:01:18.540 | had a 10% or less true positive rate out of 900 plus reports.
00:01:23.700 | One agent actually gave us 70 issues for a single task,
00:01:27.060 | and all of them were false.
00:01:28.380 | And no developer is going to go through all those, right?
00:01:31.200 | You're not going to sit there and try to figure out
00:01:33.880 | what bugs actually exist.
00:01:37.160 | So bad vibes, right?
00:01:38.540 | Implications-- most popular agents
00:01:40.280 | are terrible at finding bugs.
00:01:41.960 | Cursor had a 97% false positive rate,
00:01:44.760 | over 100 plus repos, and 1,200 plus issues.
00:01:48.000 | The real world impact for this is that when developers are actually
00:01:50.680 | building with this software, there's alert fatigue,
00:01:53.280 | and it reduces the effectiveness of trusting these agents,
00:01:56.000 | which means bugs are going to go to prod.
00:01:59.420 | So how do you clean up some of the vibes?
00:02:01.400 | Like, we did this large benchmark.
00:02:03.020 | We've been doing this for months.
00:02:04.160 | We have practical tips for you when
00:02:06.140 | you're working in your IDE with these agents side by side.
00:02:09.440 | So the first thing to note is bug-focused rules.
00:02:13.100 | Every one of these agents has a rules type of file.
00:02:16.240 | You want to basically provide scoped instructions
00:02:18.460 | that provide additional detail on security issues,
00:02:21.260 | logical bugs, things like that.
00:02:23.980 | The second issue here is context management.
00:02:26.180 | So the biggest issue we saw with agents
00:02:27.960 | when navigating code bases was after a little bit of time,
00:02:30.500 | they'd get confused.
00:02:31.560 | They would lose logical links to stuff they've already read,
00:02:34.320 | and their ability to reason and come up
00:02:36.020 | with connections across a code base stumbled significantly.
00:02:40.180 | Obviously, when it comes to finding bugs,
00:02:41.880 | this is a problem because most significant and real bugs
00:02:44.400 | are complex multi-step processes that
00:02:46.080 | are nested deeply in code bases.
00:02:48.780 | And then finally, thinking models rock.
00:02:50.900 | Thinking models were significantly better
00:02:52.940 | at finding bugs in a code base.
00:02:54.600 | So whenever you're using something like Cloud Code,
00:02:57.340 | Cursor, whatever, try to reach for thinking models.
00:02:59.460 | They are just significantly better at this problem.
00:03:04.540 | So I mentioned rules earlier.
00:03:05.940 | And I think there are some practical tips
00:03:07.700 | you can take away for improving your Vibe coding.
00:03:10.720 | OWASP is the world's most popular security authority for bugs,
00:03:15.880 | I would say, give or take.
00:03:18.240 | When you're creating your rules files,
00:03:19.760 | try to feed some specific security information,
00:03:22.240 | like the OWASP top 10 to the model.
00:03:24.560 | So what you're doing here is biasing the model.
00:03:26.280 | So when it's actually looking at your code,
00:03:27.720 | it's considering these things in the first place.
00:03:29.780 | Right now, we find when you don't actually supply models
00:03:32.560 | with security or bug-related information,
00:03:34.640 | their performance is significantly lower than otherwise.
00:03:38.780 | Second, you're going to want to prioritize naming
00:03:40.560 | like explicit classes of bugs in those rules.
00:03:42.620 | Like, don't be like, hey, Cursor, just try to find me some bugs
00:03:45.260 | in this repository.
00:03:46.480 | Be like, hey, Cursor, I want you to examine my repository
00:03:49.040 | for auth bypasses or protocol pollution,
00:03:51.980 | SQL injection, auth bypasses.
00:03:54.100 | You want to be explicit about this.
00:03:56.160 | That kind of primes the models to be looking for these issues.
00:03:59.440 | And then finally, with rules, you
00:04:01.200 | want to require fixed validations.
00:04:02.720 | So you always want to tell the model, hey,
00:04:04.220 | you have to write and get tests to pass
00:04:05.720 | before this is coming into the code base.
00:04:07.280 | You have to ensure they actually fix the bugs.
00:04:10.740 | We've seen more broadly across the 100 repositories,
00:04:13.340 | we benchmarked in the thousands of issues we've seen
00:04:15.200 | from many agents that structured rules eliminate the vague check
00:04:18.920 | for bugs requests and the produce alert fatigue.
00:04:22.100 | Instead, they prime agents for much higher quality output.
00:04:26.440 | So OK, context is key too, right?
00:04:28.460 | So I mentioned agents struggle significantly
00:04:30.420 | with cross-repo navigation and understanding.
00:04:34.820 | In fact, a lot of the agents, when they reach their context
00:04:37.120 | limits, kind of summarize or compact files down.
00:04:40.160 | When that compaction happens, the ability
00:04:42.080 | to detect and understand bugs reduces significantly.
00:04:44.840 | So it's actually on you as users in the IDE
00:04:47.300 | to kind of manage your context more thoroughly for these agents.
00:04:50.560 | You want to make sure you're feeding either
00:04:53.160 | diffs of the code that was changed to the agent,
00:04:55.820 | they're able to actually understand cause and effect
00:04:57.600 | better from that.
00:04:58.560 | You want to make sure key files aren't being summarized
00:05:01.200 | or being taken out of the context window.
00:05:02.820 | And you want to actually ask--
00:05:06.520 | one thing we found really effective in the benchmarking
00:05:09.740 | was asking agents to come up with a step-by-step component
00:05:13.500 | inventory of your code.
00:05:14.960 | So have it indexed, like, these are the classes.
00:05:17.040 | These are the variables.
00:05:19.220 | This is how the use is happening across the code base.
00:05:21.680 | When it does that inventory, it becomes much more
00:05:24.140 | able to find bugs.
00:05:30.920 | Thinking models rock.
00:05:32.840 | We saw across our benchmarking, basically just implicitly,
00:05:36.100 | that thinking models were far more able to find bugs.
00:05:38.880 | If you go through their thought traces,
00:05:40.480 | you're actually able to see them kind of expand
00:05:42.040 | across a few different considerations in the code base.
00:05:44.500 | And then when they find those considerations,
00:05:46.460 | they will actually dive deeper into the chain of thought
00:05:48.880 | for finding those bugs.
00:05:50.420 | That means in practice, they do find deeper bugs
00:05:52.680 | than just non-thinking models were able to across the benchmark.
00:05:55.420 | However, I still want to note here,
00:05:56.880 | even with thinking models, there's
00:05:58.380 | a pretty significant limitation in their ability
00:06:00.540 | to actually holistically look at a file.
00:06:03.460 | We found, again, over hundreds of repos and thousands
00:06:06.240 | of issues that when agents were run,
00:06:08.380 | the top line number of bugs found would remain the same.
00:06:11.380 | But they would actually-- the bugs themselves
00:06:13.360 | would change run to run.
00:06:14.880 | So agents are never holistically really looking
00:06:17.060 | at a file like you or I would be looking at a file.
00:06:19.800 | There's high variability across runs.
00:06:21.880 | We think that's a very big limitation of current agents,
00:06:25.760 | by the way.
00:06:26.260 | We think for consumers, you shouldn't
00:06:27.420 | have to run your agents 100 times to get
00:06:29.260 | like the whole holistic kind of like bug breakdown.
00:06:31.640 | But that's kind of a still in progress problem.
00:06:35.240 | So they're more thorough, and they just
00:06:39.680 | perform better across the benchmark than other models
00:06:42.040 | were able to.
00:06:44.840 | I'm going to quickly plug us.
00:06:45.840 | So we're business.sh.
00:06:47.260 | We create PRs automatically.
00:06:48.640 | We're linked into GitHub, GitLab, JIRA, and linear.
00:06:51.320 | We scan for vulnerabilities.
00:06:52.600 | We provide reviews.
00:06:54.080 | And we also have on-prem deployments,
00:06:55.360 | which I know is a big sticking point for people.
00:06:57.360 | May your vibes be immaculate.
00:07:01.600 | If you scan this QR code, it'll take you to our site.
00:07:03.760 | There we have a link to the full benchmark
00:07:05.140 | with breakdown of methodology results.
00:07:08.540 | You can dive into the actual data itself,
00:07:11.880 | and you can see the SM100 benchmark here,
00:07:15.340 | along with our full data set and exploration
00:07:17.240 | so you can understand just how well current agents are actually
00:07:20.620 | and finding and fixing bugs.
00:07:22.800 | I'm Ian Butler.
00:07:23.440 | Thank you so much.
00:07:24.380 | Thank you.
00:07:25.380 | Thank you.
00:07:25.380 | Thank you.
00:07:25.380 | Thank you.
00:07:25.380 | Thank you.