Back to Index

How to Improve your Vibe Coding — Ian Butler


Transcript

. My name's Ian. I'm the CEO of Bismuth. We're an end-to-end agent decoding solution, kind of like Codex. We've been working on evals for how good agents are at finding and fixing bugs for the last several months. And we dropped a benchmark yesterday discussing our results. So one thing to point out about agents currently is that they have a pretty low overall find rate for bugs.

They actually generate a significant amount of false positives. You can see something like Devon and Cursor have a less than 10% true positive rate for finding bugs. This is an issue when you're vibe coding because these agents can quickly overrun your code base with unintended bugs that they're not able to actually find and then later then fix.

Overall, too, it's worth noting that in terms of needle in a haystack, when we plant bugs in a code base, these agents struggle to navigate more broadly across those larger code bases and actually find the specific bugs. So here's the hard truth, right? Three out of six agents on our benchmark had a 10% or less true positive rate out of 900 plus reports.

One agent actually gave us 70 issues for a single task, and all of them were false. And no developer is going to go through all those, right? You're not going to sit there and try to figure out what bugs actually exist. So bad vibes, right? Implications-- most popular agents are terrible at finding bugs.

Cursor had a 97% false positive rate, over 100 plus repos, and 1,200 plus issues. The real world impact for this is that when developers are actually building with this software, there's alert fatigue, and it reduces the effectiveness of trusting these agents, which means bugs are going to go to prod.

So how do you clean up some of the vibes? Like, we did this large benchmark. We've been doing this for months. We have practical tips for you when you're working in your IDE with these agents side by side. So the first thing to note is bug-focused rules. Every one of these agents has a rules type of file.

You want to basically provide scoped instructions that provide additional detail on security issues, logical bugs, things like that. The second issue here is context management. So the biggest issue we saw with agents when navigating code bases was after a little bit of time, they'd get confused. They would lose logical links to stuff they've already read, and their ability to reason and come up with connections across a code base stumbled significantly.

Obviously, when it comes to finding bugs, this is a problem because most significant and real bugs are complex multi-step processes that are nested deeply in code bases. And then finally, thinking models rock. Thinking models were significantly better at finding bugs in a code base. So whenever you're using something like Cloud Code, Cursor, whatever, try to reach for thinking models.

They are just significantly better at this problem. OK. So I mentioned rules earlier. And I think there are some practical tips you can take away for improving your Vibe coding. OWASP is the world's most popular security authority for bugs, I would say, give or take. When you're creating your rules files, try to feed some specific security information, like the OWASP top 10 to the model.

So what you're doing here is biasing the model. So when it's actually looking at your code, it's considering these things in the first place. Right now, we find when you don't actually supply models with security or bug-related information, their performance is significantly lower than otherwise. Second, you're going to want to prioritize naming like explicit classes of bugs in those rules.

Like, don't be like, hey, Cursor, just try to find me some bugs in this repository. Be like, hey, Cursor, I want you to examine my repository for auth bypasses or protocol pollution, SQL injection, auth bypasses. You want to be explicit about this. That kind of primes the models to be looking for these issues.

And then finally, with rules, you want to require fixed validations. So you always want to tell the model, hey, you have to write and get tests to pass before this is coming into the code base. You have to ensure they actually fix the bugs. We've seen more broadly across the 100 repositories, we benchmarked in the thousands of issues we've seen from many agents that structured rules eliminate the vague check for bugs requests and the produce alert fatigue.

Instead, they prime agents for much higher quality output. So OK, context is key too, right? So I mentioned agents struggle significantly with cross-repo navigation and understanding. In fact, a lot of the agents, when they reach their context limits, kind of summarize or compact files down. When that compaction happens, the ability to detect and understand bugs reduces significantly.

So it's actually on you as users in the IDE to kind of manage your context more thoroughly for these agents. You want to make sure you're feeding either diffs of the code that was changed to the agent, they're able to actually understand cause and effect better from that. You want to make sure key files aren't being summarized or being taken out of the context window.

And you want to actually ask-- one thing we found really effective in the benchmarking was asking agents to come up with a step-by-step component inventory of your code. So have it indexed, like, these are the classes. These are the variables. This is how the use is happening across the code base.

When it does that inventory, it becomes much more able to find bugs. OK. Thinking models rock. We saw across our benchmarking, basically just implicitly, that thinking models were far more able to find bugs. If you go through their thought traces, you're actually able to see them kind of expand across a few different considerations in the code base.

And then when they find those considerations, they will actually dive deeper into the chain of thought for finding those bugs. That means in practice, they do find deeper bugs than just non-thinking models were able to across the benchmark. However, I still want to note here, even with thinking models, there's a pretty significant limitation in their ability to actually holistically look at a file.

We found, again, over hundreds of repos and thousands of issues that when agents were run, the top line number of bugs found would remain the same. But they would actually-- the bugs themselves would change run to run. So agents are never holistically really looking at a file like you or I would be looking at a file.

There's high variability across runs. We think that's a very big limitation of current agents, by the way. We think for consumers, you shouldn't have to run your agents 100 times to get like the whole holistic kind of like bug breakdown. But that's kind of a still in progress problem.

So they're more thorough, and they just perform better across the benchmark than other models were able to. I'm going to quickly plug us. So we're business.sh. We create PRs automatically. We're linked into GitHub, GitLab, JIRA, and linear. We scan for vulnerabilities. We provide reviews. And we also have on-prem deployments, which I know is a big sticking point for people.

May your vibes be immaculate. If you scan this QR code, it'll take you to our site. There we have a link to the full benchmark with breakdown of methodology results. You can dive into the actual data itself, and you can see the SM100 benchmark here, along with our full data set and exploration so you can understand just how well current agents are actually and finding and fixing bugs.

Yep. I'm Ian Butler. Thank you so much. Thank you. Thank you. Thank you. Thank you. Thank you. you you