How to Improve your Vibe Coding

00:00:00.000 | .

00:00:15.000 | My name's Ian.

00:00:16.400 | I'm the CEO of Bismuth.

00:00:17.960 | We're an end-to-end agent decoding solution,

00:00:19.920 | kind of like Codex.

00:00:21.840 | We've been working on evals for how good agents

00:00:24.880 | are at finding and fixing bugs for the last several months.

00:00:28.000 | And we dropped a benchmark yesterday discussing our results.

00:00:32.140 | So one thing to point out about agents currently

00:00:35.500 | is that they have a pretty low overall find rate for bugs.

00:00:38.440 | They actually generate a significant amount

00:00:40.420 | of false positives.

00:00:42.200 | You can see something like Devon and Cursor

00:00:44.140 | have a less than 10% true positive rate for finding bugs.

00:00:48.520 | This is an issue when you're vibe coding

00:00:50.320 | because these agents can quickly overrun your code base

00:00:53.340 | with unintended bugs that they're not

00:00:55.060 | able to actually find and then later then fix.

00:00:59.900 | Overall, too, it's worth noting that in terms of needle

00:01:03.400 | in a haystack, when we plant bugs in a code base,

00:01:06.400 | these agents struggle to navigate more broadly

00:01:09.100 | across those larger code bases and actually

00:01:11.260 | find the specific bugs.

00:01:14.400 | So here's the hard truth, right?

00:01:16.740 | Three out of six agents on our benchmark

00:01:18.540 | had a 10% or less true positive rate out of 900 plus reports.

00:01:23.700 | One agent actually gave us 70 issues for a single task,

00:01:27.060 | and all of them were false.

00:01:28.380 | And no developer is going to go through all those, right?

00:01:31.200 | You're not going to sit there and try to figure out

00:01:33.880 | what bugs actually exist.

00:01:37.160 | So bad vibes, right?

00:01:38.540 | Implications-- most popular agents

00:01:40.280 | are terrible at finding bugs.

00:01:41.960 | Cursor had a 97% false positive rate,

00:01:44.760 | over 100 plus repos, and 1,200 plus issues.

00:01:48.000 | The real world impact for this is that when developers are actually

00:01:50.680 | building with this software, there's alert fatigue,

00:01:53.280 | and it reduces the effectiveness of trusting these agents,

00:01:56.000 | which means bugs are going to go to prod.

00:01:59.420 | So how do you clean up some of the vibes?

00:02:01.400 | Like, we did this large benchmark.

00:02:03.020 | We've been doing this for months.

00:02:04.160 | We have practical tips for you when

00:02:06.140 | you're working in your IDE with these agents side by side.

00:02:09.440 | So the first thing to note is bug-focused rules.

00:02:13.100 | Every one of these agents has a rules type of file.

00:02:16.240 | You want to basically provide scoped instructions

00:02:18.460 | that provide additional detail on security issues,

00:02:21.260 | logical bugs, things like that.

00:02:23.980 | The second issue here is context management.

00:02:26.180 | So the biggest issue we saw with agents

00:02:27.960 | when navigating code bases was after a little bit of time,

00:02:30.500 | they'd get confused.

00:02:31.560 | They would lose logical links to stuff they've already read,

00:02:34.320 | and their ability to reason and come up

00:02:36.020 | with connections across a code base stumbled significantly.

00:02:40.180 | Obviously, when it comes to finding bugs,

00:02:41.880 | this is a problem because most significant and real bugs

00:02:44.400 | are complex multi-step processes that

00:02:46.080 | are nested deeply in code bases.

00:02:48.780 | And then finally, thinking models rock.

00:02:50.900 | Thinking models were significantly better

00:02:52.940 | at finding bugs in a code base.

00:02:54.600 | So whenever you're using something like Cloud Code,

00:02:57.340 | Cursor, whatever, try to reach for thinking models.

00:02:59.460 | They are just significantly better at this problem.

00:03:01.060 | OK.

00:03:04.540 | So I mentioned rules earlier.

00:03:05.940 | And I think there are some practical tips

00:03:07.700 | you can take away for improving your Vibe coding.

00:03:10.720 | OWASP is the world's most popular security authority for bugs,

00:03:15.880 | I would say, give or take.

00:03:18.240 | When you're creating your rules files,

00:03:19.760 | try to feed some specific security information,

00:03:22.240 | like the OWASP top 10 to the model.

00:03:24.560 | So what you're doing here is biasing the model.

00:03:26.280 | So when it's actually looking at your code,

00:03:27.720 | it's considering these things in the first place.

00:03:29.780 | Right now, we find when you don't actually supply models

00:03:32.560 | with security or bug-related information,

00:03:34.640 | their performance is significantly lower than otherwise.

00:03:38.780 | Second, you're going to want to prioritize naming

00:03:40.560 | like explicit classes of bugs in those rules.

00:03:42.620 | Like, don't be like, hey, Cursor, just try to find me some bugs

00:03:45.260 | in this repository.

00:03:46.480 | Be like, hey, Cursor, I want you to examine my repository

00:03:49.040 | for auth bypasses or protocol pollution,

00:03:51.980 | SQL injection, auth bypasses.

00:03:54.100 | You want to be explicit about this.

00:03:56.160 | That kind of primes the models to be looking for these issues.

00:03:59.440 | And then finally, with rules, you

00:04:01.200 | want to require fixed validations.

00:04:02.720 | So you always want to tell the model, hey,

00:04:04.220 | you have to write and get tests to pass

00:04:05.720 | before this is coming into the code base.

00:04:07.280 | You have to ensure they actually fix the bugs.

00:04:10.740 | We've seen more broadly across the 100 repositories,

00:04:13.340 | we benchmarked in the thousands of issues we've seen

00:04:15.200 | from many agents that structured rules eliminate the vague check

00:04:18.920 | for bugs requests and the produce alert fatigue.

00:04:22.100 | Instead, they prime agents for much higher quality output.

00:04:26.440 | So OK, context is key too, right?

00:04:28.460 | So I mentioned agents struggle significantly

00:04:30.420 | with cross-repo navigation and understanding.

00:04:34.820 | In fact, a lot of the agents, when they reach their context

00:04:37.120 | limits, kind of summarize or compact files down.

00:04:40.160 | When that compaction happens, the ability

00:04:42.080 | to detect and understand bugs reduces significantly.

00:04:44.840 | So it's actually on you as users in the IDE

00:04:47.300 | to kind of manage your context more thoroughly for these agents.

00:04:50.560 | You want to make sure you're feeding either

00:04:53.160 | diffs of the code that was changed to the agent,

00:04:55.820 | they're able to actually understand cause and effect

00:04:57.600 | better from that.

00:04:58.560 | You want to make sure key files aren't being summarized

00:05:01.200 | or being taken out of the context window.

00:05:02.820 | And you want to actually ask--

00:05:06.520 | one thing we found really effective in the benchmarking

00:05:09.740 | was asking agents to come up with a step-by-step component

00:05:13.500 | inventory of your code.

00:05:14.960 | So have it indexed, like, these are the classes.

00:05:17.040 | These are the variables.

00:05:19.220 | This is how the use is happening across the code base.

00:05:21.680 | When it does that inventory, it becomes much more

00:05:24.140 | able to find bugs.

00:05:28.960 | OK.

00:05:30.920 | Thinking models rock.

00:05:32.840 | We saw across our benchmarking, basically just implicitly,

00:05:36.100 | that thinking models were far more able to find bugs.

00:05:38.880 | If you go through their thought traces,

00:05:40.480 | you're actually able to see them kind of expand

00:05:42.040 | across a few different considerations in the code base.

00:05:44.500 | And then when they find those considerations,

00:05:46.460 | they will actually dive deeper into the chain of thought

00:05:48.880 | for finding those bugs.

00:05:50.420 | That means in practice, they do find deeper bugs

00:05:52.680 | than just non-thinking models were able to across the benchmark.

00:05:55.420 | However, I still want to note here,

00:05:56.880 | even with thinking models, there's

00:05:58.380 | a pretty significant limitation in their ability

00:06:00.540 | to actually holistically look at a file.

00:06:03.460 | We found, again, over hundreds of repos and thousands

00:06:06.240 | of issues that when agents were run,

00:06:08.380 | the top line number of bugs found would remain the same.

00:06:11.380 | But they would actually-- the bugs themselves

00:06:13.360 | would change run to run.

00:06:14.880 | So agents are never holistically really looking

00:06:17.060 | at a file like you or I would be looking at a file.

00:06:19.800 | There's high variability across runs.

00:06:21.880 | We think that's a very big limitation of current agents,

00:06:25.760 | by the way.

00:06:26.260 | We think for consumers, you shouldn't

00:06:27.420 | have to run your agents 100 times to get

00:06:29.260 | like the whole holistic kind of like bug breakdown.

00:06:31.640 | But that's kind of a still in progress problem.

00:06:35.240 | So they're more thorough, and they just

00:06:39.680 | perform better across the benchmark than other models

00:06:42.040 | were able to.

00:06:44.840 | I'm going to quickly plug us.

00:06:45.840 | So we're business.sh.

00:06:47.260 | We create PRs automatically.

00:06:48.640 | We're linked into GitHub, GitLab, JIRA, and linear.

00:06:51.320 | We scan for vulnerabilities.

00:06:52.600 | We provide reviews.

00:06:54.080 | And we also have on-prem deployments,

00:06:55.360 | which I know is a big sticking point for people.

00:06:57.360 | May your vibes be immaculate.

00:07:01.600 | If you scan this QR code, it'll take you to our site.

00:07:03.760 | There we have a link to the full benchmark

00:07:05.140 | with breakdown of methodology results.

00:07:08.540 | You can dive into the actual data itself,

00:07:11.880 | and you can see the SM100 benchmark here,

00:07:15.340 | along with our full data set and exploration

00:07:17.240 | so you can understand just how well current agents are actually

00:07:20.620 | and finding and fixing bugs.

00:07:22.540 | Yep.

00:07:22.800 | I'm Ian Butler.

00:07:23.440 | Thank you so much.

00:07:24.380 | Thank you.

00:07:25.380 | Thank you.

00:07:26.380 | you

00:07:26.880 | you

How to Improve your Vibe Coding — Ian Butler