back to indexHow to Improve your Vibe Coding — Ian Butler

00:00:21.840 |
We've been working on evals for how good agents 00:00:24.880 |
are at finding and fixing bugs for the last several months. 00:00:28.000 |
And we dropped a benchmark yesterday discussing our results. 00:00:32.140 |
So one thing to point out about agents currently 00:00:35.500 |
is that they have a pretty low overall find rate for bugs. 00:00:44.140 |
have a less than 10% true positive rate for finding bugs. 00:00:50.320 |
because these agents can quickly overrun your code base 00:00:55.060 |
able to actually find and then later then fix. 00:00:59.900 |
Overall, too, it's worth noting that in terms of needle 00:01:03.400 |
in a haystack, when we plant bugs in a code base, 00:01:06.400 |
these agents struggle to navigate more broadly 00:01:18.540 |
had a 10% or less true positive rate out of 900 plus reports. 00:01:23.700 |
One agent actually gave us 70 issues for a single task, 00:01:28.380 |
And no developer is going to go through all those, right? 00:01:31.200 |
You're not going to sit there and try to figure out 00:01:48.000 |
The real world impact for this is that when developers are actually 00:01:50.680 |
building with this software, there's alert fatigue, 00:01:53.280 |
and it reduces the effectiveness of trusting these agents, 00:02:06.140 |
you're working in your IDE with these agents side by side. 00:02:09.440 |
So the first thing to note is bug-focused rules. 00:02:13.100 |
Every one of these agents has a rules type of file. 00:02:16.240 |
You want to basically provide scoped instructions 00:02:18.460 |
that provide additional detail on security issues, 00:02:27.960 |
when navigating code bases was after a little bit of time, 00:02:31.560 |
They would lose logical links to stuff they've already read, 00:02:36.020 |
with connections across a code base stumbled significantly. 00:02:41.880 |
this is a problem because most significant and real bugs 00:02:54.600 |
So whenever you're using something like Cloud Code, 00:02:57.340 |
Cursor, whatever, try to reach for thinking models. 00:02:59.460 |
They are just significantly better at this problem. 00:03:07.700 |
you can take away for improving your Vibe coding. 00:03:10.720 |
OWASP is the world's most popular security authority for bugs, 00:03:19.760 |
try to feed some specific security information, 00:03:24.560 |
So what you're doing here is biasing the model. 00:03:27.720 |
it's considering these things in the first place. 00:03:29.780 |
Right now, we find when you don't actually supply models 00:03:34.640 |
their performance is significantly lower than otherwise. 00:03:38.780 |
Second, you're going to want to prioritize naming 00:03:40.560 |
like explicit classes of bugs in those rules. 00:03:42.620 |
Like, don't be like, hey, Cursor, just try to find me some bugs 00:03:46.480 |
Be like, hey, Cursor, I want you to examine my repository 00:03:56.160 |
That kind of primes the models to be looking for these issues. 00:04:07.280 |
You have to ensure they actually fix the bugs. 00:04:10.740 |
We've seen more broadly across the 100 repositories, 00:04:13.340 |
we benchmarked in the thousands of issues we've seen 00:04:15.200 |
from many agents that structured rules eliminate the vague check 00:04:18.920 |
for bugs requests and the produce alert fatigue. 00:04:22.100 |
Instead, they prime agents for much higher quality output. 00:04:30.420 |
with cross-repo navigation and understanding. 00:04:34.820 |
In fact, a lot of the agents, when they reach their context 00:04:37.120 |
limits, kind of summarize or compact files down. 00:04:42.080 |
to detect and understand bugs reduces significantly. 00:04:47.300 |
to kind of manage your context more thoroughly for these agents. 00:04:53.160 |
diffs of the code that was changed to the agent, 00:04:55.820 |
they're able to actually understand cause and effect 00:04:58.560 |
You want to make sure key files aren't being summarized 00:05:06.520 |
one thing we found really effective in the benchmarking 00:05:09.740 |
was asking agents to come up with a step-by-step component 00:05:14.960 |
So have it indexed, like, these are the classes. 00:05:19.220 |
This is how the use is happening across the code base. 00:05:21.680 |
When it does that inventory, it becomes much more 00:05:32.840 |
We saw across our benchmarking, basically just implicitly, 00:05:36.100 |
that thinking models were far more able to find bugs. 00:05:40.480 |
you're actually able to see them kind of expand 00:05:42.040 |
across a few different considerations in the code base. 00:05:44.500 |
And then when they find those considerations, 00:05:46.460 |
they will actually dive deeper into the chain of thought 00:05:50.420 |
That means in practice, they do find deeper bugs 00:05:52.680 |
than just non-thinking models were able to across the benchmark. 00:05:58.380 |
a pretty significant limitation in their ability 00:06:03.460 |
We found, again, over hundreds of repos and thousands 00:06:08.380 |
the top line number of bugs found would remain the same. 00:06:11.380 |
But they would actually-- the bugs themselves 00:06:14.880 |
So agents are never holistically really looking 00:06:17.060 |
at a file like you or I would be looking at a file. 00:06:21.880 |
We think that's a very big limitation of current agents, 00:06:29.260 |
like the whole holistic kind of like bug breakdown. 00:06:31.640 |
But that's kind of a still in progress problem. 00:06:39.680 |
perform better across the benchmark than other models 00:06:48.640 |
We're linked into GitHub, GitLab, JIRA, and linear. 00:06:55.360 |
which I know is a big sticking point for people. 00:07:01.600 |
If you scan this QR code, it'll take you to our site. 00:07:17.240 |
so you can understand just how well current agents are actually