back to indexDoes AI Actually Boost Developer Productivity? (100k Devs Study) - Yegor Denisov-Blanch, Stanford

Chapters
0:0 Introduction and the context of AI in software development, including Mark Zuckerberg's bold claims.
4:37 Limitations of existing studies on AI's impact on developer productivity.
7:19 The methodology used by the Stanford research group to measure productivity.
9:50 The overall impact of AI on developer productivity, including the concept of "rework."
11:42 How productivity gains vary by task complexity and project maturity (Greenfield vs. Brownfield).
14:21 The impact of programming language popularity on AI's effectiveness.
15:42 How codebase size affects AI-driven productivity gains.
17:22 The final conclusions of the study.
00:00:00.000 |
In January of this year, Mark Zuckerberg said that he was going to replace all of the mid-level 00:00:21.960 |
engineers at Meta with AI by the end of the year. I think Mark was a bit optimistic. 00:00:29.340 |
And he was probably acting like a good CEO would to inspire a vision and also probably 00:00:34.800 |
to keep the Facebook stock price up. But what Mark also did was create a lot of trouble 00:00:41.640 |
for CTOs worldwide. Why? Because after Mark said that, every single CEO in the world almost 00:00:50.820 |
turned to their CTO and said, "Hey, Mark says he's going to replace all of his developers 00:00:56.580 |
with AI. Where are we in that journey?" And the answer probably was, honestly, not very 00:01:01.800 |
far. And we're not sure we're going to do that. 00:01:05.480 |
And so I personally think hopefully this is not, you know, going to change, but I don't 00:01:12.080 |
think AI is going to replace developers entirely, at least this year, let alone at Meta, right? 00:01:19.580 |
But I do think that AI increases developer productivity, but there's also cases in which it decreases developer 00:01:25.880 |
productivity. So AI or using AI for coding is not a one size fits all solution. And there 00:01:31.720 |
are cases in which it shouldn't be used. And so for the past three years, we've been running 00:01:38.920 |
one of the largest studies on software engineering productivity at Stanford. And we've done this 00:01:46.380 |
in a time series and cross sectional way. So time series, meaning that even if a participant 00:01:51.480 |
joins in 2025, we get access to their get history, meaning we can see trends of data across time, 00:01:58.080 |
we can see COVID, we can see AI, we can see all of these trends and things that happened. And then also cross-sectional 00:02:04.680 |
because we have more than 600 companies participating, enterprise, mid-sized, and also startups. 00:02:11.900 |
And so this means that we have more than 100,000 software engineers in our data set right now, dozens 00:02:17.980 |
of millions of commits, and billions of lines of code. And most importantly, most of this data 00:02:24.480 |
is private repositories. This is important because if you use a public repo to measure someone's 00:02:30.200 |
productivity, that public repo is not self-contained. Someone could be working on that repo on the 00:02:35.000 |
weekend or once in a while, right? Whereas if you have a private repo, it's much more self-contained 00:02:41.060 |
and much easier to measure the productivity of a team, of a company, of an organization. 00:02:49.340 |
So late last year, there was a huge controversial thing around ghost engineers. So this came 00:02:55.820 |
from kind of the same research group, our research group. And here, Elon Musk was kind enough to 00:03:01.380 |
retweet us. But what we found is that roughly 10% of software engineers in our data set at the 00:03:07.000 |
time, about 50,000, were what we called ghost engineers. These people collect a paycheck but 00:03:14.160 |
basically do no work. So that was very surprising for some people, very unsurprising for others. 00:03:21.380 |
And so some of the people in this research team are, for example, Simon from industry. So he was CTO at 00:03:29.280 |
a unicorn, which he exited. And he had a team of about 700 developers. And a CTO, he was always the 00:03:35.840 |
last person to know when something was up with his engineering team, right? And so he thought, okay, how can I 00:03:40.880 |
change this? Myself, I've been at Stanford since 22. And I focus on what I call data driven decision 00:03:47.840 |
making and software engineering. And in a past life, I was looking after digital transformation 00:03:53.840 |
for a large company with thousands of engineers. Part of the team is also Professor Kaczynski, 00:04:00.160 |
who is at Stanford, and his research focuses on human behavior in a digital environment. And basically, 00:04:05.600 |
he was the Cambridge Analytica whistleblower back in the day, if you recall that. 00:04:08.880 |
So today, we're going to be talking about three things. We're going to start off with the limitations 00:04:16.320 |
of existing studies that seek to quantify the impact of AI on developer productivity. We're going to 00:04:23.200 |
showcase our methodology. And lastly, we're going to spend most of the time looking at some of the 00:04:27.360 |
results. What is the impact on AI on deaf productivity? And how are ways we can slice and dice these results 00:04:32.880 |
to make them more meaningful? And so there's lots of research being done on this topic. But a lot of it 00:04:42.720 |
is led by vendors who themselves are trying to sell you their own AI coding tools, right? And so there's 00:04:49.520 |
a bit of a conflict of interest there sometimes. And the biggest three limitations that I see is that 00:04:54.880 |
a lot of these studies revolve around commits and PRs and tasks. Hey, we completed more commits, 00:04:59.840 |
more PRs, the time between commits decreased. The problem here is that task size varies, right? And 00:05:07.360 |
so delivering more commits does not necessarily mean more productivity. And in fact, what we found very 00:05:14.480 |
often is that by using AI, you're introducing new tasks that are bug fixes to the stuff that the AI just 00:05:21.040 |
coded before. So by that case, like you're kind of spinning your wheels in place, right? So that's 00:05:26.560 |
kind of funny. Secondly, there's a bunch of studies to say, well, we grabbed a bunch of developers, 00:05:32.560 |
we split them into two groups, and we kind of gave one AI and one of them we didn't. And what usually 00:05:38.640 |
happens there is that these are kind of greenfield tasks where they're asked to build something with 00:05:43.120 |
kind of zero context from scratch. And there, of course, AI decimates the non AI people. But that's 00:05:49.120 |
both because AI is just really good at greenfield kind of boilerplate code, right? But actually, 00:05:54.800 |
most of software engineering isn't greenfield and isn't always boilerplate, right? And so there's 00:06:01.440 |
usually an existing code base, there's usually dependencies. So these studies can't be like applied 00:06:06.000 |
too well to these situations either. And then we also have surveys, which we found to be an ineffective 00:06:13.440 |
predictor of productivity, by doing the small experiment with 43 developers, whereby we ask 00:06:20.240 |
every developer to evaluate themselves, relatively global mean or median, in five percentile buckets, 00:06:27.280 |
from zero to 100. And then we compare that to their measured productivity, we'll get into what that 00:06:33.840 |
means later. But what we found is that asking someone how productive they think they are, 00:06:38.720 |
is almost as good as flipping a coin, there's very little correlation, right? And so we found that 00:06:43.920 |
people misjudged their productivity by about 30 percentile points. Only one in three people 00:06:50.320 |
actually estimated their productivity within their quartile, one quartile. And I think surveys are great, 00:06:56.960 |
they're valuable for surfacing, you know, morale and other issues that cannot be derived from metrics. But 00:07:03.120 |
surveys shouldn't be used to measure developer productivity, much less the impact of AI on 00:07:07.840 |
developers. For productivity cases, you can measure it to kind of see how happy they are using AI or 00:07:14.320 |
whatever, I suppose. Great, so now let's dive into our methodology. So in an ideal world, you would have 00:07:23.520 |
an engineer who writes code. And this code is evaluated by a panel of 10 or 15 experts, who separately, without 00:07:32.400 |
knowing what every person is answering, evaluates that code based on quality, maintainability, output, 00:07:39.200 |
how long would this take me? How good is it? Right? So kind of like a bucket of questions. 00:07:43.920 |
And then what happens is that you aggregate those results. And we found two things. The first one is 00:07:51.280 |
that this panel actually agrees with one another. So it turns out that one engineering expert agrees with 00:07:57.280 |
the other engineering expert when they're talking about an objective code in front of them. And 00:08:02.960 |
secondly, and probably most importantly, is that you can use this to predict reality. And reality is 00:08:08.400 |
predicted by a panel like this. The problem then is that this is very slow, it's not scalable, it's expensive. 00:08:14.640 |
And so what we did is we built a model that essentially automates this, correlates pretty well, 00:08:20.320 |
it's fast, it's scalable, and it's affordable. The way it works is it plugs into Git, and then 00:08:27.120 |
the model analyzes the source code changes of every commit, and quantifies them based on a bunch of 00:08:32.800 |
these dimensions. And then since every commit has a unique author, a unique SHA, a unique timestamp, 00:08:41.440 |
then you can kind of understand that, okay, the productivity of a team is basically the functionality 00:08:46.320 |
of the code they deliver across time, not the lines of code, not the whatever commits, but the fun, 00:08:52.400 |
like what that code is doing, right? And so then you can kind of put this in a dashboard and overlay it 00:08:58.240 |
across time and get something similar to this. Great. So now let's dive into some of our results. 00:09:10.240 |
So here in September is when this company implemented AI. This is a team of about 120 developers, 00:09:17.200 |
and they were piloting whether they wanted to use, you know, AI and their kind of regular workflow. And 00:09:22.800 |
we have here these bars, and every bar is the sum total of the output done in that month 00:09:28.720 |
using our methodology, not lines of code. And we can see that in green, it's added functionality, 00:09:35.680 |
in gray, it's removed. In blue is refactoring, and in orange is reworked. And so rework versus refactoring, 00:09:43.040 |
they both alter existing code, but rework alters code that's much more recent, meaning it's wasteful. 00:09:49.040 |
Refactoring could be wasteful, could be not wasteful. And so from the get go, you see that by implementing AI, 00:09:55.760 |
you get a bunch more of rework. What happens is that you feel like you're delivering more code, 00:10:01.440 |
because there's just like more volume of code being written, more commits, more stuff being pushed. 00:10:05.440 |
But not all of that is actually useful. To be clear, I think there, I mean, based on this chart and overall, 00:10:11.440 |
there is a productivity boost of about 15 to 20%. But then a lot of the gains you're seeing are basically 00:10:19.040 |
this kind of rework, which is a bit, you know, misleading. So if I could summarize it into one 00:10:26.960 |
chart with many discrepancies, it would be something like this. So with AI coding, you generate or you 00:10:33.840 |
increase your productivity by roughly 30, 40%, like you're delivering more code. However, you've got to go 00:10:38.880 |
back and kind of fix some of the bugs that code introduced and kind of, you know, fix the the 00:10:44.640 |
mess that the AI made, which in turn gives you an average productivity gain across all industries, 00:10:50.320 |
all sectors, everything of roughly about 15 to 20%. There's a lot of new ones here, which we're going 00:10:58.880 |
to see in just a second. So here we have two violin charts, and they plot the distributions of 00:11:08.720 |
the gains in productivity from using AI. And so kind of like the Y axis is the gains, it starts from minus 00:11:15.840 |
20%, take note, and then it goes up. And here we have kind of four pieces of data being shown. 00:11:22.080 |
In blue is low complexity tasks. And in red is high complexity tasks. And kind of like your left, 00:11:32.880 |
the chart to the chart to the left is greenfield tasks, the chart to the right is brownfield tasks. 00:11:37.760 |
So right from the get go, the first conclusion we have is that sure, it seems like AI performs 00:11:44.000 |
better in coding with simpler tasks. That's good. It's proven by data. That's awesome. The second thing 00:11:50.000 |
we see is that, hey, it sounds like for low complexity greenfield tasks, there is a much more elongated 00:11:58.000 |
distribution and a much higher distribution on average. Keep in mind that this is for enterprise 00:12:03.920 |
settings. This doesn't apply for kind of like personal projects or vibe coding something for 00:12:08.000 |
yourself from scratch. The improvements there would be much bigger. This is kind of for like 00:12:12.400 |
real world working company settings. And the third thing we see is that if you look at the high complexity 00:12:19.520 |
tasks, I mean, they're lower than the low complexity ones on average in terms of the distribution. But also, 00:12:25.040 |
in some cases, they are more likely to decrease an engineer's productivity. 00:12:31.360 |
Now, this decrease could be for many things, many reasons, but that's kind of what we see in the 00:12:37.680 |
data, right? The underlying causes are still not super clear to us. If we translate this to a chart like 00:12:45.840 |
this, which is a bit more digestible, you have in the bars and the columns, kind of like the average or the 00:12:53.920 |
median gain. And then the line represents the interquartile range. So the bottom of the line is the 25th 00:13:00.480 |
percentile and the top of the line is roughly 75th percentile. And so here, it's very clear to see how we 00:13:06.480 |
have, you know, more gains from low complexity tasks, less gains from high complexity tasks. And then brownfield, 00:13:12.800 |
it's harder to leverage AI to make increases in productivity there compared to greenfield. 00:13:18.880 |
So if there is maybe a slide that you could show to your leadership team, it could be this one or could also be this one. 00:13:27.760 |
So here we have a matrix, really simplifying things. You know, reality is a bit more difficult than this, 00:13:32.800 |
but here we have kind of, on one axis, task complexity, low and high. On the other one, 00:13:37.520 |
project maturity, greenfield versus brownfield. Kind of, we see that, hey, low complexity, greenfield, 00:13:43.040 |
30 to 40 percent gains, right, from AI. High complexity, but greenfield, more modest gains, 10 to 15. 00:13:50.160 |
Brownfield and low complexity, pretty good, 15 to 20 percent. And most importantly, 00:13:57.520 |
high complexity brownfield tasks, 0 to 10 percent. These are orientative guidelines based on what 00:14:04.480 |
we see in the data. And I forgot to mention, this slide has a sample size of 136 teams across 27 companies, 00:14:12.080 |
so pretty representative. And then that's gonna derive, or this chart is derived from that data. 00:14:17.920 |
Then here, we have a similar matrix, except at the bottom, we have language popularity. So when 00:14:27.280 |
low, we have examples such as Kobo, Haskell, Elixir, really kind of obscure stuff. And high is things 00:14:34.160 |
like Python, Java, you know, JavaScript TypeScript. And what we see is that AI doesn't really help, 00:14:40.720 |
even with low complexity tasks for low popularity languages. It can help a bit, but it's not terribly 00:14:46.480 |
useful. And what ends up happening is that people just don't use it, because if it's only helpful two times out of five, 00:14:51.280 |
you're just not gonna use it very often. What's funny or interesting is that for low language 00:14:57.120 |
popularity and complex tasks, AI can actually decrease productivity, because it's so bad at coding in Kobo, 00:15:02.400 |
or Haskell, or Elixir, that it just makes you slower, right? Granted, this isn't very, like, this happens, 00:15:09.440 |
but it may be five or 10% of the kind of global development work, if that, right? Most of the 00:15:15.920 |
development work is probably somewhere in the language, in a high language popularity kind of part 00:15:20.080 |
of the chart. And here, you have gains between 20% for the low complexity, and 10 to 15% for the high 00:15:27.360 |
complexity. So now moving into something a bit more theoretical, less empirically proven, but more so 00:15:41.520 |
kind of like what we're seeing in the data, right? This is like an illustrative chart, which has kind of 00:15:46.480 |
productivity gain from AI on the y-axis and a logarithmic scale of the code base size, right, 00:15:52.080 |
from 1000 lines of code to 10 million on the x-axis. And we see that as the code base size increases, 00:15:59.440 |
the gains you get from AI decrease sharply, right? And I think most code bases nowadays are kind of 00:16:05.520 |
somewhere in the... depending on your use case, right? But they're bigger than a thousand lines of code, 00:16:09.920 |
unless you are a YC startup or something that's, like, kind of spun out a couple months ago, right? 00:16:14.960 |
And that's because, you know, there's three reasons for this, really. Context with no limitations. We're 00:16:20.320 |
gonna see in a second how performance decreases even with larger context windows. The signal-to-noise 00:16:25.600 |
ratio is... kind of confuses the model, if you will. And then, of course, larger code bases have more 00:16:32.480 |
dependencies and more domain-specific logic present. And so then, borrowing work from this paper called 00:16:41.840 |
Nolema, which shows you, on a scale of 0 to 100, how LLMs perform on coding tasks, you see that as 00:16:50.960 |
context length increases from 1,000 to 32,000 tokens, performance decreases. And so we see all these 00:16:57.920 |
models here. For example, Gemini 1.5 Pro has a context window of 2 million tokens. And you might think, 00:17:04.000 |
"Whoa! I can just throw my entire code base into it, and it's gonna retrieve and encode perfectly," 00:17:08.800 |
right? And what we see is that even at 32,000 tokens, it's already showing a decrease in performance from 00:17:16.320 |
90% to about 50%, right? So what's gonna happen when you move from 32 to 64 or 128, right? You're gonna 00:17:22.240 |
see really, really poor performance here. And so, in short, AI does increase developer productivity. 00:17:30.160 |
You should use AI for most cases, but it doesn't increase the productivity of developers all the 00:17:37.280 |
time and equally. It depends on things like task complexity, code base maturity, language popularity, 00:17:44.160 |
code base size, and also context length. Thank you so much for listening. If you'd like to learn more 00:17:51.040 |
about our research, you can access our research portal, which is softwareengineeringproductivity.stanford.edu. 00:17:57.760 |
You can also reach me by email or LinkedIn. Super happy to talk about this topic at any time. Thank you so much.