Back to Index

Does AI Actually Boost Developer Productivity? (100k Devs Study) - Yegor Denisov-Blanch, Stanford


Chapters

0:0 Introduction and the context of AI in software development, including Mark Zuckerberg's bold claims.
4:37 Limitations of existing studies on AI's impact on developer productivity.
7:19 The methodology used by the Stanford research group to measure productivity.
9:50 The overall impact of AI on developer productivity, including the concept of "rework."
11:42 How productivity gains vary by task complexity and project maturity (Greenfield vs. Brownfield).
14:21 The impact of programming language popularity on AI's effectiveness.
15:42 How codebase size affects AI-driven productivity gains.
17:22 The final conclusions of the study.

Transcript

In January of this year, Mark Zuckerberg said that he was going to replace all of the mid-level engineers at Meta with AI by the end of the year. I think Mark was a bit optimistic. And he was probably acting like a good CEO would to inspire a vision and also probably to keep the Facebook stock price up.

But what Mark also did was create a lot of trouble for CTOs worldwide. Why? Because after Mark said that, every single CEO in the world almost turned to their CTO and said, "Hey, Mark says he's going to replace all of his developers with AI. Where are we in that journey?" And the answer probably was, honestly, not very far.

And we're not sure we're going to do that. And so I personally think hopefully this is not, you know, going to change, but I don't think AI is going to replace developers entirely, at least this year, let alone at Meta, right? But I do think that AI increases developer productivity, but there's also cases in which it decreases developer productivity.

So AI or using AI for coding is not a one size fits all solution. And there are cases in which it shouldn't be used. And so for the past three years, we've been running one of the largest studies on software engineering productivity at Stanford. And we've done this in a time series and cross sectional way.

So time series, meaning that even if a participant joins in 2025, we get access to their get history, meaning we can see trends of data across time, we can see COVID, we can see AI, we can see all of these trends and things that happened. And then also cross-sectional because we have more than 600 companies participating, enterprise, mid-sized, and also startups.

And so this means that we have more than 100,000 software engineers in our data set right now, dozens of millions of commits, and billions of lines of code. And most importantly, most of this data is private repositories. This is important because if you use a public repo to measure someone's productivity, that public repo is not self-contained.

Someone could be working on that repo on the weekend or once in a while, right? Whereas if you have a private repo, it's much more self-contained and much easier to measure the productivity of a team, of a company, of an organization. So late last year, there was a huge controversial thing around ghost engineers.

So this came from kind of the same research group, our research group. And here, Elon Musk was kind enough to retweet us. But what we found is that roughly 10% of software engineers in our data set at the time, about 50,000, were what we called ghost engineers. These people collect a paycheck but basically do no work.

So that was very surprising for some people, very unsurprising for others. And so some of the people in this research team are, for example, Simon from industry. So he was CTO at a unicorn, which he exited. And he had a team of about 700 developers. And a CTO, he was always the last person to know when something was up with his engineering team, right?

And so he thought, okay, how can I change this? Myself, I've been at Stanford since 22. And I focus on what I call data driven decision making and software engineering. And in a past life, I was looking after digital transformation for a large company with thousands of engineers. Part of the team is also Professor Kaczynski, who is at Stanford, and his research focuses on human behavior in a digital environment.

And basically, he was the Cambridge Analytica whistleblower back in the day, if you recall that. So today, we're going to be talking about three things. We're going to start off with the limitations of existing studies that seek to quantify the impact of AI on developer productivity. We're going to showcase our methodology.

And lastly, we're going to spend most of the time looking at some of the results. What is the impact on AI on deaf productivity? And how are ways we can slice and dice these results to make them more meaningful? And so there's lots of research being done on this topic.

But a lot of it is led by vendors who themselves are trying to sell you their own AI coding tools, right? And so there's a bit of a conflict of interest there sometimes. And the biggest three limitations that I see is that a lot of these studies revolve around commits and PRs and tasks.

Hey, we completed more commits, more PRs, the time between commits decreased. The problem here is that task size varies, right? And so delivering more commits does not necessarily mean more productivity. And in fact, what we found very often is that by using AI, you're introducing new tasks that are bug fixes to the stuff that the AI just coded before.

So by that case, like you're kind of spinning your wheels in place, right? So that's kind of funny. Secondly, there's a bunch of studies to say, well, we grabbed a bunch of developers, we split them into two groups, and we kind of gave one AI and one of them we didn't.

And what usually happens there is that these are kind of greenfield tasks where they're asked to build something with kind of zero context from scratch. And there, of course, AI decimates the non AI people. But that's both because AI is just really good at greenfield kind of boilerplate code, right?

But actually, most of software engineering isn't greenfield and isn't always boilerplate, right? And so there's usually an existing code base, there's usually dependencies. So these studies can't be like applied too well to these situations either. And then we also have surveys, which we found to be an ineffective predictor of productivity, by doing the small experiment with 43 developers, whereby we ask every developer to evaluate themselves, relatively global mean or median, in five percentile buckets, from zero to 100.

And then we compare that to their measured productivity, we'll get into what that means later. But what we found is that asking someone how productive they think they are, is almost as good as flipping a coin, there's very little correlation, right? And so we found that people misjudged their productivity by about 30 percentile points.

Only one in three people actually estimated their productivity within their quartile, one quartile. And I think surveys are great, they're valuable for surfacing, you know, morale and other issues that cannot be derived from metrics. But surveys shouldn't be used to measure developer productivity, much less the impact of AI on developers.

For productivity cases, you can measure it to kind of see how happy they are using AI or whatever, I suppose. Great, so now let's dive into our methodology. So in an ideal world, you would have an engineer who writes code. And this code is evaluated by a panel of 10 or 15 experts, who separately, without knowing what every person is answering, evaluates that code based on quality, maintainability, output, how long would this take me?

How good is it? Right? So kind of like a bucket of questions. And then what happens is that you aggregate those results. And we found two things. The first one is that this panel actually agrees with one another. So it turns out that one engineering expert agrees with the other engineering expert when they're talking about an objective code in front of them.

And secondly, and probably most importantly, is that you can use this to predict reality. And reality is predicted by a panel like this. The problem then is that this is very slow, it's not scalable, it's expensive. And so what we did is we built a model that essentially automates this, correlates pretty well, it's fast, it's scalable, and it's affordable.

The way it works is it plugs into Git, and then the model analyzes the source code changes of every commit, and quantifies them based on a bunch of these dimensions. And then since every commit has a unique author, a unique SHA, a unique timestamp, then you can kind of understand that, okay, the productivity of a team is basically the functionality of the code they deliver across time, not the lines of code, not the whatever commits, but the fun, like what that code is doing, right?

And so then you can kind of put this in a dashboard and overlay it across time and get something similar to this. Great. So now let's dive into some of our results. So here in September is when this company implemented AI. This is a team of about 120 developers, and they were piloting whether they wanted to use, you know, AI and their kind of regular workflow.

And we have here these bars, and every bar is the sum total of the output done in that month using our methodology, not lines of code. And we can see that in green, it's added functionality, in gray, it's removed. In blue is refactoring, and in orange is reworked. And so rework versus refactoring, they both alter existing code, but rework alters code that's much more recent, meaning it's wasteful.

Refactoring could be wasteful, could be not wasteful. And so from the get go, you see that by implementing AI, you get a bunch more of rework. What happens is that you feel like you're delivering more code, because there's just like more volume of code being written, more commits, more stuff being pushed.

But not all of that is actually useful. To be clear, I think there, I mean, based on this chart and overall, there is a productivity boost of about 15 to 20%. But then a lot of the gains you're seeing are basically this kind of rework, which is a bit, you know, misleading.

So if I could summarize it into one chart with many discrepancies, it would be something like this. So with AI coding, you generate or you increase your productivity by roughly 30, 40%, like you're delivering more code. However, you've got to go back and kind of fix some of the bugs that code introduced and kind of, you know, fix the the mess that the AI made, which in turn gives you an average productivity gain across all industries, all sectors, everything of roughly about 15 to 20%.

There's a lot of new ones here, which we're going to see in just a second. So here we have two violin charts, and they plot the distributions of the gains in productivity from using AI. And so kind of like the Y axis is the gains, it starts from minus 20%, take note, and then it goes up.

And here we have kind of four pieces of data being shown. In blue is low complexity tasks. And in red is high complexity tasks. And kind of like your left, the chart to the chart to the left is greenfield tasks, the chart to the right is brownfield tasks. So right from the get go, the first conclusion we have is that sure, it seems like AI performs better in coding with simpler tasks.

That's good. It's proven by data. That's awesome. The second thing we see is that, hey, it sounds like for low complexity greenfield tasks, there is a much more elongated distribution and a much higher distribution on average. Keep in mind that this is for enterprise settings. This doesn't apply for kind of like personal projects or vibe coding something for yourself from scratch.

The improvements there would be much bigger. This is kind of for like real world working company settings. And the third thing we see is that if you look at the high complexity tasks, I mean, they're lower than the low complexity ones on average in terms of the distribution. But also, in some cases, they are more likely to decrease an engineer's productivity.

Now, this decrease could be for many things, many reasons, but that's kind of what we see in the data, right? The underlying causes are still not super clear to us. If we translate this to a chart like this, which is a bit more digestible, you have in the bars and the columns, kind of like the average or the median gain.

And then the line represents the interquartile range. So the bottom of the line is the 25th percentile and the top of the line is roughly 75th percentile. And so here, it's very clear to see how we have, you know, more gains from low complexity tasks, less gains from high complexity tasks.

And then brownfield, it's harder to leverage AI to make increases in productivity there compared to greenfield. So if there is maybe a slide that you could show to your leadership team, it could be this one or could also be this one. So here we have a matrix, really simplifying things.

You know, reality is a bit more difficult than this, but here we have kind of, on one axis, task complexity, low and high. On the other one, project maturity, greenfield versus brownfield. Kind of, we see that, hey, low complexity, greenfield, 30 to 40 percent gains, right, from AI. High complexity, but greenfield, more modest gains, 10 to 15.

Brownfield and low complexity, pretty good, 15 to 20 percent. And most importantly, high complexity brownfield tasks, 0 to 10 percent. These are orientative guidelines based on what we see in the data. And I forgot to mention, this slide has a sample size of 136 teams across 27 companies, so pretty representative.

And then that's gonna derive, or this chart is derived from that data. Then here, we have a similar matrix, except at the bottom, we have language popularity. So when low, we have examples such as Kobo, Haskell, Elixir, really kind of obscure stuff. And high is things like Python, Java, you know, JavaScript TypeScript.

And what we see is that AI doesn't really help, even with low complexity tasks for low popularity languages. It can help a bit, but it's not terribly useful. And what ends up happening is that people just don't use it, because if it's only helpful two times out of five, you're just not gonna use it very often.

What's funny or interesting is that for low language popularity and complex tasks, AI can actually decrease productivity, because it's so bad at coding in Kobo, or Haskell, or Elixir, that it just makes you slower, right? Granted, this isn't very, like, this happens, but it may be five or 10% of the kind of global development work, if that, right?

Most of the development work is probably somewhere in the language, in a high language popularity kind of part of the chart. And here, you have gains between 20% for the low complexity, and 10 to 15% for the high complexity. So now moving into something a bit more theoretical, less empirically proven, but more so kind of like what we're seeing in the data, right?

This is like an illustrative chart, which has kind of productivity gain from AI on the y-axis and a logarithmic scale of the code base size, right, from 1000 lines of code to 10 million on the x-axis. And we see that as the code base size increases, the gains you get from AI decrease sharply, right?

And I think most code bases nowadays are kind of somewhere in the... depending on your use case, right? But they're bigger than a thousand lines of code, unless you are a YC startup or something that's, like, kind of spun out a couple months ago, right? And that's because, you know, there's three reasons for this, really.

Context with no limitations. We're gonna see in a second how performance decreases even with larger context windows. The signal-to-noise ratio is... kind of confuses the model, if you will. And then, of course, larger code bases have more dependencies and more domain-specific logic present. And so then, borrowing work from this paper called Nolema, which shows you, on a scale of 0 to 100, how LLMs perform on coding tasks, you see that as context length increases from 1,000 to 32,000 tokens, performance decreases.

And so we see all these models here. For example, Gemini 1.5 Pro has a context window of 2 million tokens. And you might think, "Whoa! I can just throw my entire code base into it, and it's gonna retrieve and encode perfectly," right? And what we see is that even at 32,000 tokens, it's already showing a decrease in performance from 90% to about 50%, right?

So what's gonna happen when you move from 32 to 64 or 128, right? You're gonna see really, really poor performance here. And so, in short, AI does increase developer productivity. You should use AI for most cases, but it doesn't increase the productivity of developers all the time and equally.

It depends on things like task complexity, code base maturity, language popularity, code base size, and also context length. Thank you so much for listening. If you'd like to learn more about our research, you can access our research portal, which is softwareengineeringproductivity.stanford.edu. You can also reach me by email or LinkedIn.

Super happy to talk about this topic at any time. Thank you so much.