back to index

Does AI Actually Boost Developer Productivity? (100k Devs Study) - Yegor Denisov-Blanch, Stanford


Chapters

0:0 Introduction and the context of AI in software development, including Mark Zuckerberg's bold claims.
4:37 Limitations of existing studies on AI's impact on developer productivity.
7:19 The methodology used by the Stanford research group to measure productivity.
9:50 The overall impact of AI on developer productivity, including the concept of "rework."
11:42 How productivity gains vary by task complexity and project maturity (Greenfield vs. Brownfield).
14:21 The impact of programming language popularity on AI's effectiveness.
15:42 How codebase size affects AI-driven productivity gains.
17:22 The final conclusions of the study.

Whisper Transcript | Transcript Only Page

00:00:00.000 | In January of this year, Mark Zuckerberg said that he was going to replace all of the mid-level
00:00:21.960 | engineers at Meta with AI by the end of the year. I think Mark was a bit optimistic.
00:00:29.340 | And he was probably acting like a good CEO would to inspire a vision and also probably
00:00:34.800 | to keep the Facebook stock price up. But what Mark also did was create a lot of trouble
00:00:41.640 | for CTOs worldwide. Why? Because after Mark said that, every single CEO in the world almost
00:00:50.820 | turned to their CTO and said, "Hey, Mark says he's going to replace all of his developers
00:00:56.580 | with AI. Where are we in that journey?" And the answer probably was, honestly, not very
00:01:01.800 | far. And we're not sure we're going to do that.
00:01:05.480 | And so I personally think hopefully this is not, you know, going to change, but I don't
00:01:12.080 | think AI is going to replace developers entirely, at least this year, let alone at Meta, right?
00:01:19.580 | But I do think that AI increases developer productivity, but there's also cases in which it decreases developer
00:01:25.880 | productivity. So AI or using AI for coding is not a one size fits all solution. And there
00:01:31.720 | are cases in which it shouldn't be used. And so for the past three years, we've been running
00:01:38.920 | one of the largest studies on software engineering productivity at Stanford. And we've done this
00:01:46.380 | in a time series and cross sectional way. So time series, meaning that even if a participant
00:01:51.480 | joins in 2025, we get access to their get history, meaning we can see trends of data across time,
00:01:58.080 | we can see COVID, we can see AI, we can see all of these trends and things that happened. And then also cross-sectional
00:02:04.680 | because we have more than 600 companies participating, enterprise, mid-sized, and also startups.
00:02:11.900 | And so this means that we have more than 100,000 software engineers in our data set right now, dozens
00:02:17.980 | of millions of commits, and billions of lines of code. And most importantly, most of this data
00:02:24.480 | is private repositories. This is important because if you use a public repo to measure someone's
00:02:30.200 | productivity, that public repo is not self-contained. Someone could be working on that repo on the
00:02:35.000 | weekend or once in a while, right? Whereas if you have a private repo, it's much more self-contained
00:02:41.060 | and much easier to measure the productivity of a team, of a company, of an organization.
00:02:49.340 | So late last year, there was a huge controversial thing around ghost engineers. So this came
00:02:55.820 | from kind of the same research group, our research group. And here, Elon Musk was kind enough to
00:03:01.380 | retweet us. But what we found is that roughly 10% of software engineers in our data set at the
00:03:07.000 | time, about 50,000, were what we called ghost engineers. These people collect a paycheck but
00:03:14.160 | basically do no work. So that was very surprising for some people, very unsurprising for others.
00:03:21.380 | And so some of the people in this research team are, for example, Simon from industry. So he was CTO at
00:03:29.280 | a unicorn, which he exited. And he had a team of about 700 developers. And a CTO, he was always the
00:03:35.840 | last person to know when something was up with his engineering team, right? And so he thought, okay, how can I
00:03:40.880 | change this? Myself, I've been at Stanford since 22. And I focus on what I call data driven decision
00:03:47.840 | making and software engineering. And in a past life, I was looking after digital transformation
00:03:53.840 | for a large company with thousands of engineers. Part of the team is also Professor Kaczynski,
00:04:00.160 | who is at Stanford, and his research focuses on human behavior in a digital environment. And basically,
00:04:05.600 | he was the Cambridge Analytica whistleblower back in the day, if you recall that.
00:04:08.880 | So today, we're going to be talking about three things. We're going to start off with the limitations
00:04:16.320 | of existing studies that seek to quantify the impact of AI on developer productivity. We're going to
00:04:23.200 | showcase our methodology. And lastly, we're going to spend most of the time looking at some of the
00:04:27.360 | results. What is the impact on AI on deaf productivity? And how are ways we can slice and dice these results
00:04:32.880 | to make them more meaningful? And so there's lots of research being done on this topic. But a lot of it
00:04:42.720 | is led by vendors who themselves are trying to sell you their own AI coding tools, right? And so there's
00:04:49.520 | a bit of a conflict of interest there sometimes. And the biggest three limitations that I see is that
00:04:54.880 | a lot of these studies revolve around commits and PRs and tasks. Hey, we completed more commits,
00:04:59.840 | more PRs, the time between commits decreased. The problem here is that task size varies, right? And
00:05:07.360 | so delivering more commits does not necessarily mean more productivity. And in fact, what we found very
00:05:14.480 | often is that by using AI, you're introducing new tasks that are bug fixes to the stuff that the AI just
00:05:21.040 | coded before. So by that case, like you're kind of spinning your wheels in place, right? So that's
00:05:26.560 | kind of funny. Secondly, there's a bunch of studies to say, well, we grabbed a bunch of developers,
00:05:32.560 | we split them into two groups, and we kind of gave one AI and one of them we didn't. And what usually
00:05:38.640 | happens there is that these are kind of greenfield tasks where they're asked to build something with
00:05:43.120 | kind of zero context from scratch. And there, of course, AI decimates the non AI people. But that's
00:05:49.120 | both because AI is just really good at greenfield kind of boilerplate code, right? But actually,
00:05:54.800 | most of software engineering isn't greenfield and isn't always boilerplate, right? And so there's
00:06:01.440 | usually an existing code base, there's usually dependencies. So these studies can't be like applied
00:06:06.000 | too well to these situations either. And then we also have surveys, which we found to be an ineffective
00:06:13.440 | predictor of productivity, by doing the small experiment with 43 developers, whereby we ask
00:06:20.240 | every developer to evaluate themselves, relatively global mean or median, in five percentile buckets,
00:06:27.280 | from zero to 100. And then we compare that to their measured productivity, we'll get into what that
00:06:33.840 | means later. But what we found is that asking someone how productive they think they are,
00:06:38.720 | is almost as good as flipping a coin, there's very little correlation, right? And so we found that
00:06:43.920 | people misjudged their productivity by about 30 percentile points. Only one in three people
00:06:50.320 | actually estimated their productivity within their quartile, one quartile. And I think surveys are great,
00:06:56.960 | they're valuable for surfacing, you know, morale and other issues that cannot be derived from metrics. But
00:07:03.120 | surveys shouldn't be used to measure developer productivity, much less the impact of AI on
00:07:07.840 | developers. For productivity cases, you can measure it to kind of see how happy they are using AI or
00:07:14.320 | whatever, I suppose. Great, so now let's dive into our methodology. So in an ideal world, you would have
00:07:23.520 | an engineer who writes code. And this code is evaluated by a panel of 10 or 15 experts, who separately, without
00:07:32.400 | knowing what every person is answering, evaluates that code based on quality, maintainability, output,
00:07:39.200 | how long would this take me? How good is it? Right? So kind of like a bucket of questions.
00:07:43.920 | And then what happens is that you aggregate those results. And we found two things. The first one is
00:07:51.280 | that this panel actually agrees with one another. So it turns out that one engineering expert agrees with
00:07:57.280 | the other engineering expert when they're talking about an objective code in front of them. And
00:08:02.960 | secondly, and probably most importantly, is that you can use this to predict reality. And reality is
00:08:08.400 | predicted by a panel like this. The problem then is that this is very slow, it's not scalable, it's expensive.
00:08:14.640 | And so what we did is we built a model that essentially automates this, correlates pretty well,
00:08:20.320 | it's fast, it's scalable, and it's affordable. The way it works is it plugs into Git, and then
00:08:27.120 | the model analyzes the source code changes of every commit, and quantifies them based on a bunch of
00:08:32.800 | these dimensions. And then since every commit has a unique author, a unique SHA, a unique timestamp,
00:08:41.440 | then you can kind of understand that, okay, the productivity of a team is basically the functionality
00:08:46.320 | of the code they deliver across time, not the lines of code, not the whatever commits, but the fun,
00:08:52.400 | like what that code is doing, right? And so then you can kind of put this in a dashboard and overlay it
00:08:58.240 | across time and get something similar to this. Great. So now let's dive into some of our results.
00:09:10.240 | So here in September is when this company implemented AI. This is a team of about 120 developers,
00:09:17.200 | and they were piloting whether they wanted to use, you know, AI and their kind of regular workflow. And
00:09:22.800 | we have here these bars, and every bar is the sum total of the output done in that month
00:09:28.720 | using our methodology, not lines of code. And we can see that in green, it's added functionality,
00:09:35.680 | in gray, it's removed. In blue is refactoring, and in orange is reworked. And so rework versus refactoring,
00:09:43.040 | they both alter existing code, but rework alters code that's much more recent, meaning it's wasteful.
00:09:49.040 | Refactoring could be wasteful, could be not wasteful. And so from the get go, you see that by implementing AI,
00:09:55.760 | you get a bunch more of rework. What happens is that you feel like you're delivering more code,
00:10:01.440 | because there's just like more volume of code being written, more commits, more stuff being pushed.
00:10:05.440 | But not all of that is actually useful. To be clear, I think there, I mean, based on this chart and overall,
00:10:11.440 | there is a productivity boost of about 15 to 20%. But then a lot of the gains you're seeing are basically
00:10:19.040 | this kind of rework, which is a bit, you know, misleading. So if I could summarize it into one
00:10:26.960 | chart with many discrepancies, it would be something like this. So with AI coding, you generate or you
00:10:33.840 | increase your productivity by roughly 30, 40%, like you're delivering more code. However, you've got to go
00:10:38.880 | back and kind of fix some of the bugs that code introduced and kind of, you know, fix the the
00:10:44.640 | mess that the AI made, which in turn gives you an average productivity gain across all industries,
00:10:50.320 | all sectors, everything of roughly about 15 to 20%. There's a lot of new ones here, which we're going
00:10:58.880 | to see in just a second. So here we have two violin charts, and they plot the distributions of
00:11:08.720 | the gains in productivity from using AI. And so kind of like the Y axis is the gains, it starts from minus
00:11:15.840 | 20%, take note, and then it goes up. And here we have kind of four pieces of data being shown.
00:11:22.080 | In blue is low complexity tasks. And in red is high complexity tasks. And kind of like your left,
00:11:32.880 | the chart to the chart to the left is greenfield tasks, the chart to the right is brownfield tasks.
00:11:37.760 | So right from the get go, the first conclusion we have is that sure, it seems like AI performs
00:11:44.000 | better in coding with simpler tasks. That's good. It's proven by data. That's awesome. The second thing
00:11:50.000 | we see is that, hey, it sounds like for low complexity greenfield tasks, there is a much more elongated
00:11:58.000 | distribution and a much higher distribution on average. Keep in mind that this is for enterprise
00:12:03.920 | settings. This doesn't apply for kind of like personal projects or vibe coding something for
00:12:08.000 | yourself from scratch. The improvements there would be much bigger. This is kind of for like
00:12:12.400 | real world working company settings. And the third thing we see is that if you look at the high complexity
00:12:19.520 | tasks, I mean, they're lower than the low complexity ones on average in terms of the distribution. But also,
00:12:25.040 | in some cases, they are more likely to decrease an engineer's productivity.
00:12:31.360 | Now, this decrease could be for many things, many reasons, but that's kind of what we see in the
00:12:37.680 | data, right? The underlying causes are still not super clear to us. If we translate this to a chart like
00:12:45.840 | this, which is a bit more digestible, you have in the bars and the columns, kind of like the average or the
00:12:53.920 | median gain. And then the line represents the interquartile range. So the bottom of the line is the 25th
00:13:00.480 | percentile and the top of the line is roughly 75th percentile. And so here, it's very clear to see how we
00:13:06.480 | have, you know, more gains from low complexity tasks, less gains from high complexity tasks. And then brownfield,
00:13:12.800 | it's harder to leverage AI to make increases in productivity there compared to greenfield.
00:13:18.880 | So if there is maybe a slide that you could show to your leadership team, it could be this one or could also be this one.
00:13:27.760 | So here we have a matrix, really simplifying things. You know, reality is a bit more difficult than this,
00:13:32.800 | but here we have kind of, on one axis, task complexity, low and high. On the other one,
00:13:37.520 | project maturity, greenfield versus brownfield. Kind of, we see that, hey, low complexity, greenfield,
00:13:43.040 | 30 to 40 percent gains, right, from AI. High complexity, but greenfield, more modest gains, 10 to 15.
00:13:50.160 | Brownfield and low complexity, pretty good, 15 to 20 percent. And most importantly,
00:13:57.520 | high complexity brownfield tasks, 0 to 10 percent. These are orientative guidelines based on what
00:14:04.480 | we see in the data. And I forgot to mention, this slide has a sample size of 136 teams across 27 companies,
00:14:12.080 | so pretty representative. And then that's gonna derive, or this chart is derived from that data.
00:14:17.920 | Then here, we have a similar matrix, except at the bottom, we have language popularity. So when
00:14:27.280 | low, we have examples such as Kobo, Haskell, Elixir, really kind of obscure stuff. And high is things
00:14:34.160 | like Python, Java, you know, JavaScript TypeScript. And what we see is that AI doesn't really help,
00:14:40.720 | even with low complexity tasks for low popularity languages. It can help a bit, but it's not terribly
00:14:46.480 | useful. And what ends up happening is that people just don't use it, because if it's only helpful two times out of five,
00:14:51.280 | you're just not gonna use it very often. What's funny or interesting is that for low language
00:14:57.120 | popularity and complex tasks, AI can actually decrease productivity, because it's so bad at coding in Kobo,
00:15:02.400 | or Haskell, or Elixir, that it just makes you slower, right? Granted, this isn't very, like, this happens,
00:15:09.440 | but it may be five or 10% of the kind of global development work, if that, right? Most of the
00:15:15.920 | development work is probably somewhere in the language, in a high language popularity kind of part
00:15:20.080 | of the chart. And here, you have gains between 20% for the low complexity, and 10 to 15% for the high
00:15:27.360 | complexity. So now moving into something a bit more theoretical, less empirically proven, but more so
00:15:41.520 | kind of like what we're seeing in the data, right? This is like an illustrative chart, which has kind of
00:15:46.480 | productivity gain from AI on the y-axis and a logarithmic scale of the code base size, right,
00:15:52.080 | from 1000 lines of code to 10 million on the x-axis. And we see that as the code base size increases,
00:15:59.440 | the gains you get from AI decrease sharply, right? And I think most code bases nowadays are kind of
00:16:05.520 | somewhere in the... depending on your use case, right? But they're bigger than a thousand lines of code,
00:16:09.920 | unless you are a YC startup or something that's, like, kind of spun out a couple months ago, right?
00:16:14.960 | And that's because, you know, there's three reasons for this, really. Context with no limitations. We're
00:16:20.320 | gonna see in a second how performance decreases even with larger context windows. The signal-to-noise
00:16:25.600 | ratio is... kind of confuses the model, if you will. And then, of course, larger code bases have more
00:16:32.480 | dependencies and more domain-specific logic present. And so then, borrowing work from this paper called
00:16:41.840 | Nolema, which shows you, on a scale of 0 to 100, how LLMs perform on coding tasks, you see that as
00:16:50.960 | context length increases from 1,000 to 32,000 tokens, performance decreases. And so we see all these
00:16:57.920 | models here. For example, Gemini 1.5 Pro has a context window of 2 million tokens. And you might think,
00:17:04.000 | "Whoa! I can just throw my entire code base into it, and it's gonna retrieve and encode perfectly,"
00:17:08.800 | right? And what we see is that even at 32,000 tokens, it's already showing a decrease in performance from
00:17:16.320 | 90% to about 50%, right? So what's gonna happen when you move from 32 to 64 or 128, right? You're gonna
00:17:22.240 | see really, really poor performance here. And so, in short, AI does increase developer productivity.
00:17:30.160 | You should use AI for most cases, but it doesn't increase the productivity of developers all the
00:17:37.280 | time and equally. It depends on things like task complexity, code base maturity, language popularity,
00:17:44.160 | code base size, and also context length. Thank you so much for listening. If you'd like to learn more
00:17:51.040 | about our research, you can access our research portal, which is softwareengineeringproductivity.stanford.edu.
00:17:57.760 | You can also reach me by email or LinkedIn. Super happy to talk about this topic at any time. Thank you so much.