Does AI Actually Boost Developer Productivity? (100k Devs Study)

00:00:00.000 | In January of this year, Mark Zuckerberg said that he was going to replace all of the mid-level

00:00:21.960 | engineers at Meta with AI by the end of the year. I think Mark was a bit optimistic.

00:00:29.340 | And he was probably acting like a good CEO would to inspire a vision and also probably

00:00:34.800 | to keep the Facebook stock price up. But what Mark also did was create a lot of trouble

00:00:41.640 | for CTOs worldwide. Why? Because after Mark said that, every single CEO in the world almost

00:00:50.820 | turned to their CTO and said, "Hey, Mark says he's going to replace all of his developers

00:00:56.580 | with AI. Where are we in that journey?" And the answer probably was, honestly, not very

00:01:01.800 | far. And we're not sure we're going to do that.

00:01:05.480 | And so I personally think hopefully this is not, you know, going to change, but I don't

00:01:12.080 | think AI is going to replace developers entirely, at least this year, let alone at Meta, right?

00:01:19.580 | But I do think that AI increases developer productivity, but there's also cases in which it decreases developer

00:01:25.880 | productivity. So AI or using AI for coding is not a one size fits all solution. And there

00:01:31.720 | are cases in which it shouldn't be used. And so for the past three years, we've been running

00:01:38.920 | one of the largest studies on software engineering productivity at Stanford. And we've done this

00:01:46.380 | in a time series and cross sectional way. So time series, meaning that even if a participant

00:01:51.480 | joins in 2025, we get access to their get history, meaning we can see trends of data across time,

00:01:58.080 | we can see COVID, we can see AI, we can see all of these trends and things that happened. And then also cross-sectional

00:02:04.680 | because we have more than 600 companies participating, enterprise, mid-sized, and also startups.

00:02:11.900 | And so this means that we have more than 100,000 software engineers in our data set right now, dozens

00:02:17.980 | of millions of commits, and billions of lines of code. And most importantly, most of this data

00:02:24.480 | is private repositories. This is important because if you use a public repo to measure someone's

00:02:30.200 | productivity, that public repo is not self-contained. Someone could be working on that repo on the

00:02:35.000 | weekend or once in a while, right? Whereas if you have a private repo, it's much more self-contained

00:02:41.060 | and much easier to measure the productivity of a team, of a company, of an organization.

00:02:49.340 | So late last year, there was a huge controversial thing around ghost engineers. So this came

00:02:55.820 | from kind of the same research group, our research group. And here, Elon Musk was kind enough to

00:03:01.380 | retweet us. But what we found is that roughly 10% of software engineers in our data set at the

00:03:07.000 | time, about 50,000, were what we called ghost engineers. These people collect a paycheck but

00:03:14.160 | basically do no work. So that was very surprising for some people, very unsurprising for others.

00:03:21.380 | And so some of the people in this research team are, for example, Simon from industry. So he was CTO at

00:03:29.280 | a unicorn, which he exited. And he had a team of about 700 developers. And a CTO, he was always the

00:03:35.840 | last person to know when something was up with his engineering team, right? And so he thought, okay, how can I

00:03:40.880 | change this? Myself, I've been at Stanford since 22. And I focus on what I call data driven decision

00:03:47.840 | making and software engineering. And in a past life, I was looking after digital transformation

00:03:53.840 | for a large company with thousands of engineers. Part of the team is also Professor Kaczynski,

00:04:00.160 | who is at Stanford, and his research focuses on human behavior in a digital environment. And basically,

00:04:05.600 | he was the Cambridge Analytica whistleblower back in the day, if you recall that.

00:04:08.880 | So today, we're going to be talking about three things. We're going to start off with the limitations

00:04:16.320 | of existing studies that seek to quantify the impact of AI on developer productivity. We're going to

00:04:23.200 | showcase our methodology. And lastly, we're going to spend most of the time looking at some of the

00:04:27.360 | results. What is the impact on AI on deaf productivity? And how are ways we can slice and dice these results

00:04:32.880 | to make them more meaningful? And so there's lots of research being done on this topic. But a lot of it

00:04:42.720 | is led by vendors who themselves are trying to sell you their own AI coding tools, right? And so there's

00:04:49.520 | a bit of a conflict of interest there sometimes. And the biggest three limitations that I see is that

00:04:54.880 | a lot of these studies revolve around commits and PRs and tasks. Hey, we completed more commits,

00:04:59.840 | more PRs, the time between commits decreased. The problem here is that task size varies, right? And

00:05:07.360 | so delivering more commits does not necessarily mean more productivity. And in fact, what we found very

00:05:14.480 | often is that by using AI, you're introducing new tasks that are bug fixes to the stuff that the AI just

00:05:21.040 | coded before. So by that case, like you're kind of spinning your wheels in place, right? So that's

00:05:26.560 | kind of funny. Secondly, there's a bunch of studies to say, well, we grabbed a bunch of developers,

00:05:32.560 | we split them into two groups, and we kind of gave one AI and one of them we didn't. And what usually

00:05:38.640 | happens there is that these are kind of greenfield tasks where they're asked to build something with

00:05:43.120 | kind of zero context from scratch. And there, of course, AI decimates the non AI people. But that's

00:05:49.120 | both because AI is just really good at greenfield kind of boilerplate code, right? But actually,

00:05:54.800 | most of software engineering isn't greenfield and isn't always boilerplate, right? And so there's

00:06:01.440 | usually an existing code base, there's usually dependencies. So these studies can't be like applied

00:06:06.000 | too well to these situations either. And then we also have surveys, which we found to be an ineffective

00:06:13.440 | predictor of productivity, by doing the small experiment with 43 developers, whereby we ask

00:06:20.240 | every developer to evaluate themselves, relatively global mean or median, in five percentile buckets,

00:06:27.280 | from zero to 100. And then we compare that to their measured productivity, we'll get into what that

00:06:33.840 | means later. But what we found is that asking someone how productive they think they are,

00:06:38.720 | is almost as good as flipping a coin, there's very little correlation, right? And so we found that

00:06:43.920 | people misjudged their productivity by about 30 percentile points. Only one in three people

00:06:50.320 | actually estimated their productivity within their quartile, one quartile. And I think surveys are great,

00:06:56.960 | they're valuable for surfacing, you know, morale and other issues that cannot be derived from metrics. But

00:07:03.120 | surveys shouldn't be used to measure developer productivity, much less the impact of AI on

00:07:07.840 | developers. For productivity cases, you can measure it to kind of see how happy they are using AI or

00:07:14.320 | whatever, I suppose. Great, so now let's dive into our methodology. So in an ideal world, you would have

00:07:23.520 | an engineer who writes code. And this code is evaluated by a panel of 10 or 15 experts, who separately, without

00:07:32.400 | knowing what every person is answering, evaluates that code based on quality, maintainability, output,

00:07:39.200 | how long would this take me? How good is it? Right? So kind of like a bucket of questions.

00:07:43.920 | And then what happens is that you aggregate those results. And we found two things. The first one is

00:07:51.280 | that this panel actually agrees with one another. So it turns out that one engineering expert agrees with

00:07:57.280 | the other engineering expert when they're talking about an objective code in front of them. And

00:08:02.960 | secondly, and probably most importantly, is that you can use this to predict reality. And reality is

00:08:08.400 | predicted by a panel like this. The problem then is that this is very slow, it's not scalable, it's expensive.

00:08:14.640 | And so what we did is we built a model that essentially automates this, correlates pretty well,

00:08:20.320 | it's fast, it's scalable, and it's affordable. The way it works is it plugs into Git, and then

00:08:27.120 | the model analyzes the source code changes of every commit, and quantifies them based on a bunch of

00:08:32.800 | these dimensions. And then since every commit has a unique author, a unique SHA, a unique timestamp,

00:08:41.440 | then you can kind of understand that, okay, the productivity of a team is basically the functionality

00:08:46.320 | of the code they deliver across time, not the lines of code, not the whatever commits, but the fun,

00:08:52.400 | like what that code is doing, right? And so then you can kind of put this in a dashboard and overlay it

00:08:58.240 | across time and get something similar to this. Great. So now let's dive into some of our results.

00:09:10.240 | So here in September is when this company implemented AI. This is a team of about 120 developers,

00:09:17.200 | and they were piloting whether they wanted to use, you know, AI and their kind of regular workflow. And

00:09:22.800 | we have here these bars, and every bar is the sum total of the output done in that month

00:09:28.720 | using our methodology, not lines of code. And we can see that in green, it's added functionality,

00:09:35.680 | in gray, it's removed. In blue is refactoring, and in orange is reworked. And so rework versus refactoring,

00:09:43.040 | they both alter existing code, but rework alters code that's much more recent, meaning it's wasteful.

00:09:49.040 | Refactoring could be wasteful, could be not wasteful. And so from the get go, you see that by implementing AI,

00:09:55.760 | you get a bunch more of rework. What happens is that you feel like you're delivering more code,

00:10:01.440 | because there's just like more volume of code being written, more commits, more stuff being pushed.

00:10:05.440 | But not all of that is actually useful. To be clear, I think there, I mean, based on this chart and overall,

00:10:11.440 | there is a productivity boost of about 15 to 20%. But then a lot of the gains you're seeing are basically

00:10:19.040 | this kind of rework, which is a bit, you know, misleading. So if I could summarize it into one

00:10:26.960 | chart with many discrepancies, it would be something like this. So with AI coding, you generate or you

00:10:33.840 | increase your productivity by roughly 30, 40%, like you're delivering more code. However, you've got to go

00:10:38.880 | back and kind of fix some of the bugs that code introduced and kind of, you know, fix the the

00:10:44.640 | mess that the AI made, which in turn gives you an average productivity gain across all industries,

00:10:50.320 | all sectors, everything of roughly about 15 to 20%. There's a lot of new ones here, which we're going

00:10:58.880 | to see in just a second. So here we have two violin charts, and they plot the distributions of

00:11:08.720 | the gains in productivity from using AI. And so kind of like the Y axis is the gains, it starts from minus

00:11:15.840 | 20%, take note, and then it goes up. And here we have kind of four pieces of data being shown.

00:11:22.080 | In blue is low complexity tasks. And in red is high complexity tasks. And kind of like your left,

00:11:32.880 | the chart to the chart to the left is greenfield tasks, the chart to the right is brownfield tasks.

00:11:37.760 | So right from the get go, the first conclusion we have is that sure, it seems like AI performs

00:11:44.000 | better in coding with simpler tasks. That's good. It's proven by data. That's awesome. The second thing

00:11:50.000 | we see is that, hey, it sounds like for low complexity greenfield tasks, there is a much more elongated

00:11:58.000 | distribution and a much higher distribution on average. Keep in mind that this is for enterprise

00:12:03.920 | settings. This doesn't apply for kind of like personal projects or vibe coding something for

00:12:08.000 | yourself from scratch. The improvements there would be much bigger. This is kind of for like

00:12:12.400 | real world working company settings. And the third thing we see is that if you look at the high complexity

00:12:19.520 | tasks, I mean, they're lower than the low complexity ones on average in terms of the distribution. But also,

00:12:25.040 | in some cases, they are more likely to decrease an engineer's productivity.

00:12:31.360 | Now, this decrease could be for many things, many reasons, but that's kind of what we see in the

00:12:37.680 | data, right? The underlying causes are still not super clear to us. If we translate this to a chart like

00:12:45.840 | this, which is a bit more digestible, you have in the bars and the columns, kind of like the average or the

00:12:53.920 | median gain. And then the line represents the interquartile range. So the bottom of the line is the 25th

00:13:00.480 | percentile and the top of the line is roughly 75th percentile. And so here, it's very clear to see how we

00:13:06.480 | have, you know, more gains from low complexity tasks, less gains from high complexity tasks. And then brownfield,

00:13:12.800 | it's harder to leverage AI to make increases in productivity there compared to greenfield.

00:13:18.880 | So if there is maybe a slide that you could show to your leadership team, it could be this one or could also be this one.

00:13:27.760 | So here we have a matrix, really simplifying things. You know, reality is a bit more difficult than this,

00:13:32.800 | but here we have kind of, on one axis, task complexity, low and high. On the other one,

00:13:37.520 | project maturity, greenfield versus brownfield. Kind of, we see that, hey, low complexity, greenfield,

00:13:43.040 | 30 to 40 percent gains, right, from AI. High complexity, but greenfield, more modest gains, 10 to 15.

00:13:50.160 | Brownfield and low complexity, pretty good, 15 to 20 percent. And most importantly,

00:13:57.520 | high complexity brownfield tasks, 0 to 10 percent. These are orientative guidelines based on what

00:14:04.480 | we see in the data. And I forgot to mention, this slide has a sample size of 136 teams across 27 companies,

00:14:12.080 | so pretty representative. And then that's gonna derive, or this chart is derived from that data.

00:14:17.920 | Then here, we have a similar matrix, except at the bottom, we have language popularity. So when

00:14:27.280 | low, we have examples such as Kobo, Haskell, Elixir, really kind of obscure stuff. And high is things

00:14:34.160 | like Python, Java, you know, JavaScript TypeScript. And what we see is that AI doesn't really help,

00:14:40.720 | even with low complexity tasks for low popularity languages. It can help a bit, but it's not terribly

00:14:46.480 | useful. And what ends up happening is that people just don't use it, because if it's only helpful two times out of five,

00:14:51.280 | you're just not gonna use it very often. What's funny or interesting is that for low language

00:14:57.120 | popularity and complex tasks, AI can actually decrease productivity, because it's so bad at coding in Kobo,

00:15:02.400 | or Haskell, or Elixir, that it just makes you slower, right? Granted, this isn't very, like, this happens,

00:15:09.440 | but it may be five or 10% of the kind of global development work, if that, right? Most of the

00:15:15.920 | development work is probably somewhere in the language, in a high language popularity kind of part

00:15:20.080 | of the chart. And here, you have gains between 20% for the low complexity, and 10 to 15% for the high

00:15:27.360 | complexity. So now moving into something a bit more theoretical, less empirically proven, but more so

00:15:41.520 | kind of like what we're seeing in the data, right? This is like an illustrative chart, which has kind of

00:15:46.480 | productivity gain from AI on the y-axis and a logarithmic scale of the code base size, right,

00:15:52.080 | from 1000 lines of code to 10 million on the x-axis. And we see that as the code base size increases,

00:15:59.440 | the gains you get from AI decrease sharply, right? And I think most code bases nowadays are kind of

00:16:05.520 | somewhere in the... depending on your use case, right? But they're bigger than a thousand lines of code,

00:16:09.920 | unless you are a YC startup or something that's, like, kind of spun out a couple months ago, right?

00:16:14.960 | And that's because, you know, there's three reasons for this, really. Context with no limitations. We're

00:16:20.320 | gonna see in a second how performance decreases even with larger context windows. The signal-to-noise

00:16:25.600 | ratio is... kind of confuses the model, if you will. And then, of course, larger code bases have more

00:16:32.480 | dependencies and more domain-specific logic present. And so then, borrowing work from this paper called

00:16:41.840 | Nolema, which shows you, on a scale of 0 to 100, how LLMs perform on coding tasks, you see that as

00:16:50.960 | context length increases from 1,000 to 32,000 tokens, performance decreases. And so we see all these

00:16:57.920 | models here. For example, Gemini 1.5 Pro has a context window of 2 million tokens. And you might think,

00:17:04.000 | "Whoa! I can just throw my entire code base into it, and it's gonna retrieve and encode perfectly,"

00:17:08.800 | right? And what we see is that even at 32,000 tokens, it's already showing a decrease in performance from

00:17:16.320 | 90% to about 50%, right? So what's gonna happen when you move from 32 to 64 or 128, right? You're gonna

00:17:22.240 | see really, really poor performance here. And so, in short, AI does increase developer productivity.

00:17:30.160 | You should use AI for most cases, but it doesn't increase the productivity of developers all the

00:17:37.280 | time and equally. It depends on things like task complexity, code base maturity, language popularity,

00:17:44.160 | code base size, and also context length. Thank you so much for listening. If you'd like to learn more

00:17:51.040 | about our research, you can access our research portal, which is softwareengineeringproductivity.stanford.edu.

00:17:57.760 | You can also reach me by email or LinkedIn. Super happy to talk about this topic at any time. Thank you so much.

Does AI Actually Boost Developer Productivity? (100k Devs Study) - Yegor Denisov-Blanch, Stanford

Chapters