The ROI of AI: Why you need Eval Framework

I want to introduce our first speaker, Byung Liu, CTO of Sourcegraph, and I won't spoil the topic of the talk, but it's going to be a good one. Thank you very much. Awesome. Thank you, Peter. How's everyone doing this morning? Good. Yeah? Everyone awake? Bright and early? Thanks for coming out.

Almost awake. Awesome. Before I dive into the talk here, I just want to get a sense of who we all have in the room. Who here is like a head of engineering or a VP of engineering? Okay. A good number of you. Who here is just an IC dev interested in evaluating how things are going?

And then what do the rest of you do? Just shout out your roles. Anyone? Anyone? Middle management. Middle management. Okay. Cool. And who here has a really quantitative, very precise thought through evaluation framework for measuring the ROI of AI tools? Okay. One hand in the back. Okay. One hand in the back.

What company are you from, sir? Broad signals. Broad signals. Okay. Cool. So we've got one person in the back. And then who here is sort of like we kind of are evaluating it, but it's really kind of like vibes at this point. Anyone? Anyone brave enough to -- okay.

Cool. So you're in the right place. So you're in the right place. Now, who am I? Why am I qualified to talk on this topic? So I'm the CTO and co-founder of a company called Sourcegraph. We're a developer tools company. If you haven't heard of us, we have two products.

One is a code search engine. So we started the company because we were developers ourselves, my co-founder and I, and we were really tired of the slog of diving through large complex code bases and trying to make sense of what was happening in them. The other product that we have is an AI coding assistant.

So what this product does is it's essentially -- oh, sorry. I should mention that we have great adoption among really great companies. So we started this company ten years ago to solve this problem of tackling understanding code in large code bases. And today we're very fortunate to have, you know, customers that range from early stage startups all the way through to the Fortune 500 and even some government agencies.

So what's the tie-in to AI ROI? So about two years ago, we released a new product called Codi, which is an AI coding assistant that ties into our code search engine. So you can kind of think of it as, like, a perplexity for code, whereas, you know, your kind of vanilla run-of-the-mill AI coding assistant only uses very local context, only has access to kind of, like, the open files in your editor.

We spent the past ten years building this great code search engine that is really good at servicing relevant code snippets from across your code base. And it just so turns out that, like, that's a great superpower for AI to have, right? You know, for those of us that have started using perplexity, we can kind of see the appeal.

And a big piece of the puzzle is not just the language model itself, but the ability to fetch and rank relevant context from, you know, a whole universe of data and information. So we had to solve this problem in order to sell Codi into the likes of 1Password, Palo Alto Networks, and Leidos, which is a big government contractor.

If you flew in here from somewhere else, you probably entered through one of their security machines. We sold Codi to all these organizations. And so each one of these organizations has kind of, like, a different way of measuring ROI. They have different frameworks that they apply. And so we had to answer that question in a multitude of ways.

And that's what I'm here to talk about. Okay. So I want to start out with how I would describe the value prop of AI to someone who's actually using it, to a developer. So coding without AI, I think, you know, we've all felt this before, if you've ever written a line of code, you start out by asking yourself, like, oh, this task, it's straightforward.

It should be easy. Let me just go build this feature. I should mention, I poached these slides from a director of engineering at Palo Alto Networks. He's actually giving another talk at this conference, Gunjan Patel. I thought he did an excellent job of describing the value prop that he was solving for as a director of engineering when they were purchasing a coding AI system.

So this is the way he described it. You think it should be straightforward. But then there's all these side quests that you end up going on as a developer. It's like, uh-oh. Like, I got to go install a bunch of dependencies. Or maybe, you know, there is this UI component that I have to go and figure out, this framework that I need to learn.

So the gaps appear. And then without AI, bridging the gaps takes both time and focus. You kind of have to, you know, spin off a side process or go on a little mini quest. And then, you know, 30 minutes and two cups of coffee later, you're like, okay, I got it.

I filled the gap. But what was I doing again? And so AI helps bridge those gaps. It helps solve this problem. It helps developers really stay in flow and stay kind of, like, cognizant of the high level of what they're trying to accomplish. And so what this means is that more and more developers can actually do the thing that we want to do, which is build an amazing feature and deliver an amazing experience, instead of giving up.

Now the question is, how do we measure this in a way that we can demonstrate the business impact of what this does to the rest of the organization? And so the answer is, beans. Okay. Who here drinks coffee? Okay. Cool. Do you know the difference between a really great coffee bean and, you know, your kind of run-of-the-mill Folgers or, you know, the thing that you buy at the supermarket?

Okay. It turns out we're all in the bean business. We think we're in the software business, but we're really selling beans in a way. So in every company, there is someone, let's call him Bob, who grows the beans, essentially the developer. There's another person, let's call him Pat, who sells the beans.

That's your kind of CRO or sales lead. And then you have Alice on the side, who's kind of your CFO or CEO, and Alice has got to count the beans. At the end of the day, you know, not to diss the finance people if there are any in the room, but it's all about counting the beans and seeing how they add up.

Now, I don't know if any of you have been paying attention, but in the past two years, the bean business has been revolutionized. This thing called AI has appeared. And so what does the bean business look like now? Well, Bob grows the beans with AI. And Pat is selling beans, but with AI.

And then Alice is on the side being like, well, I'm counting all the beans, and where's the ROI? And so this is the answer that, you know, basically Bob has to answer. Pat has kind of got it easy because Pat's job is just selling the beans. It's a much more quantifiable task.

Those of us that are involved in software engineering and product development, it's a bit harder to measure. So there's tension in the bean shop, you know. Alice is asking Bob, you know, how many more beans are we growing now with AI, Bob? And Bob's like, well, it's complicated, Alice.

Not all beans are the same. You know, there's some good beans and there's some very bad beans. We're making our beans better. And Alice is like, well, okay, the bean AI tool costs money, and we've got to measure its impact somehow. Anyone feel that tension? Anyone have this kind of, like, conversation?

We've talked to a lot of heads of engineering who see this tension very real with other parts of the org, specifically between finance and engineering. And I think the core of the problem is that measuring AI ROI for functions where the work is not directly quantifiable through a number is what I like to call MP hard.

So how many people are familiar with the term NP hard here? Okay. Cool. We're all pretty technical. So MP hard basically means if you have a tough challenge, if you have a problem, and you can basically reduce it to a class of very difficult problems, it probably means your problem is not solvable.

And measuring AI ROI reduces to measuring developer productivity or the productivity of whatever class of knowledge worker that you're managing. And so that implies if you can measure AI ROI precisely, you can also measure developer productivity. And who here knows how to measure developer productivity? It's kind of an open question, right?

So using the logic of your standard reduction proof, this problem is intractable. So that's the end of my talk. I'm just here to tell you that this problem is intractable and we should give up, right? Well, in the real world, we often find tractable solutions to intractable problems. And so what the meat of this talk is is really sharing a set of evaluation frameworks that we've presented to different customers.

Not all of these are used by, you know, any given customer. But I wanted to give kind of like a sampling of the conversations that we've had. And hopefully there will be some time for Q&A at the end where we can kind of talk through this and see what other people are doing.

Okay. So framework number one is the famous roles eliminated framework. So this question gets asked a lot these days, especially, you know, on social media. Like, AI is here to take your job. So how does this framework work? Well, in the classic framing, you buy the tool, the labor-saving tool.

You observe, you know, you're in the bean business or the widget business. You observe that this tool yields an X percent increase in your capacity to build widgets. And then you can cut your workforce to meet the demands for whatever widgets you're selling. Now, in practice, we have not encountered this framework at all in the realm of software development.

We do see it more prevalent in other kind of business units, you know, things that are more viewed as, like, cost centers, like support and things like that, especially, like, consumer-facing customer support. But for software engineering, for whatever reason, we haven't encountered this yet in any of our customers.

And we think that the reason here is that, number one, you know, if you view your org as a widget factory, then you're going to prioritize outputting widgets. But the thing is that very few engineering leaders, effective engineering leaders these days, view themselves as widget builders. You know, the widgets are kind of an abstraction that don't apply to the craft of software engineering.

The other observation here is that the widgets that we're building, which is software, great user experiences, they're not really supply-limited. So if you have, like, an extensive backlog, the question is, you know, if we made your engineers 20% more productive, would you go and, you know, do more -- 20% more of your backlog, or you just cut down 20% of your workforce and say, like, you know, it's fine, we don't need to get to the backlog.

And for 99% of the companies out there that are building software, the answer is no, we want to build a better user experience. These issues in our backlog are very important, we just can't get to them. So framework number one is kind of like talked about frequently, but in practice we haven't really seen it as an evaluation criteria.

Framework number two is what I like to call A/B testing velocity. So how this works is you basically segment off your organization into two groups, the test group and the control group. And then you say group one gets the tool and group two does not. And then you go through your standard planning process.

Most people, as part of that planning process, what you do is you go and estimate the time that it will take to resolve certain issues. So how long is this feature going to take to build? How long is it going to take to work through these bug backlogs? And then because you've divided the groups into two now, you have some notion of, like, how well you're executing against your timeline, so you basically run these A/B tests.

So Palo Alto Networks, one of our customers, ran something similar to this, and the conclusion they drew was the approximate timelines got accelerated 20 to 30 percent using Kodi. And so this is a very kind of, like, rigorous scientific framework. We see it come up now and then, especially when companies are of a certain size and they're very thoughtful about this question.

The criticisms about this framework are no two teams are exactly the same, right? Like, if you lop off your development org, you have, you know, your dev infrastructure on this side, maybe backend, and then you have frontend teams on this side. It's hard sometimes to compare these different teams to each other because software development is very different in different parts of your organization.

There's also confounding factors. You know, maybe team X, you know, had an important leader or contributor to part. Maybe team Y, you know, suffered a bout of COVID that, you know, blew through the team or things like that. So you have to account for these things when you make your evaluation.

And this framework is also high cost and effort. You basically have to do the subdivision, you give one group access to the tool, and then you have to run it for an extended period of time in order to gain enough confidence. But provided you have the resources and the time to estimate it, we think this is a pretty good framework for honestly testing the efficacy of an AI tool.

Okay, framework number three, I call this time saved as a function of engagement. So if you have a productivity tool, using the product should make people more productive, right? So if you have a code search engine, the more code searches that people do, that probably saves them time. If it didn't save them time, there would be no reason why they would go to the search engine.

And so in this framework, what you do is you basically go look at your product metrics, and you break down all the different time-saving actions, you identify them, and then you kind of tag them with an approximate estimate of how much time is saved in each action. And if you want to be conservative, you can lower bound it.

You could say, like, a code search probably saves me two minutes. That's maybe like a lower bound, because there's definitely searches where, oh, my gosh, it saved me like half a day or maybe like a whole week of work because it prevented me from going down an unproductive rabbit hole.

But you can lower bound it and just say, like, okay, we're going to get a lower bound estimate on the total amount of time saved. And then you go and ask your vendor, hey, can you build some analytics and share them with me? So this is something that we built for Cody.

You know, very fine-grained analytics to show the admins and leaders in the org exactly what actions are being taken. You know, how many explicit invocations, you know, how many chats, how many questions about the code base, how many inline code generation actions, and those all map to a certain amount of time saved.

There's one caveat here, which is in products where you have like an implicit trigger, like an autocomplete, you can't rely purely on engagement because it's not the human opting in to engage the product each time. It's sort of like implicitly shown to you. And so there, you tend to go with more of an acceptance rate criteria.

You don't want to do just raw engagements because then the product could just push a bunch of like low-quality completions to you, and that would not be a good metric of time saved. So we have a lot of customers that do this. They appreciate it. One of the nice things about this is that because we're lower bounding, it makes the value of the software very clear.

So a lot of developer tools, us included, I think like Cody is like $9 per month and Sourcegraph is a little bit more than that, but it's like if you back out the math of how much an hour of a developer's time is worth, you know, typically it's around like $100 to $200.

It's like if you save, you know, a couple minutes a month, this kind of pays for itself in terms of productivity. The criticisms of this framework is, of course, it's a lower bound, so you're not fully assessing the value. If you go back to that picture I showed earlier of, you know, the dev journey where you're kind of bridging the gaps, I think a big value of AI is actually completing tasks that hitherto or beforehand were just not completed because people got fed up or they got lazy or they just had other things to do.

So this doesn't capture that. It doesn't account for the second order effects of the velocity boost, and it's good for kind of like day to day, like, hey, is this speeding up the team? But it doesn't capture some of the impact on key initiatives of the company. So that leads to the fourth evaluation framework.

So a lot of our customers track certain KPIs that they think are correlated with engineering quality and business impact. So lines of code generated, does anyone here think lines of code generated is a good metric of developer productivity? Okay. I think in 2024 we can all say that it's not.

We have seen this resurfaced in the context of measuring ROI of AI, because it's almost like we've forgotten all the lessons that we learned about human developer productivity, and with AI-generated tools, now it's like, oh, like, you generated, like, you know, hundreds of lines of code for a developer in a day.

And so we noted that we were actually losing some deals on lines of code generated. And when we actually went and looked at the product experience, we're like -- at first, we're like, hey, you know, maybe we should -- there's a product improvement here that we should be making because people aren't accepting as many lines generated by Cody.

But when we dug into this, we're like, oh, like, you know, the competitor's product, it's just more aggressively kind of, like, triggering, and that's not, like, the sort of business that we want to be. So more and more, we're kind of pushing our customers to not tie to generic metrics or, like, high-level metrics, but identify certain KPIs that attend to changes that you want to make in your organization.

So with Leidos, our big kind of government contracting customer, they identified a set of zones or actions that they felt were really important. They wanted to reduce the amount of time spent answering questions, spent bugging teammates, and spend more developer time in these areas that they identified as value-add.

Three things mainly: building features, writing unit tests, and reviewing code. And so that's what we tracked for their evaluation period. A fifth framework is impact on key initiatives. So this is the kind of, like, map your product to OKR's framework. And so there are a couple of companies where they're in the midst of a big code migration.

Like, they're trying to migrate from, you know, Cobalt to Java, or maybe, you know, from React to Svelte or the latest JS framework. And these are kind of, like, top-level goals that the VP of engineering really cares about. And so if you have a product that accelerates progress towards this, then the ROI is really just what's the value of bringing that forward by, you know, X number of months or, in some cases, X number of years, or making it possible at all.

How do you measure how much you pulled it forward? That's a good question. So the question was, how do you measure how much you pulled it forward? It is really a kind of judgment call with the engineering leader at that point. So by the time they have this conversation with us, they've typically already started it or have had a few of these under their belt and had to have an idea of the pain.

And then they can assess kind of the shape of the product and the things that we do and estimate how much quicker it will be done. We also have case studies demonstrating, like, hey, this thing that used to take, you know, a year or longer, we squished it into the span of, you know, a couple months.

Okay. And then the last framework is what I'll call survey. So this sounds like the least rigorous framework, but I think it's still highly valuable. Basically it's just run a pilot, have your developers use it, you can compare against another tool, and then at the end of it, just ask your developers, you know, which one was best.

More and more nowadays, we don't see this in an unbounded fashion. Like, you know, in the kind of like 2021 ZERP period, people were just like, you know, whatever makes the developers happy, let's just go buy that. And in other ways, we see it in more of a bounded fashion, which is, you know, a lot of orgs, you know, allocate some part of their budget toward investing in developer productivity and to develop productivity tools.

So that chart over there shows kind of the range of orgs surveyed in a survey run by the pragmatic engineer newsletter, and it typically ranges somewhere between five and 25%. So within that budget allocation, you have a certain amount of budget allocated tools. Then you basically say, subject to that constraint, let me go ask my developers what tools they want the most.

Okay, so I'm basically out of time. But hopefully this gave you kind of like a sampling of the flavors of different frameworks that are involved. This is something that we've had to work through, through a lot of customers ranging from very small startups to very large companies in the Fortune 500.

I just want to say, you know, be skeptical of anyone saying P equals NP, of saying they have a precise way to measure AI ROI or developer productivity. No framework is perfect. The most important thing, I think, is to find clear success criteria. That's something that both you, your internal teams, and stakeholders will appreciate, and also the vendor, because they know what success looks like.

And then the last kind of final note here is, productivity tools are often bottoms up, but we've actually found that top-down mandates with AI can sometimes help, because developers can be a little bit of a skeptical crowd. But if you believe firmly that this is where the future is going, and that people have to update their skill set to make productive use of LLMs and AI to code more productively, we've actually seen success in this case.

Where a CEO basically says, we're adopting Kodi, we're adopting code AI, go figure out how to use this. It's not going to be perfect, but it's something that is going to play out over the next decade. And then, sorry, one final thought. As we move towards more automation, I like to think of two pictures in mind.

So one picture is the graph on the right, which is kind of like a landscape of code AI tools. So on the left-hand side, you have the kind of like inline completions, you know, very basic, completing the next tokens that you're typing. And then on the far right, you have kind of like the fully automated offline agents.

And we're trying to make all these solutions more reliable, right? Because like generative AI is sort of inherently unreliable. We're trying to make it more general and, you know, make it productive in more languages and more scenarios. And we actually think that the path there is to go from the left-hand side to the right-hand side, you know, not jumping straight to the full automation, because that's a very difficult problem.

So we think the next phase of evolution for us is going from kind of like these inline code completion scenarios to more of what we call online agents that live in your editor but can still react to human feedback and guidance. And then the second picture is, you know, the mythical man month.

I think a lot of people are familiar with this classic work. It talks about the classic fallacies with respect to developer productivity that a lot of companies make. One question I would pose to all of you is, have the fundamentals really changed? You know, as we have more quote-unquote AI developers, should we measure them by the same evaluation criteria?

And I would -- the challenge question I would pose to all of you is, you know, I think the lessons from the mythical man month is you prefer to have one very smart engineer who's highly productive, over 10, maybe even 100 mediocre developers who are kind of productive but need a lot of guidance.

And so as AI automation increases, the question is, do you want 100 mediocre AI developers or do you want 100X lever for the human developers, the really smart people who are going to craft the user experience? All right. That's it for me. Yeah. If you want to check out Cody, that's the URL, and that's my contact info if you want to reach out later.

Do we have time for questions, or? We have time for perhaps one or two questions. One or two questions? I will run around with the microphone if anyone, just so we can catch it on audio. Cool. Thanks. Thanks. Thanks for Cody. You are the best guys. I tried them all.

I'm staying with you. Thanks. You've got really good insight what's happening now with software development. So what is your intuition? Are we going to get away from coding and the code would become a boilerplate and we move to metaprogramming or it will be still code as a main output of the senior engineer?

Yeah, that's a really good question. The usage patterns that we're observing now is more and more code is being written through natural language prompts. I had an experience just on Friday actually where Cody wrote 80% of the code because I just kept asking it to write different functions that I wanted.

And that was nice because I was thinking at the function level rather than the line-by-line code level. Yeah. And it was nice because it allowed me to stay in flow. But at the same time, the output was still code. And I really do think that we're never going to see a full replacement of code by natural language because code is nice because you can describe what you want very precisely.

And that precision is important as a source of truth for what the software actually does. I think that's probably it now. Thank you very much. Thank you. Thank you. Thank you. Thank you.

The ROI of AI: Why you need Eval Framework - Beyang Liu

Transcript