back to indexThe ROI of AI: Why you need Eval Framework - Beyang Liu

00:00:00.120 |
I want to introduce our first speaker, Byung Liu, CTO of Sourcegraph, and I won't spoil 00:00:17.560 |
the topic of the talk, but it's going to be a good one. 00:00:31.280 |
Before I dive into the talk here, I just want to get a sense of who we all have in the room. 00:00:37.040 |
Who here is like a head of engineering or a VP of engineering? 00:00:43.640 |
Who here is just an IC dev interested in evaluating how things are going? 00:01:02.060 |
And who here has a really quantitative, very precise thought through evaluation framework 00:01:20.060 |
And then who here is sort of like we kind of are evaluating it, but it's really kind of like 00:01:34.060 |
So I'm the CTO and co-founder of a company called Sourcegraph. 00:01:38.560 |
If you haven't heard of us, we have two products. 00:01:42.560 |
So we started the company because we were developers ourselves, my co-founder and I, and we were 00:01:49.000 |
really tired of the slog of diving through large complex code bases and trying to make sense 00:01:55.480 |
The other product that we have is an AI coding assistant. 00:01:59.340 |
So what this product does is it's essentially -- oh, sorry. 00:02:05.100 |
I should mention that we have great adoption among really great companies. 00:02:08.380 |
So we started this company ten years ago to solve this problem of tackling understanding 00:02:14.320 |
And today we're very fortunate to have, you know, customers that range from early stage startups 00:02:20.500 |
all the way through to the Fortune 500 and even some government agencies. 00:02:28.180 |
So about two years ago, we released a new product called Codi, which is an AI coding assistant that 00:02:36.280 |
So you can kind of think of it as, like, a perplexity for code, whereas, you know, your 00:02:40.620 |
kind of vanilla run-of-the-mill AI coding assistant only uses very local context, only has access 00:02:47.140 |
to kind of, like, the open files in your editor. 00:02:49.860 |
We spent the past ten years building this great code search engine that is really good at servicing 00:02:54.840 |
relevant code snippets from across your code base. 00:02:57.320 |
And it just so turns out that, like, that's a great superpower for AI to have, right? 00:03:01.100 |
You know, for those of us that have started using perplexity, we can kind of see the appeal. 00:03:05.440 |
And a big piece of the puzzle is not just the language model itself, but the ability to fetch 00:03:10.460 |
and rank relevant context from, you know, a whole universe of data and information. 00:03:15.540 |
So we had to solve this problem in order to sell Codi into the likes of 1Password, Palo Alto 00:03:24.200 |
Networks, and Leidos, which is a big government contractor. 00:03:27.900 |
If you flew in here from somewhere else, you probably entered through one of their security 00:03:34.880 |
And so each one of these organizations has kind of, like, a different way of measuring 00:03:40.220 |
They have different frameworks that they apply. 00:03:42.560 |
And so we had to answer that question in a multitude of ways. 00:03:49.440 |
So I want to start out with how I would describe the value prop of AI to someone who's actually 00:03:58.400 |
So coding without AI, I think, you know, we've all felt this before, if you've ever written 00:04:02.860 |
a line of code, you start out by asking yourself, like, oh, this task, it's straightforward. 00:04:11.440 |
I should mention, I poached these slides from a director of engineering at Palo Alto Networks. 00:04:18.580 |
He's actually giving another talk at this conference, Gunjan Patel. 00:04:21.540 |
I thought he did an excellent job of describing the value prop that he was solving for as a director 00:04:25.960 |
of engineering when they were purchasing a coding AI system. 00:04:33.720 |
But then there's all these side quests that you end up going on as a developer. 00:04:38.880 |
Like, I got to go install a bunch of dependencies. 00:04:40.940 |
Or maybe, you know, there is this UI component that I have to go and figure out, this framework 00:04:48.280 |
And then without AI, bridging the gaps takes both time and focus. 00:04:53.760 |
You kind of have to, you know, spin off a side process or go on a little mini quest. 00:04:57.580 |
And then, you know, 30 minutes and two cups of coffee later, you're like, okay, I got it. 00:05:09.220 |
It helps developers really stay in flow and stay kind of, like, cognizant of the high level 00:05:18.100 |
And so what this means is that more and more developers can actually do the thing that we 00:05:21.280 |
want to do, which is build an amazing feature and deliver an amazing experience, instead of 00:05:26.640 |
Now the question is, how do we measure this in a way that we can demonstrate the business 00:05:32.180 |
impact of what this does to the rest of the organization? 00:05:44.660 |
Do you know the difference between a really great coffee bean and, you know, your kind 00:05:49.380 |
of run-of-the-mill Folgers or, you know, the thing that you buy at the supermarket? 00:05:59.000 |
We think we're in the software business, but we're really selling beans in a way. 00:06:02.800 |
So in every company, there is someone, let's call him Bob, who grows the beans, essentially 00:06:08.720 |
There's another person, let's call him Pat, who sells the beans. 00:06:15.400 |
And then you have Alice on the side, who's kind of your CFO or CEO, and Alice has got 00:06:21.560 |
At the end of the day, you know, not to diss the finance people if there are any in the room, 00:06:26.820 |
but it's all about counting the beans and seeing how they add up. 00:06:29.860 |
Now, I don't know if any of you have been paying attention, but in the past two years, 00:06:38.500 |
And so what does the bean business look like now? 00:06:47.800 |
And then Alice is on the side being like, well, I'm counting all the beans, and where's the 00:06:54.140 |
And so this is the answer that, you know, basically Bob has to answer. 00:06:57.800 |
Pat has kind of got it easy because Pat's job is just selling the beans. 00:07:04.340 |
Those of us that are involved in software engineering and product development, it's a bit harder 00:07:10.060 |
So there's tension in the bean shop, you know. 00:07:12.980 |
Alice is asking Bob, you know, how many more beans are we growing now with AI, Bob? 00:07:16.740 |
And Bob's like, well, it's complicated, Alice. 00:07:19.560 |
You know, there's some good beans and there's some very bad beans. 00:07:24.020 |
And Alice is like, well, okay, the bean AI tool costs money, and we've got to measure 00:07:33.960 |
Anyone have this kind of, like, conversation? 00:07:36.560 |
We've talked to a lot of heads of engineering who see this tension very real with other parts 00:07:40.460 |
of the org, specifically between finance and engineering. 00:07:45.380 |
And I think the core of the problem is that measuring AI ROI for functions where the work 00:07:50.760 |
is not directly quantifiable through a number is what I like to call MP hard. 00:07:56.120 |
So how many people are familiar with the term NP hard here? 00:08:02.380 |
So MP hard basically means if you have a tough challenge, if you have a problem, and you can 00:08:07.180 |
basically reduce it to a class of very difficult problems, it probably means your problem is 00:08:14.380 |
And measuring AI ROI reduces to measuring developer productivity or the productivity of whatever class 00:08:22.880 |
And so that implies if you can measure AI ROI precisely, you can also measure developer productivity. 00:08:28.960 |
And who here knows how to measure developer productivity? 00:08:34.880 |
So using the logic of your standard reduction proof, this problem is intractable. 00:08:42.520 |
I'm just here to tell you that this problem is intractable and we should give up, right? 00:08:47.060 |
Well, in the real world, we often find tractable solutions to intractable problems. 00:08:52.780 |
And so what the meat of this talk is is really sharing a set of evaluation frameworks that 00:08:59.460 |
Not all of these are used by, you know, any given customer. 00:09:03.800 |
But I wanted to give kind of like a sampling of the conversations that we've had. 00:09:07.600 |
And hopefully there will be some time for Q&A at the end where we can kind of talk through 00:09:17.280 |
So framework number one is the famous roles eliminated framework. 00:09:22.820 |
So this question gets asked a lot these days, especially, you know, on social media. 00:09:30.440 |
Well, in the classic framing, you buy the tool, the labor-saving tool. 00:09:34.040 |
You observe, you know, you're in the bean business or the widget business. 00:09:38.040 |
You observe that this tool yields an X percent increase in your capacity to build widgets. 00:09:43.040 |
And then you can cut your workforce to meet the demands for whatever widgets you're selling. 00:09:48.040 |
Now, in practice, we have not encountered this framework at all in the realm of software 00:09:54.760 |
We do see it more prevalent in other kind of business units, you know, things that are more 00:09:58.960 |
viewed as, like, cost centers, like support and things like that, especially, like, consumer-facing 00:10:07.600 |
But for software engineering, for whatever reason, we haven't encountered this yet in any 00:10:13.100 |
And we think that the reason here is that, number one, you know, if you view your org as a widget 00:10:17.920 |
factory, then you're going to prioritize outputting widgets. 00:10:22.220 |
But the thing is that very few engineering leaders, effective engineering leaders these 00:10:28.300 |
You know, the widgets are kind of an abstraction that don't apply to the craft of software engineering. 00:10:33.840 |
The other observation here is that the widgets that we're building, which is software, great 00:10:38.240 |
user experiences, they're not really supply-limited. 00:10:40.560 |
So if you have, like, an extensive backlog, the question is, you know, if we made your engineers 00:10:45.540 |
20% more productive, would you go and, you know, do more -- 20% more of your backlog, or you 00:10:51.540 |
just cut down 20% of your workforce and say, like, you know, it's fine, we don't need to 00:10:56.660 |
And for 99% of the companies out there that are building software, the answer is no, we want 00:11:02.840 |
These issues in our backlog are very important, we just can't get to them. 00:11:07.320 |
So framework number one is kind of like talked about frequently, but in practice we haven't 00:11:16.920 |
Framework number two is what I like to call A/B testing velocity. 00:11:21.240 |
So how this works is you basically segment off your organization into two groups, the test 00:11:29.240 |
And then you say group one gets the tool and group two does not. 00:11:34.300 |
And then you go through your standard planning process. 00:11:37.540 |
Most people, as part of that planning process, what you do is you go and estimate the time 00:11:44.400 |
So how long is this feature going to take to build? 00:11:46.800 |
How long is it going to take to work through these bug backlogs? 00:11:50.400 |
And then because you've divided the groups into two now, you have some notion of, like, how 00:11:55.240 |
well you're executing against your timeline, so you basically run these A/B tests. 00:12:00.220 |
So Palo Alto Networks, one of our customers, ran something similar to this, and the conclusion 00:12:05.380 |
they drew was the approximate timelines got accelerated 20 to 30 percent using Kodi. 00:12:12.400 |
And so this is a very kind of, like, rigorous scientific framework. 00:12:16.520 |
We see it come up now and then, especially when companies are of a certain size and they're 00:12:24.640 |
The criticisms about this framework are no two teams are exactly the same, right? 00:12:28.820 |
Like, if you lop off your development org, you have, you know, your dev infrastructure 00:12:32.500 |
on this side, maybe backend, and then you have frontend teams on this side. 00:12:36.000 |
It's hard sometimes to compare these different teams to each other because software development 00:12:41.560 |
is very different in different parts of your organization. 00:12:46.720 |
You know, maybe team X, you know, had an important leader or contributor to part. 00:12:52.480 |
Maybe team Y, you know, suffered a bout of COVID that, you know, blew through the team or things 00:12:59.720 |
So you have to account for these things when you make your evaluation. 00:13:02.600 |
And this framework is also high cost and effort. 00:13:04.280 |
You basically have to do the subdivision, you give one group access to the tool, and then 00:13:07.820 |
you have to run it for an extended period of time in order to gain enough confidence. 00:13:12.140 |
But provided you have the resources and the time to estimate it, we think this is a pretty 00:13:16.240 |
good framework for honestly testing the efficacy of an AI tool. 00:13:22.460 |
Okay, framework number three, I call this time saved as a function of engagement. 00:13:28.640 |
So if you have a productivity tool, using the product should make people more productive, 00:13:35.900 |
So if you have a code search engine, the more code searches that people do, that probably 00:13:42.680 |
If it didn't save them time, there would be no reason why they would go to the search engine. 00:13:45.960 |
And so in this framework, what you do is you basically go look at your product metrics, 00:13:50.740 |
and you break down all the different time-saving actions, you identify them, and then you kind 00:13:56.160 |
of tag them with an approximate estimate of how much time is saved in each action. 00:14:00.820 |
And if you want to be conservative, you can lower bound it. 00:14:04.040 |
You could say, like, a code search probably saves me two minutes. 00:14:07.860 |
That's maybe like a lower bound, because there's definitely searches where, oh, my gosh, it saved 00:14:12.780 |
me like half a day or maybe like a whole week of work because it prevented me from going 00:14:20.500 |
But you can lower bound it and just say, like, okay, we're going to get a lower bound estimate 00:14:26.040 |
And then you go and ask your vendor, hey, can you build some analytics and share them with 00:14:32.580 |
You know, very fine-grained analytics to show the admins and leaders in the org exactly what 00:14:41.660 |
You know, how many explicit invocations, you know, how many chats, how many questions about 00:14:45.660 |
the code base, how many inline code generation actions, and those all map to a certain amount 00:14:51.660 |
There's one caveat here, which is in products where you have like an implicit trigger, like 00:14:57.800 |
an autocomplete, you can't rely purely on engagement because it's not the human opting in to engage 00:15:08.280 |
And so there, you tend to go with more of an acceptance rate criteria. 00:15:12.160 |
You don't want to do just raw engagements because then the product could just push a bunch of like 00:15:16.420 |
low-quality completions to you, and that would not be a good metric of time saved. 00:15:26.080 |
One of the nice things about this is that because we're lower bounding, it makes the value of 00:15:32.120 |
So a lot of developer tools, us included, I think like Cody is like $9 per month and Sourcegraph 00:15:37.460 |
is a little bit more than that, but it's like if you back out the math of how much an hour 00:15:41.280 |
of a developer's time is worth, you know, typically it's around like $100 to $200. 00:15:46.280 |
It's like if you save, you know, a couple minutes a month, this kind of pays for itself in terms 00:15:54.620 |
The criticisms of this framework is, of course, it's a lower bound, so you're not fully assessing 00:15:59.520 |
If you go back to that picture I showed earlier of, you know, the dev journey where you're kind 00:16:02.960 |
of bridging the gaps, I think a big value of AI is actually completing tasks that hitherto 00:16:09.780 |
or beforehand were just not completed because people got fed up or they got lazy or they 00:16:18.600 |
It doesn't account for the second order effects of the velocity boost, and it's good for kind 00:16:24.600 |
of like day to day, like, hey, is this speeding up the team? 00:16:26.600 |
But it doesn't capture some of the impact on key initiatives of the company. 00:16:32.040 |
So that leads to the fourth evaluation framework. 00:16:35.600 |
So a lot of our customers track certain KPIs that they think are correlated with engineering 00:16:42.700 |
So lines of code generated, does anyone here think lines of code generated is a good metric 00:16:50.420 |
I think in 2024 we can all say that it's not. 00:16:53.960 |
We have seen this resurfaced in the context of measuring ROI of AI, because it's almost like 00:16:59.160 |
we've forgotten all the lessons that we learned about human developer productivity, and with 00:17:02.780 |
AI-generated tools, now it's like, oh, like, you generated, like, you know, hundreds of lines 00:17:13.120 |
And so we noted that we were actually losing some deals on lines of code generated. 00:17:17.380 |
And when we actually went and looked at the product experience, we're like -- at first, 00:17:20.020 |
we're like, hey, you know, maybe we should -- there's a product improvement here that we should 00:17:23.240 |
be making because people aren't accepting as many lines generated by Cody. 00:17:26.440 |
But when we dug into this, we're like, oh, like, you know, the competitor's product, it's 00:17:29.700 |
just more aggressively kind of, like, triggering, and that's not, like, the sort of business 00:17:35.660 |
So more and more, we're kind of pushing our customers to not tie to generic metrics or, 00:17:41.320 |
like, high-level metrics, but identify certain KPIs that attend to changes that you want to 00:17:48.680 |
So with Leidos, our big kind of government contracting customer, they identified a set of zones or actions 00:18:00.120 |
They wanted to reduce the amount of time spent answering questions, spent bugging teammates, 00:18:04.560 |
and spend more developer time in these areas that they identified as value-add. 00:18:09.800 |
Three things mainly: building features, writing unit tests, and reviewing code. 00:18:14.140 |
And so that's what we tracked for their evaluation period. 00:18:19.700 |
A fifth framework is impact on key initiatives. 00:18:22.460 |
So this is the kind of, like, map your product to OKR's framework. 00:18:27.460 |
And so there are a couple of companies where they're in the midst of a big code migration. 00:18:30.720 |
Like, they're trying to migrate from, you know, Cobalt to Java, or maybe, you know, from React 00:18:40.020 |
And these are kind of, like, top-level goals that the VP of engineering really cares about. 00:18:45.120 |
And so if you have a product that accelerates progress towards this, then the ROI is really 00:18:50.560 |
just what's the value of bringing that forward by, you know, X number of months or, in some 00:18:54.860 |
cases, X number of years, or making it possible at all. 00:18:58.180 |
How do you measure how much you pulled it forward? 00:19:04.120 |
So the question was, how do you measure how much you pulled it forward? 00:19:06.400 |
It is really a kind of judgment call with the engineering leader at that point. 00:19:11.060 |
So by the time they have this conversation with us, they've typically already started 00:19:15.300 |
it or have had a few of these under their belt and had to have an idea of the pain. 00:19:19.900 |
And then they can assess kind of the shape of the product and the things that we do and estimate 00:19:25.700 |
We also have case studies demonstrating, like, hey, this thing that used to take, you know, 00:19:29.220 |
a year or longer, we squished it into the span of, you know, a couple months. 00:19:35.700 |
And then the last framework is what I'll call survey. 00:19:37.940 |
So this sounds like the least rigorous framework, but I think it's still highly valuable. 00:19:42.180 |
Basically it's just run a pilot, have your developers use it, you can compare against another tool, 00:19:47.620 |
and then at the end of it, just ask your developers, you know, which one was best. 00:19:52.440 |
More and more nowadays, we don't see this in an unbounded fashion. 00:19:56.180 |
Like, you know, in the kind of like 2021 ZERP period, people were just like, you know, whatever 00:20:01.160 |
makes the developers happy, let's just go buy that. 00:20:04.700 |
And in other ways, we see it in more of a bounded fashion, which is, you know, a lot 00:20:08.940 |
of orgs, you know, allocate some part of their budget toward investing in developer productivity 00:20:17.360 |
So that chart over there shows kind of the range of orgs surveyed in a survey run by the 00:20:22.580 |
pragmatic engineer newsletter, and it typically ranges somewhere between five and 25%. 00:20:28.640 |
So within that budget allocation, you have a certain amount of budget allocated tools. 00:20:32.880 |
Then you basically say, subject to that constraint, let me go ask my developers what tools they 00:20:44.140 |
But hopefully this gave you kind of like a sampling of the flavors of different frameworks that 00:20:48.800 |
This is something that we've had to work through, through a lot of customers ranging 00:20:51.520 |
from very small startups to very large companies in the Fortune 500. 00:20:56.040 |
I just want to say, you know, be skeptical of anyone saying P equals NP, of saying they have 00:21:00.880 |
a precise way to measure AI ROI or developer productivity. 00:21:06.040 |
The most important thing, I think, is to find clear success criteria. 00:21:09.680 |
That's something that both you, your internal teams, and stakeholders will appreciate, and also 00:21:13.500 |
the vendor, because they know what success looks like. 00:21:17.180 |
And then the last kind of final note here is, productivity tools are often bottoms up, 00:21:21.580 |
but we've actually found that top-down mandates with AI can sometimes help, because developers 00:21:28.580 |
But if you believe firmly that this is where the future is going, and that people have to 00:21:32.540 |
update their skill set to make productive use of LLMs and AI to code more productively, we've 00:21:39.960 |
Where a CEO basically says, we're adopting Kodi, we're adopting code AI, go figure out 00:21:44.980 |
It's not going to be perfect, but it's something that is going to play out over the next decade. 00:21:54.220 |
As we move towards more automation, I like to think of two pictures in mind. 00:21:59.060 |
So one picture is the graph on the right, which is kind of like a landscape of code AI tools. 00:22:04.840 |
So on the left-hand side, you have the kind of like inline completions, you know, very 00:22:08.860 |
basic, completing the next tokens that you're typing. 00:22:10.860 |
And then on the far right, you have kind of like the fully automated offline agents. 00:22:15.820 |
And we're trying to make all these solutions more reliable, right? 00:22:18.180 |
Because like generative AI is sort of inherently unreliable. 00:22:21.620 |
We're trying to make it more general and, you know, make it productive in more languages 00:22:29.140 |
And we actually think that the path there is to go from the left-hand side to the right-hand 00:22:34.580 |
side, you know, not jumping straight to the full automation, because that's a very difficult 00:22:39.700 |
So we think the next phase of evolution for us is going from kind of like these inline code 00:22:44.120 |
completion scenarios to more of what we call online agents that live in your editor but 00:22:48.540 |
can still react to human feedback and guidance. 00:22:52.700 |
And then the second picture is, you know, the mythical man month. 00:22:56.640 |
I think a lot of people are familiar with this classic work. 00:22:58.680 |
It talks about the classic fallacies with respect to developer productivity that a lot of companies 00:23:04.860 |
One question I would pose to all of you is, have the fundamentals really changed? 00:23:08.540 |
You know, as we have more quote-unquote AI developers, should we measure them by the same 00:23:15.540 |
And I would -- the challenge question I would pose to all of you is, you know, I think the 00:23:20.860 |
lessons from the mythical man month is you prefer to have one very smart engineer who's highly 00:23:25.440 |
productive, over 10, maybe even 100 mediocre developers who are kind of productive but need 00:23:32.700 |
And so as AI automation increases, the question is, do you want 100 mediocre AI developers or do 00:23:38.920 |
you want 100X lever for the human developers, the really smart people who are going to craft 00:23:50.140 |
If you want to check out Cody, that's the URL, and that's my contact info if you want to reach 00:23:58.180 |
We have time for perhaps one or two questions. 00:24:01.180 |
I will run around with the microphone if anyone, just so we can catch it on audio. 00:24:15.180 |
You've got really good insight what's happening now with software development. 00:24:20.180 |
Are we going to get away from coding and the code would become a boilerplate and we move 00:24:26.640 |
to metaprogramming or it will be still code as a main output of the senior engineer? 00:24:37.220 |
The usage patterns that we're observing now is more and more code is being written through 00:24:43.220 |
I had an experience just on Friday actually where Cody wrote 80% of the code because I just 00:24:48.480 |
kept asking it to write different functions that I wanted. 00:24:51.820 |
And that was nice because I was thinking at the function level rather than the line-by-line 00:24:57.440 |
And it was nice because it allowed me to stay in flow. 00:24:59.420 |
But at the same time, the output was still code. 00:25:02.260 |
And I really do think that we're never going to see a full replacement of code by natural language 00:25:06.220 |
because code is nice because you can describe what you want very precisely. 00:25:10.380 |
And that precision is important as a source of truth for what the software actually does.