The ROI of AI: Why you need Eval Framework

00:00:00.120 | I want to introduce our first speaker, Byung Liu, CTO of Sourcegraph, and I won't spoil

00:00:17.560 | the topic of the talk, but it's going to be a good one.

00:00:19.180 | Thank you very much.

00:00:20.180 | Awesome.

00:00:21.180 | Thank you, Peter.

00:00:22.180 | How's everyone doing this morning?

00:00:24.280 | Good.

00:00:25.280 | Yeah?

00:00:26.280 | Everyone awake?

00:00:27.280 | Bright and early?

00:00:28.280 | Thanks for coming out.

00:00:29.280 | Almost awake.

00:00:30.280 | Awesome.

00:00:31.280 | Before I dive into the talk here, I just want to get a sense of who we all have in the room.

00:00:37.040 | Who here is like a head of engineering or a VP of engineering?

00:00:41.280 | Okay.

00:00:42.280 | A good number of you.

00:00:43.640 | Who here is just an IC dev interested in evaluating how things are going?

00:00:51.600 | And then what do the rest of you do?

00:00:54.060 | Just shout out your roles.

00:00:56.060 | Anyone?

00:00:57.060 | Anyone?

00:00:58.060 | Middle management.

00:00:59.060 | Middle management.

00:01:00.060 | Okay.

00:01:01.060 | Cool.

00:01:02.060 | And who here has a really quantitative, very precise thought through evaluation framework

00:01:07.560 | for measuring the ROI of AI tools?

00:01:10.060 | Okay.

00:01:11.060 | One hand in the back.

00:01:12.060 | Okay.

00:01:13.060 | One hand in the back.

00:01:14.060 | What company are you from, sir?

00:01:15.060 | Broad signals.

00:01:16.060 | Broad signals.

00:01:17.060 | Okay.

00:01:18.060 | Cool.

00:01:19.060 | So we've got one person in the back.

00:01:20.060 | And then who here is sort of like we kind of are evaluating it, but it's really kind of like

00:01:25.060 | vibes at this point.

00:01:26.060 | Anyone?

00:01:27.060 | Anyone brave enough to -- okay.

00:01:28.060 | Cool.

00:01:29.060 | So you're in the right place.

00:01:30.060 | So you're in the right place.

00:01:31.060 | Now, who am I?

00:01:32.060 | Why am I qualified to talk on this topic?

00:01:34.060 | So I'm the CTO and co-founder of a company called Sourcegraph.

00:01:37.060 | We're a developer tools company.

00:01:38.560 | If you haven't heard of us, we have two products.

00:01:41.560 | One is a code search engine.

00:01:42.560 | So we started the company because we were developers ourselves, my co-founder and I, and we were

00:01:49.000 | really tired of the slog of diving through large complex code bases and trying to make sense

00:01:53.500 | of what was happening in them.

00:01:55.480 | The other product that we have is an AI coding assistant.

00:01:59.340 | So what this product does is it's essentially -- oh, sorry.

00:02:05.100 | I should mention that we have great adoption among really great companies.

00:02:08.380 | So we started this company ten years ago to solve this problem of tackling understanding

00:02:12.560 | code in large code bases.

00:02:14.320 | And today we're very fortunate to have, you know, customers that range from early stage startups

00:02:20.500 | all the way through to the Fortune 500 and even some government agencies.

00:02:25.300 | So what's the tie-in to AI ROI?

00:02:28.180 | So about two years ago, we released a new product called Codi, which is an AI coding assistant that

00:02:33.920 | ties into our code search engine.

00:02:36.280 | So you can kind of think of it as, like, a perplexity for code, whereas, you know, your

00:02:40.620 | kind of vanilla run-of-the-mill AI coding assistant only uses very local context, only has access

00:02:47.140 | to kind of, like, the open files in your editor.

00:02:49.860 | We spent the past ten years building this great code search engine that is really good at servicing

00:02:54.840 | relevant code snippets from across your code base.

00:02:57.320 | And it just so turns out that, like, that's a great superpower for AI to have, right?

00:03:01.100 | You know, for those of us that have started using perplexity, we can kind of see the appeal.

00:03:05.440 | And a big piece of the puzzle is not just the language model itself, but the ability to fetch

00:03:10.460 | and rank relevant context from, you know, a whole universe of data and information.

00:03:15.540 | So we had to solve this problem in order to sell Codi into the likes of 1Password, Palo Alto

00:03:24.200 | Networks, and Leidos, which is a big government contractor.

00:03:27.900 | If you flew in here from somewhere else, you probably entered through one of their security

00:03:31.880 | machines.

00:03:32.880 | We sold Codi to all these organizations.

00:03:34.880 | And so each one of these organizations has kind of, like, a different way of measuring

00:03:39.220 | ROI.

00:03:40.220 | They have different frameworks that they apply.

00:03:42.560 | And so we had to answer that question in a multitude of ways.

00:03:46.760 | And that's what I'm here to talk about.

00:03:48.440 | Okay.

00:03:49.440 | So I want to start out with how I would describe the value prop of AI to someone who's actually

00:03:55.780 | using it, to a developer.

00:03:58.400 | So coding without AI, I think, you know, we've all felt this before, if you've ever written

00:04:02.860 | a line of code, you start out by asking yourself, like, oh, this task, it's straightforward.

00:04:08.680 | It should be easy.

00:04:09.680 | Let me just go build this feature.

00:04:11.440 | I should mention, I poached these slides from a director of engineering at Palo Alto Networks.

00:04:18.580 | He's actually giving another talk at this conference, Gunjan Patel.

00:04:21.540 | I thought he did an excellent job of describing the value prop that he was solving for as a director

00:04:25.960 | of engineering when they were purchasing a coding AI system.

00:04:29.740 | So this is the way he described it.

00:04:32.060 | You think it should be straightforward.

00:04:33.720 | But then there's all these side quests that you end up going on as a developer.

00:04:37.880 | It's like, uh-oh.

00:04:38.880 | Like, I got to go install a bunch of dependencies.

00:04:40.940 | Or maybe, you know, there is this UI component that I have to go and figure out, this framework

00:04:45.500 | that I need to learn.

00:04:46.800 | So the gaps appear.

00:04:48.280 | And then without AI, bridging the gaps takes both time and focus.

00:04:53.760 | You kind of have to, you know, spin off a side process or go on a little mini quest.

00:04:57.580 | And then, you know, 30 minutes and two cups of coffee later, you're like, okay, I got it.

00:05:02.400 | I filled the gap.

00:05:03.400 | But what was I doing again?

00:05:06.060 | And so AI helps bridge those gaps.

00:05:07.960 | It helps solve this problem.

00:05:09.220 | It helps developers really stay in flow and stay kind of, like, cognizant of the high level

00:05:15.280 | of what they're trying to accomplish.

00:05:18.100 | And so what this means is that more and more developers can actually do the thing that we

00:05:21.280 | want to do, which is build an amazing feature and deliver an amazing experience, instead of

00:05:25.460 | giving up.

00:05:26.640 | Now the question is, how do we measure this in a way that we can demonstrate the business

00:05:32.180 | impact of what this does to the rest of the organization?

00:05:36.700 | And so the answer is, beans.

00:05:39.720 | Okay.

00:05:40.880 | Who here drinks coffee?

00:05:42.660 | Okay.

00:05:43.660 | Cool.

00:05:44.660 | Do you know the difference between a really great coffee bean and, you know, your kind

00:05:49.380 | of run-of-the-mill Folgers or, you know, the thing that you buy at the supermarket?

00:05:53.900 | Okay.

00:05:54.900 | It turns out we're all in the bean business.

00:05:59.000 | We think we're in the software business, but we're really selling beans in a way.

00:06:02.800 | So in every company, there is someone, let's call him Bob, who grows the beans, essentially

00:06:07.720 | the developer.

00:06:08.720 | There's another person, let's call him Pat, who sells the beans.

00:06:13.260 | That's your kind of CRO or sales lead.

00:06:15.400 | And then you have Alice on the side, who's kind of your CFO or CEO, and Alice has got

00:06:20.560 | to count the beans.

00:06:21.560 | At the end of the day, you know, not to diss the finance people if there are any in the room,

00:06:26.820 | but it's all about counting the beans and seeing how they add up.

00:06:29.860 | Now, I don't know if any of you have been paying attention, but in the past two years,

00:06:34.420 | the bean business has been revolutionized.

00:06:37.000 | This thing called AI has appeared.

00:06:38.500 | And so what does the bean business look like now?

00:06:42.060 | Well, Bob grows the beans with AI.

00:06:45.800 | And Pat is selling beans, but with AI.

00:06:47.800 | And then Alice is on the side being like, well, I'm counting all the beans, and where's the

00:06:53.140 | ROI?

00:06:54.140 | And so this is the answer that, you know, basically Bob has to answer.

00:06:57.800 | Pat has kind of got it easy because Pat's job is just selling the beans.

00:07:00.800 | It's a much more quantifiable task.

00:07:04.340 | Those of us that are involved in software engineering and product development, it's a bit harder

00:07:08.360 | to measure.

00:07:10.060 | So there's tension in the bean shop, you know.

00:07:12.980 | Alice is asking Bob, you know, how many more beans are we growing now with AI, Bob?

00:07:16.740 | And Bob's like, well, it's complicated, Alice.

00:07:18.560 | Not all beans are the same.

00:07:19.560 | You know, there's some good beans and there's some very bad beans.

00:07:22.860 | We're making our beans better.

00:07:24.020 | And Alice is like, well, okay, the bean AI tool costs money, and we've got to measure

00:07:28.880 | its impact somehow.

00:07:32.180 | Anyone feel that tension?

00:07:33.960 | Anyone have this kind of, like, conversation?

00:07:36.560 | We've talked to a lot of heads of engineering who see this tension very real with other parts

00:07:40.460 | of the org, specifically between finance and engineering.

00:07:45.380 | And I think the core of the problem is that measuring AI ROI for functions where the work

00:07:50.760 | is not directly quantifiable through a number is what I like to call MP hard.

00:07:56.120 | So how many people are familiar with the term NP hard here?

00:07:59.380 | Okay.

00:08:00.380 | Cool.

00:08:01.380 | We're all pretty technical.

00:08:02.380 | So MP hard basically means if you have a tough challenge, if you have a problem, and you can

00:08:07.180 | basically reduce it to a class of very difficult problems, it probably means your problem is

00:08:12.480 | not solvable.

00:08:14.380 | And measuring AI ROI reduces to measuring developer productivity or the productivity of whatever class

00:08:20.060 | of knowledge worker that you're managing.

00:08:22.880 | And so that implies if you can measure AI ROI precisely, you can also measure developer productivity.

00:08:28.960 | And who here knows how to measure developer productivity?

00:08:33.320 | It's kind of an open question, right?

00:08:34.880 | So using the logic of your standard reduction proof, this problem is intractable.

00:08:40.480 | So that's the end of my talk.

00:08:42.520 | I'm just here to tell you that this problem is intractable and we should give up, right?

00:08:47.060 | Well, in the real world, we often find tractable solutions to intractable problems.

00:08:52.780 | And so what the meat of this talk is is really sharing a set of evaluation frameworks that

00:08:56.720 | we've presented to different customers.

00:08:59.460 | Not all of these are used by, you know, any given customer.

00:09:03.800 | But I wanted to give kind of like a sampling of the conversations that we've had.

00:09:07.600 | And hopefully there will be some time for Q&A at the end where we can kind of talk through

00:09:12.260 | this and see what other people are doing.

00:09:16.280 | Okay.

00:09:17.280 | So framework number one is the famous roles eliminated framework.

00:09:22.820 | So this question gets asked a lot these days, especially, you know, on social media.

00:09:26.320 | Like, AI is here to take your job.

00:09:28.320 | So how does this framework work?

00:09:30.440 | Well, in the classic framing, you buy the tool, the labor-saving tool.

00:09:34.040 | You observe, you know, you're in the bean business or the widget business.

00:09:38.040 | You observe that this tool yields an X percent increase in your capacity to build widgets.

00:09:43.040 | And then you can cut your workforce to meet the demands for whatever widgets you're selling.

00:09:48.040 | Now, in practice, we have not encountered this framework at all in the realm of software

00:09:53.080 | development.

00:09:54.760 | We do see it more prevalent in other kind of business units, you know, things that are more

00:09:58.960 | viewed as, like, cost centers, like support and things like that, especially, like, consumer-facing

00:10:05.960 | customer support.

00:10:07.600 | But for software engineering, for whatever reason, we haven't encountered this yet in any

00:10:12.000 | of our customers.

00:10:13.100 | And we think that the reason here is that, number one, you know, if you view your org as a widget

00:10:17.920 | factory, then you're going to prioritize outputting widgets.

00:10:22.220 | But the thing is that very few engineering leaders, effective engineering leaders these

00:10:25.700 | days, view themselves as widget builders.

00:10:28.300 | You know, the widgets are kind of an abstraction that don't apply to the craft of software engineering.

00:10:33.840 | The other observation here is that the widgets that we're building, which is software, great

00:10:38.240 | user experiences, they're not really supply-limited.

00:10:40.560 | So if you have, like, an extensive backlog, the question is, you know, if we made your engineers

00:10:45.540 | 20% more productive, would you go and, you know, do more -- 20% more of your backlog, or you

00:10:51.540 | just cut down 20% of your workforce and say, like, you know, it's fine, we don't need to

00:10:55.000 | get to the backlog.

00:10:56.660 | And for 99% of the companies out there that are building software, the answer is no, we want

00:11:01.300 | to build a better user experience.

00:11:02.840 | These issues in our backlog are very important, we just can't get to them.

00:11:07.320 | So framework number one is kind of like talked about frequently, but in practice we haven't

00:11:12.560 | really seen it as an evaluation criteria.

00:11:16.920 | Framework number two is what I like to call A/B testing velocity.

00:11:21.240 | So how this works is you basically segment off your organization into two groups, the test

00:11:27.540 | group and the control group.

00:11:29.240 | And then you say group one gets the tool and group two does not.

00:11:34.300 | And then you go through your standard planning process.

00:11:37.540 | Most people, as part of that planning process, what you do is you go and estimate the time

00:11:41.780 | that it will take to resolve certain issues.

00:11:44.400 | So how long is this feature going to take to build?

00:11:46.800 | How long is it going to take to work through these bug backlogs?

00:11:50.400 | And then because you've divided the groups into two now, you have some notion of, like, how

00:11:55.240 | well you're executing against your timeline, so you basically run these A/B tests.

00:12:00.220 | So Palo Alto Networks, one of our customers, ran something similar to this, and the conclusion

00:12:05.380 | they drew was the approximate timelines got accelerated 20 to 30 percent using Kodi.

00:12:12.400 | And so this is a very kind of, like, rigorous scientific framework.

00:12:16.520 | We see it come up now and then, especially when companies are of a certain size and they're

00:12:21.860 | very thoughtful about this question.

00:12:24.640 | The criticisms about this framework are no two teams are exactly the same, right?

00:12:28.820 | Like, if you lop off your development org, you have, you know, your dev infrastructure

00:12:32.500 | on this side, maybe backend, and then you have frontend teams on this side.

00:12:36.000 | It's hard sometimes to compare these different teams to each other because software development

00:12:41.560 | is very different in different parts of your organization.

00:12:45.260 | There's also confounding factors.

00:12:46.720 | You know, maybe team X, you know, had an important leader or contributor to part.

00:12:52.480 | Maybe team Y, you know, suffered a bout of COVID that, you know, blew through the team or things

00:12:58.720 | like that.

00:12:59.720 | So you have to account for these things when you make your evaluation.

00:13:02.600 | And this framework is also high cost and effort.

00:13:04.280 | You basically have to do the subdivision, you give one group access to the tool, and then

00:13:07.820 | you have to run it for an extended period of time in order to gain enough confidence.

00:13:12.140 | But provided you have the resources and the time to estimate it, we think this is a pretty

00:13:16.240 | good framework for honestly testing the efficacy of an AI tool.

00:13:22.460 | Okay, framework number three, I call this time saved as a function of engagement.

00:13:28.640 | So if you have a productivity tool, using the product should make people more productive,

00:13:34.900 | right?

00:13:35.900 | So if you have a code search engine, the more code searches that people do, that probably

00:13:41.680 | saves them time.

00:13:42.680 | If it didn't save them time, there would be no reason why they would go to the search engine.

00:13:45.960 | And so in this framework, what you do is you basically go look at your product metrics,

00:13:50.740 | and you break down all the different time-saving actions, you identify them, and then you kind

00:13:56.160 | of tag them with an approximate estimate of how much time is saved in each action.

00:14:00.820 | And if you want to be conservative, you can lower bound it.

00:14:04.040 | You could say, like, a code search probably saves me two minutes.

00:14:07.860 | That's maybe like a lower bound, because there's definitely searches where, oh, my gosh, it saved

00:14:12.780 | me like half a day or maybe like a whole week of work because it prevented me from going

00:14:17.240 | down an unproductive rabbit hole.

00:14:20.500 | But you can lower bound it and just say, like, okay, we're going to get a lower bound estimate

00:14:23.840 | on the total amount of time saved.

00:14:26.040 | And then you go and ask your vendor, hey, can you build some analytics and share them with

00:14:30.580 | me?

00:14:31.580 | So this is something that we built for Cody.

00:14:32.580 | You know, very fine-grained analytics to show the admins and leaders in the org exactly what

00:14:39.580 | actions are being taken.

00:14:41.660 | You know, how many explicit invocations, you know, how many chats, how many questions about

00:14:45.660 | the code base, how many inline code generation actions, and those all map to a certain amount

00:14:51.000 | of time saved.

00:14:51.660 | There's one caveat here, which is in products where you have like an implicit trigger, like

00:14:57.800 | an autocomplete, you can't rely purely on engagement because it's not the human opting in to engage

00:15:04.060 | the product each time.

00:15:05.740 | It's sort of like implicitly shown to you.

00:15:08.280 | And so there, you tend to go with more of an acceptance rate criteria.

00:15:12.160 | You don't want to do just raw engagements because then the product could just push a bunch of like

00:15:16.420 | low-quality completions to you, and that would not be a good metric of time saved.

00:15:22.960 | So we have a lot of customers that do this.

00:15:25.080 | They appreciate it.

00:15:26.080 | One of the nice things about this is that because we're lower bounding, it makes the value of

00:15:30.520 | the software very clear.

00:15:32.120 | So a lot of developer tools, us included, I think like Cody is like $9 per month and Sourcegraph

00:15:37.460 | is a little bit more than that, but it's like if you back out the math of how much an hour

00:15:41.280 | of a developer's time is worth, you know, typically it's around like $100 to $200.

00:15:46.280 | It's like if you save, you know, a couple minutes a month, this kind of pays for itself in terms

00:15:51.780 | of productivity.

00:15:54.620 | The criticisms of this framework is, of course, it's a lower bound, so you're not fully assessing

00:15:58.520 | the value.

00:15:59.520 | If you go back to that picture I showed earlier of, you know, the dev journey where you're kind

00:16:02.960 | of bridging the gaps, I think a big value of AI is actually completing tasks that hitherto

00:16:09.780 | or beforehand were just not completed because people got fed up or they got lazy or they

00:16:14.860 | just had other things to do.

00:16:16.780 | So this doesn't capture that.

00:16:18.600 | It doesn't account for the second order effects of the velocity boost, and it's good for kind

00:16:24.600 | of like day to day, like, hey, is this speeding up the team?

00:16:26.600 | But it doesn't capture some of the impact on key initiatives of the company.

00:16:32.040 | So that leads to the fourth evaluation framework.

00:16:35.600 | So a lot of our customers track certain KPIs that they think are correlated with engineering

00:16:40.600 | quality and business impact.

00:16:42.700 | So lines of code generated, does anyone here think lines of code generated is a good metric

00:16:48.020 | of developer productivity?

00:16:49.420 | Okay.

00:16:50.420 | I think in 2024 we can all say that it's not.

00:16:53.960 | We have seen this resurfaced in the context of measuring ROI of AI, because it's almost like

00:16:59.160 | we've forgotten all the lessons that we learned about human developer productivity, and with

00:17:02.780 | AI-generated tools, now it's like, oh, like, you generated, like, you know, hundreds of lines

00:17:09.000 | of code for a developer in a day.

00:17:13.120 | And so we noted that we were actually losing some deals on lines of code generated.

00:17:17.380 | And when we actually went and looked at the product experience, we're like -- at first,

00:17:20.020 | we're like, hey, you know, maybe we should -- there's a product improvement here that we should

00:17:23.240 | be making because people aren't accepting as many lines generated by Cody.

00:17:26.440 | But when we dug into this, we're like, oh, like, you know, the competitor's product, it's

00:17:29.700 | just more aggressively kind of, like, triggering, and that's not, like, the sort of business

00:17:34.200 | that we want to be.

00:17:35.660 | So more and more, we're kind of pushing our customers to not tie to generic metrics or,

00:17:41.320 | like, high-level metrics, but identify certain KPIs that attend to changes that you want to

00:17:46.780 | make in your organization.

00:17:48.680 | So with Leidos, our big kind of government contracting customer, they identified a set of zones or actions

00:17:57.660 | that they felt were really important.

00:18:00.120 | They wanted to reduce the amount of time spent answering questions, spent bugging teammates,

00:18:04.560 | and spend more developer time in these areas that they identified as value-add.

00:18:09.800 | Three things mainly: building features, writing unit tests, and reviewing code.

00:18:14.140 | And so that's what we tracked for their evaluation period.

00:18:19.700 | A fifth framework is impact on key initiatives.

00:18:22.460 | So this is the kind of, like, map your product to OKR's framework.

00:18:27.460 | And so there are a couple of companies where they're in the midst of a big code migration.

00:18:30.720 | Like, they're trying to migrate from, you know, Cobalt to Java, or maybe, you know, from React

00:18:37.120 | to Svelte or the latest JS framework.

00:18:40.020 | And these are kind of, like, top-level goals that the VP of engineering really cares about.

00:18:45.120 | And so if you have a product that accelerates progress towards this, then the ROI is really

00:18:50.560 | just what's the value of bringing that forward by, you know, X number of months or, in some

00:18:54.860 | cases, X number of years, or making it possible at all.

00:18:58.180 | How do you measure how much you pulled it forward?

00:19:02.620 | That's a good question.

00:19:04.120 | So the question was, how do you measure how much you pulled it forward?

00:19:06.400 | It is really a kind of judgment call with the engineering leader at that point.

00:19:11.060 | So by the time they have this conversation with us, they've typically already started

00:19:15.300 | it or have had a few of these under their belt and had to have an idea of the pain.

00:19:19.900 | And then they can assess kind of the shape of the product and the things that we do and estimate

00:19:24.440 | how much quicker it will be done.

00:19:25.700 | We also have case studies demonstrating, like, hey, this thing that used to take, you know,

00:19:29.220 | a year or longer, we squished it into the span of, you know, a couple months.

00:19:33.820 | Okay.

00:19:35.700 | And then the last framework is what I'll call survey.

00:19:37.940 | So this sounds like the least rigorous framework, but I think it's still highly valuable.

00:19:42.180 | Basically it's just run a pilot, have your developers use it, you can compare against another tool,

00:19:47.620 | and then at the end of it, just ask your developers, you know, which one was best.

00:19:52.440 | More and more nowadays, we don't see this in an unbounded fashion.

00:19:56.180 | Like, you know, in the kind of like 2021 ZERP period, people were just like, you know, whatever

00:20:01.160 | makes the developers happy, let's just go buy that.

00:20:04.700 | And in other ways, we see it in more of a bounded fashion, which is, you know, a lot

00:20:08.940 | of orgs, you know, allocate some part of their budget toward investing in developer productivity

00:20:14.940 | and to develop productivity tools.

00:20:17.360 | So that chart over there shows kind of the range of orgs surveyed in a survey run by the

00:20:22.580 | pragmatic engineer newsletter, and it typically ranges somewhere between five and 25%.

00:20:28.640 | So within that budget allocation, you have a certain amount of budget allocated tools.

00:20:32.880 | Then you basically say, subject to that constraint, let me go ask my developers what tools they

00:20:38.820 | want the most.

00:20:40.880 | Okay, so I'm basically out of time.

00:20:44.140 | But hopefully this gave you kind of like a sampling of the flavors of different frameworks that

00:20:47.800 | are involved.

00:20:48.800 | This is something that we've had to work through, through a lot of customers ranging

00:20:51.520 | from very small startups to very large companies in the Fortune 500.

00:20:56.040 | I just want to say, you know, be skeptical of anyone saying P equals NP, of saying they have

00:21:00.880 | a precise way to measure AI ROI or developer productivity.

00:21:04.800 | No framework is perfect.

00:21:06.040 | The most important thing, I think, is to find clear success criteria.

00:21:09.680 | That's something that both you, your internal teams, and stakeholders will appreciate, and also

00:21:13.500 | the vendor, because they know what success looks like.

00:21:17.180 | And then the last kind of final note here is, productivity tools are often bottoms up,

00:21:21.580 | but we've actually found that top-down mandates with AI can sometimes help, because developers

00:21:26.280 | can be a little bit of a skeptical crowd.

00:21:28.580 | But if you believe firmly that this is where the future is going, and that people have to

00:21:32.540 | update their skill set to make productive use of LLMs and AI to code more productively, we've

00:21:37.920 | actually seen success in this case.

00:21:39.960 | Where a CEO basically says, we're adopting Kodi, we're adopting code AI, go figure out

00:21:43.980 | how to use this.

00:21:44.980 | It's not going to be perfect, but it's something that is going to play out over the next decade.

00:21:51.480 | And then, sorry, one final thought.

00:21:54.220 | As we move towards more automation, I like to think of two pictures in mind.

00:21:59.060 | So one picture is the graph on the right, which is kind of like a landscape of code AI tools.

00:22:04.840 | So on the left-hand side, you have the kind of like inline completions, you know, very

00:22:08.860 | basic, completing the next tokens that you're typing.

00:22:10.860 | And then on the far right, you have kind of like the fully automated offline agents.

00:22:15.820 | And we're trying to make all these solutions more reliable, right?

00:22:18.180 | Because like generative AI is sort of inherently unreliable.

00:22:21.620 | We're trying to make it more general and, you know, make it productive in more languages

00:22:27.020 | and more scenarios.

00:22:29.140 | And we actually think that the path there is to go from the left-hand side to the right-hand

00:22:34.580 | side, you know, not jumping straight to the full automation, because that's a very difficult

00:22:38.700 | problem.

00:22:39.700 | So we think the next phase of evolution for us is going from kind of like these inline code

00:22:44.120 | completion scenarios to more of what we call online agents that live in your editor but

00:22:48.540 | can still react to human feedback and guidance.

00:22:52.700 | And then the second picture is, you know, the mythical man month.

00:22:56.640 | I think a lot of people are familiar with this classic work.

00:22:58.680 | It talks about the classic fallacies with respect to developer productivity that a lot of companies

00:23:03.380 | make.

00:23:04.860 | One question I would pose to all of you is, have the fundamentals really changed?

00:23:08.540 | You know, as we have more quote-unquote AI developers, should we measure them by the same

00:23:13.920 | evaluation criteria?

00:23:15.540 | And I would -- the challenge question I would pose to all of you is, you know, I think the

00:23:20.860 | lessons from the mythical man month is you prefer to have one very smart engineer who's highly

00:23:25.440 | productive, over 10, maybe even 100 mediocre developers who are kind of productive but need

00:23:31.400 | a lot of guidance.

00:23:32.700 | And so as AI automation increases, the question is, do you want 100 mediocre AI developers or do

00:23:38.920 | you want 100X lever for the human developers, the really smart people who are going to craft

00:23:43.820 | the user experience?

00:23:47.140 | All right.

00:23:48.140 | That's it for me.

00:23:49.140 | Yeah.

00:23:50.140 | If you want to check out Cody, that's the URL, and that's my contact info if you want to reach

00:23:55.260 | out later.

00:23:56.260 | Do we have time for questions, or?

00:23:58.180 | We have time for perhaps one or two questions.

00:24:00.180 | One or two questions?

00:24:01.180 | I will run around with the microphone if anyone, just so we can catch it on audio.

00:24:05.860 | Cool.

00:24:06.860 | Thanks.

00:24:07.860 | Thanks.

00:24:08.860 | Thanks for Cody.

00:24:10.180 | You are the best guys.

00:24:12.180 | I tried them all.

00:24:13.180 | I'm staying with you.

00:24:14.180 | Thanks.

00:24:15.180 | You've got really good insight what's happening now with software development.

00:24:19.300 | So what is your intuition?

00:24:20.180 | Are we going to get away from coding and the code would become a boilerplate and we move

00:24:26.640 | to metaprogramming or it will be still code as a main output of the senior engineer?

00:24:33.500 | Yeah, that's a really good question.

00:24:37.220 | The usage patterns that we're observing now is more and more code is being written through

00:24:41.540 | natural language prompts.

00:24:43.220 | I had an experience just on Friday actually where Cody wrote 80% of the code because I just

00:24:48.480 | kept asking it to write different functions that I wanted.

00:24:51.820 | And that was nice because I was thinking at the function level rather than the line-by-line

00:24:55.440 | code level.

00:24:56.440 | Yeah.

00:24:57.440 | And it was nice because it allowed me to stay in flow.

00:24:59.420 | But at the same time, the output was still code.

00:25:02.260 | And I really do think that we're never going to see a full replacement of code by natural language

00:25:06.220 | because code is nice because you can describe what you want very precisely.

00:25:10.380 | And that precision is important as a source of truth for what the software actually does.

00:25:14.540 | I think that's probably it now.

00:25:17.260 | Thank you very much.

00:25:18.520 | Thank you.

00:25:19.520 | Thank you.

00:25:20.520 | Thank you.

00:25:21.520 | Thank you.

The ROI of AI: Why you need Eval Framework - Beyang Liu