back to indexFrom Pilot to Platform: Agentic Developer Products with LangGraph

00:00:00.000 |
All right. Hello, everyone. Thanks for being here and joining us on this nice Wednesday afternoon. 00:00:18.000 |
My name is Matas Ristanis, and this is my colleague. 00:00:23.000 |
And today we're going to present how we built AI developer tools at Uber using LangGraph. 00:00:29.000 |
So to start off a little bit of context, Uber is a massive company serving 33 million trips a day across 15,000 cities. 00:00:36.000 |
This is enabled by a massive code base with hundreds of millions of lines of code. 00:00:41.000 |
And it is our job, the job of developer platform, to make sure that that code is churning smoothly through. 00:00:48.000 |
Now, all you need to know really is that we have 5,000 developers that are hard to please that we have to keep happy. 00:00:58.000 |
To accomplish that, we built out a large corpus of dev tools for our engineers. 00:01:03.000 |
And today we'll present a few of them to you and some of the key insights that we found out while building them. 00:01:13.000 |
All right. So we'll dive right in by talking about the 10,000-foot view of the AI developer to landscape at Uber. 00:01:21.000 |
As part of that, we'll highlight a couple of products we've built. 00:01:24.000 |
We'll actually show you what the user experience is like. 00:01:26.000 |
And then we'll tell you what are the reusable tools and agents that power them. 00:01:31.000 |
After that, you know, we can only focus on a couple. 00:01:34.000 |
But we'll do a quick, we'll blow through a couple more products we've built just to show you how this is proliferated all through Uber. 00:01:40.000 |
And finally, we'll just, you know, tell you what we learned and hopefully there's something reusable there for you. 00:01:49.000 |
So our AI DevTools strategy at Uber is built on primarily three pillars, right? 00:01:54.000 |
The first one is these products or bets that we have to take. 00:01:58.000 |
So we've picked things that directly improve developer workflow. 00:02:02.000 |
So these are things that our developers perform today. 00:02:08.000 |
It is reviewing code, which can also be laborious. 00:02:11.000 |
And we're like, okay, how do we make this better? 00:02:19.000 |
It's based on, you know, what we think we can make the most impact, but we're also always learning. 00:02:25.000 |
See what everyone else is up to and see what else we can target. 00:02:29.000 |
The second pillar of our strategy is we've got to build the right, what we call like cross-cutting primitives. 00:02:35.000 |
There's foundational AI technologies that is pretty much, you know, in all your solutions, you'll probably feel it too. 00:02:42.000 |
And having the right, you know, abstractions in place, the right frameworks, the right tooling helps us build more solutions and build them faster. 00:02:52.000 |
And lastly, what I'd say is probably the cornerstone of this strategy is what we call intentional tech transfer. 00:03:03.000 |
But we do stop and be deliberate about, hey, what here is reusable? 00:03:08.000 |
What can be spun out into something that reduces the barrier for the next problem we want to solve? 00:03:13.000 |
And so LangFX is our opinionated framework we built that wraps like LangGraph and LangChain and makes it work better with Uber systems. 00:03:26.000 |
We had the first couple of products emerge and they wanted to solve problems in an agentic manner. 00:03:32.000 |
They wanted to build reusable nodes and like LangGraph was the perfect fit to do it because we saw it was proliferating across the organization. 00:03:51.000 |
So the first part we'll showcase today is called Validator. 00:03:55.000 |
Now, what it is is an NID experience that flags up best practices violations and security issues for engineers in code automatically. 00:04:03.000 |
So it is effectively a LangGraph agent that we built a nice ID UX around. 00:04:09.000 |
And, you know, let's take a look at how it works. 00:04:11.000 |
So we have a screenshot here that shows a user opening Go file. 00:04:18.000 |
And what they have there is they're notified of a violation in this case. 00:04:21.000 |
So they have a little bit of a diagnostic that they can mouse over. 00:04:24.000 |
And they got a nice modal saying, hey, in this case, you're using the incorrect method to create a temporary test file. 00:04:35.000 |
You want to have them automatically clean up for you. 00:04:42.000 |
They can apply a pre-computed fix that we have prepared for them in the background. 00:04:46.000 |
Or if they choose so, they can ship off the fix to their ID agentic assistant if they prefer. 00:04:51.000 |
So that's what we have in the next slide, actually, is the fix request has been shift off. 00:05:04.000 |
Some of the key ideas that we found out while building this. 00:05:09.000 |
The main thing is that the agent abstraction allows us to actually compose multiple sub-agents 00:05:14.000 |
under a central validator agent for now, for example. 00:05:17.000 |
So we have a, you know, sections, a sub-agent for a validator that calls into the LLM with a list of practices and sort of gets those points of feedback resolved or returned. 00:05:29.000 |
But there's also a deterministic bit where, for example, we want to discover lint issues from static linters. 00:05:35.000 |
So there's nothing stopping us from running a lint tool and then passing on the learnings through the rest of the graph that allows us to, you know, pre-compute a fix even for those. 00:05:44.000 |
And in terms of impact, you know, we're seeing thousands of fixed interactions a day from satisfied engineers that fix their problems in code before they come back later to bite them. 00:05:56.000 |
And I think, you know, we think we've built a compelling experience here, right? 00:05:59.000 |
We've met developers where they are in the IDE. 00:06:04.000 |
It can combine, you know, deterministic capabilities like we use AST parsing tools. 00:06:09.000 |
We find out where each of the test boundaries lie. 00:06:12.000 |
We're able to evaluate each one of these against a set of curated best practices, flag up violations, figure out what the most expressive way to deliver this back to the user, shorten the IDE, give them a way of applying fixes. 00:06:33.000 |
Let's help engineers by authoring their tests from the get-go. 00:06:37.000 |
Now, you know, the second tool we're showing off here is called AutoCover. 00:06:40.000 |
And it is a tool to help engineers build -- or generate, rather -- building, passing, coverage-raising, business case testing, and, you know, validated and mutation tested tests. 00:06:51.000 |
So, like, really high-quality tests is what we're shooting for here. 00:06:57.000 |
They want to get their tests quickly and move on to the next business feature that they want to implement. 00:07:00.000 |
So the way we got to this is actually we took a bunch of domain expert agents. 00:07:06.000 |
We actually threw invalidator in there as well, and more on that later. 00:07:09.000 |
And then we arrive at a test generation tool. 00:07:14.000 |
We have a screenshot of, you know, Go source file, as an example. 00:07:19.000 |
And the user can, you know, invoke it in all auto-covered in multiple ways. 00:07:22.000 |
If they want to invoke it for the whole file and sort of bulk generate, they can do a right-click, as shown in the screenshot, and then just invoke it. 00:07:28.000 |
And then once the user clicks the button, what happens next is a whole bunch of stuff happens in the background. 00:07:33.000 |
So we start with adding a new target to the build system. 00:07:39.000 |
We run an initial coverage check to get a sort of a target space for us to operate on. 00:07:43.000 |
All while that is being done, we also analyze the surrounding source to get the business context out, so that we know what to test against. 00:07:52.000 |
And what the user sees really is just they get switched to an empty test file in this case. 00:07:57.000 |
And then because we did all that stuff in the background, we're starting to already generate tests. 00:08:02.000 |
And what the user will see is, they'll see a stream of tests come in. 00:08:16.000 |
Some tests might get removed because they're redundant. 00:08:18.000 |
You might see benchmark, like concurrency tests come in later. 00:08:22.000 |
And so, you know, the user is sort of watching this experience. 00:08:26.000 |
And then at the end, arriving at a nice set of validated vetted tests. 00:08:36.000 |
Let's dive a bit deeper into the graph here to see how it actually functions. 00:08:41.000 |
On the bottom right, you can actually see validator, which is the same agent that we just talked 00:08:48.000 |
So you can already see some of the composability learnings that we found useful. 00:08:56.000 |
We looked at the sort of heuristics that an engineer would use while writing tests. 00:09:02.000 |
And so, for example, you want to prepare your test environment. 00:09:05.000 |
You want to think about which business cases to test. 00:09:10.000 |
And then you want to think up new test cases, whether it be for extending existing tests 00:09:18.000 |
And then you want to run your builds, your tests. 00:09:20.000 |
And then if those are passing, you want to run a coverage check to see what you missed. 00:09:25.000 |
And so we go on to complete the graph this way. 00:09:29.000 |
And then because we no longer have a human involved, we can actually supercharge the graph 00:09:33.000 |
and sort of juice it up so that we can do 100 iterations of a code generation at the same 00:09:38.000 |
time, and then 100 executions at the same time. 00:09:41.000 |
We've seen, you know, for a sufficiently large source file, you can do that. 00:09:44.000 |
And that's sort of where our key learning comes in is we found that having these super-capable 00:09:50.000 |
main expert agents gives us unparalleled performance, sort of exceptional performance 00:09:57.000 |
So we benchmarked it against, you know, the industry agentic coding tools that are available 00:10:02.000 |
And we get about two to three times more coverage in about half the time compared to them 00:10:08.000 |
because of the speed-ups that we did in creating this graph here and sort of the custom 00:10:13.000 |
bespoke knowledge that we built into our agents. 00:10:16.000 |
And in terms of impact, we have -- the tool has helped raise developer platform coverage 00:10:24.000 |
So that maps to about 21,000 dev hours saved, which we're super happy about. 00:10:28.000 |
And we're seeing continued use of thousands of tests generated monthly. 00:10:36.000 |
Yeah, so we didn't want to stop at 5,000 tests a week. 00:10:40.000 |
Just wanted to give you a sneak peek of what else we've been able to do in the organization with this. 00:10:45.000 |
So what you see on screen right now is our Uber Assistant Builder. 00:10:48.000 |
Think of it like our internal custom GPT store where you can build chatbots that are, you know, steeped in Uber knowledge. 00:10:54.000 |
So, like, one of them you see on the screen is the Security Scorebot. 00:10:57.000 |
And it has access to some of the same tools that we showcased earlier. 00:11:01.000 |
So it knows it's steeped in Uber's best practices. 00:11:05.000 |
So even before I get to the point of I'm in my IDE writing code, I can ask questions about architecture and figure out whether my implementation is secure or not. 00:11:19.000 |
Picasso is our internal workflow management platform. 00:11:30.000 |
And it can give you feedback grounded in product truth, like, aware of what the product does. 00:11:36.000 |
Third thing I want to show you, and this is not an exhaustive list, right, is our tool called uReview. 00:11:45.000 |
We tried to flag anti-patterns earlier in the process. 00:11:48.000 |
But sometimes things still slip through the crack. 00:11:51.000 |
You know, why not reinforce and make sure quality is enforced before, you know, code gets landed, before your PR gets merged. 00:11:58.000 |
So, again, powered with some of the same tools that you saw earlier that power, like, Validator and Test Generator, 00:12:04.000 |
were able to flag, you know, both code review comments and code suggestions that developers can apply during review time. 00:12:12.000 |
I think with that, we'll just jump over to the learnings. 00:12:16.000 |
So, in terms of the learnings, we already sort of talked about this. 00:12:20.000 |
But we found that building domain expeditions that are super capable are actually the way to go to get outsized results. 00:12:32.000 |
And then, you know, the outgoing result is much better. 00:12:35.000 |
So, an example that I already talked about is the executor agent. 00:12:39.000 |
So, we're able to finagle our build system to allow us to, on the same file, execute 100 tests on the same test file without colliding, 00:12:48.000 |
That's an example of a domain expert that's super capable and gives us that performance that we want. 00:12:53.000 |
Secondly, we found that when possible, composing agents with deterministic sub-agents, or just have the whole agent deterministic, 00:13:01.000 |
makes a lot of sense if you can solve the problem in a deterministic way. 00:13:05.000 |
So, you know, one example of that was the lint agent undervalidator. 00:13:09.000 |
We want to have reliable output, and if we have a deterministic tool that can give us that intelligence, 00:13:17.000 |
We can have that reliable output and pass on the learnings to the rest of the graph and have them fixed. 00:13:23.000 |
And then third, we found that we can scale up our dev efforts quite a bit by solving a bounded problem, 00:13:29.000 |
by creating an agent, and then reusing it in multiple applications. 00:13:32.000 |
So you already saw it with validator, the standalone experience, and validator within auto-cover for test generation validation. 00:13:38.000 |
But I'm going to give you one more lower-level example, and that's the build system agent. 00:13:42.000 |
That's actually used through both of those products. 00:13:45.000 |
That's an even lower-level abstraction that is required for us to be able to, you know, 00:13:50.000 |
have the agents be able to, like, execute builds and, like, execute tests in our build system. 00:13:55.000 |
So, Sourabh, take us through some of the strategic learnings now. 00:13:59.000 |
Sourabh talked us through some of the tech benefits, but this is the one I'm probably most excited to share. 00:14:04.000 |
Like, you can set up your organization for success if you want to build agentic AI. 00:14:09.000 |
And I think we've done a pretty good job of it at Uber. 00:14:14.000 |
We're all building in collaboration, and I think these are our biggest takeaways. 00:14:18.000 |
The first being just, you know, encapsulation boost collaboration. 00:14:23.000 |
When there are well-thought-out abstractions, like LandGraph, and there are opinions on how to do things like handle state management, how to deal with concurrency, 00:14:34.000 |
it really allows us to scale development horizontally. 00:14:37.000 |
It lets us tackle more problems and more complex problems without creating this operational bottleneck, right? 00:14:45.000 |
An example I'll give you is our security team was able to write rules for Validator, like the product we showcased earlier. 00:14:52.000 |
It's able to detect security anti-patterns, but the security team knew nothing about-- 00:14:56.000 |
Well, this part of the security team knew nothing about AI agents and how the graph was constructed, 00:15:00.000 |
but they were still able to add value and improve the lives of our developers. 00:15:04.000 |
And so, like, a natural segue from that is if you're able to encapsulate, you know, work into these well-defined nodes, 00:15:12.000 |
then, like, graphs are the next thing you think about, right? 00:15:15.000 |
Like, graphs help us model these interactions perfectly. 00:15:19.000 |
They oftentimes mirror how developers already interact with the system. 00:15:24.000 |
So, when we do the classic process engineering and identify process bottlenecks and inefficiencies, 00:15:31.000 |
it doesn't just help accelerate or boost the AI workloads. 00:15:34.000 |
It also helps improve the experience for people not even interacting with the AI tools, right? 00:15:39.000 |
So, it's not, like, an arms race either, or should we build agentic systems or should we improve our existing systems? 00:15:45.000 |
It usually segues into, like, helping each other. 00:15:48.000 |
Like, just, you know, we spoke about our agentic test generation, 00:15:51.000 |
and we found multiple inefficiencies through, like, how do you do mock generation quickly? 00:15:57.000 |
How do you modify build files, invoke, like, interact with the build system? 00:16:04.000 |
And in the process of, like, fixing all these paper cuts, 00:16:09.000 |
we improved the experience for just, like, non-agentic applications, 00:16:13.000 |
just for developers interacting directly with our systems. 00:16:18.000 |
And, you know, with that, I want to bring this talk to an end. 00:16:26.000 |
Hopefully, you all learned something, and we'll take something back to your companies.