From Pilot to Platform: Agentic Developer Products with LangGraph

All right. Hello, everyone. Thanks for being here and joining us on this nice Wednesday afternoon. My name is Matas Ristanis, and this is my colleague. Hey, folks. I'm Saurabh Sherhati. And today we're going to present how we built AI developer tools at Uber using LangGraph. So to start off a little bit of context, Uber is a massive company serving 33 million trips a day across 15,000 cities.

This is enabled by a massive code base with hundreds of millions of lines of code. And it is our job, the job of developer platform, to make sure that that code is churning smoothly through. Now, all you need to know really is that we have 5,000 developers that are hard to please that we have to keep happy.

So that's not so easy on us here. To accomplish that, we built out a large corpus of dev tools for our engineers. And today we'll present a few of them to you and some of the key insights that we found out while building them. So, Saurabh, take us through the agenda.

All right. So we'll dive right in by talking about the 10,000-foot view of the AI developer to landscape at Uber. As part of that, we'll highlight a couple of products we've built. We'll actually show you what the user experience is like. And then we'll tell you what are the reusable tools and agents that power them.

After that, you know, we can only focus on a couple. But we'll do a quick, we'll blow through a couple more products we've built just to show you how this is proliferated all through Uber. And finally, we'll just, you know, tell you what we learned and hopefully there's something reusable there for you.

So let's do it. Let's do it. Okay. So our AI DevTools strategy at Uber is built on primarily three pillars, right? The first one is these products or bets that we have to take. So we've picked things that directly improve developer workflow. So these are things that our developers perform today.

It could be writing tests. Yes, I know it's boring. It is reviewing code, which can also be laborious. And we're like, okay, how do we make this better? How do we make this faster? How do we eliminate oil for developers? We've taken a bet on few areas. It's based on, you know, what we think we can make the most impact, but we're also always learning.

You know, that's why we're here. See what everyone else is up to and see what else we can target. The second pillar of our strategy is we've got to build the right, what we call like cross-cutting primitives. There's foundational AI technologies that is pretty much, you know, in all your solutions, you'll probably feel it too.

And having the right, you know, abstractions in place, the right frameworks, the right tooling helps us build more solutions and build them faster. And lastly, what I'd say is probably the cornerstone of this strategy is what we call intentional tech transfer. We've taken a bet on a few product areas.

We want to build them. We want to build them as fast as possible. But we do stop and be deliberate about, hey, what here is reusable? What can be spun out into something that reduces the barrier for the next problem we want to solve? And so LangFX is our opinionated framework we built that wraps like LangGraph and LangChain and makes it work better with Uber systems.

And it was born out of a necessity, right? We had the first couple of products emerge and they wanted to solve problems in an agentic manner. They wanted to build reusable nodes and like LangGraph was the perfect fit to do it because we saw it was proliferating across the organization.

We made it available. We built a opinionated framework around it. So, you know, I think enough of the view. Let's just dive into one of the products. Mark, can you walk us through Validator? Yeah, absolutely. So the first part we'll showcase today is called Validator. Now, what it is is an NID experience that flags up best practices violations and security issues for engineers in code automatically.

So it is effectively a LangGraph agent that we built a nice ID UX around. And, you know, let's take a look at how it works. So we have a screenshot here that shows a user opening Go file. And what they have there is they're notified of a violation in this case.

So they have a little bit of a diagnostic that they can mouse over. And they got a nice modal saying, hey, in this case, you're using the incorrect method to create a temporary test file. You know, this will leak into the host. You want to have them automatically clean up for you.

So what do you do about it? What can the user do? Well, they have multiple choices. They can apply a pre-computed fix that we have prepared for them in the background. Or if they choose so, they can ship off the fix to their ID agentic assistant if they prefer.

So that's what we have in the next slide, actually, is the fix request has been shift off. And we got back a fix from the ID. And so the issue is no longer present. And the user is happy. The issue is resolved. They no longer have a code smell.

So that's super. Some of the key ideas that we found out while building this. The main thing is that the agent abstraction allows us to actually compose multiple sub-agents under a central validator agent for now, for example. So we have a, you know, sections, a sub-agent for a validator that calls into the LLM with a list of practices and sort of gets those points of feedback resolved or returned.

But there's also a deterministic bit where, for example, we want to discover lint issues from static linters. So there's nothing stopping us from running a lint tool and then passing on the learnings through the rest of the graph that allows us to, you know, pre-compute a fix even for those.

So that's the learning. And in terms of impact, you know, we're seeing thousands of fixed interactions a day from satisfied engineers that fix their problems in code before they come back later to bite them. And I think, you know, we think we've built a compelling experience here, right? We've met developers where they are in the IDE.

We have tooling that runs in the background. It can combine, you know, deterministic capabilities like we use AST parsing tools. We find out where each of the test boundaries lie. We're able to evaluate each one of these against a set of curated best practices, flag up violations, figure out what the most expressive way to deliver this back to the user, shorten the IDE, give them a way of applying fixes.

But we thought, why stop there? For sure. So why stop at validating? Let's help engineers by authoring their tests from the get-go. Now, you know, the second tool we're showing off here is called AutoCover. And it is a tool to help engineers build -- or generate, rather -- building, passing, coverage-raising, business case testing, and, you know, validated and mutation tested tests.

So, like, really high-quality tests is what we're shooting for here. And the intent is to save the engineer time. So they're developing code. They want to get their tests quickly and move on to the next business feature that they want to implement. So the way we got to this is actually we took a bunch of domain expert agents.

We actually threw invalidator in there as well, and more on that later. And then we arrive at a test generation tool. So let's take a look at how it works. We have a screenshot of, you know, Go source file, as an example. And the user can, you know, invoke it in all auto-covered in multiple ways.

If they want to invoke it for the whole file and sort of bulk generate, they can do a right-click, as shown in the screenshot, and then just invoke it. And then once the user clicks the button, what happens next is a whole bunch of stuff happens in the background.

So we start with adding a new target to the build system. We, you know, we set up a test file. We run an initial coverage check to get a sort of a target space for us to operate on. All while that is being done, we also analyze the surrounding source to get the business context out, so that we know what to test against.

And what the user sees really is just they get switched to an empty test file in this case. It can also be populated. And then because we did all that stuff in the background, we're starting to already generate tests. And what the user will see is, they'll see a stream of tests come in.

And the file will be in constant flux. There will be tests coming in at fast speed. We'll do a build. This test didn't pass. We'll take it out. Some tests might get merged. Some tests might get removed because they're redundant. You might see benchmark, like concurrency tests come in later.

And so, you know, the user is sort of watching this experience. And then at the end, arriving at a nice set of validated vetted tests. That's what we want. That's the magic we want for our users here. Yeah. And that's what we want. Let's dive a bit deeper into the graph here to see how it actually functions.

So here's the graph. On the bottom right, you can actually see validator, which is the same agent that we just talked about previously. So you can already see some of the composability learnings that we found useful. So how do we arrive at this graph? We looked at the sort of heuristics that an engineer would use while writing tests.

And so, for example, you want to prepare your test environment. You want to think about which business cases to test. That's the job of the scaffolder. And then you want to think up new test cases, whether it be for extending existing tests or just writing new tests altogether. That's the job of the generator.

And then you want to run your builds, your tests. And then if those are passing, you want to run a coverage check to see what you missed. That's the job of the executor. And so we go on to complete the graph this way. And then because we no longer have a human involved, we can actually supercharge the graph and sort of juice it up so that we can do 100 iterations of a code generation at the same time, and then 100 executions at the same time.

We've seen, you know, for a sufficiently large source file, you can do that. And that's sort of where our key learning comes in is we found that having these super-capable main expert agents gives us unparalleled performance, sort of exceptional performance compared to other agentic coding tools. So we benchmarked it against, you know, the industry agentic coding tools that are available for test generation.

And we get about two to three times more coverage in about half the time compared to them because of the speed-ups that we did in creating this graph here and sort of the custom bespoke knowledge that we built into our agents. And in terms of impact, we have -- the tool has helped raise developer platform coverage by about 10%.

So that maps to about 21,000 dev hours saved, which we're super happy about. And we're seeing continued use of thousands of tests generated monthly. So, yeah, that's very happy about that. So, take us through some more products. Yeah, so we didn't want to stop at 5,000 tests a week.

Like, we've built these primitives, right? Just wanted to give you a sneak peek of what else we've been able to do in the organization with this. So what you see on screen right now is our Uber Assistant Builder. Think of it like our internal custom GPT store where you can build chatbots that are, you know, steeped in Uber knowledge.

So, like, one of them you see on the screen is the Security Scorebot. And it has access to some of the same tools that we showcased earlier. So it knows it's steeped in Uber's best practices. It can detect security anti-patterns. So even before I get to the point of I'm in my IDE writing code, I can ask questions about architecture and figure out whether my implementation is secure or not.

Right? Same primitives. Power is a different experience. So, next up we have Picasso. Picasso is our internal workflow management platform. And we built a conversational AI. We call it Genie. Adopt that. It understands workflow automation. It understands the source of truth. And it can give you feedback grounded in product truth, like, aware of what the product does.

Third thing I want to show you, and this is not an exhaustive list, right, is our tool called uReview. Obviously, we built stuff in the IDE. We tried to flag anti-patterns earlier in the process. But sometimes things still slip through the crack. You know, why not reinforce and make sure quality is enforced before, you know, code gets landed, before your PR gets merged.

So, again, powered with some of the same tools that you saw earlier that power, like, Validator and Test Generator, were able to flag, you know, both code review comments and code suggestions that developers can apply during review time. I think with that, we'll just jump over to the learnings.

Yeah, sounds good. So, in terms of the learnings, we already sort of talked about this. But we found that building domain expeditions that are super capable are actually the way to go to get outsized results. So, they use context better. You can encode things in rich state. They hallucinate less.

And then, you know, the outgoing result is much better. So, an example that I already talked about is the executor agent. So, we're able to finagle our build system to allow us to, on the same file, execute 100 tests on the same test file without colliding, and then also get separate coverage reports.

That's an example of a domain expert that's super capable and gives us that performance that we want. Secondly, we found that when possible, composing agents with deterministic sub-agents, or just have the whole agent deterministic, makes a lot of sense if you can solve the problem in a deterministic way.

So, you know, one example of that was the lint agent undervalidator. We want to have reliable output, and if we have a deterministic tool that can give us that intelligence, we don't need to rely on an LLM. We can have that reliable output and pass on the learnings to the rest of the graph and have them fixed.

And then third, we found that we can scale up our dev efforts quite a bit by solving a bounded problem, by creating an agent, and then reusing it in multiple applications. So you already saw it with validator, the standalone experience, and validator within auto-cover for test generation validation. But I'm going to give you one more lower-level example, and that's the build system agent.

That's actually used through both of those products. That's an even lower-level abstraction that is required for us to be able to, you know, have the agents be able to, like, execute builds and, like, execute tests in our build system. So, Sourabh, take us through some of the strategic learnings now.

Yeah. Sourabh talked us through some of the tech benefits, but this is the one I'm probably most excited to share. Like, you can set up your organization for success if you want to build agentic AI. And I think we've done a pretty good job of it at Uber. We haven't devolved into an AI arms race.

We're all building in collaboration, and I think these are our biggest takeaways. The first being just, you know, encapsulation boost collaboration. When there are well-thought-out abstractions, like LandGraph, and there are opinions on how to do things like handle state management, how to deal with concurrency, it really allows us to scale development horizontally.

It lets us tackle more problems and more complex problems without creating this operational bottleneck, right? An example I'll give you is our security team was able to write rules for Validator, like the product we showcased earlier. It's able to detect security anti-patterns, but the security team knew nothing about-- Well, this part of the security team knew nothing about AI agents and how the graph was constructed, but they were still able to add value and improve the lives of our developers.

And so, like, a natural segue from that is if you're able to encapsulate, you know, work into these well-defined nodes, then, like, graphs are the next thing you think about, right? Like, graphs help us model these interactions perfectly. They oftentimes mirror how developers already interact with the system. So, when we do the classic process engineering and identify process bottlenecks and inefficiencies, it doesn't just help accelerate or boost the AI workloads.

It also helps improve the experience for people not even interacting with the AI tools, right? So, it's not, like, an arms race either, or should we build agentic systems or should we improve our existing systems? It usually segues into, like, helping each other. Like, just, you know, we spoke about our agentic test generation, and we found multiple inefficiencies through, like, how do you do mock generation quickly?

How do you modify build files, invoke, like, interact with the build system? How do you even execute the tests? And in the process of, like, fixing all these paper cuts, we improved the experience for just, like, non-agentic applications, just for developers interacting directly with our systems. And it's been hugely beneficial.

And, you know, with that, I want to bring this talk to an end. We really enjoyed presenting here. Thank you for the opportunity. Hopefully, you all learned something, and we'll take something back to your companies. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you.

From Pilot to Platform: Agentic Developer Products with LangGraph

Transcript