LangChain Interrupt 2025 Making Devin

Muscle previously led machine learning at CLA-R, and also one-built Tesla AutoPAD. I've known Muscle for a number of years and I'm very excited to talk, so welcome, Muscle. Thank you, Muscle. Thank you so much for having me, really excited to share a little bit more about how we made Devon.

So my name is Muscle Kaplan, I'm the president at Cognition, we're the company behind Devon. And as a quick show of hands, how many of you have heard of Devon? Alright, almost everyone. So Devon is an AI software engineer. But we are really focused specifically on working within existing codebases.

There's lots of amazing AI tools out there for coding, and what we found is that as the codebases get larger, the problem gets harder. And most of our customers and users around the world are teams. They're teams of engineers or companies full of engineers that are trying to ship real-world products.

And so today I want to talk a little bit more about what Devon is, but more importantly, how we built it. And I'm going to share some new sort of technical information we're releasing on exactly how this works, and I'm really excited to the present to you all. First, what are we seeing in AI for coding?

Obviously, this is a really fascinating field. And in many ways, software engineering is one of the first large-scale successful applications of Devon AI. It started in many ways with co-pilots, real-time text completion inside your editor that makes you as an engineer go a bit faster. And now we also have AI ITEs.

Again, a development environment for you as an individual engineer to get even more letters for sometimes delegating entire tasks or snippets and really coding in flow with AI systems. We see Devon as part of the third wave of AI technology tools, which is on the fully autonomous agent end of the spectrum.

more AI teaming than AI programming. Companies around the world are using Devon like another member of their engineering team, going directly from Tingle to pull requests, collaborating with Devon in Slack or Jira or Linear. And we see the large majority of Devon samples and Devon PRs are starting from within these other tools, the same way we might interact with another human request engineer.

So architecturally, this is really different from something that runs locally on your computer. Devon is a cloud AI agent. And what we've seen is that it's very daunting to these local AI policy tools. When you are coding yourself and you want to stay in flow and get that stable, you use a local AI policy tool.

Where people use Devon is when they're ready to delegate their tasks entirely. And this is a very different set of capital trade-offs. You get large-scale parallelism, asynchristness, and the ability to completely delegate individuals. In the team setting, this also means that Devon is not remotely. And so they share the same environment across different tasks.

So you can try many different things in parallel, combined with Devon. And the teams of engineers who use Devon will break up large-scale engineering outcomes to small individual tasks, delegated into a fleet of Devon, and then coalesced together inside of Devon. And maintain how you look for it for that focus that we created.

It also changes the learning model. In the cloud AI agent setting, Devon is not just for you. It's for your team and for your organization. And so as Devon learns from your interactions, those learnings are not only with you. But instead, they're incorporated as part of the team and as part of the organization.

And this reliance on organizational knowledge is something that we've seen is really important for working with existing large-scale offices. because working with large-scale offices is really hard. And so I'm going to go into some more detail on exactly how we do this unlimited. Part one is all about context.

If you want to build an AI software engineer, you need to understand the existing code. You don't want your AI code contributions to be a new framework, adding new dependencies, or being done in isolation of what you already have. And code-based understanding is pretty hard. LLMs are amazing at so many things.

But they have limited context windows. And the code-based fits inside the context window. The effective context window is often a lot lower than the advertised context window. We have a series of internal benchmarks that measure effective easing capacity across the context. And we find very consistently that the advertised context window is much higher than the effective easing context window.

Large code bases also have complex dependencies. You can expand multiple services, multiple repositories, and they can be intertwined in very complicated ways, even for humans. There are huge variations in code quality. There might be some parts of the code base you want them to emulate, and some parts you really want them to stay away from, when it's learning how to be harassing a member of other teams.

Same thing is true for documentation. The code might have comments, might have missing comments, might have documentation that's outrightly correct for missing. All of these things are our technical challenges we work on. The last critical piece of the code base is that the larger the code base, the more custom it tends to be.

Teams and companies build their own proprietary frameworks. They have their own specific jargon. And so these are the research questions that we set out to solve to make that an actual user is available. And the first thing I'm going to go to is something that we actually recently released free and publicly for all open source repositories.

It's called DeepWiki. DeepWiki is a real-time, continually updated index of your code base, published as an interactive meeting, almost like a real-time conference game with documentation, diagrams, the ability to ask questions about your code base. We had this originally as an internal data structure for Devon. It wasn't a product.

It was just a tool. The Devon used to get time-level context about the code. And what we realized is that human engineers wanted to see this information, too. And so we decided to release it as a standalone project service. So you can take any GitHub URL today and just change the GitHub to deepwiki.com.

And for any open source repo, you'll get a full interactive repo. This also works on your private repos when they're integrated with Devon. And so I looked at the LinkChain in Fabrico. And we have a full updated, up-to-date documentation page for LinkChain that has not only the pros of how it's organized, the key concepts in LinkChain's code base, but also architectural diagrams, data flows.

And we've gotten a lot of feedback from the community that these diagrams are, in some cases or in many cases, actually better than the diagram of the official documentation of very popular open source projects. Right? Whether it's folks who are on the TypeScript steering committee, the DLM maintainers, or others.

We're getting lots of ways to repeat that on how great DLM is before. And we've had, you know, thousands of code bases start from DLM to DLM as part of their official documentation. So definitely check this out if you're working on open source code yourself. How does this work under the hood?

We just said that LMs are really bad at raising about large code bases. I'll give you the algorithm of what we're doing under the hood to generate these pieces. Step one, it's actually not about the code. It's about the concepts. What are the key principles inside this code base that are going to form our table of context for how we lay out the matter of context of this code base?

And what we've found is that in many cases those concepts, you don't just get them from the source code itself. There's extremely rich information in the library around the source code. For example, was that source code added in as part of the full request? Well, which member of the team added that full request?

What else did they contribute into? Was there a discussion in that full request about the code? Are there comments? Is there documentation? The gate to make history. All of this metadata is a really useful source for building these high context pieces. Once you have those concepts, then you can connect them to the code.

So what are the connections between the various code files and the proprietary or specific concepts inside this code base? And after you have that, you need to connect the code itself. There's different sections of the code base. There's some files that are more related, less related. There's small cases and flows.

There's a specific way that these different components of the code base connect to each other. You can look at things like the symbol graph. You can look at the call graph. You can look at how these files can be used together. And once you have those code to code connections, then you can actually generate a loop.

And for each concept, what we do is we use an agent to go research that concept in the context of the specific code base. We generally engage about it. And when you put this all together, you get very rich representations of code. And we use graphs as a critical part of those representations.

And so this is a graph of the LangStream code base where you can see at a high level that different files are more or less related to each other with a lot of logic before. And then maybe outskirts that are motivated to test artists, documentation, specific integrations with third parties, etc.

And these data structures power a lot of how they actually work inside a large, you know, million, multi-million, like code base. So we've got our region. But we also need to be able to search for them. And this is another feature that's now mainline in Devon but started as an integral tool for Devon, the AI software engineers.

That's the trend we're seeing, is to make Devon a great software engineer to these building tools that are so useful that human engineers want to use them too. And so we have Devon Search, which is essentially deep research on your proprietary code base. Again, whether it's open source or internal.

You can ask questions about the code. Devon will scan through that code, try to understand what's going on, using both the micro code in individual files, but also the background context it has from this VGA structure. And then we'll find this information. For example, I ask, how do I enforce structured output in microchains?

And Devon will have found the right section of the documentation for microchains as well as the actual implementation code for what to do. Devon Search gives Devon context. It's an essential part under the hood of how Devon, the autonomous AI agent, can actually make useful changes inside the larger team-like code bases.

Once you get a query, you need to do re-processing, and of course, RAG is a component of that. But we end up doing a lot more under the hood than just RAG, including junk removal, some RGS, filtering of less relevant information, and re-ranking multi-hop search, to end up with this set of context that we think is very, very knowledgeable in the script.

And that context, again, includes good source files, but also wiki pages. You need the micro and macro context to provide really useful, really useful recommendations. And from that, you can get the grounding answer. People don't want hallucinations in their vikings, and they don't want hallucinations in their search. So the grounding is essential for this to actually be useful.

The second part of how we optimize and customize to existing code bases is a bit more research-oriented. And I'm excited to share a little bit more of some of the post-training RL that we do under the hood to make Devon work well inside specific narrow domains. We recently released a new model, an open-source-free model, called Kevin.

Uh, Colonel Kevin. Kevin 32B. Kevin is, uh, uh, outperforms many state-of-the-art combination models on the narrow domain of writing PUDO kernels. Raise your hand if you've ever heard of a PUDO kernel. All right. You know, the audience is very familiar with the underpinnings that I'm on. For those who haven't heard of PUDO kernels, um, this is the source code that you use to write GPs.

Uh, GPU-optimized, uh, GPU-optimized limitations for NVIDIA to use. And so, under the hood, when you're using TypeLorge or TensorFlow, um, those high-level graph operations are being executed under the hood by PUDO kernels. And the, the, the domain of writing PUDO kernels is extremely fixed. Because this is a very low-level program relative to what many of us operate for typically day-to-day, say PUDO.

And, uh, PUDO kernels were released as a, uh, a kernel bench. Uh, kernel bench was released as a benchmark by Ann, Simon, and Azalea, uh, to estimate Google's capabilities of generating these very, very specific CUDO kernels at high performance and high reliability. And this work from automation was done by Carlo, Pietro, then supervised by Cypress.

Uh, these are our research interns, uh, who, who did, who got really, really exciting results, um, from the single project. So, let's talk about what this work does more specifically. Uh, the goal is to take high-level machine learning learning, say, a few different calls to PUDO, and rewrite it as a highly optimized, performing, correct CUDO kernels.

Uh, this is a very detailed public domain that many, uh, low-level machine learning researchers spent in their entire careers optimizing. Uh, the design space is quite far. It's quite hard. It's quite hard. It's quite hard to write for optimal CUDA kernels. And, for example, uh, what we see in practice in the ML community is that a lot of progress in machine learning is really driven by performance on the hardware.

And so, even if your algorithm for your new paper is big O, optimal, like a linear tension mechanism, uh, under the hood, if your implementation is not efficient, cache-friendly, uh, perform it on actual GPU hardware, it tends to not be at least. So, this is a really active research domain for our ML researchers.

And we want, uh, Kevin to be good at regulating these optimized CUDO kernels. So, how does this work? Uh, the first step is to define your reward machine. And one of the great things about software, and in particular, the CUDO kernels, is that it's often easy to get automatically verifiable.

Can you verify the correctness of your code automatically? Well, in this case, we have a less performance reference implementation that we can use to check the records. And so, whenever Kevin, uh, which is our host-trained, uh, for MLM for this project, whenever Kevin writes a kernel, we run through a series of checks, right?

First of all, does that code parse? Uh, is it actually valid CUDA? Does it compile? Does it run? Uh, and then after all of that, is it correct? And only if it's correct, do we then rate it for performance. How much faster or slower is it than the reference implementation?

So, with this reward function, notice we don't need a machine learning model here. This is barely a set of, uh, automatically verifiable steps, uh, which makes, which makes this very, very friendly for high compute RL. Once you have a reward function, you can use it for multi-term retainer. And so we use multi-term GRBO.

Uh, and for those who aren't familiar, what's going on here is we're taking multiple different trajectories in sequence for this model to get better at writing CUDA code. So, on the left here, we have an initial prompt, which results in a chain of thought from the model and output.

Now output may or may not be correct. Uh, when we move to the second, the middle of this diagram, we provide eval info back to the model. And this eval info is the results from trying to run that kernel in a real-world GPU environment. Uh, there's a lot of work you have to do in terms of sandboxing and isolation to make sure these incorrect CUDA kernels don't mess up your training process or crash, which you use, and then you're getting accurate performance benchmarks.

But we package all that up into almost like a struct of eval information. With that model, you can see as it tries again. And it tries again with the second chain of thought and the second kernel. That gets passed to another step. And this process works over several steps.

And the result is hopefully correct and confirmed. Then you have to distribute the rewards to train this information. And what we found is that you don't want to just reward based on the final object and its correctness or non and its performance or lack of performance. Actually, the path to get there is also non.

So, you'll notice in red at the bottom here, we have this sum of different rewards discounted by gamma over time. And what that's showing is the very first trajectory, the very first step of that trajectory, gets a reward even if it wasn't correct itself. If it led to a correct solution and a performance solution downstream.

Are you marking up the red tree? It's basically a reward in one view. And what we found in this project is that being able to do this over multiple iterations for these discounted reports was really important for this to work. And once you do this, you can find that it's not impossible to very deeply optimize for these narrow problem domains.

So, in this graph, we have the correctness on the left of how many other kernels were written correctly on this log. And Kevin32B is getting 91% correct on this session on the kernel benchmark that we're focused on. And you can see that compared to even in 04 mini or 03, this is a significant improvement.

This is a narrow domain where high-QRL lets you outperform existing models. On the right, you see performance. So, we rewarded that proportional to how much speedup Kevin got in this project. And so, as the kernels got faster and faster, it got more and more reward. And what we found is that even from a performance standpoint, Kevin32B is able to outperform a larger scale.

And this was a really interesting result to us because it kind of flies in the face of many, many sort of broad discussions that, oh, these foundation models are going to be the best at everything. And we don't use them exclusively for everything. But what we see internally all the time is that for any given narrow domain, if you can set up your environment to do high compute RL in that domain, it's very feasible to outperform an out-of-the-the-box application model.

Especially as the open source base models that you start with have improved. So, to actually make this work in practice, it's important that you keep your model doing what you actually want it to do and not cheating along the way. And this is called word hacking in RL. In many cases, it's actually challenging for events.

So, I want to show you a few ways that Kevin misbehaved when he had to sort of steer back. So, one is Kevin realized that he could write the CUDA and then wrap the whole thing in a triangle set block. And it's all mapped to the existing high portion limitations.

And, you know, they would always score 100% correct in that case. And it had some chance of being faster than average, but it wasn't. It's a result to the 1x. So, that was a very uncorrupted direction for Kevin to go down during the RL process. And we had to make sure that we updated the word function to recognize this type of word hacking.

The second is even a bit more subtle. And, so, the test partners to make sure that Kevin's code was correct had a class, in this case called model view, that inherited from the model. And you can see here what Kevin realized is that it could implement the model as a subclass of an end.module, which is attempt to optimize the code.

And then you can just overwrite that classroom in the namespace. And, so, you can see you can find a second model view, that this, in this case, just generates directly from the correct model view. So, these models got very great at all about how to sort of get around your attention.

And, this is a challenge in RL. So, making sure you directly define your environment is really critical to success. And, for those of you who have used any really popular commercial models, like some really popular models for coding, you might have seen that as the models get better, sometimes they're more aggressive at doing things like commenting out your test cases to make sure the laws still pass.

That's what's going on in the hood. This is, this is a smell of reward hacking. And, so, it's a constant term path between the researchers who are trying to steer these models to do what we actually want, and the models are trying to exploit every possible way to get this high quality.

So, what we learned from this, custom post training, again, it does not perform frontier problems on specific variable domains. For reinforcement learning specifically, especially in code, it's more compute now than data now. You know, KernelBetch, the subset of KernelBetch that we trained on, only had 180 tasks, which is really not that many people think about it.

But, by applying IPvRL, all that directories again and again, there is very, very rich rewards that we can learn from. And, that's really good at the same software. We have an order that can help these rewards. We actually have the environment. And, this, in my opinion, is one of the reasons that software, and coding specifically, has accelerated particularly fast as AI capability.

Is that code is one of the few things where this property holds. I used to be a machine learning that scale, which provides close range of data for many of the . and it gets really hard to label by hand, high-quality, high accuracy data, as it all gets hard.

But, code doesn't have that quality because you can seemingly scale based on automatic signals. And, that's really the third key. Automatic verification allows you to scale. So, if you want code bases in your own process, putting in the CI systems, putting in the test coverage, putting in the harnesses that allow the operator infusion, is going to future-proof your code as RL, as AI gets better.

And, we see many of them, they first take their code base with Debit and go fix all the test coverage issues. And, now, they have full test coverage. It's even faster if you've got it. The last big point here is, I just showed you the example on . But, to me, the more interesting, deeper application of this research is, every code base is, in some sense, a narrow domain.

There are specific things to your code that don't exist in anyone else's code. And, that's more and more true to the larger code bases. So, you can imagine a future where high-to-cube RL, and preferred code-based customization, leads to significantly outperforming agents on each individual domain. The equivalent of hiring a software engineer and giving them millions of years of experience working specifically in your environment.

So, this is some of the research work we've been doing in this conversation that powers Debit around the hood. If you'd like to play around and try this yourself, you can use, you can go to Debit.ai and sign up your account. Connect it with your distinct code, give it a task, and go from taking it to PR.

Thank you so much for having me. Thank you so much. Thank you very much. Thank you very much. Thank you very much. Thank you very much. Thank you very much. All right. Thank you very much.

LangChain Interrupt 2025 Making Devin – Russell Kaplan

Transcript