Multi-Agent Frontiers: Making Devin

Langchain team thank you so much for having me really excited to share a little bit more about how we made Devon so my name is Russell Kaplan I'm president at cognition we're the company behind Devon and as a quick show of hands how many of you have heard of Devon before all right almost everyone so Devon is an AI software engineer but we are really focused specifically on working within existing code bases there's lots of amazing AI tools out there for coding and what we found is that as the code bases get larger the problem gets harder and most of our customers and users around the world are teams they're teams of engineers or companies full of engineers they're trying to ship real-world products and so today I want to talk a little bit more about what Devon is but more importantly how we built it and I'm gonna share some new sort of technical information we're releasing on exactly how this works under the hood that I'm really excited to to present to you all first what are we seeing in AI for coding obviously this is a really fast-moving field and in many ways software engineering is one of the first large-scale successful applications of generative AI it started in many ways with co-pilots real-time text completion inside your editor that makes you as an engineer go a bit faster and now we also have AI IDEs again a development environment for you as an individual engineer to get even more leverage and for sometimes delegating entire tasks or snippets and really coding in flow with AI assistance we see Devon as part of a third wave of AI developer tools which is on the fully autonomous agent end of the spectrum more AI teammate than AI copilot companies around the world are using Devon like another member of their engineering team going directly from ticket to pull request collaborating with Devon in slack or Jira or linear and we see the large majority of Devon sessions and Devon PRs are started from within these other tools the same way you might interact with another human engineer architecturally this is really different from something that runs locally on your computer Devon is a cloud AI agent and what we've seen is that it's very complimentary to these local AI development tools when you are coding yourself and you want to stay in flow and get that speed up you use a local AI development tool where people use Devon is when they're really ready to delegate the task entirely and this is a very different set of technical trade-off you get large-scale parallelism asynchronousness and the ability to completely delegate individual units of work in the team setting this also means that Devon's run remotely not locally and so they share the same environment across different runs so you can try many different things in parallel combine them together and the teams of engineers who use Devon will break up large-scale engineering outcomes into small individual tasks delegated to a fleet of Devons and then coalesce together inside the code base and the main thing our users look for is for that code from Devon to get merged as pull requests it also changes the learning model for Devon in the cloud AI agent setting Devon is not just for you it's for your team and for your organization and so as Devon learns from your interactions those learnings are not kept only with you instead they're incorporated as part of your team and as part of your organization and this reliance on organizational knowledge is something that we've seen is really important for working with existing large-scale code bases because working with large code bases is really hard and so I'm going to go into some more detail on exactly how we do this under the hood with Devon part one is all about context if you want to build an AI software engineer you need to understand existing code you don't want your AI code contributions to be using a new framework adding new dependencies or being done in isolation of what you already have and code base understanding is pretty hard LLMs are amazing at so many things but they have limited context windows and even if a code base fits inside the context window the effective context window is often a lot lower than the advertised context window we have a series of internal benchmarks that measures effective reasoning capacity across a context and we find very consistently that the advertised context window is much higher than the effective reasoning context window Large code bases also have complex dependencies they can span multiple services multiple repositories and it can be intertwined in very complicated ways even for human engineers there's huge variations in code quality there might be some parts of the code base you want Devon to emulate and some parts you really want Devon to stay away from when it's learning how to be a productive member of your team same thing is true for documentation the code might have comments might have comments might have documentations that's outright incorrect or misleading all of these things are part of the technical challenges we work on to make Devon work in real world code bases the last critical piece of real world code bases is that the larger the code base the more custom it tends to be teams and companies build their own proprietary frameworks they have their own specific jargon there's context in the code that's not inside the code itself but the organizational workflow around the code and so these are the research questions that we set out to solve to make Devon actually useful in the real world and the first thing I'm going to go into more detail on is something we actually recently released free and publicly for all open source repositories it's called deep wiki deep wiki is a real-time continually updated index of your code base published as an interactive wiki almost like a real-time confluence page with documentation diagrams and the ability to ask questions about your code we had this originally as an internal data structure for Devon it wasn't a it wasn't a product it was just a tool that Devon could use to get high-level context about the code and what we realized is that human engineers wanted to see this information too and so we decided to release it as a standalone product and service so you can take any github URL today and just change the github to deep wiki.com and for any open source repo you'll get a full interactive wiki this also works on your private repos when they're integrated with Devon and so I looked at the Langchain repo and we have a full updated up-to-date documentation page for Langchain that has not only the pros of how it's organized the key concepts in Langchain's code base but also architectural diagrams and data flows and we've gotten a lot of feedback from the community that these diagrams are in some cases or in many cases actually better than the diagrams of the official documentation of very popular open source projects right whether it's folks who are on the typescript steering committee the vllm maintainers or others we're getting lots of amazing feedback on how great deep wiki is and we've had you know thousands of code bases start linking to deep wiki as part of their official documentation so definitely check this out if you're working on open source code yourself how does this work under the hood we just said that lms are really bad at raising about large code bases so let's give you the high-level algorithm of what we're doing under the hood to generate these wikis step one it's actually not about the code it's about the concepts what are the key principles inside this code base that are going to form our table of contents for how we lay out the macro context of this code base and what we found is that the in many cases those concepts um you don't just get them from the source code itself there is extremely rich information in the metadata around the source code for example was that source code added in as part of a pull request well which member of the team added that pull request what else have they contributed to was there discussion in that pull request about the code um what were are there comments is there documentation the git commit history all of this metadata data is a really useful source for building these high context wikis once you have those concepts then you can connect them to the code so what are the connections between the various code files and the proprietary or specific concepts inside this code base and after you have that you need to connect the code itself there's different sections of the code base uh you know some files that are more related less related there's uh call you know sort of call traces and flows and there's a specific way that these different components of the code base connect to each other you can look at things like the symbol the symbol graph you can look at the call graph uh and you can look at how um how these files tend to be used together and once you have those code to code connections then you can actually generate a wiki and for each concept what we do is we use an agent to go research that concept in the context of the specific code base uh we generate a wiki page about it and then we also provide those intermediate artifacts as context uh and as tools and when you put this all together uh you get very rich representations of code uh and we use graphs as a critical part of those representations uh and so this is a graph of uh the lang chain code base where you can see uh at a high level that different files are more or less related to each other uh with a lot of a lot of logic in the core and then maybe outskirts that are more related to test harnesses documentation specific integrations with third parties and so on and these data structures power a lot of how devon actually works inside large you know million multi-million line of code code bases so we've got our wiki but we also need to be able to search the code and this is another feature that's now mainline in devon but started as an internal tool for devon the ai software engineer that's the trend we're seeing is to make devon a great software engineer you need to build tools that are so useful that human engineers want to use them too and so we have devon search which is essentially deep research on your proprietary code base again whether it's open source or internal you can ask questions about the code devon will scan through that code try to understand what's going on using both the micro code the individual files but also the macro context it has from this wiki data structure uh and it will find this information for example i asked how do i enforce structured output in length chain and devon went and found the right section of the documentation from length chain as well as the actual implementation code for what to do devon search gives devon context it's an essential part under the hood of how devon the autonomous ai agent can actually make useful changes inside larger team wide code bases once you get a query you need to do pre-processing and of course rag is a component of that but we end up doing a lot more under the hood than just rag including junk removal some more advanced filtering of less relevant information and re-ranking multi-hop search to end up with this set of context that we think is very very relevant for this query um and that context again includes both source files but also wiki pages you need the micro and the macro context to provide really useful really useful recommendations um and from that we can get a grounded answer people don't want hallucinations in their wikis and they don't want hallucinations in their search so the grounding is essential for this to actually be useful the second part of how we optimize and customize to existing code bases is a bit more research oriented uh and i'm excited to share a little bit more of some of the post training and rl that we do under the hood to make devon work well inside specific narrow domains we recently released a new model an open source free model called kevin uh colonel devon kevin 32b kevin is uh uh outperforms many state-of-the-art foundation models on the narrow domain of writing cuda kernels raise your hand if you've ever heard of a cuda kernel all right we have an audience that's very familiar with the under fittings of ml for those who haven't heard of cuda kernels um this is the source code that you use to write gpu optimized uh gpu optimized implementations for nvidia gpus and so under the hood when you're using pytorch or tensorflow um those high-level graph operations are being executed under the hood by cuda kernels and the the domain of writing cuda kernels is extremely rich because this is a very low-level programming language relative to what many of us operate in more typically day-to-day say python and uh cuda kernels were released as a uh uh kernel bench was released as a benchmark by anne simon azalea uh to estimate models capabilities of generating these very niche very specific cuda kernels at high performance and high reliability and this work from cognition was done by carlo pietro ben supervised by silas uh these were our research uh these were our research interns uh who who did who got really really exciting results um from a from a single project so let's talk about what this work does more specifically um the goal is to take high-level machine learning code say a few different calls to pytorch and rewrite it as a highly optimized performant correct cuda kernel this is a very detailed problem domain that many low-level machine learning researchers spend you know their entire careers optimizing uh the design space is quite large for how to write optimal cuda kernels and it's quite challenging um what we see in practice in the ml community is that a lot of progress in machine learning in machine learning is really driven by performance on the hardware and so even if your algorithm or your new paper is big o optimal like a linear attention mechanism um under the hood if your implementation is not efficient cache friendly uh performant on actual gpu hardware it tends to not be that useful so this is a really active research domain for ml researchers and we want uh kevin to be good at writing these optimized cuda kernels so how does this work uh the first step is to define your reward function and one of the great things about software and in particular writing cuda kernels is that it's often easy to get automatically verifiable reward can you verify the correctness of your code automatically well in this case we have a less performant reference implementation that we can use to check correctness and so whenever kevin uh which is our post train uh sort of llm for this project whenever kevin writes a kernel we run it through a series of checks right first of all does that code parse um is it actually valid cuda does it compile does it run uh and then after all that is it correct and only if it's correct do we then grade it for performance how much faster or slower is it than the reference implementation so with this reward function notice we don't need a machine learning model here this is purely a set of um automatically verifiable steps um which makes which makes this very very friendly for high compute rl once you have the reward function you can use it for multi-turn training and so we use multi-turn grpo uh and for those who aren't familiar what's going on here is we're taking we're taking multiple different trajectories in sequence for this model to get better at writing cuda code so on the left here we have an initial prompt which results in a chain of thought from the model and an output and that output may or may not be correct uh when we move to the second the middle of this diagram we provide eval info back to the model and this eval info is the results from trying to run that kernel in a real world gpu environment um there's a lot of work you have to do in terms of sandboxing and isolation to make sure these incorrect cuda kernels don't mess up your training process or crash your gpus and then you're getting accurate performance benchmarks but we package all that up into almost like a struct of eval information that that model can then see as it tries again and it tries again with a second chain of thought a second kernel that gets passed to another step and this process repeats over several steps and the result is hopefully a correct kernel then you have to distribute your rewards to train this information and what we found is that you don't want to just reward based on the final output uh and its correctness or not and its performance or lack of performance actually the the path to get there is also valuable so you'll notice in red at the bottom here we have this uh this sum of different rewards discounted by gamma over time and what that's showing is the very first trajectory the very first step of that trajectory gets a reward even if it wasn't correct itself if it led to a correct solution and a performance solution downstream are you barking up the right tree is basically the reward we want to give the model and what we found in this project is that being able to do this over multiple iterations with these discounted rewards was really important for this to work because writing cooter kernels is hard and so the reward signal is going to be sparse if you only get one shot and once you do this uh you can you can find that it's not impossible to very deeply optimize for these narrow problem domains so in this graph we have the um correctness on the left how many of the kernels were written correctly by this model and kevin 32b is getting 91 correct on this section of the kernel bench benchmark that we focused on and you can see that compared to even 04 mini or 03 this is a significant improvement um this is a narrow domain where high compute rl lets you outperform existing models on the right you see performance so we rewarded devon proportional to how much speedup uh you know kevin got in this project and so as the kernels got faster and faster it got more and more reward and what we found is that even from a performance standpoint kevin 32b is able to outperform these larger scale foundation models and this was really interesting result to us because um it kind of flies in the face of many many sort of broad discussions of oh these foundation models are going to be the best at everything and you should use them exclusively for everything but what we see internally all the time is that for any given narrow domain if you can set up your environment to do high compute rl in that domain it's very feasible to outperform an out-of-the-box foundation model especially as the open source based models that you start with have improved to actually make this work in practice it's important that you keep your model doing what you actually want it to do and not cheating along the way and this is called reward hacking in rl and it's uh many cases actually challenging to prevent so i want to show you a few ways that kevin misbehaved uh that we had to sort of steer back so one is kevin realized that it could write the cuda and then wrap the whole thing in a try except block and just fall back to the existing pi torch implementation and you know it would always score 100 percent correct in that case and it had some chance of being faster than average but if it wasn't it it sort of defaulted to the one x so that was a very unproductive direction for for kevin to go down during the rl process and uh we had to make sure that we update the reward function to recognize this type of reward hacking the second is even a bit more subtle uh and so the test harness to make sure that kevin's code was correct uh had a class you know in this case called model new uh that inherited from model and you can see here what kevin realized is that it could implement the model as a uh you know as a subclass of nn.module with its attempt at optimize code code and then it could just overwrite that uh class name in the namespace and so you can see it defined a second model new that this in this case just inherits directly from the correct model implementation so these models got very creative at all uh how to sort of get around your intentions and this is a challenge in rl and so making sure you correctly define your environment is is really critical to success and for those of you who have used maybe really popular commercial models like some of the most popular models for coding you might have seen that as the models get better sometimes they're more aggressive at doing things like commenting out your test cases to make sure the works will pass that's what's going on under the hood is this is uh this is a smell of reward hacking and so it's a constant sort of cat and mouse game between the researchers who are trying to steer these models to do what we actually want and the models they're trying to exploit every possible way to get this high quality reward so what do we learn from this um custom post training can and does outperform frontier models on specific narrow domains for reinforcement learning specifically um especially in code it's more compute bound than data bound you know kernel bench the subset of kernel bench that we trained on only had 180 tasks which is really not that many when you think about it but by applying high compute rl rolling out these trajectories again and again there is very very rich reward signal to learn from and that's because in software we have an oracle that can help with these rewards we actually have the environment we can run the code we can see if it compiles we can see how fast it is and this in my opinion is one of the reasons that um software and coding specifically has accelerated particularly fast as an ai capability is that code is one of the few domains where this property holds i used to lead machine learning at scale ai which provides post training human data for many of the large scale foundation model labs and it gets really hard to label by hand high quality high accuracy data as the models get smarter but code doesn't have that bottleneck because you can continually scale um based on automatic signals of correctness and that's really the third key is automatic verification allows you to scale so for your own code bases and your own process putting in the ci systems putting in the test coverage putting in the harnesses that allow that automatic verification is going to future proof your code as rl and as ai gets better and we see many of our users of devin they first take their code base with devin and go fix all the test coverage issues and now that they have full test coverage it's even faster to use devin to ship new more pull requests the last big point here is i just showed you an example on cuda kernels but to me the the more interesting deeper implication of this research is every code base is in some sense a narrow domain there are specific things to your code that don't exist in anyone else's code and that's more and more true the larger your code base is so you can imagine a future where high compute rl and per code base customization leads to significantly outperforming agents on each individual domain the equivalent of hiring a software engineer and giving them millions of years of experience working specifically in your environment so this is some of the research work we've been doing at cognition that powers devin under the hood if you'd like to play around and try this yourself you can use you can go to devin.ai and sign up for an account connect it with your existing code give it a task and go from ticket to pr thank you so much for having me thank you thank you Thank you.

Thank you.

Multi-Agent Frontiers: Making Devin

Transcript