Multi-Agent Frontiers: Making Devin

00:00:00.000 | Langchain team thank you so much for having me really excited to share a

00:00:11.820 | little bit more about how we made Devon so my name is Russell Kaplan I'm

00:00:15.840 | president at cognition we're the company behind Devon and as a quick show of

00:00:20.040 | hands how many of you have heard of Devon before all right almost everyone

00:00:24.720 | so Devon is an AI software engineer but we are really focused specifically on

00:00:31.760 | working within existing code bases there's lots of amazing AI tools out

00:00:36.420 | there for coding and what we found is that as the code bases get larger the

00:00:40.480 | problem gets harder and most of our customers and users around the world are

00:00:44.460 | teams they're teams of engineers or companies full of engineers they're

00:00:47.960 | trying to ship real-world products and so today I want to talk a little bit more

00:00:53.360 | about what Devon is but more importantly how we built it and I'm gonna share

00:00:56.900 | some new sort of technical information we're releasing on exactly how this works

00:01:00.380 | under the hood that I'm really excited to to present to you all first what are we

00:01:05.600 | seeing in AI for coding obviously this is a really fast-moving field and in many

00:01:12.020 | ways software engineering is one of the first large-scale successful applications

00:01:16.040 | of generative AI it started in many ways with co-pilots real-time text completion

00:01:22.580 | inside your editor that makes you as an engineer go a bit faster and now we also

00:01:29.000 | have AI IDEs again a development environment for you as an individual

00:01:33.620 | engineer to get even more leverage and for sometimes delegating entire tasks or

00:01:38.240 | snippets and really coding in flow with AI assistance we see Devon as part of a

00:01:44.660 | third wave of AI developer tools which is on the fully autonomous agent end of the

00:01:50.440 | spectrum more AI teammate than AI copilot companies around the world are using Devon like another

00:01:58.660 | member of their engineering team going directly from ticket to pull request

00:02:02.600 | collaborating with Devon in slack or Jira or linear and we see the large majority of

00:02:09.200 | Devon sessions and Devon PRs are started from within these other tools the same way you might interact with another human engineer

00:02:16.860 | architecturally this is really different from something that runs locally on your computer Devon is a cloud AI agent and what we've seen is that it's very complimentary to these local AI development tools

00:02:30.860 | when you are coding yourself and you want to stay in flow and get that speed up you use a local AI development tool where people use Devon is when they're really ready to delegate the task entirely and this is a very different set of

00:02:45.860 | technical trade-off you get large-scale parallelism asynchronousness and the ability to completely delegate individual units of work

00:02:54.080 | in the team setting this also means that Devon's run remotely not locally and so they share the same environment across different runs so you can try many different things in parallel combine them together

00:03:06.080 | and the teams of engineers who use Devon will break up large-scale engineering outcomes into small individual tasks delegated to a fleet of Devons and then coalesce together inside the code base

00:03:18.080 | and the main thing our users look for is for that code from Devon to get merged as pull requests

00:03:24.080 | it also changes the learning model for Devon

00:03:28.080 | in the cloud AI agent setting Devon is not just for you it's for your team and for your organization

00:03:34.080 | and so as Devon learns from your interactions those learnings are not kept only with you

00:03:40.080 | instead they're incorporated as part of your team and as part of your organization

00:03:45.080 | and this reliance on organizational knowledge is something that we've seen is really important for working with existing large-scale code bases

00:03:53.080 | because working with large code bases is really hard and so I'm going to go into some more detail on exactly how we do this under the hood with Devon

00:04:02.080 | part one is all about context if you want to build an AI software engineer you need to understand existing code

00:04:12.080 | you don't want your AI code contributions to be using a new framework adding new dependencies or being done in isolation of what you already have

00:04:21.080 | and code base understanding is pretty hard LLMs are amazing at so many things but they have limited context windows

00:04:30.080 | and even if a code base fits inside the context window the effective context window is often a lot lower than the advertised context window

00:04:36.080 | we have a series of internal benchmarks that measures effective reasoning capacity across a context and we find very consistently that the advertised context window is much higher than the effective reasoning context window

00:04:48.080 | Large code bases also have complex dependencies they can span multiple services multiple repositories and it can be intertwined in very complicated ways even for human engineers

00:05:00.080 | there's huge variations in code quality there might be some parts of the code base you want Devon to emulate and some parts you really want Devon to stay away from when it's learning how to be a productive member of your team

00:05:11.080 | same thing is true for documentation the code might have comments might have comments might have documentations that's outright incorrect or misleading

00:05:20.080 | all of these things are part of the technical challenges we work on to make Devon work in real world code bases

00:05:26.080 | the last critical piece of real world code bases is that the larger the code base the more custom it tends to be

00:05:32.080 | teams and companies build their own proprietary frameworks they have their own specific jargon there's context in the code that's not inside the code itself

00:05:40.080 | but the organizational workflow around the code and so these are the research questions that we set out to solve to make Devon actually useful in the real world

00:05:47.080 | and the first thing I'm going to go into more detail on is something we actually recently released free and publicly for all open source repositories

00:05:58.080 | it's called deep wiki deep wiki is a real-time continually updated index of your code base published as an interactive wiki almost like a real-time confluence page with documentation diagrams

00:06:14.080 | and the ability to ask questions about your code we had this originally as an internal data structure for Devon

00:06:20.080 | it wasn't a it wasn't a product it was just a tool that Devon could use to get high-level context about the code

00:06:26.080 | and what we realized is that human engineers wanted to see this information too and so we decided to release it as a standalone product and service

00:06:33.080 | so you can take any github URL today and just change the github to deep wiki.com and for any open source repo you'll get a full interactive wiki

00:06:43.080 | this also works on your private repos when they're integrated with Devon

00:06:47.080 | and so I looked at the Langchain repo and we have a full updated up-to-date documentation page for Langchain

00:06:55.080 | that has not only the pros of how it's organized the key concepts in Langchain's code base but also architectural diagrams and data flows

00:07:03.080 | and we've gotten a lot of feedback from the community that these diagrams are in some cases or in many cases actually better than the diagrams of the official documentation of very popular open source projects

00:07:15.080 | right whether it's folks who are on the typescript steering committee the vllm maintainers or others we're getting lots of amazing feedback on how great deep wiki is and we've had you know thousands of code bases start linking to deep wiki as part of their official documentation

00:07:29.080 | so definitely check this out if you're working on open source code yourself how does this work under the hood we just said that lms are really bad at raising about large code bases

00:07:43.080 | so let's give you the high-level algorithm of what we're doing under the hood to generate these wikis

00:07:47.080 | step one it's actually not about the code it's about the concepts what are the key principles inside this code base that are going to form our table of contents for how we lay out the macro context of this code base

00:08:01.080 | and what we found is that the in many cases those concepts um you don't just get them from the source code itself there is extremely rich information in the metadata around the source code for example was that source code added in as part of a pull request

00:08:17.080 | well which member of the team added that pull request what else have they contributed to was there discussion in that pull request about the code um what were are there comments is there documentation the git commit history all of this metadata

00:08:29.080 | data is a really useful source for building these high context wikis once you have those concepts then you can connect them to the code so what are the connections between the various code files and the proprietary or specific concepts inside this code base

00:08:48.080 | and after you have that you need to connect the code itself there's different sections of the code base uh you know some files that are more related less related

00:08:57.080 | there's uh call you know sort of call traces and flows and there's a specific way that these different components of the code base connect to each other

00:09:06.080 | you can look at things like the symbol the symbol graph you can look at the call graph uh and you can look at how um how these files tend to be used together

00:09:14.080 | and once you have those code to code connections then you can actually generate a wiki and for each concept what we do is we use an agent to go research that concept in the context of the specific code base

00:09:25.080 | uh we generate a wiki page about it and then we also provide those intermediate artifacts as context uh and as tools

00:09:32.080 | and when you put this all together uh you get very rich representations of code uh and we use graphs as a critical part of those representations uh and so this is a graph of uh the lang chain code base where you can see uh at a high level that different files are more or less related to each other uh with a lot of a lot of logic in the core and then maybe outskirts that are more related to test harnesses documentation

00:09:53.080 | specific integrations with third parties and so on and these data structures power a lot of how devon actually works inside large you know million multi-million line of code code bases

00:10:07.080 | so we've got our wiki but we also need to be able to search the code and this is another feature that's now mainline in devon but started as an internal tool for devon the ai software engineer

00:10:18.080 | that's the trend we're seeing is to make devon a great software engineer you need to build tools that are so useful that human engineers want to use them too

00:10:25.080 | and so we have devon search which is essentially deep research on your proprietary code base again whether it's open source or internal

00:10:35.080 | you can ask questions about the code devon will scan through that code try to understand what's going on

00:10:40.080 | using both the micro code the individual files but also the macro context it has from this wiki data structure

00:10:47.080 | uh and it will find this information for example i asked how do i enforce structured output in length chain

00:10:52.080 | and devon went and found the right section of the documentation from length chain as well as the actual implementation code for what to do

00:11:01.080 | devon search gives devon context it's an essential part under the hood of how devon the autonomous ai agent can actually make useful changes inside larger team wide code bases

00:11:11.080 | once you get a query you need to do pre-processing and of course rag is a component of that but we end up doing a lot more under the hood than just rag

00:11:20.080 | including junk removal some more advanced filtering of less relevant information and re-ranking multi-hop search to end up with this set of context that we think is very very relevant for this query

00:11:32.080 | um and that context again includes both source files but also wiki pages you need the micro and the macro context to provide really useful really useful recommendations

00:11:42.080 | um and from that we can get a grounded answer people don't want hallucinations in their wikis and they don't want hallucinations in their search

00:11:49.080 | so the grounding is essential for this to actually be useful

00:11:54.080 | the second part of how we optimize and customize to existing code bases is a bit more research oriented uh and i'm excited to share a little bit more of some of the post training and rl

00:12:04.080 | that we do under the hood to make devon work well inside specific narrow domains

00:12:10.080 | we recently released a new model an open source free model called kevin uh colonel devon kevin 32b

00:12:18.080 | kevin is uh uh outperforms many state-of-the-art foundation models on the narrow domain of writing cuda kernels

00:12:26.080 | raise your hand if you've ever heard of a cuda kernel

00:12:30.080 | all right we have an audience that's very familiar with the under fittings of ml for those who haven't heard of cuda kernels

00:12:36.080 | um this is the source code that you use to write gpu optimized uh gpu optimized implementations for nvidia gpus

00:12:44.080 | and so under the hood when you're using pytorch or tensorflow um those high-level graph operations are being executed under the hood by cuda kernels

00:12:52.080 | and the the domain of writing cuda kernels is extremely rich because this is a very low-level programming language

00:12:59.080 | relative to what many of us operate in more typically day-to-day say python and uh cuda kernels were released

00:13:06.080 | as a uh uh kernel bench was released as a benchmark by anne simon azalea uh to estimate models capabilities

00:13:15.080 | of generating these very niche very specific cuda kernels at high performance and high reliability

00:13:21.080 | and this work from cognition was done by carlo pietro ben supervised by silas uh these were our research

00:13:28.080 | uh these were our research interns uh who who did who got really really exciting results um from a from a single project

00:13:35.080 | so let's talk about what this work does more specifically um the goal is to take high-level

00:13:42.080 | machine learning code say a few different calls to pytorch and rewrite it as a highly optimized performant

00:13:49.080 | correct cuda kernel this is a very detailed problem domain that many low-level machine learning researchers

00:13:56.080 | spend you know their entire careers optimizing uh the design space is quite large for how to write optimal cuda kernels

00:14:03.080 | and it's quite challenging um what we see in practice in the ml community is that a lot of progress in machine learning

00:14:09.080 | in machine learning is really driven by performance on the hardware and so even if your algorithm or your new paper

00:14:16.080 | is big o optimal like a linear attention mechanism um under the hood if your implementation is not efficient

00:14:22.080 | cache friendly uh performant on actual gpu hardware it tends to not be that useful so this is a really active

00:14:28.080 | research domain for ml researchers and we want uh kevin to be good at writing these optimized cuda kernels

00:14:36.080 | so how does this work uh the first step is to define your reward function and one of the great things about software

00:14:43.080 | and in particular writing cuda kernels is that it's often easy to get automatically verifiable reward

00:14:50.080 | can you verify the correctness of your code automatically well in this case we have a less performant reference

00:14:55.080 | implementation that we can use to check correctness and so whenever kevin uh which is our post train uh sort of

00:15:03.080 | llm for this project whenever kevin writes a kernel we run it through a series of checks right first of all

00:15:09.080 | does that code parse um is it actually valid cuda does it compile does it run uh and then after all that

00:15:15.080 | is it correct and only if it's correct do we then grade it for performance how much faster or slower is it

00:15:21.080 | than the reference implementation so with this reward function notice we don't need a machine learning model here

00:15:26.080 | this is purely a set of um automatically verifiable steps um which makes which makes this very very

00:15:33.080 | friendly for high compute rl once you have the reward function you can use it for multi-turn training

00:15:39.080 | and so we use multi-turn grpo uh and for those who aren't familiar what's going on here is we're taking

00:15:45.080 | we're taking multiple different trajectories in sequence for this model to get better at writing

00:15:51.080 | cuda code so on the left here we have an initial prompt which results in a chain of thought from the

00:15:57.080 | model and an output and that output may or may not be correct uh when we move to the second the middle of

00:16:02.080 | this diagram we provide eval info back to the model and this eval info is the results from trying to run that

00:16:09.080 | kernel in a real world gpu environment um there's a lot of work you have to do in terms of sandboxing

00:16:14.080 | and isolation to make sure these incorrect cuda kernels don't mess up your training process or crash your

00:16:19.080 | gpus and then you're getting accurate performance benchmarks but we package all that up into almost

00:16:23.080 | like a struct of eval information that that model can then see as it tries again and it tries again with

00:16:30.080 | a second chain of thought a second kernel that gets passed to another step and this process repeats over

00:16:35.080 | several steps and the result is hopefully a correct kernel then you have to distribute your rewards to

00:16:43.080 | train this information and what we found is that you don't want to just reward based on the final output

00:16:49.080 | uh and its correctness or not and its performance or lack of performance actually the the path to get

00:16:55.080 | there is also valuable so you'll notice in red at the bottom here we have this uh this sum of different

00:17:02.080 | rewards discounted by gamma over time and what that's showing is the very first trajectory the

00:17:08.080 | very first step of that trajectory gets a reward even if it wasn't correct itself if it led to a

00:17:15.080 | correct solution and a performance solution downstream are you barking up the right tree is basically the

00:17:20.080 | reward we want to give the model and what we found in this project is that being able to do this over

00:17:24.080 | multiple iterations with these discounted rewards was really important for this to work because writing

00:17:29.080 | cooter kernels is hard and so the reward signal is going to be sparse if you only get one shot

00:17:35.080 | and once you do this uh you can you can find that it's not impossible to very deeply optimize for these

00:17:43.080 | narrow problem domains so in this graph we have the um correctness on the left how many of the kernels were written

00:17:52.080 | correctly by this model and kevin 32b is getting 91 correct on this section of the kernel bench benchmark that we focused on

00:18:00.080 | and you can see that compared to even 04 mini or 03 this is a significant improvement um this is a narrow

00:18:06.080 | domain where high compute rl lets you outperform existing models on the right you see performance so we rewarded devon proportional to how much speedup uh you know kevin got in this project

00:18:20.080 | and so as the kernels got faster and faster it got more and more reward and what we found is that even from a performance standpoint

00:18:28.080 | kevin 32b is able to outperform these larger scale foundation models

00:18:34.080 | and this was really interesting result to us because um it kind of flies in the face of many many sort of broad discussions of oh these foundation models are going to be the best at everything and you should use them exclusively for everything

00:18:48.080 | but what we see internally all the time is that for any given narrow domain if you can set up your environment to do high compute rl in that domain

00:18:56.080 | it's very feasible to outperform an out-of-the-box foundation model especially as the open source based models that you start with have improved

00:19:06.080 | to actually make this work in practice it's important that you keep your model doing what you actually want it to do and not cheating along the way

00:19:14.080 | and this is called reward hacking in rl and it's uh many cases actually challenging to prevent so i want to show you a few ways that kevin misbehaved

00:19:22.080 | uh that we had to sort of steer back so one is kevin realized that it could write the cuda and then wrap the whole thing in a try except block

00:19:30.080 | and just fall back to the existing pi torch implementation and you know it would always score 100

00:19:36.080 | percent correct in that case and it had some chance of being faster than average but if it wasn't it it sort of defaulted to the one x

00:19:42.080 | so that was a very unproductive direction for for kevin to go down during the rl process and uh we had to make sure

00:19:48.080 | that we update the reward function to recognize this type of reward hacking the second is even a bit more subtle uh and so

00:19:56.080 | the test harness to make sure that kevin's code was correct uh had a class you know in this case called model new uh that inherited

00:20:04.080 | from model and you can see here what kevin realized is that it could implement the model as a uh you know as a subclass of

00:20:13.080 | nn.module with its attempt at optimize code code and then it could just overwrite that uh class name in the namespace

00:20:20.080 | and so you can see it defined a second model new that this in this case just inherits directly from the correct

00:20:25.080 | model implementation so these models got very creative at all uh how to sort of get around your intentions and

00:20:31.080 | this is a challenge in rl and so making sure you correctly define your environment is is really critical to success

00:20:37.080 | and for those of you who have used maybe really popular commercial models like some of the most popular models for coding

00:20:42.080 | you might have seen that as the models get better sometimes they're more aggressive at doing things like commenting out your test cases

00:20:49.080 | to make sure the works will pass that's what's going on under the hood is this is uh this is a smell of reward hacking

00:20:57.080 | and so it's a constant sort of cat and mouse game between the researchers who are trying to steer these models to do what we actually want

00:21:03.080 | and the models they're trying to exploit every possible way to get this high quality reward

00:21:10.080 | so what do we learn from this um custom post training can and does outperform frontier models on specific

00:21:17.080 | narrow domains for reinforcement learning specifically um especially in code it's more compute bound than data bound

00:21:24.080 | you know kernel bench the subset of kernel bench that we trained on only had 180 tasks which is really not that many when you think about it

00:21:31.080 | but by applying high compute rl rolling out these trajectories again and again there is very very rich reward signal to learn from

00:21:38.080 | and that's because in software we have an oracle that can help with these rewards we actually have the environment

00:21:43.080 | we can run the code we can see if it compiles we can see how fast it is

00:21:46.080 | and this in my opinion is one of the reasons that um software and coding specifically has accelerated particularly fast

00:21:53.080 | as an ai capability is that code is one of the few domains where this property holds

00:21:58.080 | i used to lead machine learning at scale ai which provides post training human data for many of the large scale foundation model labs

00:22:05.080 | and it gets really hard to label by hand high quality high accuracy data as the models get smarter but code doesn't have that bottleneck

00:22:13.080 | because you can continually scale um based on automatic signals of correctness

00:22:19.080 | and that's really the third key is automatic verification allows you to scale

00:22:23.080 | so for your own code bases and your own process putting in the ci systems putting in the test coverage

00:22:28.080 | putting in the harnesses that allow that automatic verification is going to future proof your code as rl and as ai gets better

00:22:35.080 | and we see many of our users of devin they first take their code base with devin and go fix all the test coverage issues

00:22:42.080 | and now that they have full test coverage it's even faster to use devin to ship new more pull requests

00:22:48.080 | the last big point here is i just showed you an example on cuda kernels

00:22:54.080 | but to me the the more interesting deeper implication of this research is every code base is in some sense a narrow domain

00:23:00.080 | there are specific things to your code that don't exist in anyone else's code

00:23:04.080 | and that's more and more true the larger your code base is so you can imagine a future where high compute rl and per code base customization leads to significantly outperforming agents on each individual domain

00:23:16.080 | the equivalent of hiring a software engineer and giving them millions of years of experience working specifically in your environment

00:23:24.080 | so this is some of the research work we've been doing at cognition that powers devin under the hood

00:23:28.080 | if you'd like to play around and try this yourself you can use you can go to devin.ai and sign up for an account

00:23:34.080 | connect it with your existing code give it a task and go from ticket to pr thank you so much for having me

00:23:40.080 | thank you

00:23:42.080 | thank you

00:23:44.080 | Thank you.

00:23:45.080 | Thank you.