back to indexMulti-Agent Frontiers: Making Devin

00:00:00.000 |
Langchain team thank you so much for having me really excited to share a 00:00:11.820 |
little bit more about how we made Devon so my name is Russell Kaplan I'm 00:00:15.840 |
president at cognition we're the company behind Devon and as a quick show of 00:00:20.040 |
hands how many of you have heard of Devon before all right almost everyone 00:00:24.720 |
so Devon is an AI software engineer but we are really focused specifically on 00:00:31.760 |
working within existing code bases there's lots of amazing AI tools out 00:00:36.420 |
there for coding and what we found is that as the code bases get larger the 00:00:40.480 |
problem gets harder and most of our customers and users around the world are 00:00:44.460 |
teams they're teams of engineers or companies full of engineers they're 00:00:47.960 |
trying to ship real-world products and so today I want to talk a little bit more 00:00:53.360 |
about what Devon is but more importantly how we built it and I'm gonna share 00:00:56.900 |
some new sort of technical information we're releasing on exactly how this works 00:01:00.380 |
under the hood that I'm really excited to to present to you all first what are we 00:01:05.600 |
seeing in AI for coding obviously this is a really fast-moving field and in many 00:01:12.020 |
ways software engineering is one of the first large-scale successful applications 00:01:16.040 |
of generative AI it started in many ways with co-pilots real-time text completion 00:01:22.580 |
inside your editor that makes you as an engineer go a bit faster and now we also 00:01:29.000 |
have AI IDEs again a development environment for you as an individual 00:01:33.620 |
engineer to get even more leverage and for sometimes delegating entire tasks or 00:01:38.240 |
snippets and really coding in flow with AI assistance we see Devon as part of a 00:01:44.660 |
third wave of AI developer tools which is on the fully autonomous agent end of the 00:01:50.440 |
spectrum more AI teammate than AI copilot companies around the world are using Devon like another 00:01:58.660 |
member of their engineering team going directly from ticket to pull request 00:02:02.600 |
collaborating with Devon in slack or Jira or linear and we see the large majority of 00:02:09.200 |
Devon sessions and Devon PRs are started from within these other tools the same way you might interact with another human engineer 00:02:16.860 |
architecturally this is really different from something that runs locally on your computer Devon is a cloud AI agent and what we've seen is that it's very complimentary to these local AI development tools 00:02:30.860 |
when you are coding yourself and you want to stay in flow and get that speed up you use a local AI development tool where people use Devon is when they're really ready to delegate the task entirely and this is a very different set of 00:02:45.860 |
technical trade-off you get large-scale parallelism asynchronousness and the ability to completely delegate individual units of work 00:02:54.080 |
in the team setting this also means that Devon's run remotely not locally and so they share the same environment across different runs so you can try many different things in parallel combine them together 00:03:06.080 |
and the teams of engineers who use Devon will break up large-scale engineering outcomes into small individual tasks delegated to a fleet of Devons and then coalesce together inside the code base 00:03:18.080 |
and the main thing our users look for is for that code from Devon to get merged as pull requests 00:03:28.080 |
in the cloud AI agent setting Devon is not just for you it's for your team and for your organization 00:03:34.080 |
and so as Devon learns from your interactions those learnings are not kept only with you 00:03:40.080 |
instead they're incorporated as part of your team and as part of your organization 00:03:45.080 |
and this reliance on organizational knowledge is something that we've seen is really important for working with existing large-scale code bases 00:03:53.080 |
because working with large code bases is really hard and so I'm going to go into some more detail on exactly how we do this under the hood with Devon 00:04:02.080 |
part one is all about context if you want to build an AI software engineer you need to understand existing code 00:04:12.080 |
you don't want your AI code contributions to be using a new framework adding new dependencies or being done in isolation of what you already have 00:04:21.080 |
and code base understanding is pretty hard LLMs are amazing at so many things but they have limited context windows 00:04:30.080 |
and even if a code base fits inside the context window the effective context window is often a lot lower than the advertised context window 00:04:36.080 |
we have a series of internal benchmarks that measures effective reasoning capacity across a context and we find very consistently that the advertised context window is much higher than the effective reasoning context window 00:04:48.080 |
Large code bases also have complex dependencies they can span multiple services multiple repositories and it can be intertwined in very complicated ways even for human engineers 00:05:00.080 |
there's huge variations in code quality there might be some parts of the code base you want Devon to emulate and some parts you really want Devon to stay away from when it's learning how to be a productive member of your team 00:05:11.080 |
same thing is true for documentation the code might have comments might have comments might have documentations that's outright incorrect or misleading 00:05:20.080 |
all of these things are part of the technical challenges we work on to make Devon work in real world code bases 00:05:26.080 |
the last critical piece of real world code bases is that the larger the code base the more custom it tends to be 00:05:32.080 |
teams and companies build their own proprietary frameworks they have their own specific jargon there's context in the code that's not inside the code itself 00:05:40.080 |
but the organizational workflow around the code and so these are the research questions that we set out to solve to make Devon actually useful in the real world 00:05:47.080 |
and the first thing I'm going to go into more detail on is something we actually recently released free and publicly for all open source repositories 00:05:58.080 |
it's called deep wiki deep wiki is a real-time continually updated index of your code base published as an interactive wiki almost like a real-time confluence page with documentation diagrams 00:06:14.080 |
and the ability to ask questions about your code we had this originally as an internal data structure for Devon 00:06:20.080 |
it wasn't a it wasn't a product it was just a tool that Devon could use to get high-level context about the code 00:06:26.080 |
and what we realized is that human engineers wanted to see this information too and so we decided to release it as a standalone product and service 00:06:33.080 |
so you can take any github URL today and just change the github to deep wiki.com and for any open source repo you'll get a full interactive wiki 00:06:43.080 |
this also works on your private repos when they're integrated with Devon 00:06:47.080 |
and so I looked at the Langchain repo and we have a full updated up-to-date documentation page for Langchain 00:06:55.080 |
that has not only the pros of how it's organized the key concepts in Langchain's code base but also architectural diagrams and data flows 00:07:03.080 |
and we've gotten a lot of feedback from the community that these diagrams are in some cases or in many cases actually better than the diagrams of the official documentation of very popular open source projects 00:07:15.080 |
right whether it's folks who are on the typescript steering committee the vllm maintainers or others we're getting lots of amazing feedback on how great deep wiki is and we've had you know thousands of code bases start linking to deep wiki as part of their official documentation 00:07:29.080 |
so definitely check this out if you're working on open source code yourself how does this work under the hood we just said that lms are really bad at raising about large code bases 00:07:43.080 |
so let's give you the high-level algorithm of what we're doing under the hood to generate these wikis 00:07:47.080 |
step one it's actually not about the code it's about the concepts what are the key principles inside this code base that are going to form our table of contents for how we lay out the macro context of this code base 00:08:01.080 |
and what we found is that the in many cases those concepts um you don't just get them from the source code itself there is extremely rich information in the metadata around the source code for example was that source code added in as part of a pull request 00:08:17.080 |
well which member of the team added that pull request what else have they contributed to was there discussion in that pull request about the code um what were are there comments is there documentation the git commit history all of this metadata 00:08:29.080 |
data is a really useful source for building these high context wikis once you have those concepts then you can connect them to the code so what are the connections between the various code files and the proprietary or specific concepts inside this code base 00:08:48.080 |
and after you have that you need to connect the code itself there's different sections of the code base uh you know some files that are more related less related 00:08:57.080 |
there's uh call you know sort of call traces and flows and there's a specific way that these different components of the code base connect to each other 00:09:06.080 |
you can look at things like the symbol the symbol graph you can look at the call graph uh and you can look at how um how these files tend to be used together 00:09:14.080 |
and once you have those code to code connections then you can actually generate a wiki and for each concept what we do is we use an agent to go research that concept in the context of the specific code base 00:09:25.080 |
uh we generate a wiki page about it and then we also provide those intermediate artifacts as context uh and as tools 00:09:32.080 |
and when you put this all together uh you get very rich representations of code uh and we use graphs as a critical part of those representations uh and so this is a graph of uh the lang chain code base where you can see uh at a high level that different files are more or less related to each other uh with a lot of a lot of logic in the core and then maybe outskirts that are more related to test harnesses documentation 00:09:53.080 |
specific integrations with third parties and so on and these data structures power a lot of how devon actually works inside large you know million multi-million line of code code bases 00:10:07.080 |
so we've got our wiki but we also need to be able to search the code and this is another feature that's now mainline in devon but started as an internal tool for devon the ai software engineer 00:10:18.080 |
that's the trend we're seeing is to make devon a great software engineer you need to build tools that are so useful that human engineers want to use them too 00:10:25.080 |
and so we have devon search which is essentially deep research on your proprietary code base again whether it's open source or internal 00:10:35.080 |
you can ask questions about the code devon will scan through that code try to understand what's going on 00:10:40.080 |
using both the micro code the individual files but also the macro context it has from this wiki data structure 00:10:47.080 |
uh and it will find this information for example i asked how do i enforce structured output in length chain 00:10:52.080 |
and devon went and found the right section of the documentation from length chain as well as the actual implementation code for what to do 00:11:01.080 |
devon search gives devon context it's an essential part under the hood of how devon the autonomous ai agent can actually make useful changes inside larger team wide code bases 00:11:11.080 |
once you get a query you need to do pre-processing and of course rag is a component of that but we end up doing a lot more under the hood than just rag 00:11:20.080 |
including junk removal some more advanced filtering of less relevant information and re-ranking multi-hop search to end up with this set of context that we think is very very relevant for this query 00:11:32.080 |
um and that context again includes both source files but also wiki pages you need the micro and the macro context to provide really useful really useful recommendations 00:11:42.080 |
um and from that we can get a grounded answer people don't want hallucinations in their wikis and they don't want hallucinations in their search 00:11:49.080 |
so the grounding is essential for this to actually be useful 00:11:54.080 |
the second part of how we optimize and customize to existing code bases is a bit more research oriented uh and i'm excited to share a little bit more of some of the post training and rl 00:12:04.080 |
that we do under the hood to make devon work well inside specific narrow domains 00:12:10.080 |
we recently released a new model an open source free model called kevin uh colonel devon kevin 32b 00:12:18.080 |
kevin is uh uh outperforms many state-of-the-art foundation models on the narrow domain of writing cuda kernels 00:12:26.080 |
raise your hand if you've ever heard of a cuda kernel 00:12:30.080 |
all right we have an audience that's very familiar with the under fittings of ml for those who haven't heard of cuda kernels 00:12:36.080 |
um this is the source code that you use to write gpu optimized uh gpu optimized implementations for nvidia gpus 00:12:44.080 |
and so under the hood when you're using pytorch or tensorflow um those high-level graph operations are being executed under the hood by cuda kernels 00:12:52.080 |
and the the domain of writing cuda kernels is extremely rich because this is a very low-level programming language 00:12:59.080 |
relative to what many of us operate in more typically day-to-day say python and uh cuda kernels were released 00:13:06.080 |
as a uh uh kernel bench was released as a benchmark by anne simon azalea uh to estimate models capabilities 00:13:15.080 |
of generating these very niche very specific cuda kernels at high performance and high reliability 00:13:21.080 |
and this work from cognition was done by carlo pietro ben supervised by silas uh these were our research 00:13:28.080 |
uh these were our research interns uh who who did who got really really exciting results um from a from a single project 00:13:35.080 |
so let's talk about what this work does more specifically um the goal is to take high-level 00:13:42.080 |
machine learning code say a few different calls to pytorch and rewrite it as a highly optimized performant 00:13:49.080 |
correct cuda kernel this is a very detailed problem domain that many low-level machine learning researchers 00:13:56.080 |
spend you know their entire careers optimizing uh the design space is quite large for how to write optimal cuda kernels 00:14:03.080 |
and it's quite challenging um what we see in practice in the ml community is that a lot of progress in machine learning 00:14:09.080 |
in machine learning is really driven by performance on the hardware and so even if your algorithm or your new paper 00:14:16.080 |
is big o optimal like a linear attention mechanism um under the hood if your implementation is not efficient 00:14:22.080 |
cache friendly uh performant on actual gpu hardware it tends to not be that useful so this is a really active 00:14:28.080 |
research domain for ml researchers and we want uh kevin to be good at writing these optimized cuda kernels 00:14:36.080 |
so how does this work uh the first step is to define your reward function and one of the great things about software 00:14:43.080 |
and in particular writing cuda kernels is that it's often easy to get automatically verifiable reward 00:14:50.080 |
can you verify the correctness of your code automatically well in this case we have a less performant reference 00:14:55.080 |
implementation that we can use to check correctness and so whenever kevin uh which is our post train uh sort of 00:15:03.080 |
llm for this project whenever kevin writes a kernel we run it through a series of checks right first of all 00:15:09.080 |
does that code parse um is it actually valid cuda does it compile does it run uh and then after all that 00:15:15.080 |
is it correct and only if it's correct do we then grade it for performance how much faster or slower is it 00:15:21.080 |
than the reference implementation so with this reward function notice we don't need a machine learning model here 00:15:26.080 |
this is purely a set of um automatically verifiable steps um which makes which makes this very very 00:15:33.080 |
friendly for high compute rl once you have the reward function you can use it for multi-turn training 00:15:39.080 |
and so we use multi-turn grpo uh and for those who aren't familiar what's going on here is we're taking 00:15:45.080 |
we're taking multiple different trajectories in sequence for this model to get better at writing 00:15:51.080 |
cuda code so on the left here we have an initial prompt which results in a chain of thought from the 00:15:57.080 |
model and an output and that output may or may not be correct uh when we move to the second the middle of 00:16:02.080 |
this diagram we provide eval info back to the model and this eval info is the results from trying to run that 00:16:09.080 |
kernel in a real world gpu environment um there's a lot of work you have to do in terms of sandboxing 00:16:14.080 |
and isolation to make sure these incorrect cuda kernels don't mess up your training process or crash your 00:16:19.080 |
gpus and then you're getting accurate performance benchmarks but we package all that up into almost 00:16:23.080 |
like a struct of eval information that that model can then see as it tries again and it tries again with 00:16:30.080 |
a second chain of thought a second kernel that gets passed to another step and this process repeats over 00:16:35.080 |
several steps and the result is hopefully a correct kernel then you have to distribute your rewards to 00:16:43.080 |
train this information and what we found is that you don't want to just reward based on the final output 00:16:49.080 |
uh and its correctness or not and its performance or lack of performance actually the the path to get 00:16:55.080 |
there is also valuable so you'll notice in red at the bottom here we have this uh this sum of different 00:17:02.080 |
rewards discounted by gamma over time and what that's showing is the very first trajectory the 00:17:08.080 |
very first step of that trajectory gets a reward even if it wasn't correct itself if it led to a 00:17:15.080 |
correct solution and a performance solution downstream are you barking up the right tree is basically the 00:17:20.080 |
reward we want to give the model and what we found in this project is that being able to do this over 00:17:24.080 |
multiple iterations with these discounted rewards was really important for this to work because writing 00:17:29.080 |
cooter kernels is hard and so the reward signal is going to be sparse if you only get one shot 00:17:35.080 |
and once you do this uh you can you can find that it's not impossible to very deeply optimize for these 00:17:43.080 |
narrow problem domains so in this graph we have the um correctness on the left how many of the kernels were written 00:17:52.080 |
correctly by this model and kevin 32b is getting 91 correct on this section of the kernel bench benchmark that we focused on 00:18:00.080 |
and you can see that compared to even 04 mini or 03 this is a significant improvement um this is a narrow 00:18:06.080 |
domain where high compute rl lets you outperform existing models on the right you see performance so we rewarded devon proportional to how much speedup uh you know kevin got in this project 00:18:20.080 |
and so as the kernels got faster and faster it got more and more reward and what we found is that even from a performance standpoint 00:18:28.080 |
kevin 32b is able to outperform these larger scale foundation models 00:18:34.080 |
and this was really interesting result to us because um it kind of flies in the face of many many sort of broad discussions of oh these foundation models are going to be the best at everything and you should use them exclusively for everything 00:18:48.080 |
but what we see internally all the time is that for any given narrow domain if you can set up your environment to do high compute rl in that domain 00:18:56.080 |
it's very feasible to outperform an out-of-the-box foundation model especially as the open source based models that you start with have improved 00:19:06.080 |
to actually make this work in practice it's important that you keep your model doing what you actually want it to do and not cheating along the way 00:19:14.080 |
and this is called reward hacking in rl and it's uh many cases actually challenging to prevent so i want to show you a few ways that kevin misbehaved 00:19:22.080 |
uh that we had to sort of steer back so one is kevin realized that it could write the cuda and then wrap the whole thing in a try except block 00:19:30.080 |
and just fall back to the existing pi torch implementation and you know it would always score 100 00:19:36.080 |
percent correct in that case and it had some chance of being faster than average but if it wasn't it it sort of defaulted to the one x 00:19:42.080 |
so that was a very unproductive direction for for kevin to go down during the rl process and uh we had to make sure 00:19:48.080 |
that we update the reward function to recognize this type of reward hacking the second is even a bit more subtle uh and so 00:19:56.080 |
the test harness to make sure that kevin's code was correct uh had a class you know in this case called model new uh that inherited 00:20:04.080 |
from model and you can see here what kevin realized is that it could implement the model as a uh you know as a subclass of 00:20:13.080 |
nn.module with its attempt at optimize code code and then it could just overwrite that uh class name in the namespace 00:20:20.080 |
and so you can see it defined a second model new that this in this case just inherits directly from the correct 00:20:25.080 |
model implementation so these models got very creative at all uh how to sort of get around your intentions and 00:20:31.080 |
this is a challenge in rl and so making sure you correctly define your environment is is really critical to success 00:20:37.080 |
and for those of you who have used maybe really popular commercial models like some of the most popular models for coding 00:20:42.080 |
you might have seen that as the models get better sometimes they're more aggressive at doing things like commenting out your test cases 00:20:49.080 |
to make sure the works will pass that's what's going on under the hood is this is uh this is a smell of reward hacking 00:20:57.080 |
and so it's a constant sort of cat and mouse game between the researchers who are trying to steer these models to do what we actually want 00:21:03.080 |
and the models they're trying to exploit every possible way to get this high quality reward 00:21:10.080 |
so what do we learn from this um custom post training can and does outperform frontier models on specific 00:21:17.080 |
narrow domains for reinforcement learning specifically um especially in code it's more compute bound than data bound 00:21:24.080 |
you know kernel bench the subset of kernel bench that we trained on only had 180 tasks which is really not that many when you think about it 00:21:31.080 |
but by applying high compute rl rolling out these trajectories again and again there is very very rich reward signal to learn from 00:21:38.080 |
and that's because in software we have an oracle that can help with these rewards we actually have the environment 00:21:43.080 |
we can run the code we can see if it compiles we can see how fast it is 00:21:46.080 |
and this in my opinion is one of the reasons that um software and coding specifically has accelerated particularly fast 00:21:53.080 |
as an ai capability is that code is one of the few domains where this property holds 00:21:58.080 |
i used to lead machine learning at scale ai which provides post training human data for many of the large scale foundation model labs 00:22:05.080 |
and it gets really hard to label by hand high quality high accuracy data as the models get smarter but code doesn't have that bottleneck 00:22:13.080 |
because you can continually scale um based on automatic signals of correctness 00:22:19.080 |
and that's really the third key is automatic verification allows you to scale 00:22:23.080 |
so for your own code bases and your own process putting in the ci systems putting in the test coverage 00:22:28.080 |
putting in the harnesses that allow that automatic verification is going to future proof your code as rl and as ai gets better 00:22:35.080 |
and we see many of our users of devin they first take their code base with devin and go fix all the test coverage issues 00:22:42.080 |
and now that they have full test coverage it's even faster to use devin to ship new more pull requests 00:22:48.080 |
the last big point here is i just showed you an example on cuda kernels 00:22:54.080 |
but to me the the more interesting deeper implication of this research is every code base is in some sense a narrow domain 00:23:00.080 |
there are specific things to your code that don't exist in anyone else's code 00:23:04.080 |
and that's more and more true the larger your code base is so you can imagine a future where high compute rl and per code base customization leads to significantly outperforming agents on each individual domain 00:23:16.080 |
the equivalent of hiring a software engineer and giving them millions of years of experience working specifically in your environment 00:23:24.080 |
so this is some of the research work we've been doing at cognition that powers devin under the hood 00:23:28.080 |
if you'd like to play around and try this yourself you can use you can go to devin.ai and sign up for an account 00:23:34.080 |
connect it with your existing code give it a task and go from ticket to pr thank you so much for having me