back to index

Multi-Agent Frontiers: Making Devin


Whisper Transcript | Transcript Only Page

00:00:00.000 | Langchain team thank you so much for having me really excited to share a
00:00:11.820 | little bit more about how we made Devon so my name is Russell Kaplan I'm
00:00:15.840 | president at cognition we're the company behind Devon and as a quick show of
00:00:20.040 | hands how many of you have heard of Devon before all right almost everyone
00:00:24.720 | so Devon is an AI software engineer but we are really focused specifically on
00:00:31.760 | working within existing code bases there's lots of amazing AI tools out
00:00:36.420 | there for coding and what we found is that as the code bases get larger the
00:00:40.480 | problem gets harder and most of our customers and users around the world are
00:00:44.460 | teams they're teams of engineers or companies full of engineers they're
00:00:47.960 | trying to ship real-world products and so today I want to talk a little bit more
00:00:53.360 | about what Devon is but more importantly how we built it and I'm gonna share
00:00:56.900 | some new sort of technical information we're releasing on exactly how this works
00:01:00.380 | under the hood that I'm really excited to to present to you all first what are we
00:01:05.600 | seeing in AI for coding obviously this is a really fast-moving field and in many
00:01:12.020 | ways software engineering is one of the first large-scale successful applications
00:01:16.040 | of generative AI it started in many ways with co-pilots real-time text completion
00:01:22.580 | inside your editor that makes you as an engineer go a bit faster and now we also
00:01:29.000 | have AI IDEs again a development environment for you as an individual
00:01:33.620 | engineer to get even more leverage and for sometimes delegating entire tasks or
00:01:38.240 | snippets and really coding in flow with AI assistance we see Devon as part of a
00:01:44.660 | third wave of AI developer tools which is on the fully autonomous agent end of the
00:01:50.440 | spectrum more AI teammate than AI copilot companies around the world are using Devon like another
00:01:58.660 | member of their engineering team going directly from ticket to pull request
00:02:02.600 | collaborating with Devon in slack or Jira or linear and we see the large majority of
00:02:09.200 | Devon sessions and Devon PRs are started from within these other tools the same way you might interact with another human engineer
00:02:16.860 | architecturally this is really different from something that runs locally on your computer Devon is a cloud AI agent and what we've seen is that it's very complimentary to these local AI development tools
00:02:30.860 | when you are coding yourself and you want to stay in flow and get that speed up you use a local AI development tool where people use Devon is when they're really ready to delegate the task entirely and this is a very different set of
00:02:45.860 | technical trade-off you get large-scale parallelism asynchronousness and the ability to completely delegate individual units of work
00:02:54.080 | in the team setting this also means that Devon's run remotely not locally and so they share the same environment across different runs so you can try many different things in parallel combine them together
00:03:06.080 | and the teams of engineers who use Devon will break up large-scale engineering outcomes into small individual tasks delegated to a fleet of Devons and then coalesce together inside the code base
00:03:18.080 | and the main thing our users look for is for that code from Devon to get merged as pull requests
00:03:24.080 | it also changes the learning model for Devon
00:03:28.080 | in the cloud AI agent setting Devon is not just for you it's for your team and for your organization
00:03:34.080 | and so as Devon learns from your interactions those learnings are not kept only with you
00:03:40.080 | instead they're incorporated as part of your team and as part of your organization
00:03:45.080 | and this reliance on organizational knowledge is something that we've seen is really important for working with existing large-scale code bases
00:03:53.080 | because working with large code bases is really hard and so I'm going to go into some more detail on exactly how we do this under the hood with Devon
00:04:02.080 | part one is all about context if you want to build an AI software engineer you need to understand existing code
00:04:12.080 | you don't want your AI code contributions to be using a new framework adding new dependencies or being done in isolation of what you already have
00:04:21.080 | and code base understanding is pretty hard LLMs are amazing at so many things but they have limited context windows
00:04:30.080 | and even if a code base fits inside the context window the effective context window is often a lot lower than the advertised context window
00:04:36.080 | we have a series of internal benchmarks that measures effective reasoning capacity across a context and we find very consistently that the advertised context window is much higher than the effective reasoning context window
00:04:48.080 | Large code bases also have complex dependencies they can span multiple services multiple repositories and it can be intertwined in very complicated ways even for human engineers
00:05:00.080 | there's huge variations in code quality there might be some parts of the code base you want Devon to emulate and some parts you really want Devon to stay away from when it's learning how to be a productive member of your team
00:05:11.080 | same thing is true for documentation the code might have comments might have comments might have documentations that's outright incorrect or misleading
00:05:20.080 | all of these things are part of the technical challenges we work on to make Devon work in real world code bases
00:05:26.080 | the last critical piece of real world code bases is that the larger the code base the more custom it tends to be
00:05:32.080 | teams and companies build their own proprietary frameworks they have their own specific jargon there's context in the code that's not inside the code itself
00:05:40.080 | but the organizational workflow around the code and so these are the research questions that we set out to solve to make Devon actually useful in the real world
00:05:47.080 | and the first thing I'm going to go into more detail on is something we actually recently released free and publicly for all open source repositories
00:05:58.080 | it's called deep wiki deep wiki is a real-time continually updated index of your code base published as an interactive wiki almost like a real-time confluence page with documentation diagrams
00:06:14.080 | and the ability to ask questions about your code we had this originally as an internal data structure for Devon
00:06:20.080 | it wasn't a it wasn't a product it was just a tool that Devon could use to get high-level context about the code
00:06:26.080 | and what we realized is that human engineers wanted to see this information too and so we decided to release it as a standalone product and service
00:06:33.080 | so you can take any github URL today and just change the github to deep wiki.com and for any open source repo you'll get a full interactive wiki
00:06:43.080 | this also works on your private repos when they're integrated with Devon
00:06:47.080 | and so I looked at the Langchain repo and we have a full updated up-to-date documentation page for Langchain
00:06:55.080 | that has not only the pros of how it's organized the key concepts in Langchain's code base but also architectural diagrams and data flows
00:07:03.080 | and we've gotten a lot of feedback from the community that these diagrams are in some cases or in many cases actually better than the diagrams of the official documentation of very popular open source projects
00:07:15.080 | right whether it's folks who are on the typescript steering committee the vllm maintainers or others we're getting lots of amazing feedback on how great deep wiki is and we've had you know thousands of code bases start linking to deep wiki as part of their official documentation
00:07:29.080 | so definitely check this out if you're working on open source code yourself how does this work under the hood we just said that lms are really bad at raising about large code bases
00:07:43.080 | so let's give you the high-level algorithm of what we're doing under the hood to generate these wikis
00:07:47.080 | step one it's actually not about the code it's about the concepts what are the key principles inside this code base that are going to form our table of contents for how we lay out the macro context of this code base
00:08:01.080 | and what we found is that the in many cases those concepts um you don't just get them from the source code itself there is extremely rich information in the metadata around the source code for example was that source code added in as part of a pull request
00:08:17.080 | well which member of the team added that pull request what else have they contributed to was there discussion in that pull request about the code um what were are there comments is there documentation the git commit history all of this metadata
00:08:29.080 | data is a really useful source for building these high context wikis once you have those concepts then you can connect them to the code so what are the connections between the various code files and the proprietary or specific concepts inside this code base
00:08:48.080 | and after you have that you need to connect the code itself there's different sections of the code base uh you know some files that are more related less related
00:08:57.080 | there's uh call you know sort of call traces and flows and there's a specific way that these different components of the code base connect to each other
00:09:06.080 | you can look at things like the symbol the symbol graph you can look at the call graph uh and you can look at how um how these files tend to be used together
00:09:14.080 | and once you have those code to code connections then you can actually generate a wiki and for each concept what we do is we use an agent to go research that concept in the context of the specific code base
00:09:25.080 | uh we generate a wiki page about it and then we also provide those intermediate artifacts as context uh and as tools
00:09:32.080 | and when you put this all together uh you get very rich representations of code uh and we use graphs as a critical part of those representations uh and so this is a graph of uh the lang chain code base where you can see uh at a high level that different files are more or less related to each other uh with a lot of a lot of logic in the core and then maybe outskirts that are more related to test harnesses documentation
00:09:53.080 | specific integrations with third parties and so on and these data structures power a lot of how devon actually works inside large you know million multi-million line of code code bases
00:10:07.080 | so we've got our wiki but we also need to be able to search the code and this is another feature that's now mainline in devon but started as an internal tool for devon the ai software engineer
00:10:18.080 | that's the trend we're seeing is to make devon a great software engineer you need to build tools that are so useful that human engineers want to use them too
00:10:25.080 | and so we have devon search which is essentially deep research on your proprietary code base again whether it's open source or internal
00:10:35.080 | you can ask questions about the code devon will scan through that code try to understand what's going on
00:10:40.080 | using both the micro code the individual files but also the macro context it has from this wiki data structure
00:10:47.080 | uh and it will find this information for example i asked how do i enforce structured output in length chain
00:10:52.080 | and devon went and found the right section of the documentation from length chain as well as the actual implementation code for what to do
00:11:01.080 | devon search gives devon context it's an essential part under the hood of how devon the autonomous ai agent can actually make useful changes inside larger team wide code bases
00:11:11.080 | once you get a query you need to do pre-processing and of course rag is a component of that but we end up doing a lot more under the hood than just rag
00:11:20.080 | including junk removal some more advanced filtering of less relevant information and re-ranking multi-hop search to end up with this set of context that we think is very very relevant for this query
00:11:32.080 | um and that context again includes both source files but also wiki pages you need the micro and the macro context to provide really useful really useful recommendations
00:11:42.080 | um and from that we can get a grounded answer people don't want hallucinations in their wikis and they don't want hallucinations in their search
00:11:49.080 | so the grounding is essential for this to actually be useful
00:11:54.080 | the second part of how we optimize and customize to existing code bases is a bit more research oriented uh and i'm excited to share a little bit more of some of the post training and rl
00:12:04.080 | that we do under the hood to make devon work well inside specific narrow domains
00:12:10.080 | we recently released a new model an open source free model called kevin uh colonel devon kevin 32b
00:12:18.080 | kevin is uh uh outperforms many state-of-the-art foundation models on the narrow domain of writing cuda kernels
00:12:26.080 | raise your hand if you've ever heard of a cuda kernel
00:12:30.080 | all right we have an audience that's very familiar with the under fittings of ml for those who haven't heard of cuda kernels
00:12:36.080 | um this is the source code that you use to write gpu optimized uh gpu optimized implementations for nvidia gpus
00:12:44.080 | and so under the hood when you're using pytorch or tensorflow um those high-level graph operations are being executed under the hood by cuda kernels
00:12:52.080 | and the the domain of writing cuda kernels is extremely rich because this is a very low-level programming language
00:12:59.080 | relative to what many of us operate in more typically day-to-day say python and uh cuda kernels were released
00:13:06.080 | as a uh uh kernel bench was released as a benchmark by anne simon azalea uh to estimate models capabilities
00:13:15.080 | of generating these very niche very specific cuda kernels at high performance and high reliability
00:13:21.080 | and this work from cognition was done by carlo pietro ben supervised by silas uh these were our research
00:13:28.080 | uh these were our research interns uh who who did who got really really exciting results um from a from a single project
00:13:35.080 | so let's talk about what this work does more specifically um the goal is to take high-level
00:13:42.080 | machine learning code say a few different calls to pytorch and rewrite it as a highly optimized performant
00:13:49.080 | correct cuda kernel this is a very detailed problem domain that many low-level machine learning researchers
00:13:56.080 | spend you know their entire careers optimizing uh the design space is quite large for how to write optimal cuda kernels
00:14:03.080 | and it's quite challenging um what we see in practice in the ml community is that a lot of progress in machine learning
00:14:09.080 | in machine learning is really driven by performance on the hardware and so even if your algorithm or your new paper
00:14:16.080 | is big o optimal like a linear attention mechanism um under the hood if your implementation is not efficient
00:14:22.080 | cache friendly uh performant on actual gpu hardware it tends to not be that useful so this is a really active
00:14:28.080 | research domain for ml researchers and we want uh kevin to be good at writing these optimized cuda kernels
00:14:36.080 | so how does this work uh the first step is to define your reward function and one of the great things about software
00:14:43.080 | and in particular writing cuda kernels is that it's often easy to get automatically verifiable reward
00:14:50.080 | can you verify the correctness of your code automatically well in this case we have a less performant reference
00:14:55.080 | implementation that we can use to check correctness and so whenever kevin uh which is our post train uh sort of
00:15:03.080 | llm for this project whenever kevin writes a kernel we run it through a series of checks right first of all
00:15:09.080 | does that code parse um is it actually valid cuda does it compile does it run uh and then after all that
00:15:15.080 | is it correct and only if it's correct do we then grade it for performance how much faster or slower is it
00:15:21.080 | than the reference implementation so with this reward function notice we don't need a machine learning model here
00:15:26.080 | this is purely a set of um automatically verifiable steps um which makes which makes this very very
00:15:33.080 | friendly for high compute rl once you have the reward function you can use it for multi-turn training
00:15:39.080 | and so we use multi-turn grpo uh and for those who aren't familiar what's going on here is we're taking
00:15:45.080 | we're taking multiple different trajectories in sequence for this model to get better at writing
00:15:51.080 | cuda code so on the left here we have an initial prompt which results in a chain of thought from the
00:15:57.080 | model and an output and that output may or may not be correct uh when we move to the second the middle of
00:16:02.080 | this diagram we provide eval info back to the model and this eval info is the results from trying to run that
00:16:09.080 | kernel in a real world gpu environment um there's a lot of work you have to do in terms of sandboxing
00:16:14.080 | and isolation to make sure these incorrect cuda kernels don't mess up your training process or crash your
00:16:19.080 | gpus and then you're getting accurate performance benchmarks but we package all that up into almost
00:16:23.080 | like a struct of eval information that that model can then see as it tries again and it tries again with
00:16:30.080 | a second chain of thought a second kernel that gets passed to another step and this process repeats over
00:16:35.080 | several steps and the result is hopefully a correct kernel then you have to distribute your rewards to
00:16:43.080 | train this information and what we found is that you don't want to just reward based on the final output
00:16:49.080 | uh and its correctness or not and its performance or lack of performance actually the the path to get
00:16:55.080 | there is also valuable so you'll notice in red at the bottom here we have this uh this sum of different
00:17:02.080 | rewards discounted by gamma over time and what that's showing is the very first trajectory the
00:17:08.080 | very first step of that trajectory gets a reward even if it wasn't correct itself if it led to a
00:17:15.080 | correct solution and a performance solution downstream are you barking up the right tree is basically the
00:17:20.080 | reward we want to give the model and what we found in this project is that being able to do this over
00:17:24.080 | multiple iterations with these discounted rewards was really important for this to work because writing
00:17:29.080 | cooter kernels is hard and so the reward signal is going to be sparse if you only get one shot
00:17:35.080 | and once you do this uh you can you can find that it's not impossible to very deeply optimize for these
00:17:43.080 | narrow problem domains so in this graph we have the um correctness on the left how many of the kernels were written
00:17:52.080 | correctly by this model and kevin 32b is getting 91 correct on this section of the kernel bench benchmark that we focused on
00:18:00.080 | and you can see that compared to even 04 mini or 03 this is a significant improvement um this is a narrow
00:18:06.080 | domain where high compute rl lets you outperform existing models on the right you see performance so we rewarded devon proportional to how much speedup uh you know kevin got in this project
00:18:20.080 | and so as the kernels got faster and faster it got more and more reward and what we found is that even from a performance standpoint
00:18:28.080 | kevin 32b is able to outperform these larger scale foundation models
00:18:34.080 | and this was really interesting result to us because um it kind of flies in the face of many many sort of broad discussions of oh these foundation models are going to be the best at everything and you should use them exclusively for everything
00:18:48.080 | but what we see internally all the time is that for any given narrow domain if you can set up your environment to do high compute rl in that domain
00:18:56.080 | it's very feasible to outperform an out-of-the-box foundation model especially as the open source based models that you start with have improved
00:19:06.080 | to actually make this work in practice it's important that you keep your model doing what you actually want it to do and not cheating along the way
00:19:14.080 | and this is called reward hacking in rl and it's uh many cases actually challenging to prevent so i want to show you a few ways that kevin misbehaved
00:19:22.080 | uh that we had to sort of steer back so one is kevin realized that it could write the cuda and then wrap the whole thing in a try except block
00:19:30.080 | and just fall back to the existing pi torch implementation and you know it would always score 100
00:19:36.080 | percent correct in that case and it had some chance of being faster than average but if it wasn't it it sort of defaulted to the one x
00:19:42.080 | so that was a very unproductive direction for for kevin to go down during the rl process and uh we had to make sure
00:19:48.080 | that we update the reward function to recognize this type of reward hacking the second is even a bit more subtle uh and so
00:19:56.080 | the test harness to make sure that kevin's code was correct uh had a class you know in this case called model new uh that inherited
00:20:04.080 | from model and you can see here what kevin realized is that it could implement the model as a uh you know as a subclass of
00:20:13.080 | nn.module with its attempt at optimize code code and then it could just overwrite that uh class name in the namespace
00:20:20.080 | and so you can see it defined a second model new that this in this case just inherits directly from the correct
00:20:25.080 | model implementation so these models got very creative at all uh how to sort of get around your intentions and
00:20:31.080 | this is a challenge in rl and so making sure you correctly define your environment is is really critical to success
00:20:37.080 | and for those of you who have used maybe really popular commercial models like some of the most popular models for coding
00:20:42.080 | you might have seen that as the models get better sometimes they're more aggressive at doing things like commenting out your test cases
00:20:49.080 | to make sure the works will pass that's what's going on under the hood is this is uh this is a smell of reward hacking
00:20:57.080 | and so it's a constant sort of cat and mouse game between the researchers who are trying to steer these models to do what we actually want
00:21:03.080 | and the models they're trying to exploit every possible way to get this high quality reward
00:21:10.080 | so what do we learn from this um custom post training can and does outperform frontier models on specific
00:21:17.080 | narrow domains for reinforcement learning specifically um especially in code it's more compute bound than data bound
00:21:24.080 | you know kernel bench the subset of kernel bench that we trained on only had 180 tasks which is really not that many when you think about it
00:21:31.080 | but by applying high compute rl rolling out these trajectories again and again there is very very rich reward signal to learn from
00:21:38.080 | and that's because in software we have an oracle that can help with these rewards we actually have the environment
00:21:43.080 | we can run the code we can see if it compiles we can see how fast it is
00:21:46.080 | and this in my opinion is one of the reasons that um software and coding specifically has accelerated particularly fast
00:21:53.080 | as an ai capability is that code is one of the few domains where this property holds
00:21:58.080 | i used to lead machine learning at scale ai which provides post training human data for many of the large scale foundation model labs
00:22:05.080 | and it gets really hard to label by hand high quality high accuracy data as the models get smarter but code doesn't have that bottleneck
00:22:13.080 | because you can continually scale um based on automatic signals of correctness
00:22:19.080 | and that's really the third key is automatic verification allows you to scale
00:22:23.080 | so for your own code bases and your own process putting in the ci systems putting in the test coverage
00:22:28.080 | putting in the harnesses that allow that automatic verification is going to future proof your code as rl and as ai gets better
00:22:35.080 | and we see many of our users of devin they first take their code base with devin and go fix all the test coverage issues
00:22:42.080 | and now that they have full test coverage it's even faster to use devin to ship new more pull requests
00:22:48.080 | the last big point here is i just showed you an example on cuda kernels
00:22:54.080 | but to me the the more interesting deeper implication of this research is every code base is in some sense a narrow domain
00:23:00.080 | there are specific things to your code that don't exist in anyone else's code
00:23:04.080 | and that's more and more true the larger your code base is so you can imagine a future where high compute rl and per code base customization leads to significantly outperforming agents on each individual domain
00:23:16.080 | the equivalent of hiring a software engineer and giving them millions of years of experience working specifically in your environment
00:23:24.080 | so this is some of the research work we've been doing at cognition that powers devin under the hood
00:23:28.080 | if you'd like to play around and try this yourself you can use you can go to devin.ai and sign up for an account
00:23:34.080 | connect it with your existing code give it a task and go from ticket to pr thank you so much for having me
00:23:40.080 | thank you
00:23:42.080 | thank you
00:23:44.080 | Thank you.
00:23:45.080 | Thank you.