Devin 2.0 and the Future of SWE - Scott Wu, Cognition

yeah well thank you guys so much for having me it's exciting to be back it's uh I was last here at AI engineer one year ago it's kind of crazy I've always been I've been telling Swix that we need to have these conferences way more often if it's going to be about AI software engineering probably should be like every two months or something like that with the pace of everything's done but but but it's gonna be fun to to talk a little bit about you know what we've seen in the space and and what we've learned over the last 12 or 18 months building Devon over this time and I want to start this off with Moore's law for AI agents and so you can kind of think of the the the capability or the capacity of an AI by how much work it can do uninterrupted until you have to come in and step in and intervene or steer it or whatever it is right and you know in GPT-3 for example it's if you were to go and ask GPT-3 to do something you know you could probably get through a few words or so and then it'll say something where it's like okay you know this is probably not the right thing to say hey and GPT-3.5 was better and GPT-4 was better right and and so people talk about these lengths of tasks and what you see in general is that that doubling time is about every seven months which already is pretty crazy actually but in code it's actually even faster it's every 70 days which is two or three months and so you know if you look at various software engineering tasks that start from the simplest single functions or single lines and you go all the way to you know we're doing tasks now that take hours of humans time and and an AI agent is able to just do all of that right and if you think about doubling every 70 days I mean basically you know every two to three months means you get four to six doublings every year which means that the amount of work that an AI agent can do in code goes something between 16 and 64x in a year every year at least for the last couple years that we've seen um and it's kind of crazy to think about but but that sounds about right actually for for what we've seen you know 18 months ago I would say the only really the only product experience that had pmf in code was just tab completion right it was just like here's what I have so far predict the next line for me that was kind of all you really could do um in a way that really worked and we've gone from that obviously to full AI engineer that goes and just does does does all these tasks for you right and implements a ton of these things and people ask all the time what is the um you know what what what is the the future interface or what is the right way to do this or what are the most important capabilities to solve for and I think funnily enough the answer to all these questions actually is it changes every two or three months like every time you get to the next tier the the the bottleneck that you're running into or the most important capability or the right way you should be interfacing with it like all these actually change at each point and so I wanted to talk a bit about some of the the those tiers for us over the last year or so um and you know over the course of that time obviously you know when we got started um in the end of 2023 obviously agents were not even a concept um and now everyone has you know everyone's talking about coding agents people are doing more and more and more and more uh and it's very cool to see um and and each of these has kind of been almost a discrete tier for us um and so right right around a year ago when we were doing the last ai engineer talk actually um the the biggest use case that we really saw that that was getting broad adoption was what i'll kind of call these repetitive migrations and so i'm talking like javascript to typescript or like upgrading your angular version from this one to that one or going from this java version to that java version or something like that um and those those kinds of tasks in particular what you typically see is you are you you have some massive code base that you want to apply this whole migration for you have to go file by file and do every single one and usually the set of steps is pretty clear right if you go to the angular website or something like that it'll tell you all right here's what you have to do this this this this this and um you want to go and execute each of these steps it's not so routine that there you know there's no classical deterministic program that solves that but there's kind of a clear set of steps and if you can follow those steps very well then you can do the task and you know this was the thing for us because that was all you could really trust agents to do at the time you know you could do harder things once in a while and you could do some really cool stuff occasionally but as far as something that was consistent enough that you could do it over and over and over these kinds of like repetitive migrations that you would be doing for you know 10 000 files were you know in many ways the the easiest thing which was cool actually because it was also kind of the the most annoying thing for humans to do and i think that's generally been the trend where um ai has always done these more boilerplate tasks and the more tedious stuff more repetitive stuff and we get to do the the the more fun creative stuff um and obviously as time has gone on it's it's taken on more and more of that boilerplate but for a problem like this one a lot of what you need to do is you need devin to be able to go and execute a set of steps super reliably and so a lot of this was you know i would say the big capabilities problems to solve was mostly instruction following and so we built this system called playbooks where basically you could just outline a very clear set of steps have it follow each of those step by step and then do exactly what's said now if you think about it obviously a lot of software engineering does not fall under the category of literally just follow 10 steps step by step and do exactly what it said but migration does and it allowed us to go and actually do these and this was kind of i would say the first big use case of devin that really um that really came up i think one of the other big systems that got built around that time which we've since rebuilt many times is knowledge or memory right which is you know if you're doing the same task over and over and over again then often the human will have feedback on hey by the way you have to remember to do x thing or you have to you know you need to do y thing every time when you see this right um and so basically an ability to to just maintain and understand the learnings from that and use that to improve the agent in every future one and those were kind of the big problems of the time you know and that was summer of last year and around end of summer fall or so you know i think that the the kind of big thing that started coming up was as these systems got more and more capable instead of just doing the most routine migrations you could do you know these more still pretty isolated but but but but a bit broader of these general kind of bugs or features where you can actually just tell it what you want to do and have you have it do it right and so for example hey devon in this uh repo select drop down can you please just list the currently selected ones at the top like having the checkboxes throughout is just doesn't really and and devon will just go and do that right and so if you think about it it's you know it's it's it's something like the kind of level of tasks that you would give an intern and there are a few particular things that you have to solve for um with this first of all usually these these these changes are pretty isolated and pretty contained and so it's one maybe two files that you really have to look at and change to do a task like this but at least you do still need to be able to set up the repo and work with the repo right and so you want to be able to run lint you want to be able to run ci all these other things so you know to at least have the basic checks of whether things work one of the big things that we built around then was the ability to really set up your repository uh ahead of time and build a snapshot um that that you could start off that you could reload that you could roll back and all of these kinds of primitives as well right so having this clean remote vm that could run all these things it could run your ci it could run your linter uh and and so on um but that's when we started to really see i would say a bit more broad of value right i mean migrations is one particular thing and for that particular thing we were showing a ton of value and then we started to see where you know with these bug fixes or things like that you would be able to just generally get value from devon as as almost like a junior buddy of yours and then in fall things really moved towards just much broader bugs and requests and here it's you know most most changes again you know you're jumping another order of magnitude most changes don't just contain themselves to one file right often you have to go and look see what's going on you have to diagnose things you have to figure out what's happening you have to work across files and make the right changes often these changes are you know hundreds of lines if it's like hey i've got this bug let's figure out what's going on let's solve it right and you know there are a lot of things here that that really started to make sense and really started to be important but but one in particular i'll just point out was there's a lot of stuff that you can do with not just looking at the code as text but thinking of it as this whole hierarchy right so so understanding call hierarchies running a language server uh is a big deal you have you have git commit history which you can look at which informs how these different files relate to one another you have um um obviously you have like your linter and things like that but but you're able to kind of reference things across files and so like one of the big problems here i think was uh kind of working with the context of it and getting to the point where it could make changes across several files it could be consistent across those changes it would be able to understand across the code base and here was really the point i would say where you started to be able to just tag it and have it do an issue and just have it build it for you um and so slack was a you know a huge part of the workflow then um and and it was just it it made sense because it's where you discuss your issues and it's where you set these things up right so you would tag devon and slack and say hey by the way we've got this bug please take a look or you know could you please go build this thing uh this is especially fun part for us because this is right around when we went ga and a lot of that was because it was it got to the point where you truly could just get set up with devon and ask it a lot of these broad tasks and and just have it do it um but but a lot of these you know a lot of the work that we did was around having devon have better and better understanding of the code base right and if you think about it you know from the human lens it's the same way where on your first day on the job for example being super fresh in the code base it's kind of tough to know exactly what you're supposed to do like a lot of these details are things that you understand over time or that a representation of the code base that you build over time right um and devon had to do the same thing and had to understand how do i plan this task out before i solve it how do i understand all the files that need to be changed how do i go from there and make that diff and around the spring of this year um again every every gap is like two or three months you know we got to an interesting point which is once you start to get to harder and harder tasks you as the human don't necessarily know everything that you want done at the time that you're giving the task right if you're saying hey you know i'd like to go and um improve the architecture of this or you know this this function is slow like let's let's profile it and look into it and see what needs to be done or hey like you know we really should should handle this this error case better but like let's look at all the possibilities and see what we should you know what the right logic should be in each of these right and basically what it meant is that this whole idea of taking a two-line prompt or a three-line prompt or something and then just having that result in a devon task was was not sufficient and you wanted to really be able to work with devon and specify a lot more and around this time along with this kind of like better code-based intelligence um we had a few different things that that that came up and so we released deep wiki for example um and the whole idea of deep wiki was you know funnily enough is devon had its own internal representation of the code base but it turns out that for humans it was great to look at that too to be able to understand what was going on or to be able to ask questions quickly about the code base closely related to that was with search which is the ability to really just ask questions about a code base and understand um some some piece of this and a lot of the workflow that really started to come up was actually basically this this more iterative workflow where the first thing that you would do is you would ask a few questions you would understand you would basically have a more l2 experience where you can go and explore the code base with your agent figure out what has to be done in the task and then set your agent off to go do that because for these more complex tasks you kind of needed that right um and so so you know that was a i would say kind of like a big paradigm shift for us then is is understanding you know this is what also came along with devon 2.0 for example and the in ide experience where often yeah you want to be able to have points where you closely monitor devon for 10 percent of the task 20 of the task and then have it do uh work on its own for the other 80 90 percent um and then lastly most recently in june which is now it was kind of yeah the really the ability to just truly just kill your backlog and hand it a ton of tasks and have it do all these at once and you know if you think about this task and in many ways i would say it's it's almost like a culmination of of many of these different things that that had to be done in the past you have to work with all these systems obviously you have to integrate into all these certainly you want to be able to to work with linear or with jira or systems like that but you have to be able to scope out a task to understand what's meant by what's going on you have to decide when to go to the human for more approval or for questions or things like that you have to work across several different files often you have to understand even what repo is the right repo to make the change in if your if your org has multiple repos or what part of the code base is the right part of the code base that needs to change um and to really get to the point where you can go and do this more autonomously first of all um you you have to have like a really great sense of confidence right and so um you know rather than just going off and doing things immediately you have to be able to say okay i'm quite sure that this is the task and i'm going to go execute it now versus i don't understand what's going on human please give me help basically right but but the other piece of it is um this is i think the era where testing and this asynchronous testing gets really really important right which is if you want something to just deliver entire prs for you for tasks that you do especially for these larger tasks you want to know that it is can contest it itself and often the agent actually needs this iterative loop to be able to go and do that right so it needs to be able to run all the code locally it needs to know what to test it needs to know what to look for um and in many ways it's just a much higher context problem to solve for right is this testing itself and that brings us to now and obviously it's a it's a pretty fun time to see because now what we're thinking about is hey maybe if instead of doing it just one task it's you know how do we think about tackling an entire project right and after we do a project you know what goes after that and maybe one point that i would just make here is we talk about all these two x's you know that happen every couple months and i think from a kind of cosmic perspective all the two x's look the same right but in practice every two x actually is a different one right and so when we were just doing you know tab completion line single line completion it really was just a text problem it is just like take in the single file so far and just predict what the line is next right over the last year or year and a half we've had to think about so much more how do you how do you work with the human in linear or slack which you how do you take in feedback or steering how do you help the human plan out and do all these things right and moreover obviously there's a ton of the tooling and the capabilities work that have to be done of how does how does devon test on its own how does devon you know make a lot of these longer term decisions on its own how does it debug its own outputs or run the right shell commands to figure out what the feedback is and go from there and so it's super exciting now that there's a lot more there's a lot more coding agents in the space it's a it's very fun to see and i think that you know we're going to see another 16 to 64 x over the next 12 months as well and uh and so yeah super super excited awesome well that's all thank you guys so much for having me

Devin 2.0 and the Future of SWE - Scott Wu, Cognition

Transcript