Devin 2.0 and the Future of SWE - Scott Wu, Cognition

00:00:00.000 | yeah well thank you guys so much for having me it's exciting to be back it's uh I was last here

00:00:18.780 | at AI engineer one year ago it's kind of crazy I've always been I've been telling Swix that we

00:00:24.720 | need to have these conferences way more often if it's going to be about AI software engineering

00:00:28.020 | probably should be like every two months or something like that with the pace of everything's

00:00:31.380 | done but but but it's gonna be fun to to talk a little bit about you know what we've seen in the

00:00:36.300 | space and and what we've learned over the last 12 or 18 months building Devon over this time and I

00:00:44.220 | want to start this off with Moore's law for AI agents and so you can kind of think of the the

00:00:50.520 | the capability or the capacity of an AI by how much work it can do uninterrupted until you have

00:00:58.380 | to come in and step in and intervene or steer it or whatever it is right and you know in GPT-3 for

00:01:04.380 | example it's if you were to go and ask GPT-3 to do something you know you could probably get through a

00:01:08.940 | few words or so and then it'll say something where it's like okay you know this is probably not the

00:01:13.140 | right thing to say hey and GPT-3.5 was better and GPT-4 was better right and and so people talk

00:01:19.080 | about these lengths of tasks and what you see in general is that that doubling time is about every

00:01:23.700 | seven months which already is pretty crazy actually but in code it's actually even faster it's every 70

00:01:30.660 | days which is two or three months and so you know if you look at various software engineering tasks that

00:01:36.300 | start from the simplest single functions or single lines and you go all the way to you know we're doing

00:01:42.540 | tasks now that take hours of humans time and and an AI agent is able to just do all of that right

00:01:48.300 | and if you think about doubling every 70 days I mean basically you know every two to three months means

00:01:54.380 | you get four to six doublings every year which means that the amount of work that an AI agent can do

00:02:00.460 | in code goes something between 16 and 64x in a year every year at least for the last couple years that

00:02:07.660 | we've seen um and it's kind of crazy to think about but but that sounds about right actually for for what

00:02:13.100 | we've seen you know 18 months ago I would say the only really the only product experience that had pmf

00:02:18.780 | in code was just tab completion right it was just like here's what I have so far predict the next line

00:02:24.940 | for me that was kind of all you really could do um in a way that really worked and we've gone from

00:02:30.380 | that obviously to full AI engineer that goes and just does does does all these tasks for you right

00:02:35.580 | and implements a ton of these things and people ask all the time what is the um you know what what

00:02:42.380 | what is the the future interface or what is the right way to do this or what are the most important

00:02:46.940 | capabilities to solve for and I think funnily enough the answer to all these questions actually is

00:02:51.100 | it changes every two or three months like every time you get to the next tier the the the bottleneck

00:02:57.660 | that you're running into or the most important capability or the right way you should be

00:03:01.020 | interfacing with it like all these actually change at each point and so I wanted to talk a bit about

00:03:07.660 | some of the the those tiers for us over the last year or so um and you know over the course of that

00:03:15.020 | time obviously you know when we got started um in the end of 2023 obviously agents were not even a

00:03:19.820 | concept um and now everyone has you know everyone's talking about coding agents people are doing more and

00:03:23.980 | more and more and more uh and it's very cool to see um and and each of these has kind of been almost a

00:03:28.940 | discrete tier for us um and so right right around a year ago when we were doing the last ai engineer

00:03:35.420 | talk actually um the the biggest use case that we really saw that that was getting broad adoption was

00:03:40.780 | what i'll kind of call these repetitive migrations and so i'm talking like javascript to typescript

00:03:46.620 | or like upgrading your angular version from this one to that one or going from this java version to that java

00:03:52.220 | version or something like that um and those those kinds of tasks in particular what you typically

00:03:58.220 | see is you are you you have some massive code base that you want to apply this whole migration for

00:04:05.580 | you have to go file by file and do every single one and usually the set of steps is pretty clear

00:04:10.300 | right if you go to the angular website or something like that it'll tell you all right here's what you

00:04:14.060 | have to do this this this this this and um you want to go and execute each of these steps it's not so

00:04:20.220 | routine that there you know there's no classical deterministic program that solves that but there's

00:04:25.180 | kind of a clear set of steps and if you can follow those steps very well then you can do the task and

00:04:29.820 | you know this was the thing for us because that was all you could really trust agents to do at the time

00:04:35.020 | you know you could do harder things once in a while and you could do some really cool stuff occasionally

00:04:39.020 | but as far as something that was consistent enough that you could do it over and over and over

00:04:44.300 | these kinds of like repetitive migrations that you would be doing for you know 10 000 files

00:04:48.540 | were you know in many ways the the easiest thing which was cool actually because

00:04:53.340 | it was also kind of the the most annoying thing for humans to do and i think that's generally been

00:04:59.100 | the trend where um ai has always done these more boilerplate tasks and the more tedious stuff more

00:05:04.780 | repetitive stuff and we get to do the the the more fun creative stuff um and obviously as time has gone on

00:05:10.220 | it's it's taken on more and more of that boilerplate but for a problem like this one a lot of what you

00:05:15.820 | need to do is you need devin to be able to go and execute a set of steps super reliably and so a lot of

00:05:22.780 | this was you know i would say the big capabilities problems to solve was mostly instruction following and so

00:05:28.220 | we built this system called playbooks where basically you could just outline a very clear set of steps

00:05:33.340 | have it follow each of those step by step and then do exactly what's said now if you think about it

00:05:37.660 | obviously a lot of software engineering does not fall under the category of literally just follow

00:05:42.060 | 10 steps step by step and do exactly what it said but migration does and it allowed us to go and actually

00:05:48.460 | do these and this was kind of i would say the first big use case of devin that really um that really came

00:05:53.100 | up i think one of the other big systems that got built around that time which we've since rebuilt many

00:05:57.900 | times is knowledge or memory right which is you know if you're doing the same task over and over and

00:06:03.820 | over again then often the human will have feedback on hey by the way you have to remember to do x thing

00:06:08.780 | or you have to you know you need to do y thing every time when you see this right um and so basically an

00:06:16.300 | ability to to just maintain and understand the learnings from that and use that to improve the agent in every

00:06:22.060 | future one and those were kind of the big problems of the time you know and that was summer of last year

00:06:26.380 | and around end of summer fall or so you know i think that the the kind of big thing that started

00:06:33.340 | coming up was as these systems got more and more capable instead of just doing the most routine

00:06:38.300 | migrations you could do you know these more still pretty isolated but but but but a bit broader of

00:06:44.300 | these general kind of bugs or features where you can actually just tell it what you want to do and

00:06:49.420 | have you have it do it right and so for example hey devon in this uh repo select drop down can you please

00:06:55.180 | just list the currently selected ones at the top like having the checkboxes throughout is just doesn't

00:06:59.820 | really and and devon will just go and do that right and so if you think about it it's you know it's it's

00:07:05.500 | it's something like the kind of level of tasks that you would give an intern

00:07:07.980 | and there are a few particular things that you have to solve for um with this first of all

00:07:14.060 | usually these these these changes are pretty isolated and pretty contained and so it's one maybe two files

00:07:20.060 | that you really have to look at and change to do a task like this but at least you do still need to

00:07:24.300 | be able to set up the repo and work with the repo right and so you want to be able to run lint you

00:07:29.340 | want to be able to run ci all these other things so you know to at least have the basic checks of whether

00:07:34.380 | things work one of the big things that we built around then was the ability to really set up your

00:07:39.580 | repository uh ahead of time and build a snapshot um that that you could start off that you could reload that

00:07:45.580 | you could roll back and all of these kinds of primitives as well right so having this clean remote

00:07:50.060 | vm that could run all these things it could run your ci it could run your linter uh and and so on um

00:07:56.860 | but that's when we started to really see i would say a bit more broad of value right i mean migrations

00:08:01.500 | is one particular thing and for that particular thing we were showing a ton of value and then we

00:08:05.260 | started to see where you know with these bug fixes or things like that you would be able to

00:08:09.180 | just generally get value from devon as as almost like a junior buddy of yours

00:08:14.380 | and then in fall things really moved towards just much broader bugs and requests and here it's you

00:08:21.500 | know most most changes again you know you're jumping another order of magnitude most changes

00:08:26.940 | don't just contain themselves to one file right often you have to go and look see what's going on you have

00:08:31.980 | to diagnose things you have to figure out what's happening you have to work across files and make the

00:08:36.140 | right changes often these changes are you know hundreds of lines if it's like hey i've got this bug let's figure

00:08:41.340 | out what's going on let's solve it right and you know there are a lot of things here that that really

00:08:47.260 | started to make sense and really started to be important but but one in particular i'll just point

00:08:50.700 | out was there's a lot of stuff that you can do with not just looking at the code as text but thinking of

00:08:57.340 | it as this whole hierarchy right so so understanding call hierarchies running a language server uh is a big deal

00:09:03.500 | you have you have git commit history which you can look at which informs how these different files

00:09:07.660 | relate to one another you have um um obviously you have like your linter and things like that but but

00:09:13.660 | you're able to kind of reference things across files and so like one of the big problems here i think

00:09:17.500 | was uh kind of working with the context of it and getting to the point where it could make changes

00:09:24.460 | across several files it could be consistent across those changes it would be able to understand across the

00:09:29.260 | code base and here was really the point i would say where you started to be able to just tag it

00:09:33.420 | and have it do an issue and just have it build it for you um and so slack was a you know a huge part

00:09:38.860 | of the workflow then um and and it was just it it made sense because it's where you discuss your issues

00:09:44.860 | and it's where you set these things up right so you would tag devon and slack and say hey by the way

00:09:48.780 | we've got this bug please take a look or you know could you please go build this thing

00:09:52.860 | uh this is especially fun part for us because this is right around when we went ga and a lot of that

00:09:58.060 | was because it was it got to the point where you truly could just get set up with devon and ask it a

00:10:03.100 | lot of these broad tasks and and just have it do it um but but a lot of these you know a lot of the work

00:10:08.140 | that we did was around having devon have better and better understanding of the code base right and if

00:10:13.900 | you think about it you know from the human lens it's the same way where on your first day on the job for

00:10:18.620 | example being super fresh in the code base it's kind of tough to know exactly what you're supposed to do

00:10:22.700 | like a lot of these details are things that you understand over time or that a representation of the

00:10:26.940 | code base that you build over time right um and devon had to do the same thing and had to understand

00:10:31.580 | how do i plan this task out before i solve it how do i understand all the files that need to be changed

00:10:36.060 | how do i go from there and make that diff

00:10:38.060 | and around the spring of this year um again every every gap is like two or three months you know we got

00:10:48.060 | to an interesting point which is once you start to get to harder and harder tasks you as the human

00:10:53.820 | don't necessarily know everything that you want done at the time that you're giving the task right if

00:10:58.940 | you're saying hey you know i'd like to go and um improve the architecture of this or you know this this

00:11:04.620 | function is slow like let's let's profile it and look into it and see what needs to be done or hey like

00:11:10.780 | you know we really should should handle this this error case better but like let's look at all the

00:11:15.420 | possibilities and see what we should you know what the right logic should be in each of these right

00:11:19.660 | and basically what it meant is that this whole idea of taking a two-line prompt or a three-line prompt

00:11:24.540 | or something and then just having that result in a devon task was was not sufficient and you wanted to

00:11:30.380 | really be able to work with devon and specify a lot more and around this time along with this kind of

00:11:36.220 | like better code-based intelligence um we had a few different things that that that came up and so we

00:11:40.460 | released deep wiki for example um and the whole idea of deep wiki was you know funnily enough is

00:11:45.100 | devon had its own internal representation of the code base but it turns out that

00:11:48.540 | for humans it was great to look at that too to be able to understand what was going on or to be able

00:11:54.140 | to ask questions quickly about the code base closely related to that was with search which is the ability

00:11:59.900 | to really just ask questions about a code base and understand um some some piece of this and a lot of

00:12:06.140 | the workflow that really started to come up was actually basically this this more iterative workflow where

00:12:10.940 | the first thing that you would do is you would ask a few questions you would understand you would

00:12:14.940 | basically have a more l2 experience where you can go and explore the code base with your agent

00:12:19.500 | figure out what has to be done in the task and then set your agent off to go do that because

00:12:25.260 | for these more complex tasks you kind of needed that right um and so so you know that was a i would

00:12:31.100 | say kind of like a big paradigm shift for us then is is understanding you know this is what also came

00:12:35.020 | along with devon 2.0 for example and the in ide experience where often yeah you want to be able to

00:12:40.540 | have points where you closely monitor devon for 10 percent of the task 20 of the task and then have

00:12:45.580 | it do uh work on its own for the other 80 90 percent um and then lastly most recently in june which is now

00:12:54.780 | it was kind of yeah the really the ability to just truly just kill your backlog and hand it a ton of

00:13:00.540 | tasks and have it do all these at once and you know if you think about this task and in many ways i would

00:13:04.540 | say it's it's almost like a culmination of of many of these different things that that had to be done in the

00:13:08.300 | past you have to work with all these systems obviously you have to integrate into all these

00:13:11.900 | certainly you want to be able to to work with linear or with jira or systems like that but you have to

00:13:16.540 | be able to scope out a task to understand what's meant by what's going on you have to decide when

00:13:21.260 | to go to the human for more approval or for questions or things like that you have to work across several

00:13:25.980 | different files often you have to understand even what repo is the right repo to make the change in if

00:13:31.660 | your if your org has multiple repos or what part of the code base is the right part of the code base that

00:13:35.820 | needs to change um and to really get to the point where you can go and do this more autonomously

00:13:40.700 | first of all um you you have to have like a really great sense of confidence right and so

00:13:47.260 | um you know rather than just going off and doing things immediately you have to be able to say okay

00:13:51.500 | i'm quite sure that this is the task and i'm going to go execute it now versus i don't understand what's

00:13:57.420 | going on human please give me help basically right but but the other piece of it is

00:14:03.420 | um this is i think the era where testing and this asynchronous testing gets really really important

00:14:08.300 | right which is if you want something to just deliver entire prs for you for tasks that you do especially

00:14:14.220 | for these larger tasks you want to know that it is can contest it itself and often the agent actually

00:14:20.140 | needs this iterative loop to be able to go and do that right so it needs to be able to run all the code

00:14:24.540 | locally it needs to know what to test it needs to know what to look for um and in many ways it's just a much

00:14:29.580 | higher context problem to solve for right is this testing itself and that brings us to now and

00:14:36.700 | obviously it's a it's a pretty fun time to see because now what we're thinking about is hey maybe

00:14:40.700 | if instead of doing it just one task it's you know how do we think about tackling an entire project right

00:14:46.140 | and after we do a project you know what goes after that and maybe one point that i would just make here is

00:14:53.180 | we talk about all these two x's you know that happen every couple months and i think from a kind of

00:14:59.180 | cosmic perspective all the two x's look the same right but in practice every two x actually is a

00:15:03.340 | different one right and so when we were just doing you know tab completion line single line completion

00:15:08.780 | it really was just a text problem it is just like take in the single file so far and just predict what

00:15:14.300 | the line is next right over the last year or year and a half we've had to think about so much more how do you

00:15:18.780 | how do you work with the human in linear or slack which you how do you take in feedback or steering

00:15:23.820 | how do you help the human plan out and do all these things right and moreover obviously there's a ton of

00:15:30.140 | the tooling and the capabilities work that have to be done of how does how does devon test on its own

00:15:35.260 | how does devon you know make a lot of these longer term decisions on its own how does it

00:15:40.620 | debug its own outputs or run the right shell commands to figure out what the feedback is

00:15:46.060 | and go from there and so it's super exciting now that there's a lot more there's a lot more coding

00:15:50.620 | agents in the space it's a it's very fun to see and i think that you know we're going to see another 16

00:15:56.460 | to 64 x over the next 12 months as well and uh and so yeah super super excited awesome well that's all

00:16:04.300 | thank you guys so much for having me