back to indexDevin 2.0 and the Future of SWE - Scott Wu, Cognition

00:00:00.000 |
yeah well thank you guys so much for having me it's exciting to be back it's uh I was last here 00:00:18.780 |
at AI engineer one year ago it's kind of crazy I've always been I've been telling Swix that we 00:00:24.720 |
need to have these conferences way more often if it's going to be about AI software engineering 00:00:28.020 |
probably should be like every two months or something like that with the pace of everything's 00:00:31.380 |
done but but but it's gonna be fun to to talk a little bit about you know what we've seen in the 00:00:36.300 |
space and and what we've learned over the last 12 or 18 months building Devon over this time and I 00:00:44.220 |
want to start this off with Moore's law for AI agents and so you can kind of think of the the 00:00:50.520 |
the capability or the capacity of an AI by how much work it can do uninterrupted until you have 00:00:58.380 |
to come in and step in and intervene or steer it or whatever it is right and you know in GPT-3 for 00:01:04.380 |
example it's if you were to go and ask GPT-3 to do something you know you could probably get through a 00:01:08.940 |
few words or so and then it'll say something where it's like okay you know this is probably not the 00:01:13.140 |
right thing to say hey and GPT-3.5 was better and GPT-4 was better right and and so people talk 00:01:19.080 |
about these lengths of tasks and what you see in general is that that doubling time is about every 00:01:23.700 |
seven months which already is pretty crazy actually but in code it's actually even faster it's every 70 00:01:30.660 |
days which is two or three months and so you know if you look at various software engineering tasks that 00:01:36.300 |
start from the simplest single functions or single lines and you go all the way to you know we're doing 00:01:42.540 |
tasks now that take hours of humans time and and an AI agent is able to just do all of that right 00:01:48.300 |
and if you think about doubling every 70 days I mean basically you know every two to three months means 00:01:54.380 |
you get four to six doublings every year which means that the amount of work that an AI agent can do 00:02:00.460 |
in code goes something between 16 and 64x in a year every year at least for the last couple years that 00:02:07.660 |
we've seen um and it's kind of crazy to think about but but that sounds about right actually for for what 00:02:13.100 |
we've seen you know 18 months ago I would say the only really the only product experience that had pmf 00:02:18.780 |
in code was just tab completion right it was just like here's what I have so far predict the next line 00:02:24.940 |
for me that was kind of all you really could do um in a way that really worked and we've gone from 00:02:30.380 |
that obviously to full AI engineer that goes and just does does does all these tasks for you right 00:02:35.580 |
and implements a ton of these things and people ask all the time what is the um you know what what 00:02:42.380 |
what is the the future interface or what is the right way to do this or what are the most important 00:02:46.940 |
capabilities to solve for and I think funnily enough the answer to all these questions actually is 00:02:51.100 |
it changes every two or three months like every time you get to the next tier the the the bottleneck 00:02:57.660 |
that you're running into or the most important capability or the right way you should be 00:03:01.020 |
interfacing with it like all these actually change at each point and so I wanted to talk a bit about 00:03:07.660 |
some of the the those tiers for us over the last year or so um and you know over the course of that 00:03:15.020 |
time obviously you know when we got started um in the end of 2023 obviously agents were not even a 00:03:19.820 |
concept um and now everyone has you know everyone's talking about coding agents people are doing more and 00:03:23.980 |
more and more and more uh and it's very cool to see um and and each of these has kind of been almost a 00:03:28.940 |
discrete tier for us um and so right right around a year ago when we were doing the last ai engineer 00:03:35.420 |
talk actually um the the biggest use case that we really saw that that was getting broad adoption was 00:03:40.780 |
what i'll kind of call these repetitive migrations and so i'm talking like javascript to typescript 00:03:46.620 |
or like upgrading your angular version from this one to that one or going from this java version to that java 00:03:52.220 |
version or something like that um and those those kinds of tasks in particular what you typically 00:03:58.220 |
see is you are you you have some massive code base that you want to apply this whole migration for 00:04:05.580 |
you have to go file by file and do every single one and usually the set of steps is pretty clear 00:04:10.300 |
right if you go to the angular website or something like that it'll tell you all right here's what you 00:04:14.060 |
have to do this this this this this and um you want to go and execute each of these steps it's not so 00:04:20.220 |
routine that there you know there's no classical deterministic program that solves that but there's 00:04:25.180 |
kind of a clear set of steps and if you can follow those steps very well then you can do the task and 00:04:29.820 |
you know this was the thing for us because that was all you could really trust agents to do at the time 00:04:35.020 |
you know you could do harder things once in a while and you could do some really cool stuff occasionally 00:04:39.020 |
but as far as something that was consistent enough that you could do it over and over and over 00:04:44.300 |
these kinds of like repetitive migrations that you would be doing for you know 10 000 files 00:04:48.540 |
were you know in many ways the the easiest thing which was cool actually because 00:04:53.340 |
it was also kind of the the most annoying thing for humans to do and i think that's generally been 00:04:59.100 |
the trend where um ai has always done these more boilerplate tasks and the more tedious stuff more 00:05:04.780 |
repetitive stuff and we get to do the the the more fun creative stuff um and obviously as time has gone on 00:05:10.220 |
it's it's taken on more and more of that boilerplate but for a problem like this one a lot of what you 00:05:15.820 |
need to do is you need devin to be able to go and execute a set of steps super reliably and so a lot of 00:05:22.780 |
this was you know i would say the big capabilities problems to solve was mostly instruction following and so 00:05:28.220 |
we built this system called playbooks where basically you could just outline a very clear set of steps 00:05:33.340 |
have it follow each of those step by step and then do exactly what's said now if you think about it 00:05:37.660 |
obviously a lot of software engineering does not fall under the category of literally just follow 00:05:42.060 |
10 steps step by step and do exactly what it said but migration does and it allowed us to go and actually 00:05:48.460 |
do these and this was kind of i would say the first big use case of devin that really um that really came 00:05:53.100 |
up i think one of the other big systems that got built around that time which we've since rebuilt many 00:05:57.900 |
times is knowledge or memory right which is you know if you're doing the same task over and over and 00:06:03.820 |
over again then often the human will have feedback on hey by the way you have to remember to do x thing 00:06:08.780 |
or you have to you know you need to do y thing every time when you see this right um and so basically an 00:06:16.300 |
ability to to just maintain and understand the learnings from that and use that to improve the agent in every 00:06:22.060 |
future one and those were kind of the big problems of the time you know and that was summer of last year 00:06:26.380 |
and around end of summer fall or so you know i think that the the kind of big thing that started 00:06:33.340 |
coming up was as these systems got more and more capable instead of just doing the most routine 00:06:38.300 |
migrations you could do you know these more still pretty isolated but but but but a bit broader of 00:06:44.300 |
these general kind of bugs or features where you can actually just tell it what you want to do and 00:06:49.420 |
have you have it do it right and so for example hey devon in this uh repo select drop down can you please 00:06:55.180 |
just list the currently selected ones at the top like having the checkboxes throughout is just doesn't 00:06:59.820 |
really and and devon will just go and do that right and so if you think about it it's you know it's it's 00:07:05.500 |
it's something like the kind of level of tasks that you would give an intern 00:07:07.980 |
and there are a few particular things that you have to solve for um with this first of all 00:07:14.060 |
usually these these these changes are pretty isolated and pretty contained and so it's one maybe two files 00:07:20.060 |
that you really have to look at and change to do a task like this but at least you do still need to 00:07:24.300 |
be able to set up the repo and work with the repo right and so you want to be able to run lint you 00:07:29.340 |
want to be able to run ci all these other things so you know to at least have the basic checks of whether 00:07:34.380 |
things work one of the big things that we built around then was the ability to really set up your 00:07:39.580 |
repository uh ahead of time and build a snapshot um that that you could start off that you could reload that 00:07:45.580 |
you could roll back and all of these kinds of primitives as well right so having this clean remote 00:07:50.060 |
vm that could run all these things it could run your ci it could run your linter uh and and so on um 00:07:56.860 |
but that's when we started to really see i would say a bit more broad of value right i mean migrations 00:08:01.500 |
is one particular thing and for that particular thing we were showing a ton of value and then we 00:08:05.260 |
started to see where you know with these bug fixes or things like that you would be able to 00:08:09.180 |
just generally get value from devon as as almost like a junior buddy of yours 00:08:14.380 |
and then in fall things really moved towards just much broader bugs and requests and here it's you 00:08:21.500 |
know most most changes again you know you're jumping another order of magnitude most changes 00:08:26.940 |
don't just contain themselves to one file right often you have to go and look see what's going on you have 00:08:31.980 |
to diagnose things you have to figure out what's happening you have to work across files and make the 00:08:36.140 |
right changes often these changes are you know hundreds of lines if it's like hey i've got this bug let's figure 00:08:41.340 |
out what's going on let's solve it right and you know there are a lot of things here that that really 00:08:47.260 |
started to make sense and really started to be important but but one in particular i'll just point 00:08:50.700 |
out was there's a lot of stuff that you can do with not just looking at the code as text but thinking of 00:08:57.340 |
it as this whole hierarchy right so so understanding call hierarchies running a language server uh is a big deal 00:09:03.500 |
you have you have git commit history which you can look at which informs how these different files 00:09:07.660 |
relate to one another you have um um obviously you have like your linter and things like that but but 00:09:13.660 |
you're able to kind of reference things across files and so like one of the big problems here i think 00:09:17.500 |
was uh kind of working with the context of it and getting to the point where it could make changes 00:09:24.460 |
across several files it could be consistent across those changes it would be able to understand across the 00:09:29.260 |
code base and here was really the point i would say where you started to be able to just tag it 00:09:33.420 |
and have it do an issue and just have it build it for you um and so slack was a you know a huge part 00:09:38.860 |
of the workflow then um and and it was just it it made sense because it's where you discuss your issues 00:09:44.860 |
and it's where you set these things up right so you would tag devon and slack and say hey by the way 00:09:48.780 |
we've got this bug please take a look or you know could you please go build this thing 00:09:52.860 |
uh this is especially fun part for us because this is right around when we went ga and a lot of that 00:09:58.060 |
was because it was it got to the point where you truly could just get set up with devon and ask it a 00:10:03.100 |
lot of these broad tasks and and just have it do it um but but a lot of these you know a lot of the work 00:10:08.140 |
that we did was around having devon have better and better understanding of the code base right and if 00:10:13.900 |
you think about it you know from the human lens it's the same way where on your first day on the job for 00:10:18.620 |
example being super fresh in the code base it's kind of tough to know exactly what you're supposed to do 00:10:22.700 |
like a lot of these details are things that you understand over time or that a representation of the 00:10:26.940 |
code base that you build over time right um and devon had to do the same thing and had to understand 00:10:31.580 |
how do i plan this task out before i solve it how do i understand all the files that need to be changed 00:10:38.060 |
and around the spring of this year um again every every gap is like two or three months you know we got 00:10:48.060 |
to an interesting point which is once you start to get to harder and harder tasks you as the human 00:10:53.820 |
don't necessarily know everything that you want done at the time that you're giving the task right if 00:10:58.940 |
you're saying hey you know i'd like to go and um improve the architecture of this or you know this this 00:11:04.620 |
function is slow like let's let's profile it and look into it and see what needs to be done or hey like 00:11:10.780 |
you know we really should should handle this this error case better but like let's look at all the 00:11:15.420 |
possibilities and see what we should you know what the right logic should be in each of these right 00:11:19.660 |
and basically what it meant is that this whole idea of taking a two-line prompt or a three-line prompt 00:11:24.540 |
or something and then just having that result in a devon task was was not sufficient and you wanted to 00:11:30.380 |
really be able to work with devon and specify a lot more and around this time along with this kind of 00:11:36.220 |
like better code-based intelligence um we had a few different things that that that came up and so we 00:11:40.460 |
released deep wiki for example um and the whole idea of deep wiki was you know funnily enough is 00:11:45.100 |
devon had its own internal representation of the code base but it turns out that 00:11:48.540 |
for humans it was great to look at that too to be able to understand what was going on or to be able 00:11:54.140 |
to ask questions quickly about the code base closely related to that was with search which is the ability 00:11:59.900 |
to really just ask questions about a code base and understand um some some piece of this and a lot of 00:12:06.140 |
the workflow that really started to come up was actually basically this this more iterative workflow where 00:12:10.940 |
the first thing that you would do is you would ask a few questions you would understand you would 00:12:14.940 |
basically have a more l2 experience where you can go and explore the code base with your agent 00:12:19.500 |
figure out what has to be done in the task and then set your agent off to go do that because 00:12:25.260 |
for these more complex tasks you kind of needed that right um and so so you know that was a i would 00:12:31.100 |
say kind of like a big paradigm shift for us then is is understanding you know this is what also came 00:12:35.020 |
along with devon 2.0 for example and the in ide experience where often yeah you want to be able to 00:12:40.540 |
have points where you closely monitor devon for 10 percent of the task 20 of the task and then have 00:12:45.580 |
it do uh work on its own for the other 80 90 percent um and then lastly most recently in june which is now 00:12:54.780 |
it was kind of yeah the really the ability to just truly just kill your backlog and hand it a ton of 00:13:00.540 |
tasks and have it do all these at once and you know if you think about this task and in many ways i would 00:13:04.540 |
say it's it's almost like a culmination of of many of these different things that that had to be done in the 00:13:08.300 |
past you have to work with all these systems obviously you have to integrate into all these 00:13:11.900 |
certainly you want to be able to to work with linear or with jira or systems like that but you have to 00:13:16.540 |
be able to scope out a task to understand what's meant by what's going on you have to decide when 00:13:21.260 |
to go to the human for more approval or for questions or things like that you have to work across several 00:13:25.980 |
different files often you have to understand even what repo is the right repo to make the change in if 00:13:31.660 |
your if your org has multiple repos or what part of the code base is the right part of the code base that 00:13:35.820 |
needs to change um and to really get to the point where you can go and do this more autonomously 00:13:40.700 |
first of all um you you have to have like a really great sense of confidence right and so 00:13:47.260 |
um you know rather than just going off and doing things immediately you have to be able to say okay 00:13:51.500 |
i'm quite sure that this is the task and i'm going to go execute it now versus i don't understand what's 00:13:57.420 |
going on human please give me help basically right but but the other piece of it is 00:14:03.420 |
um this is i think the era where testing and this asynchronous testing gets really really important 00:14:08.300 |
right which is if you want something to just deliver entire prs for you for tasks that you do especially 00:14:14.220 |
for these larger tasks you want to know that it is can contest it itself and often the agent actually 00:14:20.140 |
needs this iterative loop to be able to go and do that right so it needs to be able to run all the code 00:14:24.540 |
locally it needs to know what to test it needs to know what to look for um and in many ways it's just a much 00:14:29.580 |
higher context problem to solve for right is this testing itself and that brings us to now and 00:14:36.700 |
obviously it's a it's a pretty fun time to see because now what we're thinking about is hey maybe 00:14:40.700 |
if instead of doing it just one task it's you know how do we think about tackling an entire project right 00:14:46.140 |
and after we do a project you know what goes after that and maybe one point that i would just make here is 00:14:53.180 |
we talk about all these two x's you know that happen every couple months and i think from a kind of 00:14:59.180 |
cosmic perspective all the two x's look the same right but in practice every two x actually is a 00:15:03.340 |
different one right and so when we were just doing you know tab completion line single line completion 00:15:08.780 |
it really was just a text problem it is just like take in the single file so far and just predict what 00:15:14.300 |
the line is next right over the last year or year and a half we've had to think about so much more how do you 00:15:18.780 |
how do you work with the human in linear or slack which you how do you take in feedback or steering 00:15:23.820 |
how do you help the human plan out and do all these things right and moreover obviously there's a ton of 00:15:30.140 |
the tooling and the capabilities work that have to be done of how does how does devon test on its own 00:15:35.260 |
how does devon you know make a lot of these longer term decisions on its own how does it 00:15:40.620 |
debug its own outputs or run the right shell commands to figure out what the feedback is 00:15:46.060 |
and go from there and so it's super exciting now that there's a lot more there's a lot more coding 00:15:50.620 |
agents in the space it's a it's very fun to see and i think that you know we're going to see another 16 00:15:56.460 |
to 64 x over the next 12 months as well and uh and so yeah super super excited awesome well that's all