back to indexSoftware Development Agents: What Works and What Doesn't - Robert Brennan, AllHands/OpenHands

00:00:03.000 |
Today I'm going to talk a little bit about coding agents 00:00:20.000 |
If you're anything like me, you've found a lot of things 00:00:31.000 |
I've been building open source development tools 00:00:35.000 |
My team and I have created an open source software development agent 00:00:40.000 |
called OpenHands, formerly known as OpenDevon. 00:00:43.000 |
To state the obvious, in 2025, software development is changing. 00:00:49.000 |
Our jobs are very different now than they were two years ago, 00:00:53.000 |
and they're going to be very different two years from now. 00:00:55.000 |
The thing I want to convince you of is that coding is going away. 00:00:58.000 |
We're going to be spending a lot less time actually writing code, 00:01:01.000 |
but that doesn't mean that software engineering is going away. 00:01:07.000 |
but to actually think critically about the problems that are in front of us. 00:01:10.000 |
If we do AI-driven development correctly, it will mean we spend less time 00:01:16.000 |
actually leaning forward and squinting into our IDE 00:01:19.000 |
and more time kind of sitting back in our chair and thinking, 00:01:26.000 |
What problems are we trying to solve as an organization? 00:01:28.000 |
How can we architect this in a way that sets us up for the future? 00:01:31.000 |
The AI is very good at that inner loop of development, 00:01:35.000 |
the write code, run the code, write code, run the code. 00:01:37.000 |
It's not very good at those kind of big-picture tasks 00:01:46.000 |
and that's where we come in as software engineers. 00:01:49.000 |
So let's talk a little about what actually a coding agent is. 00:01:55.000 |
I think this word "agent" gets thrown around a lot these days. 00:02:01.000 |
but at the core of it is this concept of agency. 00:02:04.000 |
It's this idea of taking action out in the real world. 00:02:08.000 |
And these are the main tools of a software engineer's job, right? 00:02:12.000 |
We have a code editor to actually modify our code base, 00:02:16.000 |
You have a terminal to help you actually run the code 00:02:21.000 |
And you need a web browser in order to look up documentation 00:02:24.000 |
and maybe copy and paste some code from Stack Overflow. 00:02:27.000 |
So these are kind of the core tools of the job, 00:02:29.000 |
and these are the tools that we give to our agents 00:02:38.000 |
from some more tactical code gen tools that are out there. 00:02:41.000 |
You know, we kind of started a couple years ago 00:02:43.000 |
with things like GitHub Copilot's autocomplete feature 00:02:46.000 |
where, you know, it's literally wherever your cursor is pointed 00:02:50.000 |
it's just filling out two or three more lines of code. 00:02:53.000 |
And then over time, things have gotten more and more agentic, 00:02:58.000 |
So we've got like AI-powered IDEs that can maybe take a few steps 00:03:04.000 |
and then now you've got these tools like Devon and Open Hands 00:03:07.000 |
where you're really giving an agent, you know, one or two sentences 00:03:12.000 |
It goes off and works for five, ten, fifteen minutes on its own 00:03:23.000 |
You know, you can focus on communicating with your coworkers, 00:03:27.000 |
or goofing off on Reddit while these agents are working for you. 00:03:33.000 |
but it's a much more powerful way of working. 00:03:35.000 |
So I want to talk a little bit about how these agents work under the hood. 00:03:40.000 |
I feel like once you understand what's happening under the surface, 00:03:44.000 |
it really helps you build an intuition for how to use agents effectively. 00:03:49.000 |
And at its core, an agent is this loop between a large language model 00:03:56.000 |
So the large language model kind of serves as the brain, 00:03:59.000 |
and then we have to repeatedly take actions in the external world, 00:04:08.000 |
So basically at every step of this loop, we're asking the LLM, 00:04:11.000 |
what's the next thing you want to do in order to get one step closer to your goal? 00:04:15.000 |
It might say, okay, I want to read this file, 00:04:17.000 |
I want to make this edit, I want to run this command, 00:04:21.000 |
We go out and take that action in the real world, 00:04:23.000 |
get some kind of output, whether it's the contents of a web page 00:04:26.000 |
or the output of a command, and then stick that back into the LLM 00:04:30.000 |
Just to talk a little bit about kind of the core tools 00:04:42.000 |
It actually turns out to be a fairly interesting problem. 00:04:45.000 |
The naive solution would be to just give the old file to the LLM 00:04:51.000 |
It's not a very efficient way to work, though. 00:05:01.000 |
printing out all the lines that are staying the same. 00:05:03.000 |
So most contemporary agents use a find and replace type editor 00:05:15.000 |
A lot of times they'll also provide an abstract syntax tree 00:05:26.000 |
Again, you would think text in, text out should be pretty simple, 00:05:29.000 |
but there are a lot of questions that pop up here. 00:05:31.000 |
What do you do when there's a long-running command 00:05:37.000 |
What happens if you want to run multiple commands in parallel, 00:05:44.000 |
Lots of really interesting problems that crop up 00:05:47.000 |
when you have an agent interacting with the terminal. 00:05:51.000 |
And then probably the most complicated tool is the web browser. 00:05:54.000 |
Again, there's a naive solution here where the agent just gives you a URL 00:06:00.000 |
That's very expensive because there's a bunch of cruft inside that HTML 00:06:06.000 |
We've had a lot of luck passing it accessibility trees 00:06:09.000 |
or converting to markdown and passing that to the LLM, 00:06:12.000 |
or allowing the LLM to maybe scroll through the web page 00:06:17.000 |
And then also, if you start to add interaction, 00:06:21.000 |
You can let the LLM write JavaScript against the page, 00:06:24.000 |
or we've actually had a lot of luck basically giving it a screenshot 00:06:38.000 |
I would say this is definitely a space to watch. 00:06:41.000 |
And then I also want to talk about sandboxing. 00:06:47.000 |
because if they're going to run autonomously for several minutes 00:06:50.000 |
on their own without you watching everything they're doing, 00:06:53.000 |
you want to make sure that they're not doing anything dangerous. 00:06:56.000 |
And so all of our agents run inside of a Docker container by default. 00:07:01.000 |
They're totally separated out from your workstation, 00:07:03.000 |
so there's no chance of it running RMRF on your home directory. 00:07:07.000 |
Increasingly, though, we're giving agents access to third-party APIs, right? 00:07:12.000 |
So you might give it access to a GitHub token or access to your AWS account. 00:07:16.000 |
Super, super important to make sure that those credentials are tightly scoped 00:07:19.000 |
and that you're following the principle of least privilege 00:07:22.000 |
as you're granting agents access to do these things. 00:07:27.000 |
All right, I want to move into some best practices. 00:07:31.000 |
My biggest advice for folks who are just getting started is to start small. 00:07:35.000 |
The best tasks are things that can be completed pretty quickly, 00:07:39.000 |
you know, a single commit where there's a clear definition of done. 00:07:42.000 |
You know, you want the agent to be able to verify, 00:07:47.000 |
Or, you know, the merge conflicts have been solved, et cetera. 00:07:50.000 |
And tasks that are easy for you as an engineer to verify 00:07:55.000 |
I like to tell people to start with small chores. 00:07:58.000 |
Very frequently you might have a pull request 00:08:00.000 |
where there's, you know, one test that's failing 00:08:02.000 |
or there's some lint errors or there's merge conflicts. 00:08:04.000 |
Bits of toil that you don't really like doing as a developer. 00:08:07.000 |
Those are great tasks to just shove off to the AI. 00:08:18.000 |
you'll find that you can give it bigger and bigger tasks. 00:08:20.000 |
You'll understand how to communicate with the agent effectively. 00:08:28.000 |
for me like 90% of my code now goes through the agent, 00:08:35.000 |
and kind of get my hands dirty in the code base again. 00:08:38.000 |
Being very clear with the agent about what you want 00:08:47.000 |
but you need to tell it how you want it to do it. 00:08:48.000 |
You know, mention specific frameworks that you want it to use. 00:08:51.000 |
If you want it to do like a test-driven development strategy, 00:08:55.000 |
Mention any specific files or function names that it can go for. 00:08:59.000 |
This not only helps it be more accurate and, you know, 00:09:04.000 |
more clear as to what exactly you want the output to be, 00:09:08.000 |
It doesn't have to spend as long exploring the code base 00:09:10.000 |
if you tell it, I want you to edit this exact file. 00:09:13.000 |
This can save you a bunch of time and energy, 00:09:22.000 |
I also like to remind folks that in an AI-driven development world, 00:09:30.000 |
I love if I have an idea, like, on my walk to work, 00:09:33.000 |
I'll just, like, you know, tell open hands with my voice, 00:09:37.000 |
like, do X, Y, and Z, and then when I get to work, 00:09:44.000 |
50% of the time, it looks great, and I just merge it, 00:09:48.000 |
It's really fun to be able to just rapidly prototype 00:09:53.000 |
And I would also say, you know, if you try to work with the agent 00:10:00.000 |
maybe it's close and you can just keep iterating 00:10:05.000 |
If it's way off, though, just throw away that work. 00:10:07.000 |
Start fresh with the new prompt based on what you learned 00:10:14.000 |
it's a new sort of muscle memory you have to develop 00:10:18.000 |
Sometimes it's hard to throw away tens of thousands of lines of code 00:10:23.000 |
that have been generated because you're used to that being 00:10:28.000 |
These days it's very easy to kind of just start from scratch again. 00:10:34.000 |
This is probably the most important bit of advice I can give folks. 00:10:37.000 |
You need to review the code that the AI writes. 00:10:40.000 |
I've seen more than one organization run into trouble thinking 00:10:43.000 |
that they could just vibe code their way to a production application 00:10:46.000 |
and just automatically merging everything that came out of the AI. 00:10:50.000 |
But if you just don't review anything, you'll find that your code base 00:11:01.000 |
So make sure you're reviewing the code that it outputs 00:11:03.000 |
and make sure you're pulling the code and running it on your workstation 00:11:06.000 |
or running it inside of an ephemeral environment 00:11:09.000 |
just to make sure that the agent has actually solved the problem 00:11:17.000 |
As you work with agents over time, you'll build an intuition 00:11:20.000 |
for what they do well and what they don't do well. 00:11:22.000 |
You can generally trust them to operate the same way today 00:11:31.000 |
One of our big learnings with open hands, in the early days, 00:11:35.000 |
if you opened up a pull request with open hands, 00:11:39.000 |
that pull request would show up as owned by open hands. 00:11:42.000 |
It would be the little hands logo next to the pull request. 00:11:47.000 |
One, it meant that the human who had triggered that pull request 00:11:50.000 |
could then approve it and basically bypass our whole code review system. 00:11:53.000 |
You didn't need a second human in the loop before merging. 00:11:56.000 |
And two, oftentimes those pull requests would just languish. 00:12:04.000 |
nobody was like jumping in to make sure the test passed. 00:12:07.000 |
And they would just kind of like sit there and not get merged. 00:12:10.000 |
Or if they did get merged and something went wrong, 00:12:13.000 |
we didn't really know who to go to and be like, 00:12:16.000 |
There was nobody we could hold accountable for that breakage. 00:12:19.000 |
And so now if you open up a pull request with open hands, 00:12:25.000 |
You're responsible for any breakage it might cause down the line. 00:12:31.000 |
And then I do want to just close by going through a handful of use cases. 00:12:40.000 |
as long as you kind of like break things down 00:12:50.000 |
that are like really great day one use cases for agents. 00:12:55.000 |
This is like the biggest chore as a part of my job. 00:12:58.000 |
Open hands itself is a very fast moving code base. 00:13:06.000 |
And I love just being able to jump in and say, 00:13:08.000 |
add open hands, fix the merge conflicts on this PR. 00:13:18.000 |
And open hands knocks this out 99% of the time. 00:13:25.000 |
This one's great because somebody else has already taken the time 00:13:28.000 |
to clearly articulate what they want changed, 00:13:31.000 |
and all you have to do is say, add open hands, 00:13:37.000 |
open hands did exactly what this person wanted. 00:13:41.000 |
And our front end engineer was like, do X, Y, and Z. 00:13:44.000 |
And he mentioned a whole bunch of buzz words that I don't know. 00:13:46.000 |
Open hands knew all of it and was able to address his feedback 00:13:55.000 |
You can see in this example, we had an input that was a text input, 00:14:00.000 |
If I wasn't lazy, I could have dug through my code base, 00:14:04.000 |
But it was really easy for me to just quickly -- 00:14:07.000 |
I think I did this one from directly inside of Slack -- 00:14:09.000 |
just add open hands, fix this thing we were just talking about. 00:14:12.000 |
And it's just really -- I don't even have to fire up my IDE. 00:14:23.000 |
Usually these involve looking up some like really esoteric syntax 00:14:27.000 |
inside of like the Terraform docs or something like that. 00:14:30.000 |
Open hands and, you know, the underlying LLMs tend to just like know 00:14:36.000 |
And if not, they can look up the documentation using the browser. 00:14:41.000 |
Sometimes we'll just get like an out of memory exception in Slack 00:14:43.000 |
and immediately say, okay, open hands, increase the memory. 00:14:51.000 |
This is one where I find I often leave best practices behind. 00:14:58.000 |
The LLM tends to be really great about following all best practices 00:15:03.000 |
Again, it's kind of like a rote task for developers. 00:15:13.000 |
If you've already got the code 90% of the way there, 00:15:16.000 |
there's just a unit test failing because there was a breaking API change. 00:15:19.000 |
Very easy to call in an agent to just clean up the failing tests. 00:15:23.000 |
Expanding test coverage is another one I love because it's a very safe task. 00:15:29.000 |
As long as the tests are passing, it's generally safe to just merge that. 00:15:33.000 |
If you notice a spot in your code base where you're like, 00:15:37.000 |
just ask your agent to expand your test coverage in that area of the code base. 00:15:42.000 |
It's a great quick win to make your code base a little bit safer. 00:15:47.000 |
Then everybody's favorite, building apps from scratch. 00:15:50.000 |
You know, I would say if you're shipping production code, 00:15:53.000 |
again, don't just like vibe code your way to a production application. 00:15:56.000 |
But we're finding increasingly internally at our company, 00:15:59.000 |
a lot of times there's like a little internal app we want to build. 00:16:02.000 |
Like for instance, we built a way to debug open hands trajectories, 00:16:08.000 |
We built like a whole web application that since it's just an internal application, 00:16:14.000 |
We don't really need to review every line of code. 00:16:18.000 |
This has been a really, really fun thing for our business 00:16:20.000 |
to just be able to churn out these really quick applications 00:16:25.000 |
So yeah, Greenfield is a great, great use case for agents. 00:16:31.000 |
We'd love to have you all join the open hands community. 00:16:33.000 |
You can find us on GitHub, all hands AI/openhands.