back to index

Software Development Agents: What Works and What Doesn't - Robert Brennan, AllHands/OpenHands


Whisper Transcript | Transcript Only Page

00:00:03.000 | Today I'm going to talk a little bit about coding agents
00:00:18.000 | and how to use them effectively, really.
00:00:20.000 | If you're anything like me, you've found a lot of things
00:00:23.000 | that work really well and a lot of things
00:00:25.000 | that don't work very well.
00:00:28.000 | A little bit about me.
00:00:30.000 | My name is Robert Brennan.
00:00:31.000 | I've been building open source development tools
00:00:33.000 | for over a decade now.
00:00:35.000 | My team and I have created an open source software development agent
00:00:40.000 | called OpenHands, formerly known as OpenDevon.
00:00:43.000 | To state the obvious, in 2025, software development is changing.
00:00:49.000 | Our jobs are very different now than they were two years ago,
00:00:53.000 | and they're going to be very different two years from now.
00:00:55.000 | The thing I want to convince you of is that coding is going away.
00:00:58.000 | We're going to be spending a lot less time actually writing code,
00:01:01.000 | but that doesn't mean that software engineering is going away.
00:01:04.000 | We're paid not to type on our keyboard,
00:01:07.000 | but to actually think critically about the problems that are in front of us.
00:01:10.000 | If we do AI-driven development correctly, it will mean we spend less time
00:01:16.000 | actually leaning forward and squinting into our IDE
00:01:19.000 | and more time kind of sitting back in our chair and thinking,
00:01:22.000 | what does the user actually want here?
00:01:24.000 | What are we actually trying to build?
00:01:26.000 | What problems are we trying to solve as an organization?
00:01:28.000 | How can we architect this in a way that sets us up for the future?
00:01:31.000 | The AI is very good at that inner loop of development,
00:01:35.000 | the write code, run the code, write code, run the code.
00:01:37.000 | It's not very good at those kind of big-picture tasks
00:01:40.000 | that I have to take into account,
00:01:42.000 | that I have to empathize with the end user,
00:01:44.000 | take into account business-level objectives,
00:01:46.000 | and that's where we come in as software engineers.
00:01:49.000 | So let's talk a little about what actually a coding agent is.
00:01:55.000 | I think this word "agent" gets thrown around a lot these days.
00:01:58.000 | The meaning has started to drift over time,
00:02:01.000 | but at the core of it is this concept of agency.
00:02:04.000 | It's this idea of taking action out in the real world.
00:02:08.000 | And these are the main tools of a software engineer's job, right?
00:02:12.000 | We have a code editor to actually modify our code base,
00:02:15.000 | navigate our code base.
00:02:16.000 | You have a terminal to help you actually run the code
00:02:19.000 | that you're writing.
00:02:21.000 | And you need a web browser in order to look up documentation
00:02:24.000 | and maybe copy and paste some code from Stack Overflow.
00:02:27.000 | So these are kind of the core tools of the job,
00:02:29.000 | and these are the tools that we give to our agents
00:02:31.000 | to let them do their whole development loop.
00:02:35.000 | I also want to contrast coding agents
00:02:38.000 | from some more tactical code gen tools that are out there.
00:02:41.000 | You know, we kind of started a couple years ago
00:02:43.000 | with things like GitHub Copilot's autocomplete feature
00:02:46.000 | where, you know, it's literally wherever your cursor is pointed
00:02:48.000 | in the code base right now,
00:02:50.000 | it's just filling out two or three more lines of code.
00:02:53.000 | And then over time, things have gotten more and more agentic,
00:02:56.000 | more and more asynchronous, right?
00:02:58.000 | So we've got like AI-powered IDEs that can maybe take a few steps
00:03:01.000 | at a time without a developer interfering.
00:03:04.000 | and then now you've got these tools like Devon and Open Hands
00:03:07.000 | where you're really giving an agent, you know, one or two sentences
00:03:10.000 | describing what you want it to do.
00:03:12.000 | It goes off and works for five, ten, fifteen minutes on its own
00:03:15.000 | and then comes back to you with a solution.
00:03:17.000 | This is a much more powerful way of working.
00:03:19.000 | You can get a lot done.
00:03:20.000 | You can send off multiple agents at once.
00:03:23.000 | You know, you can focus on communicating with your coworkers,
00:03:27.000 | or goofing off on Reddit while these agents are working for you.
00:03:30.000 | And it's a very different way of working,
00:03:33.000 | but it's a much more powerful way of working.
00:03:35.000 | So I want to talk a little bit about how these agents work under the hood.
00:03:40.000 | I feel like once you understand what's happening under the surface,
00:03:44.000 | it really helps you build an intuition for how to use agents effectively.
00:03:49.000 | And at its core, an agent is this loop between a large language model
00:03:54.000 | and the external world.
00:03:56.000 | So the large language model kind of serves as the brain,
00:03:59.000 | and then we have to repeatedly take actions in the external world,
00:04:02.000 | get some kind of feedback from the world,
00:04:05.000 | and pass that back into the LLM.
00:04:08.000 | So basically at every step of this loop, we're asking the LLM,
00:04:11.000 | what's the next thing you want to do in order to get one step closer to your goal?
00:04:15.000 | It might say, okay, I want to read this file,
00:04:17.000 | I want to make this edit, I want to run this command,
00:04:19.000 | I want to look at this web page.
00:04:21.000 | We go out and take that action in the real world,
00:04:23.000 | get some kind of output, whether it's the contents of a web page
00:04:26.000 | or the output of a command, and then stick that back into the LLM
00:04:29.000 | for the next turn of the loop.
00:04:30.000 | Just to talk a little bit about kind of the core tools
00:04:36.000 | that are at the agent's disposal.
00:04:38.000 | The first one, again, is a code editor.
00:04:40.000 | You might think this is really simple.
00:04:42.000 | It actually turns out to be a fairly interesting problem.
00:04:45.000 | The naive solution would be to just give the old file to the LLM
00:04:49.000 | and then have it output the entire new file.
00:04:51.000 | It's not a very efficient way to work, though.
00:04:53.000 | If you've got a thousand lines of code
00:04:57.000 | and you want to just change one line,
00:04:59.000 | you're going to waste a lot of tokens
00:05:01.000 | printing out all the lines that are staying the same.
00:05:03.000 | So most contemporary agents use a find and replace type editor
00:05:09.000 | or a diff-based editor to allow the LLM
00:05:11.000 | to just make tactical edits inside the file.
00:05:15.000 | A lot of times they'll also provide an abstract syntax tree
00:05:19.000 | or some kind of way to allow the agent
00:05:21.000 | to navigate the code base more effectively.
00:05:25.000 | Next up is the terminal.
00:05:26.000 | Again, you would think text in, text out should be pretty simple,
00:05:29.000 | but there are a lot of questions that pop up here.
00:05:31.000 | What do you do when there's a long-running command
00:05:33.000 | that has no standard out for a long time?
00:05:35.000 | Do you kill it?
00:05:36.000 | Do you let the LLM wait?
00:05:37.000 | What happens if you want to run multiple commands in parallel,
00:05:40.000 | run commands in the background?
00:05:41.000 | Maybe you want to start a server
00:05:42.000 | and then run curl against that server.
00:05:44.000 | Lots of really interesting problems that crop up
00:05:47.000 | when you have an agent interacting with the terminal.
00:05:51.000 | And then probably the most complicated tool is the web browser.
00:05:54.000 | Again, there's a naive solution here where the agent just gives you a URL
00:05:58.000 | and you give it a bunch of HTML.
00:06:00.000 | That's very expensive because there's a bunch of cruft inside that HTML
00:06:04.000 | that the LLM doesn't really need to see.
00:06:06.000 | We've had a lot of luck passing it accessibility trees
00:06:09.000 | or converting to markdown and passing that to the LLM,
00:06:12.000 | or allowing the LLM to maybe scroll through the web page
00:06:15.000 | if there's a ton of content there.
00:06:17.000 | And then also, if you start to add interaction,
00:06:19.000 | things get even more complicated.
00:06:21.000 | You can let the LLM write JavaScript against the page,
00:06:24.000 | or we've actually had a lot of luck basically giving it a screenshot
00:06:27.000 | of the page with labeled nodes,
00:06:29.000 | and it can say what it wants to click on.
00:06:31.000 | This is an area of active research.
00:06:33.000 | We just had a contribution about a month ago
00:06:36.000 | that doubled our accuracy on web browsing.
00:06:38.000 | I would say this is definitely a space to watch.
00:06:41.000 | And then I also want to talk about sandboxing.
00:06:45.000 | This is a really important thing for agents
00:06:47.000 | because if they're going to run autonomously for several minutes
00:06:50.000 | on their own without you watching everything they're doing,
00:06:53.000 | you want to make sure that they're not doing anything dangerous.
00:06:56.000 | And so all of our agents run inside of a Docker container by default.
00:07:01.000 | They're totally separated out from your workstation,
00:07:03.000 | so there's no chance of it running RMRF on your home directory.
00:07:07.000 | Increasingly, though, we're giving agents access to third-party APIs, right?
00:07:12.000 | So you might give it access to a GitHub token or access to your AWS account.
00:07:16.000 | Super, super important to make sure that those credentials are tightly scoped
00:07:19.000 | and that you're following the principle of least privilege
00:07:22.000 | as you're granting agents access to do these things.
00:07:27.000 | All right, I want to move into some best practices.
00:07:31.000 | My biggest advice for folks who are just getting started is to start small.
00:07:35.000 | The best tasks are things that can be completed pretty quickly,
00:07:39.000 | you know, a single commit where there's a clear definition of done.
00:07:42.000 | You know, you want the agent to be able to verify,
00:07:44.000 | okay, the tests are passing.
00:07:45.000 | I must have done it correctly.
00:07:47.000 | Or, you know, the merge conflicts have been solved, et cetera.
00:07:50.000 | And tasks that are easy for you as an engineer to verify
00:07:53.000 | were done completely and correctly.
00:07:55.000 | I like to tell people to start with small chores.
00:07:58.000 | Very frequently you might have a pull request
00:08:00.000 | where there's, you know, one test that's failing
00:08:02.000 | or there's some lint errors or there's merge conflicts.
00:08:04.000 | Bits of toil that you don't really like doing as a developer.
00:08:07.000 | Those are great tasks to just shove off to the AI.
00:08:09.000 | They tend to be very rote.
00:08:12.000 | The AI does them very well.
00:08:14.000 | But as your intuition grows here,
00:08:16.000 | as you get used to working with an agent,
00:08:18.000 | you'll find that you can give it bigger and bigger tasks.
00:08:20.000 | You'll understand how to communicate with the agent effectively.
00:08:23.000 | And I would say for me, for my co-founders,
00:08:26.000 | and for our biggest power users,
00:08:28.000 | for me like 90% of my code now goes through the agent,
00:08:31.000 | and it's only maybe 10% of the time
00:08:33.000 | that I have to drop back into my IDE
00:08:35.000 | and kind of get my hands dirty in the code base again.
00:08:38.000 | Being very clear with the agent about what you want
00:08:42.000 | is super important.
00:08:43.000 | I specifically like to say, you know,
00:08:45.000 | you need to tell it not just what you want,
00:08:47.000 | but you need to tell it how you want it to do it.
00:08:48.000 | You know, mention specific frameworks that you want it to use.
00:08:51.000 | If you want it to do like a test-driven development strategy,
00:08:54.000 | tell it that.
00:08:55.000 | Mention any specific files or function names that it can go for.
00:08:59.000 | This not only helps it be more accurate and, you know,
00:09:04.000 | more clear as to what exactly you want the output to be,
00:09:06.000 | it also makes it go faster, right?
00:09:08.000 | It doesn't have to spend as long exploring the code base
00:09:10.000 | if you tell it, I want you to edit this exact file.
00:09:13.000 | This can save you a bunch of time and energy,
00:09:16.000 | and it can save a lot of tokens,
00:09:19.000 | a lot of actual, like, inference costs.
00:09:22.000 | I also like to remind folks that in an AI-driven development world,
00:09:26.000 | code is cheap.
00:09:27.000 | You can throw code away.
00:09:28.000 | You can experiment and prototype.
00:09:30.000 | I love if I have an idea, like, on my walk to work,
00:09:33.000 | I'll just, like, you know, tell open hands with my voice,
00:09:37.000 | like, do X, Y, and Z, and then when I get to work,
00:09:40.000 | I'll have a PR waiting for me.
00:09:42.000 | 50% of the time, I'll just throw it away.
00:09:43.000 | It didn't really work.
00:09:44.000 | 50% of the time, it looks great, and I just merge it,
00:09:46.000 | and it's awesome.
00:09:48.000 | It's really fun to be able to just rapidly prototype
00:09:51.000 | using AI-driven development.
00:09:53.000 | And I would also say, you know, if you try to work with the agent
00:09:58.000 | on a particular task and it gets it wrong,
00:10:00.000 | maybe it's close and you can just keep iterating
00:10:02.000 | within the same conversation
00:10:03.000 | and it's already built up some context.
00:10:05.000 | If it's way off, though, just throw away that work.
00:10:07.000 | Start fresh with the new prompt based on what you learned
00:10:10.000 | from the last one.
00:10:11.000 | It's really, really, I think,
00:10:14.000 | it's a new sort of muscle memory you have to develop
00:10:16.000 | to just throw things away.
00:10:18.000 | Sometimes it's hard to throw away tens of thousands of lines of code
00:10:23.000 | that have been generated because you're used to that being
00:10:25.000 | a very expensive bunch of code.
00:10:28.000 | These days it's very easy to kind of just start from scratch again.
00:10:34.000 | This is probably the most important bit of advice I can give folks.
00:10:37.000 | You need to review the code that the AI writes.
00:10:40.000 | I've seen more than one organization run into trouble thinking
00:10:43.000 | that they could just vibe code their way to a production application
00:10:46.000 | and just automatically merging everything that came out of the AI.
00:10:50.000 | But if you just don't review anything, you'll find that your code base
00:10:55.000 | just grows and grows with this tech debt.
00:10:57.000 | You'll find duplicate code everywhere.
00:10:59.000 | Things get out of hand very quickly.
00:11:01.000 | So make sure you're reviewing the code that it outputs
00:11:03.000 | and make sure you're pulling the code and running it on your workstation
00:11:06.000 | or running it inside of an ephemeral environment
00:11:09.000 | just to make sure that the agent has actually solved the problem
00:11:11.000 | that you asked it to solve.
00:11:13.000 | And I like to say trust but verify.
00:11:17.000 | As you work with agents over time, you'll build an intuition
00:11:20.000 | for what they do well and what they don't do well.
00:11:22.000 | You can generally trust them to operate the same way today
00:11:26.000 | that they did yesterday.
00:11:28.000 | But you really do need a human in the loop.
00:11:31.000 | One of our big learnings with open hands, in the early days,
00:11:35.000 | if you opened up a pull request with open hands,
00:11:39.000 | that pull request would show up as owned by open hands.
00:11:42.000 | It would be the little hands logo next to the pull request.
00:11:45.000 | And that caused two problems.
00:11:47.000 | One, it meant that the human who had triggered that pull request
00:11:50.000 | could then approve it and basically bypass our whole code review system.
00:11:53.000 | You didn't need a second human in the loop before merging.
00:11:56.000 | And two, oftentimes those pull requests would just languish.
00:12:00.000 | Nobody would really take ownership for them.
00:12:02.000 | If there was like a failing unit test,
00:12:04.000 | nobody was like jumping in to make sure the test passed.
00:12:07.000 | And they would just kind of like sit there and not get merged.
00:12:10.000 | Or if they did get merged and something went wrong,
00:12:12.000 | the code didn't actually work,
00:12:13.000 | we didn't really know who to go to and be like,
00:12:15.000 | you know, who caused this?
00:12:16.000 | There was nobody we could hold accountable for that breakage.
00:12:19.000 | And so now if you open up a pull request with open hands,
00:12:21.000 | your face is on that pull request.
00:12:23.000 | You're responsible for getting it merged.
00:12:25.000 | You're responsible for any breakage it might cause down the line.
00:12:30.000 | Cool.
00:12:31.000 | And then I do want to just close by going through a handful of use cases.
00:12:34.000 | This is always kind of a tricky topic
00:12:36.000 | because agents are great generalists.
00:12:38.000 | They can hypothetically do anything
00:12:40.000 | as long as you kind of like break things down
00:12:42.000 | into bite-sized steps that they can take on.
00:12:45.000 | But in the spirit of starting small,
00:12:48.000 | I think there are a bunch of use cases
00:12:50.000 | that are like really great day one use cases for agents.
00:12:53.000 | My favorite is resolving merge conflicts.
00:12:55.000 | This is like the biggest chore as a part of my job.
00:12:58.000 | Open hands itself is a very fast moving code base.
00:13:01.000 | I can say there's probably no PR that I make
00:13:03.000 | that I get away with zero merge conflicts.
00:13:06.000 | And I love just being able to jump in and say,
00:13:08.000 | add open hands, fix the merge conflicts on this PR.
00:13:10.000 | It comes in, and it's such a rote task.
00:13:13.000 | It's usually very obvious.
00:13:14.000 | What changed before?
00:13:15.000 | What changed in this PR?
00:13:16.000 | What's the intention behind those changes?
00:13:18.000 | And open hands knocks this out 99% of the time.
00:13:23.000 | Addressing PR feedback is also a favorite.
00:13:25.000 | This one's great because somebody else has already taken the time
00:13:28.000 | to clearly articulate what they want changed,
00:13:31.000 | and all you have to do is say, add open hands,
00:13:33.000 | do what that guy said.
00:13:34.000 | And again, like you can see in this example,
00:13:37.000 | open hands did exactly what this person wanted.
00:13:39.000 | I don't know React super well.
00:13:41.000 | And our front end engineer was like, do X, Y, and Z.
00:13:44.000 | And he mentioned a whole bunch of buzz words that I don't know.
00:13:46.000 | Open hands knew all of it and was able to address his feedback
00:13:49.000 | exactly how he wanted.
00:13:51.000 | Fixing quick little bugs.
00:13:55.000 | You can see in this example, we had an input that was a text input,
00:13:59.000 | should have been a number input.
00:14:00.000 | If I wasn't lazy, I could have dug through my code base,
00:14:03.000 | found the right file.
00:14:04.000 | But it was really easy for me to just quickly --
00:14:07.000 | I think I did this one from directly inside of Slack --
00:14:09.000 | just add open hands, fix this thing we were just talking about.
00:14:12.000 | And it's just really -- I don't even have to fire up my IDE.
00:14:17.000 | It's a really, really fun way to work.
00:14:20.000 | Infrastructure changes I really like.
00:14:23.000 | Usually these involve looking up some like really esoteric syntax
00:14:27.000 | inside of like the Terraform docs or something like that.
00:14:30.000 | Open hands and, you know, the underlying LLMs tend to just like know
00:14:34.000 | the right Terraform syntax.
00:14:36.000 | And if not, they can look up the documentation using the browser.
00:14:39.000 | So this stuff is really great.
00:14:41.000 | Sometimes we'll just get like an out of memory exception in Slack
00:14:43.000 | and immediately say, okay, open hands, increase the memory.
00:14:48.000 | database migrations are another great one.
00:14:51.000 | This is one where I find I often leave best practices behind.
00:14:54.000 | I won't put indexes on the right things.
00:14:56.000 | I won't set up foreign keys the right way.
00:14:58.000 | The LLM tends to be really great about following all best practices
00:15:01.000 | around database migrations.
00:15:03.000 | Again, it's kind of like a rote task for developers.
00:15:05.000 | It's not very fun.
00:15:07.000 | The LLM is great at it.
00:15:11.000 | Fixing failing tests like on a PR.
00:15:13.000 | If you've already got the code 90% of the way there,
00:15:16.000 | there's just a unit test failing because there was a breaking API change.
00:15:19.000 | Very easy to call in an agent to just clean up the failing tests.
00:15:23.000 | Expanding test coverage is another one I love because it's a very safe task.
00:15:29.000 | As long as the tests are passing, it's generally safe to just merge that.
00:15:33.000 | If you notice a spot in your code base where you're like,
00:15:35.000 | hey, we have really low coverage here,
00:15:37.000 | just ask your agent to expand your test coverage in that area of the code base.
00:15:42.000 | It's a great quick win to make your code base a little bit safer.
00:15:47.000 | Then everybody's favorite, building apps from scratch.
00:15:50.000 | You know, I would say if you're shipping production code,
00:15:53.000 | again, don't just like vibe code your way to a production application.
00:15:56.000 | But we're finding increasingly internally at our company,
00:15:59.000 | a lot of times there's like a little internal app we want to build.
00:16:02.000 | Like for instance, we built a way to debug open hands trajectories,
00:16:06.000 | debug open hands sessions.
00:16:08.000 | We built like a whole web application that since it's just an internal application,
00:16:12.000 | we can vibe code it a little bit.
00:16:14.000 | We don't really need to review every line of code.
00:16:16.000 | It's not really facing end users.
00:16:18.000 | This has been a really, really fun thing for our business
00:16:20.000 | to just be able to churn out these really quick applications
00:16:23.000 | just to serve our own internal needs.
00:16:25.000 | So yeah, Greenfield is a great, great use case for agents.
00:16:29.000 | That's all I've got.
00:16:31.000 | We'd love to have you all join the open hands community.
00:16:33.000 | You can find us on GitHub, all hands AI/openhands.
00:16:36.000 | Join us on Slack, Discord.
00:16:38.000 | We'd love to build with you.
00:16:40.000 | We'll be right back.
00:16:41.000 | We'll be right back.
00:16:41.000 | We'll see you next time.