Software Development Agents: What Works and What Doesn't

. Today I'm going to talk a little bit about coding agents and how to use them effectively, really. If you're anything like me, you've found a lot of things that work really well and a lot of things that don't work very well. A little bit about me. My name is Robert Brennan.

I've been building open source development tools for over a decade now. My team and I have created an open source software development agent called OpenHands, formerly known as OpenDevon. To state the obvious, in 2025, software development is changing. Our jobs are very different now than they were two years ago, and they're going to be very different two years from now.

The thing I want to convince you of is that coding is going away. We're going to be spending a lot less time actually writing code, but that doesn't mean that software engineering is going away. We're paid not to type on our keyboard, but to actually think critically about the problems that are in front of us.

If we do AI-driven development correctly, it will mean we spend less time actually leaning forward and squinting into our IDE and more time kind of sitting back in our chair and thinking, what does the user actually want here? What are we actually trying to build? What problems are we trying to solve as an organization?

How can we architect this in a way that sets us up for the future? The AI is very good at that inner loop of development, the write code, run the code, write code, run the code. It's not very good at those kind of big-picture tasks that I have to take into account, that I have to empathize with the end user, take into account business-level objectives, and that's where we come in as software engineers.

So let's talk a little about what actually a coding agent is. I think this word "agent" gets thrown around a lot these days. The meaning has started to drift over time, but at the core of it is this concept of agency. It's this idea of taking action out in the real world.

And these are the main tools of a software engineer's job, right? We have a code editor to actually modify our code base, navigate our code base. You have a terminal to help you actually run the code that you're writing. And you need a web browser in order to look up documentation and maybe copy and paste some code from Stack Overflow.

So these are kind of the core tools of the job, and these are the tools that we give to our agents to let them do their whole development loop. I also want to contrast coding agents from some more tactical code gen tools that are out there. You know, we kind of started a couple years ago with things like GitHub Copilot's autocomplete feature where, you know, it's literally wherever your cursor is pointed in the code base right now, it's just filling out two or three more lines of code.

And then over time, things have gotten more and more agentic, more and more asynchronous, right? So we've got like AI-powered IDEs that can maybe take a few steps at a time without a developer interfering. and then now you've got these tools like Devon and Open Hands where you're really giving an agent, you know, one or two sentences describing what you want it to do.

It goes off and works for five, ten, fifteen minutes on its own and then comes back to you with a solution. This is a much more powerful way of working. You can get a lot done. You can send off multiple agents at once. You know, you can focus on communicating with your coworkers, or goofing off on Reddit while these agents are working for you.

And it's a very different way of working, but it's a much more powerful way of working. So I want to talk a little bit about how these agents work under the hood. I feel like once you understand what's happening under the surface, it really helps you build an intuition for how to use agents effectively.

And at its core, an agent is this loop between a large language model and the external world. So the large language model kind of serves as the brain, and then we have to repeatedly take actions in the external world, get some kind of feedback from the world, and pass that back into the LLM.

So basically at every step of this loop, we're asking the LLM, what's the next thing you want to do in order to get one step closer to your goal? It might say, okay, I want to read this file, I want to make this edit, I want to run this command, I want to look at this web page.

We go out and take that action in the real world, get some kind of output, whether it's the contents of a web page or the output of a command, and then stick that back into the LLM for the next turn of the loop. Just to talk a little bit about kind of the core tools that are at the agent's disposal.

The first one, again, is a code editor. You might think this is really simple. It actually turns out to be a fairly interesting problem. The naive solution would be to just give the old file to the LLM and then have it output the entire new file. It's not a very efficient way to work, though.

If you've got a thousand lines of code and you want to just change one line, you're going to waste a lot of tokens printing out all the lines that are staying the same. So most contemporary agents use a find and replace type editor or a diff-based editor to allow the LLM to just make tactical edits inside the file.

A lot of times they'll also provide an abstract syntax tree or some kind of way to allow the agent to navigate the code base more effectively. Next up is the terminal. Again, you would think text in, text out should be pretty simple, but there are a lot of questions that pop up here.

What do you do when there's a long-running command that has no standard out for a long time? Do you kill it? Do you let the LLM wait? What happens if you want to run multiple commands in parallel, run commands in the background? Maybe you want to start a server and then run curl against that server.

Lots of really interesting problems that crop up when you have an agent interacting with the terminal. And then probably the most complicated tool is the web browser. Again, there's a naive solution here where the agent just gives you a URL and you give it a bunch of HTML. That's very expensive because there's a bunch of cruft inside that HTML that the LLM doesn't really need to see.

We've had a lot of luck passing it accessibility trees or converting to markdown and passing that to the LLM, or allowing the LLM to maybe scroll through the web page if there's a ton of content there. And then also, if you start to add interaction, things get even more complicated.

You can let the LLM write JavaScript against the page, or we've actually had a lot of luck basically giving it a screenshot of the page with labeled nodes, and it can say what it wants to click on. This is an area of active research. We just had a contribution about a month ago that doubled our accuracy on web browsing.

I would say this is definitely a space to watch. And then I also want to talk about sandboxing. This is a really important thing for agents because if they're going to run autonomously for several minutes on their own without you watching everything they're doing, you want to make sure that they're not doing anything dangerous.

And so all of our agents run inside of a Docker container by default. They're totally separated out from your workstation, so there's no chance of it running RMRF on your home directory. Increasingly, though, we're giving agents access to third-party APIs, right? So you might give it access to a GitHub token or access to your AWS account.

Super, super important to make sure that those credentials are tightly scoped and that you're following the principle of least privilege as you're granting agents access to do these things. All right, I want to move into some best practices. My biggest advice for folks who are just getting started is to start small.

The best tasks are things that can be completed pretty quickly, you know, a single commit where there's a clear definition of done. You know, you want the agent to be able to verify, okay, the tests are passing. I must have done it correctly. Or, you know, the merge conflicts have been solved, et cetera.

And tasks that are easy for you as an engineer to verify were done completely and correctly. I like to tell people to start with small chores. Very frequently you might have a pull request where there's, you know, one test that's failing or there's some lint errors or there's merge conflicts.

Bits of toil that you don't really like doing as a developer. Those are great tasks to just shove off to the AI. They tend to be very rote. The AI does them very well. But as your intuition grows here, as you get used to working with an agent, you'll find that you can give it bigger and bigger tasks.

You'll understand how to communicate with the agent effectively. And I would say for me, for my co-founders, and for our biggest power users, for me like 90% of my code now goes through the agent, and it's only maybe 10% of the time that I have to drop back into my IDE and kind of get my hands dirty in the code base again.

Being very clear with the agent about what you want is super important. I specifically like to say, you know, you need to tell it not just what you want, but you need to tell it how you want it to do it. You know, mention specific frameworks that you want it to use.

If you want it to do like a test-driven development strategy, tell it that. Mention any specific files or function names that it can go for. This not only helps it be more accurate and, you know, more clear as to what exactly you want the output to be, it also makes it go faster, right?

It doesn't have to spend as long exploring the code base if you tell it, I want you to edit this exact file. This can save you a bunch of time and energy, and it can save a lot of tokens, a lot of actual, like, inference costs. I also like to remind folks that in an AI-driven development world, code is cheap.

You can throw code away. You can experiment and prototype. I love if I have an idea, like, on my walk to work, I'll just, like, you know, tell open hands with my voice, like, do X, Y, and Z, and then when I get to work, I'll have a PR waiting for me.

50% of the time, I'll just throw it away. It didn't really work. 50% of the time, it looks great, and I just merge it, and it's awesome. It's really fun to be able to just rapidly prototype using AI-driven development. And I would also say, you know, if you try to work with the agent on a particular task and it gets it wrong, maybe it's close and you can just keep iterating within the same conversation and it's already built up some context.

If it's way off, though, just throw away that work. Start fresh with the new prompt based on what you learned from the last one. It's really, really, I think, it's a new sort of muscle memory you have to develop to just throw things away. Sometimes it's hard to throw away tens of thousands of lines of code that have been generated because you're used to that being a very expensive bunch of code.

These days it's very easy to kind of just start from scratch again. This is probably the most important bit of advice I can give folks. You need to review the code that the AI writes. I've seen more than one organization run into trouble thinking that they could just vibe code their way to a production application and just automatically merging everything that came out of the AI.

But if you just don't review anything, you'll find that your code base just grows and grows with this tech debt. You'll find duplicate code everywhere. Things get out of hand very quickly. So make sure you're reviewing the code that it outputs and make sure you're pulling the code and running it on your workstation or running it inside of an ephemeral environment just to make sure that the agent has actually solved the problem that you asked it to solve.

And I like to say trust but verify. As you work with agents over time, you'll build an intuition for what they do well and what they don't do well. You can generally trust them to operate the same way today that they did yesterday. But you really do need a human in the loop.

One of our big learnings with open hands, in the early days, if you opened up a pull request with open hands, that pull request would show up as owned by open hands. It would be the little hands logo next to the pull request. And that caused two problems. One, it meant that the human who had triggered that pull request could then approve it and basically bypass our whole code review system.

You didn't need a second human in the loop before merging. And two, oftentimes those pull requests would just languish. Nobody would really take ownership for them. If there was like a failing unit test, nobody was like jumping in to make sure the test passed. And they would just kind of like sit there and not get merged.

Or if they did get merged and something went wrong, the code didn't actually work, we didn't really know who to go to and be like, you know, who caused this? There was nobody we could hold accountable for that breakage. And so now if you open up a pull request with open hands, your face is on that pull request.

You're responsible for getting it merged. You're responsible for any breakage it might cause down the line. Cool. And then I do want to just close by going through a handful of use cases. This is always kind of a tricky topic because agents are great generalists. They can hypothetically do anything as long as you kind of like break things down into bite-sized steps that they can take on.

But in the spirit of starting small, I think there are a bunch of use cases that are like really great day one use cases for agents. My favorite is resolving merge conflicts. This is like the biggest chore as a part of my job. Open hands itself is a very fast moving code base.

I can say there's probably no PR that I make that I get away with zero merge conflicts. And I love just being able to jump in and say, add open hands, fix the merge conflicts on this PR. It comes in, and it's such a rote task. It's usually very obvious.

What changed before? What changed in this PR? What's the intention behind those changes? And open hands knocks this out 99% of the time. Addressing PR feedback is also a favorite. This one's great because somebody else has already taken the time to clearly articulate what they want changed, and all you have to do is say, add open hands, do what that guy said.

And again, like you can see in this example, open hands did exactly what this person wanted. I don't know React super well. And our front end engineer was like, do X, Y, and Z. And he mentioned a whole bunch of buzz words that I don't know. Open hands knew all of it and was able to address his feedback exactly how he wanted.

Fixing quick little bugs. You can see in this example, we had an input that was a text input, should have been a number input. If I wasn't lazy, I could have dug through my code base, found the right file. But it was really easy for me to just quickly -- I think I did this one from directly inside of Slack -- just add open hands, fix this thing we were just talking about.

And it's just really -- I don't even have to fire up my IDE. It's a really, really fun way to work. Infrastructure changes I really like. Usually these involve looking up some like really esoteric syntax inside of like the Terraform docs or something like that. Open hands and, you know, the underlying LLMs tend to just like know the right Terraform syntax.

And if not, they can look up the documentation using the browser. So this stuff is really great. Sometimes we'll just get like an out of memory exception in Slack and immediately say, okay, open hands, increase the memory. database migrations are another great one. This is one where I find I often leave best practices behind.

I won't put indexes on the right things. I won't set up foreign keys the right way. The LLM tends to be really great about following all best practices around database migrations. Again, it's kind of like a rote task for developers. It's not very fun. The LLM is great at it.

Fixing failing tests like on a PR. If you've already got the code 90% of the way there, there's just a unit test failing because there was a breaking API change. Very easy to call in an agent to just clean up the failing tests. Expanding test coverage is another one I love because it's a very safe task.

As long as the tests are passing, it's generally safe to just merge that. If you notice a spot in your code base where you're like, hey, we have really low coverage here, just ask your agent to expand your test coverage in that area of the code base. It's a great quick win to make your code base a little bit safer.

Then everybody's favorite, building apps from scratch. You know, I would say if you're shipping production code, again, don't just like vibe code your way to a production application. But we're finding increasingly internally at our company, a lot of times there's like a little internal app we want to build.

Like for instance, we built a way to debug open hands trajectories, debug open hands sessions. We built like a whole web application that since it's just an internal application, we can vibe code it a little bit. We don't really need to review every line of code. It's not really facing end users.

This has been a really, really fun thing for our business to just be able to churn out these really quick applications just to serve our own internal needs. So yeah, Greenfield is a great, great use case for agents. That's all I've got. We'd love to have you all join the open hands community.

You can find us on GitHub, all hands AI/openhands. Join us on Slack, Discord. We'd love to build with you. We'll be right back. We'll be right back. We'll see you next time.

Software Development Agents: What Works and What Doesn't - Robert Brennan, AllHands/OpenHands

Transcript