Software Development Agents: What Works and What Doesn't

00:00:00.000 | .

00:00:03.000 | Today I'm going to talk a little bit about coding agents

00:00:18.000 | and how to use them effectively, really.

00:00:20.000 | If you're anything like me, you've found a lot of things

00:00:23.000 | that work really well and a lot of things

00:00:25.000 | that don't work very well.

00:00:28.000 | A little bit about me.

00:00:30.000 | My name is Robert Brennan.

00:00:31.000 | I've been building open source development tools

00:00:33.000 | for over a decade now.

00:00:35.000 | My team and I have created an open source software development agent

00:00:40.000 | called OpenHands, formerly known as OpenDevon.

00:00:43.000 | To state the obvious, in 2025, software development is changing.

00:00:49.000 | Our jobs are very different now than they were two years ago,

00:00:53.000 | and they're going to be very different two years from now.

00:00:55.000 | The thing I want to convince you of is that coding is going away.

00:00:58.000 | We're going to be spending a lot less time actually writing code,

00:01:01.000 | but that doesn't mean that software engineering is going away.

00:01:04.000 | We're paid not to type on our keyboard,

00:01:07.000 | but to actually think critically about the problems that are in front of us.

00:01:10.000 | If we do AI-driven development correctly, it will mean we spend less time

00:01:16.000 | actually leaning forward and squinting into our IDE

00:01:19.000 | and more time kind of sitting back in our chair and thinking,

00:01:22.000 | what does the user actually want here?

00:01:24.000 | What are we actually trying to build?

00:01:26.000 | What problems are we trying to solve as an organization?

00:01:28.000 | How can we architect this in a way that sets us up for the future?

00:01:31.000 | The AI is very good at that inner loop of development,

00:01:35.000 | the write code, run the code, write code, run the code.

00:01:37.000 | It's not very good at those kind of big-picture tasks

00:01:40.000 | that I have to take into account,

00:01:42.000 | that I have to empathize with the end user,

00:01:44.000 | take into account business-level objectives,

00:01:46.000 | and that's where we come in as software engineers.

00:01:49.000 | So let's talk a little about what actually a coding agent is.

00:01:55.000 | I think this word "agent" gets thrown around a lot these days.

00:01:58.000 | The meaning has started to drift over time,

00:02:01.000 | but at the core of it is this concept of agency.

00:02:04.000 | It's this idea of taking action out in the real world.

00:02:08.000 | And these are the main tools of a software engineer's job, right?

00:02:12.000 | We have a code editor to actually modify our code base,

00:02:15.000 | navigate our code base.

00:02:16.000 | You have a terminal to help you actually run the code

00:02:19.000 | that you're writing.

00:02:21.000 | And you need a web browser in order to look up documentation

00:02:24.000 | and maybe copy and paste some code from Stack Overflow.

00:02:27.000 | So these are kind of the core tools of the job,

00:02:29.000 | and these are the tools that we give to our agents

00:02:31.000 | to let them do their whole development loop.

00:02:35.000 | I also want to contrast coding agents

00:02:38.000 | from some more tactical code gen tools that are out there.

00:02:41.000 | You know, we kind of started a couple years ago

00:02:43.000 | with things like GitHub Copilot's autocomplete feature

00:02:46.000 | where, you know, it's literally wherever your cursor is pointed

00:02:48.000 | in the code base right now,

00:02:50.000 | it's just filling out two or three more lines of code.

00:02:53.000 | And then over time, things have gotten more and more agentic,

00:02:56.000 | more and more asynchronous, right?

00:02:58.000 | So we've got like AI-powered IDEs that can maybe take a few steps

00:03:01.000 | at a time without a developer interfering.

00:03:04.000 | and then now you've got these tools like Devon and Open Hands

00:03:07.000 | where you're really giving an agent, you know, one or two sentences

00:03:10.000 | describing what you want it to do.

00:03:12.000 | It goes off and works for five, ten, fifteen minutes on its own

00:03:15.000 | and then comes back to you with a solution.

00:03:17.000 | This is a much more powerful way of working.

00:03:19.000 | You can get a lot done.

00:03:20.000 | You can send off multiple agents at once.

00:03:23.000 | You know, you can focus on communicating with your coworkers,

00:03:27.000 | or goofing off on Reddit while these agents are working for you.

00:03:30.000 | And it's a very different way of working,

00:03:33.000 | but it's a much more powerful way of working.

00:03:35.000 | So I want to talk a little bit about how these agents work under the hood.

00:03:40.000 | I feel like once you understand what's happening under the surface,

00:03:44.000 | it really helps you build an intuition for how to use agents effectively.

00:03:49.000 | And at its core, an agent is this loop between a large language model

00:03:54.000 | and the external world.

00:03:56.000 | So the large language model kind of serves as the brain,

00:03:59.000 | and then we have to repeatedly take actions in the external world,

00:04:02.000 | get some kind of feedback from the world,

00:04:05.000 | and pass that back into the LLM.

00:04:08.000 | So basically at every step of this loop, we're asking the LLM,

00:04:11.000 | what's the next thing you want to do in order to get one step closer to your goal?

00:04:15.000 | It might say, okay, I want to read this file,

00:04:17.000 | I want to make this edit, I want to run this command,

00:04:19.000 | I want to look at this web page.

00:04:21.000 | We go out and take that action in the real world,

00:04:23.000 | get some kind of output, whether it's the contents of a web page

00:04:26.000 | or the output of a command, and then stick that back into the LLM

00:04:29.000 | for the next turn of the loop.

00:04:30.000 | Just to talk a little bit about kind of the core tools

00:04:36.000 | that are at the agent's disposal.

00:04:38.000 | The first one, again, is a code editor.

00:04:40.000 | You might think this is really simple.

00:04:42.000 | It actually turns out to be a fairly interesting problem.

00:04:45.000 | The naive solution would be to just give the old file to the LLM

00:04:49.000 | and then have it output the entire new file.

00:04:51.000 | It's not a very efficient way to work, though.

00:04:53.000 | If you've got a thousand lines of code

00:04:57.000 | and you want to just change one line,

00:04:59.000 | you're going to waste a lot of tokens

00:05:01.000 | printing out all the lines that are staying the same.

00:05:03.000 | So most contemporary agents use a find and replace type editor

00:05:09.000 | or a diff-based editor to allow the LLM

00:05:11.000 | to just make tactical edits inside the file.

00:05:15.000 | A lot of times they'll also provide an abstract syntax tree

00:05:19.000 | or some kind of way to allow the agent

00:05:21.000 | to navigate the code base more effectively.

00:05:25.000 | Next up is the terminal.

00:05:26.000 | Again, you would think text in, text out should be pretty simple,

00:05:29.000 | but there are a lot of questions that pop up here.

00:05:31.000 | What do you do when there's a long-running command

00:05:33.000 | that has no standard out for a long time?

00:05:35.000 | Do you kill it?

00:05:36.000 | Do you let the LLM wait?

00:05:37.000 | What happens if you want to run multiple commands in parallel,

00:05:40.000 | run commands in the background?

00:05:41.000 | Maybe you want to start a server

00:05:42.000 | and then run curl against that server.

00:05:44.000 | Lots of really interesting problems that crop up

00:05:47.000 | when you have an agent interacting with the terminal.

00:05:51.000 | And then probably the most complicated tool is the web browser.

00:05:54.000 | Again, there's a naive solution here where the agent just gives you a URL

00:05:58.000 | and you give it a bunch of HTML.

00:06:00.000 | That's very expensive because there's a bunch of cruft inside that HTML

00:06:04.000 | that the LLM doesn't really need to see.

00:06:06.000 | We've had a lot of luck passing it accessibility trees

00:06:09.000 | or converting to markdown and passing that to the LLM,

00:06:12.000 | or allowing the LLM to maybe scroll through the web page

00:06:15.000 | if there's a ton of content there.

00:06:17.000 | And then also, if you start to add interaction,

00:06:19.000 | things get even more complicated.

00:06:21.000 | You can let the LLM write JavaScript against the page,

00:06:24.000 | or we've actually had a lot of luck basically giving it a screenshot

00:06:27.000 | of the page with labeled nodes,

00:06:29.000 | and it can say what it wants to click on.

00:06:31.000 | This is an area of active research.

00:06:33.000 | We just had a contribution about a month ago

00:06:36.000 | that doubled our accuracy on web browsing.

00:06:38.000 | I would say this is definitely a space to watch.

00:06:41.000 | And then I also want to talk about sandboxing.

00:06:45.000 | This is a really important thing for agents

00:06:47.000 | because if they're going to run autonomously for several minutes

00:06:50.000 | on their own without you watching everything they're doing,

00:06:53.000 | you want to make sure that they're not doing anything dangerous.

00:06:56.000 | And so all of our agents run inside of a Docker container by default.

00:07:01.000 | They're totally separated out from your workstation,

00:07:03.000 | so there's no chance of it running RMRF on your home directory.

00:07:07.000 | Increasingly, though, we're giving agents access to third-party APIs, right?

00:07:12.000 | So you might give it access to a GitHub token or access to your AWS account.

00:07:16.000 | Super, super important to make sure that those credentials are tightly scoped

00:07:19.000 | and that you're following the principle of least privilege

00:07:22.000 | as you're granting agents access to do these things.

00:07:27.000 | All right, I want to move into some best practices.

00:07:31.000 | My biggest advice for folks who are just getting started is to start small.

00:07:35.000 | The best tasks are things that can be completed pretty quickly,

00:07:39.000 | you know, a single commit where there's a clear definition of done.

00:07:42.000 | You know, you want the agent to be able to verify,

00:07:44.000 | okay, the tests are passing.

00:07:45.000 | I must have done it correctly.

00:07:47.000 | Or, you know, the merge conflicts have been solved, et cetera.

00:07:50.000 | And tasks that are easy for you as an engineer to verify

00:07:53.000 | were done completely and correctly.

00:07:55.000 | I like to tell people to start with small chores.

00:07:58.000 | Very frequently you might have a pull request

00:08:00.000 | where there's, you know, one test that's failing

00:08:02.000 | or there's some lint errors or there's merge conflicts.

00:08:04.000 | Bits of toil that you don't really like doing as a developer.

00:08:07.000 | Those are great tasks to just shove off to the AI.

00:08:09.000 | They tend to be very rote.

00:08:12.000 | The AI does them very well.

00:08:14.000 | But as your intuition grows here,

00:08:16.000 | as you get used to working with an agent,

00:08:18.000 | you'll find that you can give it bigger and bigger tasks.

00:08:20.000 | You'll understand how to communicate with the agent effectively.

00:08:23.000 | And I would say for me, for my co-founders,

00:08:26.000 | and for our biggest power users,

00:08:28.000 | for me like 90% of my code now goes through the agent,

00:08:31.000 | and it's only maybe 10% of the time

00:08:33.000 | that I have to drop back into my IDE

00:08:35.000 | and kind of get my hands dirty in the code base again.

00:08:38.000 | Being very clear with the agent about what you want

00:08:42.000 | is super important.

00:08:43.000 | I specifically like to say, you know,

00:08:45.000 | you need to tell it not just what you want,

00:08:47.000 | but you need to tell it how you want it to do it.

00:08:48.000 | You know, mention specific frameworks that you want it to use.

00:08:51.000 | If you want it to do like a test-driven development strategy,

00:08:54.000 | tell it that.

00:08:55.000 | Mention any specific files or function names that it can go for.

00:08:59.000 | This not only helps it be more accurate and, you know,

00:09:04.000 | more clear as to what exactly you want the output to be,

00:09:06.000 | it also makes it go faster, right?

00:09:08.000 | It doesn't have to spend as long exploring the code base

00:09:10.000 | if you tell it, I want you to edit this exact file.

00:09:13.000 | This can save you a bunch of time and energy,

00:09:16.000 | and it can save a lot of tokens,

00:09:19.000 | a lot of actual, like, inference costs.

00:09:22.000 | I also like to remind folks that in an AI-driven development world,

00:09:26.000 | code is cheap.

00:09:27.000 | You can throw code away.

00:09:28.000 | You can experiment and prototype.

00:09:30.000 | I love if I have an idea, like, on my walk to work,

00:09:33.000 | I'll just, like, you know, tell open hands with my voice,

00:09:37.000 | like, do X, Y, and Z, and then when I get to work,

00:09:40.000 | I'll have a PR waiting for me.

00:09:42.000 | 50% of the time, I'll just throw it away.

00:09:43.000 | It didn't really work.

00:09:44.000 | 50% of the time, it looks great, and I just merge it,

00:09:46.000 | and it's awesome.

00:09:48.000 | It's really fun to be able to just rapidly prototype

00:09:51.000 | using AI-driven development.

00:09:53.000 | And I would also say, you know, if you try to work with the agent

00:09:58.000 | on a particular task and it gets it wrong,

00:10:00.000 | maybe it's close and you can just keep iterating

00:10:02.000 | within the same conversation

00:10:03.000 | and it's already built up some context.

00:10:05.000 | If it's way off, though, just throw away that work.

00:10:07.000 | Start fresh with the new prompt based on what you learned

00:10:10.000 | from the last one.

00:10:11.000 | It's really, really, I think,

00:10:14.000 | it's a new sort of muscle memory you have to develop

00:10:16.000 | to just throw things away.

00:10:18.000 | Sometimes it's hard to throw away tens of thousands of lines of code

00:10:23.000 | that have been generated because you're used to that being

00:10:25.000 | a very expensive bunch of code.

00:10:28.000 | These days it's very easy to kind of just start from scratch again.

00:10:34.000 | This is probably the most important bit of advice I can give folks.

00:10:37.000 | You need to review the code that the AI writes.

00:10:40.000 | I've seen more than one organization run into trouble thinking

00:10:43.000 | that they could just vibe code their way to a production application

00:10:46.000 | and just automatically merging everything that came out of the AI.

00:10:50.000 | But if you just don't review anything, you'll find that your code base

00:10:55.000 | just grows and grows with this tech debt.

00:10:57.000 | You'll find duplicate code everywhere.

00:10:59.000 | Things get out of hand very quickly.

00:11:01.000 | So make sure you're reviewing the code that it outputs

00:11:03.000 | and make sure you're pulling the code and running it on your workstation

00:11:06.000 | or running it inside of an ephemeral environment

00:11:09.000 | just to make sure that the agent has actually solved the problem

00:11:11.000 | that you asked it to solve.

00:11:13.000 | And I like to say trust but verify.

00:11:17.000 | As you work with agents over time, you'll build an intuition

00:11:20.000 | for what they do well and what they don't do well.

00:11:22.000 | You can generally trust them to operate the same way today

00:11:26.000 | that they did yesterday.

00:11:28.000 | But you really do need a human in the loop.

00:11:31.000 | One of our big learnings with open hands, in the early days,

00:11:35.000 | if you opened up a pull request with open hands,

00:11:39.000 | that pull request would show up as owned by open hands.

00:11:42.000 | It would be the little hands logo next to the pull request.

00:11:45.000 | And that caused two problems.

00:11:47.000 | One, it meant that the human who had triggered that pull request

00:11:50.000 | could then approve it and basically bypass our whole code review system.

00:11:53.000 | You didn't need a second human in the loop before merging.

00:11:56.000 | And two, oftentimes those pull requests would just languish.

00:12:00.000 | Nobody would really take ownership for them.

00:12:02.000 | If there was like a failing unit test,

00:12:04.000 | nobody was like jumping in to make sure the test passed.

00:12:07.000 | And they would just kind of like sit there and not get merged.

00:12:10.000 | Or if they did get merged and something went wrong,

00:12:12.000 | the code didn't actually work,

00:12:13.000 | we didn't really know who to go to and be like,

00:12:15.000 | you know, who caused this?

00:12:16.000 | There was nobody we could hold accountable for that breakage.

00:12:19.000 | And so now if you open up a pull request with open hands,

00:12:21.000 | your face is on that pull request.

00:12:23.000 | You're responsible for getting it merged.

00:12:25.000 | You're responsible for any breakage it might cause down the line.

00:12:30.000 | Cool.

00:12:31.000 | And then I do want to just close by going through a handful of use cases.

00:12:34.000 | This is always kind of a tricky topic

00:12:36.000 | because agents are great generalists.

00:12:38.000 | They can hypothetically do anything

00:12:40.000 | as long as you kind of like break things down

00:12:42.000 | into bite-sized steps that they can take on.

00:12:45.000 | But in the spirit of starting small,

00:12:48.000 | I think there are a bunch of use cases

00:12:50.000 | that are like really great day one use cases for agents.

00:12:53.000 | My favorite is resolving merge conflicts.

00:12:55.000 | This is like the biggest chore as a part of my job.

00:12:58.000 | Open hands itself is a very fast moving code base.

00:13:01.000 | I can say there's probably no PR that I make

00:13:03.000 | that I get away with zero merge conflicts.

00:13:06.000 | And I love just being able to jump in and say,

00:13:08.000 | add open hands, fix the merge conflicts on this PR.

00:13:10.000 | It comes in, and it's such a rote task.

00:13:13.000 | It's usually very obvious.

00:13:14.000 | What changed before?

00:13:15.000 | What changed in this PR?

00:13:16.000 | What's the intention behind those changes?

00:13:18.000 | And open hands knocks this out 99% of the time.

00:13:23.000 | Addressing PR feedback is also a favorite.

00:13:25.000 | This one's great because somebody else has already taken the time

00:13:28.000 | to clearly articulate what they want changed,

00:13:31.000 | and all you have to do is say, add open hands,

00:13:33.000 | do what that guy said.

00:13:34.000 | And again, like you can see in this example,

00:13:37.000 | open hands did exactly what this person wanted.

00:13:39.000 | I don't know React super well.

00:13:41.000 | And our front end engineer was like, do X, Y, and Z.

00:13:44.000 | And he mentioned a whole bunch of buzz words that I don't know.

00:13:46.000 | Open hands knew all of it and was able to address his feedback

00:13:49.000 | exactly how he wanted.

00:13:51.000 | Fixing quick little bugs.

00:13:55.000 | You can see in this example, we had an input that was a text input,

00:13:59.000 | should have been a number input.

00:14:00.000 | If I wasn't lazy, I could have dug through my code base,

00:14:03.000 | found the right file.

00:14:04.000 | But it was really easy for me to just quickly --

00:14:07.000 | I think I did this one from directly inside of Slack --

00:14:09.000 | just add open hands, fix this thing we were just talking about.

00:14:12.000 | And it's just really -- I don't even have to fire up my IDE.

00:14:17.000 | It's a really, really fun way to work.

00:14:20.000 | Infrastructure changes I really like.

00:14:23.000 | Usually these involve looking up some like really esoteric syntax

00:14:27.000 | inside of like the Terraform docs or something like that.

00:14:30.000 | Open hands and, you know, the underlying LLMs tend to just like know

00:14:34.000 | the right Terraform syntax.

00:14:36.000 | And if not, they can look up the documentation using the browser.

00:14:39.000 | So this stuff is really great.

00:14:41.000 | Sometimes we'll just get like an out of memory exception in Slack

00:14:43.000 | and immediately say, okay, open hands, increase the memory.

00:14:48.000 | database migrations are another great one.

00:14:51.000 | This is one where I find I often leave best practices behind.

00:14:54.000 | I won't put indexes on the right things.

00:14:56.000 | I won't set up foreign keys the right way.

00:14:58.000 | The LLM tends to be really great about following all best practices

00:15:01.000 | around database migrations.

00:15:03.000 | Again, it's kind of like a rote task for developers.

00:15:05.000 | It's not very fun.

00:15:07.000 | The LLM is great at it.

00:15:11.000 | Fixing failing tests like on a PR.

00:15:13.000 | If you've already got the code 90% of the way there,

00:15:16.000 | there's just a unit test failing because there was a breaking API change.

00:15:19.000 | Very easy to call in an agent to just clean up the failing tests.

00:15:23.000 | Expanding test coverage is another one I love because it's a very safe task.

00:15:29.000 | As long as the tests are passing, it's generally safe to just merge that.

00:15:33.000 | If you notice a spot in your code base where you're like,

00:15:35.000 | hey, we have really low coverage here,

00:15:37.000 | just ask your agent to expand your test coverage in that area of the code base.

00:15:42.000 | It's a great quick win to make your code base a little bit safer.

00:15:47.000 | Then everybody's favorite, building apps from scratch.

00:15:50.000 | You know, I would say if you're shipping production code,

00:15:53.000 | again, don't just like vibe code your way to a production application.

00:15:56.000 | But we're finding increasingly internally at our company,

00:15:59.000 | a lot of times there's like a little internal app we want to build.

00:16:02.000 | Like for instance, we built a way to debug open hands trajectories,

00:16:06.000 | debug open hands sessions.

00:16:08.000 | We built like a whole web application that since it's just an internal application,

00:16:12.000 | we can vibe code it a little bit.

00:16:14.000 | We don't really need to review every line of code.

00:16:16.000 | It's not really facing end users.

00:16:18.000 | This has been a really, really fun thing for our business

00:16:20.000 | to just be able to churn out these really quick applications

00:16:23.000 | just to serve our own internal needs.

00:16:25.000 | So yeah, Greenfield is a great, great use case for agents.

00:16:29.000 | That's all I've got.

00:16:31.000 | We'd love to have you all join the open hands community.

00:16:33.000 | You can find us on GitHub, all hands AI/openhands.

00:16:36.000 | Join us on Slack, Discord.

00:16:38.000 | We'd love to build with you.

00:16:40.000 | We'll be right back.

00:16:41.000 | We'll be right back.

00:16:41.000 | We'll see you next time.

Software Development Agents: What Works and What Doesn't - Robert Brennan, AllHands/OpenHands