Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)

(upbeat music) - Okay, hi everyone. So I was given the task of talking about agents in 2024 and this is an impossible task because there are so many agents, so many agents in 2024. So this is gonna be strongly covered by like my personal experience and what I think is interesting and important, but I think it's an important topic.

So let's go ahead. So the first thing I'd like to think about is, let's say I gave you, you know, a highly competent human some tools. Let's say I give you a web browser and a terminal or a file system and the ability to edit text or code. What could you do with that?

Everything, yeah. Probably a lot of things. This is like 99% of my, you know, daily life I guess when I'm working. So I think this is a pretty powerful tool set and what I am trying to do and what I think some other people are trying to do is come up with agents that are able to, you know, manipulate these things, web browsing, coding, running code in successful ways.

So there was a little bit about my profile. I'm a professor at CMU, chief scientist at All Hands AI, building open source coding agents. I'm maintainer of Open Hands, which is an open source coding agent framework. And I'm also a software developer and I like doing lots of coding and, you know, shipping new features and stuff like this.

So building agents that help me to do this, you know, is kind of an interesting thing, very close to me. So the first thing I'd like to do is I'd like to try some things that I haven't actually tried before. If anybody has, you know, tried to give a live demo, you know, this is very, very scary whenever you do it and it might not work.

So it might not work this time either. But I wanna show you like three things that I typically do with coding agents in my everyday work. I use coding agents maybe five to 10 times a day to help me solve my own problems. And so this is a first one.

This is a data science task, which says I want to create scatter plots that show the increase of the SWE bench score over time. And so I wrote a kind of concrete prompt about this. Agents work better with like somewhat concrete prompts. And I'm gonna throw this into open hands and let it work.

And I'll go back to that in a second. Another thing that I do is I create new software. And I've been using a service, a particular service, I won't name it, for sending emails and I'm not very happy with it. So I want to switch over to this new service called resend.com, which makes it easier to send emails.

And so I'm going to ask it to read the docs for the resend.com API and come up with a script that allows me to send emails. The input to the script should be a CSV file and the subject and body should be provided in Jinja2 templates. So I'll start another agent and try to get it to do that for me.

And let's go with the last one. The last one I do is improving existing software. And in order, you know, once you write software, you usually don't throw it away. You go in and like actually improve it iteratively. This software that I have is something I created without writing any code.

It's basically software to monitor how much our agents are contributing to the open hands repository. And on the, let me make that a little bit bigger. On the left side, I have the number of issues where it like sent a pull request. I have the number of issues where it like sent a pull request, whether it was merged in purple, closed in red, or is still open in green.

And so these are like, you know, it's helping us monitor. But one thing it doesn't tell me is the total number. And I kind of want that feature added to this software. So I'm gonna try to add that too. So I'll take this, I'll take this prompt. And here I want to open up specifically that GitHub repo.

So I'll open up that repo and paste in the prompt asking it, I asked it to make a pie chart for each of these and give me the total over the entire time period that I'm monitoring. So we'll do that. And so now I have, let's see, I have some agents.

Oh, this one already finished. Let's see. So this one already finished. You can see it finished analyzing the SuiteBench repository. It wrote a demonstration of, yeah, I'm trying to do that now, actually. It wrote a demonstration of how much each of the systems have improved over time. And I asked it to label the top three for each of the datasets.

And so it labeled OpenHands as being the best one for SuiteBench normal. For SuiteBench verified, it has like the Amazon queue agent and OpenHands. For the SuiteBench Lite, it has three over here. So you can see like, that's pretty useful, right? If you're a researcher, you do data analysis all the time.

I did it while I was talking to all of you and making a presentation. So that's pretty nice. I doubt the other two are finished yet. That would be impressive if the, yeah. So I think they're still working. So maybe we'll get back to them at the end of the presentation.

So these are the kinds of things that I do every day with coding agents now. And it's, or software development agents. It's pretty impressive. The next thing I'd like to talk about a little bit is things I worry about when designing agents. So we're designing agents to, you know, do a very difficult task of like navigating websites, writing code, other things like this.

And within 2024, there's been like a huge improvement in the methodology that we use to do this. But there's a bunch of things we think about. There's a bunch of interesting papers and I'd like to introduce a few of them. So the first thing I worry about is the agent computer interface.

Like how do we get an agent to interact with computers? And how do we provide agents with the tools to do the job? And within OpenHands, we are doing the thing on the right, but there's also a lot of agents that do the thing on the left. So the thing on the left is you give like agents kind of granular tools.

You give them tools like, or let's say your instruction is, I want to determine the most cost-effective country to purchase the smartphone model Kodak One. Other countries to consider are the USA, Japan, Germany, and India. And you have a bunch of available APIs. And so what you do for some agents is you provide them all of these tools, APIs as tools that they can call.

And so in this particular case, in order to solve this problem, you'd have to make about like 30 tool calls, right? You'd have to call lookup rates for Germany. You'd have to look it up for the US, Japan, and India. That's four tool goals. And then you'd go through and do all of these things separately.

And the method that we adopt in OpenHands instead is we provide these tools, but we provide them by just giving a coding agent the ability to call arbitrary Python code. And in the arbitrary Python code, it can call these tools. We expose these tools as APIs that the model can call.

And what that allows us to do is instead of writing 20 tool calls, making 20 LLM calls, you write a program that runs all of these all at once, and it gets the result. And of course it can execute that program. It can make a mistake. It can get errors back and fix things, but that makes our job a lot easier.

And this has been really like instrumental to our success, I think. Another part of this is what tools does the agent need? And I think this depends on your use case. We're kind of extreme, and we're only giving the agent five tools, or maybe six tools. And what are they?

The first one is program execution. So it can execute Bash programs, and it can execute Jupyter notebooks. It can execute cells in Jupyter notebooks. So those are two tools. Another one is a file editing tool. And the file editing tool allows you to browse parts of files, and kind of read them, overwrite them, other stuff like this.

And then we have another global search and replace tool. So it's actually two tools for file editing. And then a final one is web browsing. Web browsing, I'm kind of cheating when I call it only one tool. You actually have like scroll and text input and in click and other stuff like that.

But these are basically the only things we allow the agent to do. What, then the question is like, what if we want it to allow it to do something else? And the answer is, well, you know, human programmers already have a bunch of things that they use. They have the requests PyPy library.

They have the PDF to text PyPy library. They have like all these other libraries in the Python ecosystem that they can use. And so if we provide a coding agent with all these libraries, it can do things like data visualization and other stuff that I just showed you. So it can also get clone repositories and other things like this.

The agents are super good at using the GitHub API also. So they can do things on GitHub, like finding all of the comments on your issues or checking GitHub actions and stuff. The second thing I think about is the human agent interface. So this is like, how do we get humans to interact with agents?

I already showed you one variety of our human agent interface. It's basically a chat window where you can browse through the agents results and things like this. This is very, very difficult. I don't think anybody has a good answer to this. And I don't think we have a good answer to this, but the guiding principles that I'm trying to follow are we want to present enough info to the user.

So we want to present them with, you know, what the agent is doing in the form of a kind of English description. So you can see here, you can see here, every time it takes an action, it says like, I will help you create a script for sending emails.

When it runs a bash command, sorry, that's a little small. When it runs a bash command, it will say ran a bash command. It won't actually show you the whole bash command or the whole Jupyter Notebook because it can be really large, but you can open it up and see if you want to by clicking on this.

So like, if you want to explore more, you can click over to the Jupyter Notebook and see what's displayed in the Jupyter Notebook. And you get like lots and lots of information. So that's one thing. Another thing is go where the user is. So like if the user is already interacting in a particular setting, then I'd like to, you know, integrate into that setting, but only to a point.

So at OpenHands, we have a chat UI for interaction. We have a GitHub plugin for tagging and resolving issues. So basically what you do is you do @OpenHandsAgent and the OpenHandsAgent will like see that comment and be able to go in and fix things. So if you say @OpenHandsAgent, tests are failing on this PR, please fix the tests.

It will go in and fix the tests for you and stuff like this. Another thing we have is a remote runtime for launching headless jobs. So if you want to launch like a fleet of agents to solve, you know, five different problems at once, you can also do that through an API.

So we have these interfaces. And this probably depends on the use case. So like depending, if you're a coding agent, you want to do things one way. If you're like insurance auditing agent, you'll want to do things other ways, obviously. Another thing I think about a lot is choosing a language model.

And for agentic LMs, we have to have a bunch of things work really well. The first thing is really, really good instruction following ability. And if you have really good instruction following ability, it opens up like a ton of possible applications for you. Tool use and coding ability. So if you provide tools, it needs to be able to use them well.

Environment understanding. So it needs, like if you're building a web agent, it needs to be able to understand web pages either through a vision or through text. And error awareness and recovery ability. So if it makes a mistake, it needs to be able to, you know, figure out why it made a mistake, come up with alternative strategies and other things like this.

Under the hood, in all of the demos that I did now, Cloud, we're using Cloud. Cloud has all of these abilities. Very good, not perfect, but very good. Most others don't have these abilities quite as much. So like GPT-4.0 doesn't have very good error recovery ability. And so because of this, it will go into loops and do the same thing over and over and over again, whereas Cloud does not do this.

Cloud, if you use the agents enough, you get used to their kind of like personality and Cloud says, hmm, let me try a different approach a lot. So, you know, obviously it's been trained in some way to, you know, elicit this ability. We did an evaluation. This is old and we need to update this basically, but we evaluated Cloud, GPT-4.0, 01-mini, LLAMA-405B DeepSeq 2.5 on being a good code agent within our framework.

And Cloud was kind of head and shoulders above the rest. GPT-4.0 was kind of okay. The best open source model was LLAMA-3.1-405B. This needs to be updated 'cause this is like a few months old by now and, you know, things are moving really, really fast, but I still am under the impression that Cloud is the best.

The other closed models are, you know, not quite as good. And then the open models are a little bit behind that. Grok, we haven't tried Grok at all actually. So it's a good question. If you want to try it, I'd be happy to help. Cool, another thing is planning.

And so there's a few considerations for planning. The first one is whether you have a curated plan or you have it generated on the fly. And so for solving GitHub issues, you can kind of have an overall plan. Like the plan is first reproduce. If there's an issue, first write tests to reproduce the issue or to demonstrate the issue.

After that, run the tests and make sure they fail. Then go in and fix the tests, run the tests again to make sure they pass and then you're done. So that's like a pretty good workflow for like solving coding issues. And you could curate that ahead of time. Another option is to let the language model basically generate its own plan.

And both of these are perfectly valid. Another one is explicit structure versus implicit structure. So let's say you generate a plan. If you have explicit structure, you could like write a multi-agent system. And the multi-agent system would have your reproducer agent and then it would have your test writer agent and your bug fixer agent and lots of different agents.

And you would explicitly write this all out in code and then use it that way. On the other hand, you could just provide a prompt that says, please do all of these things in order. So in OpenHands, we do very light planning. We have a single prompt, we don't have any multi-agent systems, but we do provide like instructions about like what to do first, what to do next and other things like this.

I'm not against doing it the other way, but I laid out some kind of justification for this in this blog called Don't Sleep on Single Agent Systems. And the basic idea behind this is if you have a really, really good instruction following agent, it will follow the instructions as long as things are working according to your plan.

But let's say you need to deviate from your plan, you still have the flexibility to do this. And if you do explicit structure through a multi-agent system, it becomes a lot harder to do that. Like you get stuck when things deviate from your plan. There's also some other examples and I wanted to introduce a few papers.

So one paper I liked recently is this paper called Co-Act where you generate plans and then go in and fix them. And so the basic idea is like if you need to deviate from your plan, you can figure out that your plan was not working and go back and deviate from it.

Another thing I think about a lot is specifying common workflows. So we're trying to tackle software development and I already showed like three use cases where we do software development. And when we do software development, we do a ton of different things, but we do them over and over and over again.

So just to give an example, we fix GitHub actions when GitHub actions are failing and we do that over and over and over again. That's not the number one thing that software engineers do, but it's a high up on the list. So how can we get a list of all of like the workflows that people are working on?

And there's a few research works that people have done in this direction. One example is manual prompting. So there's this nice paper called Step that got state-of-the-art on the Web Arena Web Navigation Benchmark where they came up with a bunch of manual workflows for solving different web navigation tasks.

And we also have a paper recently called Agent Workflow Memory where the basic idea behind this is we want to create self-improving agents that learn from their past successes. And the way it works is we have a memory that has an example of lots of the previous workflows that people have used.

And every time the agent finishes a task and it self-judges that it did a good job at that task, you take that task, you break it down into individual workflows included in that and then you put it back in the prompt for the agent to work next time. And we demonstrated that this leads to a 22.5% increase on Web Arena after 40 examples.

So that's a pretty huge increase by kind of self-learning and self-improvement. Another thing is exploration. And one thing I think about is like, how can agents learn more about their environment before acting? And I work on coding and web agents and there's a few good examples of this in both areas.

Within coding, I view this as like repository understanding, understanding the code base that you're dealing with. And there's an example of this or a couple of examples of this, one example being agent lists, where they basically create a map of the repo and based on the map of the repo, they feed that into the agent so the agent can then navigate the repo and better know where things are.

And for web agents, there's an example of a paper called Bagel. And basically what they do is they have the agent just do random tasks on a website, explore the website, better understand the structure of the website. And then after that, they feed that in as a part of the prompt.

Part seven is search. Right now in open hands, we just let the agent go on a linear search path. So it's just solving the problem once. We're using a good agent that can kind of like recover from errors and try alternative things when things are not working properly, but still we only have a linear search path.

But there's also some nice work in 2024 that is about exploring multiple paths. So one example of this is, there's a paper called Tree Search for Language Agents, and they basically expand multiple paths, check whether the paths are going well, and if they aren't going well, you rewind back.

And on the web, this is kind of tricky because like how do you rewind when you accidentally ordered something you don't want on Amazon? It's kind of not the easiest thing to do. For code, it's a little bit easier 'cause you can just revert any changes that you made.

But I think that's an interesting topic too. And then finally, evaluation. So within our development for evaluation, we want to do a number of things. The first one is fast sanity checks. And in order to do this, we want things we can run really fast, really cheaply. So for web, we have something called mini world of bits, which is basically these trivial kind of web navigation things.

We have something called the Adder Code Editing Benchmark, where it's just about editing individual files that we use. But we also want highly realistic evaluation. So for the web, we have something called Web Arena that we created at CMU. This is web navigation on real open source websites. So it's open source websites that are actually used to serve shops or like bulletin boards or other things like this.

And for code, we use Sui Bench, which I think a lot of people may have heard of. It's basically a coding benchmark that comes from real world pull requests on GitHub. So if you can solve those, you can also probably solve other real world pull requests. I would say we still don't have benchmarks for the full versatility of agents.

So for example, we don't have benchmarks that test whether agents can code and do web navigation, but we're working on that and hoping to release something in the next week or two. So if that sounds interesting to you, come talk to me and I will tell you more about it.

- Cool, so I don't like making predictions, but I was told that I should be somewhat controversial, I guess, so I will try to do it anyway, although maybe none of these will be very controversial. The first thing is agent-oriented LLMs, like large language models for agents. My prediction is every large LLM trainer will be focusing on training models as agents.

So every large language model will be a better agent model by mid 2025. Competition will increase, prices will go down, smaller models will become competitive as agents. So right now, actually agents are somewhat expensive to run in some cases, but I expect that that won't last six months. I bet we'll have much better agent models in six months.

Another thing is instruction for LLMs. Another thing is instruction following ability specifically in agentic contexts will increase. And what that means is we'll have to do less manual engineering of agentic workflows and be able to do more by just prompting agents in more complex ways. Cloud is already really good at this.

It's not perfect, but it's already really, really good. And I expect the other models will catch up to Cloud pretty soon. Error correction ability will increase, less getting stuck in loops. Again, this is something that Cloud's already pretty good at. And I expect the others will follow. Agent benchmarks.

Agent benchmarks will start saturating. So right now we have a WebArena and SuiBench. I think WebArena is already too easy. It's not super easy, but it's already a bit too easy because the tasks we do in there are ones that take like two minutes for a human. So not too hard.

And kind of historically in 2023, our benchmarks were too easy. So we built harder benchmarks like WebArena and SuiBench were both built in 2023. In 2024, our agents were too bad. So we built agents and now we're building better agents. In 2025, our benchmarks will be too easy. So we'll build better benchmarks, I'm guessing.

So I would expect to see much more challenging agent benchmarks come out and we're already seeing some of them. In 2026, I don't know. I didn't write AGI, but we'll see. Then the human agent computer interface. I think one thing that we'll want to think about is what do we do at 75% success rate at things that we like actually care about.

Right now we have 53% or 55% on SuiBench verified, which is real world GitHub PRs. My impression is that the actual ability of models is maybe closer to 30 to 40%. So 30 to 40% of the things that I want an agent to solve on my own repos, it just solves without any human intervention.

80 to 90% it can solve without me opening an IDE, but I need to give it feedback. So how do we make that interaction smooth so that humans can audit the work of agents that are really, really good, but not perfect is going to be a big challenge. How can we expose the power of programming agents to other industries?

So as programmers, I think not all of us are using agents every day in our programming, although we probably will be in months or maybe a year, but I think it will come very naturally to us as programmers because we know code, we know how to architect software and stuff like that.

So I think the question is how do we put this in the hands of a lawyer or a chemist or somebody else and have them also be able to interact with it as naturally as we can. Another interesting thing is how can we redesign our existing systems for agents?

So we had a paper on API-based web agents, and basically what we showed is if you take a web agent and the agent interacts not with a website, but with APIs, the accuracy goes way up, just because APIs are way easier to interact with. And in fact, like when I ask our agent, our agent is able to browse websites, but whenever I want it to interact with GitHub, I tell it do not browse the GitHub website, use the GitHub API because it's way more successful at doing that.

So maybe every website is gonna need to have an API because we're gonna be having agents interact with them. About progress, I think progress will get faster. It's already fast. A lot of people are already overwhelmed, but I think it will continue. The reason why is agents are building agents and better agents will build better agents faster.

So I expect that if you haven't interacted with a coding agent yet, it's pretty magical, like the stuff that it can do. So, yeah. And I have a call to action. I'm honestly, like I've been working on natural language processing and language models for what, 15 years now? And even for me, it's pretty impressive what like AI agents powered by strong language models can do.

On the other hand, I believe that we should really make these powerful tools accessible. And what I mean by this is I don't think like, we should have these be opaque or limited to only a certain set of people. I feel like they should be affordable. They shouldn't be increasing the difference in the amount of power that people have.

If anything, I'd really like them to kind of make it possible for people who weren't able to do things before to be able to do them well. Open source is one way to do that. That's why I'm working on open source. There are other ways to do that. Make things cheap, make things so you can serve them to people who aren't able to afford them easily.

Like Duolingo is one example where they get all the people in the US to pay them $20 a month. So that they can give all the people in South America free language education so they can learn English and become more attractive on the job market, for instance. And so I think we can all think of ways that we can do that sort of thing.

And if that resonates with you, please contribute. Of course, I'd be happy if you contribute to Open Hands and use it. But another way you can do that is just use open source solutions, contribute to them, research with them, and train strong open source models. So I see some people in the room who are already training models.

It'd be great if you could train models for coding agents and make them cheap and yeah. Yeah, please, I was thinking about you, among others. Cool, yeah, that's all I have, thanks. - Slightly controversial thing is probably the nicest way to say hot takes. Any hot takes questions, actual hot takes?

- Oh, I can also show the other agents that were working if anybody's interested, but yeah, sorry, go ahead. - Yeah, I have a couple of questions. So they're kind of paired maybe. The first thing is that you said that you're estimating that your agent is successfully resolving something like 30 to 40% of your issues, but that's like below what you saw on Swebench.

So I guess I'm wondering where that discrepancy is coming from. And then I guess my other second question, which is maybe broader in scope, is that like if you think of an agent as like a junior developer, and I say, go do something, then I expect maybe tomorrow to get a Slack message being like, hey, I ran into this issue.

How can I resolve it? And like you said, your agent is like successfully solving like 90% of issues where you give it direct feedback. So are you thinking about how to get the agent to reach out to like, for planning when it's stuck or something like that? For like identify when it runs into a hole like that?

- Yeah, so great. These are great questions. - Oh, sorry. The third question, which is a good, so this is the first two. And if so, are you going to add a benchmark for that second question? - Okay, great. Yeah, great questions. Okay, so the first question was, why do I think it's resolving less than 50% of the issues on Swebench?

So first, Swebench is on popular open source repos and all of these popular open source repos were included in the training data for all of the language models. And so the language models already know these repos. In some cases, the language models already know the individual issues in Swebench.

So basically like some of the training data has leaked. And so it definitely will overestimate with respect to that. I don't think it's like horribly, horribly off, but I think it's boosting the accuracy by a little bit. So maybe that's the biggest reason why. In terms of asking for help and whether we're benchmarking asking for help, yes, we are.

So one thing we're working on now, which we're hoping to put out soon is we basically made super vague Swebench issues. Like I'm having a problem with the matrix multiply, please help. (laughs) Because these are like, if anybody's run a popular open source like framework, these are what half your issues are.

You're like users show up and say like, my screen doesn't work, what's wrong or something. And so then you need to ask them questions and how to reproduce. So yeah, we're working on that. I think it, my impression is that agents are not very good at asking for help, even flawed.

So like when they ask for help, they'll ask for help when they don't need it and then won't ask for help when they do need it. So this is definitely like an issue, I think. - Thanks for the great talk. I also have two questions. It's first one, can you talk a bit more about how the web agent interacts with websites?

So is there a VLM that looks at the webpage layout and then you parse the HTML and select which buttons to click on? And if so, do you think there's a future where there's like, so I work at Bing, Microsoft AI. Do you think there's a future where they're like the same web index, but there's an agent-friendly web index where all the processing is done offline so that you don't need to spend time cleaning up, like cleaning up the HTML and figuring out what to click online.

And any thoughts on that? - Yeah, so great question. There's a lot of work on web agents. I didn't go into like all of the details, but I think there's three main ways that agents interact with websites. The first way is the simplest way and the newest way, but it doesn't work very well, which is you take a screenshot of the website and then you click on a particular pixel value on the website.

And like models are not very good at that at the moment. Like they'll misclick. There was this thing about how like clod computer use started like looking at pictures of Yellowstone National Park or something like this. I don't know if you heard about this anecdote, but like people were like, oh, it's so human.

It's looking for a vacation. And it was like, no, it probably just misclicked on the wrong pixels and accidentally clicked on an ad. So like, this is the simplest way. The second simplest way is you take the HTML and you basically identify elements in the HTML. You don't use any vision whatsoever.

And then you say, okay, I want to click on this element. I want to enter text in this element or something like that. But HTML is too huge. So it actually, it usually gets condensed down into something called an accessibility tree, which was made for screen readers for visually impaired people.

And so that's another way. And then the third way is kind of a hybrid where you present the screenshot, but you also present like a textual summary of the output. And that's the one that I think will probably work best. What we're using is we're just using text at the moment.

And that's just an implementation issue that we haven't implemented the visual stuff yet, but that's kind of like we're working on it now. Another thing that I should point out is we actually have two modalities for web browsing. Very recently, we implemented this. And the reason why is because if you want to interact with full websites, you will need to click on all of the elements or have the ability to click on all of the elements.

But most of our work that we need websites for is just web browsing and like gathering information. So we have another modality where we convert all of it to markdown because that's like way more concise and easier for the agent to deal with. And then can we create an index specifically for agents?

Maybe a markdown index or something like that would be, you know, would make sense. Oh, how would I make a successor to Swebench? So, I mean, a first thing is there's like LiveCodeBench, which LiveCodeBench is basically continuously updating to make sure it doesn't leak into language model training data.

That's easy to do for Swebench because it comes from real websites and those real websites are getting new issues all the time. So you could just do it on the same benchmarks that they have there. There's also like a pretty large number of things covering various coding tasks. So like, for example, Swebench is mainly fixing issues, but there's also like documentation.

There's generating tests that actually test the functionality that you want. And there was a paper by a student at CMU on generating tests and stuff like that. So I feel like Swebench is one piece of the puzzle, but you could also have like 10 different other tasks. And then you could have like a composite benchmark where you test all of these abilities, not just that particular one.

Lots of other things too, but yeah. - Question from across. Use your mic, it would help. - Yeah, great talk, thank you. My question is about your experience designing agent architectures specifically. How much did you have to separate concerns in terms of task specific agents versus having one agent to do three or five things with a gigantic prompt with conditional paths and so on?

- Yeah, so that's a great question. So we have a basic coding and browsing agent. And I won't say basic, like it's a good agent, but it does coding and browsing. It has instructions about how to do coding and browsing. That is enough for most things, especially given a strong language model that has a lot of background knowledge about how to solve different types of tasks and how to use different APIs and stuff like that.

We do have a mechanism for something called microagents. And microagents are basically something that gets added to the prompt when a trigger is triggered. Right now it's very, very rudimentary. It's like if you detect the word GitHub anywhere, you get instructions about how to interact with GitHub, like use the API and don't browse.

Also, another one that I just added is for NPM, the like JavaScript package manager. And NPM, when it runs and it hits a failure, it like hits in interactive terminals where it says, would you like to quit? Enter yes. And if that does it, it like stalls our agent for the timeout until like two minutes.

So like I added a new microagent. Whenever it started using NPM, it would like get instructions about how to not use the interactive terminal and stuff like that. So that's our current solution. Honestly, I like it a lot. It's simple, it's easy to maintain. It works really well and stuff like that.

But I think there is a world where you would want something more complex than that. - Got it, thank you. - I got a question about MCP. I feel like this is the entropic model context protocol. It seems like the most successful type of this, like standardization of interactions between computers and agents.

Are you guys adopting it? Is there any other competing standard? Anything thought about it? - Yeah, I think the, so the entropic MCP is like a way, it's essentially a collection of APIs that you can use to interact with different things on the internet. I think it's not a bad idea, but it's like, there's a few things that bug me a little bit about it.

It's like, we already have an API for GitHub. So why do we need an MCP for GitHub, right? You know, like GitHub has an API, the GitHub API is evolving. We can look up the GitHub API documentation. So it seems like kind of duplicated a little bit. And also they have a setting where it's like, you have to spin up a server to serve your GitHub stuff and you have to spin up a server to serve your like, you know, other stuff.

And so I think it makes sense if you really care about like separation of concerns and security and like other things like this. But right now we haven't seen, we haven't seen that to have a lot more value than interacting directly with the tools that are already provided. And that kind of goes into my general philosophy, which is we're already developing things for programmers.

You know, how is an agent different from a programmer? And it is different, obviously, you know, like agents are different from programmers, but they're not that different at this point. So we can kind of interact with the interfaces we create for programmers. Yeah. I might change my mind later though.

So we'll see. - Yeah, hi, thanks. Very interesting talk. You were saying that the agents you have right now solve like maybe 30% of your issues out of the gate. I'm curious, of the things that it doesn't do, is there like a pattern that you observe? Like, oh, like these are the sorts of things that it just seems to really struggle with or is it just seemingly random?

- It's definitely not random. It's like, if you think it's more complex, then it's like, just intuitively, it's more likely to fail. I've gotten a bit better at prompting also. So like, just to give an example, it will sometimes fail to fix a GitHub workflow because it will not look at the GitHub workflow and understand what the GitHub workflow is doing before it solves the problem.

So I think actually probably the biggest thing that it fails at is, or that our agent plus Claude fails at is insufficient information gathering before trying to solve the task. And so if you provide all, if you provide instructions that it should do information gathering beforehand, it tends to do well.

If you don't provide sufficient instructions, it will try to solve the task without like fully understanding the task first and then fail and then you need to go back and give additional feedback. Another example, like, I love this example. While I was developing the monitor website that I showed here, we had a really tricky bug where it was writing out a cache file to a different directory than it was reading the cache file from.

And I had no idea, I had no idea what was going on. I thought the bug was in a different part of the code. But what I asked it to do was come up with five possible reasons why this could be failing and decreasing order of likelihood and examine all of them.

And that worked. And it could just go in and like do that. So like, I think a certain level of like scaffolding about like how it should sufficiently gather all the information that's necessary in order to solve the task is like, if that's missing, then that's probably the biggest failure point at the moment.

- Thanks. - Yeah. - I'm just using this as a chance to ask you all my questions. You had a slide on here about like self-improving agents or something like that with memory. It's like a really throwaway slide for like a super powerful idea. It got me thinking about how I would do it.

I have no idea how. So I just wanted you to chain a thought more on this. - Yeah, self-improving. So I think the biggest reason, like the simplest possible way to create a self-improving agent is to have a really, really strong language model that with infinite context. And it can just go back and look at like all of its past experiences and, you know, learn from them.

You might also want to remove the bad stuff just so it doesn't over-index on its like failed past experiences. But the problem is a really powerful language model is large, infinite context is expensive. We don't have a good way to index into it because like RAG, at least in my experience, RAG from language to code doesn't work super well.

So I think in the end, it's like, that's the way I would like to solve this problem. I'd like to have an infinite context and somehow be able to index into it appropriately. And I think that would mostly solve it. Another thing you can do is fine-tuning. So I think like RAG is one way to get information into your model.

Fine-tuning is another way to get information into your model. So that might be another way of continuously improving. Like you identify when you did a good job and then just add all of the good examples into your model. - Yeah, so you know how like Voyager tries to write code into a skill library and then reuses the skill library, right?

So it improves in the sense that it just builds up the skill library over time. - Yep. - One thing I was like thinking about, and there's this idea from Devin, your arch nemesis, of playbooks. I don't know if you've seen them. - Yeah, I mean, we're calling them workflows, but they're simpler.

- Yeah, so like basically like you should, like once a workflow works, you can kind of like persist them as a skill library. - Yep. - Right, like I feel like that's like some in between, like you said, you know, it's hard to do RAG between language and code, but I feel like that is RAG for, like I've done this before.

Last time I did it, this worked. So I'm just going to shortcut all the stuff that failed before. - Yeah, I totally, I think it's possible. It's just, you know, not trivial at the same time. - Yeah. - I'll explain the two curves. So basically the baseline is just an agent that does it from scratch every time.

And this curve up here is agent workflow memory, where it's like adding the successful experiences back into the prompt. Why is this improving? The reason why is because just it failed on the first few examples, and for the average to catch up, it took a little bit of time.

So it's not like this is actually improving it. You could just basically view the, this one is constant. And then this one is like improving like this. Basically you can see it's continuing to go up, yeah. - How do you think we're going to solve the authentication problem for agents right now?

- When you say authentication, you mean like credentials, like, yeah. - Yeah, 'cause I've seen a few startup solutions today, but it seems like it's limited to the amount of websites or actual authentication methods that it's capable of performing today. - Yeah, great question. So my preferred solution to this at the moment is GitHub fine-grained authentication tokens.

And GitHub fine-grained authentication tokens allow you to specify on a very granular basis. On this repo, you have permission to do this. On this repo, you have permission to do this. You also can prevent people from pushing to the main branch unless they get approved. You can do all of these other things.

And I think these were all developed for human developers or like the branch protection rules were developed for human developers. The fine-grained authentication tokens were developed for GitHub apps. I think for GitHub, maybe just pushing this like a little bit more is the way to do this. For other things, they're totally not prepared to give that sort of fine-grained control.

Like most APIs don't have something like a fine-grained authentication token. And that goes into my like comment that we're gonna need to prepare the world for agents, I think. But I think like the GitHub authentication tokens are like a good template for how you could start doing that maybe.

But yeah, I don't know. I don't have an answer. - I'll let you know if I find one. - Okay, yeah, thank you. Cool. I'm gonna finish up. Let me just see. Okay, so this one did write a script. I'm not gonna actually read it for you. And then the other one, let's see.

Yeah, so it sent a PR. Sorry, what is the PR URL? (silence) So I don't know if this... Sorry, that's taking way longer than it should. Okay, cool. Yeah, so this one sent a PR. I'll tell you later if this actually like successfully... Oh, no, it's deployed on Vercel.

So I can actually show you. But let me try this real quick. Sorry, I know I don't have time. Yeah, there you go. I have pie charts now, so yeah. It's so fun. It's so fun to play with these things 'cause you could just do that while I'm giving a talk.

Things like that. So yeah, thanks. (audience applauds)

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)

Chapters

Transcript