Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)

00:00:00.000 | (upbeat music)

00:00:02.580 | - Okay, hi everyone.

00:00:08.520 | So I was given the task of talking about agents in 2024

00:00:13.520 | and this is an impossible task

00:00:16.780 | because there are so many agents, so many agents in 2024.

00:00:21.600 | So this is gonna be strongly covered

00:00:23.460 | by like my personal experience

00:00:25.160 | and what I think is interesting and important,

00:00:26.960 | but I think it's an important topic.

00:00:29.120 | So let's go ahead.

00:00:30.320 | So the first thing I'd like to think about is,

00:00:36.480 | let's say I gave you, you know,

00:00:38.800 | a highly competent human some tools.

00:00:41.360 | Let's say I give you a web browser

00:00:44.760 | and a terminal or a file system

00:00:47.520 | and the ability to edit text or code.

00:00:51.580 | What could you do with that?

00:00:55.260 | Everything, yeah.

00:00:58.280 | Probably a lot of things.

00:00:59.360 | This is like 99% of my, you know,

00:01:01.720 | daily life I guess when I'm working.

00:01:05.560 | So I think this is a pretty powerful tool set

00:01:09.960 | and what I am trying to do

00:01:12.360 | and what I think some other people are trying to do

00:01:14.560 | is come up with agents that are able to, you know,

00:01:16.820 | manipulate these things,

00:01:18.240 | web browsing, coding, running code in successful ways.

00:01:21.800 | So there was a little bit about my profile.

00:01:25.360 | I'm a professor at CMU, chief scientist at All Hands AI,

00:01:28.240 | building open source coding agents.

00:01:30.480 | I'm maintainer of Open Hands,

00:01:32.560 | which is an open source coding agent framework.

00:01:35.400 | And I'm also a software developer

00:01:38.480 | and I like doing lots of coding and, you know,

00:01:43.480 | shipping new features and stuff like this.

00:01:45.480 | So building agents that help me to do this, you know,

00:01:48.180 | is kind of an interesting thing, very close to me.

00:01:50.800 | So the first thing I'd like to do

00:01:54.000 | is I'd like to try some things

00:01:55.760 | that I haven't actually tried before.

00:01:58.160 | If anybody has, you know, tried to give a live demo,

00:02:01.460 | you know, this is very, very scary whenever you do it

00:02:04.960 | and it might not work.

00:02:05.800 | So it might not work this time either.

00:02:08.040 | But I wanna show you like three things

00:02:10.520 | that I typically do with coding agents in my everyday work.

00:02:15.080 | I use coding agents maybe five to 10 times a day

00:02:18.600 | to help me solve my own problems.

00:02:21.800 | And so this is a first one.

00:02:23.000 | This is a data science task,

00:02:25.400 | which says I want to create scatter plots

00:02:28.760 | that show the increase of the SWE bench score over time.

00:02:32.100 | And so I wrote a kind of concrete prompt about this.

00:02:36.480 | Agents work better with like somewhat concrete prompts.

00:02:39.760 | And I'm gonna throw this into open hands and let it work.

00:02:44.760 | And I'll go back to that in a second.

00:02:52.320 | Another thing that I do is I create new software.

00:02:56.640 | And I've been using a service,

00:03:02.920 | a particular service, I won't name it,

00:03:06.440 | for sending emails and I'm not very happy with it.

00:03:09.120 | So I want to switch over to this new service

00:03:11.380 | called resend.com, which makes it easier to send emails.

00:03:15.040 | And so I'm going to ask it to read the docs

00:03:17.760 | for the resend.com API and come up with a script

00:03:20.360 | that allows me to send emails.

00:03:22.320 | The input to the script should be a CSV file

00:03:24.440 | and the subject and body should be provided

00:03:26.720 | in Jinja2 templates.

00:03:28.840 | So I'll start another agent

00:03:32.800 | and try to get it to do that for me.

00:03:37.000 | And let's go with the last one.

00:03:42.600 | The last one I do is improving existing software.

00:03:46.820 | And in order, you know, once you write software,

00:03:50.640 | you usually don't throw it away.

00:03:51.760 | You go in and like actually improve it iteratively.

00:03:55.240 | This software that I have is something I created

00:03:59.000 | without writing any code.

00:04:01.080 | It's basically software to monitor

00:04:03.520 | how much our agents are contributing

00:04:06.720 | to the open hands repository.

00:04:09.040 | And on the, let me make that a little bit bigger.

00:04:15.440 | On the left side, I have the number of issues

00:04:18.180 | where it like sent a pull request.

00:04:20.260 | I have the number of issues where it like sent

00:04:27.580 | a pull request, whether it was merged in purple,

00:04:29.940 | closed in red, or is still open in green.

00:04:33.420 | And so these are like, you know, it's helping us monitor.

00:04:38.380 | But one thing it doesn't tell me is the total number.

00:04:40.700 | And I kind of want that feature added to this software.

00:04:43.980 | So I'm gonna try to add that too.

00:04:46.080 | So I'll take this, I'll take this prompt.

00:04:51.080 | And here I want to open up specifically that GitHub repo.

00:05:03.120 | So I'll open up that repo and paste in the prompt asking it,

00:05:09.600 | I asked it to make a pie chart for each of these

00:05:11.760 | and give me the total over the entire time period

00:05:14.140 | that I'm monitoring.

00:05:14.980 | So we'll do that.

00:05:17.540 | And so now I have, let's see, I have some agents.

00:05:21.080 | Oh, this one already finished.

00:05:23.340 | Let's see.

00:05:25.440 | So this one already finished.

00:05:29.460 | You can see it finished analyzing the SuiteBench repository.

00:05:33.540 | It wrote a demonstration of,

00:05:40.200 | yeah, I'm trying to do that now, actually.

00:05:42.340 | It wrote a demonstration of how much each of the systems

00:05:51.580 | have improved over time.

00:05:53.280 | And I asked it to label the top three

00:05:56.160 | for each of the datasets.

00:05:57.220 | And so it labeled OpenHands as being the best one

00:05:59.480 | for SuiteBench normal.

00:06:01.840 | For SuiteBench verified,

00:06:03.120 | it has like the Amazon queue agent and OpenHands.

00:06:06.360 | For the SuiteBench Lite, it has three over here.

00:06:11.360 | So you can see like, that's pretty useful, right?

00:06:15.840 | If you're a researcher, you do data analysis all the time.

00:06:18.360 | I did it while I was talking to all of you

00:06:19.880 | and making a presentation.

00:06:21.320 | So that's pretty nice.

00:06:24.320 | I doubt the other two are finished yet.

00:06:26.440 | That would be impressive if the, yeah.

00:06:27.920 | So I think they're still working.

00:06:29.360 | So maybe we'll get back to them

00:06:30.520 | at the end of the presentation.

00:06:32.040 | So these are the kinds of things

00:06:35.960 | that I do every day with coding agents now.

00:06:38.200 | And it's, or software development agents.

00:06:40.440 | It's pretty impressive.

00:06:41.600 | The next thing I'd like to talk about a little bit

00:06:46.320 | is things I worry about when designing agents.

00:06:48.440 | So we're designing agents to, you know,

00:06:50.560 | do a very difficult task of like navigating websites,

00:06:54.800 | writing code, other things like this.

00:06:57.160 | And within 2024, there's been like a huge improvement

00:07:00.640 | in the methodology that we use to do this.

00:07:04.480 | But there's a bunch of things we think about.

00:07:06.320 | There's a bunch of interesting papers

00:07:07.680 | and I'd like to introduce a few of them.

00:07:09.640 | So the first thing I worry about

00:07:12.440 | is the agent computer interface.

00:07:14.920 | Like how do we get an agent to interact with computers?

00:07:18.200 | And how do we provide agents with the tools to do the job?

00:07:23.200 | And within OpenHands, we are doing the thing on the right,

00:07:28.880 | but there's also a lot of agents

00:07:31.640 | that do the thing on the left.

00:07:33.400 | So the thing on the left is you give like agents

00:07:36.480 | kind of granular tools.

00:07:38.680 | You give them tools like,

00:07:39.960 | or let's say your instruction is,

00:07:43.320 | I want to determine the most cost-effective country

00:07:45.600 | to purchase the smartphone model Kodak One.

00:07:48.440 | Other countries to consider are the USA,

00:07:50.240 | Japan, Germany, and India.

00:07:52.360 | And you have a bunch of available APIs.

00:07:54.800 | And so what you do for some agents

00:07:57.640 | is you provide them all of these tools,

00:07:59.920 | APIs as tools that they can call.

00:08:02.800 | And so in this particular case,

00:08:05.000 | in order to solve this problem,

00:08:06.280 | you'd have to make about like 30 tool calls, right?

00:08:08.560 | You'd have to call lookup rates for Germany.

00:08:12.320 | You'd have to look it up for the US, Japan, and India.

00:08:14.840 | That's four tool goals.

00:08:16.480 | And then you'd go through

00:08:17.320 | and do all of these things separately.

00:08:20.720 | And the method that we adopt in OpenHands instead

00:08:24.240 | is we provide these tools,

00:08:26.120 | but we provide them by just giving a coding agent

00:08:28.600 | the ability to call arbitrary Python code.

00:08:32.280 | And in the arbitrary Python code, it can call these tools.

00:08:36.560 | We expose these tools as APIs that the model can call.

00:08:39.680 | And what that allows us to do

00:08:40.880 | is instead of writing 20 tool calls, making 20 LLM calls,

00:08:45.160 | you write a program that runs all of these all at once,

00:08:47.680 | and it gets the result.

00:08:49.000 | And of course it can execute that program.

00:08:50.600 | It can make a mistake.

00:08:51.960 | It can get errors back and fix things,

00:08:54.960 | but that makes our job a lot easier.

00:08:56.560 | And this has been really like instrumental

00:08:58.180 | to our success, I think.

00:09:01.200 | Another part of this is what tools does the agent need?

00:09:05.220 | And I think this depends on your use case.

00:09:07.700 | We're kind of extreme,

00:09:09.160 | and we're only giving the agent five tools,

00:09:13.240 | or maybe six tools.

00:09:15.280 | And what are they?

00:09:16.960 | The first one is program execution.

00:09:19.600 | So it can execute Bash programs,

00:09:21.400 | and it can execute Jupyter notebooks.

00:09:23.960 | It can execute cells in Jupyter notebooks.

00:09:26.600 | So those are two tools.

00:09:30.200 | Another one is a file editing tool.

00:09:32.360 | And the file editing tool allows you

00:09:35.200 | to browse parts of files,

00:09:36.920 | and kind of read them, overwrite them,

00:09:40.320 | other stuff like this.

00:09:41.560 | And then we have another global search and replace tool.

00:09:43.800 | So it's actually two tools for file editing.

00:09:46.160 | And then a final one is web browsing.

00:09:49.000 | Web browsing, I'm kind of cheating

00:09:50.360 | when I call it only one tool.

00:09:51.640 | You actually have like scroll and text input

00:09:54.360 | and in click and other stuff like that.

00:09:56.120 | But these are basically the only things

00:09:58.360 | we allow the agent to do.

00:10:00.640 | What, then the question is like,

00:10:03.600 | what if we want it to allow it to do something else?

00:10:06.480 | And the answer is, well, you know,

00:10:09.560 | human programmers already have a bunch of things

00:10:11.960 | that they use.

00:10:13.200 | They have the requests PyPy library.

00:10:15.040 | They have the PDF to text PyPy library.

00:10:18.640 | They have like all these other libraries

00:10:20.400 | in the Python ecosystem that they can use.

00:10:22.680 | And so if we provide a coding agent

00:10:24.840 | with all these libraries,

00:10:25.720 | it can do things like data visualization

00:10:27.800 | and other stuff that I just showed you.

00:10:29.160 | So it can also get clone repositories

00:10:32.200 | and other things like this.

00:10:34.360 | The agents are super good at using the GitHub API also.

00:10:37.480 | So they can do things on GitHub,

00:10:40.320 | like finding all of the comments on your issues

00:10:43.200 | or checking GitHub actions and stuff.

00:10:45.040 | The second thing I think about

00:10:48.920 | is the human agent interface.

00:10:50.360 | So this is like, how do we get humans

00:10:52.240 | to interact with agents?

00:10:54.040 | I already showed you one variety

00:10:56.040 | of our human agent interface.

00:10:57.200 | It's basically a chat window

00:10:58.400 | where you can browse through the agents results

00:11:00.160 | and things like this.

00:11:01.240 | This is very, very difficult.

00:11:04.400 | I don't think anybody has a good answer to this.

00:11:07.080 | And I don't think we have a good answer to this,

00:11:08.800 | but the guiding principles that I'm trying to follow

00:11:13.160 | are we want to present enough info to the user.

00:11:16.200 | So we want to present them with, you know,

00:11:19.200 | what the agent is doing

00:11:21.920 | in the form of a kind of English description.

00:11:26.040 | So you can see here,

00:11:27.720 | you can see here, every time it takes an action,

00:11:32.280 | it says like, I will help you create a script

00:11:34.360 | for sending emails.

00:11:35.840 | When it runs a bash command,

00:11:39.880 | sorry, that's a little small.

00:11:41.400 | When it runs a bash command,

00:11:43.280 | it will say ran a bash command.

00:11:46.360 | It won't actually show you the whole bash command

00:11:48.440 | or the whole Jupyter Notebook

00:11:49.600 | because it can be really large,

00:11:50.800 | but you can open it up and see

00:11:52.440 | if you want to by clicking on this.

00:11:54.840 | So like, if you want to explore more,

00:11:57.280 | you can click over to the Jupyter Notebook

00:11:59.160 | and see what's displayed in the Jupyter Notebook.

00:12:01.400 | And you get like lots and lots of information.

00:12:04.200 | So that's one thing.

00:12:05.360 | Another thing is go where the user is.

00:12:13.560 | So like if the user is already interacting

00:12:16.200 | in a particular setting,

00:12:17.760 | then I'd like to, you know, integrate into that setting,

00:12:20.360 | but only to a point.

00:12:22.520 | So at OpenHands, we have a chat UI for interaction.

00:12:26.320 | We have a GitHub plugin for tagging and resolving issues.

00:12:29.280 | So basically what you do is you do @OpenHandsAgent

00:12:33.360 | and the OpenHandsAgent will like see that comment

00:12:37.240 | and be able to go in and fix things.

00:12:38.680 | So if you say @OpenHandsAgent,

00:12:41.000 | tests are failing on this PR, please fix the tests.

00:12:43.800 | It will go in and fix the tests for you

00:12:45.280 | and stuff like this.

00:12:46.280 | Another thing we have is a remote runtime

00:12:50.840 | for launching headless jobs.

00:12:52.480 | So if you want to launch like a fleet of agents

00:12:54.600 | to solve, you know, five different problems at once,

00:12:57.800 | you can also do that through an API.

00:12:59.240 | So we have these interfaces.

00:13:02.840 | And this probably depends on the use case.

00:13:04.600 | So like depending, if you're a coding agent,

00:13:06.920 | you want to do things one way.

00:13:08.040 | If you're like insurance auditing agent,

00:13:10.800 | you'll want to do things other ways, obviously.

00:13:13.000 | Another thing I think about a lot

00:13:16.680 | is choosing a language model.

00:13:19.760 | And for agentic LMs, we have to have a bunch of things

00:13:24.760 | work really well.

00:13:26.520 | The first thing is really, really good

00:13:28.440 | instruction following ability.

00:13:30.480 | And if you have really good instruction following ability,

00:13:33.160 | it opens up like a ton of possible applications for you.

00:13:36.620 | Tool use and coding ability.

00:13:39.360 | So if you provide tools,

00:13:40.440 | it needs to be able to use them well.

00:13:42.280 | Environment understanding.

00:13:44.880 | So it needs, like if you're building a web agent,

00:13:48.440 | it needs to be able to understand web pages

00:13:50.320 | either through a vision or through text.

00:13:53.320 | And error awareness and recovery ability.

00:13:57.200 | So if it makes a mistake, it needs to be able to,

00:13:59.720 | you know, figure out why it made a mistake,

00:14:01.520 | come up with alternative strategies

00:14:03.440 | and other things like this.

00:14:04.800 | Under the hood, in all of the demos that I did now,

00:14:12.480 | Cloud, we're using Cloud.

00:14:15.120 | Cloud has all of these abilities.

00:14:17.400 | Very good, not perfect, but very good.

00:14:20.440 | Most others don't have these abilities quite as much.

00:14:24.480 | So like GPT-4.0 doesn't have very good

00:14:27.560 | error recovery ability.

00:14:29.240 | And so because of this, it will go into loops

00:14:31.200 | and do the same thing over and over and over again,

00:14:33.040 | whereas Cloud does not do this.

00:14:35.120 | Cloud, if you use the agents enough,

00:14:38.400 | you get used to their kind of like personality

00:14:40.800 | and Cloud says, hmm, let me try a different approach a lot.

00:14:44.680 | So, you know, obviously it's been trained in some way

00:14:47.640 | to, you know, elicit this ability.

00:14:49.480 | We did an evaluation.

00:14:52.800 | This is old and we need to update this basically,

00:14:56.280 | but we evaluated Cloud, GPT-4.0, 01-mini,

00:15:01.280 | LLAMA-405B DeepSeq 2.5

00:15:05.280 | on being a good code agent within our framework.

00:15:07.880 | And Cloud was kind of head and shoulders above the rest.

00:15:11.440 | GPT-4.0 was kind of okay.

00:15:12.960 | The best open source model was LLAMA-3.1-405B.

00:15:16.680 | This needs to be updated

00:15:17.720 | 'cause this is like a few months old by now

00:15:19.520 | and, you know, things are moving really, really fast,

00:15:21.800 | but I still am under the impression that Cloud is the best.

00:15:24.920 | The other closed models are, you know, not quite as good.

00:15:27.560 | And then the open models are a little bit behind that.

00:15:30.320 | Grok, we haven't tried Grok at all actually.

00:15:34.560 | So it's a good question.

00:15:35.560 | If you want to try it, I'd be happy to help.

00:15:41.520 | Cool, another thing is planning.

00:15:43.280 | And so there's a few considerations for planning.

00:15:47.440 | The first one is whether you have a curated plan

00:15:50.640 | or you have it generated on the fly.

00:15:53.360 | And so for solving GitHub issues,

00:15:57.280 | you can kind of have an overall plan.

00:15:59.760 | Like the plan is first reproduce.

00:16:03.160 | If there's an issue,

00:16:05.280 | first write tests to reproduce the issue

00:16:07.560 | or to demonstrate the issue.

00:16:09.400 | After that, run the tests and make sure they fail.

00:16:12.760 | Then go in and fix the tests,

00:16:14.640 | run the tests again to make sure they pass

00:16:16.280 | and then you're done.

00:16:17.240 | So that's like a pretty good workflow

00:16:19.080 | for like solving coding issues.

00:16:22.080 | And you could curate that ahead of time.

00:16:24.880 | Another option is to let the language model

00:16:27.560 | basically generate its own plan.

00:16:29.640 | And both of these are perfectly valid.

00:16:31.920 | Another one is explicit structure versus implicit structure.

00:16:36.520 | So let's say you generate a plan.

00:16:39.040 | If you have explicit structure,

00:16:41.520 | you could like write a multi-agent system.

00:16:44.520 | And the multi-agent system would have your reproducer agent

00:16:48.480 | and then it would have your test writer agent

00:16:53.480 | and your bug fixer agent and lots of different agents.

00:16:57.480 | And you would explicitly write this all out in code

00:17:00.000 | and then use it that way.

00:17:02.520 | On the other hand, you could just provide a prompt

00:17:04.640 | that says, please do all of these things in order.

00:17:07.200 | So in OpenHands, we do very light planning.

00:17:14.120 | We have a single prompt,

00:17:15.240 | we don't have any multi-agent systems,

00:17:17.920 | but we do provide like instructions about like

00:17:20.400 | what to do first, what to do next

00:17:21.880 | and other things like this.

00:17:23.400 | I'm not against doing it the other way,

00:17:26.480 | but I laid out some kind of justification for this

00:17:30.560 | in this blog called Don't Sleep on Single Agent Systems.

00:17:33.600 | And the basic idea behind this is

00:17:35.600 | if you have a really, really good instruction

00:17:37.480 | following agent, it will follow the instructions

00:17:40.800 | as long as things are working according to your plan.

00:17:43.480 | But let's say you need to deviate from your plan,

00:17:45.880 | you still have the flexibility to do this.

00:17:47.800 | And if you do explicit structure

00:17:49.400 | through a multi-agent system,

00:17:50.480 | it becomes a lot harder to do that.

00:17:51.880 | Like you get stuck when things deviate from your plan.

00:17:55.460 | There's also some other examples

00:17:59.600 | and I wanted to introduce a few papers.

00:18:02.200 | So one paper I liked recently is this paper called Co-Act

00:18:05.360 | where you generate plans and then go in and fix them.

00:18:09.240 | And so the basic idea is like

00:18:12.080 | if you need to deviate from your plan,

00:18:13.560 | you can figure out that your plan was not working

00:18:17.400 | and go back and deviate from it.

00:18:19.000 | Another thing I think about a lot

00:18:23.600 | is specifying common workflows.

00:18:25.400 | So we're trying to tackle software development

00:18:28.040 | and I already showed like three use cases

00:18:30.840 | where we do software development.

00:18:35.560 | And when we do software development,

00:18:40.040 | we do a ton of different things,

00:18:41.560 | but we do them over and over and over again.

00:18:43.120 | So just to give an example,

00:18:45.320 | we fix GitHub actions when GitHub actions are failing

00:18:49.520 | and we do that over and over and over again.

00:18:51.640 | That's not the number one thing that software engineers do,

00:18:53.940 | but it's a high up on the list.

00:18:56.200 | So how can we get a list of all of like the workflows

00:18:58.640 | that people are working on?

00:19:01.000 | And there's a few research works

00:19:03.600 | that people have done in this direction.

00:19:05.920 | One example is manual prompting.

00:19:07.560 | So there's this nice paper called Step

00:19:09.760 | that got state-of-the-art

00:19:10.880 | on the Web Arena Web Navigation Benchmark

00:19:12.760 | where they came up with a bunch of manual workflows

00:19:14.920 | for solving different web navigation tasks.

00:19:18.440 | And we also have a paper recently

00:19:20.200 | called Agent Workflow Memory

00:19:22.200 | where the basic idea behind this

00:19:23.960 | is we want to create self-improving agents

00:19:26.120 | that learn from their past successes.

00:19:29.200 | And the way it works is we have a memory

00:19:32.280 | that has an example of lots of the previous workflows

00:19:35.440 | that people have used.

00:19:37.040 | And every time the agent finishes a task

00:19:39.920 | and it self-judges that it did a good job at that task,

00:19:43.240 | you take that task,

00:19:44.120 | you break it down into individual workflows included in that

00:19:47.600 | and then you put it back in the prompt

00:19:49.160 | for the agent to work next time.

00:19:51.140 | And we demonstrated that this leads to a 22.5% increase

00:19:56.900 | on Web Arena after 40 examples.

00:20:00.400 | So that's a pretty huge increase

00:20:02.540 | by kind of self-learning and self-improvement.

00:20:04.860 | Another thing is exploration.

00:20:09.920 | And one thing I think about is like,

00:20:17.140 | how can agents learn more about their environment

00:20:19.300 | before acting?

00:20:20.940 | And I work on coding and web agents

00:20:24.360 | and there's a few good examples of this in both areas.

00:20:28.520 | Within coding, I view this as like repository understanding,

00:20:33.320 | understanding the code base that you're dealing with.

00:20:36.080 | And there's an example of this

00:20:38.200 | or a couple of examples of this,

00:20:39.400 | one example being agent lists,

00:20:41.500 | where they basically create a map of the repo

00:20:44.760 | and based on the map of the repo,

00:20:46.400 | they feed that into the agent

00:20:47.580 | so the agent can then navigate the repo

00:20:50.420 | and better know where things are.

00:20:53.380 | And for web agents,

00:20:55.020 | there's an example of a paper called Bagel.

00:20:57.300 | And basically what they do is they have the agent

00:21:00.820 | just do random tasks on a website,

00:21:03.620 | explore the website,

00:21:04.680 | better understand the structure of the website.

00:21:06.300 | And then after that, they feed that in

00:21:08.860 | as a part of the prompt.

00:21:10.060 | Part seven is search.

00:21:16.220 | Right now in open hands,

00:21:19.300 | we just let the agent go on a linear search path.

00:21:21.500 | So it's just solving the problem once.

00:21:24.300 | We're using a good agent that can kind of like

00:21:26.460 | recover from errors and try alternative things

00:21:28.700 | when things are not working properly,

00:21:30.140 | but still we only have a linear search path.

00:21:33.180 | But there's also some nice work in 2024

00:21:36.660 | that is about exploring multiple paths.

00:21:39.100 | So one example of this is,

00:21:40.980 | there's a paper called Tree Search for Language Agents,

00:21:43.780 | and they basically expand multiple paths,

00:21:46.380 | check whether the paths are going well,

00:21:49.320 | and if they aren't going well, you rewind back.

00:21:51.840 | And on the web, this is kind of tricky

00:21:54.440 | because like how do you rewind

00:21:56.680 | when you accidentally ordered

00:21:57.960 | something you don't want on Amazon?

00:21:59.400 | It's kind of not the easiest thing to do.

00:22:02.120 | For code, it's a little bit easier

00:22:03.480 | 'cause you can just revert any changes that you made.

00:22:06.920 | But I think that's an interesting topic too.

00:22:09.600 | And then finally, evaluation.

00:22:13.600 | So within our development for evaluation,

00:22:18.240 | we want to do a number of things.

00:22:19.960 | The first one is fast sanity checks.

00:22:23.000 | And in order to do this,

00:22:23.960 | we want things we can run really fast, really cheaply.

00:22:27.000 | So for web, we have something called mini world of bits,

00:22:30.400 | which is basically these trivial

00:22:32.200 | kind of web navigation things.

00:22:36.480 | We have something called the Adder Code Editing Benchmark,

00:22:38.760 | where it's just about editing individual files that we use.

00:22:42.560 | But we also want highly realistic evaluation.

00:22:47.320 | So for the web, we have something called Web Arena

00:22:49.600 | that we created at CMU.

00:22:50.880 | This is web navigation on real open source websites.

00:22:55.680 | So it's open source websites

00:22:57.040 | that are actually used to serve shops

00:23:00.440 | or like bulletin boards or other things like this.

00:23:05.440 | And for code, we use Sui Bench,

00:23:07.760 | which I think a lot of people may have heard of.

00:23:10.400 | It's basically a coding benchmark

00:23:12.440 | that comes from real world pull requests on GitHub.

00:23:14.920 | So if you can solve those,

00:23:15.920 | you can also probably solve other real world pull requests.

00:23:19.400 | I would say we still don't have benchmarks

00:23:24.200 | for the full versatility of agents.

00:23:26.520 | So for example, we don't have benchmarks

00:23:29.200 | that test whether agents can code and do web navigation,

00:23:32.840 | but we're working on that

00:23:34.080 | and hoping to release something in the next week or two.

00:23:36.720 | So if that sounds interesting to you, come talk to me

00:23:40.240 | and I will tell you more about it.

00:23:43.880 | - Cool, so I don't like making predictions,

00:23:46.880 | but I was told that I should be somewhat controversial,

00:23:50.480 | I guess, so I will try to do it anyway,

00:23:54.560 | although maybe none of these will be very controversial.

00:23:57.320 | The first thing is agent-oriented LLMs,

00:24:02.320 | like large language models for agents.

00:24:04.720 | My prediction is every large LLM trainer

00:24:08.120 | will be focusing on training models as agents.

00:24:10.280 | So every large language model will be a better agent model

00:24:13.920 | by mid 2025.

00:24:16.040 | Competition will increase, prices will go down,

00:24:21.200 | smaller models will become competitive as agents.

00:24:23.760 | So right now, actually agents are somewhat expensive

00:24:25.960 | to run in some cases,

00:24:27.080 | but I expect that that won't last six months.

00:24:29.400 | I bet we'll have much better agent models in six months.

00:24:32.680 | Another thing is instruction for LLMs.

00:24:38.600 | Another thing is instruction following ability

00:24:41.160 | specifically in agentic contexts will increase.

00:24:44.800 | And what that means is we'll have to do less

00:24:47.400 | manual engineering of agentic workflows

00:24:51.360 | and be able to do more by just prompting agents

00:24:54.080 | in more complex ways.

00:24:56.040 | Cloud is already really good at this.

00:24:57.840 | It's not perfect, but it's already really, really good.

00:24:59.840 | And I expect the other models

00:25:00.960 | will catch up to Cloud pretty soon.

00:25:02.720 | Error correction ability will increase,

00:25:06.560 | less getting stuck in loops.

00:25:07.720 | Again, this is something that Cloud's

00:25:09.200 | already pretty good at.

00:25:10.520 | And I expect the others will follow.

00:25:13.680 | Agent benchmarks.

00:25:17.920 | Agent benchmarks will start saturating.

00:25:20.240 | So right now we have a WebArena and SuiBench.

00:25:25.240 | I think WebArena is already too easy.

00:25:29.560 | It's not super easy, but it's already a bit too easy

00:25:35.720 | because the tasks we do in there

00:25:38.080 | are ones that take like two minutes for a human.

00:25:40.520 | So not too hard.

00:25:42.440 | And kind of historically in 2023,

00:25:46.880 | our benchmarks were too easy.

00:25:48.200 | So we built harder benchmarks like WebArena and SuiBench

00:25:51.120 | were both built in 2023.

00:25:52.800 | In 2024, our agents were too bad.

00:25:55.960 | So we built agents and now we're building better agents.

00:26:00.040 | In 2025, our benchmarks will be too easy.

00:26:02.400 | So we'll build better benchmarks, I'm guessing.

00:26:05.240 | So I would expect to see much more challenging

00:26:08.600 | agent benchmarks come out

00:26:10.040 | and we're already seeing some of them.

00:26:12.760 | In 2026, I don't know.

00:26:14.800 | I didn't write AGI, but we'll see.

00:26:19.320 | Then the human agent computer interface.

00:26:24.040 | I think one thing that we'll want to think about

00:26:27.080 | is what do we do at 75% success rate

00:26:29.880 | at things that we like actually care about.

00:26:33.600 | Right now we have 53% or 55% on SuiBench verified,

00:26:38.600 | which is real world GitHub PRs.

00:26:43.280 | My impression is that the actual ability of models

00:26:47.800 | is maybe closer to 30 to 40%.

00:26:51.800 | So 30 to 40% of the things that I want an agent

00:26:54.680 | to solve on my own repos,

00:26:55.960 | it just solves without any human intervention.

00:26:59.520 | 80 to 90% it can solve without me opening an IDE,

00:27:03.080 | but I need to give it feedback.

00:27:05.320 | So how do we make that interaction smooth

00:27:09.280 | so that humans can audit the work of agents

00:27:13.240 | that are really, really good, but not perfect

00:27:15.720 | is going to be a big challenge.

00:27:17.280 | How can we expose the power of programming agents

00:27:22.480 | to other industries?

00:27:23.320 | So as programmers, I think not all of us

00:27:26.880 | are using agents every day in our programming,

00:27:29.560 | although we probably will be in months or maybe a year,

00:27:34.560 | but I think it will come very naturally to us as programmers

00:27:39.840 | because we know code, we know how to architect software

00:27:44.840 | and stuff like that.

00:27:47.080 | So I think the question is how do we put this in the hands

00:27:52.080 | of a lawyer or a chemist or somebody else

00:27:56.160 | and have them also be able to interact with it

00:27:58.640 | as naturally as we can.

00:27:59.760 | Another interesting thing is how can we redesign

00:28:03.960 | our existing systems for agents?

00:28:05.400 | So we had a paper on API-based web agents,

00:28:07.960 | and basically what we showed is if you take a web agent

00:28:11.440 | and the agent interacts not with a website,

00:28:14.120 | but with APIs, the accuracy goes way up,

00:28:16.640 | just because APIs are way easier to interact with.

00:28:18.800 | And in fact, like when I ask our agent,

00:28:23.800 | our agent is able to browse websites,

00:28:26.120 | but whenever I want it to interact with GitHub,

00:28:28.080 | I tell it do not browse the GitHub website,

00:28:30.120 | use the GitHub API because it's way more successful

00:28:32.320 | at doing that.

00:28:33.360 | So maybe every website is gonna need to have an API

00:28:36.760 | because we're gonna be having agents interact with them.

00:28:39.560 | About progress, I think progress will get faster.

00:28:45.840 | It's already fast.

00:28:46.840 | A lot of people are already overwhelmed,

00:28:48.520 | but I think it will continue.

00:28:50.880 | The reason why is agents are building agents

00:28:54.000 | and better agents will build better agents faster.

00:28:56.320 | So I expect that if you haven't interacted

00:29:01.320 | with a coding agent yet, it's pretty magical,

00:29:04.600 | like the stuff that it can do.

00:29:06.840 | So, yeah.

00:29:08.720 | And I have a call to action.

00:29:13.280 | I'm honestly, like I've been working

00:29:17.600 | on natural language processing and language models

00:29:21.520 | for what, 15 years now?

00:29:23.320 | And even for me, it's pretty impressive

00:29:25.480 | what like AI agents powered by strong language models

00:29:28.640 | can do.

00:29:29.480 | On the other hand, I believe that we should really make

00:29:33.880 | these powerful tools accessible.

00:29:35.800 | And what I mean by this is I don't think like,

00:29:39.680 | we should have these be opaque or limited

00:29:43.520 | to only a certain set of people.

00:29:46.280 | I feel like they should be affordable.

00:29:48.360 | They shouldn't be increasing the difference

00:29:51.640 | in the amount of power that people have.

00:29:53.760 | If anything, I'd really like them to kind of make it possible

00:29:58.160 | for people who weren't able to do things before

00:30:00.200 | to be able to do them well.

00:30:01.800 | Open source is one way to do that.

00:30:05.280 | That's why I'm working on open source.

00:30:08.280 | There are other ways to do that.

00:30:09.800 | Make things cheap, make things so you can serve them

00:30:13.480 | to people who aren't able to afford them easily.

00:30:16.480 | Like Duolingo is one example where they get all the people

00:30:19.840 | in the US to pay them $20 a month.

00:30:23.480 | So that they can give all the people in South America

00:30:26.160 | free language education so they can learn English

00:30:28.920 | and become more attractive on the job market, for instance.

00:30:33.920 | And so I think we can all think of ways

00:30:39.080 | that we can do that sort of thing.

00:30:41.520 | And if that resonates with you, please contribute.

00:30:43.840 | Of course, I'd be happy if you contribute to Open Hands

00:30:46.120 | and use it.

00:30:47.640 | But another way you can do that is just use

00:30:50.200 | open source solutions, contribute to them,

00:30:52.600 | research with them, and train strong open source models.

00:30:55.440 | So I see some people in the room

00:30:58.640 | who are already training models.

00:30:59.880 | It'd be great if you could train models for coding agents

00:31:02.640 | and make them cheap and yeah.

00:31:04.360 | Yeah, please, I was thinking about you, among others.

00:31:10.320 | Cool, yeah, that's all I have, thanks.

00:31:12.880 | - Slightly controversial thing is probably the nicest way

00:31:20.680 | to say hot takes.

00:31:21.760 | Any hot takes questions, actual hot takes?

00:31:28.400 | - Oh, I can also show the other agents that were working

00:31:32.520 | if anybody's interested, but yeah, sorry, go ahead.

00:31:34.480 | - Yeah, I have a couple of questions.

00:31:37.600 | So they're kind of paired maybe.

00:31:39.760 | The first thing is that you said that you're estimating

00:31:42.960 | that your agent is successfully resolving

00:31:47.960 | something like 30 to 40% of your issues,

00:31:50.040 | but that's like below what you saw on Swebench.

00:31:52.880 | So I guess I'm wondering where that discrepancy

00:31:55.640 | is coming from.

00:31:56.800 | And then I guess my other second question,

00:31:58.360 | which is maybe broader in scope,

00:31:59.760 | is that like if you think of an agent

00:32:01.960 | as like a junior developer, and I say, go do something,

00:32:05.800 | then I expect maybe tomorrow to get a Slack message

00:32:09.000 | being like, hey, I ran into this issue.

00:32:10.840 | How can I resolve it?

00:32:12.280 | And like you said, your agent is like successfully solving

00:32:16.640 | like 90% of issues where you give it direct feedback.

00:32:19.240 | So are you thinking about how to get the agent

00:32:21.400 | to reach out to like, for planning when it's stuck

00:32:25.720 | or something like that?

00:32:26.760 | For like identify when it runs into a hole like that?

00:32:29.680 | - Yeah, so great.

00:32:32.160 | These are great questions.

00:32:33.200 | - Oh, sorry.

00:32:34.040 | The third question, which is a good,

00:32:35.480 | so this is the first two.

00:32:36.840 | And if so, are you going to add a benchmark

00:32:39.480 | for that second question?

00:32:41.480 | - Okay, great.

00:32:42.320 | Yeah, great questions.

00:32:43.160 | Okay, so the first question was,

00:32:45.120 | why do I think it's resolving less than 50%

00:32:47.360 | of the issues on Swebench?

00:32:49.080 | So first, Swebench is on popular open source repos

00:32:54.080 | and all of these popular open source repos

00:32:56.760 | were included in the training data

00:32:59.160 | for all of the language models.

00:33:01.040 | And so the language models already know these repos.

00:33:04.760 | In some cases, the language models already know

00:33:06.600 | the individual issues in Swebench.

00:33:08.680 | So basically like some of the training data has leaked.

00:33:12.200 | And so it definitely will overestimate

00:33:14.920 | with respect to that.

00:33:15.760 | I don't think it's like horribly, horribly off,

00:33:18.880 | but I think it's boosting the accuracy by a little bit.

00:33:21.400 | So maybe that's the biggest reason why.

00:33:23.800 | In terms of asking for help

00:33:29.480 | and whether we're benchmarking asking for help,

00:33:32.320 | yes, we are.

00:33:34.800 | So one thing we're working on now,

00:33:38.520 | which we're hoping to put out soon

00:33:39.720 | is we basically made super vague Swebench issues.

00:33:43.360 | Like I'm having a problem with the matrix multiply,

00:33:46.720 | please help.

00:33:48.160 | (laughs)

00:33:49.120 | Because these are like,

00:33:50.280 | if anybody's run a popular open source like framework,

00:33:55.120 | these are what half your issues are.

00:33:57.000 | You're like users show up and say like,

00:33:59.680 | my screen doesn't work, what's wrong or something.

00:34:02.680 | And so then you need to ask them questions

00:34:04.600 | and how to reproduce.

00:34:05.440 | So yeah, we're working on that.

00:34:08.120 | I think it, my impression is that agents

00:34:12.640 | are not very good at asking for help, even flawed.

00:34:15.840 | So like when they ask for help,

00:34:19.280 | they'll ask for help when they don't need it

00:34:20.800 | and then won't ask for help when they do need it.

00:34:22.600 | So this is definitely like an issue, I think.

00:34:25.280 | - Thanks for the great talk.

00:34:30.320 | I also have two questions.

00:34:32.200 | It's first one, can you talk a bit more

00:34:34.200 | about how the web agent interacts with websites?

00:34:37.880 | So is there a VLM that looks at the webpage layout

00:34:40.760 | and then you parse the HTML

00:34:42.000 | and select which buttons to click on?

00:34:44.360 | And if so, do you think there's a future

00:34:47.560 | where there's like, so I work at Bing, Microsoft AI.

00:34:51.560 | Do you think there's a future

00:34:52.520 | where they're like the same web index,

00:34:54.920 | but there's an agent-friendly web index

00:34:56.480 | where all the processing is done offline

00:34:58.600 | so that you don't need to spend time cleaning up,

00:35:02.880 | like cleaning up the HTML

00:35:04.240 | and figuring out what to click online.

00:35:06.160 | And any thoughts on that?

00:35:09.400 | - Yeah, so great question.

00:35:13.120 | There's a lot of work on web agents.

00:35:14.480 | I didn't go into like all of the details,

00:35:16.120 | but I think there's three main ways

00:35:20.200 | that agents interact with websites.

00:35:22.440 | The first way is the simplest way and the newest way,

00:35:26.160 | but it doesn't work very well,

00:35:27.600 | which is you take a screenshot of the website

00:35:32.600 | and then you click on a particular pixel value

00:35:35.560 | on the website.

00:35:37.320 | And like models are not very good at that at the moment.

00:35:41.160 | Like they'll misclick.

00:35:42.440 | There was this thing about how like clod computer use

00:35:45.480 | started like looking at pictures

00:35:47.960 | of Yellowstone National Park or something like this.

00:35:50.400 | I don't know if you heard about this anecdote,

00:35:52.680 | but like people were like, oh, it's so human.

00:35:55.400 | It's looking for a vacation.

00:35:56.480 | And it was like, no, it probably just misclicked

00:35:58.560 | on the wrong pixels and accidentally clicked on an ad.

00:36:01.520 | So like, this is the simplest way.

00:36:04.360 | The second simplest way is you take the HTML

00:36:08.640 | and you basically identify elements in the HTML.

00:36:12.160 | You don't use any vision whatsoever.

00:36:14.840 | And then you say, okay, I want to click on this element.

00:36:17.520 | I want to enter text in this element

00:36:18.960 | or something like that.

00:36:19.960 | But HTML is too huge.

00:36:21.360 | So it actually, it usually gets condensed down

00:36:23.240 | into something called an accessibility tree,

00:36:25.080 | which was made for screen readers

00:36:26.400 | for visually impaired people.

00:36:28.280 | And so that's another way.

00:36:31.560 | And then the third way is kind of a hybrid

00:36:33.120 | where you present the screenshot,

00:36:34.400 | but you also present like a textual summary of the output.

00:36:38.160 | And that's the one that I think will probably work best.

00:36:42.320 | What we're using is we're just using text at the moment.

00:36:44.800 | And that's just an implementation issue

00:36:46.400 | that we haven't implemented the visual stuff yet,

00:36:49.240 | but that's kind of like we're working on it now.

00:36:52.000 | Another thing that I should point out

00:36:53.440 | is we actually have two modalities for web browsing.

00:36:56.040 | Very recently, we implemented this.

00:36:57.680 | And the reason why is because

00:36:59.280 | if you want to interact with full websites,

00:37:02.120 | you will need to click on all of the elements

00:37:04.040 | or have the ability to click on all of the elements.

00:37:05.920 | But most of our work that we need websites for

00:37:08.280 | is just web browsing and like gathering information.

00:37:11.560 | So we have another modality

00:37:12.760 | where we convert all of it to markdown

00:37:14.840 | because that's like way more concise

00:37:17.200 | and easier for the agent to deal with.

00:37:19.080 | And then can we create an index specifically for agents?

00:37:24.080 | Maybe a markdown index or something like that would be,

00:37:26.720 | you know, would make sense.

00:37:28.200 | Oh, how would I make a successor to Swebench?

00:37:32.280 | So, I mean, a first thing is there's like LiveCodeBench,

00:37:37.280 | which LiveCodeBench is basically continuously updating

00:37:40.640 | to make sure it doesn't leak

00:37:41.720 | into language model training data.

00:37:43.960 | That's easy to do for Swebench

00:37:45.480 | because it comes from real websites

00:37:47.120 | and those real websites are getting new issues all the time.

00:37:49.320 | So you could just do it

00:37:51.080 | on the same benchmarks that they have there.

00:37:53.960 | There's also like a pretty large number of things

00:37:59.600 | covering various coding tasks.

00:38:02.040 | So like, for example, Swebench is mainly fixing issues,

00:38:04.880 | but there's also like documentation.

00:38:07.960 | There's generating tests

00:38:10.800 | that actually test the functionality that you want.

00:38:14.120 | And there was a paper by a student at CMU

00:38:17.400 | on generating tests and stuff like that.

00:38:19.200 | So I feel like Swebench is one piece of the puzzle,

00:38:23.000 | but you could also have like 10 different other tasks.

00:38:25.640 | And then you could have like a composite benchmark

00:38:27.400 | where you test all of these abilities,

00:38:28.840 | not just that particular one.

00:38:32.200 | Lots of other things too, but yeah.

00:38:35.160 | - Question from across.

00:38:40.840 | Use your mic, it would help.

00:38:42.240 | - Yeah, great talk, thank you.

00:38:46.720 | My question is about your experience

00:38:50.800 | designing agent architectures specifically.

00:38:54.640 | How much did you have to separate concerns

00:38:57.400 | in terms of task specific agents

00:39:00.960 | versus having one agent to do three or five things

00:39:04.400 | with a gigantic prompt with conditional paths and so on?

00:39:08.160 | - Yeah, so that's a great question.

00:39:09.600 | So we have a basic coding and browsing agent.

00:39:13.280 | And I won't say basic, like it's a good agent,

00:39:18.240 | but it does coding and browsing.

00:39:20.400 | It has instructions about how to do coding and browsing.

00:39:24.400 | That is enough for most things,

00:39:27.360 | especially given a strong language model

00:39:30.920 | that has a lot of background knowledge

00:39:32.240 | about how to solve different types of tasks

00:39:34.200 | and how to use different APIs and stuff like that.

00:39:37.520 | We do have a mechanism for something called microagents.

00:39:41.280 | And microagents are basically something

00:39:42.920 | that gets added to the prompt when a trigger is triggered.

00:39:46.080 | Right now it's very, very rudimentary.

00:39:48.160 | It's like if you detect the word GitHub anywhere,

00:39:52.280 | you get instructions about how to interact with GitHub,

00:39:54.680 | like use the API and don't browse.

00:39:56.840 | Also, another one that I just added is for NPM,

00:40:02.360 | the like JavaScript package manager.

00:40:04.840 | And NPM, when it runs and it hits a failure,

00:40:07.840 | it like hits in interactive terminals where it says,

00:40:12.280 | would you like to quit?

00:40:14.120 | Enter yes.

00:40:15.120 | And if that does it,

00:40:15.960 | it like stalls our agent for the timeout

00:40:17.760 | until like two minutes.

00:40:18.720 | So like I added a new microagent.

00:40:21.480 | Whenever it started using NPM,

00:40:23.120 | it would like get instructions

00:40:26.120 | about how to not use the interactive terminal

00:40:28.120 | and stuff like that.

00:40:28.960 | So that's our current solution.

00:40:31.560 | Honestly, I like it a lot.

00:40:32.880 | It's simple, it's easy to maintain.

00:40:34.480 | It works really well and stuff like that.

00:40:36.160 | But I think there is a world

00:40:37.160 | where you would want something more complex than that.

00:40:39.680 | - Got it, thank you.

00:40:40.680 | - I got a question about MCP.

00:40:45.200 | I feel like this is the entropic model context protocol.

00:40:50.960 | It seems like the most successful type of this,

00:40:54.040 | like standardization of interactions

00:40:56.160 | between computers and agents.

00:40:57.720 | Are you guys adopting it?

00:41:00.160 | Is there any other competing standard?

00:41:03.720 | Anything thought about it?

00:41:05.760 | - Yeah, I think the,

00:41:06.840 | so the entropic MCP is like a way,

00:41:11.400 | it's essentially a collection of APIs

00:41:13.600 | that you can use to interact

00:41:14.720 | with different things on the internet.

00:41:18.560 | I think it's not a bad idea,

00:41:21.480 | but it's like,

00:41:26.480 | there's a few things that bug me a little bit about it.

00:41:29.240 | It's like, we already have an API for GitHub.

00:41:33.000 | So why do we need an MCP for GitHub, right?

00:41:35.960 | You know, like GitHub has an API,

00:41:37.600 | the GitHub API is evolving.

00:41:39.480 | We can look up the GitHub API documentation.

00:41:43.240 | So it seems like kind of duplicated a little bit.

00:41:46.880 | And also they have a setting where it's like,

00:41:50.280 | you have to spin up a server to serve your GitHub stuff

00:41:54.200 | and you have to spin up a server

00:41:55.560 | to serve your like, you know, other stuff.

00:41:58.240 | And so I think it makes sense

00:42:01.680 | if you really care about like separation of concerns

00:42:04.160 | and security and like other things like this.

00:42:07.760 | But right now we haven't seen,

00:42:11.680 | we haven't seen that to have a lot more value

00:42:13.960 | than interacting directly with the tools

00:42:15.560 | that are already provided.

00:42:16.520 | And that kind of goes into my general philosophy,

00:42:18.480 | which is we're already developing things for programmers.

00:42:22.040 | You know, how is an agent different from a programmer?

00:42:27.040 | And it is different, obviously, you know,

00:42:31.200 | like agents are different from programmers,

00:42:33.560 | but they're not that different at this point.

00:42:35.440 | So we can kind of interact with the interfaces

00:42:38.080 | we create for programmers.

00:42:40.200 | Yeah.

00:42:41.560 | I might change my mind later though.

00:42:43.000 | So we'll see.

00:42:45.560 | - Yeah, hi, thanks.

00:42:47.480 | Very interesting talk.

00:42:48.720 | You were saying that the agents you have right now

00:42:52.560 | solve like maybe 30% of your issues out of the gate.

00:42:57.360 | I'm curious, of the things that it doesn't do,

00:43:01.280 | is there like a pattern that you observe?

00:43:03.680 | Like, oh, like these are the sorts of things

00:43:05.400 | that it just seems to really struggle with

00:43:07.240 | or is it just seemingly random?

00:43:09.080 | - It's definitely not random.

00:43:12.040 | It's like, if you think it's more complex,

00:43:14.160 | then it's like, just intuitively,

00:43:16.400 | it's more likely to fail.

00:43:17.960 | I've gotten a bit better at prompting also.

00:43:22.840 | So like, just to give an example,

00:43:24.880 | it will sometimes fail to fix a GitHub workflow

00:43:35.320 | because it will not look at the GitHub workflow

00:43:38.160 | and understand what the GitHub workflow is doing

00:43:40.040 | before it solves the problem.

00:43:42.240 | So I think actually probably the biggest thing

00:43:44.560 | that it fails at is,

00:43:46.080 | or that our agent plus Claude fails at

00:43:49.880 | is insufficient information gathering

00:43:51.880 | before trying to solve the task.

00:43:53.880 | And so if you provide all,

00:43:56.040 | if you provide instructions

00:43:57.680 | that it should do information gathering beforehand,

00:44:00.120 | it tends to do well.

00:44:01.120 | If you don't provide sufficient instructions,

00:44:02.960 | it will try to solve the task

00:44:04.640 | without like fully understanding the task first

00:44:07.000 | and then fail and then you need to go back

00:44:08.640 | and give additional feedback.

00:44:12.480 | Another example, like, I love this example.

00:44:15.120 | While I was developing the monitor website

00:44:19.240 | that I showed here,

00:44:20.560 | we had a really tricky bug

00:44:22.120 | where it was writing out a cache file

00:44:25.040 | to a different directory

00:44:26.120 | than it was reading the cache file from.

00:44:28.120 | And I had no idea,

00:44:30.120 | I had no idea what was going on.

00:44:31.800 | I thought the bug was in a different part of the code.

00:44:34.600 | But what I asked it to do was

00:44:37.400 | come up with five possible reasons

00:44:39.120 | why this could be failing

00:44:40.120 | and decreasing order of likelihood

00:44:41.880 | and examine all of them.

00:44:43.400 | And that worked.

00:44:44.240 | And it could just go in and like do that.

00:44:46.200 | So like, I think a certain level of like scaffolding

00:44:50.160 | about like how it should sufficiently

00:44:54.600 | gather all the information that's necessary

00:44:56.480 | in order to solve the task is like,

00:44:58.320 | if that's missing,

00:44:59.160 | then that's probably the biggest failure point at the moment.

00:45:02.160 | - Thanks.

00:45:04.040 | - Yeah.

00:45:08.280 | - I'm just using this as a chance to ask you all my questions.

00:45:11.600 | You had a slide on here about like self-improving agents

00:45:14.480 | or something like that with memory.

00:45:16.240 | It's like a really throwaway slide

00:45:19.800 | for like a super powerful idea.

00:45:21.920 | It got me thinking about how I would do it.

00:45:24.120 | I have no idea how.

00:45:25.760 | So I just wanted you to chain a thought more on this.

00:45:28.760 | - Yeah, self-improving.

00:45:31.680 | So I think the biggest reason,

00:45:36.000 | like the simplest possible way

00:45:38.200 | to create a self-improving agent

00:45:40.520 | is to have a really, really strong language model

00:45:42.720 | that with infinite context.

00:45:44.720 | And it can just go back

00:45:45.960 | and look at like all of its past experiences

00:45:48.280 | and, you know, learn from them.

00:45:50.400 | You might also want to remove the bad stuff

00:45:53.280 | just so it doesn't over-index

00:45:54.600 | on its like failed past experiences.

00:45:56.960 | But the problem is a really powerful language model

00:46:00.920 | is large, infinite context is expensive.

00:46:04.240 | We don't have a good way to index into it

00:46:06.200 | because like RAG, at least in my experience,

00:46:10.320 | RAG from language to code doesn't work super well.

00:46:13.920 | So I think in the end, it's like,

00:46:16.240 | that's the way I would like to solve this problem.

00:46:17.960 | I'd like to have an infinite context

00:46:19.240 | and somehow be able to index into it appropriately.

00:46:21.880 | And I think that would mostly solve it.

00:46:24.560 | Another thing you can do is fine-tuning.

00:46:26.560 | So I think like RAG is one way

00:46:28.760 | to get information into your model.

00:46:30.040 | Fine-tuning is another way

00:46:30.960 | to get information into your model.

00:46:32.280 | So that might be another way of continuously improving.

00:46:36.000 | Like you identify when you did a good job

00:46:38.000 | and then just add all of the good examples into your model.

00:46:41.720 | - Yeah, so you know how like Voyager

00:46:44.480 | tries to write code into a skill library

00:46:46.320 | and then reuses the skill library, right?

00:46:47.880 | So it improves in the sense that

00:46:49.840 | it just builds up the skill library over time.

00:46:51.720 | - Yep.

00:46:53.000 | - One thing I was like thinking about,

00:46:55.120 | and there's this idea from Devin, your arch nemesis,

00:47:00.320 | of playbooks.

00:47:01.480 | I don't know if you've seen them.

00:47:02.760 | - Yeah, I mean, we're calling them workflows,

00:47:04.680 | but they're simpler.

00:47:05.520 | - Yeah, so like basically like you should,

00:47:07.240 | like once a workflow works,

00:47:09.520 | you can kind of like persist them as a skill library.

00:47:11.680 | - Yep.

00:47:12.520 | - Right, like I feel like that's like some in between,

00:47:16.600 | like you said, you know,

00:47:17.560 | it's hard to do RAG between language and code,

00:47:19.600 | but I feel like that is RAG for,

00:47:22.120 | like I've done this before.

00:47:23.480 | Last time I did it, this worked.

00:47:25.560 | So I'm just going to shortcut

00:47:26.920 | all the stuff that failed before.

00:47:29.680 | - Yeah, I totally, I think it's possible.

00:47:31.440 | It's just, you know, not trivial at the same time.

00:47:35.200 | - Yeah.

00:47:36.040 | - I'll explain the two curves.

00:47:37.200 | So basically the baseline is just an agent

00:47:40.280 | that does it from scratch every time.

00:47:42.360 | And this curve up here is agent workflow memory,

00:47:45.720 | where it's like adding the successful experiences

00:47:49.560 | back into the prompt.

00:47:50.880 | Why is this improving?

00:47:53.840 | The reason why is because just it failed

00:47:56.400 | on the first few examples,

00:47:57.520 | and for the average to catch up,

00:47:59.480 | it took a little bit of time.

00:48:01.280 | So it's not like this is actually improving it.

00:48:03.320 | You could just basically view the,

00:48:05.920 | this one is constant.

00:48:08.240 | And then this one is like improving like this.

00:48:10.960 | Basically you can see it's continuing to go up, yeah.

00:48:13.880 | - How do you think we're going to solve

00:48:17.320 | the authentication problem for agents right now?

00:48:19.880 | - When you say authentication,

00:48:22.520 | you mean like credentials, like, yeah.

00:48:25.200 | - Yeah, 'cause I've seen a few startup solutions today,

00:48:27.920 | but it seems like it's limited to the amount of websites

00:48:30.600 | or actual authentication methods

00:48:32.440 | that it's capable of performing today.

00:48:34.760 | - Yeah, great question.

00:48:36.320 | So my preferred solution to this at the moment

00:48:41.040 | is GitHub fine-grained authentication tokens.

00:48:44.680 | And GitHub fine-grained authentication tokens

00:48:47.240 | allow you to specify on a very granular basis.

00:48:53.120 | On this repo, you have permission to do this.

00:48:55.400 | On this repo, you have permission to do this.

00:48:57.640 | You also can prevent people from pushing to the main branch

00:49:01.640 | unless they get approved.

00:49:03.640 | You can do all of these other things.

00:49:05.080 | And I think these were all developed for human developers

00:49:08.200 | or like the branch protection rules

00:49:09.760 | were developed for human developers.

00:49:11.120 | The fine-grained authentication tokens

00:49:12.480 | were developed for GitHub apps.

00:49:14.080 | I think for GitHub, maybe just pushing this

00:49:19.880 | like a little bit more is the way to do this.

00:49:22.640 | For other things, they're totally not prepared

00:49:26.360 | to give that sort of fine-grained control.

00:49:28.560 | Like most APIs don't have something

00:49:30.200 | like a fine-grained authentication token.

00:49:32.280 | And that goes into my like comment

00:49:33.640 | that we're gonna need to prepare the world for agents,

00:49:35.880 | I think.

00:49:37.520 | But I think like the GitHub authentication tokens

00:49:39.880 | are like a good template

00:49:41.240 | for how you could start doing that maybe.

00:49:42.640 | But yeah, I don't know.

00:49:43.800 | I don't have an answer.

00:49:45.440 | - I'll let you know if I find one.

00:49:46.560 | - Okay, yeah, thank you.

00:49:47.760 | Cool.

00:49:50.560 | I'm gonna finish up.

00:49:51.680 | Let me just see.

00:49:53.040 | Okay, so this one did write a script.

00:50:00.800 | I'm not gonna actually read it for you.

00:50:03.560 | And then the other one, let's see.

00:50:06.320 | Yeah, so it sent a PR.

00:50:12.960 | Sorry, what is the PR URL?

00:50:18.000 | (silence)

00:50:20.160 | So I don't know if this...

00:50:24.920 | Sorry, that's taking way longer than it should.

00:50:29.680 | Okay, cool.

00:50:32.240 | Yeah, so this one sent a PR.

00:50:35.800 | I'll tell you later if this actually like successfully...

00:50:40.800 | Oh, no, it's deployed on Vercel.

00:50:42.240 | So I can actually show you.

00:50:44.400 | But let me try this real quick.

00:50:46.640 | Sorry, I know I don't have time.

00:50:48.880 | Yeah, there you go.

00:51:11.680 | I have pie charts now, so yeah.

00:51:15.760 | It's so fun.

00:51:16.600 | It's so fun to play with these things

00:51:17.920 | 'cause you could just do that while I'm giving a talk.

00:51:21.040 | Things like that.

00:51:21.880 | So yeah, thanks.

00:51:23.040 | (audience applauds)

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)

Chapters