back to index

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands)


Chapters

0:0 Welcome to Latent Space Live at NeurIPS 2024
0:29 State of LLM Agents in 2024
2:20 Professor Graham Newbig's Insights on Agents
3:57 Live Demo: Coding Agents in Action
8:20 Designing Effective Agents
14:13 Choosing the Right Language Model for Agents
16:24 Planning and Workflow for Agents
22:21 Evaluation and Future Predictions for Agents
25:31 Future of Agent Development
25:56 Human-Agent Interaction Challenges
26:48 Expanding Agent Use Beyond Programming
27:25 Redesigning Systems for Agent Efficiency
28:3 Accelerating Progress with Agent Technology
28:28 Call to Action for Open Source Contributions
30:36 Q&A: Agent Performance and Benchmarks
33:23 Q&A: Web Agents and Interaction Methods
37:16 Q&A: Agent Architectures and Improvements
43:9 Q&A: Self-Improving Agents and Authentication
47:31 Live Demonstration and Closing Remarks

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Okay, hi everyone.
00:00:08.520 | So I was given the task of talking about agents in 2024
00:00:13.520 | and this is an impossible task
00:00:16.780 | because there are so many agents, so many agents in 2024.
00:00:21.600 | So this is gonna be strongly covered
00:00:23.460 | by like my personal experience
00:00:25.160 | and what I think is interesting and important,
00:00:26.960 | but I think it's an important topic.
00:00:29.120 | So let's go ahead.
00:00:30.320 | So the first thing I'd like to think about is,
00:00:36.480 | let's say I gave you, you know,
00:00:38.800 | a highly competent human some tools.
00:00:41.360 | Let's say I give you a web browser
00:00:44.760 | and a terminal or a file system
00:00:47.520 | and the ability to edit text or code.
00:00:51.580 | What could you do with that?
00:00:55.260 | Everything, yeah.
00:00:58.280 | Probably a lot of things.
00:00:59.360 | This is like 99% of my, you know,
00:01:01.720 | daily life I guess when I'm working.
00:01:05.560 | So I think this is a pretty powerful tool set
00:01:09.960 | and what I am trying to do
00:01:12.360 | and what I think some other people are trying to do
00:01:14.560 | is come up with agents that are able to, you know,
00:01:16.820 | manipulate these things,
00:01:18.240 | web browsing, coding, running code in successful ways.
00:01:21.800 | So there was a little bit about my profile.
00:01:25.360 | I'm a professor at CMU, chief scientist at All Hands AI,
00:01:28.240 | building open source coding agents.
00:01:30.480 | I'm maintainer of Open Hands,
00:01:32.560 | which is an open source coding agent framework.
00:01:35.400 | And I'm also a software developer
00:01:38.480 | and I like doing lots of coding and, you know,
00:01:43.480 | shipping new features and stuff like this.
00:01:45.480 | So building agents that help me to do this, you know,
00:01:48.180 | is kind of an interesting thing, very close to me.
00:01:50.800 | So the first thing I'd like to do
00:01:54.000 | is I'd like to try some things
00:01:55.760 | that I haven't actually tried before.
00:01:58.160 | If anybody has, you know, tried to give a live demo,
00:02:01.460 | you know, this is very, very scary whenever you do it
00:02:04.960 | and it might not work.
00:02:05.800 | So it might not work this time either.
00:02:08.040 | But I wanna show you like three things
00:02:10.520 | that I typically do with coding agents in my everyday work.
00:02:15.080 | I use coding agents maybe five to 10 times a day
00:02:18.600 | to help me solve my own problems.
00:02:21.800 | And so this is a first one.
00:02:23.000 | This is a data science task,
00:02:25.400 | which says I want to create scatter plots
00:02:28.760 | that show the increase of the SWE bench score over time.
00:02:32.100 | And so I wrote a kind of concrete prompt about this.
00:02:36.480 | Agents work better with like somewhat concrete prompts.
00:02:39.760 | And I'm gonna throw this into open hands and let it work.
00:02:44.760 | And I'll go back to that in a second.
00:02:52.320 | Another thing that I do is I create new software.
00:02:56.640 | And I've been using a service,
00:03:02.920 | a particular service, I won't name it,
00:03:06.440 | for sending emails and I'm not very happy with it.
00:03:09.120 | So I want to switch over to this new service
00:03:11.380 | called resend.com, which makes it easier to send emails.
00:03:15.040 | And so I'm going to ask it to read the docs
00:03:17.760 | for the resend.com API and come up with a script
00:03:20.360 | that allows me to send emails.
00:03:22.320 | The input to the script should be a CSV file
00:03:24.440 | and the subject and body should be provided
00:03:26.720 | in Jinja2 templates.
00:03:28.840 | So I'll start another agent
00:03:32.800 | and try to get it to do that for me.
00:03:37.000 | And let's go with the last one.
00:03:42.600 | The last one I do is improving existing software.
00:03:46.820 | And in order, you know, once you write software,
00:03:50.640 | you usually don't throw it away.
00:03:51.760 | You go in and like actually improve it iteratively.
00:03:55.240 | This software that I have is something I created
00:03:59.000 | without writing any code.
00:04:01.080 | It's basically software to monitor
00:04:03.520 | how much our agents are contributing
00:04:06.720 | to the open hands repository.
00:04:09.040 | And on the, let me make that a little bit bigger.
00:04:15.440 | On the left side, I have the number of issues
00:04:18.180 | where it like sent a pull request.
00:04:20.260 | I have the number of issues where it like sent
00:04:27.580 | a pull request, whether it was merged in purple,
00:04:29.940 | closed in red, or is still open in green.
00:04:33.420 | And so these are like, you know, it's helping us monitor.
00:04:38.380 | But one thing it doesn't tell me is the total number.
00:04:40.700 | And I kind of want that feature added to this software.
00:04:43.980 | So I'm gonna try to add that too.
00:04:46.080 | So I'll take this, I'll take this prompt.
00:04:51.080 | And here I want to open up specifically that GitHub repo.
00:05:03.120 | So I'll open up that repo and paste in the prompt asking it,
00:05:09.600 | I asked it to make a pie chart for each of these
00:05:11.760 | and give me the total over the entire time period
00:05:14.140 | that I'm monitoring.
00:05:14.980 | So we'll do that.
00:05:17.540 | And so now I have, let's see, I have some agents.
00:05:21.080 | Oh, this one already finished.
00:05:23.340 | Let's see.
00:05:25.440 | So this one already finished.
00:05:29.460 | You can see it finished analyzing the SuiteBench repository.
00:05:33.540 | It wrote a demonstration of,
00:05:40.200 | yeah, I'm trying to do that now, actually.
00:05:42.340 | It wrote a demonstration of how much each of the systems
00:05:51.580 | have improved over time.
00:05:53.280 | And I asked it to label the top three
00:05:56.160 | for each of the datasets.
00:05:57.220 | And so it labeled OpenHands as being the best one
00:05:59.480 | for SuiteBench normal.
00:06:01.840 | For SuiteBench verified,
00:06:03.120 | it has like the Amazon queue agent and OpenHands.
00:06:06.360 | For the SuiteBench Lite, it has three over here.
00:06:11.360 | So you can see like, that's pretty useful, right?
00:06:15.840 | If you're a researcher, you do data analysis all the time.
00:06:18.360 | I did it while I was talking to all of you
00:06:19.880 | and making a presentation.
00:06:21.320 | So that's pretty nice.
00:06:24.320 | I doubt the other two are finished yet.
00:06:26.440 | That would be impressive if the, yeah.
00:06:27.920 | So I think they're still working.
00:06:29.360 | So maybe we'll get back to them
00:06:30.520 | at the end of the presentation.
00:06:32.040 | So these are the kinds of things
00:06:35.960 | that I do every day with coding agents now.
00:06:38.200 | And it's, or software development agents.
00:06:40.440 | It's pretty impressive.
00:06:41.600 | The next thing I'd like to talk about a little bit
00:06:46.320 | is things I worry about when designing agents.
00:06:48.440 | So we're designing agents to, you know,
00:06:50.560 | do a very difficult task of like navigating websites,
00:06:54.800 | writing code, other things like this.
00:06:57.160 | And within 2024, there's been like a huge improvement
00:07:00.640 | in the methodology that we use to do this.
00:07:04.480 | But there's a bunch of things we think about.
00:07:06.320 | There's a bunch of interesting papers
00:07:07.680 | and I'd like to introduce a few of them.
00:07:09.640 | So the first thing I worry about
00:07:12.440 | is the agent computer interface.
00:07:14.920 | Like how do we get an agent to interact with computers?
00:07:18.200 | And how do we provide agents with the tools to do the job?
00:07:23.200 | And within OpenHands, we are doing the thing on the right,
00:07:28.880 | but there's also a lot of agents
00:07:31.640 | that do the thing on the left.
00:07:33.400 | So the thing on the left is you give like agents
00:07:36.480 | kind of granular tools.
00:07:38.680 | You give them tools like,
00:07:39.960 | or let's say your instruction is,
00:07:43.320 | I want to determine the most cost-effective country
00:07:45.600 | to purchase the smartphone model Kodak One.
00:07:48.440 | Other countries to consider are the USA,
00:07:50.240 | Japan, Germany, and India.
00:07:52.360 | And you have a bunch of available APIs.
00:07:54.800 | And so what you do for some agents
00:07:57.640 | is you provide them all of these tools,
00:07:59.920 | APIs as tools that they can call.
00:08:02.800 | And so in this particular case,
00:08:05.000 | in order to solve this problem,
00:08:06.280 | you'd have to make about like 30 tool calls, right?
00:08:08.560 | You'd have to call lookup rates for Germany.
00:08:12.320 | You'd have to look it up for the US, Japan, and India.
00:08:14.840 | That's four tool goals.
00:08:16.480 | And then you'd go through
00:08:17.320 | and do all of these things separately.
00:08:20.720 | And the method that we adopt in OpenHands instead
00:08:24.240 | is we provide these tools,
00:08:26.120 | but we provide them by just giving a coding agent
00:08:28.600 | the ability to call arbitrary Python code.
00:08:32.280 | And in the arbitrary Python code, it can call these tools.
00:08:36.560 | We expose these tools as APIs that the model can call.
00:08:39.680 | And what that allows us to do
00:08:40.880 | is instead of writing 20 tool calls, making 20 LLM calls,
00:08:45.160 | you write a program that runs all of these all at once,
00:08:47.680 | and it gets the result.
00:08:49.000 | And of course it can execute that program.
00:08:50.600 | It can make a mistake.
00:08:51.960 | It can get errors back and fix things,
00:08:54.960 | but that makes our job a lot easier.
00:08:56.560 | And this has been really like instrumental
00:08:58.180 | to our success, I think.
00:09:01.200 | Another part of this is what tools does the agent need?
00:09:05.220 | And I think this depends on your use case.
00:09:07.700 | We're kind of extreme,
00:09:09.160 | and we're only giving the agent five tools,
00:09:13.240 | or maybe six tools.
00:09:15.280 | And what are they?
00:09:16.960 | The first one is program execution.
00:09:19.600 | So it can execute Bash programs,
00:09:21.400 | and it can execute Jupyter notebooks.
00:09:23.960 | It can execute cells in Jupyter notebooks.
00:09:26.600 | So those are two tools.
00:09:30.200 | Another one is a file editing tool.
00:09:32.360 | And the file editing tool allows you
00:09:35.200 | to browse parts of files,
00:09:36.920 | and kind of read them, overwrite them,
00:09:40.320 | other stuff like this.
00:09:41.560 | And then we have another global search and replace tool.
00:09:43.800 | So it's actually two tools for file editing.
00:09:46.160 | And then a final one is web browsing.
00:09:49.000 | Web browsing, I'm kind of cheating
00:09:50.360 | when I call it only one tool.
00:09:51.640 | You actually have like scroll and text input
00:09:54.360 | and in click and other stuff like that.
00:09:56.120 | But these are basically the only things
00:09:58.360 | we allow the agent to do.
00:10:00.640 | What, then the question is like,
00:10:03.600 | what if we want it to allow it to do something else?
00:10:06.480 | And the answer is, well, you know,
00:10:09.560 | human programmers already have a bunch of things
00:10:11.960 | that they use.
00:10:13.200 | They have the requests PyPy library.
00:10:15.040 | They have the PDF to text PyPy library.
00:10:18.640 | They have like all these other libraries
00:10:20.400 | in the Python ecosystem that they can use.
00:10:22.680 | And so if we provide a coding agent
00:10:24.840 | with all these libraries,
00:10:25.720 | it can do things like data visualization
00:10:27.800 | and other stuff that I just showed you.
00:10:29.160 | So it can also get clone repositories
00:10:32.200 | and other things like this.
00:10:34.360 | The agents are super good at using the GitHub API also.
00:10:37.480 | So they can do things on GitHub,
00:10:40.320 | like finding all of the comments on your issues
00:10:43.200 | or checking GitHub actions and stuff.
00:10:45.040 | The second thing I think about
00:10:48.920 | is the human agent interface.
00:10:50.360 | So this is like, how do we get humans
00:10:52.240 | to interact with agents?
00:10:54.040 | I already showed you one variety
00:10:56.040 | of our human agent interface.
00:10:57.200 | It's basically a chat window
00:10:58.400 | where you can browse through the agents results
00:11:00.160 | and things like this.
00:11:01.240 | This is very, very difficult.
00:11:04.400 | I don't think anybody has a good answer to this.
00:11:07.080 | And I don't think we have a good answer to this,
00:11:08.800 | but the guiding principles that I'm trying to follow
00:11:13.160 | are we want to present enough info to the user.
00:11:16.200 | So we want to present them with, you know,
00:11:19.200 | what the agent is doing
00:11:21.920 | in the form of a kind of English description.
00:11:26.040 | So you can see here,
00:11:27.720 | you can see here, every time it takes an action,
00:11:32.280 | it says like, I will help you create a script
00:11:34.360 | for sending emails.
00:11:35.840 | When it runs a bash command,
00:11:39.880 | sorry, that's a little small.
00:11:41.400 | When it runs a bash command,
00:11:43.280 | it will say ran a bash command.
00:11:46.360 | It won't actually show you the whole bash command
00:11:48.440 | or the whole Jupyter Notebook
00:11:49.600 | because it can be really large,
00:11:50.800 | but you can open it up and see
00:11:52.440 | if you want to by clicking on this.
00:11:54.840 | So like, if you want to explore more,
00:11:57.280 | you can click over to the Jupyter Notebook
00:11:59.160 | and see what's displayed in the Jupyter Notebook.
00:12:01.400 | And you get like lots and lots of information.
00:12:04.200 | So that's one thing.
00:12:05.360 | Another thing is go where the user is.
00:12:13.560 | So like if the user is already interacting
00:12:16.200 | in a particular setting,
00:12:17.760 | then I'd like to, you know, integrate into that setting,
00:12:20.360 | but only to a point.
00:12:22.520 | So at OpenHands, we have a chat UI for interaction.
00:12:26.320 | We have a GitHub plugin for tagging and resolving issues.
00:12:29.280 | So basically what you do is you do @OpenHandsAgent
00:12:33.360 | and the OpenHandsAgent will like see that comment
00:12:37.240 | and be able to go in and fix things.
00:12:38.680 | So if you say @OpenHandsAgent,
00:12:41.000 | tests are failing on this PR, please fix the tests.
00:12:43.800 | It will go in and fix the tests for you
00:12:45.280 | and stuff like this.
00:12:46.280 | Another thing we have is a remote runtime
00:12:50.840 | for launching headless jobs.
00:12:52.480 | So if you want to launch like a fleet of agents
00:12:54.600 | to solve, you know, five different problems at once,
00:12:57.800 | you can also do that through an API.
00:12:59.240 | So we have these interfaces.
00:13:02.840 | And this probably depends on the use case.
00:13:04.600 | So like depending, if you're a coding agent,
00:13:06.920 | you want to do things one way.
00:13:08.040 | If you're like insurance auditing agent,
00:13:10.800 | you'll want to do things other ways, obviously.
00:13:13.000 | Another thing I think about a lot
00:13:16.680 | is choosing a language model.
00:13:19.760 | And for agentic LMs, we have to have a bunch of things
00:13:24.760 | work really well.
00:13:26.520 | The first thing is really, really good
00:13:28.440 | instruction following ability.
00:13:30.480 | And if you have really good instruction following ability,
00:13:33.160 | it opens up like a ton of possible applications for you.
00:13:36.620 | Tool use and coding ability.
00:13:39.360 | So if you provide tools,
00:13:40.440 | it needs to be able to use them well.
00:13:42.280 | Environment understanding.
00:13:44.880 | So it needs, like if you're building a web agent,
00:13:48.440 | it needs to be able to understand web pages
00:13:50.320 | either through a vision or through text.
00:13:53.320 | And error awareness and recovery ability.
00:13:57.200 | So if it makes a mistake, it needs to be able to,
00:13:59.720 | you know, figure out why it made a mistake,
00:14:01.520 | come up with alternative strategies
00:14:03.440 | and other things like this.
00:14:04.800 | Under the hood, in all of the demos that I did now,
00:14:12.480 | Cloud, we're using Cloud.
00:14:15.120 | Cloud has all of these abilities.
00:14:17.400 | Very good, not perfect, but very good.
00:14:20.440 | Most others don't have these abilities quite as much.
00:14:24.480 | So like GPT-4.0 doesn't have very good
00:14:27.560 | error recovery ability.
00:14:29.240 | And so because of this, it will go into loops
00:14:31.200 | and do the same thing over and over and over again,
00:14:33.040 | whereas Cloud does not do this.
00:14:35.120 | Cloud, if you use the agents enough,
00:14:38.400 | you get used to their kind of like personality
00:14:40.800 | and Cloud says, hmm, let me try a different approach a lot.
00:14:44.680 | So, you know, obviously it's been trained in some way
00:14:47.640 | to, you know, elicit this ability.
00:14:49.480 | We did an evaluation.
00:14:52.800 | This is old and we need to update this basically,
00:14:56.280 | but we evaluated Cloud, GPT-4.0, 01-mini,
00:15:01.280 | LLAMA-405B DeepSeq 2.5
00:15:05.280 | on being a good code agent within our framework.
00:15:07.880 | And Cloud was kind of head and shoulders above the rest.
00:15:11.440 | GPT-4.0 was kind of okay.
00:15:12.960 | The best open source model was LLAMA-3.1-405B.
00:15:16.680 | This needs to be updated
00:15:17.720 | 'cause this is like a few months old by now
00:15:19.520 | and, you know, things are moving really, really fast,
00:15:21.800 | but I still am under the impression that Cloud is the best.
00:15:24.920 | The other closed models are, you know, not quite as good.
00:15:27.560 | And then the open models are a little bit behind that.
00:15:30.320 | Grok, we haven't tried Grok at all actually.
00:15:34.560 | So it's a good question.
00:15:35.560 | If you want to try it, I'd be happy to help.
00:15:41.520 | Cool, another thing is planning.
00:15:43.280 | And so there's a few considerations for planning.
00:15:47.440 | The first one is whether you have a curated plan
00:15:50.640 | or you have it generated on the fly.
00:15:53.360 | And so for solving GitHub issues,
00:15:57.280 | you can kind of have an overall plan.
00:15:59.760 | Like the plan is first reproduce.
00:16:03.160 | If there's an issue,
00:16:05.280 | first write tests to reproduce the issue
00:16:07.560 | or to demonstrate the issue.
00:16:09.400 | After that, run the tests and make sure they fail.
00:16:12.760 | Then go in and fix the tests,
00:16:14.640 | run the tests again to make sure they pass
00:16:16.280 | and then you're done.
00:16:17.240 | So that's like a pretty good workflow
00:16:19.080 | for like solving coding issues.
00:16:22.080 | And you could curate that ahead of time.
00:16:24.880 | Another option is to let the language model
00:16:27.560 | basically generate its own plan.
00:16:29.640 | And both of these are perfectly valid.
00:16:31.920 | Another one is explicit structure versus implicit structure.
00:16:36.520 | So let's say you generate a plan.
00:16:39.040 | If you have explicit structure,
00:16:41.520 | you could like write a multi-agent system.
00:16:44.520 | And the multi-agent system would have your reproducer agent
00:16:48.480 | and then it would have your test writer agent
00:16:53.480 | and your bug fixer agent and lots of different agents.
00:16:57.480 | And you would explicitly write this all out in code
00:17:00.000 | and then use it that way.
00:17:02.520 | On the other hand, you could just provide a prompt
00:17:04.640 | that says, please do all of these things in order.
00:17:07.200 | So in OpenHands, we do very light planning.
00:17:14.120 | We have a single prompt,
00:17:15.240 | we don't have any multi-agent systems,
00:17:17.920 | but we do provide like instructions about like
00:17:20.400 | what to do first, what to do next
00:17:21.880 | and other things like this.
00:17:23.400 | I'm not against doing it the other way,
00:17:26.480 | but I laid out some kind of justification for this
00:17:30.560 | in this blog called Don't Sleep on Single Agent Systems.
00:17:33.600 | And the basic idea behind this is
00:17:35.600 | if you have a really, really good instruction
00:17:37.480 | following agent, it will follow the instructions
00:17:40.800 | as long as things are working according to your plan.
00:17:43.480 | But let's say you need to deviate from your plan,
00:17:45.880 | you still have the flexibility to do this.
00:17:47.800 | And if you do explicit structure
00:17:49.400 | through a multi-agent system,
00:17:50.480 | it becomes a lot harder to do that.
00:17:51.880 | Like you get stuck when things deviate from your plan.
00:17:55.460 | There's also some other examples
00:17:59.600 | and I wanted to introduce a few papers.
00:18:02.200 | So one paper I liked recently is this paper called Co-Act
00:18:05.360 | where you generate plans and then go in and fix them.
00:18:09.240 | And so the basic idea is like
00:18:12.080 | if you need to deviate from your plan,
00:18:13.560 | you can figure out that your plan was not working
00:18:17.400 | and go back and deviate from it.
00:18:19.000 | Another thing I think about a lot
00:18:23.600 | is specifying common workflows.
00:18:25.400 | So we're trying to tackle software development
00:18:28.040 | and I already showed like three use cases
00:18:30.840 | where we do software development.
00:18:35.560 | And when we do software development,
00:18:40.040 | we do a ton of different things,
00:18:41.560 | but we do them over and over and over again.
00:18:43.120 | So just to give an example,
00:18:45.320 | we fix GitHub actions when GitHub actions are failing
00:18:49.520 | and we do that over and over and over again.
00:18:51.640 | That's not the number one thing that software engineers do,
00:18:53.940 | but it's a high up on the list.
00:18:56.200 | So how can we get a list of all of like the workflows
00:18:58.640 | that people are working on?
00:19:01.000 | And there's a few research works
00:19:03.600 | that people have done in this direction.
00:19:05.920 | One example is manual prompting.
00:19:07.560 | So there's this nice paper called Step
00:19:09.760 | that got state-of-the-art
00:19:10.880 | on the Web Arena Web Navigation Benchmark
00:19:12.760 | where they came up with a bunch of manual workflows
00:19:14.920 | for solving different web navigation tasks.
00:19:18.440 | And we also have a paper recently
00:19:20.200 | called Agent Workflow Memory
00:19:22.200 | where the basic idea behind this
00:19:23.960 | is we want to create self-improving agents
00:19:26.120 | that learn from their past successes.
00:19:29.200 | And the way it works is we have a memory
00:19:32.280 | that has an example of lots of the previous workflows
00:19:35.440 | that people have used.
00:19:37.040 | And every time the agent finishes a task
00:19:39.920 | and it self-judges that it did a good job at that task,
00:19:43.240 | you take that task,
00:19:44.120 | you break it down into individual workflows included in that
00:19:47.600 | and then you put it back in the prompt
00:19:49.160 | for the agent to work next time.
00:19:51.140 | And we demonstrated that this leads to a 22.5% increase
00:19:56.900 | on Web Arena after 40 examples.
00:20:00.400 | So that's a pretty huge increase
00:20:02.540 | by kind of self-learning and self-improvement.
00:20:04.860 | Another thing is exploration.
00:20:09.920 | And one thing I think about is like,
00:20:17.140 | how can agents learn more about their environment
00:20:19.300 | before acting?
00:20:20.940 | And I work on coding and web agents
00:20:24.360 | and there's a few good examples of this in both areas.
00:20:28.520 | Within coding, I view this as like repository understanding,
00:20:33.320 | understanding the code base that you're dealing with.
00:20:36.080 | And there's an example of this
00:20:38.200 | or a couple of examples of this,
00:20:39.400 | one example being agent lists,
00:20:41.500 | where they basically create a map of the repo
00:20:44.760 | and based on the map of the repo,
00:20:46.400 | they feed that into the agent
00:20:47.580 | so the agent can then navigate the repo
00:20:50.420 | and better know where things are.
00:20:53.380 | And for web agents,
00:20:55.020 | there's an example of a paper called Bagel.
00:20:57.300 | And basically what they do is they have the agent
00:21:00.820 | just do random tasks on a website,
00:21:03.620 | explore the website,
00:21:04.680 | better understand the structure of the website.
00:21:06.300 | And then after that, they feed that in
00:21:08.860 | as a part of the prompt.
00:21:10.060 | Part seven is search.
00:21:16.220 | Right now in open hands,
00:21:19.300 | we just let the agent go on a linear search path.
00:21:21.500 | So it's just solving the problem once.
00:21:24.300 | We're using a good agent that can kind of like
00:21:26.460 | recover from errors and try alternative things
00:21:28.700 | when things are not working properly,
00:21:30.140 | but still we only have a linear search path.
00:21:33.180 | But there's also some nice work in 2024
00:21:36.660 | that is about exploring multiple paths.
00:21:39.100 | So one example of this is,
00:21:40.980 | there's a paper called Tree Search for Language Agents,
00:21:43.780 | and they basically expand multiple paths,
00:21:46.380 | check whether the paths are going well,
00:21:49.320 | and if they aren't going well, you rewind back.
00:21:51.840 | And on the web, this is kind of tricky
00:21:54.440 | because like how do you rewind
00:21:56.680 | when you accidentally ordered
00:21:57.960 | something you don't want on Amazon?
00:21:59.400 | It's kind of not the easiest thing to do.
00:22:02.120 | For code, it's a little bit easier
00:22:03.480 | 'cause you can just revert any changes that you made.
00:22:06.920 | But I think that's an interesting topic too.
00:22:09.600 | And then finally, evaluation.
00:22:13.600 | So within our development for evaluation,
00:22:18.240 | we want to do a number of things.
00:22:19.960 | The first one is fast sanity checks.
00:22:23.000 | And in order to do this,
00:22:23.960 | we want things we can run really fast, really cheaply.
00:22:27.000 | So for web, we have something called mini world of bits,
00:22:30.400 | which is basically these trivial
00:22:32.200 | kind of web navigation things.
00:22:36.480 | We have something called the Adder Code Editing Benchmark,
00:22:38.760 | where it's just about editing individual files that we use.
00:22:42.560 | But we also want highly realistic evaluation.
00:22:47.320 | So for the web, we have something called Web Arena
00:22:49.600 | that we created at CMU.
00:22:50.880 | This is web navigation on real open source websites.
00:22:55.680 | So it's open source websites
00:22:57.040 | that are actually used to serve shops
00:23:00.440 | or like bulletin boards or other things like this.
00:23:05.440 | And for code, we use Sui Bench,
00:23:07.760 | which I think a lot of people may have heard of.
00:23:10.400 | It's basically a coding benchmark
00:23:12.440 | that comes from real world pull requests on GitHub.
00:23:14.920 | So if you can solve those,
00:23:15.920 | you can also probably solve other real world pull requests.
00:23:19.400 | I would say we still don't have benchmarks
00:23:24.200 | for the full versatility of agents.
00:23:26.520 | So for example, we don't have benchmarks
00:23:29.200 | that test whether agents can code and do web navigation,
00:23:32.840 | but we're working on that
00:23:34.080 | and hoping to release something in the next week or two.
00:23:36.720 | So if that sounds interesting to you, come talk to me
00:23:40.240 | and I will tell you more about it.
00:23:43.880 | - Cool, so I don't like making predictions,
00:23:46.880 | but I was told that I should be somewhat controversial,
00:23:50.480 | I guess, so I will try to do it anyway,
00:23:54.560 | although maybe none of these will be very controversial.
00:23:57.320 | The first thing is agent-oriented LLMs,
00:24:02.320 | like large language models for agents.
00:24:04.720 | My prediction is every large LLM trainer
00:24:08.120 | will be focusing on training models as agents.
00:24:10.280 | So every large language model will be a better agent model
00:24:13.920 | by mid 2025.
00:24:16.040 | Competition will increase, prices will go down,
00:24:21.200 | smaller models will become competitive as agents.
00:24:23.760 | So right now, actually agents are somewhat expensive
00:24:25.960 | to run in some cases,
00:24:27.080 | but I expect that that won't last six months.
00:24:29.400 | I bet we'll have much better agent models in six months.
00:24:32.680 | Another thing is instruction for LLMs.
00:24:38.600 | Another thing is instruction following ability
00:24:41.160 | specifically in agentic contexts will increase.
00:24:44.800 | And what that means is we'll have to do less
00:24:47.400 | manual engineering of agentic workflows
00:24:51.360 | and be able to do more by just prompting agents
00:24:54.080 | in more complex ways.
00:24:56.040 | Cloud is already really good at this.
00:24:57.840 | It's not perfect, but it's already really, really good.
00:24:59.840 | And I expect the other models
00:25:00.960 | will catch up to Cloud pretty soon.
00:25:02.720 | Error correction ability will increase,
00:25:06.560 | less getting stuck in loops.
00:25:07.720 | Again, this is something that Cloud's
00:25:09.200 | already pretty good at.
00:25:10.520 | And I expect the others will follow.
00:25:13.680 | Agent benchmarks.
00:25:17.920 | Agent benchmarks will start saturating.
00:25:20.240 | So right now we have a WebArena and SuiBench.
00:25:25.240 | I think WebArena is already too easy.
00:25:29.560 | It's not super easy, but it's already a bit too easy
00:25:35.720 | because the tasks we do in there
00:25:38.080 | are ones that take like two minutes for a human.
00:25:40.520 | So not too hard.
00:25:42.440 | And kind of historically in 2023,
00:25:46.880 | our benchmarks were too easy.
00:25:48.200 | So we built harder benchmarks like WebArena and SuiBench
00:25:51.120 | were both built in 2023.
00:25:52.800 | In 2024, our agents were too bad.
00:25:55.960 | So we built agents and now we're building better agents.
00:26:00.040 | In 2025, our benchmarks will be too easy.
00:26:02.400 | So we'll build better benchmarks, I'm guessing.
00:26:05.240 | So I would expect to see much more challenging
00:26:08.600 | agent benchmarks come out
00:26:10.040 | and we're already seeing some of them.
00:26:12.760 | In 2026, I don't know.
00:26:14.800 | I didn't write AGI, but we'll see.
00:26:19.320 | Then the human agent computer interface.
00:26:24.040 | I think one thing that we'll want to think about
00:26:27.080 | is what do we do at 75% success rate
00:26:29.880 | at things that we like actually care about.
00:26:33.600 | Right now we have 53% or 55% on SuiBench verified,
00:26:38.600 | which is real world GitHub PRs.
00:26:43.280 | My impression is that the actual ability of models
00:26:47.800 | is maybe closer to 30 to 40%.
00:26:51.800 | So 30 to 40% of the things that I want an agent
00:26:54.680 | to solve on my own repos,
00:26:55.960 | it just solves without any human intervention.
00:26:59.520 | 80 to 90% it can solve without me opening an IDE,
00:27:03.080 | but I need to give it feedback.
00:27:05.320 | So how do we make that interaction smooth
00:27:09.280 | so that humans can audit the work of agents
00:27:13.240 | that are really, really good, but not perfect
00:27:15.720 | is going to be a big challenge.
00:27:17.280 | How can we expose the power of programming agents
00:27:22.480 | to other industries?
00:27:23.320 | So as programmers, I think not all of us
00:27:26.880 | are using agents every day in our programming,
00:27:29.560 | although we probably will be in months or maybe a year,
00:27:34.560 | but I think it will come very naturally to us as programmers
00:27:39.840 | because we know code, we know how to architect software
00:27:44.840 | and stuff like that.
00:27:47.080 | So I think the question is how do we put this in the hands
00:27:52.080 | of a lawyer or a chemist or somebody else
00:27:56.160 | and have them also be able to interact with it
00:27:58.640 | as naturally as we can.
00:27:59.760 | Another interesting thing is how can we redesign
00:28:03.960 | our existing systems for agents?
00:28:05.400 | So we had a paper on API-based web agents,
00:28:07.960 | and basically what we showed is if you take a web agent
00:28:11.440 | and the agent interacts not with a website,
00:28:14.120 | but with APIs, the accuracy goes way up,
00:28:16.640 | just because APIs are way easier to interact with.
00:28:18.800 | And in fact, like when I ask our agent,
00:28:23.800 | our agent is able to browse websites,
00:28:26.120 | but whenever I want it to interact with GitHub,
00:28:28.080 | I tell it do not browse the GitHub website,
00:28:30.120 | use the GitHub API because it's way more successful
00:28:32.320 | at doing that.
00:28:33.360 | So maybe every website is gonna need to have an API
00:28:36.760 | because we're gonna be having agents interact with them.
00:28:39.560 | About progress, I think progress will get faster.
00:28:45.840 | It's already fast.
00:28:46.840 | A lot of people are already overwhelmed,
00:28:48.520 | but I think it will continue.
00:28:50.880 | The reason why is agents are building agents
00:28:54.000 | and better agents will build better agents faster.
00:28:56.320 | So I expect that if you haven't interacted
00:29:01.320 | with a coding agent yet, it's pretty magical,
00:29:04.600 | like the stuff that it can do.
00:29:06.840 | So, yeah.
00:29:08.720 | And I have a call to action.
00:29:13.280 | I'm honestly, like I've been working
00:29:17.600 | on natural language processing and language models
00:29:21.520 | for what, 15 years now?
00:29:23.320 | And even for me, it's pretty impressive
00:29:25.480 | what like AI agents powered by strong language models
00:29:28.640 | can do.
00:29:29.480 | On the other hand, I believe that we should really make
00:29:33.880 | these powerful tools accessible.
00:29:35.800 | And what I mean by this is I don't think like,
00:29:39.680 | we should have these be opaque or limited
00:29:43.520 | to only a certain set of people.
00:29:46.280 | I feel like they should be affordable.
00:29:48.360 | They shouldn't be increasing the difference
00:29:51.640 | in the amount of power that people have.
00:29:53.760 | If anything, I'd really like them to kind of make it possible
00:29:58.160 | for people who weren't able to do things before
00:30:00.200 | to be able to do them well.
00:30:01.800 | Open source is one way to do that.
00:30:05.280 | That's why I'm working on open source.
00:30:08.280 | There are other ways to do that.
00:30:09.800 | Make things cheap, make things so you can serve them
00:30:13.480 | to people who aren't able to afford them easily.
00:30:16.480 | Like Duolingo is one example where they get all the people
00:30:19.840 | in the US to pay them $20 a month.
00:30:23.480 | So that they can give all the people in South America
00:30:26.160 | free language education so they can learn English
00:30:28.920 | and become more attractive on the job market, for instance.
00:30:33.920 | And so I think we can all think of ways
00:30:39.080 | that we can do that sort of thing.
00:30:41.520 | And if that resonates with you, please contribute.
00:30:43.840 | Of course, I'd be happy if you contribute to Open Hands
00:30:46.120 | and use it.
00:30:47.640 | But another way you can do that is just use
00:30:50.200 | open source solutions, contribute to them,
00:30:52.600 | research with them, and train strong open source models.
00:30:55.440 | So I see some people in the room
00:30:58.640 | who are already training models.
00:30:59.880 | It'd be great if you could train models for coding agents
00:31:02.640 | and make them cheap and yeah.
00:31:04.360 | Yeah, please, I was thinking about you, among others.
00:31:10.320 | Cool, yeah, that's all I have, thanks.
00:31:12.880 | - Slightly controversial thing is probably the nicest way
00:31:20.680 | to say hot takes.
00:31:21.760 | Any hot takes questions, actual hot takes?
00:31:28.400 | - Oh, I can also show the other agents that were working
00:31:32.520 | if anybody's interested, but yeah, sorry, go ahead.
00:31:34.480 | - Yeah, I have a couple of questions.
00:31:37.600 | So they're kind of paired maybe.
00:31:39.760 | The first thing is that you said that you're estimating
00:31:42.960 | that your agent is successfully resolving
00:31:47.960 | something like 30 to 40% of your issues,
00:31:50.040 | but that's like below what you saw on Swebench.
00:31:52.880 | So I guess I'm wondering where that discrepancy
00:31:55.640 | is coming from.
00:31:56.800 | And then I guess my other second question,
00:31:58.360 | which is maybe broader in scope,
00:31:59.760 | is that like if you think of an agent
00:32:01.960 | as like a junior developer, and I say, go do something,
00:32:05.800 | then I expect maybe tomorrow to get a Slack message
00:32:09.000 | being like, hey, I ran into this issue.
00:32:10.840 | How can I resolve it?
00:32:12.280 | And like you said, your agent is like successfully solving
00:32:16.640 | like 90% of issues where you give it direct feedback.
00:32:19.240 | So are you thinking about how to get the agent
00:32:21.400 | to reach out to like, for planning when it's stuck
00:32:25.720 | or something like that?
00:32:26.760 | For like identify when it runs into a hole like that?
00:32:29.680 | - Yeah, so great.
00:32:32.160 | These are great questions.
00:32:33.200 | - Oh, sorry.
00:32:34.040 | The third question, which is a good,
00:32:35.480 | so this is the first two.
00:32:36.840 | And if so, are you going to add a benchmark
00:32:39.480 | for that second question?
00:32:41.480 | - Okay, great.
00:32:42.320 | Yeah, great questions.
00:32:43.160 | Okay, so the first question was,
00:32:45.120 | why do I think it's resolving less than 50%
00:32:47.360 | of the issues on Swebench?
00:32:49.080 | So first, Swebench is on popular open source repos
00:32:54.080 | and all of these popular open source repos
00:32:56.760 | were included in the training data
00:32:59.160 | for all of the language models.
00:33:01.040 | And so the language models already know these repos.
00:33:04.760 | In some cases, the language models already know
00:33:06.600 | the individual issues in Swebench.
00:33:08.680 | So basically like some of the training data has leaked.
00:33:12.200 | And so it definitely will overestimate
00:33:14.920 | with respect to that.
00:33:15.760 | I don't think it's like horribly, horribly off,
00:33:18.880 | but I think it's boosting the accuracy by a little bit.
00:33:21.400 | So maybe that's the biggest reason why.
00:33:23.800 | In terms of asking for help
00:33:29.480 | and whether we're benchmarking asking for help,
00:33:32.320 | yes, we are.
00:33:34.800 | So one thing we're working on now,
00:33:38.520 | which we're hoping to put out soon
00:33:39.720 | is we basically made super vague Swebench issues.
00:33:43.360 | Like I'm having a problem with the matrix multiply,
00:33:46.720 | please help.
00:33:48.160 | (laughs)
00:33:49.120 | Because these are like,
00:33:50.280 | if anybody's run a popular open source like framework,
00:33:55.120 | these are what half your issues are.
00:33:57.000 | You're like users show up and say like,
00:33:59.680 | my screen doesn't work, what's wrong or something.
00:34:02.680 | And so then you need to ask them questions
00:34:04.600 | and how to reproduce.
00:34:05.440 | So yeah, we're working on that.
00:34:08.120 | I think it, my impression is that agents
00:34:12.640 | are not very good at asking for help, even flawed.
00:34:15.840 | So like when they ask for help,
00:34:19.280 | they'll ask for help when they don't need it
00:34:20.800 | and then won't ask for help when they do need it.
00:34:22.600 | So this is definitely like an issue, I think.
00:34:25.280 | - Thanks for the great talk.
00:34:30.320 | I also have two questions.
00:34:32.200 | It's first one, can you talk a bit more
00:34:34.200 | about how the web agent interacts with websites?
00:34:37.880 | So is there a VLM that looks at the webpage layout
00:34:40.760 | and then you parse the HTML
00:34:42.000 | and select which buttons to click on?
00:34:44.360 | And if so, do you think there's a future
00:34:47.560 | where there's like, so I work at Bing, Microsoft AI.
00:34:51.560 | Do you think there's a future
00:34:52.520 | where they're like the same web index,
00:34:54.920 | but there's an agent-friendly web index
00:34:56.480 | where all the processing is done offline
00:34:58.600 | so that you don't need to spend time cleaning up,
00:35:02.880 | like cleaning up the HTML
00:35:04.240 | and figuring out what to click online.
00:35:06.160 | And any thoughts on that?
00:35:09.400 | - Yeah, so great question.
00:35:13.120 | There's a lot of work on web agents.
00:35:14.480 | I didn't go into like all of the details,
00:35:16.120 | but I think there's three main ways
00:35:20.200 | that agents interact with websites.
00:35:22.440 | The first way is the simplest way and the newest way,
00:35:26.160 | but it doesn't work very well,
00:35:27.600 | which is you take a screenshot of the website
00:35:32.600 | and then you click on a particular pixel value
00:35:35.560 | on the website.
00:35:37.320 | And like models are not very good at that at the moment.
00:35:41.160 | Like they'll misclick.
00:35:42.440 | There was this thing about how like clod computer use
00:35:45.480 | started like looking at pictures
00:35:47.960 | of Yellowstone National Park or something like this.
00:35:50.400 | I don't know if you heard about this anecdote,
00:35:52.680 | but like people were like, oh, it's so human.
00:35:55.400 | It's looking for a vacation.
00:35:56.480 | And it was like, no, it probably just misclicked
00:35:58.560 | on the wrong pixels and accidentally clicked on an ad.
00:36:01.520 | So like, this is the simplest way.
00:36:04.360 | The second simplest way is you take the HTML
00:36:08.640 | and you basically identify elements in the HTML.
00:36:12.160 | You don't use any vision whatsoever.
00:36:14.840 | And then you say, okay, I want to click on this element.
00:36:17.520 | I want to enter text in this element
00:36:18.960 | or something like that.
00:36:19.960 | But HTML is too huge.
00:36:21.360 | So it actually, it usually gets condensed down
00:36:23.240 | into something called an accessibility tree,
00:36:25.080 | which was made for screen readers
00:36:26.400 | for visually impaired people.
00:36:28.280 | And so that's another way.
00:36:31.560 | And then the third way is kind of a hybrid
00:36:33.120 | where you present the screenshot,
00:36:34.400 | but you also present like a textual summary of the output.
00:36:38.160 | And that's the one that I think will probably work best.
00:36:42.320 | What we're using is we're just using text at the moment.
00:36:44.800 | And that's just an implementation issue
00:36:46.400 | that we haven't implemented the visual stuff yet,
00:36:49.240 | but that's kind of like we're working on it now.
00:36:52.000 | Another thing that I should point out
00:36:53.440 | is we actually have two modalities for web browsing.
00:36:56.040 | Very recently, we implemented this.
00:36:57.680 | And the reason why is because
00:36:59.280 | if you want to interact with full websites,
00:37:02.120 | you will need to click on all of the elements
00:37:04.040 | or have the ability to click on all of the elements.
00:37:05.920 | But most of our work that we need websites for
00:37:08.280 | is just web browsing and like gathering information.
00:37:11.560 | So we have another modality
00:37:12.760 | where we convert all of it to markdown
00:37:14.840 | because that's like way more concise
00:37:17.200 | and easier for the agent to deal with.
00:37:19.080 | And then can we create an index specifically for agents?
00:37:24.080 | Maybe a markdown index or something like that would be,
00:37:26.720 | you know, would make sense.
00:37:28.200 | Oh, how would I make a successor to Swebench?
00:37:32.280 | So, I mean, a first thing is there's like LiveCodeBench,
00:37:37.280 | which LiveCodeBench is basically continuously updating
00:37:40.640 | to make sure it doesn't leak
00:37:41.720 | into language model training data.
00:37:43.960 | That's easy to do for Swebench
00:37:45.480 | because it comes from real websites
00:37:47.120 | and those real websites are getting new issues all the time.
00:37:49.320 | So you could just do it
00:37:51.080 | on the same benchmarks that they have there.
00:37:53.960 | There's also like a pretty large number of things
00:37:59.600 | covering various coding tasks.
00:38:02.040 | So like, for example, Swebench is mainly fixing issues,
00:38:04.880 | but there's also like documentation.
00:38:07.960 | There's generating tests
00:38:10.800 | that actually test the functionality that you want.
00:38:14.120 | And there was a paper by a student at CMU
00:38:17.400 | on generating tests and stuff like that.
00:38:19.200 | So I feel like Swebench is one piece of the puzzle,
00:38:23.000 | but you could also have like 10 different other tasks.
00:38:25.640 | And then you could have like a composite benchmark
00:38:27.400 | where you test all of these abilities,
00:38:28.840 | not just that particular one.
00:38:32.200 | Lots of other things too, but yeah.
00:38:35.160 | - Question from across.
00:38:40.840 | Use your mic, it would help.
00:38:42.240 | - Yeah, great talk, thank you.
00:38:46.720 | My question is about your experience
00:38:50.800 | designing agent architectures specifically.
00:38:54.640 | How much did you have to separate concerns
00:38:57.400 | in terms of task specific agents
00:39:00.960 | versus having one agent to do three or five things
00:39:04.400 | with a gigantic prompt with conditional paths and so on?
00:39:08.160 | - Yeah, so that's a great question.
00:39:09.600 | So we have a basic coding and browsing agent.
00:39:13.280 | And I won't say basic, like it's a good agent,
00:39:18.240 | but it does coding and browsing.
00:39:20.400 | It has instructions about how to do coding and browsing.
00:39:24.400 | That is enough for most things,
00:39:27.360 | especially given a strong language model
00:39:30.920 | that has a lot of background knowledge
00:39:32.240 | about how to solve different types of tasks
00:39:34.200 | and how to use different APIs and stuff like that.
00:39:37.520 | We do have a mechanism for something called microagents.
00:39:41.280 | And microagents are basically something
00:39:42.920 | that gets added to the prompt when a trigger is triggered.
00:39:46.080 | Right now it's very, very rudimentary.
00:39:48.160 | It's like if you detect the word GitHub anywhere,
00:39:52.280 | you get instructions about how to interact with GitHub,
00:39:54.680 | like use the API and don't browse.
00:39:56.840 | Also, another one that I just added is for NPM,
00:40:02.360 | the like JavaScript package manager.
00:40:04.840 | And NPM, when it runs and it hits a failure,
00:40:07.840 | it like hits in interactive terminals where it says,
00:40:12.280 | would you like to quit?
00:40:14.120 | Enter yes.
00:40:15.120 | And if that does it,
00:40:15.960 | it like stalls our agent for the timeout
00:40:17.760 | until like two minutes.
00:40:18.720 | So like I added a new microagent.
00:40:21.480 | Whenever it started using NPM,
00:40:23.120 | it would like get instructions
00:40:26.120 | about how to not use the interactive terminal
00:40:28.120 | and stuff like that.
00:40:28.960 | So that's our current solution.
00:40:31.560 | Honestly, I like it a lot.
00:40:32.880 | It's simple, it's easy to maintain.
00:40:34.480 | It works really well and stuff like that.
00:40:36.160 | But I think there is a world
00:40:37.160 | where you would want something more complex than that.
00:40:39.680 | - Got it, thank you.
00:40:40.680 | - I got a question about MCP.
00:40:45.200 | I feel like this is the entropic model context protocol.
00:40:50.960 | It seems like the most successful type of this,
00:40:54.040 | like standardization of interactions
00:40:56.160 | between computers and agents.
00:40:57.720 | Are you guys adopting it?
00:41:00.160 | Is there any other competing standard?
00:41:03.720 | Anything thought about it?
00:41:05.760 | - Yeah, I think the,
00:41:06.840 | so the entropic MCP is like a way,
00:41:11.400 | it's essentially a collection of APIs
00:41:13.600 | that you can use to interact
00:41:14.720 | with different things on the internet.
00:41:18.560 | I think it's not a bad idea,
00:41:21.480 | but it's like,
00:41:26.480 | there's a few things that bug me a little bit about it.
00:41:29.240 | It's like, we already have an API for GitHub.
00:41:33.000 | So why do we need an MCP for GitHub, right?
00:41:35.960 | You know, like GitHub has an API,
00:41:37.600 | the GitHub API is evolving.
00:41:39.480 | We can look up the GitHub API documentation.
00:41:43.240 | So it seems like kind of duplicated a little bit.
00:41:46.880 | And also they have a setting where it's like,
00:41:50.280 | you have to spin up a server to serve your GitHub stuff
00:41:54.200 | and you have to spin up a server
00:41:55.560 | to serve your like, you know, other stuff.
00:41:58.240 | And so I think it makes sense
00:42:01.680 | if you really care about like separation of concerns
00:42:04.160 | and security and like other things like this.
00:42:07.760 | But right now we haven't seen,
00:42:11.680 | we haven't seen that to have a lot more value
00:42:13.960 | than interacting directly with the tools
00:42:15.560 | that are already provided.
00:42:16.520 | And that kind of goes into my general philosophy,
00:42:18.480 | which is we're already developing things for programmers.
00:42:22.040 | You know, how is an agent different from a programmer?
00:42:27.040 | And it is different, obviously, you know,
00:42:31.200 | like agents are different from programmers,
00:42:33.560 | but they're not that different at this point.
00:42:35.440 | So we can kind of interact with the interfaces
00:42:38.080 | we create for programmers.
00:42:40.200 | Yeah.
00:42:41.560 | I might change my mind later though.
00:42:43.000 | So we'll see.
00:42:45.560 | - Yeah, hi, thanks.
00:42:47.480 | Very interesting talk.
00:42:48.720 | You were saying that the agents you have right now
00:42:52.560 | solve like maybe 30% of your issues out of the gate.
00:42:57.360 | I'm curious, of the things that it doesn't do,
00:43:01.280 | is there like a pattern that you observe?
00:43:03.680 | Like, oh, like these are the sorts of things
00:43:05.400 | that it just seems to really struggle with
00:43:07.240 | or is it just seemingly random?
00:43:09.080 | - It's definitely not random.
00:43:12.040 | It's like, if you think it's more complex,
00:43:14.160 | then it's like, just intuitively,
00:43:16.400 | it's more likely to fail.
00:43:17.960 | I've gotten a bit better at prompting also.
00:43:22.840 | So like, just to give an example,
00:43:24.880 | it will sometimes fail to fix a GitHub workflow
00:43:35.320 | because it will not look at the GitHub workflow
00:43:38.160 | and understand what the GitHub workflow is doing
00:43:40.040 | before it solves the problem.
00:43:42.240 | So I think actually probably the biggest thing
00:43:44.560 | that it fails at is,
00:43:46.080 | or that our agent plus Claude fails at
00:43:49.880 | is insufficient information gathering
00:43:51.880 | before trying to solve the task.
00:43:53.880 | And so if you provide all,
00:43:56.040 | if you provide instructions
00:43:57.680 | that it should do information gathering beforehand,
00:44:00.120 | it tends to do well.
00:44:01.120 | If you don't provide sufficient instructions,
00:44:02.960 | it will try to solve the task
00:44:04.640 | without like fully understanding the task first
00:44:07.000 | and then fail and then you need to go back
00:44:08.640 | and give additional feedback.
00:44:12.480 | Another example, like, I love this example.
00:44:15.120 | While I was developing the monitor website
00:44:19.240 | that I showed here,
00:44:20.560 | we had a really tricky bug
00:44:22.120 | where it was writing out a cache file
00:44:25.040 | to a different directory
00:44:26.120 | than it was reading the cache file from.
00:44:28.120 | And I had no idea,
00:44:30.120 | I had no idea what was going on.
00:44:31.800 | I thought the bug was in a different part of the code.
00:44:34.600 | But what I asked it to do was
00:44:37.400 | come up with five possible reasons
00:44:39.120 | why this could be failing
00:44:40.120 | and decreasing order of likelihood
00:44:41.880 | and examine all of them.
00:44:43.400 | And that worked.
00:44:44.240 | And it could just go in and like do that.
00:44:46.200 | So like, I think a certain level of like scaffolding
00:44:50.160 | about like how it should sufficiently
00:44:54.600 | gather all the information that's necessary
00:44:56.480 | in order to solve the task is like,
00:44:58.320 | if that's missing,
00:44:59.160 | then that's probably the biggest failure point at the moment.
00:45:02.160 | - Thanks.
00:45:04.040 | - Yeah.
00:45:08.280 | - I'm just using this as a chance to ask you all my questions.
00:45:11.600 | You had a slide on here about like self-improving agents
00:45:14.480 | or something like that with memory.
00:45:16.240 | It's like a really throwaway slide
00:45:19.800 | for like a super powerful idea.
00:45:21.920 | It got me thinking about how I would do it.
00:45:24.120 | I have no idea how.
00:45:25.760 | So I just wanted you to chain a thought more on this.
00:45:28.760 | - Yeah, self-improving.
00:45:31.680 | So I think the biggest reason,
00:45:36.000 | like the simplest possible way
00:45:38.200 | to create a self-improving agent
00:45:40.520 | is to have a really, really strong language model
00:45:42.720 | that with infinite context.
00:45:44.720 | And it can just go back
00:45:45.960 | and look at like all of its past experiences
00:45:48.280 | and, you know, learn from them.
00:45:50.400 | You might also want to remove the bad stuff
00:45:53.280 | just so it doesn't over-index
00:45:54.600 | on its like failed past experiences.
00:45:56.960 | But the problem is a really powerful language model
00:46:00.920 | is large, infinite context is expensive.
00:46:04.240 | We don't have a good way to index into it
00:46:06.200 | because like RAG, at least in my experience,
00:46:10.320 | RAG from language to code doesn't work super well.
00:46:13.920 | So I think in the end, it's like,
00:46:16.240 | that's the way I would like to solve this problem.
00:46:17.960 | I'd like to have an infinite context
00:46:19.240 | and somehow be able to index into it appropriately.
00:46:21.880 | And I think that would mostly solve it.
00:46:24.560 | Another thing you can do is fine-tuning.
00:46:26.560 | So I think like RAG is one way
00:46:28.760 | to get information into your model.
00:46:30.040 | Fine-tuning is another way
00:46:30.960 | to get information into your model.
00:46:32.280 | So that might be another way of continuously improving.
00:46:36.000 | Like you identify when you did a good job
00:46:38.000 | and then just add all of the good examples into your model.
00:46:41.720 | - Yeah, so you know how like Voyager
00:46:44.480 | tries to write code into a skill library
00:46:46.320 | and then reuses the skill library, right?
00:46:47.880 | So it improves in the sense that
00:46:49.840 | it just builds up the skill library over time.
00:46:51.720 | - Yep.
00:46:53.000 | - One thing I was like thinking about,
00:46:55.120 | and there's this idea from Devin, your arch nemesis,
00:47:00.320 | of playbooks.
00:47:01.480 | I don't know if you've seen them.
00:47:02.760 | - Yeah, I mean, we're calling them workflows,
00:47:04.680 | but they're simpler.
00:47:05.520 | - Yeah, so like basically like you should,
00:47:07.240 | like once a workflow works,
00:47:09.520 | you can kind of like persist them as a skill library.
00:47:11.680 | - Yep.
00:47:12.520 | - Right, like I feel like that's like some in between,
00:47:16.600 | like you said, you know,
00:47:17.560 | it's hard to do RAG between language and code,
00:47:19.600 | but I feel like that is RAG for,
00:47:22.120 | like I've done this before.
00:47:23.480 | Last time I did it, this worked.
00:47:25.560 | So I'm just going to shortcut
00:47:26.920 | all the stuff that failed before.
00:47:29.680 | - Yeah, I totally, I think it's possible.
00:47:31.440 | It's just, you know, not trivial at the same time.
00:47:35.200 | - Yeah.
00:47:36.040 | - I'll explain the two curves.
00:47:37.200 | So basically the baseline is just an agent
00:47:40.280 | that does it from scratch every time.
00:47:42.360 | And this curve up here is agent workflow memory,
00:47:45.720 | where it's like adding the successful experiences
00:47:49.560 | back into the prompt.
00:47:50.880 | Why is this improving?
00:47:53.840 | The reason why is because just it failed
00:47:56.400 | on the first few examples,
00:47:57.520 | and for the average to catch up,
00:47:59.480 | it took a little bit of time.
00:48:01.280 | So it's not like this is actually improving it.
00:48:03.320 | You could just basically view the,
00:48:05.920 | this one is constant.
00:48:08.240 | And then this one is like improving like this.
00:48:10.960 | Basically you can see it's continuing to go up, yeah.
00:48:13.880 | - How do you think we're going to solve
00:48:17.320 | the authentication problem for agents right now?
00:48:19.880 | - When you say authentication,
00:48:22.520 | you mean like credentials, like, yeah.
00:48:25.200 | - Yeah, 'cause I've seen a few startup solutions today,
00:48:27.920 | but it seems like it's limited to the amount of websites
00:48:30.600 | or actual authentication methods
00:48:32.440 | that it's capable of performing today.
00:48:34.760 | - Yeah, great question.
00:48:36.320 | So my preferred solution to this at the moment
00:48:41.040 | is GitHub fine-grained authentication tokens.
00:48:44.680 | And GitHub fine-grained authentication tokens
00:48:47.240 | allow you to specify on a very granular basis.
00:48:53.120 | On this repo, you have permission to do this.
00:48:55.400 | On this repo, you have permission to do this.
00:48:57.640 | You also can prevent people from pushing to the main branch
00:49:01.640 | unless they get approved.
00:49:03.640 | You can do all of these other things.
00:49:05.080 | And I think these were all developed for human developers
00:49:08.200 | or like the branch protection rules
00:49:09.760 | were developed for human developers.
00:49:11.120 | The fine-grained authentication tokens
00:49:12.480 | were developed for GitHub apps.
00:49:14.080 | I think for GitHub, maybe just pushing this
00:49:19.880 | like a little bit more is the way to do this.
00:49:22.640 | For other things, they're totally not prepared
00:49:26.360 | to give that sort of fine-grained control.
00:49:28.560 | Like most APIs don't have something
00:49:30.200 | like a fine-grained authentication token.
00:49:32.280 | And that goes into my like comment
00:49:33.640 | that we're gonna need to prepare the world for agents,
00:49:35.880 | I think.
00:49:37.520 | But I think like the GitHub authentication tokens
00:49:39.880 | are like a good template
00:49:41.240 | for how you could start doing that maybe.
00:49:42.640 | But yeah, I don't know.
00:49:43.800 | I don't have an answer.
00:49:45.440 | - I'll let you know if I find one.
00:49:46.560 | - Okay, yeah, thank you.
00:49:47.760 | Cool.
00:49:50.560 | I'm gonna finish up.
00:49:51.680 | Let me just see.
00:49:53.040 | Okay, so this one did write a script.
00:50:00.800 | I'm not gonna actually read it for you.
00:50:03.560 | And then the other one, let's see.
00:50:06.320 | Yeah, so it sent a PR.
00:50:12.960 | Sorry, what is the PR URL?
00:50:18.000 | (silence)
00:50:20.160 | So I don't know if this...
00:50:24.920 | Sorry, that's taking way longer than it should.
00:50:29.680 | Okay, cool.
00:50:32.240 | Yeah, so this one sent a PR.
00:50:35.800 | I'll tell you later if this actually like successfully...
00:50:40.800 | Oh, no, it's deployed on Vercel.
00:50:42.240 | So I can actually show you.
00:50:44.400 | But let me try this real quick.
00:50:46.640 | Sorry, I know I don't have time.
00:50:48.880 | Yeah, there you go.
00:51:11.680 | I have pie charts now, so yeah.
00:51:15.760 | It's so fun.
00:51:16.600 | It's so fun to play with these things
00:51:17.920 | 'cause you could just do that while I'm giving a talk.
00:51:21.040 | Things like that.
00:51:21.880 | So yeah, thanks.
00:51:23.040 | (audience applauds)