back to indexThe new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic
Chapters
0:0 Introductions
3:39 What is SWE-Bench?
12:22 SWE-Bench vs HumanEval vs others
15:21 SWE-Agent architecture and runtime
21:18 Do you need code indexing?
24:50 Giving the agent tools
27:47 Sandboxing for coding agents
29:16 Why not write tests?
30:31 Redesigning engineering tools for LLMs
35:53 Multi-agent systems
37:52 Why XML so good?
42:57 Thoughts on agent frameworks
45:12 How many turns can an agent do?
47:12 Using multiple model types
51:40 Computer use and agent use cases
59:4 State of AI robotics
64:24 Robotics in manufacturing
65:1 Hardware challenges in robotics
69:21 Is self-driving a good business?
00:00:06.240 |
This is Alessio, partner and CTO at Decibel Partners. 00:00:16.920 |
to have Eric Schluntz from Anthropic with us. 00:00:21.480 |
I'm a member of technical staff at Anthropic, 00:00:23.860 |
working on tool use, computer use, and SweetBench. 00:00:27.760 |
Yeah, well, how did you get into just the whole AI journey? 00:00:32.760 |
I think you spent some time at SpaceX as well? 00:00:37.120 |
Yeah, there's a lot of overlap between the robotics people 00:00:43.320 |
between language models for robots right now. 00:00:51.840 |
But before joining Anthropic, I was the CTO and co-founder 00:01:01.400 |
would patrol through an office building or a warehouse, 00:01:07.720 |
We would just call a remote operator if we saw anything. 00:01:11.800 |
So we have about 100 of those out in the world, 00:01:15.840 |
We actually got acquired about six months ago. 00:01:20.560 |
because I was starting to get a lot more excited about AI. 00:01:23.160 |
I had been writing a lot of my code with things like Copilot. 00:01:26.160 |
And I was like, wow, this is actually really cool. 00:01:28.200 |
If you had told me 10 years ago that AI would 00:01:30.800 |
be writing a lot of my code, I would say, hey, 00:01:34.240 |
And so I realized that we had passed this level. 00:01:37.560 |
We're like, wow, this is actually really useful 00:01:47.080 |
and then doing a lot of reading and research myself, 00:01:50.240 |
and decided, hey, I want to go be at the core of this 00:01:56.960 |
Did you consider maybe some of the robotics companies? 00:02:04.080 |
any sort of negative things I say about robotics 00:02:06.600 |
or hardware is coming from a place of burnout. 00:02:09.680 |
I reserve my right to change my opinion in a few years. 00:02:12.680 |
Yeah, I looked around, but ultimately I knew a lot of people 00:02:15.440 |
that I really trusted and I thought were incredibly smart 00:02:18.440 |
at Anthropic, and I think that was the big deciding factor 00:02:24.400 |
but sort of like the most nice and kind people that I know. 00:02:26.760 |
And so I just felt I could be a really good culture fit. 00:02:28.840 |
And ultimately, like, I do care a lot about AI safety 00:02:32.880 |
I don't want to build something that's used for bad purposes. 00:02:40.520 |
these labs kind of look like huge organizations 00:02:43.480 |
that have this like obscure ways to organize. 00:02:49.280 |
on like SweetBench and some of the stuff you publish, 00:02:51.480 |
or you kind of join and then you figure out where you land? 00:02:54.560 |
I think people are always curious to learn more. 00:02:59.080 |
is very bottoms up and sort of very sort of receptive 00:03:03.480 |
And so I joined sort of being very transparent of like, 00:03:06.320 |
hey, I'm most excited about code generation and AI 00:03:09.040 |
that can actually go out and sort of touch the world 00:03:13.360 |
And, you know, those weren't my initial projects. 00:03:16.600 |
hey, I want to do the most valuable possible thing 00:03:20.960 |
And, you know, like, let me find the balance of those. 00:03:23.120 |
So I was working on lots of things at the beginning, 00:03:28.000 |
and then sort of as it became more and more relevant, 00:03:43.480 |
I feel like there's just been a series of releases 00:03:48.080 |
Around about two, three months ago, 3.5 SONNET came out 00:03:51.880 |
and it was a step ahead in terms of a lot of, 00:03:55.200 |
people immediately fell in love with it for coding. 00:03:59.280 |
you released a new updated version of Cloud SONNET. 00:04:01.840 |
We're not going to talk about the training for that 00:04:04.760 |
but I think Anthropic's done a really good job 00:04:11.200 |
but then also we're going to talk a little bit 00:04:15.960 |
about like why you looked at SweetBench Verified 00:04:18.840 |
and you actually like came up with a whole system 00:04:26.760 |
- Yeah, so I'm on a sub team called product research. 00:04:32.000 |
is to really understand like what end customers care about 00:04:42.240 |
on sort of these more abstract general benchmarks 00:04:47.960 |
but we really care about like finding the things 00:04:50.800 |
and making sure the models are great at those. 00:04:52.720 |
And so because I had been interested in coding agents, 00:04:55.640 |
sort of, I knew that this would be a really valuable thing. 00:04:59.280 |
and our customers trying to build coding agents 00:05:04.080 |
"a really good benchmark to be able to measure that 00:05:16.480 |
It fell to me to sort of both implement the benchmark, 00:05:20.600 |
and then also to sort of make sure we had an agent 00:05:26.120 |
maybe I'd call it, that could do very well on it. 00:05:36.560 |
and get sort of the most out of it as possible. 00:05:38.920 |
So with this blog post we released on SweetBench, 00:05:44.120 |
that we gave the model to be able to do well. 00:05:49.320 |
I think the general perception is they're like 00:06:22.640 |
And it's just 12 of these popular open source repos. 00:06:37.920 |
a very limited subset of real engineering tasks, 00:06:48.520 |
are really kind of these much more artificial setups 00:06:53.360 |
they're more like coding interview style questions 00:07:02.000 |
you all like get to use recursion in your day-to-day job, 00:07:11.440 |
It's like how different interview questions are. 00:07:16.320 |
But I think the, one of the most interesting things 00:07:18.720 |
about Sweebench is that all these other benchmarks 00:07:31.640 |
to the problem of finding the relevant files. 00:07:34.280 |
And this is a huge part of real engineering is, 00:07:37.000 |
it's actually again, pretty rare that you're starting 00:07:40.520 |
You need to go and figure out where in a code base 00:07:56.200 |
I don't even know if you can actually get to 100% 00:07:58.160 |
because some of the data is not actually solvable. 00:08:04.480 |
Because when you look at like the model releases, 00:08:06.440 |
it's like, oh, it's like 92% instead of like 89, 90% 00:08:15.240 |
Which is like, before 45% was state of the art, 00:08:18.360 |
but maybe like six months ago, it was like 30%, 00:08:24.960 |
Or do you think they're just going to run in parallel? 00:08:32.720 |
about just sort of greenfield code generation. 00:08:35.440 |
And so I don't think that everything needs to go 00:08:42.360 |
is that Sweebench is certainly hard to implement 00:09:09.200 |
And maybe hopefully there's just sort of harder versions 00:09:16.960 |
Do you think that's something where it's like 00:09:38.400 |
but a lot of those, even though a human did them, 00:09:42.400 |
given the information that comes with the task. 00:09:46.880 |
is the test looks for a very specific error string, 00:09:55.280 |
And unless you know that's exactly what you're looking for, 00:10:07.160 |
and they hired humans to go review all these tasks 00:10:24.200 |
how difficult they thought the problems would be 00:10:30.760 |
an hour to four hours and greater than four hours. 00:10:41.880 |
with some of the remaining failures that I see, 00:10:47.920 |
sort of operates at the wrong level of abstraction. 00:10:55.080 |
when really the task is asking for a bigger refactor. 00:10:58.200 |
And some of those, you know, is the model's fault, 00:11:00.400 |
but a lot of times if you're just seeing the, 00:11:03.360 |
if you're just sort of seeing the GitHub issue, 00:11:05.600 |
it's not exactly clear like which way you should do. 00:11:25.560 |
even though our models are very good at vision. 00:11:38.080 |
and the model will just say, okay, it looks great. 00:11:42.520 |
So there's certainly extra juice to squeeze there 00:11:44.160 |
of just making sure the model really understands 00:11:52.480 |
So this is something that I have not looked at, 00:11:57.400 |
what is the union of all of the different tasks 00:12:00.040 |
that have been solved by at least one attempt 00:12:03.600 |
There's a ton of submissions to the benchmark. 00:12:07.440 |
how many of those 500 tasks, at least someone has solved. 00:12:11.160 |
And I think, you know, there's probably a bunch 00:12:14.840 |
And I think it'd be interesting to look at those and say, 00:12:19.360 |
Or are they just really hard and only a human could do them? 00:12:22.200 |
- Yeah, like specifically, is there a category of problems 00:12:27.480 |
- Yeah, yeah, and I think there definitely are. 00:12:28.680 |
The question is, are those fairly inaccessible 00:12:32.480 |
or are they just impossible because of the descriptions? 00:12:36.840 |
especially the ones that the human graders reviewed 00:12:40.160 |
as like taking longer than four hours are extremely difficult. 00:12:51.600 |
- They certainly did less than, yeah, than four hours. 00:12:56.360 |
with like human estimated time, you know what I mean? 00:12:58.520 |
Or do we have sort of more of X paradox type situations 00:13:06.400 |
- I actually haven't done like done the stats on that, 00:13:09.280 |
but I think that'd be really interesting to see 00:13:15.200 |
What is the likelihood of success with difficulty? 00:13:18.120 |
I think actually a really interesting thing that I saw, 00:13:21.360 |
one of my coworkers who was also working on this 00:13:23.960 |
named Simon, he was focusing just specifically 00:13:28.600 |
the ones that are said to take longer than four hours. 00:13:39.160 |
but a lower score overall in the whole benchmark. 00:13:43.240 |
which is sort of much more simple and bare bones, 00:13:49.960 |
And I think some of that is the really detailed prompt 00:13:56.640 |
'Cause honestly, a lot of the sweet bench problems, 00:14:00.200 |
and where it's like, hey, this crashes if this is none, 00:14:02.600 |
and really all you need to do is put a check if none. 00:14:05.040 |
And so sometimes like trying to make the model 00:14:07.360 |
think really deeply, like it'll think in circles 00:14:11.140 |
which certainly human engineers are capable of as well. 00:14:17.040 |
might not be the best prompt for easy problems. 00:14:20.080 |
Are you supposed to fix it at the model level? 00:14:22.240 |
Like how do I know what prompt I'm supposed to use? 00:14:25.600 |
- Yeah, and I'll say this was a very small effect size. 00:14:29.280 |
I think this isn't like worth obsessing over, 00:14:31.780 |
but I would say that as people are building systems 00:14:35.200 |
around agents, I think the more you can separate out 00:14:39.000 |
the different kinds of work the agent needs to do, 00:14:41.840 |
the better you can tailor a prompt for that task. 00:14:46.440 |
for instance, if you were trying to make an agent 00:14:48.040 |
that could both, you know, solve hard programming tasks, 00:14:52.200 |
and it could just like, you know, write quick test files 00:14:55.880 |
for something that someone else had already made, 00:15:02.400 |
where they first sort of have a classification 00:15:04.600 |
and then route the problem to two different prompts. 00:15:09.000 |
because one, it makes the two different prompts 00:15:13.760 |
And it means you can have someone work on one of the prompts 00:15:16.460 |
without any risk of affecting the other tasks. 00:15:18.740 |
So it creates like a nice separation of concerns. 00:15:20.760 |
- Yeah, and the other model behavior thing you mentioned, 00:15:28.000 |
You know, I think that's maybe like the lazy model question 00:15:33.400 |
why are you not just generating the whole code 00:15:44.200 |
the easier solution rather than the hard solution. 00:15:47.580 |
I think what you're talking about is like the lazy model 00:15:49.340 |
is like when the model says like dot, dot, dot, 00:15:54.000 |
- I think honestly, like that just comes as like, 00:15:57.300 |
people on the internet will do stuff like that. 00:15:59.260 |
And like, dude, if you were talking to a friend 00:16:01.460 |
and you asked them like to give you some example code, 00:16:06.900 |
And so I think that's just a matter of like, you know, 00:16:09.380 |
sometimes you actually do just want like the relevant changes 00:16:14.380 |
this is something where a lot of times like, you know, 00:16:19.240 |
So I think that like the more explicit you can be 00:16:36.940 |
Lex Friedman just dropped his five hour pod with Dario 00:16:41.580 |
And Dario actually made this interesting observation 00:16:45.700 |
we complain about models being too chatty in text 00:16:50.980 |
And so like getting that right is kind of a awkward bar 00:16:57.700 |
but then you also want it to be complete in code. 00:17:02.540 |
which is something that Enthopic has also released 00:17:05.620 |
with, you know, like the fast edit stuff that you guys did. 00:17:08.740 |
And then the other thing I wanted to also double back on 00:17:17.020 |
I think we'll go into suite agents in a little bit, 00:17:28.020 |
I think something that Enthopic has done really well 00:17:32.980 |
And so why can't you just develop a meta-prompt 00:17:39.340 |
Obviously I'm probably hand-waving a little bit, 00:17:42.900 |
to try the Enthopic Workbench meta-prompting system 00:17:47.700 |
I went to the build day recently at Enthopic HQ 00:17:57.620 |
- Yeah, no, Claude is great at writing prompts for Claude. 00:18:04.620 |
even like very smart humans still use sort of checklists 00:18:13.340 |
And certainly, you know, a very senior engineer 00:18:20.660 |
And so I always try to anthropomorphize the models 00:18:31.860 |
And would you need to give them a lot of instruction 00:18:52.220 |
built sort of these very hard-coded and rigid workflows 00:19:05.940 |
But one of the things that we really wanted to explore 00:19:08.300 |
was like, let's really give Claude the reins here 00:19:20.020 |
what we did is like the most extreme version of this 00:19:24.940 |
and it's able to keep calling the tools, keep thinking, 00:19:28.020 |
and then yeah, keep doing that until it thinks it's done. 00:19:31.060 |
And that's sort of the most minimal agent framework 00:19:51.140 |
And I think that's something that you didn't see 00:19:55.140 |
Some of the existing agent frameworks that I looked at, 00:19:57.260 |
they had whole systems built to try to detect loops 00:20:00.700 |
and see, oh, is the model doing the same thing, 00:20:07.220 |
the less you need that kind of extra scaffolding. 00:20:13.740 |
until it thinks it's done was the most minimal framework 00:20:18.260 |
- So you're not pruning like bad paths from the context. 00:20:25.220 |
- Yes, and so I would say the downside of this 00:20:27.380 |
is that this is sort of a very token expensive way 00:20:30.540 |
- But still, it's very common to prune bad paths 00:20:34.220 |
- Yeah, but I'd say that, yeah, 3.5 is not getting stuck 00:20:44.060 |
this is definitely an area of future research, 00:20:48.580 |
that are going to take a human more than four hours. 00:20:56.900 |
be able to accomplish this task within 200K tokens. 00:20:59.940 |
So certainly I think there's like future research 00:21:03.300 |
but it's not necessary to do well on these benchmarks. 00:21:06.140 |
- Another thing I always have questions about 00:21:09.660 |
there's a mini cottage industry of code indexers 00:21:18.540 |
And I think I'd say there's like two reasons for this. 00:21:27.420 |
Sonnet is very good at what we call agentic search 00:21:32.220 |
is letting the model decide how to search for something. 00:21:40.540 |
So if you read through a lot of the traces of the SweetBench, 00:21:44.260 |
the model is calling tools to view directories, 00:21:50.660 |
until it feels like it's found the file where the bug is 00:21:58.500 |
everything we did was about just giving Claude the full reins 00:22:10.940 |
- Or embedding things into a vector database. 00:22:16.300 |
But again, this is very, very token expensive. 00:22:19.660 |
And so certainly, and it also takes many, many turns. 00:22:27.860 |
- And just to make it clear, it's using the bash tool, 00:22:46.260 |
Like it'll only do an ls sort of two directories deep 00:22:52.780 |
I would say actually we did more engineering of the tools 00:23:29.460 |
you would need to do more of this exhaustive search 00:23:31.620 |
where an agentic search would take way too long. 00:23:33.660 |
- As someone who has spent the last few years 00:23:43.460 |
because there's so much virtualization that we do. 00:23:46.620 |
with where the code problems are actually appearing. 00:24:04.220 |
- I will say suite bench just released suite bench multimodal 00:24:08.020 |
which I believe is either entirely JavaScript 00:24:17.340 |
I think it's on the list and there's interest, 00:24:34.700 |
- Yeah, sort of running on our own internal code base. 00:24:37.940 |
- Since you spend so much time on the tool design, 00:24:47.180 |
Is there some special way to look at files, feed them in? 00:24:50.580 |
- I would say the core of that tool is string replace. 00:24:56.900 |
with different ways to specify how to edit a file. 00:25:02.100 |
the model has to write out the existing version 00:25:08.100 |
We found that to be the most reliable way to do these edits. 00:25:21.300 |
And if you're in a very big file, it's cost prohibitive. 00:25:28.140 |
And they actually have pretty big differences 00:25:34.940 |
where they explore some of these different methods 00:25:38.500 |
for editing files and they post results about them, 00:25:42.180 |
But I think this is like a really good example 00:25:54.940 |
like they're just writing an API for a computer. 00:25:59.700 |
it's sort of just the bare bones of what you'd need. 00:26:02.620 |
And honestly, like it's so hard for the models to use those. 00:26:06.700 |
I come back to anthropomorphizing these models. 00:26:10.820 |
and you just read this for the very first time 00:26:36.860 |
You want the model to literally write a patch file. 00:26:39.580 |
I think patch files have at the very beginning, 00:26:44.900 |
That means before the model has actually written the edit, 00:26:53.500 |
I'm pretty sure, I think it's something like that, 00:26:55.580 |
but I don't know if that's exactly the diff format, 00:27:06.420 |
goes into designing human interfaces for things. 00:27:09.700 |
This is like entirely what FrontEnd is about, 00:27:11.900 |
is creating better interfaces to kind of do the same things. 00:27:31.500 |
Yeah, it's all open source if people wanna check it out. 00:27:34.500 |
I'm curious if there's an environment element 00:27:43.060 |
'Cause that can be slow or resource intensive. 00:27:45.540 |
Do you have anything else that you would recommend? 00:27:49.220 |
about sort of public details or about private details 00:27:54.500 |
But obviously, we need to have sort of safe, secure, 00:27:59.100 |
for the models to be able to practice writing code 00:28:03.140 |
- I'm aware of a few startups working on agent sandboxing. 00:28:11.340 |
where they're focusing on snapshotting memory 00:28:20.900 |
Whereas here, I think that the kinds of tools 00:28:30.020 |
- Yeah, I think the computer use demo that we released 00:28:47.660 |
I'd say they're very specific for editing files 00:28:52.300 |
that's actually very general if you think about it. 00:28:57.180 |
or editing files, you can do with those tools. 00:29:08.140 |
rather than making tools that were very specific 00:29:10.700 |
for SweetBench, like run tests as its own tool, 00:29:22.340 |
and then you're running it against SweetBench anyway? 00:29:24.380 |
So it doesn't really need to write the test or? 00:29:26.740 |
- Yeah, so this is one of the interesting things 00:29:31.820 |
that the model's output is graded on are hidden from it. 00:29:34.860 |
That's basically so that the model can't cheat 00:29:37.060 |
by looking at the tests and writing the exact solution. 00:29:53.300 |
So the first thing the model does is try to reproduce that. 00:29:56.060 |
And so it's kind of then rerunning that script 00:30:03.140 |
that breaks some other test and it doesn't know about that. 00:30:05.540 |
- And should we be redesigning any tools, APIs? 00:30:08.780 |
We kind of talked about this on having more examples, 00:30:10.820 |
but I'm thinking even things of Q as a query parameter 00:30:15.340 |
It's easier for the model to re-query than read the Q. 00:30:19.860 |
but is there anything you've seen, like building this, 00:30:23.080 |
where it's like, "Hey, if I were to redesign some CLI tool, 00:30:26.740 |
"some API tool, I would change the way structure 00:30:31.420 |
- I don't think I've thought enough about that 00:30:34.820 |
but certainly just making everything more human-friendly. 00:30:37.840 |
Like having like more detailed documentation and examples. 00:30:45.340 |
Like so many, like just using the Linux command line, 00:30:54.340 |
Like, I don't want to go read through a hundred flags. 00:30:57.820 |
And again, so things that would be useful for a human 00:31:08.080 |
that is useful for human is this access to the internet. 00:31:20.880 |
You can't like look around for similar implementations. 00:31:23.940 |
These are all things that I do when I try to fix code. 00:31:31.520 |
but then also it's kind of not being fair to these agents 00:31:33.760 |
because they're not operating in a real world situation. 00:31:37.680 |
of course I'm giving it access to the internet 00:31:41.200 |
I don't have a question in there, more just like, 00:31:47.520 |
- I think that that's really important for humans. 00:31:50.200 |
But honestly, the models have so much general knowledge 00:31:52.800 |
from pre-training that it's like less important for them. 00:31:59.240 |
that was like, that came after the knowledge cutoff, 00:32:03.280 |
I think actually this is like a broader problem 00:32:08.640 |
and like what customers will actually care about 00:32:11.120 |
who are working on a coding agent for real use. 00:32:13.640 |
And I think one of those there is like internet access 00:32:30.520 |
and like really make sure it has a very detailed 00:32:38.680 |
are gonna be much more interactive with the agent 00:32:41.920 |
rather than this kind of like one-shot system. 00:32:44.720 |
And right now there's no benchmark that measures that. 00:32:49.080 |
to have some benchmark that is more interactive. 00:32:52.520 |
I don't know if you're familiar with TauBench, 00:32:56.500 |
where there's basically one LLM that's playing 00:32:59.800 |
the user or the customer that's getting support 00:33:02.480 |
and another LLM that's playing the support agent 00:33:05.520 |
and they interact and try to resolve the issue. 00:33:10.240 |
- And they also did MTBench for people listening along. 00:33:17.480 |
where like before the SuiteBench task starts, 00:33:30.560 |
and like just get the exact thing out of the human 00:33:35.720 |
But I think that will be a really interesting thing to see. 00:33:41.960 |
I think one of the really great UX things they do 00:33:52.360 |
like having a planning step at the beginning, 00:33:55.040 |
one, just having that plan will improve performance 00:33:59.200 |
just because it's kind of like a bigger chain of thought, 00:34:10.200 |
that sort of has a much slower time through each loop. 00:34:12.920 |
If the human has approved this implementation plan, 00:34:28.960 |
there's a couple of comments on names that you dropped. 00:34:30.680 |
Copilot also does the plan stage before it writes code. 00:34:38.960 |
because it's not prompt to code, it's prompt plan code. 00:34:42.360 |
So there's a little bit of friction in there, 00:34:44.760 |
Like it actually, you get a lot for what it's worth. 00:34:50.320 |
where you can sort of edit the plan as it goes along. 00:34:54.640 |
we hosted a sort of dev day pre-game with Repl.it 00:35:00.720 |
So like having two agents kind of bounce off of each other. 00:35:04.080 |
I think it's a similar approach to what you're talking about 00:35:08.200 |
just as in the prompts of clarifying what the agent wants. 00:35:12.560 |
But typically I think this would be implemented 00:35:14.360 |
as a tool calling another agent, like a sub-agent. 00:35:33.520 |
that a lot of people will kind of get confused by 00:35:37.920 |
but really it's sort of usually the same model 00:36:00.200 |
If you want a plan that's very thorough and detailed, 00:36:04.120 |
If you want a really quick, just like write this function, 00:36:17.200 |
oh, maybe you're just getting lucky with XML, 00:36:21.000 |
in your own agent prompts, so they must work. 00:36:23.840 |
And why is it so model specific to your family? 00:36:31.200 |
that internally we've preferred XML for the data. 00:36:37.520 |
is that if you look at certain kinds of outputs, 00:36:43.800 |
Like if you're trying to output a code in JSON, 00:36:47.760 |
there's a lot of extra escaping that needs to be done. 00:36:50.200 |
I mean, that actually hurts model performance 00:36:56.440 |
there's none of that sort of escaping that needs to happen. 00:36:59.080 |
That being said, I haven't tried having it write, 00:37:04.440 |
into weird escaping things there, I'm not sure. 00:37:08.200 |
But yeah, I'd say that's some historical reasons 00:37:24.200 |
where you're pretty sure, like example one start, 00:37:26.720 |
example one end, like that is one cohesive unit. 00:37:37.480 |
Cloud was just the first one to popularize it, I think. 00:37:39.240 |
- I do definitely prefer to read XML than read JSON, so yeah. 00:37:43.200 |
- Any other details that are like maybe underappreciated? 00:37:46.640 |
I know, for example, you had the absolute paths 00:37:51.920 |
- Yeah, no, I think that's a good sort of anecdote 00:37:56.080 |
Like I said, spend time prompt engineering your tools 00:38:00.920 |
but like write the tool and then actually give it 00:38:04.880 |
to the model and like read a bunch of transcripts 00:38:09.960 |
And I think you will find, like by doing that, 00:38:12.360 |
you will find areas where the model misunderstands a tool 00:38:16.000 |
or makes mistakes and then basically change the tool 00:38:28.360 |
you can have like a plug that can fit either way 00:38:30.320 |
and that's dangerous, or you can make it asymmetric 00:38:32.560 |
so that like it can't fit this way, it has to go like this. 00:38:41.560 |
one of the things that we saw while testing these tools 00:38:44.080 |
is, oh, if the model has like, you know, done CD 00:38:49.520 |
it would often get confused when trying to use the tool 00:38:52.520 |
because it's like now in a different directory. 00:38:56.200 |
So we said, oh, look, let's just force the tool 00:39:00.760 |
And then, you know, that's easy for the model to understand. 00:39:03.080 |
It knows sort of where it is, it knows where the files are. 00:39:06.000 |
And then once we have it always giving absolute paths, 00:39:08.600 |
it never messes up even like no matter where it is, 00:39:10.800 |
because it just, if you're using an absolute path, 00:39:16.160 |
let us make the tool foolproof for the model. 00:39:18.880 |
I'd say there's other categories of things where we see, 00:39:21.480 |
oh, if the model, you know, opens Vim, like, you know, 00:39:33.760 |
it just text in, text out, it's not interactive. 00:39:38.860 |
It's that the way that the tool is like hooked up 00:39:44.280 |
- Yes, I mean, there is the meme of no one knows 00:39:47.400 |
You know, basically we just added instructions 00:39:50.060 |
in the tool of like, hey, don't launch commands 00:39:53.960 |
Like, yeah, like don't launch Vim, don't launch whatever. 00:39:58.840 |
put an ampersand after it or launch it in the background. 00:40:08.640 |
And I think like that's an underutilized space 00:40:11.160 |
of prompt engineering where like people might try to do that 00:40:16.360 |
So the model knows that it's like for this tool, 00:40:20.380 |
- You said you worked on the function calling and tool use 00:40:23.160 |
before you actually started the C-Bench work, right? 00:40:26.760 |
Because you basically went from creator of that API 00:40:44.660 |
I think some way, like right now it just takes a, 00:40:47.780 |
I think we sort of force people to do the best practices 00:40:50.580 |
of writing out sort of these full JSON schemas, 00:40:54.180 |
if you could just pass in a Python function as a tool. 00:41:00.380 |
- Instructure, you know, I don't know if there, 00:41:03.100 |
if there's anyone else that is specializing for Anthropic, 00:41:06.100 |
maybe Jeremy Howard's and Simon Willis and stuff. 00:41:14.140 |
I also wanted to spend a little bit of time with SuiteAgent. 00:41:19.780 |
apart from it's the same authors as SuiteBench? 00:41:25.180 |
so it just felt sort of like the safest, most neutral option. 00:41:45.340 |
than what we wanted to do, but it's still very close. 00:41:54.540 |
and talked with the SuiteBench people directly. 00:41:58.220 |
we already know the authors, this will be easy, 00:42:05.900 |
and where they go to school, it all makes sense. 00:42:16.620 |
And that's like, think, act, observe, like that's all React. 00:42:23.540 |
if you actually read our traces of our submission, 00:42:26.220 |
you can actually see like, think, act, observe, 00:42:29.300 |
And like, we just didn't even like change the printing code. 00:42:34.100 |
it's like doing still function calls under the hood 00:42:36.540 |
and the model can do sort of multiple function calls 00:42:39.500 |
in a row without thinking in between if it wants to. 00:42:43.540 |
and a lot of things we inherited from SuiteAgent 00:42:47.260 |
- Yeah, any thoughts about other agent frameworks? 00:42:51.260 |
the whole gamut from very simple to like very complex. 00:42:56.980 |
I think I haven't explored a lot of them in detail. 00:43:00.820 |
I would say with agent frameworks in general, 00:43:03.140 |
they can certainly save you some like boilerplate, 00:43:05.820 |
but I think there's actually this like downside 00:43:08.340 |
of making agents too easy where you end up very quickly 00:43:12.140 |
like building a much more complex system than you need. 00:43:15.020 |
And suddenly, you know, instead of having one prompt, 00:43:17.460 |
you have five agents that are talking to each other 00:43:20.660 |
And it's like, because the framework made that 10 lines 00:43:28.220 |
to like try to start without these frameworks if you can, 00:43:34.540 |
and be able to sort of directly understand what's going on. 00:43:37.740 |
I think a lot of times these frameworks also, 00:43:40.260 |
by trying to make everything feel really magical, 00:43:43.300 |
you end up sort of really hiding what the actual prompt 00:43:58.460 |
what's really happening and making it too easy 00:44:03.820 |
So yeah, I would recommend people to like try it 00:44:08.220 |
Would you rather have like a framework of tools? 00:44:18.020 |
if I had an easy way to get the best tool from you 00:44:21.540 |
and like you maintain the definition or yeah, 00:44:23.700 |
any thoughts on how you want to formalize tool sharing? 00:44:27.540 |
that we're certainly interested in exploring. 00:44:29.900 |
And I think there is space for sort of these general tools 00:44:37.500 |
they do have, you know, much more specific things 00:44:44.420 |
but the ultimate end applications are going to be bespoke. 00:44:48.740 |
that the model's great at any tool that it uses, 00:44:52.780 |
- So everything bespoke, no frameworks, no anything. 00:44:57.100 |
- Yeah, I would say that like the best thing I've seen 00:45:03.100 |
and then you can use those as building blocks. 00:45:05.780 |
I have a utils folder where I call these scripts. 00:45:13.220 |
There's a startup hidden in every utils folder, you know? 00:45:17.220 |
like it's a startup, you know, like at some point. 00:45:21.500 |
is there a maximum length of turns that it took? 00:45:27.860 |
I mean, we had, it had basically infinite turns 00:45:42.820 |
I'm trying to remember like the longest successful run, 00:45:45.620 |
but I think it was definitely over a hundred turns 00:45:52.180 |
But certainly, you know, these things can be a lot of turns. 00:45:53.940 |
And I think that's because some of these things 00:45:55.660 |
are really hard where it's going to take, you know, 00:46:01.100 |
think about a task that takes a human four hours to do, 00:46:03.980 |
like think about how many different like files you read 00:46:07.100 |
and like times you edit a file in four hours. 00:46:16.260 |
what's kind of like the return on the extra compute now? 00:46:19.060 |
So like, you know, if you had thousands of turns 00:46:21.540 |
or like whatever, like how much better would it get? 00:46:26.860 |
I think sort of one of the open areas of research 00:46:38.820 |
So you mentioned earlier things like pruning bad paths. 00:46:41.900 |
I think there's a lot of interesting work around there. 00:46:51.500 |
that you could have something that uses way more tokens 00:47:01.260 |
can you make the model sort of losslessly summarize 00:47:05.700 |
what it's learned from trying different approaches 00:47:12.940 |
So you have Haiku, which is like, you know, cheaper. 00:47:17.580 |
to do a lot of these smaller things and then put it back up? 00:47:22.260 |
that they actually have a separate model for file editing. 00:47:25.340 |
I'm trying to remember, I think they were on a, 00:47:27.300 |
maybe the Lex Fridman podcast where they said like, 00:47:29.580 |
they have a bigger model, like write what the code should be 00:47:39.100 |
that they worked with on, it's speculative decoding. 00:47:42.020 |
- But I think there's also really interesting things 00:47:43.780 |
about like, you know, paring down input tokens as well. 00:47:47.020 |
Especially sometimes the models trying to read 00:47:48.900 |
like a 10,000 line file, like that's a lot of tokens. 00:48:07.060 |
I think there's a lot of really interesting room 00:48:11.860 |
the simplest, most minimal thing and show that it works. 00:48:16.620 |
sort of the agent community builds things like that 00:48:22.140 |
You know, we're not going to go and do lots more submissions 00:48:24.420 |
to SweetBench and try to prompt engineer this 00:48:31.020 |
But yeah, so I think that's a really interesting one. 00:48:40.060 |
It itself is actually very smart, which is great. 00:48:44.300 |
with this like combination of the two models. 00:48:46.940 |
But yeah, I think that's one of the exciting things 00:48:51.980 |
shows that sort of even our smallest, fastest models 00:48:58.940 |
Like it's not just sort of for writing simple text anymore. 00:49:02.580 |
- And I know you're not going to talk about it, 00:49:03.980 |
but like Sonnet is not even supposed to be the best model. 00:49:06.860 |
You know, like Opus, it's kind of like we left it 00:49:11.620 |
At some point, I'm sure the new Opus will come out. 00:49:14.180 |
And if you had Opus Plus on it, that sounds very, very good. 00:49:20.500 |
but that's the official SweetBench guys doing it. 00:49:28.420 |
I mean, you could just change the model name. 00:49:32.740 |
but I think we included it in our model card. 00:49:48.180 |
so we didn't feel like they need to submit the benchmark. 00:49:51.260 |
- We can cut over to computer use if we're okay 00:49:53.140 |
with moving on to topics on this, if anything else. 00:49:57.540 |
I think, I'm trying to think if there's anything else 00:50:01.100 |
- It doesn't have to be also just specifically SweetBench, 00:50:15.940 |
But there's obviously a ton of low-hanging fruit 00:50:18.620 |
So just your thoughts on if you were to build 00:50:23.820 |
- I think the really interesting question for me 00:50:26.180 |
for all the startups out there is this kind of divergence 00:50:29.220 |
between the benchmarks and what real customers will want. 00:50:37.300 |
What are the differences that they're starting to make? 00:50:40.660 |
I'm actually very curious what they will see, 00:50:43.940 |
I feel like it's like slowed down a little bit 00:50:53.500 |
So we had CoSign on, they had like a 50-something on full, 00:50:58.220 |
on SweetBench full, which is the hardest one. 00:51:00.580 |
And they were rejected because they didn't want 00:51:06.380 |
- We actually, tomorrow, we're talking to Bolt, 00:51:09.420 |
You guys actually published a case study with them. 00:51:22.820 |
My take on this is Anthropic shipped Adept as a feature, 00:51:28.700 |
- What was it like when you tried it for the first time? 00:51:30.900 |
Was it obvious that cloud had reached that stage 00:51:38.820 |
Like, I think, I actually, I had been on vacation, 00:51:41.200 |
and I came back, and everyone's like, computer use works. 00:51:44.580 |
And so it was kind of this very exciting moment. 00:51:46.980 |
I mean, after the first, just like, you know, go to Google, 00:51:48.900 |
I think I tried to have it play Minecraft or something, 00:51:50.940 |
and it actually like installed and like opened Minecraft. 00:51:59.660 |
there's certain things that it's not very good at yet. 00:52:02.380 |
But I'm really excited, I think, most broadly, 00:52:06.260 |
not just for like new things that weren't possible before, 00:52:10.140 |
but as a much lower friction way to implement tool use. 00:52:14.240 |
One anecdote from my days at Cobalt Robotics, 00:52:17.600 |
we wanted our robots to be able to ride elevators, 00:52:20.300 |
to go between floors and fully cover a building. 00:52:24.420 |
was doing API integrations with the elevator companies. 00:52:29.780 |
we could send that request, and it would move the elevator. 00:52:32.260 |
Each new company we did took like six months to do, 00:52:35.580 |
'cause they were very slow, they didn't really care. 00:52:40.940 |
- Even installing, like once we had it with the company, 00:52:43.380 |
they would have to like literally go install an API box 00:52:47.580 |
And that would sometimes take six months, so very slow. 00:52:51.260 |
And eventually we're like, okay, this is getting like, 00:52:54.200 |
slowing down all of our customer deployments. 00:52:56.640 |
And I was like, what if we just add an arm to the robot? 00:52:59.280 |
And I added this little arm that could literally go 00:53:08.060 |
and have the robot being able to use the elevators. 00:53:10.420 |
At the same time, it was slower than the API, 00:53:19.400 |
but it was slower and a little bit less reliable. 00:53:21.460 |
And I kind of see this as like an analogy to computer use 00:53:24.340 |
of like, anything you can do with computer use today, 00:53:29.660 |
and like integrate it with APIs to up to the language model. 00:53:33.280 |
But that's going to take a bunch of software engineering 00:53:38.100 |
With computer use, just give the thing a browser 00:53:40.700 |
that's logged into what you want to integrate with, 00:53:48.260 |
Of like, imagine like a customer support team, 00:53:51.380 |
where, okay, hey, you got this customer support bot, 00:53:54.480 |
but you need to go integrate it with all these things. 00:54:05.120 |
now, suddenly in one day, you could be up and rolling 00:54:09.920 |
that could go do all the actions you care about. 00:54:12.000 |
So I think that's the most exciting thing for me 00:54:13.700 |
about computer use is like reducing that friction 00:54:22.360 |
- Just go computer use, very high value use cases. 00:54:31.520 |
do you drive by vision or do you have special tools? 00:54:33.640 |
And vision is the universal tool to claim all tools. 00:54:37.520 |
There's trade-offs, but like there's situations 00:54:41.560 |
the one that we just put out had Stan Polu from DUST 00:54:50.360 |
between maybe like the high volume use cases, 00:54:54.280 |
you want APIs, and then the long tail, you want computer use. 00:55:00.680 |
with computer use, and then, hey, this is working. 00:55:09.040 |
- Yeah, I'd be interested to see a computer use agent 00:55:14.600 |
and then just dropping out of the equation altogether. 00:55:23.960 |
RPA for people listening is robotic process automation, 00:55:40.340 |
- Yeah, or have some way to turn Claude's actions 00:55:54.580 |
It's kind of like, "Hey, peace, run at your own risk." 00:55:58.160 |
- No, no, we launched it with, I think a VM or Docker, 00:56:01.880 |
- But it's not for your actual computer, right? 00:56:05.100 |
Like the Docker instance is like runs in the Docker. 00:56:23.180 |
We really care about providing a nice sort of, 00:56:30.120 |
And I mean, very quickly people made modifications 00:56:38.880 |
I would say also like from a product perspective right now, 00:56:44.320 |
I think a lot of the most useful use cases are, 00:56:56.380 |
you can't use your computer at the same time. 00:56:58.700 |
I think you actually like want it to have its own screen. 00:57:03.660 |
but only on one laptop versus you have two laptops. 00:57:09.140 |
- Yeah, I think it's just a better experience. 00:57:13.300 |
you want it to do for you on your own computer. 00:57:19.980 |
and maybe checking in on it every now and then. 00:57:24.640 |
half our audience is going to be too young to remember this, 00:57:35.680 |
that would be how you did like enterprise computing. 00:57:44.120 |
Is it a fun demo or is it like the future of Anthropic? 00:57:57.840 |
that then also like test the front end that they made. 00:58:01.240 |
So I think it's very cool to like use computer use 00:58:03.480 |
to be able to close the loop on a lot of things 00:58:05.380 |
that right now just a terminal based agent can't do. 00:58:18.440 |
this will be Amanda Askell, the head of Cloud Character. 00:58:25.120 |
Giving it a name like computer use is very practical. 00:58:29.380 |
but maybe sometimes it's not about doing things, 00:58:35.480 |
In some way that's, you know, solving sweet bench, 00:58:37.120 |
like you should be allowed to use the internet 00:58:39.280 |
or you should be allowed to use a computer to solve it 00:58:45.120 |
with all these restrictions just 'cause we wanna play nice 00:58:50.480 |
a full AI will be able to do all these things, to think. 00:58:58.680 |
- Can we just do a, before we wrap, a robotics corner? 00:59:07.160 |
What's the state of AI robotics, under hyped, over hyped? 00:59:10.660 |
- Yeah, and I'll say like these are my opinions, 00:59:20.560 |
like there is really sort of incredible progress 00:59:26.320 |
that I think will be a big unlock for robotics. 00:59:28.620 |
The first is just general purpose language models. 00:59:33.020 |
that if to fully describe your task is harder 00:59:36.740 |
than to just do the task, you can never automate it. 00:59:51.440 |
and it's gonna know how do I make a Reuben sandwich? 00:59:56.280 |
Whereas before like the idea of even like a cooking thing, 00:59:59.040 |
it's like, oh God, like we're gonna have the team 01:00:02.980 |
for the long tail of anything, it'd be a disaster. 01:00:06.260 |
So I think that's one thing is that bringing common sense 01:00:09.120 |
really is like solves this huge problem describing tasks. 01:00:12.320 |
The second big innovation has been diffusion models 01:00:16.800 |
A lot of this work came out of Toyota Research. 01:00:19.760 |
There's a lot of startups now that are working on this 01:00:26.120 |
And the basic idea here is using a little bit of the, 01:00:29.800 |
I'd say maybe more inspiration from diffusion 01:00:39.720 |
Whereas previously all of robotics motion control 01:00:44.960 |
You either, you're programming in explicit motions 01:01:00.560 |
it's basically like learning from these examples. 01:01:05.920 |
And doing these in a way just like diffusion models 01:01:11.320 |
you can have it, the same model learn many different tasks. 01:01:14.840 |
And then the hope is that these start to generalize, 01:01:18.120 |
that if you've trained it on picking up coffee cups 01:01:21.200 |
and picking up books, then when I say pick up the backpack, 01:01:33.000 |
and then that's enough to really get it to generalize 01:01:40.880 |
have like measured some degree of generalization. 01:01:44.780 |
But at the end of the day, it's also like LLMs. 01:01:46.720 |
Like, you know, do you really care about the thing, 01:01:56.360 |
And you can just make sure it has good training 01:01:58.840 |
What you do care about then is like generalization 01:02:02.240 |
I've never seen this particular coffee mug before. 01:02:14.320 |
and diffusion inspired path planning algorithms. 01:02:23.520 |
is where self-driving cars were 10 years ago. 01:02:29.920 |
you had videos of people driving a car on the highway, 01:02:33.300 |
driving a car on a street with a safety driver, 01:02:37.060 |
but it's really taken a long time to go from there to, 01:02:41.220 |
And even then Waymo is only in SF and a few other cities. 01:02:44.540 |
And I think like it takes a long time for these things 01:02:55.940 |
That these models are really good at doing these demos 01:03:01.240 |
If they only work 99% of the time, like that sounds good, 01:03:08.080 |
Like imagine if like one out of every 100 dishes, 01:03:12.720 |
Like you would not want that robot in your house 01:03:15.080 |
or you certainly wouldn't want that in your factory 01:03:21.480 |
So I think for these things to really be useful, 01:03:24.080 |
they're gonna have to hit a very, very high level 01:03:32.360 |
for these models to move from like the 95% reliability 01:03:41.640 |
of how good the unit economics of these things will be. 01:03:45.320 |
These robots are gonna be very expensive to build. 01:03:54.600 |
it kind of sets an upper cap about how much you can charge. 01:03:57.440 |
And so, it seems like it's not that great a business. 01:04:09.520 |
which is like, it needs to be like very precise, 01:04:25.200 |
a lot of those traditional manufacturing robots 01:04:50.440 |
like sometimes you just have a servo that fails 01:04:53.520 |
and it takes a bunch of time to like fix that. 01:04:55.800 |
Is that holding back things or is the software still? 01:05:02.860 |
And I think a lot of the humanoid robot companies 01:05:05.060 |
now are really trying to build amazing hardware. 01:05:10.660 |
- You know, you build your first robot and it works, 01:05:13.860 |
Then you build 10 of them, five of them work, 01:05:16.300 |
three of them work half the time, two of them don't work, 01:05:18.460 |
and you built them all the same and you don't know why. 01:05:22.540 |
has like this level of detail and differences 01:05:34.200 |
Like imagine if every binary that you shipped to a customer, 01:05:36.880 |
each of those four loops was a little bit differently, 01:05:41.940 |
and sort of maintain quality of these things. 01:05:52.020 |
Where again, like you'll buy a batch of a hundred motors 01:05:57.340 |
a little bit differently to the same input command. 01:06:06.380 |
- We can't get the tolerance of motors down to- 01:06:14.700 |
- One of my horror stories was that at Cobalt, 01:06:20.900 |
that had a USB connection to the computer inside, 01:06:29.720 |
the user can just unplug it and plug it back in. 01:06:39.480 |
Again, because they assume someone will just unplug it 01:06:44.240 |
- I heard this too and I didn't listen to it. 01:06:48.480 |
a bunch of these thermal cameras started failing 01:06:56.520 |
Did the hardware design change around this node?" 01:07:00.920 |
looking at kernel logs of what's happening with this thing. 01:07:05.680 |
And finally, the procurement person was like, 01:07:09.120 |
I found this new vendor for USB cables last summer." 01:07:13.200 |
You switched which vendor we're buying USB cables from?" 01:07:16.080 |
And I'm like, "Yeah, it's the same exact cable. 01:07:32.680 |
and we'd need to reboot a big part of the system. 01:07:35.440 |
And it was all just 'cause the same exact spec, 01:07:38.080 |
these two different USB cables, like slightly different. 01:07:46.640 |
where they talked about buying tens of thousands of GPUs 01:08:02.560 |
Just the real world has this level of detail. 01:08:14.000 |
of complaints about hardware and supply chain. 01:08:17.040 |
And we know each other and we joke occasionally 01:08:25.920 |
The time of the real world is unlimited, right? 01:08:43.480 |
And yeah, I mean, this is like the whole thesis. 01:08:52.920 |
Like you're just kind of skeptical about self-driving 01:08:56.520 |
So I wanna like double click on this a little bit 01:08:59.200 |
because I mean, I think that that shouldn't be taken away. 01:09:03.760 |
Read from Waymo is pretty public with like their stats. 01:09:14.320 |
At some point they will recoup their investment, right? 01:09:25.960 |
like I don't know how much an Uber driver takes home a year, 01:09:30.400 |
that a Waymo is gonna be making in that same year. 01:09:51.720 |
like the cost of the car, the depreciation of the car. 01:10:00.800 |
- Well, they need to pre-assess the run Waymo 01:10:17.360 |
- I'm very excited to see a lot more LLM agents 01:10:23.000 |
I think there'll be the biggest limiting thing 01:10:28.880 |
And like, how do you trust the output of an agent 01:10:34.400 |
And if you can't find some way to trust that agent's work, 01:10:39.320 |
So I think that's gonna be a really important thing 01:10:43.040 |
but doing the work in a trustable, auditable way