The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents

00:00:00.000 | [MUSIC PLAYING]

00:00:03.320 | Hey, everyone.

00:00:04.360 | Welcome to the Latent Space Podcast.

00:00:06.240 | This is Alessio, partner and CTO at Decibel Partners.

00:00:09.280 | And today, we're in the news studio

00:00:11.760 | with my usual co-host, Sean from Small AI.

00:00:14.680 | Hey, and today, we are very blessed

00:00:16.920 | to have Eric Schluntz from Anthropic with us.

00:00:18.760 | Welcome.

00:00:19.480 | Hi, thanks very much.

00:00:20.600 | I'm Eric Schluntz.

00:00:21.480 | I'm a member of technical staff at Anthropic,

00:00:23.860 | working on tool use, computer use, and SweetBench.

00:00:27.760 | Yeah, well, how did you get into just the whole AI journey?

00:00:32.760 | I think you spent some time at SpaceX as well?

00:00:35.760 | Yeah.

00:00:36.260 | And robotics?

00:00:37.120 | Yeah, there's a lot of overlap between the robotics people

00:00:39.840 | and the AI people.

00:00:40.560 | And maybe there's some interlap or interest

00:00:43.320 | between language models for robots right now.

00:00:46.360 | Maybe just a little bit of background

00:00:47.960 | on how you got to where you are.

00:00:49.720 | Yeah, sure.

00:00:50.340 | I was at SpaceX a long time ago.

00:00:51.840 | But before joining Anthropic, I was the CTO and co-founder

00:00:55.520 | of Cobalt Robotics.

00:00:56.920 | We built security and inspection robots.

00:00:59.600 | These are five-foot-tall robots that

00:01:01.400 | would patrol through an office building or a warehouse,

00:01:04.200 | looking for anything out of the ordinary.

00:01:06.120 | Very friendly, no tasers or anything.

00:01:07.720 | We would just call a remote operator if we saw anything.

00:01:11.800 | So we have about 100 of those out in the world,

00:01:14.400 | and had a team of about 100.

00:01:15.840 | We actually got acquired about six months ago.

00:01:17.800 | But I had left Cobalt about a year ago now,

00:01:20.560 | because I was starting to get a lot more excited about AI.

00:01:23.160 | I had been writing a lot of my code with things like Copilot.

00:01:26.160 | And I was like, wow, this is actually really cool.

00:01:28.200 | If you had told me 10 years ago that AI would

00:01:30.800 | be writing a lot of my code, I would say, hey,

00:01:32.800 | I think that's AGI.

00:01:34.240 | And so I realized that we had passed this level.

00:01:37.560 | We're like, wow, this is actually really useful

00:01:39.560 | for engineering work.

00:01:40.520 | That got me a lot more excited about AI

00:01:43.160 | and learning about large language models.

00:01:45.120 | So I ended up taking a sabbatical

00:01:47.080 | and then doing a lot of reading and research myself,

00:01:50.240 | and decided, hey, I want to go be at the core of this

00:01:52.600 | and joined Anthropic.

00:01:53.840 | And why Anthropic?

00:01:55.600 | - Did you consider other labs?

00:01:56.960 | Did you consider maybe some of the robotics companies?

00:02:00.520 | - So I think at the time,

00:02:01.400 | I was a little burnt out of robotics.

00:02:02.880 | And so also for the rest of this,

00:02:04.080 | any sort of negative things I say about robotics

00:02:06.600 | or hardware is coming from a place of burnout.

00:02:09.680 | I reserve my right to change my opinion in a few years.

00:02:12.680 | Yeah, I looked around, but ultimately I knew a lot of people

00:02:15.440 | that I really trusted and I thought were incredibly smart

00:02:18.440 | at Anthropic, and I think that was the big deciding factor

00:02:20.720 | to come there.

00:02:21.560 | Like, hey, this team's amazing.

00:02:23.320 | They're not just brilliant,

00:02:24.400 | but sort of like the most nice and kind people that I know.

00:02:26.760 | And so I just felt I could be a really good culture fit.

00:02:28.840 | And ultimately, like, I do care a lot about AI safety

00:02:31.560 | and making sure that, you know,

00:02:32.880 | I don't want to build something that's used for bad purposes.

00:02:35.640 | And I felt like the best chance of that

00:02:37.960 | was joining Anthropic.

00:02:39.680 | - And from the outside,

00:02:40.520 | these labs kind of look like huge organizations

00:02:43.480 | that have this like obscure ways to organize.

00:02:45.680 | How did you get, you joined Anthropic,

00:02:47.840 | did you already know you were going to work

00:02:49.280 | on like SweetBench and some of the stuff you publish,

00:02:51.480 | or you kind of join and then you figure out where you land?

00:02:54.560 | I think people are always curious to learn more.

00:02:57.120 | - Yeah, I've been very happy that Anthropic

00:02:59.080 | is very bottoms up and sort of very sort of receptive

00:03:01.800 | to whatever your interests are.

00:03:03.480 | And so I joined sort of being very transparent of like,

00:03:06.320 | hey, I'm most excited about code generation and AI

00:03:09.040 | that can actually go out and sort of touch the world

00:03:11.600 | or sort of help people build things.

00:03:13.360 | And, you know, those weren't my initial projects.

00:03:15.760 | I also came in and said,

00:03:16.600 | hey, I want to do the most valuable possible thing

00:03:18.520 | for this company and help Anthropic succeed.

00:03:20.960 | And, you know, like, let me find the balance of those.

00:03:23.120 | So I was working on lots of things at the beginning,

00:03:25.640 | you know, function calling, tool use,

00:03:28.000 | and then sort of as it became more and more relevant,

00:03:31.120 | I was like, oh, hey, yeah, like let's,

00:03:32.280 | it's time to go work on encoding agents

00:03:35.000 | and sort of started looking at SweetBench

00:03:36.800 | as sort of a really good benchmark for that.

00:03:39.320 | - So let's get right into SweetBench.

00:03:41.200 | That's one of the many claims to fame.

00:03:43.480 | I feel like there's just been a series of releases

00:03:46.280 | related with Cloud 3.5 SONNET.

00:03:48.080 | Around about two, three months ago, 3.5 SONNET came out

00:03:51.880 | and it was a step ahead in terms of a lot of,

00:03:55.200 | people immediately fell in love with it for coding.

00:03:57.240 | And then last month,

00:03:59.280 | you released a new updated version of Cloud SONNET.

00:04:01.840 | We're not going to talk about the training for that

00:04:03.280 | 'cause that's still confidential,

00:04:04.760 | but I think Anthropic's done a really good job

00:04:07.080 | like applying the model to different things.

00:04:09.520 | So you took the lead on SweetBench,

00:04:11.200 | but then also we're going to talk a little bit

00:04:12.520 | about computer use later on.

00:04:14.400 | So yeah, maybe just give us a context

00:04:15.960 | about like why you looked at SweetBench Verified

00:04:18.840 | and you actually like came up with a whole system

00:04:21.560 | for building agents that, you know,

00:04:24.200 | would maximally use the model well.

00:04:26.760 | - Yeah, so I'm on a sub team called product research.

00:04:29.920 | And basically the idea of product research

00:04:32.000 | is to really understand like what end customers care about

00:04:36.040 | and want in the models

00:04:37.640 | and then work to try to make that happen.

00:04:40.520 | So, you know, we're not focused

00:04:42.240 | on sort of these more abstract general benchmarks

00:04:45.360 | like math problems or MMLU,

00:04:47.960 | but we really care about like finding the things

00:04:49.920 | that are really valuable

00:04:50.800 | and making sure the models are great at those.

00:04:52.720 | And so because I had been interested in coding agents,

00:04:55.640 | sort of, I knew that this would be a really valuable thing.

00:04:57.840 | And I knew there were a lot of startups

00:04:59.280 | and our customers trying to build coding agents

00:05:02.400 | with our models.

00:05:03.240 | And so I said, "Hey, this is going to be

00:05:04.080 | "a really good benchmark to be able to measure that

00:05:06.440 | "and do well on it."

00:05:07.280 | And I, you know, wasn't the first person

00:05:09.240 | at Anthropic to find SweetBench.

00:05:11.000 | And then, you know, there are lots of people

00:05:12.040 | that already knew about it

00:05:13.840 | and had done some internal efforts on it.

00:05:16.480 | It fell to me to sort of both implement the benchmark,

00:05:19.520 | which is very tricky,

00:05:20.600 | and then also to sort of make sure we had an agent

00:05:23.960 | and basically like a reference agent,

00:05:26.120 | maybe I'd call it, that could do very well on it.

00:05:28.760 | Ultimately, we want to provide

00:05:31.000 | how we implemented that reference agent

00:05:33.080 | so that people can build their own agents

00:05:35.160 | on top of our system

00:05:36.560 | and get sort of the most out of it as possible.

00:05:38.920 | So with this blog post we released on SweetBench,

00:05:41.600 | we released the exact tools and the prompt

00:05:44.120 | that we gave the model to be able to do well.

00:05:46.120 | - For people who don't know,

00:05:47.400 | who maybe haven't dived into SweetBench,

00:05:49.320 | I think the general perception is they're like

00:05:51.040 | tasks that a software engineer could do.

00:05:53.440 | I feel like that's an inaccurate description

00:05:55.800 | because it is basically,

00:05:57.720 | one, it's a subset of like 12 repos.

00:05:59.920 | It's everything they could find

00:06:01.200 | that every issue with like a matching commit

00:06:04.280 | that could be tested.

00:06:05.440 | So that's not every commit.

00:06:07.240 | And then SweetBench verified

00:06:08.480 | is further manually filtered by OpenAI.

00:06:11.280 | Is that an accurate description

00:06:12.440 | and anything you'd change about that?

00:06:13.640 | - Yes, SweetBench is,

00:06:15.520 | it certainly is a subset of all tasks.

00:06:18.800 | First of all, it's only Python repos.

00:06:21.000 | So already fairly limited there.

00:06:22.640 | And it's just 12 of these popular open source repos.

00:06:25.760 | And yes, it's only ones where there were

00:06:28.280 | tests that passed at the beginning

00:06:30.080 | and also new tests that were introduced

00:06:33.200 | that test the new feature that's added.

00:06:36.360 | So it is, I think,

00:06:37.920 | a very limited subset of real engineering tasks,

00:06:40.800 | but I think it's also very valuable

00:06:42.520 | because it's, even though it's a subset,

00:06:44.600 | it is true engineering tasks.

00:06:46.520 | And I think a lot of other benchmarks

00:06:48.520 | are really kind of these much more artificial setups

00:06:51.440 | of even if they're related to coding,

00:06:53.360 | they're more like coding interview style questions

00:06:55.920 | or puzzles that I think are very different

00:06:58.160 | from like day-to-day what you end up doing.

00:07:00.440 | Like, I don't know how frequently

00:07:02.000 | you all like get to use recursion in your day-to-day job,

00:07:05.160 | but whenever I do, it's like a treat.

00:07:07.640 | And I think it is, it's kind of,

00:07:08.960 | it's almost comical and a lot of people

00:07:10.400 | joke about this in the industry.

00:07:11.440 | It's like how different interview questions are.

00:07:13.040 | - Dynamic programming.

00:07:14.040 | - Yeah, exactly. - Like new code.

00:07:15.480 | - From the day-to-day job.

00:07:16.320 | But I think the, one of the most interesting things

00:07:18.720 | about Sweebench is that all these other benchmarks

00:07:21.720 | are usually just isolated puzzles

00:07:23.560 | and you're starting from scratch.

00:07:25.120 | Whereas Sweebench, you're starting

00:07:26.680 | in the context of an entire repository.

00:07:29.160 | And so it adds this entirely new dimension

00:07:31.640 | to the problem of finding the relevant files.

00:07:34.280 | And this is a huge part of real engineering is,

00:07:37.000 | it's actually again, pretty rare that you're starting

00:07:39.000 | something totally greenfield.

00:07:40.520 | You need to go and figure out where in a code base

00:07:43.040 | you're going to make a change and understand

00:07:45.120 | how your work is going to interact

00:07:46.800 | with the rest of the systems.

00:07:48.160 | And I think Sweebench does a really good job

00:07:49.920 | of like presenting that problem.

00:07:51.800 | - Why do we still use human eval?

00:07:54.240 | It's like 92%, I think.

00:07:56.200 | I don't even know if you can actually get to 100%

00:07:58.160 | because some of the data is not actually solvable.

00:08:01.120 | Do you see benchmarks like that,

00:08:03.040 | they should just get sunsetted?

00:08:04.480 | Because when you look at like the model releases,

00:08:06.440 | it's like, oh, it's like 92% instead of like 89, 90%

00:08:10.600 | on human eval versus, you know,

00:08:12.360 | Sweebench verified is you have 49%, right?

00:08:15.240 | Which is like, before 45% was state of the art,

00:08:18.360 | but maybe like six months ago, it was like 30%,

00:08:20.680 | something like that.

00:08:21.640 | So is that a benchmark that you think

00:08:23.640 | is going to replace human eval?

00:08:24.960 | Or do you think they're just going to run in parallel?

00:08:27.440 | - I think there's still need

00:08:28.760 | for sort of many different varied evals.

00:08:31.440 | Like sometimes you do really care

00:08:32.720 | about just sort of greenfield code generation.

00:08:35.440 | And so I don't think that everything needs to go

00:08:37.800 | to sort of an agentic setup.

00:08:39.480 | - It would be very expensive to implement.

00:08:41.160 | - And the other thing I was going to say

00:08:42.360 | is that Sweebench is certainly hard to implement

00:08:46.120 | and expensive to run because each task,

00:08:49.000 | you have to parse a lot of the repo

00:08:51.600 | to understand where to put your code.

00:08:53.160 | And a lot of times you take many tries

00:08:55.520 | of writing code, running it, editing it.

00:08:57.720 | It can use a lot of tokens

00:08:59.600 | compared to something like human eval.

00:09:01.000 | So I think there's definitely a space

00:09:02.680 | for these more traditional coding evals

00:09:05.200 | that are sort of easy to implement,

00:09:06.800 | quick to run and do get you some signal.

00:09:09.200 | And maybe hopefully there's just sort of harder versions

00:09:12.000 | of human eval that get created.

00:09:14.000 | - How do we get Sweebench verified to 92%?

00:09:16.960 | Do you think that's something where it's like

00:09:18.920 | line of sight to it?

00:09:20.000 | Or it's like, you know,

00:09:20.840 | we need a whole lot of things to go right.

00:09:23.040 | - Yeah, yeah.

00:09:23.880 | And actually maybe I'll start with Sweebench

00:09:26.200 | versus Sweebench verified,

00:09:27.320 | which is I think something I missed earlier.

00:09:29.400 | So Sweebench is, as we described,

00:09:30.720 | this big set of tasks that were scraped.

00:09:33.240 | - Like 12,000 or something.

00:09:35.000 | - Yeah, I think it's 2,000 in the final set,

00:09:38.400 | but a lot of those, even though a human did them,

00:09:41.000 | they're actually impossible

00:09:42.400 | given the information that comes with the task.

00:09:45.240 | The most classic example of this

00:09:46.880 | is the test looks for a very specific error string,

00:09:50.520 | you know, like assert message equals error,

00:09:53.960 | something, something, something.

00:09:55.280 | And unless you know that's exactly what you're looking for,

00:09:58.160 | there's no way the model is going to write

00:09:59.800 | that exact same error message

00:10:01.120 | and so the tests are going to fail.

00:10:02.640 | So Sweebench verified was actually made

00:10:05.400 | in partnership with OpenAI

00:10:07.160 | and they hired humans to go review all these tasks

00:10:10.680 | and pick out a subset to try to remove

00:10:13.440 | any obstacle like this

00:10:15.160 | that would make the tasks impossible.

00:10:16.960 | So in theory, all of these tasks

00:10:19.240 | should be fully doable by the model.

00:10:22.080 | And they also had humans grade

00:10:24.200 | how difficult they thought the problems would be

00:10:26.440 | between like 15, less than 15 minutes,

00:10:29.200 | I think 15 minutes to an hour,

00:10:30.760 | an hour to four hours and greater than four hours.

00:10:33.400 | So that's kind of this interesting sort of

00:10:35.920 | how big the problem is as well.

00:10:37.840 | To get to Sweebench verified to 90%,

00:10:40.560 | actually, maybe I'll also start off

00:10:41.880 | with some of the remaining failures that I see,

00:10:44.000 | like when running our model on Sweebench.

00:10:46.120 | I'd say the biggest cases are the model

00:10:47.920 | sort of operates at the wrong level of abstraction.

00:10:51.040 | And what I mean by that is

00:10:52.680 | the model puts in maybe a smaller band-aid

00:10:55.080 | when really the task is asking for a bigger refactor.

00:10:58.200 | And some of those, you know, is the model's fault,

00:11:00.400 | but a lot of times if you're just seeing the,

00:11:03.360 | if you're just sort of seeing the GitHub issue,

00:11:05.600 | it's not exactly clear like which way you should do.

00:11:07.560 | So even though these tasks are possible,

00:11:09.440 | there's still some ambiguity

00:11:11.360 | in how the tasks are described.

00:11:13.040 | That being said, I think in general,

00:11:14.480 | like language models frequently will produce

00:11:16.440 | like a smaller diff when possible

00:11:18.440 | rather than trying to do a big refactor.

00:11:20.440 | I think another area is sort of,

00:11:21.640 | at least the agent we created

00:11:22.920 | didn't have any multimodal abilities,

00:11:25.560 | even though our models are very good at vision.

00:11:27.360 | So I think that's just a missed opportunity.

00:11:29.560 | And if I read through some of the traces,

00:11:31.120 | there's some funny things where,

00:11:32.560 | especially the tasks on matplotlib,

00:11:34.520 | which is a graphing library,

00:11:36.240 | the test script will like save an image

00:11:38.080 | and the model will just say, okay, it looks great.

00:11:40.600 | You know, without looking at it.

00:11:42.520 | So there's certainly extra juice to squeeze there

00:11:44.160 | of just making sure the model really understands

00:11:46.640 | all the sides of the input that it's given,

00:11:48.200 | including multimodal.

00:11:49.520 | But yeah, I think like getting to 92%.

00:11:52.480 | So this is something that I have not looked at,

00:11:54.360 | but I'm very curious about.

00:11:55.880 | I want someone to look at like,

00:11:57.400 | what is the union of all of the different tasks

00:12:00.040 | that have been solved by at least one attempt

00:12:02.320 | at SuiteBench Verify?

00:12:03.600 | There's a ton of submissions to the benchmark.

00:12:05.480 | And so I'd be really curious to see

00:12:07.440 | how many of those 500 tasks, at least someone has solved.

00:12:11.160 | And I think, you know, there's probably a bunch

00:12:12.840 | that none of the attempts have ever solved.

00:12:14.840 | And I think it'd be interesting to look at those and say,

00:12:16.560 | hey, is there some problem with these?

00:12:18.080 | Like, are these impossible?

00:12:19.360 | Or are they just really hard and only a human could do them?

00:12:22.200 | - Yeah, like specifically, is there a category of problems

00:12:24.040 | that are still unreachable by any LLM agent?

00:12:27.480 | - Yeah, yeah, and I think there definitely are.

00:12:28.680 | The question is, are those fairly inaccessible

00:12:32.480 | or are they just impossible because of the descriptions?

00:12:34.920 | But I think certainly some of the tasks,

00:12:36.840 | especially the ones that the human graders reviewed

00:12:40.160 | as like taking longer than four hours are extremely difficult.

00:12:43.920 | I think we got a few of them right,

00:12:46.760 | but not very many at all in the benchmark.

00:12:49.520 | - And did those take less than four hours?

00:12:51.600 | - They certainly did less than, yeah, than four hours.

00:12:53.960 | - Is there a correlation of length of time

00:12:56.360 | with like human estimated time, you know what I mean?

00:12:58.520 | Or do we have sort of more of X paradox type situations

00:13:01.800 | where it's something super easy for a model,

00:13:05.400 | but hard for a human?

00:13:06.400 | - I actually haven't done like done the stats on that,

00:13:09.280 | but I think that'd be really interesting to see

00:13:10.680 | of like how many tokens does it take

00:13:12.800 | and how is that correlated with difficulty?

00:13:15.200 | What is the likelihood of success with difficulty?

00:13:18.120 | I think actually a really interesting thing that I saw,

00:13:21.360 | one of my coworkers who was also working on this

00:13:23.960 | named Simon, he was focusing just specifically

00:13:27.080 | on the very hard problems,

00:13:28.600 | the ones that are said to take longer than four hours.

00:13:31.440 | And he ended up sort of creating

00:13:33.320 | a much more detailed prompt than I used.

00:13:35.240 | And he got a higher score

00:13:37.120 | on the most difficult subset of problems,

00:13:39.160 | but a lower score overall in the whole benchmark.

00:13:41.960 | And the prompt that I made,

00:13:43.240 | which is sort of much more simple and bare bones,

00:13:45.800 | got a higher score on the overall benchmark,

00:13:47.640 | but lower score on the really hard problems.

00:13:49.960 | And I think some of that is the really detailed prompt

00:13:52.680 | made the model sort of overcomplicate

00:13:54.920 | a lot of the easy problems.

00:13:56.640 | 'Cause honestly, a lot of the sweet bench problems,

00:13:58.400 | they really do just ask for a bandaid

00:14:00.200 | and where it's like, hey, this crashes if this is none,

00:14:02.600 | and really all you need to do is put a check if none.

00:14:05.040 | And so sometimes like trying to make the model

00:14:07.360 | think really deeply, like it'll think in circles

00:14:10.200 | and overcomplicate something,

00:14:11.140 | which certainly human engineers are capable of as well.

00:14:14.120 | But I think there's some interesting thing

00:14:15.240 | of like the best prompt for hard problems

00:14:17.040 | might not be the best prompt for easy problems.

00:14:19.080 | - How do we fix that?

00:14:20.080 | Are you supposed to fix it at the model level?

00:14:22.240 | Like how do I know what prompt I'm supposed to use?

00:14:25.600 | - Yeah, and I'll say this was a very small effect size.

00:14:27.600 | And so I think this is not,

00:14:29.280 | I think this isn't like worth obsessing over,

00:14:31.780 | but I would say that as people are building systems

00:14:35.200 | around agents, I think the more you can separate out

00:14:39.000 | the different kinds of work the agent needs to do,

00:14:41.840 | the better you can tailor a prompt for that task.

00:14:44.560 | And I think that also creates a lot of like,

00:14:46.440 | for instance, if you were trying to make an agent

00:14:48.040 | that could both, you know, solve hard programming tasks,

00:14:52.200 | and it could just like, you know, write quick test files

00:14:55.880 | for something that someone else had already made,

00:14:57.680 | the best way to do those two tasks

00:14:59.200 | might be very different prompts.

00:15:00.660 | I see a lot of people build systems

00:15:02.400 | where they first sort of have a classification

00:15:04.600 | and then route the problem to two different prompts.

00:15:07.320 | And that's sort of a very effective thing

00:15:09.000 | because one, it makes the two different prompts

00:15:12.240 | much simpler and smaller.

00:15:13.760 | And it means you can have someone work on one of the prompts

00:15:16.460 | without any risk of affecting the other tasks.

00:15:18.740 | So it creates like a nice separation of concerns.

00:15:20.760 | - Yeah, and the other model behavior thing you mentioned,

00:15:22.960 | they prefer to generate like shorter diffs.

00:15:26.320 | Why is that?

00:15:27.160 | Like, is there a way?

00:15:28.000 | You know, I think that's maybe like the lazy model question

00:15:32.560 | that people have is like,

00:15:33.400 | why are you not just generating the whole code

00:15:35.480 | instead of telling me to implement it?

00:15:36.320 | - Are you saving tokens?

00:15:37.600 | - Yeah, exactly.

00:15:38.440 | It's like conspiracy theory.

00:15:39.760 | - Yeah, yeah, yeah.

00:15:40.600 | - So there's two different things there.

00:15:41.880 | One is like the, I'd say maybe like doing

00:15:44.200 | the easier solution rather than the hard solution.

00:15:46.620 | And I'd say the second one,

00:15:47.580 | I think what you're talking about is like the lazy model

00:15:49.340 | is like when the model says like dot, dot, dot,

00:15:51.500 | code remains the same.

00:15:52.340 | - Code goes here.

00:15:53.160 | I'm like, thanks, dude.

00:15:54.000 | - I think honestly, like that just comes as like,

00:15:57.300 | people on the internet will do stuff like that.

00:15:59.260 | And like, dude, if you were talking to a friend

00:16:01.460 | and you asked them like to give you some example code,

00:16:04.020 | they would definitely do that.

00:16:04.860 | They're not going to reroll the whole thing.

00:16:06.900 | And so I think that's just a matter of like, you know,

00:16:09.380 | sometimes you actually do just want like the relevant changes

00:16:13.260 | and so I think it's,

00:16:14.380 | this is something where a lot of times like, you know,

00:16:16.180 | the models aren't good at mind reading

00:16:17.660 | of like which one you want.

00:16:19.240 | So I think that like the more explicit you can be

00:16:22.220 | in prompting to say, hey, you know,

00:16:23.540 | give me the entire thing, no elisions,

00:16:26.820 | versus just give me the relevant changes.

00:16:28.260 | And that's something, you know,

00:16:29.100 | we want to make the models always better

00:16:30.720 | at following those kinds of instructions.

00:16:33.020 | - I'll drop a couple of references here.

00:16:34.540 | We're recording this like a day after Dario,

00:16:36.940 | Lex Friedman just dropped his five hour pod with Dario

00:16:39.380 | and Amanda and the rest of the crew.

00:16:41.580 | And Dario actually made this interesting observation

00:16:44.260 | that like, we actually don't want,

00:16:45.700 | we complain about models being too chatty in text

00:16:48.780 | and then not chatty enough in code.

00:16:50.980 | And so like getting that right is kind of a awkward bar

00:16:54.500 | because, you know, you don't want it to yap

00:16:56.740 | in its responses,

00:16:57.700 | but then you also want it to be complete in code.

00:17:00.420 | And then sometimes it's not complete.

00:17:01.420 | Sometimes you just want it to diff,

00:17:02.540 | which is something that Enthopic has also released

00:17:05.620 | with, you know, like the fast edit stuff that you guys did.

00:17:08.740 | And then the other thing I wanted to also double back on

00:17:11.060 | is the prompting stuff.

00:17:12.540 | You said it was a small effect,

00:17:13.820 | but it was a noticeable effect

00:17:15.020 | in terms of like picking a prompt.

00:17:17.020 | I think we'll go into suite agents in a little bit,

00:17:19.580 | but I kind of reject the fact

00:17:20.940 | that you need to choose one prompt

00:17:23.260 | and like have your whole performance

00:17:25.540 | be predicated on that one prompt.

00:17:28.020 | I think something that Enthopic has done really well

00:17:30.400 | is meta-prompting, prompting for a prompt.

00:17:32.980 | And so why can't you just develop a meta-prompt

00:17:34.700 | for all the other prompts?

00:17:35.780 | And, you know, if it's a simple task,

00:17:37.140 | make a simple prompt.

00:17:37.980 | If it's a hard task, make a hard prompt.

00:17:39.340 | Obviously I'm probably hand-waving a little bit,

00:17:41.020 | but I will definitely ask people

00:17:42.900 | to try the Enthopic Workbench meta-prompting system

00:17:46.580 | if they haven't tried it yet.

00:17:47.700 | I went to the build day recently at Enthopic HQ

00:17:50.780 | and it's the closest I've felt to an AGI,

00:17:53.860 | like learning how to operate itself.

00:17:55.540 | That, yeah, it's really magical.

00:17:57.620 | - Yeah, no, Claude is great at writing prompts for Claude.

00:17:59.900 | - Right, so meta-prompting.

00:18:00.900 | - Yeah, yeah.

00:18:01.860 | The way I think about this is that humans,

00:18:04.620 | even like very smart humans still use sort of checklists

00:18:07.900 | and use sort of scaffolding for themselves.

00:18:09.860 | Surgeons will still have checklists

00:18:11.580 | even though they're incredible experts.

00:18:13.340 | And certainly, you know, a very senior engineer

00:18:15.300 | needs less structure than a junior engineer,

00:18:18.300 | but there still is some of that structure

00:18:19.820 | that you want to keep.

00:18:20.660 | And so I always try to anthropomorphize the models

00:18:22.940 | and try to think about for a human,

00:18:24.380 | sort of what is the equivalent?

00:18:25.500 | And that's sort of, you know,

00:18:26.940 | how I think about these things is

00:18:28.500 | how much instruction would you give a human

00:18:30.700 | with the same task?

00:18:31.860 | And would you need to give them a lot of instruction

00:18:34.380 | or a little bit of instruction?

00:18:35.860 | - Let's talk about the agent architecture.

00:18:37.700 | Maybe, so first, runtime.

00:18:39.740 | You let it run until it thinks it's done

00:18:42.340 | or it reaches 200K context window.

00:18:45.020 | How did you come up?

00:18:45.860 | - What's up with that?

00:18:46.700 | - Yeah.

00:18:47.540 | - Yeah, I mean this,

00:18:48.540 | so I'd say that a lot of previous agent work

00:18:52.220 | built sort of these very hard-coded and rigid workflows

00:18:56.300 | where the model is sort of pushed through

00:18:58.660 | certain flows of steps.

00:19:00.300 | And I think to some extent, you know,

00:19:01.980 | that's needed with smaller models

00:19:04.020 | and models that are less smart.

00:19:05.940 | But one of the things that we really wanted to explore

00:19:08.300 | was like, let's really give Claude the reins here

00:19:11.300 | and not force Claude to do anything,

00:19:13.820 | but let Claude decide, you know,

00:19:15.980 | how it should approach the problem,

00:19:17.340 | what steps it should do.

00:19:18.740 | And so really, you know,

00:19:20.020 | what we did is like the most extreme version of this

00:19:22.460 | is just give it some tools that it can call

00:19:24.940 | and it's able to keep calling the tools, keep thinking,

00:19:28.020 | and then yeah, keep doing that until it thinks it's done.

00:19:31.060 | And that's sort of the most minimal agent framework

00:19:35.020 | that we came up with.

00:19:36.460 | And I think that works very well.

00:19:37.980 | I think especially the new SONNET 3.5

00:19:41.100 | is very, very good at self-correction.

00:19:43.660 | It has a lot of like grit.

00:19:44.940 | Claude will try things that fail

00:19:47.500 | and then try, you know, come back

00:19:49.420 | and sort of try different approaches.

00:19:51.140 | And I think that's something that you didn't see

00:19:53.580 | in a lot of previous models.

00:19:55.140 | Some of the existing agent frameworks that I looked at,

00:19:57.260 | they had whole systems built to try to detect loops

00:20:00.700 | and see, oh, is the model doing the same thing,

00:20:02.980 | you know, more than three times,

00:20:04.220 | and we have to pull it out.

00:20:05.660 | And I think like the smarter the models are,

00:20:07.220 | the less you need that kind of extra scaffolding.

00:20:09.140 | So yeah, just giving the model tools

00:20:11.420 | and letting it keep sample and call tools

00:20:13.740 | until it thinks it's done was the most minimal framework

00:20:16.580 | that we could think of.

00:20:17.420 | And so that's what we did.

00:20:18.260 | - So you're not pruning like bad paths from the context.

00:20:21.460 | If it tries to do something, it fails,

00:20:23.460 | you just burn all these tokens to-

00:20:25.220 | - Yes, and so I would say the downside of this

00:20:27.380 | is that this is sort of a very token expensive way

00:20:29.700 | to do this.

00:20:30.540 | - But still, it's very common to prune bad paths

00:20:32.820 | 'cause models get stuck.

00:20:34.220 | - Yeah, but I'd say that, yeah, 3.5 is not getting stuck

00:20:38.300 | as much as previous models.

00:20:39.420 | And so, yeah, we wanted to at least

00:20:41.060 | just try the most minimal thing.

00:20:42.580 | I know I would say that, you know,

00:20:44.060 | this is definitely an area of future research,

00:20:46.380 | especially if we talk about these problems

00:20:48.580 | that are going to take a human more than four hours.

00:20:51.700 | Those might be things where we're gonna need

00:20:53.300 | to go prune bad paths to let the model

00:20:56.900 | be able to accomplish this task within 200K tokens.

00:20:59.940 | So certainly I think there's like future research

00:21:02.140 | to be done in that area,

00:21:03.300 | but it's not necessary to do well on these benchmarks.

00:21:06.140 | - Another thing I always have questions about

00:21:08.180 | on context window things,

00:21:09.660 | there's a mini cottage industry of code indexers

00:21:12.780 | that have sprung up for large codebases

00:21:15.100 | like the ones in SweetBench.

00:21:16.420 | You didn't need them?

00:21:17.700 | - We didn't.

00:21:18.540 | And I think I'd say there's like two reasons for this.

00:21:20.740 | One is like SweetBench specific

00:21:22.420 | and the other is a more general thing.

00:21:25.380 | The more general thing is that I think

00:21:27.420 | Sonnet is very good at what we call agentic search

00:21:30.780 | and what this basically means

00:21:32.220 | is letting the model decide how to search for something.

00:21:35.460 | It gets the results and then it can decide

00:21:37.500 | should it keep searching or is it done?

00:21:38.980 | Does it have everything it needs?

00:21:40.540 | So if you read through a lot of the traces of the SweetBench,

00:21:44.260 | the model is calling tools to view directories,

00:21:47.140 | list out things, view files,

00:21:49.140 | and it will do a few of those

00:21:50.660 | until it feels like it's found the file where the bug is

00:21:54.220 | and then it will start working on that file.

00:21:56.340 | And I think like, again, this is all,

00:21:58.500 | everything we did was about just giving Claude the full reins

00:22:01.500 | so there's no hard-coded system.

00:22:03.340 | There's no search system that you're relying

00:22:05.700 | on getting the correct files into context.

00:22:08.620 | This just totally lets Claude do it.

00:22:10.940 | - Or embedding things into a vector database.

00:22:13.580 | - Exactly.

00:22:14.620 | - Oops.

00:22:15.460 | - No, no, I know.

00:22:16.300 | But again, this is very, very token expensive.

00:22:19.660 | And so certainly, and it also takes many, many turns.

00:22:22.220 | And so certainly if you want to do something

00:22:24.060 | in a single turn, you need to do rag

00:22:25.740 | and just push stuff into the first prompt.

00:22:27.860 | - And just to make it clear, it's using the bash tool,

00:22:30.540 | basically doing ls, looking at files,

00:22:32.740 | and then doing cat to the following context.

00:22:34.940 | - It can do that, but it's file editing tool

00:22:37.660 | also has a command in it called view.

00:22:39.980 | They can view a directory.

00:22:41.260 | It's very similar to ls,

00:22:43.060 | but it just sort of has some nice

00:22:44.780 | sort of quality of life improvements.

00:22:46.260 | Like it'll only do an ls sort of two directories deep

00:22:49.420 | so that the model doesn't get overwhelmed

00:22:51.140 | if it does this on a huge file.

00:22:52.780 | I would say actually we did more engineering of the tools

00:22:55.860 | than the overall prompt.

00:22:57.140 | But the one other thing I want to say

00:22:59.380 | about this agentic search is that

00:23:01.340 | for suite bench specifically,

00:23:03.180 | a lot of the tasks are bug reports,

00:23:06.420 | which means they have a stack trace in them.

00:23:08.380 | And that means right in that first prompt,

00:23:10.260 | there is- - Tells you where to go.

00:23:11.260 | - It tells you where to go.

00:23:12.340 | And so I think this is a very easy case

00:23:14.140 | for the model to find the right files

00:23:15.980 | versus if you're using,

00:23:17.980 | this is a general coding assistant

00:23:19.740 | where there isn't a stack trace

00:23:20.980 | or you're asking it to insert a new feature.

00:23:23.900 | I think there it's much harder to know

00:23:25.700 | which files to look at.

00:23:28.020 | And that might be an area where

00:23:29.460 | you would need to do more of this exhaustive search

00:23:31.620 | where an agentic search would take way too long.

00:23:33.660 | - As someone who has spent the last few years

00:23:35.660 | in the JS world,

00:23:36.780 | it'd be interesting to see suite bench JS

00:23:39.620 | because these stack traces are useless

00:23:43.460 | because there's so much virtualization that we do.

00:23:45.500 | So they're very, very disconnected

00:23:46.620 | with where the code problems are actually appearing.

00:23:50.700 | - That makes me feel better

00:23:51.620 | about my limited front end experiences.

00:23:53.180 | I've like always struggled with that.

00:23:54.780 | - It's not your fault.

00:23:56.100 | We've gotten ourselves

00:23:57.860 | into a very, very complicated situation

00:23:59.460 | and I'm not sure it's entirely needed,

00:24:01.620 | but if you talk to our friends at Vercel,

00:24:03.260 | they will say it is.

00:24:04.220 | - I will say suite bench just released suite bench multimodal

00:24:08.020 | which I believe is either entirely JavaScript

00:24:10.700 | or largely JavaScript.

00:24:12.260 | And it's entirely things

00:24:13.540 | that have visual components of them.

00:24:15.340 | - Are you going to tackle that?

00:24:16.500 | - We will see.

00:24:17.340 | I think it's on the list and there's interest,

00:24:18.980 | but no guarantees yet.

00:24:20.460 | - Just as a side note,

00:24:21.380 | it occurs to me that every model lab,

00:24:24.100 | including Enthopic, but the others as well,

00:24:26.700 | you should have your own suite bench.

00:24:28.500 | Whatever your bug tracker tool,

00:24:30.460 | this is a general methodology

00:24:31.820 | that you can use to track progress, I guess.

00:24:34.700 | - Yeah, sort of running on our own internal code base.

00:24:36.940 | Yeah, that's a fun idea.

00:24:37.940 | - Since you spend so much time on the tool design,

00:24:39.980 | so you have this added tool

00:24:41.140 | that can make changes and whatnot.

00:24:42.540 | Any learnings from that

00:24:43.900 | that you wish the AI IDEs would take in?

00:24:47.180 | Is there some special way to look at files, feed them in?

00:24:50.580 | - I would say the core of that tool is string replace.

00:24:54.540 | And so we did a few different experiments

00:24:56.900 | with different ways to specify how to edit a file.

00:25:00.180 | And string replace, basically,

00:25:02.100 | the model has to write out the existing version

00:25:04.740 | of the string and then a new version,

00:25:06.420 | and that just gets swapped in.

00:25:08.100 | We found that to be the most reliable way to do these edits.

00:25:11.900 | Other things that we tried

00:25:12.980 | were having the model directly write a diff,

00:25:15.580 | having the model fully regenerate files.

00:25:18.060 | That one is actually the most accurate,

00:25:20.100 | it takes so many tokens.

00:25:21.300 | And if you're in a very big file, it's cost prohibitive.

00:25:24.460 | There's basically a lot of different ways

00:25:26.020 | to sort of represent the same task.

00:25:28.140 | And they actually have pretty big differences

00:25:30.100 | in terms of like model accuracy.

00:25:32.140 | I think Eider, they have a really good blog

00:25:34.940 | where they explore some of these different methods

00:25:38.500 | for editing files and they post results about them,

00:25:41.140 | which I think is interesting.

00:25:42.180 | But I think this is like a really good example

00:25:44.060 | of the broader idea

00:25:45.100 | that like you need to iterate on tools

00:25:47.820 | rather than just a prompt.

00:25:49.380 | And I think a lot of people,

00:25:50.780 | when they make tools for an LLM,

00:25:54.100 | they kind of treat it

00:25:54.940 | like they're just writing an API for a computer.

00:25:58.060 | And it's sort of very minimal,

00:25:59.700 | it's sort of just the bare bones of what you'd need.

00:26:02.620 | And honestly, like it's so hard for the models to use those.

00:26:05.860 | I really, again,

00:26:06.700 | I come back to anthropomorphizing these models.

00:26:08.900 | Like imagine you're a developer

00:26:10.820 | and you just read this for the very first time

00:26:13.260 | and you're trying to use it.

00:26:14.140 | Like you can do so much better

00:26:15.900 | than like just sort of the bare API spec

00:26:17.980 | of what you'd often see,

00:26:19.260 | like include examples in the description,

00:26:21.420 | include like really detailed explanations

00:26:23.300 | of how things work.

00:26:24.420 | And I think that, again,

00:26:25.380 | also think about what is the easiest way

00:26:28.340 | for the model to represent the change

00:26:30.500 | that it wants to make.

00:26:31.700 | For file editing as an example,

00:26:33.500 | writing a diff is actually,

00:26:35.340 | let's take the most extreme example.

00:26:36.860 | You want the model to literally write a patch file.

00:26:39.580 | I think patch files have at the very beginning,

00:26:41.860 | like numbers of how many total lines change.

00:26:44.900 | That means before the model has actually written the edit,

00:26:47.540 | it needs to decide how many numbers

00:26:50.660 | or how many lines are gonna change.

00:26:52.220 | Don't quote me on that.

00:26:53.500 | I'm pretty sure, I think it's something like that,

00:26:55.580 | but I don't know if that's exactly the diff format,

00:26:57.860 | but you can certainly have formats

00:26:59.660 | that are much easier to express

00:27:01.100 | without messing up than others.

00:27:02.540 | And I like to think about like,

00:27:03.860 | think about how much human effort

00:27:06.420 | goes into designing human interfaces for things.

00:27:08.860 | Like, it's incredible.

00:27:09.700 | This is like entirely what FrontEnd is about,

00:27:11.900 | is creating better interfaces to kind of do the same things.

00:27:14.940 | And I think that same amount of attention

00:27:16.660 | and effort needs to go

00:27:17.620 | into creating agent computer interfaces.

00:27:19.660 | - It's a topic we've discussed,

00:27:21.140 | ACI or whatever that looks like.

00:27:24.620 | I would also shout out that,

00:27:25.700 | I think you released some of these toolings

00:27:27.620 | as part of computer use as well,

00:27:29.500 | and people really liked it.

00:27:31.500 | Yeah, it's all open source if people wanna check it out.

00:27:34.500 | I'm curious if there's an environment element

00:27:37.660 | that complements the tools.

00:27:39.260 | So how do you, like, do you have a sandbox?

00:27:41.260 | Do you, is it just Docker?

00:27:43.060 | 'Cause that can be slow or resource intensive.

00:27:45.540 | Do you have anything else that you would recommend?

00:27:47.580 | - Yeah, I don't think I can talk

00:27:49.220 | about sort of public details or about private details

00:27:52.060 | about how we implement our sandboxing.

00:27:54.500 | But obviously, we need to have sort of safe, secure,

00:27:57.300 | and fast sandboxes for training,

00:27:59.100 | for the models to be able to practice writing code

00:28:00.980 | and working in an environment.

00:28:03.140 | - I'm aware of a few startups working on agent sandboxing.

00:28:06.940 | E2B is a close friend of ours

00:28:08.620 | that Alessio has led around in.

00:28:10.100 | But also I think there's others

00:28:11.340 | where they're focusing on snapshotting memory

00:28:13.900 | so that it can do time travel for debugging,

00:28:16.620 | computer use where you can control the mouse

00:28:19.180 | or keyboard or something like that.

00:28:20.900 | Whereas here, I think that the kinds of tools

00:28:22.500 | that we offer it are very, very limited

00:28:25.460 | to coding agent work cases like bash, edit,

00:28:28.780 | you know, stuff like that.

00:28:30.020 | - Yeah, I think the computer use demo that we released

00:28:32.820 | is an extension of that of it.

00:28:34.060 | It has the same bash and edit tools,

00:28:36.100 | but it also has the computer tool

00:28:37.940 | that lets it get screenshots

00:28:39.180 | and move the mouse and keyboard.

00:28:40.820 | Yeah, so I definitely think

00:28:41.660 | there's sort of more general tools there.

00:28:43.340 | And again, the tools we released

00:28:45.500 | as part of SweetBench were,

00:28:47.660 | I'd say they're very specific for editing files

00:28:50.620 | and doing bash, but at the same time,

00:28:52.300 | that's actually very general if you think about it.

00:28:54.500 | Anything that you would do on a command line

00:28:57.180 | or editing files, you can do with those tools.

00:28:59.940 | And so we do want those tools to feel

00:29:02.100 | like any sort of computer terminal work

00:29:06.180 | could be done with those same tools,

00:29:08.140 | rather than making tools that were very specific

00:29:10.700 | for SweetBench, like run tests as its own tool,

00:29:14.380 | for instance.

00:29:15.220 | - Yeah, you had a question about tests.

00:29:16.220 | - Yeah, yeah, exactly.

00:29:17.540 | I saw there's no test writer tool.

00:29:19.740 | Is it because it generates the code

00:29:22.340 | and then you're running it against SweetBench anyway?

00:29:24.380 | So it doesn't really need to write the test or?

00:29:26.740 | - Yeah, so this is one of the interesting things

00:29:28.980 | about SweetBench is that the tests

00:29:31.820 | that the model's output is graded on are hidden from it.

00:29:34.860 | That's basically so that the model can't cheat

00:29:37.060 | by looking at the tests and writing the exact solution.

00:29:40.180 | But I'd say typically the model,

00:29:42.260 | the first thing it does is it usually writes

00:29:44.660 | a little script to reproduce the error.

00:29:47.340 | And again, most SweetBench tasks are like,

00:29:49.540 | "Hey, here's a bug that I found.

00:29:51.540 | "I run this and I get this error."

00:29:53.300 | So the first thing the model does is try to reproduce that.

00:29:56.060 | And so it's kind of then rerunning that script

00:29:58.220 | as a mini test.

00:29:59.640 | But yeah, sometimes the model

00:30:01.220 | will accidentally introduce a bug

00:30:03.140 | that breaks some other test and it doesn't know about that.

00:30:05.540 | - And should we be redesigning any tools, APIs?

00:30:08.780 | We kind of talked about this on having more examples,

00:30:10.820 | but I'm thinking even things of Q as a query parameter

00:30:14.500 | in many APIs.

00:30:15.340 | It's easier for the model to re-query than read the Q.

00:30:17.900 | I'm sure it learned the Q by this point,

00:30:19.860 | but is there anything you've seen, like building this,

00:30:23.080 | where it's like, "Hey, if I were to redesign some CLI tool,

00:30:26.740 | "some API tool, I would change the way structure

00:30:29.420 | "to make it better for LLMs."

00:30:31.420 | - I don't think I've thought enough about that

00:30:33.820 | off the top of my head,

00:30:34.820 | but certainly just making everything more human-friendly.

00:30:37.840 | Like having like more detailed documentation and examples.

00:30:42.460 | I think examples are really good

00:30:44.180 | in things like descriptions.

00:30:45.340 | Like so many, like just using the Linux command line,

00:30:47.980 | like how many time I do like dash dash help

00:30:50.380 | or look at the man page or something.

00:30:51.580 | It's like, just give me one example

00:30:53.220 | of like how I actually use this.

00:30:54.340 | Like, I don't want to go read through a hundred flags.

00:30:55.980 | Just give me the most common example.

00:30:57.820 | And again, so things that would be useful for a human

00:31:01.180 | I think are also very useful for a model.

00:31:03.140 | - Yeah, I mean, there's one thing

00:31:04.180 | that you cannot give to code agents

00:31:08.080 | that is useful for human is this access to the internet.

00:31:11.200 | I wonder how to design that in.

00:31:13.000 | Because one of the issues that I also had

00:31:15.040 | with just the idea of a suite bench

00:31:17.880 | is that you can't do follow-up questions.

00:31:20.880 | You can't like look around for similar implementations.

00:31:23.940 | These are all things that I do when I try to fix code.

00:31:27.480 | And we don't do that.

00:31:28.640 | It's not, it wouldn't be fair.

00:31:30.040 | Like it'd be too easy to cheat,

00:31:31.520 | but then also it's kind of not being fair to these agents

00:31:33.760 | because they're not operating in a real world situation.

00:31:36.240 | Like if I had a real world agent,

00:31:37.680 | of course I'm giving it access to the internet

00:31:39.200 | 'cause I'm not trying to pass a benchmark.

00:31:41.200 | I don't have a question in there, more just like,

00:31:43.480 | I feel like the most obvious tool,

00:31:45.680 | access to the internet is not being used.

00:31:47.520 | - I think that that's really important for humans.

00:31:50.200 | But honestly, the models have so much general knowledge

00:31:52.800 | from pre-training that it's like less important for them.

00:31:56.720 | - But like versioning, you know.

00:31:58.080 | - If you're working on a newer thing

00:31:59.240 | that was like, that came after the knowledge cutoff,

00:32:01.500 | then yes, I think that's very important.

00:32:03.280 | I think actually this is like a broader problem

00:32:05.640 | that there is a divergence between SweeBench

00:32:08.640 | and like what customers will actually care about

00:32:11.120 | who are working on a coding agent for real use.

00:32:13.640 | And I think one of those there is like internet access

00:32:16.280 | and being able to like,

00:32:17.120 | how do you pull in outside information?

00:32:19.200 | I think another one is like,

00:32:20.800 | if you have a real coding agent,

00:32:22.000 | you don't wanna have it start on a task

00:32:24.380 | and like spin its wheels for hours

00:32:26.320 | because you gave it a bad prompt.

00:32:27.840 | You want it to come back immediately

00:32:29.280 | and ask follow-up questions

00:32:30.520 | and like really make sure it has a very detailed

00:32:32.720 | understanding of what to do,

00:32:34.320 | then go off for a few hours and do work.

00:32:36.640 | So I think that like real tasks

00:32:38.680 | are gonna be much more interactive with the agent

00:32:41.920 | rather than this kind of like one-shot system.

00:32:44.720 | And right now there's no benchmark that measures that.

00:32:47.620 | And maybe I think it'd be interesting

00:32:49.080 | to have some benchmark that is more interactive.

00:32:52.520 | I don't know if you're familiar with TauBench,

00:32:54.160 | but it's a customer service benchmark

00:32:56.500 | where there's basically one LLM that's playing

00:32:59.800 | the user or the customer that's getting support

00:33:02.480 | and another LLM that's playing the support agent

00:33:05.520 | and they interact and try to resolve the issue.

00:33:07.760 | - Yeah, we talked to the LMSIS guys.

00:33:09.400 | - Awesome, yeah.

00:33:10.240 | - And they also did MTBench for people listening along.

00:33:13.200 | So maybe we need MTSuiteBench.

00:33:14.520 | - Sure.

00:33:15.800 | Yeah, so maybe you could have something

00:33:17.480 | where like before the SuiteBench task starts,

00:33:20.120 | you have like a few back and forths

00:33:22.320 | with kind of like the author

00:33:24.760 | who can answer follow-up questions

00:33:26.480 | about what they want the task to do.

00:33:27.880 | And of course you'd need to do that

00:33:29.080 | where it doesn't cheat

00:33:30.560 | and like just get the exact thing out of the human

00:33:34.240 | or out of the sort of user.

00:33:35.720 | But I think that will be a really interesting thing to see.

00:33:37.880 | If you look at sort of existing agent work

00:33:39.900 | like Repl.it's coding agent,

00:33:41.960 | I think one of the really great UX things they do

00:33:45.000 | is like first having the agent create a plan

00:33:48.040 | and then having the human approve that plan

00:33:49.960 | or give feedback.

00:33:51.000 | I think for agents in general,

00:33:52.360 | like having a planning step at the beginning,

00:33:55.040 | one, just having that plan will improve performance

00:33:58.040 | on the downstream task

00:33:59.200 | just because it's kind of like a bigger chain of thought,

00:34:01.480 | but also it's just such a better UX.

00:34:03.640 | It's way easier for a human

00:34:06.040 | to iterate on a plan with a model

00:34:08.200 | rather than iterating on the full task

00:34:10.200 | that sort of has a much slower time through each loop.

00:34:12.920 | If the human has approved this implementation plan,

00:34:16.020 | I think it makes the end result a lot more

00:34:18.240 | sort of auditable and trustable.

00:34:20.440 | So I think there's a lot of things

00:34:22.680 | sort of outside of SuiteBench

00:34:24.080 | that will be very important

00:34:25.160 | for real agent usage in the world.

00:34:27.360 | - Yeah, I would say also,

00:34:28.960 | there's a couple of comments on names that you dropped.

00:34:30.680 | Copilot also does the plan stage before it writes code.

00:34:34.800 | I feel like those approaches

00:34:36.640 | have generally been less Twitter successful

00:34:38.960 | because it's not prompt to code, it's prompt plan code.

00:34:42.360 | So there's a little bit of friction in there,

00:34:43.920 | but it's not much.

00:34:44.760 | Like it actually, you get a lot for what it's worth.

00:34:47.640 | And I also like the way that DevIn does it

00:34:50.320 | where you can sort of edit the plan as it goes along.

00:34:52.640 | And then the other thing with Repl.it,

00:34:54.640 | we hosted a sort of dev day pre-game with Repl.it

00:34:58.120 | and they also commented about multi-agents.

00:35:00.720 | So like having two agents kind of bounce off of each other.

00:35:04.080 | I think it's a similar approach to what you're talking about

00:35:05.880 | with kind of the few-shot example,

00:35:08.200 | just as in the prompts of clarifying what the agent wants.

00:35:12.560 | But typically I think this would be implemented

00:35:14.360 | as a tool calling another agent, like a sub-agent.

00:35:18.320 | I don't know if you explored that.

00:35:19.600 | Do you like that idea?

00:35:20.640 | - I haven't explored this enough,

00:35:22.000 | but I've definitely heard of people

00:35:23.360 | having good success with this,

00:35:25.080 | of almost like basically having

00:35:27.280 | a few different sort of personas of agents,

00:35:30.240 | even if they're all the same LLM.

00:35:31.960 | I think this is one thing with multi-agent

00:35:33.520 | that a lot of people will kind of get confused by

00:35:35.280 | is they think it has to be different models

00:35:37.080 | behind each thing,

00:35:37.920 | but really it's sort of usually the same model

00:35:40.440 | with different prompts.

00:35:41.560 | And yet having them have different personas

00:35:44.240 | to kind of bring different sort of thoughts

00:35:46.240 | and priorities to the table.

00:35:47.760 | I've seen that work very well

00:35:49.520 | and sort of create a much more thorough

00:35:52.640 | and thought out response.

00:35:54.040 | I think the downside is just that

00:35:55.160 | it adds a lot of complexity

00:35:56.440 | and it adds a lot of extra tokens.

00:35:58.600 | So I think it depends what you care about.

00:36:00.200 | If you want a plan that's very thorough and detailed,

00:36:03.080 | I think it's great.

00:36:04.120 | If you want a really quick, just like write this function,

00:36:06.800 | you know, you probably don't want to do that

00:36:08.080 | and have like a bunch of different calls

00:36:10.160 | before it does this.

00:36:11.320 | - And just talking about the prompt,

00:36:12.560 | why are XML tags so good in Cloud?

00:36:16.120 | I think initially people were like,

00:36:17.200 | oh, maybe you're just getting lucky with XML,

00:36:19.600 | but I saw obviously you use them

00:36:21.000 | in your own agent prompts, so they must work.

00:36:23.840 | And why is it so model specific to your family?

00:36:26.880 | - Yeah, I think that there's, again,

00:36:28.560 | I'm not sure how much I can say,

00:36:29.680 | but I think there's historical reasons

00:36:31.200 | that internally we've preferred XML for the data.

00:36:34.080 | I think also the one broader thing I'll say

00:36:37.520 | is that if you look at certain kinds of outputs,

00:36:41.120 | there is overhead to outputting in JSON.

00:36:43.800 | Like if you're trying to output a code in JSON,

00:36:47.760 | there's a lot of extra escaping that needs to be done.

00:36:50.200 | I mean, that actually hurts model performance

00:36:51.960 | across the board, where versus like,

00:36:54.040 | if you're in just a single XML tag,

00:36:56.440 | there's none of that sort of escaping that needs to happen.

00:36:59.080 | That being said, I haven't tried having it write,

00:37:01.520 | you know, HTML and XML,

00:37:03.320 | which maybe then you start running

00:37:04.440 | into weird escaping things there, I'm not sure.

00:37:08.200 | But yeah, I'd say that's some historical reasons

00:37:10.560 | and there's less overhead of escaping.

00:37:13.080 | - I use XML in other models as well.

00:37:16.320 | And it's just a really nice way to make sure

00:37:18.080 | that the thing that ends is tied

00:37:20.600 | to the thing that starts.

00:37:22.000 | That's the only way to do code fences

00:37:24.200 | where you're pretty sure, like example one start,

00:37:26.720 | example one end, like that is one cohesive unit.

00:37:30.200 | - Because the braces are nondescriptive.

00:37:32.000 | - Yeah, exactly.

00:37:32.840 | That would be my simple reason.

00:37:35.400 | XML is good for everyone, not just Cloud.

00:37:37.480 | Cloud was just the first one to popularize it, I think.

00:37:39.240 | - I do definitely prefer to read XML than read JSON, so yeah.

00:37:43.200 | - Any other details that are like maybe underappreciated?

00:37:46.640 | I know, for example, you had the absolute paths

00:37:49.120 | versus relative.

00:37:50.280 | Any other, yeah, fun nuggets?

00:37:51.920 | - Yeah, no, I think that's a good sort of anecdote

00:37:54.200 | to mention about iterating on tools.

00:37:56.080 | Like I said, spend time prompt engineering your tools

00:37:59.560 | and don't just write the prompt,

00:38:00.920 | but like write the tool and then actually give it

00:38:04.880 | to the model and like read a bunch of transcripts

00:38:07.400 | about how the model tries to use the tool.

00:38:09.960 | And I think you will find, like by doing that,

00:38:12.360 | you will find areas where the model misunderstands a tool

00:38:16.000 | or makes mistakes and then basically change the tool

00:38:19.480 | to make it foolproof.

00:38:20.840 | And there's this Japanese term, pokayoke,

00:38:23.440 | about like making tools mistake-proof.

00:38:26.400 | You know, the classic idea is you have like,

00:38:28.360 | you can have like a plug that can fit either way

00:38:30.320 | and that's dangerous, or you can make it asymmetric

00:38:32.560 | so that like it can't fit this way, it has to go like this.

00:38:35.120 | And like, that's a better tool

00:38:36.480 | because you can't use it the wrong way.

00:38:38.600 | So for this example of like absolute paths,

00:38:41.560 | one of the things that we saw while testing these tools

00:38:44.080 | is, oh, if the model has like, you know, done CD

00:38:47.720 | and moved to a different directory,

00:38:49.520 | it would often get confused when trying to use the tool

00:38:52.520 | because it's like now in a different directory.

00:38:54.960 | And so the paths aren't lining up.

00:38:56.200 | So we said, oh, look, let's just force the tool

00:38:58.560 | to always require an absolute path.

00:39:00.760 | And then, you know, that's easy for the model to understand.

00:39:03.080 | It knows sort of where it is, it knows where the files are.

00:39:06.000 | And then once we have it always giving absolute paths,

00:39:08.600 | it never messes up even like no matter where it is,

00:39:10.800 | because it just, if you're using an absolute path,

00:39:12.920 | it doesn't matter where you are.

00:39:13.960 | So like iterations like that, you know,

00:39:16.160 | let us make the tool foolproof for the model.

00:39:18.880 | I'd say there's other categories of things where we see,

00:39:21.480 | oh, if the model, you know, opens Vim, like, you know,

00:39:25.120 | it's never going to return.

00:39:26.200 | And so the tool is like stuck.

00:39:28.680 | - Did it get stuck?

00:39:29.520 | - Yeah.

00:39:30.340 | - Get out of Vim.

00:39:31.180 | - What?

00:39:32.020 | - Well, because the tool is like,

00:39:33.760 | it just text in, text out, it's not interactive.

00:39:36.080 | So it's not like the model doesn't know

00:39:37.960 | how to get out of Vim.

00:39:38.860 | It's that the way that the tool is like hooked up

00:39:41.700 | to the computer is not interactive.

00:39:44.280 | - Yes, I mean, there is the meme of no one knows

00:39:46.320 | how to get out of Vim.

00:39:47.400 | You know, basically we just added instructions

00:39:50.060 | in the tool of like, hey, don't launch commands

00:39:53.120 | that don't return.

00:39:53.960 | Like, yeah, like don't launch Vim, don't launch whatever.

00:39:57.440 | If you do need to do something, you know,

00:39:58.840 | put an ampersand after it or launch it in the background.

00:40:01.520 | And so like, just, you know,

00:40:03.120 | putting kind of instructions like that,

00:40:06.040 | just right in the description for the tool

00:40:07.480 | really helps the model.

00:40:08.640 | And I think like that's an underutilized space

00:40:11.160 | of prompt engineering where like people might try to do that

00:40:13.620 | in the overall prompt,

00:40:14.640 | but just put that in the tool itself.

00:40:16.360 | So the model knows that it's like for this tool,

00:40:18.760 | this is what's relevant.

00:40:20.380 | - You said you worked on the function calling and tool use

00:40:23.160 | before you actually started the C-Bench work, right?

00:40:25.120 | Was there any surprises?

00:40:26.760 | Because you basically went from creator of that API

00:40:30.520 | to user of that API.

00:40:32.040 | Any surprises or changes you would make

00:40:34.720 | now that you have extensively dog fooded

00:40:37.480 | in a state-of-the-art agent?

00:40:39.760 | - I want us to make like a,

00:40:41.540 | maybe like a little bit less verbose SDK.

00:40:44.660 | I think some way, like right now it just takes a,

00:40:47.780 | I think we sort of force people to do the best practices

00:40:50.580 | of writing out sort of these full JSON schemas,

00:40:53.340 | but it would be really nice

00:40:54.180 | if you could just pass in a Python function as a tool.

00:40:57.540 | I think that could be something that--

00:40:58.380 | - I think that there's a lot of like--

00:40:59.540 | - There's helper libraries.

00:41:00.380 | - Instructure, you know, I don't know if there,

00:41:03.100 | if there's anyone else that is specializing for Anthropic,

00:41:06.100 | maybe Jeremy Howard's and Simon Willis and stuff.

00:41:09.060 | I think they all have cloud specific stuff

00:41:11.060 | that they are working on.

00:41:12.140 | - Cloudette.

00:41:12.980 | - Cloudette, exactly.

00:41:14.140 | I also wanted to spend a little bit of time with SuiteAgent.

00:41:16.660 | It seems like a very general framework.

00:41:18.420 | Like, is there a reason you picked it

00:41:19.780 | apart from it's the same authors as SuiteBench?

00:41:21.820 | - The main thing we wanted to go with

00:41:23.180 | was it was the same authors as SuiteBench,

00:41:25.180 | so it just felt sort of like the safest, most neutral option.

00:41:28.460 | And it was, you know, very high quality.

00:41:30.100 | It was very easy to modify, to work with.

00:41:33.500 | I would say it also actually,

00:41:35.140 | their underlying framework is sort of this,

00:41:38.740 | it's like, you know, think, act, observe,

00:41:41.140 | that they kind of go through this loop,

00:41:43.060 | which is like a little bit more hard-coded

00:41:45.340 | than what we wanted to do, but it's still very close.

00:41:47.500 | That's still very general.

00:41:48.900 | So it felt like a good match

00:41:50.020 | is sort of the starting point for our agent.

00:41:52.660 | And we had already sort of worked with the,

00:41:54.540 | and talked with the SuiteBench people directly.

00:41:56.220 | So it felt nice to just have, you know,

00:41:58.220 | we already know the authors, this will be easy,

00:41:59.980 | easy to work with.

00:42:00.820 | - I'll share a little bit of like,

00:42:02.220 | this all seems disconnected,

00:42:04.300 | but once you figure out the people

00:42:05.900 | and where they go to school, it all makes sense.

00:42:08.380 | So it's all Princeton.

00:42:09.980 | - Yeah, the SuiteBench and SuiteAgent,

00:42:11.500 | it's a group out of Princeton.

00:42:12.580 | - Yeah, we had Shun Yu on the pod

00:42:14.420 | and he came up with the React paradigm.

00:42:16.620 | And that's like, think, act, observe, like that's all React.

00:42:20.100 | So they're all friends.

00:42:21.260 | - Yep, yeah, exactly.

00:42:22.220 | And you know, our,

00:42:23.540 | if you actually read our traces of our submission,

00:42:26.220 | you can actually see like, think, act, observe,

00:42:28.460 | like in our logs.

00:42:29.300 | And like, we just didn't even like change the printing code.

00:42:31.660 | Like that's, so it's not actually,

00:42:34.100 | it's like doing still function calls under the hood

00:42:36.540 | and the model can do sort of multiple function calls

00:42:39.500 | in a row without thinking in between if it wants to.

00:42:41.980 | But yeah, so a lot of similarities

00:42:43.540 | and a lot of things we inherited from SuiteAgent

00:42:45.420 | just as a starting point for the framework.

00:42:47.260 | - Yeah, any thoughts about other agent frameworks?

00:42:50.220 | I think there's, you know,

00:42:51.260 | the whole gamut from very simple to like very complex.

00:42:54.180 | - Autogen, CooEI, LandGraph.

00:42:56.140 | - Yeah, yeah.

00:42:56.980 | I think I haven't explored a lot of them in detail.

00:43:00.820 | I would say with agent frameworks in general,

00:43:03.140 | they can certainly save you some like boilerplate,

00:43:05.820 | but I think there's actually this like downside

00:43:08.340 | of making agents too easy where you end up very quickly

00:43:12.140 | like building a much more complex system than you need.

00:43:15.020 | And suddenly, you know, instead of having one prompt,

00:43:17.460 | you have five agents that are talking to each other

00:43:19.700 | and doing a dialogue.

00:43:20.660 | And it's like, because the framework made that 10 lines

00:43:23.700 | to do, you end up building something

00:43:25.140 | that's way too complex.

00:43:26.500 | So I think I would actually caution people

00:43:28.220 | to like try to start without these frameworks if you can,

00:43:32.340 | because you'll be closer to the raw prompts

00:43:34.540 | and be able to sort of directly understand what's going on.

00:43:37.740 | I think a lot of times these frameworks also,

00:43:40.260 | by trying to make everything feel really magical,

00:43:43.300 | you end up sort of really hiding what the actual prompt

00:43:47.580 | and output of the model is,

00:43:49.140 | and that can make it much harder to debug.

00:43:51.380 | So certainly these things have a place,

00:43:52.740 | and I think they do really help

00:43:54.020 | at getting rid of boilerplate,

00:43:55.700 | but they come with this cost of obfuscating

00:43:58.460 | what's really happening and making it too easy

00:44:01.180 | to very quickly add a lot of complexity.

00:44:03.820 | So yeah, I would recommend people to like try it

00:44:06.220 | from scratch and it's like not that bad.

00:44:08.220 | Would you rather have like a framework of tools?

00:44:11.100 | You know, do you almost see like,

00:44:12.300 | hey, like it's maybe easier to get tools

00:44:14.860 | that are already well curated,

00:44:16.460 | like the ones that you build, you know,

00:44:18.020 | if I had an easy way to get the best tool from you

00:44:21.540 | and like you maintain the definition or yeah,

00:44:23.700 | any thoughts on how you want to formalize tool sharing?

00:44:26.580 | Yeah, I think that's something

00:44:27.540 | that we're certainly interested in exploring.

00:44:29.900 | And I think there is space for sort of these general tools

00:44:33.260 | that will be very broadly applicable.

00:44:35.260 | But at the same time,

00:44:36.100 | most people that are building on these,

00:44:37.500 | they do have, you know, much more specific things

00:44:40.300 | that they're trying to do.

00:44:41.500 | You know, I think that might be useful

00:44:42.540 | for hobbyists and demos,

00:44:44.420 | but the ultimate end applications are going to be bespoke.

00:44:47.300 | And so we just want to make sure

00:44:48.740 | that the model's great at any tool that it uses,

00:44:51.380 | but certainly something we're exploring.

00:44:52.780 | - So everything bespoke, no frameworks, no anything.

00:44:55.580 | Just build. - For now, for now.

00:44:57.100 | - Yeah, I would say that like the best thing I've seen

00:44:59.180 | is people building up from like,

00:45:01.580 | build some good util functions

00:45:03.100 | and then you can use those as building blocks.

00:45:04.940 | - Yeah, yeah.

00:45:05.780 | I have a utils folder where I call these scripts.

00:45:08.180 | My framework is like def, call, and tropic.

00:45:10.700 | And then I just put all the defaults.

00:45:12.300 | - Yeah, exactly.

00:45:13.220 | There's a startup hidden in every utils folder, you know?

00:45:15.820 | - No, totally not. - If you use it enough,

00:45:17.220 | like it's a startup, you know, like at some point.

00:45:20.660 | I'm kind of curious,

00:45:21.500 | is there a maximum length of turns that it took?

00:45:25.980 | Like what was the longest run?

00:45:27.020 | - I actually don't.

00:45:27.860 | I mean, we had, it had basically infinite turns

00:45:31.020 | until it ran into 200K context.

00:45:33.740 | I should have looked this up.

00:45:35.540 | I don't know.

00:45:37.100 | And so for some of those failed cases

00:45:38.820 | where it eventually ran out of context,

00:45:40.420 | I mean, it was over a hundred turns.

00:45:42.820 | I'm trying to remember like the longest successful run,

00:45:45.620 | but I think it was definitely over a hundred turns

00:45:47.820 | that some of the times, you know?

00:45:48.660 | - Which is not that much.

00:45:49.500 | It's a coffee break.

00:45:50.340 | - Yeah, yeah.

00:45:52.180 | But certainly, you know, these things can be a lot of turns.

00:45:53.940 | And I think that's because some of these things

00:45:55.660 | are really hard where it's going to take, you know,

00:45:57.900 | many tries to do it.

00:45:59.460 | - Yeah, and if you think about like,

00:46:01.100 | think about a task that takes a human four hours to do,

00:46:03.980 | like think about how many different like files you read

00:46:07.100 | and like times you edit a file in four hours.

00:46:09.220 | Like that's a lot more than a hundred.

00:46:10.700 | - How many times you open Twitter?

00:46:12.140 | - Yeah.

00:46:12.980 | - Because you get distracted.

00:46:13.940 | But if you had a lot more compute,

00:46:16.260 | what's kind of like the return on the extra compute now?

00:46:19.060 | So like, you know, if you had thousands of turns

00:46:21.540 | or like whatever, like how much better would it get?

00:46:24.100 | - Yeah, this, I don't know.

00:46:25.220 | And I think this is,

00:46:26.860 | I think sort of one of the open areas of research

00:46:29.580 | in general with agents is memory

00:46:31.460 | and sort of how do you have something

00:46:33.660 | that can do work beyond its context length

00:46:37.460 | where you're just purely appending.

00:46:38.820 | So you mentioned earlier things like pruning bad paths.

00:46:41.900 | I think there's a lot of interesting work around there.

00:46:44.140 | Can you just roll back, but summarize,

00:46:46.380 | hey, don't go down this path.

00:46:47.900 | - There'll be dragons.

00:46:49.100 | - Yeah, I think that's very interesting

00:46:51.500 | that you could have something that uses way more tokens

00:46:54.420 | without ever using at a time more than 200K.

00:46:58.180 | So I think that's very interesting.

00:46:59.980 | I think the biggest thing is like,

00:47:01.260 | can you make the model sort of losslessly summarize

00:47:05.700 | what it's learned from trying different approaches

00:47:08.100 | and bring things back?

00:47:09.620 | I think that's sort of the big challenge.

00:47:11.700 | - What about different models?

00:47:12.940 | So you have Haiku, which is like, you know, cheaper.

00:47:15.180 | So you're like, well, what if I have a Haiku

00:47:17.580 | to do a lot of these smaller things and then put it back up?

00:47:20.660 | - I think Cursor might have said

00:47:22.260 | that they actually have a separate model for file editing.

00:47:25.340 | I'm trying to remember, I think they were on a,

00:47:27.300 | maybe the Lex Fridman podcast where they said like,

00:47:29.580 | they have a bigger model, like write what the code should be

00:47:32.180 | and then a different model, like apply it.

00:47:34.220 | So I think there's a lot of interesting room

00:47:36.060 | for stuff like that.

00:47:36.900 | - Yeah, fast applying.

00:47:37.820 | We actually did a pod with Fireworks

00:47:39.100 | that they worked with on, it's speculative decoding.

00:47:42.020 | - But I think there's also really interesting things

00:47:43.780 | about like, you know, paring down input tokens as well.

00:47:47.020 | Especially sometimes the models trying to read

00:47:48.900 | like a 10,000 line file, like that's a lot of tokens.

00:47:51.620 | And you know, most of it is actually

00:47:52.900 | not going to be relevant.

00:47:54.220 | I think it'd be really interesting

00:47:55.340 | to like delegate that to Haiku.

00:47:57.700 | Haiku read this file and just pull out

00:47:59.740 | the most relevant functions.

00:48:01.700 | And then, you know, Sonnet reads just those

00:48:04.620 | and you save 90% on tokens.

00:48:07.060 | I think there's a lot of really interesting room

00:48:08.740 | for things like that.

00:48:09.580 | And again, we were just trying to do sort of

00:48:11.860 | the simplest, most minimal thing and show that it works.

00:48:14.820 | I'm really hoping that people,

00:48:16.620 | sort of the agent community builds things like that

00:48:19.300 | on top of our models.

00:48:20.420 | That's again, why we released these tools.

00:48:22.140 | You know, we're not going to go and do lots more submissions

00:48:24.420 | to SweetBench and try to prompt engineer this

00:48:27.100 | and build a bigger system.

00:48:27.940 | We want people to, like the ecosystem,

00:48:29.620 | to do that on top of our models.

00:48:31.020 | But yeah, so I think that's a really interesting one.

00:48:32.700 | - It turns out, I think you did do 3.5 Haiku

00:48:35.260 | with your tools and it scored a 40.6.

00:48:38.500 | - Yes, yeah, so it did very well.

00:48:40.060 | It itself is actually very smart, which is great.

00:48:42.900 | But we haven't done any experiments

00:48:44.300 | with this like combination of the two models.

00:48:46.940 | But yeah, I think that's one of the exciting things

00:48:48.260 | is that how well Haiku 3.5 did on SweetBench

00:48:51.980 | shows that sort of even our smallest, fastest models

00:48:54.940 | is very good at sort of thinking agentically

00:48:57.660 | and working on hard problems.

00:48:58.940 | Like it's not just sort of for writing simple text anymore.

00:49:02.580 | - And I know you're not going to talk about it,

00:49:03.980 | but like Sonnet is not even supposed to be the best model.

00:49:06.860 | You know, like Opus, it's kind of like we left it

00:49:09.900 | at three back in the corner intro.

00:49:11.620 | At some point, I'm sure the new Opus will come out.

00:49:14.180 | And if you had Opus Plus on it, that sounds very, very good.

00:49:18.580 | - There's a run with SweetAgent Plus Opus,

00:49:20.500 | but that's the official SweetBench guys doing it.

00:49:23.180 | - That was the older, you know, 3.0.

00:49:24.380 | - You didn't do yours.

00:49:25.460 | - Yeah.

00:49:26.300 | - Okay, did you want to, or did you just?

00:49:28.420 | I mean, you could just change the model name.

00:49:30.060 | - I think, I think we didn't submit it,

00:49:32.740 | but I think we included it in our model card.

00:49:35.020 | We included the score as a comparison.

00:49:36.740 | - Yeah.

00:49:37.580 | - Yeah, and Sonnet and Haiku, actually,

00:49:39.020 | I think the new ones, they both outperformed

00:49:42.220 | the original Opus.

00:49:43.060 | - Yeah, I did see that.

00:49:43.900 | - Yeah, it's a little bit hard to find.

00:49:45.860 | - Yeah, yeah, it's not an exciting score,

00:49:48.180 | so we didn't feel like they need to submit the benchmark.

00:49:51.260 | - We can cut over to computer use if we're okay

00:49:53.140 | with moving on to topics on this, if anything else.

00:49:56.460 | - I think we're good.

00:49:57.540 | I think, I'm trying to think if there's anything else

00:49:59.780 | SweetBench related.

00:50:01.100 | - It doesn't have to be also just specifically SweetBench,

00:50:03.420 | but just your thoughts on building agents,

00:50:04.900 | 'cause you are one of the few people

00:50:06.140 | that have reached this leaderboard

00:50:08.060 | on building a coding agent.

00:50:10.060 | This is the state of the art.

00:50:11.380 | It's surprisingly not that hard to reach

00:50:14.380 | with some good principles, right?

00:50:15.940 | But there's obviously a ton of low-hanging fruit

00:50:17.780 | that we covered.

00:50:18.620 | So just your thoughts on if you were to build

00:50:20.620 | a coding agent startup, maybe, what next?

00:50:23.820 | - I think the really interesting question for me

00:50:26.180 | for all the startups out there is this kind of divergence

00:50:29.220 | between the benchmarks and what real customers will want.

00:50:31.780 | So I'm curious, maybe the next time you have

00:50:34.660 | a coding agent startup on the podcast,

00:50:36.460 | you should ask them that.

00:50:37.300 | What are the differences that they're starting to make?

00:50:38.620 | - Tomorrow.

00:50:39.460 | - Oh, perfect, perfect, yeah.

00:50:40.660 | I'm actually very curious what they will see,

00:50:42.580 | 'cause I also have seen,

00:50:43.940 | I feel like it's like slowed down a little bit

00:50:45.740 | if I don't see the startups submitting

00:50:49.340 | to SweetBench that much anymore.

00:50:51.700 | - 'Cause of the traces, the traces.

00:50:53.500 | So we had CoSign on, they had like a 50-something on full,

00:50:58.220 | on SweetBench full, which is the hardest one.

00:51:00.580 | And they were rejected because they didn't want

00:51:02.340 | to submit their traces.

00:51:03.540 | - Yep.

00:51:04.360 | - IP, you know?

00:51:05.200 | - Yeah, that makes sense, that makes sense.

00:51:06.380 | - We actually, tomorrow, we're talking to Bolt,

00:51:08.140 | which is a cloud customer.

00:51:09.420 | You guys actually published a case study with them.

00:51:12.260 | I assume you weren't involved with that,

00:51:13.940 | but they were very happy with cloud.

00:51:16.140 | (laughing)

00:51:17.500 | One of the biggest launches of the year.

00:51:18.660 | - Yeah, totally.

00:51:19.500 | - We actually happened to be sitting

00:51:20.780 | in Adept's former office.

00:51:22.820 | My take on this is Anthropic shipped Adept as a feature,

00:51:25.500 | or as like an open source demo.

00:51:26.900 | - It's still a beta feature, but yes.

00:51:28.700 | - What was it like when you tried it for the first time?

00:51:30.900 | Was it obvious that cloud had reached that stage

00:51:34.820 | where you could do computer use?

00:51:36.980 | - It was somewhat of a surprise to me.

00:51:38.820 | Like, I think, I actually, I had been on vacation,

00:51:41.200 | and I came back, and everyone's like, computer use works.

00:51:43.740 | (laughing)

00:51:44.580 | And so it was kind of this very exciting moment.

00:51:46.980 | I mean, after the first, just like, you know, go to Google,

00:51:48.900 | I think I tried to have it play Minecraft or something,

00:51:50.940 | and it actually like installed and like opened Minecraft.

00:51:53.220 | I was like, wow, this is pretty cool.

00:51:54.740 | So I was like, wow, yeah,

00:51:55.620 | this thing can actually use a computer.

00:51:57.780 | And certainly, it is still beta, you know,

00:51:59.660 | there's certain things that it's not very good at yet.

00:52:02.380 | But I'm really excited, I think, most broadly,

00:52:06.260 | not just for like new things that weren't possible before,

00:52:10.140 | but as a much lower friction way to implement tool use.

00:52:14.240 | One anecdote from my days at Cobalt Robotics,

00:52:17.600 | we wanted our robots to be able to ride elevators,

00:52:20.300 | to go between floors and fully cover a building.

00:52:23.160 | The first way that we did this

00:52:24.420 | was doing API integrations with the elevator companies.

00:52:27.740 | And some of them actually had APIs,

00:52:29.780 | we could send that request, and it would move the elevator.

00:52:32.260 | Each new company we did took like six months to do,

00:52:35.580 | 'cause they were very slow, they didn't really care.

00:52:39.060 | - Or an elevator, not an API.

00:52:40.940 | - Even installing, like once we had it with the company,

00:52:43.380 | they would have to like literally go install an API box

00:52:45.900 | on the elevator that we wanted to use.

00:52:47.580 | And that would sometimes take six months, so very slow.

00:52:51.260 | And eventually we're like, okay, this is getting like,

00:52:54.200 | slowing down all of our customer deployments.

00:52:56.640 | And I was like, what if we just add an arm to the robot?

00:52:59.280 | And I added this little arm that could literally go

00:53:02.220 | and press the elevator buttons,

00:53:03.500 | and we used computer vision to do this.

00:53:05.740 | And we could deploy that in a single day,

00:53:08.060 | and have the robot being able to use the elevators.

00:53:10.420 | At the same time, it was slower than the API,

00:53:13.900 | it wasn't quite as reliable, you know,

00:53:15.460 | sometimes it would miss

00:53:16.380 | and it would have to try to press it again.

00:53:18.580 | But it would get there,

00:53:19.400 | but it was slower and a little bit less reliable.

00:53:21.460 | And I kind of see this as like an analogy to computer use

00:53:24.340 | of like, anything you can do with computer use today,

00:53:26.980 | you could probably write tool use

00:53:29.660 | and like integrate it with APIs to up to the language model.

00:53:33.280 | But that's going to take a bunch of software engineering

00:53:35.180 | to go write those integrations,

00:53:36.460 | you'll have to do all this stuff.

00:53:38.100 | With computer use, just give the thing a browser

00:53:40.700 | that's logged into what you want to integrate with,

00:53:42.980 | and it's going to work immediately.

00:53:44.620 | And I see that like reduction and friction

00:53:46.500 | as being incredibly exciting.

00:53:48.260 | Of like, imagine like a customer support team,

00:53:51.380 | where, okay, hey, you got this customer support bot,

00:53:54.480 | but you need to go integrate it with all these things.

00:53:56.980 | And you don't have any engineers

00:53:58.180 | on your customer support team.

00:53:59.860 | But if you can just give the thing a browser

00:54:01.580 | that's logged into your systems

00:54:03.200 | that you need it to have access to,

00:54:05.120 | now, suddenly in one day, you could be up and rolling

00:54:07.580 | with a fully integrated customer service bot

00:54:09.920 | that could go do all the actions you care about.

00:54:12.000 | So I think that's the most exciting thing for me

00:54:13.700 | about computer use is like reducing that friction

00:54:16.880 | of integrations to almost zero.

00:54:18.960 | - Or farming on World of Warcraft.

00:54:21.520 | - Yes, or that.

00:54:22.360 | - Just go computer use, very high value use cases.

00:54:25.520 | - I always say about this is, you know,

00:54:26.860 | this is like the oldest question in robotics

00:54:29.900 | or self-driving, which is, you know,

00:54:31.520 | do you drive by vision or do you have special tools?

00:54:33.640 | And vision is the universal tool to claim all tools.

00:54:37.520 | There's trade-offs, but like there's situations

00:54:38.980 | in which that will come.

00:54:40.220 | But, you know, this week's podcast,

00:54:41.560 | the one that we just put out had Stan Polu from DUST

00:54:44.960 | saying that he doesn't see a future

00:54:46.880 | where it's like the significant workhorse.

00:54:49.160 | I think there could be a separation

00:54:50.360 | between maybe like the high volume use cases,

00:54:54.280 | you want APIs, and then the long tail, you want computer use.

00:54:57.600 | - I totally agree.

00:54:58.440 | - Right?

00:54:59.260 | So you'll start, you'll prototype something

00:55:00.680 | with computer use, and then, hey, this is working.

00:55:03.000 | Like customers have adopted this feature.

00:55:04.780 | Okay, like, let's go turn it into an API

00:55:07.160 | and it'll be faster and use less tokens.

00:55:09.040 | - Yeah, I'd be interested to see a computer use agent

00:55:11.720 | replace itself by figuring out the API

00:55:14.600 | and then just dropping out of the equation altogether.

00:55:17.960 | You know?

00:55:18.800 | - Yeah, that's really fun actually.

00:55:20.100 | - If I was running an RPA company,

00:55:21.720 | like you would have the RPA scripting,

00:55:23.960 | RPA for people listening is robotic process automation,

00:55:26.840 | where you would script things

00:55:28.420 | that like always show up in sequence.

00:55:30.380 | So you don't have an LLM in the loop.

00:55:32.380 | And so basically what you need to do

00:55:33.660 | is train an LLM to code that script.

00:55:35.900 | And then you can sort of naturally hand off

00:55:38.140 | from computer use to non-computer user.

00:55:40.340 | - Yeah, or have some way to turn Claude's actions

00:55:43.660 | of computer use into a saved script

00:55:45.700 | that you can then run repeatedly.

00:55:47.360 | - Yeah, it'd be interesting to record that.

00:55:48.980 | - Why did you decide to not ship any

00:55:51.140 | like sandbox harness for computer use?

00:55:54.580 | It's kind of like, "Hey, peace, run at your own risk."

00:55:57.320 | - It's Docker, right?

00:55:58.160 | - No, no, we launched it with, I think a VM or Docker,

00:56:00.960 | a Docker as system.

00:56:01.880 | - But it's not for your actual computer, right?

00:56:05.100 | Like the Docker instance is like runs in the Docker.

00:56:07.800 | It's not for-

00:56:08.720 | - Yeah, it runs its own browser.

00:56:09.960 | I think, I mean, the main reason for that

00:56:12.320 | is one is sort of security.

00:56:13.840 | You know, we don't want, you know,

00:56:15.560 | the model can do anything.

00:56:16.920 | So we wanted to give it a sandbox,

00:56:18.760 | not have people do their own computer,

00:56:21.400 | at least sort of for our default experience.

00:56:23.180 | We really care about providing a nice sort of,

00:56:25.600 | making the default safe,

00:56:27.640 | I think is the best way for us to do it.

00:56:30.120 | And I mean, very quickly people made modifications

00:56:32.920 | to let you run it on your own desktop.

00:56:34.740 | And that's fine.

00:56:35.580 | Someone else can do that,

00:56:36.400 | but we don't want that to be the official

00:56:37.440 | anthropic thing to run.

00:56:38.880 | I would say also like from a product perspective right now,

00:56:41.880 | because this is sort of still in beta,

00:56:44.320 | I think a lot of the most useful use cases are,

00:56:47.560 | like a sandbox is actually what you want.

00:56:49.260 | You want something where,

00:56:50.360 | hey, it can't mess up anything in here.

00:56:52.360 | It only has what I gave it.

00:56:54.740 | Also, if it's using your computer, you know,

00:56:56.380 | you can't use your computer at the same time.

00:56:58.700 | I think you actually like want it to have its own screen.

00:57:01.420 | It's like you and a person pair programming,

00:57:03.660 | but only on one laptop versus you have two laptops.

00:57:05.900 | - Everyone should totally have a side laptop

00:57:07.460 | where the computer is just doing its thing.

00:57:09.140 | - Yeah, I think it's just a better experience.

00:57:11.700 | Unless there's something very explicit

00:57:13.300 | you want it to do for you on your own computer.

00:57:15.660 | - It becomes like you're sort of

00:57:17.940 | shelling into a remote machine

00:57:19.980 | and maybe checking in on it every now and then.

00:57:22.960 | I have fond memories of,

00:57:24.640 | half our audience is going to be too young to remember this,

00:57:26.320 | but Citrix, like desktop experience,

00:57:28.880 | like you were sort of remote into a machine

00:57:32.960 | that someone else was operating.

00:57:34.440 | And for a long time,

00:57:35.680 | that would be how you did like enterprise computing.

00:57:38.480 | - It's a viewer.

00:57:39.320 | - Yeah, it's coming back.

00:57:42.360 | Any other implications of computer use?

00:57:44.120 | Is it a fun demo or is it like the future of Anthropic?

00:57:47.480 | - I'm very excited about it.

00:57:48.980 | I think that like there's a lot

00:57:50.280 | of sort of very repetitive work

00:57:51.640 | that like computer use will be great for.

00:57:54.260 | I think I've seen some examples

00:57:55.560 | of people build like coding agents

00:57:57.840 | that then also like test the front end that they made.

00:58:01.240 | So I think it's very cool to like use computer use

00:58:03.480 | to be able to close the loop on a lot of things

00:58:05.380 | that right now just a terminal based agent can't do.

00:58:09.000 | So I think that's very exciting.

00:58:09.840 | - It's kind of like end to end testing.

00:58:11.480 | - Exactly, yeah, yeah.

00:58:12.680 | The end sort of front end and web testing

00:58:14.640 | is something I'm very excited about.

00:58:16.240 | - Yeah, I've seen Amanda also talking,

00:58:18.440 | this will be Amanda Askell, the head of Cloud Character.

00:58:21.520 | She goes on a lunch break

00:58:22.400 | and it generates research ideas for her.

00:58:25.120 | Giving it a name like computer use is very practical.

00:58:27.600 | It's like you're supposed to do things,

00:58:29.380 | but maybe sometimes it's not about doing things,

00:58:30.960 | it's about thinking.

00:58:32.200 | And thinking, in the process of thinking,

00:58:33.820 | you're using the computer.

00:58:35.480 | In some way that's, you know, solving sweet bench,

00:58:37.120 | like you should be allowed to use the internet

00:58:39.280 | or you should be allowed to use a computer to solve it

00:58:41.920 | and use your vision and use whatever.

00:58:43.400 | Like we're just sort of shackling it

00:58:45.120 | with all these restrictions just 'cause we wanna play nice

00:58:48.080 | for a benchmark, but really, you know,

00:58:50.480 | a full AI will be able to do all these things, to think.

00:58:53.940 | - Yeah, we'll definitely be able to.

00:58:54.780 | - To reason.

00:58:55.620 | - To Google and search for things.

00:58:56.920 | - Yeah.

00:58:57.760 | - Yeah, pull down inspiration.

00:58:58.680 | - Can we just do a, before we wrap, a robotics corner?

00:59:01.960 | - Oh, yeah, yeah, yeah.

00:59:02.800 | - People are always curious,

00:59:03.640 | especially with somebody that is not trying

00:59:05.720 | to hype their own company.

00:59:07.160 | What's the state of AI robotics, under hyped, over hyped?

00:59:10.660 | - Yeah, and I'll say like these are my opinions,

00:59:12.880 | not Anthropic's.

00:59:13.840 | And again, coming from a place

00:59:15.520 | of a burned out robotics founder.

00:59:17.080 | So take everything with a grain of salt.

00:59:19.520 | I would say on the positives,

00:59:20.560 | like there is really sort of incredible progress

00:59:24.400 | that's happened in the last five years

00:59:26.320 | that I think will be a big unlock for robotics.

00:59:28.620 | The first is just general purpose language models.

00:59:30.580 | I mean, there was an old saying in robotics

00:59:33.020 | that if to fully describe your task is harder

00:59:36.740 | than to just do the task, you can never automate it.

00:59:39.360 | 'Cause like, it's gonna take more effort

00:59:40.680 | to even tell the robot how to do this thing

00:59:42.360 | than to me just do it itself.

00:59:43.920 | LLM solved that.

00:59:45.200 | I no longer need to go exhaustively program

00:59:48.680 | in every little thing I could do.

00:59:50.200 | The thing just has common sense

00:59:51.440 | and it's gonna know how do I make a Reuben sandwich?

00:59:54.480 | I'm not gonna have to go program that in.

00:59:56.280 | Whereas before like the idea of even like a cooking thing,

00:59:59.040 | it's like, oh God, like we're gonna have the team

01:00:00.560 | of engineers that are hard coding recipes

01:00:02.980 | for the long tail of anything, it'd be a disaster.

01:00:06.260 | So I think that's one thing is that bringing common sense

01:00:09.120 | really is like solves this huge problem describing tasks.

01:00:12.320 | The second big innovation has been diffusion models

01:00:15.760 | for path planning.

01:00:16.800 | A lot of this work came out of Toyota Research.

01:00:19.760 | There's a lot of startups now that are working on this

01:00:21.800 | like Physical Intelligence Pi, Chelsea Fins,

01:00:24.760 | startup out of Stanford.

01:00:26.120 | And the basic idea here is using a little bit of the,

01:00:29.800 | I'd say maybe more inspiration from diffusion

01:00:32.020 | rather than diffusion models themselves,

01:00:34.000 | but they're a way to basically learn

01:00:36.240 | an end to end sort of motion control.

01:00:39.720 | Whereas previously all of robotics motion control

01:00:42.680 | was sort of very hard coded.

01:00:44.960 | You either, you're programming in explicit motions

01:00:48.240 | or you're programming in an explicit goal

01:00:51.120 | and using an optimization library

01:00:52.820 | to find the shortest path to it.

01:00:54.760 | This is now something where you just give it

01:00:56.820 | a bunch of demonstrations.

01:00:58.200 | And again, just like using learning,

01:01:00.560 | it's basically like learning from these examples.

01:01:03.680 | What does it mean to go pick up a cup?

01:01:05.920 | And doing these in a way just like diffusion models

01:01:08.200 | where they're somewhat conditioned by text,

01:01:11.320 | you can have it, the same model learn many different tasks.

01:01:14.840 | And then the hope is that these start to generalize,

01:01:18.120 | that if you've trained it on picking up coffee cups

01:01:21.200 | and picking up books, then when I say pick up the backpack,

01:01:24.400 | it knows how to do that too,

01:01:25.920 | even though you've never trained it on that.

01:01:27.360 | That's kind of the holy grail here

01:01:29.040 | is that you train it on 500 different tasks

01:01:33.000 | and then that's enough to really get it to generalize

01:01:35.240 | to do anything you would need.

01:01:36.760 | I think that's like still a big TBD

01:01:39.280 | and these people are working,

01:01:40.880 | have like measured some degree of generalization.

01:01:44.780 | But at the end of the day, it's also like LLMs.

01:01:46.720 | Like, you know, do you really care about the thing,

01:01:48.640 | being able to do something

01:01:49.480 | that no one has ever shown in training data?

01:01:51.980 | People for like a home robot,

01:01:53.560 | there's gonna be like a hundred things

01:01:55.360 | that people really want it to do.

01:01:56.360 | And you can just make sure it has good training

01:01:58.000 | for those things.

01:01:58.840 | What you do care about then is like generalization

01:02:01.320 | within a task of, oh,

01:02:02.240 | I've never seen this particular coffee mug before.

01:02:05.000 | Can I still pick it up?

01:02:06.200 | And those, the models do seem very good at.

01:02:08.040 | So these kind of are the two big things

01:02:10.160 | that are going for robotics right now

01:02:11.600 | is LLMs for common sense

01:02:14.320 | and diffusion inspired path planning algorithms.

01:02:18.200 | I think this is very promising,

01:02:20.240 | but I think there's a lot of hype.

01:02:21.540 | And I think where we are right now

01:02:23.520 | is where self-driving cars were 10 years ago.

01:02:26.320 | I think we have very cool demos that work.

01:02:29.080 | I mean, 10 years ago,

01:02:29.920 | you had videos of people driving a car on the highway,

01:02:33.300 | driving a car on a street with a safety driver,

01:02:37.060 | but it's really taken a long time to go from there to,

01:02:39.780 | I took a Waymo here today.

01:02:41.220 | And even then Waymo is only in SF and a few other cities.

01:02:44.540 | And I think like it takes a long time for these things

01:02:47.600 | to actually like get everywhere

01:02:49.340 | and to get all the edge cases covered.

01:02:51.540 | I think that for robotics,

01:02:52.940 | the limiting factor is gonna be reliability.

01:02:55.940 | That these models are really good at doing these demos

01:02:58.620 | of like doing laundry or doing dishes.

01:03:01.240 | If they only work 99% of the time, like that sounds good,

01:03:04.800 | but that's actually really annoying.

01:03:06.200 | Like humans are really good at these tasks.

01:03:08.080 | Like imagine if like one out of every 100 dishes,

01:03:11.220 | it washed, it breaks.

01:03:12.720 | Like you would not want that robot in your house

01:03:15.080 | or you certainly wouldn't want that in your factory

01:03:17.320 | if one of every 100 boxes that it moves,

01:03:19.560 | it drops and breaks things inside it.

01:03:21.480 | So I think for these things to really be useful,

01:03:24.080 | they're gonna have to hit a very, very high level

01:03:26.760 | of reliability, just like self-driving cars.

01:03:29.440 | And I don't know how hard it's gonna be

01:03:32.360 | for these models to move from like the 95% reliability

01:03:36.060 | to 99.9.

01:03:37.360 | I think that's gonna be the big thing.

01:03:39.280 | And I think also like I'm a little skeptical

01:03:41.640 | of how good the unit economics of these things will be.

01:03:45.320 | These robots are gonna be very expensive to build.

01:03:48.080 | And if you're just trying to replace labor,

01:03:52.320 | like a one for one purchase,

01:03:54.600 | it kind of sets an upper cap about how much you can charge.

01:03:57.440 | And so, it seems like it's not that great a business.

01:04:01.320 | I'm also worried about that

01:04:02.280 | for the self-driving car industry.

01:04:03.680 | - Do you see most of the applications

01:04:05.920 | actually taking some of the older,

01:04:07.800 | especially manufacturing machinery,

01:04:09.520 | which is like, it needs to be like very precise,

01:04:12.080 | even if it's off by just a few millimeters,

01:04:14.080 | it cannot screw up the whole thing

01:04:15.640 | and be able to adjust at the edge.

01:04:18.200 | Or do you think like the net new use cases

01:04:20.760 | may be like the more interesting?

01:04:23.120 | - I think it'd be very hard to replace

01:04:25.200 | a lot of those traditional manufacturing robots

01:04:27.420 | because everything relies on that precision.

01:04:30.040 | If you have a model that can, again,

01:04:31.520 | only get there 99% of the time,

01:04:33.720 | you don't want 1% of your cars

01:04:35.400 | to have the weld in the wrong spot.

01:04:36.960 | Like, that's gonna be a disaster.

01:04:38.600 | And a lot of manufacturing is all about

01:04:41.680 | getting rid of as much sort of variance

01:04:44.800 | and uncertainty as possible.

01:04:46.480 | - And what about the hardware?

01:04:47.480 | A lot of my friends that work in robotics,

01:04:49.200 | one of their big issues,

01:04:50.440 | like sometimes you just have a servo that fails

01:04:52.680 | and then you gotta,

01:04:53.520 | and it takes a bunch of time to like fix that.

01:04:55.800 | Is that holding back things or is the software still?

01:04:58.580 | Anyway, not by right. - I think both.

01:04:59.700 | I think there's been a lot more progress

01:05:01.580 | in the software in the last few years.

01:05:02.860 | And I think a lot of the humanoid robot companies

01:05:05.060 | now are really trying to build amazing hardware.

01:05:07.020 | Hardware is just so hard.

01:05:08.900 | It's something where- - Classic.

01:05:10.660 | - You know, you build your first robot and it works,

01:05:12.940 | you're like, great.

01:05:13.860 | Then you build 10 of them, five of them work,

01:05:16.300 | three of them work half the time, two of them don't work,

01:05:18.460 | and you built them all the same and you don't know why.

01:05:20.340 | And it's just like the real world

01:05:22.540 | has like this level of detail and differences

01:05:25.020 | that software doesn't have.

01:05:27.160 | Like imagine if every four loop you wrote,

01:05:29.360 | some of them just didn't work.

01:05:30.520 | Some of them were slower than others.

01:05:31.960 | Like how do you deal with that?

01:05:34.200 | Like imagine if every binary that you shipped to a customer,

01:05:36.880 | each of those four loops was a little bit differently,

01:05:38.820 | was a little different.

01:05:39.880 | It becomes just so hard to scale

01:05:41.940 | and sort of maintain quality of these things.

01:05:44.360 | And I think that's like,

01:05:45.760 | that's what makes hardware really hard

01:05:47.140 | is not building one of something,

01:05:48.320 | but repeatedly building something

01:05:50.200 | and making it work reliably.

01:05:52.020 | Where again, like you'll buy a batch of a hundred motors

01:05:55.220 | and each of those motors will behave

01:05:57.340 | a little bit differently to the same input command.

01:05:59.620 | - This is your lived experience at Cobalt.

01:06:01.340 | - And robotics is all about

01:06:03.380 | how do you build something that's robust

01:06:05.260 | despite these differences?

01:06:06.380 | - We can't get the tolerance of motors down to-

01:06:08.380 | - It's just everything.

01:06:09.220 | You know, you'll have-

01:06:10.060 | (laughing)

01:06:11.540 | - It's actually everything.

01:06:13.140 | No, I mean, one of-

01:06:14.700 | - One of my horror stories was that at Cobalt,

01:06:16.860 | this was many years ago,

01:06:17.780 | we had a thermal camera on the robot

01:06:20.900 | that had a USB connection to the computer inside,

01:06:23.160 | which is, first of all, is a big mistake.

01:06:24.400 | You're not supposed to use a USB.

01:06:25.980 | It is not a reliable protocol.

01:06:27.760 | It's designed that if there's mistakes,

01:06:29.720 | the user can just unplug it and plug it back in.

01:06:31.640 | - I see.

01:06:32.480 | - And so typically things that are USB,

01:06:35.000 | they're not designed to the same level

01:06:37.040 | of like very high reliability you need.

01:06:39.480 | Again, because they assume someone will just unplug it

01:06:41.600 | and replug it back in.

01:06:42.440 | - You just say someone sometime.

01:06:44.240 | - I heard this too and I didn't listen to it.

01:06:45.760 | I really wish I had before.

01:06:47.080 | Anyway, at a certain point,

01:06:48.480 | a bunch of these thermal cameras started failing

01:06:50.600 | and we couldn't figure out why.

01:06:52.480 | And I asked everyone on the team,

01:06:53.640 | like, "Hey, what's changed?

01:06:54.960 | Did the software change around this node?

01:06:56.520 | Did the hardware design change around this node?"

01:06:58.640 | And I was investigating all this stuff,

01:07:00.920 | looking at kernel logs of what's happening with this thing.

01:07:05.680 | And finally, the procurement person was like,

01:07:08.280 | "Oh yeah, well,

01:07:09.120 | I found this new vendor for USB cables last summer."

01:07:12.120 | And I'm like, "What?

01:07:13.200 | You switched which vendor we're buying USB cables from?"

01:07:16.080 | And I'm like, "Yeah, it's the same exact cable.

01:07:18.040 | It's just a dollar cheaper."

01:07:19.440 | And it turns out this was the problem.

01:07:20.840 | This new cable had slightly worse resistance

01:07:25.320 | or slightly worse EMI interference.

01:07:27.800 | And it worked most of the time,

01:07:30.160 | but 1% of the time these cameras would fail

01:07:32.680 | and we'd need to reboot a big part of the system.

01:07:35.440 | And it was all just 'cause the same exact spec,

01:07:38.080 | these two different USB cables, like slightly different.

01:07:41.160 | And so these are the kind of things

01:07:42.360 | you deal with with hardware.

01:07:43.560 | - For listeners, we had a episode

01:07:45.440 | with Josh Albrecht in view,

01:07:46.640 | where they talked about buying tens of thousands of GPUs

01:07:49.560 | and just some of them will just not do math.

01:07:51.680 | - Yeah, yeah, it's the same thing.

01:07:53.840 | - You run some tests to find the bad batch

01:07:57.720 | and then you return it to sender

01:07:58.840 | 'cause they just, GPUs won't do math, right?

01:08:00.960 | - Yeah, yeah, this is the thing.

01:08:02.560 | Just the real world has this level of detail.

01:08:05.880 | There's Eric Jang, he did AI at Google.

01:08:09.320 | - Yeah, 1X.

01:08:10.160 | - Yeah, and then joined 1X.

01:08:11.680 | I see him post on Twitter occasionally

01:08:14.000 | of complaints about hardware and supply chain.

01:08:17.040 | And we know each other and we joke occasionally

01:08:19.360 | that we've switched.

01:08:20.200 | I went from robotics into AI

01:08:21.720 | and he went from AI into robotics, and yeah.

01:08:24.840 | - Look, very, very promising.

01:08:25.920 | The time of the real world is unlimited, right?

01:08:28.720 | But just also a lot harder.

01:08:30.240 | And yeah, I do think,

01:08:31.520 | something I also tell people about

01:08:33.320 | for why working software agents

01:08:35.480 | is they're infinitely clonable.

01:08:37.200 | And they always work the same way,

01:08:38.640 | mostly unless you're using Python.

01:08:43.480 | And yeah, I mean, this is like the whole thesis.

01:08:47.320 | I'm also interested like in,

01:08:48.400 | you dropped a little bit of alpha there.

01:08:50.800 | I don't wanna make sure we don't lose it.

01:08:52.920 | Like you're just kind of skeptical about self-driving

01:08:55.560 | as a business.

01:08:56.520 | So I wanna like double click on this a little bit

01:08:59.200 | because I mean, I think that that shouldn't be taken away.

01:09:01.960 | We do have some public Waymo numbers.

01:09:03.760 | Read from Waymo is pretty public with like their stats.

01:09:07.640 | They're exceeding 100 Waymo trips a week.

01:09:09.840 | If you assume like a $25 ride average,

01:09:11.840 | that's $130 million revenue run rate.

01:09:14.320 | At some point they will recoup their investment, right?

01:09:15.920 | Like what are we talking about here?

01:09:17.040 | Like why this skepticism?

01:09:19.440 | - I think, and again, I'm not an expert.

01:09:21.000 | I don't know their financials.

01:09:22.920 | I would say the thing I'm worried about

01:09:24.040 | is like compared to an Uber,

01:09:25.960 | like I don't know how much an Uber driver takes home a year,

01:09:28.360 | but like call that the revenue

01:09:30.400 | that a Waymo is gonna be making in that same year.

01:09:33.240 | Those cars are expensive.

01:09:34.520 | It's not about if you can hit profitability,

01:09:36.680 | it's about your cash conversion cycles.

01:09:38.600 | Like is building one Waymo,

01:09:40.600 | like how cheap can you make that

01:09:42.340 | compared to like how much you're earning

01:09:45.040 | sort of as the equivalent

01:09:46.200 | of what an Uber driver would take home?

01:09:47.600 | 'Cause remember, an Uber driver,

01:09:49.060 | you're not getting that whole revenue.

01:09:50.280 | You think about for the Uber driver,

01:09:51.720 | like the cost of the car, the depreciation of the car.

01:09:54.480 | I'm not convinced how much profit

01:09:57.720 | Waymo can actually make per car.

01:09:59.640 | That's, I think, my skepticism.

01:10:00.800 | - Well, they need to pre-assess the run Waymo

01:10:04.080 | because the Class C is like 100, 10 grand,

01:10:06.560 | something like that. - Yes, exactly.

01:10:07.400 | - Plus the LiDAR. - That's many years of,

01:10:10.000 | yeah, yeah, yeah, exactly, exactly.

01:10:11.800 | - Anything else?

01:10:12.640 | Parting thoughts?

01:10:13.640 | Call to action?

01:10:14.800 | Rants?

01:10:15.800 | The floor is yours.

01:10:17.360 | - I'm very excited to see a lot more LLM agents

01:10:20.240 | out there in the world doing things.

01:10:21.680 | And I think that like,

01:10:23.000 | I think there'll be the biggest limiting thing

01:10:25.320 | will start to become like,

01:10:26.880 | do people trust the output of these agents?

01:10:28.880 | And like, how do you trust the output of an agent

01:10:31.240 | that did five hours of work for you

01:10:32.680 | and is coming back with something?

01:10:34.400 | And if you can't find some way to trust that agent's work,

01:10:37.880 | it kind of wasn't valuable at all.

01:10:39.320 | So I think that's gonna be a really important thing

01:10:41.000 | is not just doing the work,

01:10:43.040 | but doing the work in a trustable, auditable way

01:10:45.560 | where you can also explain to the human,

01:10:47.560 | hey, here's exactly how this works and why,

01:10:49.680 | and how I came to it.

01:10:51.000 | I think that's gonna be really important.

01:10:52.720 | - Thank you so much.

01:10:53.560 | - Thank you. - Yeah, thanks.

01:10:54.440 | This was great.

01:10:55.440 | (upbeat music)

01:10:58.040 | (upbeat music)

01:11:00.640 | (upbeat music)

01:11:03.240 | (upbeat music)

01:11:05.820 | you

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Chapters