back to index

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic


Chapters

0:0 Introductions
3:39 What is SWE-Bench?
12:22 SWE-Bench vs HumanEval vs others
15:21 SWE-Agent architecture and runtime
21:18 Do you need code indexing?
24:50 Giving the agent tools
27:47 Sandboxing for coding agents
29:16 Why not write tests?
30:31 Redesigning engineering tools for LLMs
35:53 Multi-agent systems
37:52 Why XML so good?
42:57 Thoughts on agent frameworks
45:12 How many turns can an agent do?
47:12 Using multiple model types
51:40 Computer use and agent use cases
59:4 State of AI robotics
64:24 Robotics in manufacturing
65:1 Hardware challenges in robotics
69:21 Is self-driving a good business?

Whisper Transcript | Transcript Only Page

00:00:00.000 | [MUSIC PLAYING]
00:00:03.320 | Hey, everyone.
00:00:04.360 | Welcome to the Latent Space Podcast.
00:00:06.240 | This is Alessio, partner and CTO at Decibel Partners.
00:00:09.280 | And today, we're in the news studio
00:00:11.760 | with my usual co-host, Sean from Small AI.
00:00:14.680 | Hey, and today, we are very blessed
00:00:16.920 | to have Eric Schluntz from Anthropic with us.
00:00:18.760 | Welcome.
00:00:19.480 | Hi, thanks very much.
00:00:20.600 | I'm Eric Schluntz.
00:00:21.480 | I'm a member of technical staff at Anthropic,
00:00:23.860 | working on tool use, computer use, and SweetBench.
00:00:27.760 | Yeah, well, how did you get into just the whole AI journey?
00:00:32.760 | I think you spent some time at SpaceX as well?
00:00:35.760 | Yeah.
00:00:36.260 | And robotics?
00:00:37.120 | Yeah, there's a lot of overlap between the robotics people
00:00:39.840 | and the AI people.
00:00:40.560 | And maybe there's some interlap or interest
00:00:43.320 | between language models for robots right now.
00:00:46.360 | Maybe just a little bit of background
00:00:47.960 | on how you got to where you are.
00:00:49.720 | Yeah, sure.
00:00:50.340 | I was at SpaceX a long time ago.
00:00:51.840 | But before joining Anthropic, I was the CTO and co-founder
00:00:55.520 | of Cobalt Robotics.
00:00:56.920 | We built security and inspection robots.
00:00:59.600 | These are five-foot-tall robots that
00:01:01.400 | would patrol through an office building or a warehouse,
00:01:04.200 | looking for anything out of the ordinary.
00:01:06.120 | Very friendly, no tasers or anything.
00:01:07.720 | We would just call a remote operator if we saw anything.
00:01:11.800 | So we have about 100 of those out in the world,
00:01:14.400 | and had a team of about 100.
00:01:15.840 | We actually got acquired about six months ago.
00:01:17.800 | But I had left Cobalt about a year ago now,
00:01:20.560 | because I was starting to get a lot more excited about AI.
00:01:23.160 | I had been writing a lot of my code with things like Copilot.
00:01:26.160 | And I was like, wow, this is actually really cool.
00:01:28.200 | If you had told me 10 years ago that AI would
00:01:30.800 | be writing a lot of my code, I would say, hey,
00:01:32.800 | I think that's AGI.
00:01:34.240 | And so I realized that we had passed this level.
00:01:37.560 | We're like, wow, this is actually really useful
00:01:39.560 | for engineering work.
00:01:40.520 | That got me a lot more excited about AI
00:01:43.160 | and learning about large language models.
00:01:45.120 | So I ended up taking a sabbatical
00:01:47.080 | and then doing a lot of reading and research myself,
00:01:50.240 | and decided, hey, I want to go be at the core of this
00:01:52.600 | and joined Anthropic.
00:01:53.840 | And why Anthropic?
00:01:55.600 | - Did you consider other labs?
00:01:56.960 | Did you consider maybe some of the robotics companies?
00:02:00.520 | - So I think at the time,
00:02:01.400 | I was a little burnt out of robotics.
00:02:02.880 | And so also for the rest of this,
00:02:04.080 | any sort of negative things I say about robotics
00:02:06.600 | or hardware is coming from a place of burnout.
00:02:09.680 | I reserve my right to change my opinion in a few years.
00:02:12.680 | Yeah, I looked around, but ultimately I knew a lot of people
00:02:15.440 | that I really trusted and I thought were incredibly smart
00:02:18.440 | at Anthropic, and I think that was the big deciding factor
00:02:20.720 | to come there.
00:02:21.560 | Like, hey, this team's amazing.
00:02:23.320 | They're not just brilliant,
00:02:24.400 | but sort of like the most nice and kind people that I know.
00:02:26.760 | And so I just felt I could be a really good culture fit.
00:02:28.840 | And ultimately, like, I do care a lot about AI safety
00:02:31.560 | and making sure that, you know,
00:02:32.880 | I don't want to build something that's used for bad purposes.
00:02:35.640 | And I felt like the best chance of that
00:02:37.960 | was joining Anthropic.
00:02:39.680 | - And from the outside,
00:02:40.520 | these labs kind of look like huge organizations
00:02:43.480 | that have this like obscure ways to organize.
00:02:45.680 | How did you get, you joined Anthropic,
00:02:47.840 | did you already know you were going to work
00:02:49.280 | on like SweetBench and some of the stuff you publish,
00:02:51.480 | or you kind of join and then you figure out where you land?
00:02:54.560 | I think people are always curious to learn more.
00:02:57.120 | - Yeah, I've been very happy that Anthropic
00:02:59.080 | is very bottoms up and sort of very sort of receptive
00:03:01.800 | to whatever your interests are.
00:03:03.480 | And so I joined sort of being very transparent of like,
00:03:06.320 | hey, I'm most excited about code generation and AI
00:03:09.040 | that can actually go out and sort of touch the world
00:03:11.600 | or sort of help people build things.
00:03:13.360 | And, you know, those weren't my initial projects.
00:03:15.760 | I also came in and said,
00:03:16.600 | hey, I want to do the most valuable possible thing
00:03:18.520 | for this company and help Anthropic succeed.
00:03:20.960 | And, you know, like, let me find the balance of those.
00:03:23.120 | So I was working on lots of things at the beginning,
00:03:25.640 | you know, function calling, tool use,
00:03:28.000 | and then sort of as it became more and more relevant,
00:03:31.120 | I was like, oh, hey, yeah, like let's,
00:03:32.280 | it's time to go work on encoding agents
00:03:35.000 | and sort of started looking at SweetBench
00:03:36.800 | as sort of a really good benchmark for that.
00:03:39.320 | - So let's get right into SweetBench.
00:03:41.200 | That's one of the many claims to fame.
00:03:43.480 | I feel like there's just been a series of releases
00:03:46.280 | related with Cloud 3.5 SONNET.
00:03:48.080 | Around about two, three months ago, 3.5 SONNET came out
00:03:51.880 | and it was a step ahead in terms of a lot of,
00:03:55.200 | people immediately fell in love with it for coding.
00:03:57.240 | And then last month,
00:03:59.280 | you released a new updated version of Cloud SONNET.
00:04:01.840 | We're not going to talk about the training for that
00:04:03.280 | 'cause that's still confidential,
00:04:04.760 | but I think Anthropic's done a really good job
00:04:07.080 | like applying the model to different things.
00:04:09.520 | So you took the lead on SweetBench,
00:04:11.200 | but then also we're going to talk a little bit
00:04:12.520 | about computer use later on.
00:04:14.400 | So yeah, maybe just give us a context
00:04:15.960 | about like why you looked at SweetBench Verified
00:04:18.840 | and you actually like came up with a whole system
00:04:21.560 | for building agents that, you know,
00:04:24.200 | would maximally use the model well.
00:04:26.760 | - Yeah, so I'm on a sub team called product research.
00:04:29.920 | And basically the idea of product research
00:04:32.000 | is to really understand like what end customers care about
00:04:36.040 | and want in the models
00:04:37.640 | and then work to try to make that happen.
00:04:40.520 | So, you know, we're not focused
00:04:42.240 | on sort of these more abstract general benchmarks
00:04:45.360 | like math problems or MMLU,
00:04:47.960 | but we really care about like finding the things
00:04:49.920 | that are really valuable
00:04:50.800 | and making sure the models are great at those.
00:04:52.720 | And so because I had been interested in coding agents,
00:04:55.640 | sort of, I knew that this would be a really valuable thing.
00:04:57.840 | And I knew there were a lot of startups
00:04:59.280 | and our customers trying to build coding agents
00:05:02.400 | with our models.
00:05:03.240 | And so I said, "Hey, this is going to be
00:05:04.080 | "a really good benchmark to be able to measure that
00:05:06.440 | "and do well on it."
00:05:07.280 | And I, you know, wasn't the first person
00:05:09.240 | at Anthropic to find SweetBench.
00:05:11.000 | And then, you know, there are lots of people
00:05:12.040 | that already knew about it
00:05:13.840 | and had done some internal efforts on it.
00:05:16.480 | It fell to me to sort of both implement the benchmark,
00:05:19.520 | which is very tricky,
00:05:20.600 | and then also to sort of make sure we had an agent
00:05:23.960 | and basically like a reference agent,
00:05:26.120 | maybe I'd call it, that could do very well on it.
00:05:28.760 | Ultimately, we want to provide
00:05:31.000 | how we implemented that reference agent
00:05:33.080 | so that people can build their own agents
00:05:35.160 | on top of our system
00:05:36.560 | and get sort of the most out of it as possible.
00:05:38.920 | So with this blog post we released on SweetBench,
00:05:41.600 | we released the exact tools and the prompt
00:05:44.120 | that we gave the model to be able to do well.
00:05:46.120 | - For people who don't know,
00:05:47.400 | who maybe haven't dived into SweetBench,
00:05:49.320 | I think the general perception is they're like
00:05:51.040 | tasks that a software engineer could do.
00:05:53.440 | I feel like that's an inaccurate description
00:05:55.800 | because it is basically,
00:05:57.720 | one, it's a subset of like 12 repos.
00:05:59.920 | It's everything they could find
00:06:01.200 | that every issue with like a matching commit
00:06:04.280 | that could be tested.
00:06:05.440 | So that's not every commit.
00:06:07.240 | And then SweetBench verified
00:06:08.480 | is further manually filtered by OpenAI.
00:06:11.280 | Is that an accurate description
00:06:12.440 | and anything you'd change about that?
00:06:13.640 | - Yes, SweetBench is,
00:06:15.520 | it certainly is a subset of all tasks.
00:06:18.800 | First of all, it's only Python repos.
00:06:21.000 | So already fairly limited there.
00:06:22.640 | And it's just 12 of these popular open source repos.
00:06:25.760 | And yes, it's only ones where there were
00:06:28.280 | tests that passed at the beginning
00:06:30.080 | and also new tests that were introduced
00:06:33.200 | that test the new feature that's added.
00:06:36.360 | So it is, I think,
00:06:37.920 | a very limited subset of real engineering tasks,
00:06:40.800 | but I think it's also very valuable
00:06:42.520 | because it's, even though it's a subset,
00:06:44.600 | it is true engineering tasks.
00:06:46.520 | And I think a lot of other benchmarks
00:06:48.520 | are really kind of these much more artificial setups
00:06:51.440 | of even if they're related to coding,
00:06:53.360 | they're more like coding interview style questions
00:06:55.920 | or puzzles that I think are very different
00:06:58.160 | from like day-to-day what you end up doing.
00:07:00.440 | Like, I don't know how frequently
00:07:02.000 | you all like get to use recursion in your day-to-day job,
00:07:05.160 | but whenever I do, it's like a treat.
00:07:07.640 | And I think it is, it's kind of,
00:07:08.960 | it's almost comical and a lot of people
00:07:10.400 | joke about this in the industry.
00:07:11.440 | It's like how different interview questions are.
00:07:13.040 | - Dynamic programming.
00:07:14.040 | - Yeah, exactly. - Like new code.
00:07:15.480 | - From the day-to-day job.
00:07:16.320 | But I think the, one of the most interesting things
00:07:18.720 | about Sweebench is that all these other benchmarks
00:07:21.720 | are usually just isolated puzzles
00:07:23.560 | and you're starting from scratch.
00:07:25.120 | Whereas Sweebench, you're starting
00:07:26.680 | in the context of an entire repository.
00:07:29.160 | And so it adds this entirely new dimension
00:07:31.640 | to the problem of finding the relevant files.
00:07:34.280 | And this is a huge part of real engineering is,
00:07:37.000 | it's actually again, pretty rare that you're starting
00:07:39.000 | something totally greenfield.
00:07:40.520 | You need to go and figure out where in a code base
00:07:43.040 | you're going to make a change and understand
00:07:45.120 | how your work is going to interact
00:07:46.800 | with the rest of the systems.
00:07:48.160 | And I think Sweebench does a really good job
00:07:49.920 | of like presenting that problem.
00:07:51.800 | - Why do we still use human eval?
00:07:54.240 | It's like 92%, I think.
00:07:56.200 | I don't even know if you can actually get to 100%
00:07:58.160 | because some of the data is not actually solvable.
00:08:01.120 | Do you see benchmarks like that,
00:08:03.040 | they should just get sunsetted?
00:08:04.480 | Because when you look at like the model releases,
00:08:06.440 | it's like, oh, it's like 92% instead of like 89, 90%
00:08:10.600 | on human eval versus, you know,
00:08:12.360 | Sweebench verified is you have 49%, right?
00:08:15.240 | Which is like, before 45% was state of the art,
00:08:18.360 | but maybe like six months ago, it was like 30%,
00:08:20.680 | something like that.
00:08:21.640 | So is that a benchmark that you think
00:08:23.640 | is going to replace human eval?
00:08:24.960 | Or do you think they're just going to run in parallel?
00:08:27.440 | - I think there's still need
00:08:28.760 | for sort of many different varied evals.
00:08:31.440 | Like sometimes you do really care
00:08:32.720 | about just sort of greenfield code generation.
00:08:35.440 | And so I don't think that everything needs to go
00:08:37.800 | to sort of an agentic setup.
00:08:39.480 | - It would be very expensive to implement.
00:08:41.160 | - And the other thing I was going to say
00:08:42.360 | is that Sweebench is certainly hard to implement
00:08:46.120 | and expensive to run because each task,
00:08:49.000 | you have to parse a lot of the repo
00:08:51.600 | to understand where to put your code.
00:08:53.160 | And a lot of times you take many tries
00:08:55.520 | of writing code, running it, editing it.
00:08:57.720 | It can use a lot of tokens
00:08:59.600 | compared to something like human eval.
00:09:01.000 | So I think there's definitely a space
00:09:02.680 | for these more traditional coding evals
00:09:05.200 | that are sort of easy to implement,
00:09:06.800 | quick to run and do get you some signal.
00:09:09.200 | And maybe hopefully there's just sort of harder versions
00:09:12.000 | of human eval that get created.
00:09:14.000 | - How do we get Sweebench verified to 92%?
00:09:16.960 | Do you think that's something where it's like
00:09:18.920 | line of sight to it?
00:09:20.000 | Or it's like, you know,
00:09:20.840 | we need a whole lot of things to go right.
00:09:23.040 | - Yeah, yeah.
00:09:23.880 | And actually maybe I'll start with Sweebench
00:09:26.200 | versus Sweebench verified,
00:09:27.320 | which is I think something I missed earlier.
00:09:29.400 | So Sweebench is, as we described,
00:09:30.720 | this big set of tasks that were scraped.
00:09:33.240 | - Like 12,000 or something.
00:09:35.000 | - Yeah, I think it's 2,000 in the final set,
00:09:38.400 | but a lot of those, even though a human did them,
00:09:41.000 | they're actually impossible
00:09:42.400 | given the information that comes with the task.
00:09:45.240 | The most classic example of this
00:09:46.880 | is the test looks for a very specific error string,
00:09:50.520 | you know, like assert message equals error,
00:09:53.960 | something, something, something.
00:09:55.280 | And unless you know that's exactly what you're looking for,
00:09:58.160 | there's no way the model is going to write
00:09:59.800 | that exact same error message
00:10:01.120 | and so the tests are going to fail.
00:10:02.640 | So Sweebench verified was actually made
00:10:05.400 | in partnership with OpenAI
00:10:07.160 | and they hired humans to go review all these tasks
00:10:10.680 | and pick out a subset to try to remove
00:10:13.440 | any obstacle like this
00:10:15.160 | that would make the tasks impossible.
00:10:16.960 | So in theory, all of these tasks
00:10:19.240 | should be fully doable by the model.
00:10:22.080 | And they also had humans grade
00:10:24.200 | how difficult they thought the problems would be
00:10:26.440 | between like 15, less than 15 minutes,
00:10:29.200 | I think 15 minutes to an hour,
00:10:30.760 | an hour to four hours and greater than four hours.
00:10:33.400 | So that's kind of this interesting sort of
00:10:35.920 | how big the problem is as well.
00:10:37.840 | To get to Sweebench verified to 90%,
00:10:40.560 | actually, maybe I'll also start off
00:10:41.880 | with some of the remaining failures that I see,
00:10:44.000 | like when running our model on Sweebench.
00:10:46.120 | I'd say the biggest cases are the model
00:10:47.920 | sort of operates at the wrong level of abstraction.
00:10:51.040 | And what I mean by that is
00:10:52.680 | the model puts in maybe a smaller band-aid
00:10:55.080 | when really the task is asking for a bigger refactor.
00:10:58.200 | And some of those, you know, is the model's fault,
00:11:00.400 | but a lot of times if you're just seeing the,
00:11:03.360 | if you're just sort of seeing the GitHub issue,
00:11:05.600 | it's not exactly clear like which way you should do.
00:11:07.560 | So even though these tasks are possible,
00:11:09.440 | there's still some ambiguity
00:11:11.360 | in how the tasks are described.
00:11:13.040 | That being said, I think in general,
00:11:14.480 | like language models frequently will produce
00:11:16.440 | like a smaller diff when possible
00:11:18.440 | rather than trying to do a big refactor.
00:11:20.440 | I think another area is sort of,
00:11:21.640 | at least the agent we created
00:11:22.920 | didn't have any multimodal abilities,
00:11:25.560 | even though our models are very good at vision.
00:11:27.360 | So I think that's just a missed opportunity.
00:11:29.560 | And if I read through some of the traces,
00:11:31.120 | there's some funny things where,
00:11:32.560 | especially the tasks on matplotlib,
00:11:34.520 | which is a graphing library,
00:11:36.240 | the test script will like save an image
00:11:38.080 | and the model will just say, okay, it looks great.
00:11:40.600 | You know, without looking at it.
00:11:42.520 | So there's certainly extra juice to squeeze there
00:11:44.160 | of just making sure the model really understands
00:11:46.640 | all the sides of the input that it's given,
00:11:48.200 | including multimodal.
00:11:49.520 | But yeah, I think like getting to 92%.
00:11:52.480 | So this is something that I have not looked at,
00:11:54.360 | but I'm very curious about.
00:11:55.880 | I want someone to look at like,
00:11:57.400 | what is the union of all of the different tasks
00:12:00.040 | that have been solved by at least one attempt
00:12:02.320 | at SuiteBench Verify?
00:12:03.600 | There's a ton of submissions to the benchmark.
00:12:05.480 | And so I'd be really curious to see
00:12:07.440 | how many of those 500 tasks, at least someone has solved.
00:12:11.160 | And I think, you know, there's probably a bunch
00:12:12.840 | that none of the attempts have ever solved.
00:12:14.840 | And I think it'd be interesting to look at those and say,
00:12:16.560 | hey, is there some problem with these?
00:12:18.080 | Like, are these impossible?
00:12:19.360 | Or are they just really hard and only a human could do them?
00:12:22.200 | - Yeah, like specifically, is there a category of problems
00:12:24.040 | that are still unreachable by any LLM agent?
00:12:27.480 | - Yeah, yeah, and I think there definitely are.
00:12:28.680 | The question is, are those fairly inaccessible
00:12:32.480 | or are they just impossible because of the descriptions?
00:12:34.920 | But I think certainly some of the tasks,
00:12:36.840 | especially the ones that the human graders reviewed
00:12:40.160 | as like taking longer than four hours are extremely difficult.
00:12:43.920 | I think we got a few of them right,
00:12:46.760 | but not very many at all in the benchmark.
00:12:49.520 | - And did those take less than four hours?
00:12:51.600 | - They certainly did less than, yeah, than four hours.
00:12:53.960 | - Is there a correlation of length of time
00:12:56.360 | with like human estimated time, you know what I mean?
00:12:58.520 | Or do we have sort of more of X paradox type situations
00:13:01.800 | where it's something super easy for a model,
00:13:05.400 | but hard for a human?
00:13:06.400 | - I actually haven't done like done the stats on that,
00:13:09.280 | but I think that'd be really interesting to see
00:13:10.680 | of like how many tokens does it take
00:13:12.800 | and how is that correlated with difficulty?
00:13:15.200 | What is the likelihood of success with difficulty?
00:13:18.120 | I think actually a really interesting thing that I saw,
00:13:21.360 | one of my coworkers who was also working on this
00:13:23.960 | named Simon, he was focusing just specifically
00:13:27.080 | on the very hard problems,
00:13:28.600 | the ones that are said to take longer than four hours.
00:13:31.440 | And he ended up sort of creating
00:13:33.320 | a much more detailed prompt than I used.
00:13:35.240 | And he got a higher score
00:13:37.120 | on the most difficult subset of problems,
00:13:39.160 | but a lower score overall in the whole benchmark.
00:13:41.960 | And the prompt that I made,
00:13:43.240 | which is sort of much more simple and bare bones,
00:13:45.800 | got a higher score on the overall benchmark,
00:13:47.640 | but lower score on the really hard problems.
00:13:49.960 | And I think some of that is the really detailed prompt
00:13:52.680 | made the model sort of overcomplicate
00:13:54.920 | a lot of the easy problems.
00:13:56.640 | 'Cause honestly, a lot of the sweet bench problems,
00:13:58.400 | they really do just ask for a bandaid
00:14:00.200 | and where it's like, hey, this crashes if this is none,
00:14:02.600 | and really all you need to do is put a check if none.
00:14:05.040 | And so sometimes like trying to make the model
00:14:07.360 | think really deeply, like it'll think in circles
00:14:10.200 | and overcomplicate something,
00:14:11.140 | which certainly human engineers are capable of as well.
00:14:14.120 | But I think there's some interesting thing
00:14:15.240 | of like the best prompt for hard problems
00:14:17.040 | might not be the best prompt for easy problems.
00:14:19.080 | - How do we fix that?
00:14:20.080 | Are you supposed to fix it at the model level?
00:14:22.240 | Like how do I know what prompt I'm supposed to use?
00:14:25.600 | - Yeah, and I'll say this was a very small effect size.
00:14:27.600 | And so I think this is not,
00:14:29.280 | I think this isn't like worth obsessing over,
00:14:31.780 | but I would say that as people are building systems
00:14:35.200 | around agents, I think the more you can separate out
00:14:39.000 | the different kinds of work the agent needs to do,
00:14:41.840 | the better you can tailor a prompt for that task.
00:14:44.560 | And I think that also creates a lot of like,
00:14:46.440 | for instance, if you were trying to make an agent
00:14:48.040 | that could both, you know, solve hard programming tasks,
00:14:52.200 | and it could just like, you know, write quick test files
00:14:55.880 | for something that someone else had already made,
00:14:57.680 | the best way to do those two tasks
00:14:59.200 | might be very different prompts.
00:15:00.660 | I see a lot of people build systems
00:15:02.400 | where they first sort of have a classification
00:15:04.600 | and then route the problem to two different prompts.
00:15:07.320 | And that's sort of a very effective thing
00:15:09.000 | because one, it makes the two different prompts
00:15:12.240 | much simpler and smaller.
00:15:13.760 | And it means you can have someone work on one of the prompts
00:15:16.460 | without any risk of affecting the other tasks.
00:15:18.740 | So it creates like a nice separation of concerns.
00:15:20.760 | - Yeah, and the other model behavior thing you mentioned,
00:15:22.960 | they prefer to generate like shorter diffs.
00:15:26.320 | Why is that?
00:15:27.160 | Like, is there a way?
00:15:28.000 | You know, I think that's maybe like the lazy model question
00:15:32.560 | that people have is like,
00:15:33.400 | why are you not just generating the whole code
00:15:35.480 | instead of telling me to implement it?
00:15:36.320 | - Are you saving tokens?
00:15:37.600 | - Yeah, exactly.
00:15:38.440 | It's like conspiracy theory.
00:15:39.760 | - Yeah, yeah, yeah.
00:15:40.600 | - So there's two different things there.
00:15:41.880 | One is like the, I'd say maybe like doing
00:15:44.200 | the easier solution rather than the hard solution.
00:15:46.620 | And I'd say the second one,
00:15:47.580 | I think what you're talking about is like the lazy model
00:15:49.340 | is like when the model says like dot, dot, dot,
00:15:51.500 | code remains the same.
00:15:52.340 | - Code goes here.
00:15:53.160 | I'm like, thanks, dude.
00:15:54.000 | - I think honestly, like that just comes as like,
00:15:57.300 | people on the internet will do stuff like that.
00:15:59.260 | And like, dude, if you were talking to a friend
00:16:01.460 | and you asked them like to give you some example code,
00:16:04.020 | they would definitely do that.
00:16:04.860 | They're not going to reroll the whole thing.
00:16:06.900 | And so I think that's just a matter of like, you know,
00:16:09.380 | sometimes you actually do just want like the relevant changes
00:16:13.260 | and so I think it's,
00:16:14.380 | this is something where a lot of times like, you know,
00:16:16.180 | the models aren't good at mind reading
00:16:17.660 | of like which one you want.
00:16:19.240 | So I think that like the more explicit you can be
00:16:22.220 | in prompting to say, hey, you know,
00:16:23.540 | give me the entire thing, no elisions,
00:16:26.820 | versus just give me the relevant changes.
00:16:28.260 | And that's something, you know,
00:16:29.100 | we want to make the models always better
00:16:30.720 | at following those kinds of instructions.
00:16:33.020 | - I'll drop a couple of references here.
00:16:34.540 | We're recording this like a day after Dario,
00:16:36.940 | Lex Friedman just dropped his five hour pod with Dario
00:16:39.380 | and Amanda and the rest of the crew.
00:16:41.580 | And Dario actually made this interesting observation
00:16:44.260 | that like, we actually don't want,
00:16:45.700 | we complain about models being too chatty in text
00:16:48.780 | and then not chatty enough in code.
00:16:50.980 | And so like getting that right is kind of a awkward bar
00:16:54.500 | because, you know, you don't want it to yap
00:16:56.740 | in its responses,
00:16:57.700 | but then you also want it to be complete in code.
00:17:00.420 | And then sometimes it's not complete.
00:17:01.420 | Sometimes you just want it to diff,
00:17:02.540 | which is something that Enthopic has also released
00:17:05.620 | with, you know, like the fast edit stuff that you guys did.
00:17:08.740 | And then the other thing I wanted to also double back on
00:17:11.060 | is the prompting stuff.
00:17:12.540 | You said it was a small effect,
00:17:13.820 | but it was a noticeable effect
00:17:15.020 | in terms of like picking a prompt.
00:17:17.020 | I think we'll go into suite agents in a little bit,
00:17:19.580 | but I kind of reject the fact
00:17:20.940 | that you need to choose one prompt
00:17:23.260 | and like have your whole performance
00:17:25.540 | be predicated on that one prompt.
00:17:28.020 | I think something that Enthopic has done really well
00:17:30.400 | is meta-prompting, prompting for a prompt.
00:17:32.980 | And so why can't you just develop a meta-prompt
00:17:34.700 | for all the other prompts?
00:17:35.780 | And, you know, if it's a simple task,
00:17:37.140 | make a simple prompt.
00:17:37.980 | If it's a hard task, make a hard prompt.
00:17:39.340 | Obviously I'm probably hand-waving a little bit,
00:17:41.020 | but I will definitely ask people
00:17:42.900 | to try the Enthopic Workbench meta-prompting system
00:17:46.580 | if they haven't tried it yet.
00:17:47.700 | I went to the build day recently at Enthopic HQ
00:17:50.780 | and it's the closest I've felt to an AGI,
00:17:53.860 | like learning how to operate itself.
00:17:55.540 | That, yeah, it's really magical.
00:17:57.620 | - Yeah, no, Claude is great at writing prompts for Claude.
00:17:59.900 | - Right, so meta-prompting.
00:18:00.900 | - Yeah, yeah.
00:18:01.860 | The way I think about this is that humans,
00:18:04.620 | even like very smart humans still use sort of checklists
00:18:07.900 | and use sort of scaffolding for themselves.
00:18:09.860 | Surgeons will still have checklists
00:18:11.580 | even though they're incredible experts.
00:18:13.340 | And certainly, you know, a very senior engineer
00:18:15.300 | needs less structure than a junior engineer,
00:18:18.300 | but there still is some of that structure
00:18:19.820 | that you want to keep.
00:18:20.660 | And so I always try to anthropomorphize the models
00:18:22.940 | and try to think about for a human,
00:18:24.380 | sort of what is the equivalent?
00:18:25.500 | And that's sort of, you know,
00:18:26.940 | how I think about these things is
00:18:28.500 | how much instruction would you give a human
00:18:30.700 | with the same task?
00:18:31.860 | And would you need to give them a lot of instruction
00:18:34.380 | or a little bit of instruction?
00:18:35.860 | - Let's talk about the agent architecture.
00:18:37.700 | Maybe, so first, runtime.
00:18:39.740 | You let it run until it thinks it's done
00:18:42.340 | or it reaches 200K context window.
00:18:45.020 | How did you come up?
00:18:45.860 | - What's up with that?
00:18:46.700 | - Yeah.
00:18:47.540 | - Yeah, I mean this,
00:18:48.540 | so I'd say that a lot of previous agent work
00:18:52.220 | built sort of these very hard-coded and rigid workflows
00:18:56.300 | where the model is sort of pushed through
00:18:58.660 | certain flows of steps.
00:19:00.300 | And I think to some extent, you know,
00:19:01.980 | that's needed with smaller models
00:19:04.020 | and models that are less smart.
00:19:05.940 | But one of the things that we really wanted to explore
00:19:08.300 | was like, let's really give Claude the reins here
00:19:11.300 | and not force Claude to do anything,
00:19:13.820 | but let Claude decide, you know,
00:19:15.980 | how it should approach the problem,
00:19:17.340 | what steps it should do.
00:19:18.740 | And so really, you know,
00:19:20.020 | what we did is like the most extreme version of this
00:19:22.460 | is just give it some tools that it can call
00:19:24.940 | and it's able to keep calling the tools, keep thinking,
00:19:28.020 | and then yeah, keep doing that until it thinks it's done.
00:19:31.060 | And that's sort of the most minimal agent framework
00:19:35.020 | that we came up with.
00:19:36.460 | And I think that works very well.
00:19:37.980 | I think especially the new SONNET 3.5
00:19:41.100 | is very, very good at self-correction.
00:19:43.660 | It has a lot of like grit.
00:19:44.940 | Claude will try things that fail
00:19:47.500 | and then try, you know, come back
00:19:49.420 | and sort of try different approaches.
00:19:51.140 | And I think that's something that you didn't see
00:19:53.580 | in a lot of previous models.
00:19:55.140 | Some of the existing agent frameworks that I looked at,
00:19:57.260 | they had whole systems built to try to detect loops
00:20:00.700 | and see, oh, is the model doing the same thing,
00:20:02.980 | you know, more than three times,
00:20:04.220 | and we have to pull it out.
00:20:05.660 | And I think like the smarter the models are,
00:20:07.220 | the less you need that kind of extra scaffolding.
00:20:09.140 | So yeah, just giving the model tools
00:20:11.420 | and letting it keep sample and call tools
00:20:13.740 | until it thinks it's done was the most minimal framework
00:20:16.580 | that we could think of.
00:20:17.420 | And so that's what we did.
00:20:18.260 | - So you're not pruning like bad paths from the context.
00:20:21.460 | If it tries to do something, it fails,
00:20:23.460 | you just burn all these tokens to-
00:20:25.220 | - Yes, and so I would say the downside of this
00:20:27.380 | is that this is sort of a very token expensive way
00:20:29.700 | to do this.
00:20:30.540 | - But still, it's very common to prune bad paths
00:20:32.820 | 'cause models get stuck.
00:20:34.220 | - Yeah, but I'd say that, yeah, 3.5 is not getting stuck
00:20:38.300 | as much as previous models.
00:20:39.420 | And so, yeah, we wanted to at least
00:20:41.060 | just try the most minimal thing.
00:20:42.580 | I know I would say that, you know,
00:20:44.060 | this is definitely an area of future research,
00:20:46.380 | especially if we talk about these problems
00:20:48.580 | that are going to take a human more than four hours.
00:20:51.700 | Those might be things where we're gonna need
00:20:53.300 | to go prune bad paths to let the model
00:20:56.900 | be able to accomplish this task within 200K tokens.
00:20:59.940 | So certainly I think there's like future research
00:21:02.140 | to be done in that area,
00:21:03.300 | but it's not necessary to do well on these benchmarks.
00:21:06.140 | - Another thing I always have questions about
00:21:08.180 | on context window things,
00:21:09.660 | there's a mini cottage industry of code indexers
00:21:12.780 | that have sprung up for large codebases
00:21:15.100 | like the ones in SweetBench.
00:21:16.420 | You didn't need them?
00:21:17.700 | - We didn't.
00:21:18.540 | And I think I'd say there's like two reasons for this.
00:21:20.740 | One is like SweetBench specific
00:21:22.420 | and the other is a more general thing.
00:21:25.380 | The more general thing is that I think
00:21:27.420 | Sonnet is very good at what we call agentic search
00:21:30.780 | and what this basically means
00:21:32.220 | is letting the model decide how to search for something.
00:21:35.460 | It gets the results and then it can decide
00:21:37.500 | should it keep searching or is it done?
00:21:38.980 | Does it have everything it needs?
00:21:40.540 | So if you read through a lot of the traces of the SweetBench,
00:21:44.260 | the model is calling tools to view directories,
00:21:47.140 | list out things, view files,
00:21:49.140 | and it will do a few of those
00:21:50.660 | until it feels like it's found the file where the bug is
00:21:54.220 | and then it will start working on that file.
00:21:56.340 | And I think like, again, this is all,
00:21:58.500 | everything we did was about just giving Claude the full reins
00:22:01.500 | so there's no hard-coded system.
00:22:03.340 | There's no search system that you're relying
00:22:05.700 | on getting the correct files into context.
00:22:08.620 | This just totally lets Claude do it.
00:22:10.940 | - Or embedding things into a vector database.
00:22:13.580 | - Exactly.
00:22:14.620 | - Oops.
00:22:15.460 | - No, no, I know.
00:22:16.300 | But again, this is very, very token expensive.
00:22:19.660 | And so certainly, and it also takes many, many turns.
00:22:22.220 | And so certainly if you want to do something
00:22:24.060 | in a single turn, you need to do rag
00:22:25.740 | and just push stuff into the first prompt.
00:22:27.860 | - And just to make it clear, it's using the bash tool,
00:22:30.540 | basically doing ls, looking at files,
00:22:32.740 | and then doing cat to the following context.
00:22:34.940 | - It can do that, but it's file editing tool
00:22:37.660 | also has a command in it called view.
00:22:39.980 | They can view a directory.
00:22:41.260 | It's very similar to ls,
00:22:43.060 | but it just sort of has some nice
00:22:44.780 | sort of quality of life improvements.
00:22:46.260 | Like it'll only do an ls sort of two directories deep
00:22:49.420 | so that the model doesn't get overwhelmed
00:22:51.140 | if it does this on a huge file.
00:22:52.780 | I would say actually we did more engineering of the tools
00:22:55.860 | than the overall prompt.
00:22:57.140 | But the one other thing I want to say
00:22:59.380 | about this agentic search is that
00:23:01.340 | for suite bench specifically,
00:23:03.180 | a lot of the tasks are bug reports,
00:23:06.420 | which means they have a stack trace in them.
00:23:08.380 | And that means right in that first prompt,
00:23:10.260 | there is- - Tells you where to go.
00:23:11.260 | - It tells you where to go.
00:23:12.340 | And so I think this is a very easy case
00:23:14.140 | for the model to find the right files
00:23:15.980 | versus if you're using,
00:23:17.980 | this is a general coding assistant
00:23:19.740 | where there isn't a stack trace
00:23:20.980 | or you're asking it to insert a new feature.
00:23:23.900 | I think there it's much harder to know
00:23:25.700 | which files to look at.
00:23:28.020 | And that might be an area where
00:23:29.460 | you would need to do more of this exhaustive search
00:23:31.620 | where an agentic search would take way too long.
00:23:33.660 | - As someone who has spent the last few years
00:23:35.660 | in the JS world,
00:23:36.780 | it'd be interesting to see suite bench JS
00:23:39.620 | because these stack traces are useless
00:23:43.460 | because there's so much virtualization that we do.
00:23:45.500 | So they're very, very disconnected
00:23:46.620 | with where the code problems are actually appearing.
00:23:50.700 | - That makes me feel better
00:23:51.620 | about my limited front end experiences.
00:23:53.180 | I've like always struggled with that.
00:23:54.780 | - It's not your fault.
00:23:56.100 | We've gotten ourselves
00:23:57.860 | into a very, very complicated situation
00:23:59.460 | and I'm not sure it's entirely needed,
00:24:01.620 | but if you talk to our friends at Vercel,
00:24:03.260 | they will say it is.
00:24:04.220 | - I will say suite bench just released suite bench multimodal
00:24:08.020 | which I believe is either entirely JavaScript
00:24:10.700 | or largely JavaScript.
00:24:12.260 | And it's entirely things
00:24:13.540 | that have visual components of them.
00:24:15.340 | - Are you going to tackle that?
00:24:16.500 | - We will see.
00:24:17.340 | I think it's on the list and there's interest,
00:24:18.980 | but no guarantees yet.
00:24:20.460 | - Just as a side note,
00:24:21.380 | it occurs to me that every model lab,
00:24:24.100 | including Enthopic, but the others as well,
00:24:26.700 | you should have your own suite bench.
00:24:28.500 | Whatever your bug tracker tool,
00:24:30.460 | this is a general methodology
00:24:31.820 | that you can use to track progress, I guess.
00:24:34.700 | - Yeah, sort of running on our own internal code base.
00:24:36.940 | Yeah, that's a fun idea.
00:24:37.940 | - Since you spend so much time on the tool design,
00:24:39.980 | so you have this added tool
00:24:41.140 | that can make changes and whatnot.
00:24:42.540 | Any learnings from that
00:24:43.900 | that you wish the AI IDEs would take in?
00:24:47.180 | Is there some special way to look at files, feed them in?
00:24:50.580 | - I would say the core of that tool is string replace.
00:24:54.540 | And so we did a few different experiments
00:24:56.900 | with different ways to specify how to edit a file.
00:25:00.180 | And string replace, basically,
00:25:02.100 | the model has to write out the existing version
00:25:04.740 | of the string and then a new version,
00:25:06.420 | and that just gets swapped in.
00:25:08.100 | We found that to be the most reliable way to do these edits.
00:25:11.900 | Other things that we tried
00:25:12.980 | were having the model directly write a diff,
00:25:15.580 | having the model fully regenerate files.
00:25:18.060 | That one is actually the most accurate,
00:25:20.100 | it takes so many tokens.
00:25:21.300 | And if you're in a very big file, it's cost prohibitive.
00:25:24.460 | There's basically a lot of different ways
00:25:26.020 | to sort of represent the same task.
00:25:28.140 | And they actually have pretty big differences
00:25:30.100 | in terms of like model accuracy.
00:25:32.140 | I think Eider, they have a really good blog
00:25:34.940 | where they explore some of these different methods
00:25:38.500 | for editing files and they post results about them,
00:25:41.140 | which I think is interesting.
00:25:42.180 | But I think this is like a really good example
00:25:44.060 | of the broader idea
00:25:45.100 | that like you need to iterate on tools
00:25:47.820 | rather than just a prompt.
00:25:49.380 | And I think a lot of people,
00:25:50.780 | when they make tools for an LLM,
00:25:54.100 | they kind of treat it
00:25:54.940 | like they're just writing an API for a computer.
00:25:58.060 | And it's sort of very minimal,
00:25:59.700 | it's sort of just the bare bones of what you'd need.
00:26:02.620 | And honestly, like it's so hard for the models to use those.
00:26:05.860 | I really, again,
00:26:06.700 | I come back to anthropomorphizing these models.
00:26:08.900 | Like imagine you're a developer
00:26:10.820 | and you just read this for the very first time
00:26:13.260 | and you're trying to use it.
00:26:14.140 | Like you can do so much better
00:26:15.900 | than like just sort of the bare API spec
00:26:17.980 | of what you'd often see,
00:26:19.260 | like include examples in the description,
00:26:21.420 | include like really detailed explanations
00:26:23.300 | of how things work.
00:26:24.420 | And I think that, again,
00:26:25.380 | also think about what is the easiest way
00:26:28.340 | for the model to represent the change
00:26:30.500 | that it wants to make.
00:26:31.700 | For file editing as an example,
00:26:33.500 | writing a diff is actually,
00:26:35.340 | let's take the most extreme example.
00:26:36.860 | You want the model to literally write a patch file.
00:26:39.580 | I think patch files have at the very beginning,
00:26:41.860 | like numbers of how many total lines change.
00:26:44.900 | That means before the model has actually written the edit,
00:26:47.540 | it needs to decide how many numbers
00:26:50.660 | or how many lines are gonna change.
00:26:52.220 | Don't quote me on that.
00:26:53.500 | I'm pretty sure, I think it's something like that,
00:26:55.580 | but I don't know if that's exactly the diff format,
00:26:57.860 | but you can certainly have formats
00:26:59.660 | that are much easier to express
00:27:01.100 | without messing up than others.
00:27:02.540 | And I like to think about like,
00:27:03.860 | think about how much human effort
00:27:06.420 | goes into designing human interfaces for things.
00:27:08.860 | Like, it's incredible.
00:27:09.700 | This is like entirely what FrontEnd is about,
00:27:11.900 | is creating better interfaces to kind of do the same things.
00:27:14.940 | And I think that same amount of attention
00:27:16.660 | and effort needs to go
00:27:17.620 | into creating agent computer interfaces.
00:27:19.660 | - It's a topic we've discussed,
00:27:21.140 | ACI or whatever that looks like.
00:27:24.620 | I would also shout out that,
00:27:25.700 | I think you released some of these toolings
00:27:27.620 | as part of computer use as well,
00:27:29.500 | and people really liked it.
00:27:31.500 | Yeah, it's all open source if people wanna check it out.
00:27:34.500 | I'm curious if there's an environment element
00:27:37.660 | that complements the tools.
00:27:39.260 | So how do you, like, do you have a sandbox?
00:27:41.260 | Do you, is it just Docker?
00:27:43.060 | 'Cause that can be slow or resource intensive.
00:27:45.540 | Do you have anything else that you would recommend?
00:27:47.580 | - Yeah, I don't think I can talk
00:27:49.220 | about sort of public details or about private details
00:27:52.060 | about how we implement our sandboxing.
00:27:54.500 | But obviously, we need to have sort of safe, secure,
00:27:57.300 | and fast sandboxes for training,
00:27:59.100 | for the models to be able to practice writing code
00:28:00.980 | and working in an environment.
00:28:03.140 | - I'm aware of a few startups working on agent sandboxing.
00:28:06.940 | E2B is a close friend of ours
00:28:08.620 | that Alessio has led around in.
00:28:10.100 | But also I think there's others
00:28:11.340 | where they're focusing on snapshotting memory
00:28:13.900 | so that it can do time travel for debugging,
00:28:16.620 | computer use where you can control the mouse
00:28:19.180 | or keyboard or something like that.
00:28:20.900 | Whereas here, I think that the kinds of tools
00:28:22.500 | that we offer it are very, very limited
00:28:25.460 | to coding agent work cases like bash, edit,
00:28:28.780 | you know, stuff like that.
00:28:30.020 | - Yeah, I think the computer use demo that we released
00:28:32.820 | is an extension of that of it.
00:28:34.060 | It has the same bash and edit tools,
00:28:36.100 | but it also has the computer tool
00:28:37.940 | that lets it get screenshots
00:28:39.180 | and move the mouse and keyboard.
00:28:40.820 | Yeah, so I definitely think
00:28:41.660 | there's sort of more general tools there.
00:28:43.340 | And again, the tools we released
00:28:45.500 | as part of SweetBench were,
00:28:47.660 | I'd say they're very specific for editing files
00:28:50.620 | and doing bash, but at the same time,
00:28:52.300 | that's actually very general if you think about it.
00:28:54.500 | Anything that you would do on a command line
00:28:57.180 | or editing files, you can do with those tools.
00:28:59.940 | And so we do want those tools to feel
00:29:02.100 | like any sort of computer terminal work
00:29:06.180 | could be done with those same tools,
00:29:08.140 | rather than making tools that were very specific
00:29:10.700 | for SweetBench, like run tests as its own tool,
00:29:14.380 | for instance.
00:29:15.220 | - Yeah, you had a question about tests.
00:29:16.220 | - Yeah, yeah, exactly.
00:29:17.540 | I saw there's no test writer tool.
00:29:19.740 | Is it because it generates the code
00:29:22.340 | and then you're running it against SweetBench anyway?
00:29:24.380 | So it doesn't really need to write the test or?
00:29:26.740 | - Yeah, so this is one of the interesting things
00:29:28.980 | about SweetBench is that the tests
00:29:31.820 | that the model's output is graded on are hidden from it.
00:29:34.860 | That's basically so that the model can't cheat
00:29:37.060 | by looking at the tests and writing the exact solution.
00:29:40.180 | But I'd say typically the model,
00:29:42.260 | the first thing it does is it usually writes
00:29:44.660 | a little script to reproduce the error.
00:29:47.340 | And again, most SweetBench tasks are like,
00:29:49.540 | "Hey, here's a bug that I found.
00:29:51.540 | "I run this and I get this error."
00:29:53.300 | So the first thing the model does is try to reproduce that.
00:29:56.060 | And so it's kind of then rerunning that script
00:29:58.220 | as a mini test.
00:29:59.640 | But yeah, sometimes the model
00:30:01.220 | will accidentally introduce a bug
00:30:03.140 | that breaks some other test and it doesn't know about that.
00:30:05.540 | - And should we be redesigning any tools, APIs?
00:30:08.780 | We kind of talked about this on having more examples,
00:30:10.820 | but I'm thinking even things of Q as a query parameter
00:30:14.500 | in many APIs.
00:30:15.340 | It's easier for the model to re-query than read the Q.
00:30:17.900 | I'm sure it learned the Q by this point,
00:30:19.860 | but is there anything you've seen, like building this,
00:30:23.080 | where it's like, "Hey, if I were to redesign some CLI tool,
00:30:26.740 | "some API tool, I would change the way structure
00:30:29.420 | "to make it better for LLMs."
00:30:31.420 | - I don't think I've thought enough about that
00:30:33.820 | off the top of my head,
00:30:34.820 | but certainly just making everything more human-friendly.
00:30:37.840 | Like having like more detailed documentation and examples.
00:30:42.460 | I think examples are really good
00:30:44.180 | in things like descriptions.
00:30:45.340 | Like so many, like just using the Linux command line,
00:30:47.980 | like how many time I do like dash dash help
00:30:50.380 | or look at the man page or something.
00:30:51.580 | It's like, just give me one example
00:30:53.220 | of like how I actually use this.
00:30:54.340 | Like, I don't want to go read through a hundred flags.
00:30:55.980 | Just give me the most common example.
00:30:57.820 | And again, so things that would be useful for a human
00:31:01.180 | I think are also very useful for a model.
00:31:03.140 | - Yeah, I mean, there's one thing
00:31:04.180 | that you cannot give to code agents
00:31:08.080 | that is useful for human is this access to the internet.
00:31:11.200 | I wonder how to design that in.
00:31:13.000 | Because one of the issues that I also had
00:31:15.040 | with just the idea of a suite bench
00:31:17.880 | is that you can't do follow-up questions.
00:31:20.880 | You can't like look around for similar implementations.
00:31:23.940 | These are all things that I do when I try to fix code.
00:31:27.480 | And we don't do that.
00:31:28.640 | It's not, it wouldn't be fair.
00:31:30.040 | Like it'd be too easy to cheat,
00:31:31.520 | but then also it's kind of not being fair to these agents
00:31:33.760 | because they're not operating in a real world situation.
00:31:36.240 | Like if I had a real world agent,
00:31:37.680 | of course I'm giving it access to the internet
00:31:39.200 | 'cause I'm not trying to pass a benchmark.
00:31:41.200 | I don't have a question in there, more just like,
00:31:43.480 | I feel like the most obvious tool,
00:31:45.680 | access to the internet is not being used.
00:31:47.520 | - I think that that's really important for humans.
00:31:50.200 | But honestly, the models have so much general knowledge
00:31:52.800 | from pre-training that it's like less important for them.
00:31:56.720 | - But like versioning, you know.
00:31:58.080 | - If you're working on a newer thing
00:31:59.240 | that was like, that came after the knowledge cutoff,
00:32:01.500 | then yes, I think that's very important.
00:32:03.280 | I think actually this is like a broader problem
00:32:05.640 | that there is a divergence between SweeBench
00:32:08.640 | and like what customers will actually care about
00:32:11.120 | who are working on a coding agent for real use.
00:32:13.640 | And I think one of those there is like internet access
00:32:16.280 | and being able to like,
00:32:17.120 | how do you pull in outside information?
00:32:19.200 | I think another one is like,
00:32:20.800 | if you have a real coding agent,
00:32:22.000 | you don't wanna have it start on a task
00:32:24.380 | and like spin its wheels for hours
00:32:26.320 | because you gave it a bad prompt.
00:32:27.840 | You want it to come back immediately
00:32:29.280 | and ask follow-up questions
00:32:30.520 | and like really make sure it has a very detailed
00:32:32.720 | understanding of what to do,
00:32:34.320 | then go off for a few hours and do work.
00:32:36.640 | So I think that like real tasks
00:32:38.680 | are gonna be much more interactive with the agent
00:32:41.920 | rather than this kind of like one-shot system.
00:32:44.720 | And right now there's no benchmark that measures that.
00:32:47.620 | And maybe I think it'd be interesting
00:32:49.080 | to have some benchmark that is more interactive.
00:32:52.520 | I don't know if you're familiar with TauBench,
00:32:54.160 | but it's a customer service benchmark
00:32:56.500 | where there's basically one LLM that's playing
00:32:59.800 | the user or the customer that's getting support
00:33:02.480 | and another LLM that's playing the support agent
00:33:05.520 | and they interact and try to resolve the issue.
00:33:07.760 | - Yeah, we talked to the LMSIS guys.
00:33:09.400 | - Awesome, yeah.
00:33:10.240 | - And they also did MTBench for people listening along.
00:33:13.200 | So maybe we need MTSuiteBench.
00:33:14.520 | - Sure.
00:33:15.800 | Yeah, so maybe you could have something
00:33:17.480 | where like before the SuiteBench task starts,
00:33:20.120 | you have like a few back and forths
00:33:22.320 | with kind of like the author
00:33:24.760 | who can answer follow-up questions
00:33:26.480 | about what they want the task to do.
00:33:27.880 | And of course you'd need to do that
00:33:29.080 | where it doesn't cheat
00:33:30.560 | and like just get the exact thing out of the human
00:33:34.240 | or out of the sort of user.
00:33:35.720 | But I think that will be a really interesting thing to see.
00:33:37.880 | If you look at sort of existing agent work
00:33:39.900 | like Repl.it's coding agent,
00:33:41.960 | I think one of the really great UX things they do
00:33:45.000 | is like first having the agent create a plan
00:33:48.040 | and then having the human approve that plan
00:33:49.960 | or give feedback.
00:33:51.000 | I think for agents in general,
00:33:52.360 | like having a planning step at the beginning,
00:33:55.040 | one, just having that plan will improve performance
00:33:58.040 | on the downstream task
00:33:59.200 | just because it's kind of like a bigger chain of thought,
00:34:01.480 | but also it's just such a better UX.
00:34:03.640 | It's way easier for a human
00:34:06.040 | to iterate on a plan with a model
00:34:08.200 | rather than iterating on the full task
00:34:10.200 | that sort of has a much slower time through each loop.
00:34:12.920 | If the human has approved this implementation plan,
00:34:16.020 | I think it makes the end result a lot more
00:34:18.240 | sort of auditable and trustable.
00:34:20.440 | So I think there's a lot of things
00:34:22.680 | sort of outside of SuiteBench
00:34:24.080 | that will be very important
00:34:25.160 | for real agent usage in the world.
00:34:27.360 | - Yeah, I would say also,
00:34:28.960 | there's a couple of comments on names that you dropped.
00:34:30.680 | Copilot also does the plan stage before it writes code.
00:34:34.800 | I feel like those approaches
00:34:36.640 | have generally been less Twitter successful
00:34:38.960 | because it's not prompt to code, it's prompt plan code.
00:34:42.360 | So there's a little bit of friction in there,
00:34:43.920 | but it's not much.
00:34:44.760 | Like it actually, you get a lot for what it's worth.
00:34:47.640 | And I also like the way that DevIn does it
00:34:50.320 | where you can sort of edit the plan as it goes along.
00:34:52.640 | And then the other thing with Repl.it,
00:34:54.640 | we hosted a sort of dev day pre-game with Repl.it
00:34:58.120 | and they also commented about multi-agents.
00:35:00.720 | So like having two agents kind of bounce off of each other.
00:35:04.080 | I think it's a similar approach to what you're talking about
00:35:05.880 | with kind of the few-shot example,
00:35:08.200 | just as in the prompts of clarifying what the agent wants.
00:35:12.560 | But typically I think this would be implemented
00:35:14.360 | as a tool calling another agent, like a sub-agent.
00:35:18.320 | I don't know if you explored that.
00:35:19.600 | Do you like that idea?
00:35:20.640 | - I haven't explored this enough,
00:35:22.000 | but I've definitely heard of people
00:35:23.360 | having good success with this,
00:35:25.080 | of almost like basically having
00:35:27.280 | a few different sort of personas of agents,
00:35:30.240 | even if they're all the same LLM.
00:35:31.960 | I think this is one thing with multi-agent
00:35:33.520 | that a lot of people will kind of get confused by
00:35:35.280 | is they think it has to be different models
00:35:37.080 | behind each thing,
00:35:37.920 | but really it's sort of usually the same model
00:35:40.440 | with different prompts.
00:35:41.560 | And yet having them have different personas
00:35:44.240 | to kind of bring different sort of thoughts
00:35:46.240 | and priorities to the table.
00:35:47.760 | I've seen that work very well
00:35:49.520 | and sort of create a much more thorough
00:35:52.640 | and thought out response.
00:35:54.040 | I think the downside is just that
00:35:55.160 | it adds a lot of complexity
00:35:56.440 | and it adds a lot of extra tokens.
00:35:58.600 | So I think it depends what you care about.
00:36:00.200 | If you want a plan that's very thorough and detailed,
00:36:03.080 | I think it's great.
00:36:04.120 | If you want a really quick, just like write this function,
00:36:06.800 | you know, you probably don't want to do that
00:36:08.080 | and have like a bunch of different calls
00:36:10.160 | before it does this.
00:36:11.320 | - And just talking about the prompt,
00:36:12.560 | why are XML tags so good in Cloud?
00:36:16.120 | I think initially people were like,
00:36:17.200 | oh, maybe you're just getting lucky with XML,
00:36:19.600 | but I saw obviously you use them
00:36:21.000 | in your own agent prompts, so they must work.
00:36:23.840 | And why is it so model specific to your family?
00:36:26.880 | - Yeah, I think that there's, again,
00:36:28.560 | I'm not sure how much I can say,
00:36:29.680 | but I think there's historical reasons
00:36:31.200 | that internally we've preferred XML for the data.
00:36:34.080 | I think also the one broader thing I'll say
00:36:37.520 | is that if you look at certain kinds of outputs,
00:36:41.120 | there is overhead to outputting in JSON.
00:36:43.800 | Like if you're trying to output a code in JSON,
00:36:47.760 | there's a lot of extra escaping that needs to be done.
00:36:50.200 | I mean, that actually hurts model performance
00:36:51.960 | across the board, where versus like,
00:36:54.040 | if you're in just a single XML tag,
00:36:56.440 | there's none of that sort of escaping that needs to happen.
00:36:59.080 | That being said, I haven't tried having it write,
00:37:01.520 | you know, HTML and XML,
00:37:03.320 | which maybe then you start running
00:37:04.440 | into weird escaping things there, I'm not sure.
00:37:08.200 | But yeah, I'd say that's some historical reasons
00:37:10.560 | and there's less overhead of escaping.
00:37:13.080 | - I use XML in other models as well.
00:37:16.320 | And it's just a really nice way to make sure
00:37:18.080 | that the thing that ends is tied
00:37:20.600 | to the thing that starts.
00:37:22.000 | That's the only way to do code fences
00:37:24.200 | where you're pretty sure, like example one start,
00:37:26.720 | example one end, like that is one cohesive unit.
00:37:30.200 | - Because the braces are nondescriptive.
00:37:32.000 | - Yeah, exactly.
00:37:32.840 | That would be my simple reason.
00:37:35.400 | XML is good for everyone, not just Cloud.
00:37:37.480 | Cloud was just the first one to popularize it, I think.
00:37:39.240 | - I do definitely prefer to read XML than read JSON, so yeah.
00:37:43.200 | - Any other details that are like maybe underappreciated?
00:37:46.640 | I know, for example, you had the absolute paths
00:37:49.120 | versus relative.
00:37:50.280 | Any other, yeah, fun nuggets?
00:37:51.920 | - Yeah, no, I think that's a good sort of anecdote
00:37:54.200 | to mention about iterating on tools.
00:37:56.080 | Like I said, spend time prompt engineering your tools
00:37:59.560 | and don't just write the prompt,
00:38:00.920 | but like write the tool and then actually give it
00:38:04.880 | to the model and like read a bunch of transcripts
00:38:07.400 | about how the model tries to use the tool.
00:38:09.960 | And I think you will find, like by doing that,
00:38:12.360 | you will find areas where the model misunderstands a tool
00:38:16.000 | or makes mistakes and then basically change the tool
00:38:19.480 | to make it foolproof.
00:38:20.840 | And there's this Japanese term, pokayoke,
00:38:23.440 | about like making tools mistake-proof.
00:38:26.400 | You know, the classic idea is you have like,
00:38:28.360 | you can have like a plug that can fit either way
00:38:30.320 | and that's dangerous, or you can make it asymmetric
00:38:32.560 | so that like it can't fit this way, it has to go like this.
00:38:35.120 | And like, that's a better tool
00:38:36.480 | because you can't use it the wrong way.
00:38:38.600 | So for this example of like absolute paths,
00:38:41.560 | one of the things that we saw while testing these tools
00:38:44.080 | is, oh, if the model has like, you know, done CD
00:38:47.720 | and moved to a different directory,
00:38:49.520 | it would often get confused when trying to use the tool
00:38:52.520 | because it's like now in a different directory.
00:38:54.960 | And so the paths aren't lining up.
00:38:56.200 | So we said, oh, look, let's just force the tool
00:38:58.560 | to always require an absolute path.
00:39:00.760 | And then, you know, that's easy for the model to understand.
00:39:03.080 | It knows sort of where it is, it knows where the files are.
00:39:06.000 | And then once we have it always giving absolute paths,
00:39:08.600 | it never messes up even like no matter where it is,
00:39:10.800 | because it just, if you're using an absolute path,
00:39:12.920 | it doesn't matter where you are.
00:39:13.960 | So like iterations like that, you know,
00:39:16.160 | let us make the tool foolproof for the model.
00:39:18.880 | I'd say there's other categories of things where we see,
00:39:21.480 | oh, if the model, you know, opens Vim, like, you know,
00:39:25.120 | it's never going to return.
00:39:26.200 | And so the tool is like stuck.
00:39:28.680 | - Did it get stuck?
00:39:29.520 | - Yeah.
00:39:30.340 | - Get out of Vim.
00:39:31.180 | - What?
00:39:32.020 | - Well, because the tool is like,
00:39:33.760 | it just text in, text out, it's not interactive.
00:39:36.080 | So it's not like the model doesn't know
00:39:37.960 | how to get out of Vim.
00:39:38.860 | It's that the way that the tool is like hooked up
00:39:41.700 | to the computer is not interactive.
00:39:44.280 | - Yes, I mean, there is the meme of no one knows
00:39:46.320 | how to get out of Vim.
00:39:47.400 | You know, basically we just added instructions
00:39:50.060 | in the tool of like, hey, don't launch commands
00:39:53.120 | that don't return.
00:39:53.960 | Like, yeah, like don't launch Vim, don't launch whatever.
00:39:57.440 | If you do need to do something, you know,
00:39:58.840 | put an ampersand after it or launch it in the background.
00:40:01.520 | And so like, just, you know,
00:40:03.120 | putting kind of instructions like that,
00:40:06.040 | just right in the description for the tool
00:40:07.480 | really helps the model.
00:40:08.640 | And I think like that's an underutilized space
00:40:11.160 | of prompt engineering where like people might try to do that
00:40:13.620 | in the overall prompt,
00:40:14.640 | but just put that in the tool itself.
00:40:16.360 | So the model knows that it's like for this tool,
00:40:18.760 | this is what's relevant.
00:40:20.380 | - You said you worked on the function calling and tool use
00:40:23.160 | before you actually started the C-Bench work, right?
00:40:25.120 | Was there any surprises?
00:40:26.760 | Because you basically went from creator of that API
00:40:30.520 | to user of that API.
00:40:32.040 | Any surprises or changes you would make
00:40:34.720 | now that you have extensively dog fooded
00:40:37.480 | in a state-of-the-art agent?
00:40:39.760 | - I want us to make like a,
00:40:41.540 | maybe like a little bit less verbose SDK.
00:40:44.660 | I think some way, like right now it just takes a,
00:40:47.780 | I think we sort of force people to do the best practices
00:40:50.580 | of writing out sort of these full JSON schemas,
00:40:53.340 | but it would be really nice
00:40:54.180 | if you could just pass in a Python function as a tool.
00:40:57.540 | I think that could be something that--
00:40:58.380 | - I think that there's a lot of like--
00:40:59.540 | - There's helper libraries.
00:41:00.380 | - Instructure, you know, I don't know if there,
00:41:03.100 | if there's anyone else that is specializing for Anthropic,
00:41:06.100 | maybe Jeremy Howard's and Simon Willis and stuff.
00:41:09.060 | I think they all have cloud specific stuff
00:41:11.060 | that they are working on.
00:41:12.140 | - Cloudette.
00:41:12.980 | - Cloudette, exactly.
00:41:14.140 | I also wanted to spend a little bit of time with SuiteAgent.
00:41:16.660 | It seems like a very general framework.
00:41:18.420 | Like, is there a reason you picked it
00:41:19.780 | apart from it's the same authors as SuiteBench?
00:41:21.820 | - The main thing we wanted to go with
00:41:23.180 | was it was the same authors as SuiteBench,
00:41:25.180 | so it just felt sort of like the safest, most neutral option.
00:41:28.460 | And it was, you know, very high quality.
00:41:30.100 | It was very easy to modify, to work with.
00:41:33.500 | I would say it also actually,
00:41:35.140 | their underlying framework is sort of this,
00:41:38.740 | it's like, you know, think, act, observe,
00:41:41.140 | that they kind of go through this loop,
00:41:43.060 | which is like a little bit more hard-coded
00:41:45.340 | than what we wanted to do, but it's still very close.
00:41:47.500 | That's still very general.
00:41:48.900 | So it felt like a good match
00:41:50.020 | is sort of the starting point for our agent.
00:41:52.660 | And we had already sort of worked with the,
00:41:54.540 | and talked with the SuiteBench people directly.
00:41:56.220 | So it felt nice to just have, you know,
00:41:58.220 | we already know the authors, this will be easy,
00:41:59.980 | easy to work with.
00:42:00.820 | - I'll share a little bit of like,
00:42:02.220 | this all seems disconnected,
00:42:04.300 | but once you figure out the people
00:42:05.900 | and where they go to school, it all makes sense.
00:42:08.380 | So it's all Princeton.
00:42:09.980 | - Yeah, the SuiteBench and SuiteAgent,
00:42:11.500 | it's a group out of Princeton.
00:42:12.580 | - Yeah, we had Shun Yu on the pod
00:42:14.420 | and he came up with the React paradigm.
00:42:16.620 | And that's like, think, act, observe, like that's all React.
00:42:20.100 | So they're all friends.
00:42:21.260 | - Yep, yeah, exactly.
00:42:22.220 | And you know, our,
00:42:23.540 | if you actually read our traces of our submission,
00:42:26.220 | you can actually see like, think, act, observe,
00:42:28.460 | like in our logs.
00:42:29.300 | And like, we just didn't even like change the printing code.
00:42:31.660 | Like that's, so it's not actually,
00:42:34.100 | it's like doing still function calls under the hood
00:42:36.540 | and the model can do sort of multiple function calls
00:42:39.500 | in a row without thinking in between if it wants to.
00:42:41.980 | But yeah, so a lot of similarities
00:42:43.540 | and a lot of things we inherited from SuiteAgent
00:42:45.420 | just as a starting point for the framework.
00:42:47.260 | - Yeah, any thoughts about other agent frameworks?
00:42:50.220 | I think there's, you know,
00:42:51.260 | the whole gamut from very simple to like very complex.
00:42:54.180 | - Autogen, CooEI, LandGraph.
00:42:56.140 | - Yeah, yeah.
00:42:56.980 | I think I haven't explored a lot of them in detail.
00:43:00.820 | I would say with agent frameworks in general,
00:43:03.140 | they can certainly save you some like boilerplate,
00:43:05.820 | but I think there's actually this like downside
00:43:08.340 | of making agents too easy where you end up very quickly
00:43:12.140 | like building a much more complex system than you need.
00:43:15.020 | And suddenly, you know, instead of having one prompt,
00:43:17.460 | you have five agents that are talking to each other
00:43:19.700 | and doing a dialogue.
00:43:20.660 | And it's like, because the framework made that 10 lines
00:43:23.700 | to do, you end up building something
00:43:25.140 | that's way too complex.
00:43:26.500 | So I think I would actually caution people
00:43:28.220 | to like try to start without these frameworks if you can,
00:43:32.340 | because you'll be closer to the raw prompts
00:43:34.540 | and be able to sort of directly understand what's going on.
00:43:37.740 | I think a lot of times these frameworks also,
00:43:40.260 | by trying to make everything feel really magical,
00:43:43.300 | you end up sort of really hiding what the actual prompt
00:43:47.580 | and output of the model is,
00:43:49.140 | and that can make it much harder to debug.
00:43:51.380 | So certainly these things have a place,
00:43:52.740 | and I think they do really help
00:43:54.020 | at getting rid of boilerplate,
00:43:55.700 | but they come with this cost of obfuscating
00:43:58.460 | what's really happening and making it too easy
00:44:01.180 | to very quickly add a lot of complexity.
00:44:03.820 | So yeah, I would recommend people to like try it
00:44:06.220 | from scratch and it's like not that bad.
00:44:08.220 | Would you rather have like a framework of tools?
00:44:11.100 | You know, do you almost see like,
00:44:12.300 | hey, like it's maybe easier to get tools
00:44:14.860 | that are already well curated,
00:44:16.460 | like the ones that you build, you know,
00:44:18.020 | if I had an easy way to get the best tool from you
00:44:21.540 | and like you maintain the definition or yeah,
00:44:23.700 | any thoughts on how you want to formalize tool sharing?
00:44:26.580 | Yeah, I think that's something
00:44:27.540 | that we're certainly interested in exploring.
00:44:29.900 | And I think there is space for sort of these general tools
00:44:33.260 | that will be very broadly applicable.
00:44:35.260 | But at the same time,
00:44:36.100 | most people that are building on these,
00:44:37.500 | they do have, you know, much more specific things
00:44:40.300 | that they're trying to do.
00:44:41.500 | You know, I think that might be useful
00:44:42.540 | for hobbyists and demos,
00:44:44.420 | but the ultimate end applications are going to be bespoke.
00:44:47.300 | And so we just want to make sure
00:44:48.740 | that the model's great at any tool that it uses,
00:44:51.380 | but certainly something we're exploring.
00:44:52.780 | - So everything bespoke, no frameworks, no anything.
00:44:55.580 | Just build. - For now, for now.
00:44:57.100 | - Yeah, I would say that like the best thing I've seen
00:44:59.180 | is people building up from like,
00:45:01.580 | build some good util functions
00:45:03.100 | and then you can use those as building blocks.
00:45:04.940 | - Yeah, yeah.
00:45:05.780 | I have a utils folder where I call these scripts.
00:45:08.180 | My framework is like def, call, and tropic.
00:45:10.700 | And then I just put all the defaults.
00:45:12.300 | - Yeah, exactly.
00:45:13.220 | There's a startup hidden in every utils folder, you know?
00:45:15.820 | - No, totally not. - If you use it enough,
00:45:17.220 | like it's a startup, you know, like at some point.
00:45:20.660 | I'm kind of curious,
00:45:21.500 | is there a maximum length of turns that it took?
00:45:25.980 | Like what was the longest run?
00:45:27.020 | - I actually don't.
00:45:27.860 | I mean, we had, it had basically infinite turns
00:45:31.020 | until it ran into 200K context.
00:45:33.740 | I should have looked this up.
00:45:35.540 | I don't know.
00:45:37.100 | And so for some of those failed cases
00:45:38.820 | where it eventually ran out of context,
00:45:40.420 | I mean, it was over a hundred turns.
00:45:42.820 | I'm trying to remember like the longest successful run,
00:45:45.620 | but I think it was definitely over a hundred turns
00:45:47.820 | that some of the times, you know?
00:45:48.660 | - Which is not that much.
00:45:49.500 | It's a coffee break.
00:45:50.340 | - Yeah, yeah.
00:45:52.180 | But certainly, you know, these things can be a lot of turns.
00:45:53.940 | And I think that's because some of these things
00:45:55.660 | are really hard where it's going to take, you know,
00:45:57.900 | many tries to do it.
00:45:59.460 | - Yeah, and if you think about like,
00:46:01.100 | think about a task that takes a human four hours to do,
00:46:03.980 | like think about how many different like files you read
00:46:07.100 | and like times you edit a file in four hours.
00:46:09.220 | Like that's a lot more than a hundred.
00:46:10.700 | - How many times you open Twitter?
00:46:12.140 | - Yeah.
00:46:12.980 | - Because you get distracted.
00:46:13.940 | But if you had a lot more compute,
00:46:16.260 | what's kind of like the return on the extra compute now?
00:46:19.060 | So like, you know, if you had thousands of turns
00:46:21.540 | or like whatever, like how much better would it get?
00:46:24.100 | - Yeah, this, I don't know.
00:46:25.220 | And I think this is,
00:46:26.860 | I think sort of one of the open areas of research
00:46:29.580 | in general with agents is memory
00:46:31.460 | and sort of how do you have something
00:46:33.660 | that can do work beyond its context length
00:46:37.460 | where you're just purely appending.
00:46:38.820 | So you mentioned earlier things like pruning bad paths.
00:46:41.900 | I think there's a lot of interesting work around there.
00:46:44.140 | Can you just roll back, but summarize,
00:46:46.380 | hey, don't go down this path.
00:46:47.900 | - There'll be dragons.
00:46:49.100 | - Yeah, I think that's very interesting
00:46:51.500 | that you could have something that uses way more tokens
00:46:54.420 | without ever using at a time more than 200K.
00:46:58.180 | So I think that's very interesting.
00:46:59.980 | I think the biggest thing is like,
00:47:01.260 | can you make the model sort of losslessly summarize
00:47:05.700 | what it's learned from trying different approaches
00:47:08.100 | and bring things back?
00:47:09.620 | I think that's sort of the big challenge.
00:47:11.700 | - What about different models?
00:47:12.940 | So you have Haiku, which is like, you know, cheaper.
00:47:15.180 | So you're like, well, what if I have a Haiku
00:47:17.580 | to do a lot of these smaller things and then put it back up?
00:47:20.660 | - I think Cursor might have said
00:47:22.260 | that they actually have a separate model for file editing.
00:47:25.340 | I'm trying to remember, I think they were on a,
00:47:27.300 | maybe the Lex Fridman podcast where they said like,
00:47:29.580 | they have a bigger model, like write what the code should be
00:47:32.180 | and then a different model, like apply it.
00:47:34.220 | So I think there's a lot of interesting room
00:47:36.060 | for stuff like that.
00:47:36.900 | - Yeah, fast applying.
00:47:37.820 | We actually did a pod with Fireworks
00:47:39.100 | that they worked with on, it's speculative decoding.
00:47:42.020 | - But I think there's also really interesting things
00:47:43.780 | about like, you know, paring down input tokens as well.
00:47:47.020 | Especially sometimes the models trying to read
00:47:48.900 | like a 10,000 line file, like that's a lot of tokens.
00:47:51.620 | And you know, most of it is actually
00:47:52.900 | not going to be relevant.
00:47:54.220 | I think it'd be really interesting
00:47:55.340 | to like delegate that to Haiku.
00:47:57.700 | Haiku read this file and just pull out
00:47:59.740 | the most relevant functions.
00:48:01.700 | And then, you know, Sonnet reads just those
00:48:04.620 | and you save 90% on tokens.
00:48:07.060 | I think there's a lot of really interesting room
00:48:08.740 | for things like that.
00:48:09.580 | And again, we were just trying to do sort of
00:48:11.860 | the simplest, most minimal thing and show that it works.
00:48:14.820 | I'm really hoping that people,
00:48:16.620 | sort of the agent community builds things like that
00:48:19.300 | on top of our models.
00:48:20.420 | That's again, why we released these tools.
00:48:22.140 | You know, we're not going to go and do lots more submissions
00:48:24.420 | to SweetBench and try to prompt engineer this
00:48:27.100 | and build a bigger system.
00:48:27.940 | We want people to, like the ecosystem,
00:48:29.620 | to do that on top of our models.
00:48:31.020 | But yeah, so I think that's a really interesting one.
00:48:32.700 | - It turns out, I think you did do 3.5 Haiku
00:48:35.260 | with your tools and it scored a 40.6.
00:48:38.500 | - Yes, yeah, so it did very well.
00:48:40.060 | It itself is actually very smart, which is great.
00:48:42.900 | But we haven't done any experiments
00:48:44.300 | with this like combination of the two models.
00:48:46.940 | But yeah, I think that's one of the exciting things
00:48:48.260 | is that how well Haiku 3.5 did on SweetBench
00:48:51.980 | shows that sort of even our smallest, fastest models
00:48:54.940 | is very good at sort of thinking agentically
00:48:57.660 | and working on hard problems.
00:48:58.940 | Like it's not just sort of for writing simple text anymore.
00:49:02.580 | - And I know you're not going to talk about it,
00:49:03.980 | but like Sonnet is not even supposed to be the best model.
00:49:06.860 | You know, like Opus, it's kind of like we left it
00:49:09.900 | at three back in the corner intro.
00:49:11.620 | At some point, I'm sure the new Opus will come out.
00:49:14.180 | And if you had Opus Plus on it, that sounds very, very good.
00:49:18.580 | - There's a run with SweetAgent Plus Opus,
00:49:20.500 | but that's the official SweetBench guys doing it.
00:49:23.180 | - That was the older, you know, 3.0.
00:49:24.380 | - You didn't do yours.
00:49:25.460 | - Yeah.
00:49:26.300 | - Okay, did you want to, or did you just?
00:49:28.420 | I mean, you could just change the model name.
00:49:30.060 | - I think, I think we didn't submit it,
00:49:32.740 | but I think we included it in our model card.
00:49:35.020 | We included the score as a comparison.
00:49:36.740 | - Yeah.
00:49:37.580 | - Yeah, and Sonnet and Haiku, actually,
00:49:39.020 | I think the new ones, they both outperformed
00:49:42.220 | the original Opus.
00:49:43.060 | - Yeah, I did see that.
00:49:43.900 | - Yeah, it's a little bit hard to find.
00:49:45.860 | - Yeah, yeah, it's not an exciting score,
00:49:48.180 | so we didn't feel like they need to submit the benchmark.
00:49:51.260 | - We can cut over to computer use if we're okay
00:49:53.140 | with moving on to topics on this, if anything else.
00:49:56.460 | - I think we're good.
00:49:57.540 | I think, I'm trying to think if there's anything else
00:49:59.780 | SweetBench related.
00:50:01.100 | - It doesn't have to be also just specifically SweetBench,
00:50:03.420 | but just your thoughts on building agents,
00:50:04.900 | 'cause you are one of the few people
00:50:06.140 | that have reached this leaderboard
00:50:08.060 | on building a coding agent.
00:50:10.060 | This is the state of the art.
00:50:11.380 | It's surprisingly not that hard to reach
00:50:14.380 | with some good principles, right?
00:50:15.940 | But there's obviously a ton of low-hanging fruit
00:50:17.780 | that we covered.
00:50:18.620 | So just your thoughts on if you were to build
00:50:20.620 | a coding agent startup, maybe, what next?
00:50:23.820 | - I think the really interesting question for me
00:50:26.180 | for all the startups out there is this kind of divergence
00:50:29.220 | between the benchmarks and what real customers will want.
00:50:31.780 | So I'm curious, maybe the next time you have
00:50:34.660 | a coding agent startup on the podcast,
00:50:36.460 | you should ask them that.
00:50:37.300 | What are the differences that they're starting to make?
00:50:38.620 | - Tomorrow.
00:50:39.460 | - Oh, perfect, perfect, yeah.
00:50:40.660 | I'm actually very curious what they will see,
00:50:42.580 | 'cause I also have seen,
00:50:43.940 | I feel like it's like slowed down a little bit
00:50:45.740 | if I don't see the startups submitting
00:50:49.340 | to SweetBench that much anymore.
00:50:51.700 | - 'Cause of the traces, the traces.
00:50:53.500 | So we had CoSign on, they had like a 50-something on full,
00:50:58.220 | on SweetBench full, which is the hardest one.
00:51:00.580 | And they were rejected because they didn't want
00:51:02.340 | to submit their traces.
00:51:03.540 | - Yep.
00:51:04.360 | - IP, you know?
00:51:05.200 | - Yeah, that makes sense, that makes sense.
00:51:06.380 | - We actually, tomorrow, we're talking to Bolt,
00:51:08.140 | which is a cloud customer.
00:51:09.420 | You guys actually published a case study with them.
00:51:12.260 | I assume you weren't involved with that,
00:51:13.940 | but they were very happy with cloud.
00:51:16.140 | (laughing)
00:51:17.500 | One of the biggest launches of the year.
00:51:18.660 | - Yeah, totally.
00:51:19.500 | - We actually happened to be sitting
00:51:20.780 | in Adept's former office.
00:51:22.820 | My take on this is Anthropic shipped Adept as a feature,
00:51:25.500 | or as like an open source demo.
00:51:26.900 | - It's still a beta feature, but yes.
00:51:28.700 | - What was it like when you tried it for the first time?
00:51:30.900 | Was it obvious that cloud had reached that stage
00:51:34.820 | where you could do computer use?
00:51:36.980 | - It was somewhat of a surprise to me.
00:51:38.820 | Like, I think, I actually, I had been on vacation,
00:51:41.200 | and I came back, and everyone's like, computer use works.
00:51:43.740 | (laughing)
00:51:44.580 | And so it was kind of this very exciting moment.
00:51:46.980 | I mean, after the first, just like, you know, go to Google,
00:51:48.900 | I think I tried to have it play Minecraft or something,
00:51:50.940 | and it actually like installed and like opened Minecraft.
00:51:53.220 | I was like, wow, this is pretty cool.
00:51:54.740 | So I was like, wow, yeah,
00:51:55.620 | this thing can actually use a computer.
00:51:57.780 | And certainly, it is still beta, you know,
00:51:59.660 | there's certain things that it's not very good at yet.
00:52:02.380 | But I'm really excited, I think, most broadly,
00:52:06.260 | not just for like new things that weren't possible before,
00:52:10.140 | but as a much lower friction way to implement tool use.
00:52:14.240 | One anecdote from my days at Cobalt Robotics,
00:52:17.600 | we wanted our robots to be able to ride elevators,
00:52:20.300 | to go between floors and fully cover a building.
00:52:23.160 | The first way that we did this
00:52:24.420 | was doing API integrations with the elevator companies.
00:52:27.740 | And some of them actually had APIs,
00:52:29.780 | we could send that request, and it would move the elevator.
00:52:32.260 | Each new company we did took like six months to do,
00:52:35.580 | 'cause they were very slow, they didn't really care.
00:52:39.060 | - Or an elevator, not an API.
00:52:40.940 | - Even installing, like once we had it with the company,
00:52:43.380 | they would have to like literally go install an API box
00:52:45.900 | on the elevator that we wanted to use.
00:52:47.580 | And that would sometimes take six months, so very slow.
00:52:51.260 | And eventually we're like, okay, this is getting like,
00:52:54.200 | slowing down all of our customer deployments.
00:52:56.640 | And I was like, what if we just add an arm to the robot?
00:52:59.280 | And I added this little arm that could literally go
00:53:02.220 | and press the elevator buttons,
00:53:03.500 | and we used computer vision to do this.
00:53:05.740 | And we could deploy that in a single day,
00:53:08.060 | and have the robot being able to use the elevators.
00:53:10.420 | At the same time, it was slower than the API,
00:53:13.900 | it wasn't quite as reliable, you know,
00:53:15.460 | sometimes it would miss
00:53:16.380 | and it would have to try to press it again.
00:53:18.580 | But it would get there,
00:53:19.400 | but it was slower and a little bit less reliable.
00:53:21.460 | And I kind of see this as like an analogy to computer use
00:53:24.340 | of like, anything you can do with computer use today,
00:53:26.980 | you could probably write tool use
00:53:29.660 | and like integrate it with APIs to up to the language model.
00:53:33.280 | But that's going to take a bunch of software engineering
00:53:35.180 | to go write those integrations,
00:53:36.460 | you'll have to do all this stuff.
00:53:38.100 | With computer use, just give the thing a browser
00:53:40.700 | that's logged into what you want to integrate with,
00:53:42.980 | and it's going to work immediately.
00:53:44.620 | And I see that like reduction and friction
00:53:46.500 | as being incredibly exciting.
00:53:48.260 | Of like, imagine like a customer support team,
00:53:51.380 | where, okay, hey, you got this customer support bot,
00:53:54.480 | but you need to go integrate it with all these things.
00:53:56.980 | And you don't have any engineers
00:53:58.180 | on your customer support team.
00:53:59.860 | But if you can just give the thing a browser
00:54:01.580 | that's logged into your systems
00:54:03.200 | that you need it to have access to,
00:54:05.120 | now, suddenly in one day, you could be up and rolling
00:54:07.580 | with a fully integrated customer service bot
00:54:09.920 | that could go do all the actions you care about.
00:54:12.000 | So I think that's the most exciting thing for me
00:54:13.700 | about computer use is like reducing that friction
00:54:16.880 | of integrations to almost zero.
00:54:18.960 | - Or farming on World of Warcraft.
00:54:21.520 | - Yes, or that.
00:54:22.360 | - Just go computer use, very high value use cases.
00:54:25.520 | - I always say about this is, you know,
00:54:26.860 | this is like the oldest question in robotics
00:54:29.900 | or self-driving, which is, you know,
00:54:31.520 | do you drive by vision or do you have special tools?
00:54:33.640 | And vision is the universal tool to claim all tools.
00:54:37.520 | There's trade-offs, but like there's situations
00:54:38.980 | in which that will come.
00:54:40.220 | But, you know, this week's podcast,
00:54:41.560 | the one that we just put out had Stan Polu from DUST
00:54:44.960 | saying that he doesn't see a future
00:54:46.880 | where it's like the significant workhorse.
00:54:49.160 | I think there could be a separation
00:54:50.360 | between maybe like the high volume use cases,
00:54:54.280 | you want APIs, and then the long tail, you want computer use.
00:54:57.600 | - I totally agree.
00:54:58.440 | - Right?
00:54:59.260 | So you'll start, you'll prototype something
00:55:00.680 | with computer use, and then, hey, this is working.
00:55:03.000 | Like customers have adopted this feature.
00:55:04.780 | Okay, like, let's go turn it into an API
00:55:07.160 | and it'll be faster and use less tokens.
00:55:09.040 | - Yeah, I'd be interested to see a computer use agent
00:55:11.720 | replace itself by figuring out the API
00:55:14.600 | and then just dropping out of the equation altogether.
00:55:17.960 | You know?
00:55:18.800 | - Yeah, that's really fun actually.
00:55:20.100 | - If I was running an RPA company,
00:55:21.720 | like you would have the RPA scripting,
00:55:23.960 | RPA for people listening is robotic process automation,
00:55:26.840 | where you would script things
00:55:28.420 | that like always show up in sequence.
00:55:30.380 | So you don't have an LLM in the loop.
00:55:32.380 | And so basically what you need to do
00:55:33.660 | is train an LLM to code that script.
00:55:35.900 | And then you can sort of naturally hand off
00:55:38.140 | from computer use to non-computer user.
00:55:40.340 | - Yeah, or have some way to turn Claude's actions
00:55:43.660 | of computer use into a saved script
00:55:45.700 | that you can then run repeatedly.
00:55:47.360 | - Yeah, it'd be interesting to record that.
00:55:48.980 | - Why did you decide to not ship any
00:55:51.140 | like sandbox harness for computer use?
00:55:54.580 | It's kind of like, "Hey, peace, run at your own risk."
00:55:57.320 | - It's Docker, right?
00:55:58.160 | - No, no, we launched it with, I think a VM or Docker,
00:56:00.960 | a Docker as system.
00:56:01.880 | - But it's not for your actual computer, right?
00:56:05.100 | Like the Docker instance is like runs in the Docker.
00:56:07.800 | It's not for-
00:56:08.720 | - Yeah, it runs its own browser.
00:56:09.960 | I think, I mean, the main reason for that
00:56:12.320 | is one is sort of security.
00:56:13.840 | You know, we don't want, you know,
00:56:15.560 | the model can do anything.
00:56:16.920 | So we wanted to give it a sandbox,
00:56:18.760 | not have people do their own computer,
00:56:21.400 | at least sort of for our default experience.
00:56:23.180 | We really care about providing a nice sort of,
00:56:25.600 | making the default safe,
00:56:27.640 | I think is the best way for us to do it.
00:56:30.120 | And I mean, very quickly people made modifications
00:56:32.920 | to let you run it on your own desktop.
00:56:34.740 | And that's fine.
00:56:35.580 | Someone else can do that,
00:56:36.400 | but we don't want that to be the official
00:56:37.440 | anthropic thing to run.
00:56:38.880 | I would say also like from a product perspective right now,
00:56:41.880 | because this is sort of still in beta,
00:56:44.320 | I think a lot of the most useful use cases are,
00:56:47.560 | like a sandbox is actually what you want.
00:56:49.260 | You want something where,
00:56:50.360 | hey, it can't mess up anything in here.
00:56:52.360 | It only has what I gave it.
00:56:54.740 | Also, if it's using your computer, you know,
00:56:56.380 | you can't use your computer at the same time.
00:56:58.700 | I think you actually like want it to have its own screen.
00:57:01.420 | It's like you and a person pair programming,
00:57:03.660 | but only on one laptop versus you have two laptops.
00:57:05.900 | - Everyone should totally have a side laptop
00:57:07.460 | where the computer is just doing its thing.
00:57:09.140 | - Yeah, I think it's just a better experience.
00:57:11.700 | Unless there's something very explicit
00:57:13.300 | you want it to do for you on your own computer.
00:57:15.660 | - It becomes like you're sort of
00:57:17.940 | shelling into a remote machine
00:57:19.980 | and maybe checking in on it every now and then.
00:57:22.960 | I have fond memories of,
00:57:24.640 | half our audience is going to be too young to remember this,
00:57:26.320 | but Citrix, like desktop experience,
00:57:28.880 | like you were sort of remote into a machine
00:57:32.960 | that someone else was operating.
00:57:34.440 | And for a long time,
00:57:35.680 | that would be how you did like enterprise computing.
00:57:38.480 | - It's a viewer.
00:57:39.320 | - Yeah, it's coming back.
00:57:42.360 | Any other implications of computer use?
00:57:44.120 | Is it a fun demo or is it like the future of Anthropic?
00:57:47.480 | - I'm very excited about it.
00:57:48.980 | I think that like there's a lot
00:57:50.280 | of sort of very repetitive work
00:57:51.640 | that like computer use will be great for.
00:57:54.260 | I think I've seen some examples
00:57:55.560 | of people build like coding agents
00:57:57.840 | that then also like test the front end that they made.
00:58:01.240 | So I think it's very cool to like use computer use
00:58:03.480 | to be able to close the loop on a lot of things
00:58:05.380 | that right now just a terminal based agent can't do.
00:58:09.000 | So I think that's very exciting.
00:58:09.840 | - It's kind of like end to end testing.
00:58:11.480 | - Exactly, yeah, yeah.
00:58:12.680 | The end sort of front end and web testing
00:58:14.640 | is something I'm very excited about.
00:58:16.240 | - Yeah, I've seen Amanda also talking,
00:58:18.440 | this will be Amanda Askell, the head of Cloud Character.
00:58:21.520 | She goes on a lunch break
00:58:22.400 | and it generates research ideas for her.
00:58:25.120 | Giving it a name like computer use is very practical.
00:58:27.600 | It's like you're supposed to do things,
00:58:29.380 | but maybe sometimes it's not about doing things,
00:58:30.960 | it's about thinking.
00:58:32.200 | And thinking, in the process of thinking,
00:58:33.820 | you're using the computer.
00:58:35.480 | In some way that's, you know, solving sweet bench,
00:58:37.120 | like you should be allowed to use the internet
00:58:39.280 | or you should be allowed to use a computer to solve it
00:58:41.920 | and use your vision and use whatever.
00:58:43.400 | Like we're just sort of shackling it
00:58:45.120 | with all these restrictions just 'cause we wanna play nice
00:58:48.080 | for a benchmark, but really, you know,
00:58:50.480 | a full AI will be able to do all these things, to think.
00:58:53.940 | - Yeah, we'll definitely be able to.
00:58:54.780 | - To reason.
00:58:55.620 | - To Google and search for things.
00:58:56.920 | - Yeah.
00:58:57.760 | - Yeah, pull down inspiration.
00:58:58.680 | - Can we just do a, before we wrap, a robotics corner?
00:59:01.960 | - Oh, yeah, yeah, yeah.
00:59:02.800 | - People are always curious,
00:59:03.640 | especially with somebody that is not trying
00:59:05.720 | to hype their own company.
00:59:07.160 | What's the state of AI robotics, under hyped, over hyped?
00:59:10.660 | - Yeah, and I'll say like these are my opinions,
00:59:12.880 | not Anthropic's.
00:59:13.840 | And again, coming from a place
00:59:15.520 | of a burned out robotics founder.
00:59:17.080 | So take everything with a grain of salt.
00:59:19.520 | I would say on the positives,
00:59:20.560 | like there is really sort of incredible progress
00:59:24.400 | that's happened in the last five years
00:59:26.320 | that I think will be a big unlock for robotics.
00:59:28.620 | The first is just general purpose language models.
00:59:30.580 | I mean, there was an old saying in robotics
00:59:33.020 | that if to fully describe your task is harder
00:59:36.740 | than to just do the task, you can never automate it.
00:59:39.360 | 'Cause like, it's gonna take more effort
00:59:40.680 | to even tell the robot how to do this thing
00:59:42.360 | than to me just do it itself.
00:59:43.920 | LLM solved that.
00:59:45.200 | I no longer need to go exhaustively program
00:59:48.680 | in every little thing I could do.
00:59:50.200 | The thing just has common sense
00:59:51.440 | and it's gonna know how do I make a Reuben sandwich?
00:59:54.480 | I'm not gonna have to go program that in.
00:59:56.280 | Whereas before like the idea of even like a cooking thing,
00:59:59.040 | it's like, oh God, like we're gonna have the team
01:00:00.560 | of engineers that are hard coding recipes
01:00:02.980 | for the long tail of anything, it'd be a disaster.
01:00:06.260 | So I think that's one thing is that bringing common sense
01:00:09.120 | really is like solves this huge problem describing tasks.
01:00:12.320 | The second big innovation has been diffusion models
01:00:15.760 | for path planning.
01:00:16.800 | A lot of this work came out of Toyota Research.
01:00:19.760 | There's a lot of startups now that are working on this
01:00:21.800 | like Physical Intelligence Pi, Chelsea Fins,
01:00:24.760 | startup out of Stanford.
01:00:26.120 | And the basic idea here is using a little bit of the,
01:00:29.800 | I'd say maybe more inspiration from diffusion
01:00:32.020 | rather than diffusion models themselves,
01:00:34.000 | but they're a way to basically learn
01:00:36.240 | an end to end sort of motion control.
01:00:39.720 | Whereas previously all of robotics motion control
01:00:42.680 | was sort of very hard coded.
01:00:44.960 | You either, you're programming in explicit motions
01:00:48.240 | or you're programming in an explicit goal
01:00:51.120 | and using an optimization library
01:00:52.820 | to find the shortest path to it.
01:00:54.760 | This is now something where you just give it
01:00:56.820 | a bunch of demonstrations.
01:00:58.200 | And again, just like using learning,
01:01:00.560 | it's basically like learning from these examples.
01:01:03.680 | What does it mean to go pick up a cup?
01:01:05.920 | And doing these in a way just like diffusion models
01:01:08.200 | where they're somewhat conditioned by text,
01:01:11.320 | you can have it, the same model learn many different tasks.
01:01:14.840 | And then the hope is that these start to generalize,
01:01:18.120 | that if you've trained it on picking up coffee cups
01:01:21.200 | and picking up books, then when I say pick up the backpack,
01:01:24.400 | it knows how to do that too,
01:01:25.920 | even though you've never trained it on that.
01:01:27.360 | That's kind of the holy grail here
01:01:29.040 | is that you train it on 500 different tasks
01:01:33.000 | and then that's enough to really get it to generalize
01:01:35.240 | to do anything you would need.
01:01:36.760 | I think that's like still a big TBD
01:01:39.280 | and these people are working,
01:01:40.880 | have like measured some degree of generalization.
01:01:44.780 | But at the end of the day, it's also like LLMs.
01:01:46.720 | Like, you know, do you really care about the thing,
01:01:48.640 | being able to do something
01:01:49.480 | that no one has ever shown in training data?
01:01:51.980 | People for like a home robot,
01:01:53.560 | there's gonna be like a hundred things
01:01:55.360 | that people really want it to do.
01:01:56.360 | And you can just make sure it has good training
01:01:58.000 | for those things.
01:01:58.840 | What you do care about then is like generalization
01:02:01.320 | within a task of, oh,
01:02:02.240 | I've never seen this particular coffee mug before.
01:02:05.000 | Can I still pick it up?
01:02:06.200 | And those, the models do seem very good at.
01:02:08.040 | So these kind of are the two big things
01:02:10.160 | that are going for robotics right now
01:02:11.600 | is LLMs for common sense
01:02:14.320 | and diffusion inspired path planning algorithms.
01:02:18.200 | I think this is very promising,
01:02:20.240 | but I think there's a lot of hype.
01:02:21.540 | And I think where we are right now
01:02:23.520 | is where self-driving cars were 10 years ago.
01:02:26.320 | I think we have very cool demos that work.
01:02:29.080 | I mean, 10 years ago,
01:02:29.920 | you had videos of people driving a car on the highway,
01:02:33.300 | driving a car on a street with a safety driver,
01:02:37.060 | but it's really taken a long time to go from there to,
01:02:39.780 | I took a Waymo here today.
01:02:41.220 | And even then Waymo is only in SF and a few other cities.
01:02:44.540 | And I think like it takes a long time for these things
01:02:47.600 | to actually like get everywhere
01:02:49.340 | and to get all the edge cases covered.
01:02:51.540 | I think that for robotics,
01:02:52.940 | the limiting factor is gonna be reliability.
01:02:55.940 | That these models are really good at doing these demos
01:02:58.620 | of like doing laundry or doing dishes.
01:03:01.240 | If they only work 99% of the time, like that sounds good,
01:03:04.800 | but that's actually really annoying.
01:03:06.200 | Like humans are really good at these tasks.
01:03:08.080 | Like imagine if like one out of every 100 dishes,
01:03:11.220 | it washed, it breaks.
01:03:12.720 | Like you would not want that robot in your house
01:03:15.080 | or you certainly wouldn't want that in your factory
01:03:17.320 | if one of every 100 boxes that it moves,
01:03:19.560 | it drops and breaks things inside it.
01:03:21.480 | So I think for these things to really be useful,
01:03:24.080 | they're gonna have to hit a very, very high level
01:03:26.760 | of reliability, just like self-driving cars.
01:03:29.440 | And I don't know how hard it's gonna be
01:03:32.360 | for these models to move from like the 95% reliability
01:03:36.060 | to 99.9.
01:03:37.360 | I think that's gonna be the big thing.
01:03:39.280 | And I think also like I'm a little skeptical
01:03:41.640 | of how good the unit economics of these things will be.
01:03:45.320 | These robots are gonna be very expensive to build.
01:03:48.080 | And if you're just trying to replace labor,
01:03:52.320 | like a one for one purchase,
01:03:54.600 | it kind of sets an upper cap about how much you can charge.
01:03:57.440 | And so, it seems like it's not that great a business.
01:04:01.320 | I'm also worried about that
01:04:02.280 | for the self-driving car industry.
01:04:03.680 | - Do you see most of the applications
01:04:05.920 | actually taking some of the older,
01:04:07.800 | especially manufacturing machinery,
01:04:09.520 | which is like, it needs to be like very precise,
01:04:12.080 | even if it's off by just a few millimeters,
01:04:14.080 | it cannot screw up the whole thing
01:04:15.640 | and be able to adjust at the edge.
01:04:18.200 | Or do you think like the net new use cases
01:04:20.760 | may be like the more interesting?
01:04:23.120 | - I think it'd be very hard to replace
01:04:25.200 | a lot of those traditional manufacturing robots
01:04:27.420 | because everything relies on that precision.
01:04:30.040 | If you have a model that can, again,
01:04:31.520 | only get there 99% of the time,
01:04:33.720 | you don't want 1% of your cars
01:04:35.400 | to have the weld in the wrong spot.
01:04:36.960 | Like, that's gonna be a disaster.
01:04:38.600 | And a lot of manufacturing is all about
01:04:41.680 | getting rid of as much sort of variance
01:04:44.800 | and uncertainty as possible.
01:04:46.480 | - And what about the hardware?
01:04:47.480 | A lot of my friends that work in robotics,
01:04:49.200 | one of their big issues,
01:04:50.440 | like sometimes you just have a servo that fails
01:04:52.680 | and then you gotta,
01:04:53.520 | and it takes a bunch of time to like fix that.
01:04:55.800 | Is that holding back things or is the software still?
01:04:58.580 | Anyway, not by right. - I think both.
01:04:59.700 | I think there's been a lot more progress
01:05:01.580 | in the software in the last few years.
01:05:02.860 | And I think a lot of the humanoid robot companies
01:05:05.060 | now are really trying to build amazing hardware.
01:05:07.020 | Hardware is just so hard.
01:05:08.900 | It's something where- - Classic.
01:05:10.660 | - You know, you build your first robot and it works,
01:05:12.940 | you're like, great.
01:05:13.860 | Then you build 10 of them, five of them work,
01:05:16.300 | three of them work half the time, two of them don't work,
01:05:18.460 | and you built them all the same and you don't know why.
01:05:20.340 | And it's just like the real world
01:05:22.540 | has like this level of detail and differences
01:05:25.020 | that software doesn't have.
01:05:27.160 | Like imagine if every four loop you wrote,
01:05:29.360 | some of them just didn't work.
01:05:30.520 | Some of them were slower than others.
01:05:31.960 | Like how do you deal with that?
01:05:34.200 | Like imagine if every binary that you shipped to a customer,
01:05:36.880 | each of those four loops was a little bit differently,
01:05:38.820 | was a little different.
01:05:39.880 | It becomes just so hard to scale
01:05:41.940 | and sort of maintain quality of these things.
01:05:44.360 | And I think that's like,
01:05:45.760 | that's what makes hardware really hard
01:05:47.140 | is not building one of something,
01:05:48.320 | but repeatedly building something
01:05:50.200 | and making it work reliably.
01:05:52.020 | Where again, like you'll buy a batch of a hundred motors
01:05:55.220 | and each of those motors will behave
01:05:57.340 | a little bit differently to the same input command.
01:05:59.620 | - This is your lived experience at Cobalt.
01:06:01.340 | - And robotics is all about
01:06:03.380 | how do you build something that's robust
01:06:05.260 | despite these differences?
01:06:06.380 | - We can't get the tolerance of motors down to-
01:06:08.380 | - It's just everything.
01:06:09.220 | You know, you'll have-
01:06:10.060 | (laughing)
01:06:11.540 | - It's actually everything.
01:06:13.140 | No, I mean, one of-
01:06:14.700 | - One of my horror stories was that at Cobalt,
01:06:16.860 | this was many years ago,
01:06:17.780 | we had a thermal camera on the robot
01:06:20.900 | that had a USB connection to the computer inside,
01:06:23.160 | which is, first of all, is a big mistake.
01:06:24.400 | You're not supposed to use a USB.
01:06:25.980 | It is not a reliable protocol.
01:06:27.760 | It's designed that if there's mistakes,
01:06:29.720 | the user can just unplug it and plug it back in.
01:06:31.640 | - I see.
01:06:32.480 | - And so typically things that are USB,
01:06:35.000 | they're not designed to the same level
01:06:37.040 | of like very high reliability you need.
01:06:39.480 | Again, because they assume someone will just unplug it
01:06:41.600 | and replug it back in.
01:06:42.440 | - You just say someone sometime.
01:06:44.240 | - I heard this too and I didn't listen to it.
01:06:45.760 | I really wish I had before.
01:06:47.080 | Anyway, at a certain point,
01:06:48.480 | a bunch of these thermal cameras started failing
01:06:50.600 | and we couldn't figure out why.
01:06:52.480 | And I asked everyone on the team,
01:06:53.640 | like, "Hey, what's changed?
01:06:54.960 | Did the software change around this node?
01:06:56.520 | Did the hardware design change around this node?"
01:06:58.640 | And I was investigating all this stuff,
01:07:00.920 | looking at kernel logs of what's happening with this thing.
01:07:05.680 | And finally, the procurement person was like,
01:07:08.280 | "Oh yeah, well,
01:07:09.120 | I found this new vendor for USB cables last summer."
01:07:12.120 | And I'm like, "What?
01:07:13.200 | You switched which vendor we're buying USB cables from?"
01:07:16.080 | And I'm like, "Yeah, it's the same exact cable.
01:07:18.040 | It's just a dollar cheaper."
01:07:19.440 | And it turns out this was the problem.
01:07:20.840 | This new cable had slightly worse resistance
01:07:25.320 | or slightly worse EMI interference.
01:07:27.800 | And it worked most of the time,
01:07:30.160 | but 1% of the time these cameras would fail
01:07:32.680 | and we'd need to reboot a big part of the system.
01:07:35.440 | And it was all just 'cause the same exact spec,
01:07:38.080 | these two different USB cables, like slightly different.
01:07:41.160 | And so these are the kind of things
01:07:42.360 | you deal with with hardware.
01:07:43.560 | - For listeners, we had a episode
01:07:45.440 | with Josh Albrecht in view,
01:07:46.640 | where they talked about buying tens of thousands of GPUs
01:07:49.560 | and just some of them will just not do math.
01:07:51.680 | - Yeah, yeah, it's the same thing.
01:07:53.840 | - You run some tests to find the bad batch
01:07:57.720 | and then you return it to sender
01:07:58.840 | 'cause they just, GPUs won't do math, right?
01:08:00.960 | - Yeah, yeah, this is the thing.
01:08:02.560 | Just the real world has this level of detail.
01:08:05.880 | There's Eric Jang, he did AI at Google.
01:08:09.320 | - Yeah, 1X.
01:08:10.160 | - Yeah, and then joined 1X.
01:08:11.680 | I see him post on Twitter occasionally
01:08:14.000 | of complaints about hardware and supply chain.
01:08:17.040 | And we know each other and we joke occasionally
01:08:19.360 | that we've switched.
01:08:20.200 | I went from robotics into AI
01:08:21.720 | and he went from AI into robotics, and yeah.
01:08:24.840 | - Look, very, very promising.
01:08:25.920 | The time of the real world is unlimited, right?
01:08:28.720 | But just also a lot harder.
01:08:30.240 | And yeah, I do think,
01:08:31.520 | something I also tell people about
01:08:33.320 | for why working software agents
01:08:35.480 | is they're infinitely clonable.
01:08:37.200 | And they always work the same way,
01:08:38.640 | mostly unless you're using Python.
01:08:43.480 | And yeah, I mean, this is like the whole thesis.
01:08:47.320 | I'm also interested like in,
01:08:48.400 | you dropped a little bit of alpha there.
01:08:50.800 | I don't wanna make sure we don't lose it.
01:08:52.920 | Like you're just kind of skeptical about self-driving
01:08:55.560 | as a business.
01:08:56.520 | So I wanna like double click on this a little bit
01:08:59.200 | because I mean, I think that that shouldn't be taken away.
01:09:01.960 | We do have some public Waymo numbers.
01:09:03.760 | Read from Waymo is pretty public with like their stats.
01:09:07.640 | They're exceeding 100 Waymo trips a week.
01:09:09.840 | If you assume like a $25 ride average,
01:09:11.840 | that's $130 million revenue run rate.
01:09:14.320 | At some point they will recoup their investment, right?
01:09:15.920 | Like what are we talking about here?
01:09:17.040 | Like why this skepticism?
01:09:19.440 | - I think, and again, I'm not an expert.
01:09:21.000 | I don't know their financials.
01:09:22.920 | I would say the thing I'm worried about
01:09:24.040 | is like compared to an Uber,
01:09:25.960 | like I don't know how much an Uber driver takes home a year,
01:09:28.360 | but like call that the revenue
01:09:30.400 | that a Waymo is gonna be making in that same year.
01:09:33.240 | Those cars are expensive.
01:09:34.520 | It's not about if you can hit profitability,
01:09:36.680 | it's about your cash conversion cycles.
01:09:38.600 | Like is building one Waymo,
01:09:40.600 | like how cheap can you make that
01:09:42.340 | compared to like how much you're earning
01:09:45.040 | sort of as the equivalent
01:09:46.200 | of what an Uber driver would take home?
01:09:47.600 | 'Cause remember, an Uber driver,
01:09:49.060 | you're not getting that whole revenue.
01:09:50.280 | You think about for the Uber driver,
01:09:51.720 | like the cost of the car, the depreciation of the car.
01:09:54.480 | I'm not convinced how much profit
01:09:57.720 | Waymo can actually make per car.
01:09:59.640 | That's, I think, my skepticism.
01:10:00.800 | - Well, they need to pre-assess the run Waymo
01:10:04.080 | because the Class C is like 100, 10 grand,
01:10:06.560 | something like that. - Yes, exactly.
01:10:07.400 | - Plus the LiDAR. - That's many years of,
01:10:10.000 | yeah, yeah, yeah, exactly, exactly.
01:10:11.800 | - Anything else?
01:10:12.640 | Parting thoughts?
01:10:13.640 | Call to action?
01:10:14.800 | Rants?
01:10:15.800 | The floor is yours.
01:10:17.360 | - I'm very excited to see a lot more LLM agents
01:10:20.240 | out there in the world doing things.
01:10:21.680 | And I think that like,
01:10:23.000 | I think there'll be the biggest limiting thing
01:10:25.320 | will start to become like,
01:10:26.880 | do people trust the output of these agents?
01:10:28.880 | And like, how do you trust the output of an agent
01:10:31.240 | that did five hours of work for you
01:10:32.680 | and is coming back with something?
01:10:34.400 | And if you can't find some way to trust that agent's work,
01:10:37.880 | it kind of wasn't valuable at all.
01:10:39.320 | So I think that's gonna be a really important thing
01:10:41.000 | is not just doing the work,
01:10:43.040 | but doing the work in a trustable, auditable way
01:10:45.560 | where you can also explain to the human,
01:10:47.560 | hey, here's exactly how this works and why,
01:10:49.680 | and how I came to it.
01:10:51.000 | I think that's gonna be really important.
01:10:52.720 | - Thank you so much.
01:10:53.560 | - Thank you. - Yeah, thanks.
01:10:54.440 | This was great.
01:10:55.440 | (upbeat music)
01:10:58.040 | (upbeat music)
01:11:00.640 | (upbeat music)
01:11:03.240 | (upbeat music)