Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

00:00:00.000 | Hi, everyone.

00:00:04.000 | Welcome to the Leyden Space podcast.

00:00:05.880 | This is Alessio, partner and CTO in residence

00:00:08.100 | at Decibel Partners.

00:00:09.040 | And I'm joined by my co-host Swix, founder of SmallAI.

00:00:12.000 | Hey, and today we have a super special episode.

00:00:14.440 | I actually always wanted to take a selfie and go like,

00:00:16.800 | you know, POV, you're about to revolutionize

00:00:19.440 | the world of agents, because we have two of the most awesome

00:00:23.200 | high-end agents in the house.

00:00:25.000 | So first, we're going to welcome back Harrison Chase.

00:00:27.280 | Welcome.

00:00:28.120 | Excited to be here.

00:00:29.800 | What's new with you recently in sort of like the 10,

00:00:32.320 | 20 second recap?

00:00:34.120 | Langchain, Langsmith, Langgraf, pushing on all of them.

00:00:37.440 | Lots of cool stuff related to a lot of the stuff

00:00:40.160 | that we're going to talk about today, probably.

00:00:41.880 | Yeah, we'll mention it in there.

00:00:43.120 | And the Celtics won the title.

00:00:44.520 | And the Celtics won the title.

00:00:45.760 | You got that going on for you.

00:00:47.040 | I don't know.

00:00:47.480 | Is that like floorball, handball, baseball?

00:00:51.360 | Basketball.

00:00:52.120 | Basketball, basketball.

00:00:53.400 | Patriots aren't looking good, though, so that's--

00:00:56.040 | And then Shunyu, you've also been on the pod,

00:00:58.160 | but only in like a sort of oral paper presentation capacity.

00:01:01.320 | But welcome officially to the Linux-based pod.

00:01:03.800 | Yeah, I've been a huge fan.

00:01:04.880 | So thanks for the invitation.

00:01:06.560 | Thanks.

00:01:07.400 | Well, it's an honor to have you on.

00:01:08.920 | You're one of like-- you're maybe the first PhD thesis defense

00:01:12.760 | I've ever watched in like this AI world,

00:01:16.400 | because most people just publish single papers.

00:01:18.600 | But every paper of yours is a banger.

00:01:20.800 | So congrats.

00:01:21.680 | Thanks.

00:01:22.680 | Yeah, maybe we'll just kick it off with, you know,

00:01:25.960 | what was your journey into using language models for agents?

00:01:28.520 | I like that your thesis advisor, I didn't catch his name,

00:01:31.560 | but he was like, you know--

00:01:32.400 | Karthik.

00:01:33.040 | Yeah, it's like this guy just wanted to use language models,

00:01:35.800 | and it was such a controversial pick at the time.

00:01:38.320 | Right.

00:01:38.800 | The full story is that in undergrad,

00:01:40.760 | I did some computer vision research,

00:01:42.240 | and that's how I got into AI.

00:01:44.200 | But at the time, I feel like, you know,

00:01:46.640 | you're just composing all the GAN or 3D perception

00:01:49.920 | or whatever together, and it's not exciting anymore.

00:01:53.040 | And one day, I just see this transformer paper,

00:01:56.440 | and that's really cool.

00:01:58.000 | But I really got into language model

00:02:00.480 | only when I entered my PhD and met my advisor Karthik.

00:02:04.640 | So he was actually the second author of GPT-1

00:02:07.280 | when he was like a visiting scientist at OpenAI.

00:02:09.360 | With Alec Redford?

00:02:10.320 | Yes.

00:02:10.720 | Wow.

00:02:11.280 | That's what he told me.

00:02:12.320 | It's like, back in OpenAI,

00:02:14.320 | they did this GPT-1 together,

00:02:15.680 | and Ilya just said, Karthik, you should stay,

00:02:19.040 | because we just solved the language.

00:02:20.440 | But apparently, Karthik is not fully convinced.

00:02:22.880 | So he went to Princeton, started his professorship,

00:02:25.760 | and I'm really grateful.

00:02:27.440 | So he accepted me as a student,

00:02:29.440 | even though I have no prior knowledge in NLP.

00:02:32.120 | And, you know, we just met for the first time,

00:02:34.240 | and he's like, you know, what do you want to do?

00:02:37.000 | And I'm like, you know,

00:02:38.200 | you have done those test game scenes.

00:02:40.160 | That's really cool.

00:02:40.960 | I wonder if we can just redo them with language models.

00:02:43.880 | And that's how the whole journey began.

00:02:45.760 | Awesome.

00:02:46.040 | And that was GPT-2, was that at the time?

00:02:47.760 | Yes, that was 2019.

00:02:49.240 | Yeah.

00:02:49.800 | Way too dangerous to release.

00:02:51.640 | Yeah.

00:02:52.040 | And then I guess the first work of yours that I came across was React.

00:02:56.600 | Sure.

00:02:57.080 | Which was a big part of your defense.

00:02:59.000 | But also, Harrison, when you came on the podcast last year,

00:03:01.240 | you said that was one of the first papers that you saw

00:03:03.720 | when you were getting inspired for Langchain.

00:03:05.160 | So maybe give a recap of why you thought it was cool,

00:03:08.120 | because you were already working in AI and machine learning.

00:03:10.840 | And then, yeah, you can kind of like enter the paper formally.

00:03:14.360 | But what was that interesting to you specifically?

00:03:16.360 | Yeah, I mean, I think the interesting part was using these language models to

00:03:20.360 | interact with the outside world in some form.

00:03:22.840 | And I think in the paper, you mostly deal with Wikipedia, and I think there's some other

00:03:27.320 | data sets as well, but the outside world is the outside world.

00:03:30.360 | And so interacting with things that weren't present in the LLM and APIs and calling into

00:03:34.680 | them and thinking about, and yeah, the React reasoning and acting and kind of like combining

00:03:40.120 | those together and getting better results.

00:03:42.360 | I had been playing around with LLMs, been talking with people who were playing around with LLMs.

00:03:46.200 | People were trying to get LLMs to call into APIs, do things.

00:03:48.760 | And it was always, how can they do it more reliably and better?

00:03:51.640 | And so this paper was basically a step in that direction.

00:03:54.520 | And I think really interesting and also really general as well.

00:03:58.200 | Like, I think that's part of the appeal is just how general and simple in a good way,

00:04:03.320 | I think the idea was.

00:04:04.920 | So that it was really appealing for all those reasons.

00:04:07.320 | Simple is always good.

00:04:08.200 | Yeah.

00:04:08.520 | Do you have a favorite part?

00:04:10.920 | Because I have one favorite part from your PhD defense,

00:04:13.560 | which I didn't understand when I read the paper.

00:04:16.040 | But you said something along the lines, React doesn't change the outside or the environment,

00:04:20.840 | but it does change the inside through the context, putting more things in the context.

00:04:24.360 | You're not actually changing any of the tools around you to work for you,

00:04:27.800 | but you're changing how the model thinks.

00:04:30.840 | And I think that was like a very profound thing when I,

00:04:33.320 | not that I've been using these tools for like 18 months, I'm like,

00:04:35.960 | I understand what you meant.

00:04:36.840 | But like to say that at the time you did the PhD defense was not trivial.

00:04:40.600 | Yeah.

00:04:41.080 | Another way to put it is like thinking can be an extra tool that's useful.

00:04:45.880 | It makes sense.

00:04:47.960 | It checks out.

00:04:48.840 | Who would have thought?

00:04:49.880 | I think it's also more controversial within his world because everyone was trying to use RL for

00:04:56.040 | agents.

00:04:56.440 | Right.

00:04:56.840 | And this is like the first kind of zero gradient type approach.

00:05:00.360 | Yeah.

00:05:01.080 | I think the bigger kind of historical context is that we have these two big branches of AI, right?

00:05:07.640 | So if you think about RL, right, that's pretty much the equivalent of agent at a time.

00:05:13.080 | And it's like agent is equivalent to reinforcement learning and reinforcement learning is equivalent to

00:05:17.640 | whatever game environment they're using, right?

00:05:19.880 | Atari game or go or whatever.

00:05:22.440 | So you have like a pretty much, you know, you have a biased kind of like set of methodologies in terms

00:05:28.600 | of reinforcement learning and represents agents.

00:05:31.400 | On the other hand, I think NLP is like a historical kind of subject.

00:05:36.200 | It's not really into agents, right?

00:05:38.360 | It's more about reasoning.

00:05:40.200 | It's more about solving those concrete tasks.

00:05:43.160 | And if you look at S.A.L., right, like each task has its own track, right?

00:05:46.680 | Summarization has a track.

00:05:48.120 | Question answering has a track.

00:05:49.320 | So I think really it's about rethinking agents in terms of what could be the new environments

00:05:56.920 | that we can interact with.

00:05:57.880 | It's not just Atari games or whatever, video games, but also those text games or language games.

00:06:02.840 | And also thinking about, could there be like a more general kind of methodology beyond just

00:06:07.960 | designing specific pipelines for each NLP task?

00:06:11.240 | That's like the bigger kind of context, I would say.

00:06:13.800 | Is there an inspiration spark moment that you remember?

00:06:17.320 | Or how did you come to this?

00:06:19.080 | We had Trida on the podcast and you mentioned he was really inspired working with like systems

00:06:23.160 | people to think about flash attention.

00:06:25.000 | What was your inspiration?

00:06:26.600 | Yeah, so actually before React, I spent the first two years of my PhD

00:06:31.160 | focusing on text-based games or in other words, text adventure games.

00:06:36.200 | It's a very kind of small kind of research area and quite ad hoc, I would say.

00:06:41.320 | And there are like, I don't know, like 10 people working on that at the time.

00:06:45.560 | And have you guys heard of Zork 1, for example?

00:06:49.880 | So basically the idea is you have this game and they have text observations.

00:06:53.720 | Like you see a monster, you see a dragon.

00:06:57.000 | You're eaten by a group.

00:06:58.360 | Yeah, you're eaten by a group.

00:06:59.720 | And you have actions like kill the group with a sword or whatever, right?

00:07:04.120 | And that's like a very typical setup of a text game.

00:07:07.080 | So I think one day after, you know, I've seen all the GPT-3 stuff.

00:07:11.400 | I just think, think about, you know, how can I solve the game?

00:07:15.320 | Like why those AI, you know, machine learning methods are pretty stupid, but we are pretty

00:07:20.600 | good at solving the game relatively, right?

00:07:22.360 | So for the context, the predominant method to solve this text game is obviously reinforcement

00:07:27.960 | learning.

00:07:28.280 | And the idea is you just try out an arrow in those games for like millions of steps and you

00:07:33.720 | kind of just overfeed to the game.

00:07:35.400 | But there's no language understanding at all.

00:07:37.320 | And I'm like, why can I solve the game better?

00:07:40.360 | And it's kind of like, because we think about the game, right?

00:07:44.920 | Like when we see this very complex text observation, like you see a guru and you might see a sword,

00:07:51.720 | you know, in the right of the room and you have to go through the wooden door to go to that room.

00:07:56.840 | You will think, you know, oh, I have to kill the monster and to kill that monster, I have to get

00:08:00.280 | the sword, I get to get the sword, I have to go, right?

00:08:02.760 | And this kind of thinking actually helped us kind of thorough shots of the game.

00:08:06.520 | And it's like, why don't we also enable the text agents to think?

00:08:10.520 | And that's kind of the prototype of React.

00:08:13.000 | And I think that's actually very interesting because the prototype, I think, was around

00:08:18.200 | November of 2021.

00:08:20.680 | So that's even before like chain of thought or whatever came up.

00:08:23.480 | So we did a bunch of experiments in the text game, but it was not really working that well.

00:08:29.160 | Like those texting are just too hard.

00:08:30.360 | I think today it's still very hard.

00:08:31.640 | Like if you use GPD-4 to solve it, it's still very hard.

00:08:34.280 | So the change came when I started the internship in Google and apparently Google care less about

00:08:41.320 | text game, they care more about what's more practical.

00:08:44.360 | So pretty much I just reapply the idea, but to more practical kind of environments like Wikipedia or

00:08:50.120 | or like simpler text games like R4 and it just worked.

00:08:55.720 | It's kind of like you first have the idea and then you try to find the domains and the problems

00:09:01.000 | to demonstrate the idea, which is, I would say, different from most of the AI research.

00:09:06.280 | But it kind of worked out for me in that case.

00:09:08.600 | Yeah.

00:09:09.320 | For Harrison, when you were implementing React, what were people applying React to in the early days?

00:09:13.880 | I think the first demo we did probably had like a calculator tool and a search tool.

00:09:17.800 | So like general things, we tried to make it pretty easy to write your own tools and plug in your own

00:09:22.840 | things. And so this is one of the things that we've seen in LinkedIn is people who build their own

00:09:27.400 | applications generally write their own tools. Like there are a few common ones. I'd say like the

00:09:31.720 | three common ones might be like a browser, a search tool and a code interpreter. But then other than that-

00:09:36.840 | The LMOs.

00:09:37.880 | Yep. Yeah, exactly. It matches up very nice with that.

00:09:40.680 | And we just, we actually just redid like our integrations docs page. And if you go to the tools

00:09:45.080 | section, we like highlight those three. And then there's a bunch of like other ones.

00:09:47.800 | Yeah.

00:09:48.280 | And there's such a long tail of other ones, but in practice, like when people go to production,

00:09:52.200 | they generally have their own tools or maybe one of those three, maybe some other ones, but like

00:09:56.200 | very, very few other ones. So yeah, I think the first demo was, was a search and the calculator one.

00:10:01.880 | And there's, what's the data set?

00:10:03.320 | Hot Pog QA.

00:10:04.920 | Yeah. Oh, so there's that one. And then there's like the celebrity one

00:10:07.640 | by the same author, I think.

00:10:09.240 | Olivia Wilde's boyfriend squared.

00:10:11.800 | Yeah.

00:10:12.040 | To the power of 0.23.

00:10:14.120 | Yeah.

00:10:14.840 | Right, right, right.

00:10:15.640 | There's, I'm forgetting the name of the author,

00:10:17.320 | I was like, we're going to over optimize for Olivia Wilde's boyfriend and it's going to

00:10:20.040 | change next year or something.

00:10:21.640 | There's a few data sets kind of like in that vein that require multi-step kind of like reasoning

00:10:26.120 | and thinking. So one of the questions I actually had for you in this vein, like the React paper,

00:10:31.000 | there's a thing, I think there's a few things in there, or at least when I think of that,

00:10:33.320 | there's a few things that I think of. There's kind of like the specific prompting strategy.

00:10:37.000 | Then there's like this general idea of kind of like thinking and then taking an action. And then

00:10:41.800 | there's just even more general idea of just like taking actions in a loop. Today, like obviously language models have

00:10:47.080 | changed a lot. We have tool calling, the specific prompting strategy probably isn't used super

00:10:51.400 | heavily anymore. Would you say that like the concept of React is still used though? Or like, do you think

00:10:58.200 | that tool calling and running tool calling in a loop, is that React in your mind?

00:11:02.920 | I would say like it's like more implicitly used than explicitly used. To be fair, I think the contribution

00:11:10.600 | of React is actually twofold. So first is this idea of, you know, we should be able to use calls in a very

00:11:17.480 | general way. Like there should be a single kind of general method to handle interaction with various

00:11:24.520 | environments. I think React is the first paper to demonstrate the idea. But then I think later, there are

00:11:30.600 | two form or whatever, and this become like a trivial idea. But I think at the time, that's like a pretty

00:11:36.200 | long trivial thing. And I think the second contribution is this idea of what people call

00:11:42.040 | like inner monologue or thinking or reasoning or whatever to be paired with tool use. I think that's

00:11:48.280 | still not trivial because if you look at the default function calling or whatever, like there's no inner

00:11:53.240 | monologue. And in practice, that actually is important, especially if the tool that you use is pretty different

00:12:00.360 | from the training distribution of the language model. So I think that's like, those are the two

00:12:06.360 | like main things that are kind of inherited.

00:12:09.800 | Yeah. On that note, I think OpenAI even recommended like when you're doing tool calling,

00:12:14.200 | it's sometimes helpful to put like a thought field in the tool along with all the actual acquired

00:12:18.600 | arguments and then have that one first. So it fills out that first and then, and that's, they've shown

00:12:22.680 | that that's yielded to kind of like better results. The reason I ask is just like the same concept is still

00:12:27.480 | alive and I don't know whether to call it like a React agent or not. Like I don't know what to call

00:12:31.640 | it. Like I think of it as React, like it's the same ideas that were in the paper, but it's obviously a

00:12:35.480 | very different implementation at this point in time. And so I just don't know what to call it.

00:12:39.240 | I feel like people will sometimes think more in terms of different tools, right? Because if you think

00:12:46.360 | about a web agent versus, you know, like a function calling agent and calling a Python API, you would

00:12:51.880 | think of them as very different. But in some sense, the methodology is the same. It depends on how you

00:12:57.240 | view them, right? And I think people will tend to think more in terms of the environment and the tools

00:13:03.000 | rather than the methodology. So, or in other words, I think the methodology is kind of trivial and simple.

00:13:08.280 | So people will try to focus more on the different tools, but I think it's good to have like a single

00:13:13.960 | underlying, principle underlying, all of those things. Yeah.

00:13:17.320 | How do you see the surface of React getting molded into the model? So a function calling is a good

00:13:21.880 | example of like, now the model does it. What about the thinking? Now, most models that you use kind of do

00:13:28.040 | chain of thought on their own. They kind of produce steps. Do you think that more and more of this logic will

00:13:32.440 | be in the model? Or do you think like the context window will still be the main driver of like reasoning

00:13:38.440 | and thinking? I think it's already default, right? Like you do some chain of thought and you do some

00:13:44.440 | tool call, like the cost of adding the chain of thought is kind of relatively low compared to other

00:13:50.040 | things. So it's not hurting to do that. And I think it's already kind of common practice, I would say.

00:13:56.200 | Is this a good place to bring in either tree of thought or reflection? Your pick.

00:14:00.920 | Maybe reflection like to respect the time order, I would say.

00:14:04.760 | Yeah. Any backstory as well, like, you know, the people involved with NOAA and like the Princeton

00:14:09.000 | group, I think, you know, we talked about this offline, but people don't understand how these

00:14:12.920 | research pieces come together and this ideation. Yeah. I think reflection is mostly NOAA's work.

00:14:18.200 | Like I'm more like advising kind of role. So the story is, I don't remember the time, but like one day we

00:14:24.200 | just see this preprint that's like reflection and autonomous agent with memory or whatever. And it's

00:14:31.320 | kind of like a extension to React, which uses this self-refaction. I'm like, oh, somehow it become very

00:14:37.640 | popular. Yeah.

00:14:38.040 | And NOAA reached out to me. It's like, do you want to collaborate on this and make this from like a

00:14:43.720 | archive preprint to something more solid, you know, like a conference submission? I'm like, sure.

00:14:48.040 | We start, started collaborating and we remain good friends today. And, uh, I think another interesting

00:14:54.360 | backstory is like NOAA was like, I think contacted by OpenAI at the time. It's like, this is pretty

00:14:59.320 | cool. Do you want to just work at OpenAI? And I think Sierra also reached out at the same time.

00:15:04.200 | It's like, this is pretty cool. Do you want to work at Sierra? And, and I think NOAA chose, uh, Sierra,

00:15:09.880 | but it's pretty cool because he was like, still like a second year undergrad and he's a very smart kid.

00:15:16.360 | Based on one paper? Based on one paper. Yeah.

00:15:18.600 | Oh my God. He's done some other research based on like programming language or chemistry or whatever,

00:15:23.800 | but I think that's the paper that got the attention of OpenAI and Sierra, right?

00:15:27.880 | Okay. For those who haven't gone too deep on it, the way that you present the inside of React,

00:15:32.120 | like, can you do that also for reflection? Yeah. I think one way to think of reflection is that

00:15:37.880 | the traditional idea of reinforcement learning is you have a scalar reward and then you, you somehow

00:15:43.240 | back-propagate the signal of the scalar reward to the rest of your neural network through whatever

00:15:48.040 | algorithm, like policy gradient or A2C or whatever. And if you think about the real life, you know, most

00:15:54.200 | of the reward signal is not scalar. It's like your boss told you, you know, you should have done a better

00:15:59.960 | job in this, but a good job on that or whatever. Right. It's not like a scalar reward, like 29 or

00:16:05.320 | something. I think in general, humans do more, deal more with, you know, long-scale reward, or you can

00:16:10.760 | say language feedback, right? And the way that they deal with language feedback also have this kind of

00:16:15.720 | back-propagation kind of process, right? Because you start from this, you did a good job on job B,

00:16:21.320 | and then you reflect, you know, what could have done different to, to change, to make it better.

00:16:26.520 | And you kind of change your prompt, right? Basically, you change your prompt, how to do job A,

00:16:30.920 | and how to do job B, and then you do the whole thing again. So it's really like a pipeline of

00:16:36.040 | language where, in self-gradient descent, you have something like text reasoning to replace

00:16:41.480 | those gradient descent algorithms. I think that's one way to think of reflection, yeah.

00:16:47.160 | One question I have about reflection is, how general do you think the algorithm

00:16:51.640 | there is? And so for context, I think at Langchain and at other places as well, we found it, like we and

00:16:56.920 | others found it pretty easy to implement, kind of like React in a standard way. You plug in any tools,

00:17:00.840 | and it kind of works off the shelf, you know, can get it up and running. I don't think we have like

00:17:05.720 | an off the shelf kind of like implementation of reflection and kind of like the general sense.

00:17:09.800 | I think the concepts like absolutely we see used in different kind of like specific cognitive

00:17:14.600 | architectures, but I don't think we have one that comes off the shelf. I don't think any of the other

00:17:18.920 | frameworks have one that comes off the shelf. And I'm curious whether that's because it's not general

00:17:24.280 | enough, or it's complex as well, because it also requires running it more times. Maybe that's not feasible.

00:17:29.800 | Like, I'm curious how you think about the generalality complexity. Why? Yeah, should we have

00:17:34.680 | one that comes off the shelf?

00:17:35.880 | I think the algorithm is general in the sense that it's just as general as like other algorithms,

00:17:41.720 | if you think about like positive reading or whatever, but it's not applicable to all tasks,

00:17:45.800 | just like other algorithms, right? So you can argue PPO is also general, but it works better for those set

00:17:51.560 | of tasks, but not on those set of tasks. I think it's the same situation for reflection. And I think a key

00:17:57.160 | bottleneck is the evaluator, right? Basically, you need to have a good sense of the signal.

00:18:02.040 | So for example, like if you're trying to do a very hard reasoning task, say mathematics,

00:18:07.320 | for example, and you don't have any tools, right? It's operating in this chain of thought setup. Then

00:18:12.280 | reflection will be pretty hard because in order to reflect upon your thoughts, you have to have a very

00:18:17.720 | good evaluator to judge whether your thought is good or not. But that might be as hard as solving the

00:18:23.720 | problem itself or even harder. The principle of self-reflection is probably more applicable if you have a good

00:18:28.680 | evaluator, for example, in the case of coding, right? Like if you have those arrows, then you can

00:18:33.400 | just reflect on that and how to solve the bug and stuff. So I think another criteria is that it

00:18:40.840 | depends on the application, right? Like if you have this latency or whatever need for like an actual

00:18:46.360 | application with an end user, right? And the user wouldn't let you, you know, do like two hours of three

00:18:51.240 | of thought or reflection, right? You need something as soon as possible. So in that case, maybe this is

00:18:56.520 | better to be used as like a training time technique, right? You do those reflection or tree of thought

00:19:01.800 | or whatever, you get a lot of data and then you try to use the data to train your model better. And then

00:19:06.600 | in test time, you still use something as simple as React, but that's already improved.

00:19:10.600 | And if you think of the Voyager paper as like a way to store skills and then reuse them, like how would you

00:19:16.440 | compare like this like reflective memory and like at what point it's just like ragging on the memory

00:19:22.520 | versus like you want to start to fine tune some of them or like what's the next step once you get a very

00:19:26.840 | long kind of like a reflective corpus? Yeah. So I think there are two questions here. The first question is

00:19:33.080 | what type of information or memory are you considering, right? Is it like somatic memory that stores, you know,

00:19:40.360 | knowledge about the word or is it the episodic memory that stores, you know, trajectories or behaviors or is it

00:19:46.680 | like more of a procedural memory? Like in Voyager's case, like skills or code snippets that you can use

00:19:52.680 | to do actions, right? That's that's one dimension. And the second dimension is obviously how you use the

00:19:58.120 | memory, either retrieving from it, using it in the context or or fine tuning it. I think the cognitive

00:20:05.080 | architecture for language agents paper have a good kind of categorization of all the different combinations.

00:20:10.600 | And of course, what which way you use depends on the concrete application and the concrete need and the concrete

00:20:17.240 | task. But I think in general, it's good to think of those like systematic dimensions and all the possible like

00:20:23.240 | options there. Yeah.

00:20:25.240 | Harrison also has in LangMem. I think you did a presentation in my meetup and I think you've done

00:20:30.680 | it at a couple other venues as well. User state, semantic memory and append only state. I think kind

00:20:36.040 | of maps to what you just said. What is LangMem? Can I give it like a quick...

00:20:39.720 | One of the modules of LinkedIn for a long time has been something around memory. And I think like,

00:20:43.720 | you know, we're still obviously figuring out what that means as is everyone kind of in the space. But

00:20:48.440 | one of the experiments that we did and one of the proof of concepts that we did was,

00:20:51.880 | technically what it was is you would basically create threads. You'd push messages to those

00:20:56.840 | threads in the background. We process the data in a few ways. One, we like put it into some semantic

00:21:01.880 | store. That's the semantic memory. And then two, we do some like extraction and reasoning over the

00:21:07.240 | memories to extract. And we let the user define this, but like extract key facts or anything that's

00:21:12.600 | of interest to the user. Those aren't exactly trajectories. They're maybe more closer to the,

00:21:17.640 | to the procedural memory. Is that how you'd think about it or classify it? Or is it like about like

00:21:23.080 | knowledge about the word or is it more like how to do something? It's reflections basically. So in

00:21:29.480 | generative worlds, generative agents, generative smallville. Yeah. The smallville one. So the way

00:21:33.800 | that they had their memory there was they had the sequence of events and that's kind of like the raw

00:21:37.880 | events that happened. But then every, and events did like run some synthesis over those events

00:21:43.080 | to, for the, for the LLM to insert its own memory basically. And it's, it's that type of memory. I

00:21:48.760 | don't know how that would be classified. I think of that more of the semantic memory, but to be fair,

00:21:53.080 | I think it's just one way to think of that. But whether it's semantic memory or procedure memory or whatever

00:21:58.360 | memory, that's like, like abstraction layer. But in terms of implementation, you can choose whatever

00:22:03.880 | implementation for whatever memory. So they're totally kind of orthogonal. I think it's more of a good way to

00:22:09.160 | think of the things because like from the history of cognitive science and you know, cognitive

00:22:14.120 | architecture and how people study even neuroscience, right? That's the way people think of how human brain

00:22:19.720 | organizes memory. And I think it's more useful as a way to think of things, but it's not like for semantic

00:22:25.480 | memory, you have to do this kind of like way to retrieve or fine tune. And for precision memory, you have to do that.

00:22:30.920 | Like, I think those are totally orthogonal kind of dimensions.

00:22:34.520 | How much background do you have in kind of like cognitive sciences and how much do you model some

00:22:38.440 | of your thoughts on?

00:22:39.480 | That's a great question actually. And, uh, I think one of the undergrad kind of influence for my like

00:22:47.400 | follow-up research is I was doing like an internship at MIT's computational cognitive science lab with like,

00:22:53.960 | you know, Josh Tanabam and he's like a very famous cognitive scientist. And I think a lot of,

00:22:59.880 | a lot of his ideas still influence me today. Like, uh, think of sensing like computational terms and

00:23:06.200 | get interested in language and a lot of stuff, you know, or even like developer psychology kind of stuff,

00:23:11.720 | I think still influencing me today.

00:23:12.920 | As a developer that tried out Langmem, the way I view it is just, it's a materialized view of a stream

00:23:19.400 | of logs. And if anything, that's just useful for context compression. I don't have to use the full

00:23:23.560 | context to run it over everything, but also it's kind of debuggable. Like if it's wrong, I can show it

00:23:27.880 | to the user, user can manually fix it and I can carry on.

00:23:30.120 | That's a really good analogy. Yeah, I like that. I'm going to steal that.

00:23:33.080 | Please, please. You know I'm bullish on memory databases. Um, I guess, Tree of Thoughts?

00:23:37.720 | Um, yeah, Tree of Thoughts. I mean, you had a...

00:23:40.200 | I feel like I'm relieving the defense again in like a podcast format.

00:23:44.680 | Yeah, no, I mean, it was a, you had a banger. Well, this is the one where you're already successful

00:23:49.000 | and would just like, you know, highlight the glory. It was really good. You mentioned that since thinking

00:23:54.440 | is kind of like taking an action, you can use like action searching algorithms to think of thinking.

00:23:59.320 | So just like you will use Tree Search to like find the next thing. And the idea behind Tree of Thoughts

00:24:04.280 | is like you generate all these possible outcomes and then find the best tree to get to the end.

00:24:09.240 | Maybe back to the latency question. You can't really do that if you have to respond in real time. So

00:24:13.720 | what are maybe some of the most helpful use cases for things like this? Where have you seen people

00:24:17.640 | adopt it where the high latency is actually worth the weight?

00:24:20.840 | For things that you don't care about latency, obviously, for example, if you're trying to do

00:24:25.480 | math, right? If you're just trying to come up with a proof. But I feel like one type of task is

00:24:29.880 | more about searching for a solution, right? You can try a hundred times, but if you find one solution,

00:24:35.480 | that's good. Like for example, if you're finding a math proof or if you're finding a good code to solve

00:24:39.880 | a problem or whatever. And I think another type of task is more like reacting, right? For example,

00:24:44.760 | if you're doing customer service, you're like a web agent booking a ticket for like a end user, right?

00:24:49.960 | Those are more like kind of reactive kind of tasks, right? You have to, or more real-time tasks,

00:24:54.040 | right? You have to do things fast. They might be easy, but you have to do it reliably. And you care more

00:24:59.080 | about like, can you solve 99% of the time out of a hundred, but for the type of search type of tasks,

00:25:05.640 | then you care more about, you know, can I find one solution out of a hundred? So it's kind of symmetric and different.

00:25:11.080 | Do you have any data or like intuition from your user base? Like what's the split of

00:25:16.360 | these type of use cases? Like how many people are doing more reactive things and how many people are

00:25:20.280 | experimenting with like kind of deep, long search?

00:25:22.680 | I would say like reacts probably like the most popular. I think there's aspects of reflection

00:25:27.800 | that get used tree of thought probably like the least. So there's a great tweet from Jason way.

00:25:34.120 | I think you're now colleague, and he was talking about like prompting strategies and how he thinks

00:25:38.520 | about them. And I think like the four things that he kind of had was like, one, how easy is it to

00:25:43.960 | implement? How much compute does it take? How many tasks does it kind of like solve? And how much does

00:25:49.240 | it improve on those tasks? And, and I'd add a fifth, which is like, how likely is it to be relevant with

00:25:53.720 | when the next generation of models kind of come out? And I think if you look at kind of like those axes,

00:25:58.040 | and then you look at like, uh, you know, react reflection, tree of thought, it tracks that the

00:26:04.040 | ones that score better are used more like react is pretty easy to implement tree of thoughts, pretty

00:26:08.680 | hard to implement. Um, right. Like the amount of compute. Yeah. A lot more for tree of thought,

00:26:14.360 | the tasks and how much it improves. I don't have amazing visibility there, but I think like,

00:26:19.000 | if we're comparing react versus tree of thought, react just dominates the first two axes so much

00:26:24.040 | that my question around that was going to be like, how do you think about like these prompting

00:26:28.120 | strategies, cognitive architectures, whatever you want to call them when you're, when you're thinking

00:26:31.400 | of them, what are the axes that you're judging them on in your head when you're thinking whether

00:26:36.200 | it's a good one or, or a less good one or. Right. Right. I think there is a difference between

00:26:41.800 | like a prompting method versus like a research in the sense that like for research, you don't really

00:26:48.360 | even care about does it actually work on practical tasks or does it help whatever. I think it's more

00:26:55.560 | about like the idea or the, the principle, right? Like what is the direction that you're like unblocking

00:27:02.040 | and whatever. And I think for the, for like an actual prompting method to solve like a concrete

00:27:07.480 | problem, I would say like simplicity is very important because the simpler it is, the less

00:27:13.720 | decision you have to make about it. And it's easier to design, it's easier to propagate and it's easier

00:27:18.120 | to, to do stuff. So always try to be as simple as possible. And I think latency obviously is important.

00:27:25.000 | Like if you can do things fast and you don't want to do things slow. And I think in terms of the actual

00:27:30.200 | prompting method to use for a particular problem, I'm a, I think we should all be in the minimum list

00:27:36.200 | kind of camp, right? You should try the minimum thing and see if it works and if it doesn't work and

00:27:41.720 | there's absolute reason to add something, then you add something, right? Like there's an absolute reason

00:27:46.280 | that you need some tool, then you should add the tool thing. If there's absolute reason to add

00:27:51.640 | reflection or whatever, you should add that. Otherwise, if chain of thought can already solve

00:27:54.760 | something, then you don't even need to use any of that. Yeah. Or if just better prompting can solve

00:27:58.600 | it. Like, you know, you could add a reflection step or you could make your instructions a little bit

00:28:02.200 | clearer and it's a lot easier to do that. I think another interesting thing is like, I personally have

00:28:07.160 | never done those kind of like weird tricks. I think all the prompts that I write are kind of like just

00:28:12.120 | talking to a human, right? It's like, I don't know, like, like I never say something like,

00:28:15.800 | your grandma is dying and you have to solve it too. I mean, those are cool, but I feel like

00:28:20.920 | we should all try to solve this in like a very intuitive way. Like, just like talking to your

00:28:25.400 | co-worker and that, that should work 99% of the time. That's my personal take. Yeah.

00:28:29.720 | The problem with how language models, at least in these sort of GPC3 era, was that they're,

00:28:35.640 | they over optimized to some sets of tokens in sequence. So like what reading the Kojima et al. paper,

00:28:42.200 | that was listening step by step, like he tried a bunch of them and they had wildly different results.

00:28:47.480 | It should not be the case, but it is the case. And hopefully we're getting better there.

00:28:50.760 | Yeah. I think it's also like a timing thing in the sense that if you think about this whole line of

00:28:55.960 | language model, right? Like at the time it was just like a text generator. We don't have any idea how it's

00:29:01.240 | going to be used. Right. And obviously at the time you will find all kinds of weird issues because

00:29:06.920 | it's not trained to do any of that. Right. But then I think we have this loop where once we realize

00:29:12.520 | chain of thought is important or agent is important or tool using is important. What we see is today's

00:29:17.240 | language models are heavily optimized towards those things. So I think in some sense they become more

00:29:22.600 | reliable and robust over those use cases. And, uh, you don't need to do as much prompt engineering

00:29:28.680 | tricks anymore to, to solve those things. I feel like in some sense, I feel like prompt engineering

00:29:33.560 | even is like a slightly negative word at the time because it refers to all those kind of weird tricks

00:29:38.440 | that you have to apply. But I think we don't have to do that anymore. Like given today's progress,

00:29:42.840 | you should just be able to talk to like a coworker and if you're clear and concrete and you know,

00:29:48.200 | being reasonable, then it should be reasonable things for you. Yeah. Yeah. The way I put this is,

00:29:52.040 | uh, you should not be a prompt engineer because it is the goal of the big labs to put you out of a job.

00:29:56.600 | You should just be a good communicator. Like if you're a good communicator to human,

00:30:00.680 | you should be a good communicator to them. And I think that's the key though, because oftentimes

00:30:04.600 | people aren't good communicators to these language models and that is a very important skill and that's

00:30:08.920 | still messing around with the prompt. And so it depends what you're talking about when you're

00:30:12.760 | saying prompt engineer. But do you think it's like very correlated with like, are they like a good

00:30:17.640 | communicator to humans? You know, it's like it may be, but I also think I would say on average,

00:30:22.120 | people are probably worse at communicating with language models than to humans. That's for sure.

00:30:25.640 | Right now, at least, because I think we're still figuring out how to do it. You kind of expect it

00:30:28.920 | to be magical and there's probably some correlation, but I'd say there's also just like people are worse

00:30:34.200 | at it right now than talking to you. We should, we should, uh, make it like a, you know, like an elementary

00:30:39.240 | school class or whatever, like how to talk to. Uh, yeah. I'm very proud of that. Yeah.

00:30:44.040 | Before we leave the topic of, uh, trees and searching, not specific about Q*, but there's

00:30:49.160 | a lot of questions about MCTS and this combination of tree search and language models. And I just had to

00:30:56.280 | get in a question there about how seriously should people take this?

00:30:58.920 | Again, I think it depends on the tasks, right? So MCTS was magical for Go, but it's probably not as

00:31:06.520 | magical for robotics, right? So I think right now, the problem is not even that we don't have good

00:31:12.520 | methodologies. It's more about we don't have good tasks. It's also very interesting, right? Because

00:31:17.080 | if you look at my citations, like obviously the most cited are React, Reflection, and Treehouse,

00:31:21.640 | all those are methodologies. But I think like equally important, if not more important,

00:31:26.520 | line of my work is like benchmarks and environments, right? Like web shop or suite venture or whatever.

00:31:31.400 | And I think in general, what people do in academia that I think is not good is they choose a very

00:31:38.760 | simple task, like Alfred, and then they apply overly complex methods and to show the improved 2%.

00:31:45.000 | I think like you should probably match, you know, the level of complexity of your task and your method.

00:31:53.000 | Right. I feel like where tasks are kind of far behind the method in some sense, right? Because

00:31:59.000 | we have some good test time approaches, like whatever, React or Reflection and Treehouse,

00:32:03.640 | all that are like, there are many, many more complicated testing methods afterwards. But on the

00:32:09.800 | benchmark side, we have made a lot of good progress this year, last year. But I think we still need more

00:32:15.720 | progress towards that, like better coding benchmark, better web agent benchmark, better agent benchmark,

00:32:22.040 | not even for web or code. I think in general, we need to catch up with, with tasks.

00:32:27.560 | What are the biggest reasons in your mind why, why it lags behind?

00:32:31.000 | I think incentive is one big reason. Like if you see, you know, all the massive paper are cited like

00:32:38.360 | a hundred times more than the task paper. And also making a good benchmark is actually quite hard. And

00:32:45.560 | it's almost like a different set of skills in some sense, right? I feel like if you want to build a good

00:32:51.480 | benchmark, you need to be like a good kind of product manager kind of mindset, right? You need to think about

00:32:56.680 | why people should use your benchmark, why it's challenging, why it's useful. If you think about

00:33:01.160 | like a PhD going into like a school, right? The prior skill that expected to have is more about, you know,

00:33:08.760 | can they code this method and can they just run experiments and can solve that? I think building

00:33:13.560 | a benchmark is not the typical prior skill that we have, but I think things are getting better. I think

00:33:19.560 | more and more people are starting to build benchmarks and people are saying that it's like a way to get more

00:33:26.040 | impact in some sense, right? Because like, if you have a really good benchmark, a lot of people are going to use it.

00:33:30.520 | But if you have a super complicated test time method, like it's very hard for people to use it.

00:33:35.480 | Are evaluation metrics also part of the reason, like for some of these tasks that we might want to ask these

00:33:41.000 | agents or language models to do, is it hard to evaluate them since so it's hard to get an automated benchmark?

00:33:46.600 | Obviously with Sweetbench, you can, and with coding, it's, it's easier, but.

00:33:50.200 | I think that's part of the, like the skill set thing that I mentioned, because I feel like it's like,

00:33:55.400 | it's like a product manager because there are many dimensions and you need to strike a balance and

00:34:00.680 | it's really hard, right? If you want to make sense very easy to all the gradable, like automatically

00:34:06.760 | gradable, like either to grade or either to evaluate, then you might lose some of the realness or

00:34:11.880 | practicality. Or like it might be practical, but it might not be as scalable, right? For example,

00:34:18.440 | if you think about text game, humans have pre-annotated all the rewards and all the language are

00:34:23.720 | real. So it's pretty good on auto-gradable dimension and the practical dimension. If you

00:34:29.800 | think about, you know, practical, like actual English being practical, but it's not scalable,

00:34:34.280 | right? Like it takes like a year for like experts to, to, to build that game. So it's not really that

00:34:39.240 | scalable. And I think part of the reason that Sweetbench is so popular now is it kind of hits the

00:34:44.200 | balance between the three dimensions, right? Either to evaluate and being actually practical and being

00:34:49.160 | scalable. Like if I were to criticize upon some of my prior work, I think webshop, like it's my initial

00:34:55.400 | attempt to get into benchmark work. And I'm trying to do a good job striking the balance, but obviously

00:35:01.960 | we make it all gradable and it's really scalable. But then I think the practicality is not as high as

00:35:08.520 | actually just using GitHub issues, right? Because you're just creating those like synthetic tasks.

00:35:13.480 | Are there other areas besides coding that jumped to mind as being really good for being auto-gradable?

00:35:19.160 | Maybe mathematics.

00:35:20.760 | Yeah, classic.

00:35:21.880 | Yeah.

00:35:22.760 | Do you have thoughts on AlphaProof, the, the new DeepMind paper?

00:35:27.720 | I think it's pretty cool.

00:35:28.520 | Yeah.

00:35:30.520 | I think it's more of a, you know, it's more of like a confidence boost or like a, sometimes, you know,

00:35:38.280 | the work is not even about, you know, the technical details or the methodology that it chooses or the,

00:35:44.840 | the concrete results. I think it's more about a signal, right?

00:35:47.480 | Yeah. Existence boost, like, yeah, yeah. It's like, can be done.

00:35:50.520 | This direction is exciting. It kind of encourages people to work more towards that direction. I think

00:35:55.400 | it's more like a boost of confidence, I would say. Yeah.

00:35:59.320 | So we're going to focus more on agents now. And, you know, we were a special,

00:36:03.960 | all of us have a special interest in coding agents. I would consider Devin to be the sort of

00:36:08.920 | biggest launch of the year as far as AI startups go. And you guys in the Princeton group worked on

00:36:16.040 | SuiAgent alongside of SuiBench. Tell us the story about SuiAgent.

00:36:19.400 | Sure. So I think it's kind of like a trilogy. It's actually a series of three works now. So

00:36:25.960 | actually the first work is called Intercode, but it's not as, it's not as famous, I know. And the

00:36:32.600 | second word is called SuiBench. And the third word is called SuiAgent. And I'm just really confused why

00:36:38.280 | nobody's working on coding. You know, it's like a year ago, but I mean, not everybody's working on coding,

00:36:44.520 | obviously, but a year ago, like literally nobody was working on coding. I was really confused. And

00:36:50.120 | the people that were working on coding are, you know, trying to solve human evil in like a sick

00:36:55.800 | to sick way. There's no agent, there's no chain of thought, there's no anything. They're just, you

00:37:00.360 | know, fine tuning the model and improve some points and whatever. Like I was really confused because

00:37:06.040 | obviously coding is the best application for agents because it's all degradable. It's super important.

00:37:12.440 | You can make everything like API or code action, right? So I was confused and I collaborated with

00:37:19.000 | some of the students in Princeton and we have this work called Intercode. And the idea is,

00:37:23.400 | first, if you care about coding, then you should solve coding in an interactive way, meaning more

00:37:28.520 | like a Jupyter notebook kind of way than just writing a program and seeing if fails or succeeds and stop,

00:37:35.320 | right? You should solve it in an interactive way. That's because that's exactly how humans solve it,

00:37:39.880 | right? If I tell you to, you know, write a program like next token, next token, next token and stop and

00:37:47.000 | never do any edits and you cannot really use any terminal or whatever tool, it doesn't make sense,

00:37:52.360 | right? And that's the way people are solving coding at the time. Basically like sampling a program from a

00:37:58.440 | language model without chain of thought, without tool call, without reflection, without anything.

00:38:02.280 | So first point is we should solve code coding in a very interactive way. And that's a very general

00:38:07.400 | principle that applies for various coding benchmarks. But also I think you can make a lot of the agent

00:38:14.760 | tasks kind of like interactive coding. If you have Python and you can call any package, then you can

00:38:20.920 | literally also browse internet or do whatever you want, like control a robot or whatever. So that seems

00:38:26.920 | to be a very general paradigm. But obviously I think a bottleneck is at the time we're still doing, you know,

00:38:32.840 | very simple tasks like human evil or whatever coding benchmark people proposed. Like they were super hard

00:38:37.640 | 2021, like 20%, but they're like 95% already in 2023. So obviously the next step is we need better benchmark. And

00:38:44.920 | Carlos and John, which are the first authors of Sweetbench. I think they come up with this great

00:38:50.840 | idea that we should just script GitHub and solve whatever human engineers are solving. And I think

00:38:56.520 | it's actually pretty easy to come up with this idea. And I think in the first week, they already made a lot of

00:39:01.720 | progress, like the script, the GitHub, and they make all the same, but then there's a lot of pain for

00:39:07.240 | infra work and whatever, you know, I think the idea is super easy, but the engineering is super hard. And I feel like

00:39:12.680 | that's a very typical signal of a good work in the AI era now.

00:39:16.840 | I think also, I think the filtering was challenging because if you look at open source PRs, like a lot

00:39:23.560 | of them are just like, you know, fixing typos.

00:39:25.800 | I think it's challenging. And to be honest, we didn't do a perfect job at the time. So if you

00:39:30.120 | look at the recent blog posts with OpenAI, like we improved the filtering so that, you know, it's more

00:39:36.520 | so I think OpenAI was just like, look, this is a thing now we have to fix this.

00:39:40.120 | Like these students just like, you know, rushed it.

00:39:44.520 | It's a good convergence of interest for me.

00:39:47.240 | Yeah. Was that tied to you joining OpenAI or like, was that just unrelated?

00:39:52.680 | It's a coincidence for me, but it's a good coincidence.

00:39:55.800 | There is a history of anytime a big lab adopts a benchmark, they fix it because, you know,

00:40:00.520 | otherwise it's a broken benchmark.

00:40:02.040 | Yeah. So naturally, once we propose Swimage, the next step is to solve it, right?

00:40:07.400 | But I think the typical way you solve something now is you collect some training samples or you

00:40:12.520 | design some complicated agent method, and then you try to solve it, right?

00:40:17.000 | Either a super complicated prompt or you build a better model with more trained data. But I think

00:40:22.040 | at the time we realized that even before those things, there's a fundamental problem with the

00:40:27.080 | interface or the tool that you're supposed to use, right? Because that's like a ignored problem in

00:40:33.480 | some sense, right? Like what your tool is or how that matters for your task. So what we found

00:40:40.200 | concretely is that if you just use the text terminal off the shelf as a tool for those agents,

00:40:45.800 | there's a lot of problems, right? For example, if you edit something, there's no feedback.

00:40:50.200 | So you don't know whether your edit is good or not. And that makes the agent very confused

00:40:54.520 | and makes a lot of mistakes. And there are a lot of like small problems, you would say. And

00:40:59.640 | well, you can try to do prompt engineering and improve that, but it turns out to be actually very

00:41:05.480 | hard. And we realized that the interface design is actually a very omitted kind of part of agent design.

00:41:12.520 | So we did this sweet agent work. And the key idea is just even before you talk about, you know,

00:41:17.240 | what the agent is, you should talk about what the environment is and you should make sure that the

00:41:21.400 | environment is actually friendly to whatever agent you're trying to apply, right? And that's the same

00:41:26.280 | idea for humans, right? Like if I give you like text terminal is good for some tasks like git pool or

00:41:33.240 | whatever, right? But it's not good if you want to look at, you know, browser and whatever, right?

00:41:39.080 | So also like, you know, browser is a good tool for some tasks, but it's not a good tool for other tasks.

00:41:44.680 | We need to talk about how to design an interface in some sense where we should treat agents as our

00:41:49.880 | customers, right? It's like when we treat humans as a customer, we design human computer interfaces, right?

00:41:56.360 | We design those beautiful desktops or browser or whatever, so that it's very intuitive and easy for

00:42:02.600 | humans to use. And this whole great subject of HCI is all about that. I think now the research idea

00:42:09.880 | of sweet agent is just we should treat agents as our customers and we should do like, you know, AICI.

00:42:16.280 | So what are the tools that a sweet agent should have or a coding agent in general?

00:42:24.360 | For sweet agent, it's like a modified text terminal, which kind of adapts to a lot of the patterns of

00:42:30.440 | language models to make it easier for language model to use. For example, now for edit, instead of having

00:42:36.200 | no feedback, it will actually have a feedback of, you know, actually here you introduce like a syntax

00:42:41.080 | error and you should probably want to fix that. And there's ended error there. And that makes it super

00:42:45.880 | easy for the model to actually do that. And there's other small things like how exactly you write

00:42:50.760 | arguments, right? Like, do you want to write like a multi-line edit or do you want to write a single

00:42:55.720 | line edit? I think it's more interesting to think about the way of the development process of ACI rather

00:43:02.280 | than the actual ACI for like a concrete application, because I think the general paradigm is very similar

00:43:07.640 | to HCI and psychology, right? Basically for how people develop HCI is they do behavior experiments on

00:43:14.600 | humans, right? I do A/B test, right? Like which interface is actually better? And I do those

00:43:20.200 | behavior experiments, kind of like psychology experiments to humans, and I change things.

00:43:24.440 | And I think what's really interesting for me for this three agent paper is we can probably do the

00:43:29.480 | same thing for agents, right? We can do A/B test for those agents and do behavior tests.

00:43:33.720 | And through the process, we not only invent better interfaces for those agents, that's the practical

00:43:39.000 | value, but we also better understand agents. Just like when we do those A/B tests, we do those

00:43:44.280 | HCI, we better understand humans. During those ACI experiments, we actually better understand agents.

00:43:49.400 | And that's pretty cool.

00:43:50.680 | Besides kind of like that A/B testing, what are other kind of like processes that people

00:43:55.400 | can use to think about this in a good way?

00:43:57.640 | That's a great question. And I think switch is like an initial work. And what we do is

00:44:02.520 | the kind of the live approach, right? You just try some interface and you see what's going

00:44:07.640 | wrong and then you try to fix that. You do this kind of iterative fixing. But I think what's really

00:44:12.680 | interesting is there will be a lot of future directions that's very promising if we can apply

00:44:17.880 | some of the HCI principles more systematically into the interface design. I think that would be a very

00:44:23.400 | cool interdisciplinary research opportunity.

00:44:25.680 | You talked a lot about kind of like agent computer interfaces and interactions. What about like human

00:44:32.440 | to agent kind of like UX patterns? I'm like, yeah, curious for any thoughts there that you might have.

00:44:38.440 | That's a great question. And in some sense, I feel like prompt engineering is about

00:44:43.400 | human agent interface. But I think there can be a lot of interesting research done about it. So prompting is

00:44:51.240 | about how humans can better communicate with the agent. But I think there could be interesting

00:44:55.960 | research on how agents can better communicate with humans, right? When to ask questions,

00:45:00.920 | how to ask questions, like what's the frequency of asking questions. And I think those kind of stuff

00:45:05.960 | could be very cool research.

00:45:07.240 | Yeah. I think some of the most interesting stuff that I saw here was also related to coding with

00:45:11.240 | Devon from Cognition. And they had the three or four different panels where you had like the chat,

00:45:16.200 | the browser, the terminal, and I guess the code editor as well.

00:45:19.080 | There's more now.

00:45:19.960 | There's more. Okay. I'm not up to date.

00:45:21.560 | Yeah. I think they also did a good job on ACI.

00:45:23.640 | They did. Yeah.

00:45:24.600 | Yeah.

00:45:24.680 | I think that's the main learning I have from Devon. They cracked that. They actually,

00:45:28.280 | there was no foundational planning breakthrough. The planner is like actually pretty simple,

00:45:32.520 | but ACI that they broke through on.

00:45:34.920 | I think making the tool good and reliable is probably like 90% of the whole agent. Once the tool is actually

00:45:41.720 | good, then the agent design can be much, much simpler. On the other hand, if the tool is bad,

00:45:47.240 | then no matter how much you put into the agent design planning or search or whatever,

00:45:51.720 | it's still going to be trash.

00:45:52.840 | Yeah. I'd argue the same, same with like context and instructions. Like, yeah, go hand in hand.

00:45:59.320 | On the tool, how do you think about the tension of like, for both of you, I mean, you're building

00:46:03.880 | a library. So even more for you, the tension between making now a language or a library that

00:46:09.080 | is like easy for the agent to grasp and write versus one that is easy for like the human to grasp

00:46:15.160 | and write, because you know, the trend is like more and more code gets written by the agent. So

00:46:18.840 | why wouldn't you optimize the framework to be as easy as possible for the model versus for the person?

00:46:24.200 | I think it's possible to design interface that's both friendly to humans and agents. But what do

00:46:28.840 | you think?

00:46:29.160 | We haven't thought about that from the perspective, like we're not trying to design

00:46:32.360 | LangChain or LangGraph to be friendly, but I mean, I think to be friendly for agents to write.

00:46:40.760 | But I mean, I think we see this with like, I saw some paper that used TypeScript notation instead of

00:46:46.760 | JSON notation for tool calling and it got a lot better performance. So it's definitely a thing. I haven't

00:46:51.960 | really heard of anyone designing like a syntax or a language explicitly for agents, but there's

00:46:57.160 | clearly syntaxes that are better.

00:46:58.600 | I think function calling is a good example where it's like a good interface for both

00:47:03.160 | human programmers and for agents, right? Like for developers, it's actually a very friendly

00:47:08.680 | interface because it's very concrete and you don't have to do prompt engineering anymore. You can

00:47:12.840 | be very systematic. And for models, it's also pretty good, right? Like it can use all the existing coding

00:47:18.600 | content. So I think we need more of those kinds of designs.

00:47:21.320 | I will mostly agree and then I'll slightly disagree in terms of this, which is like whether

00:47:26.440 | designing for humans also overlaps the designing for AI. So Malta Ubo, who's the CTO of Vercel,

00:47:31.960 | who is creating basically JavaScript's, you know, competitor to LangChain, they're observing that

00:47:37.160 | basically like if the API is easy to understand for humans, it's actually much easier to understand for

00:47:41.800 | LMs. For example, because they're not overloaded functions. They don't behave differently under different

00:47:46.120 | contexts. They do one thing and they always work the same way. It's easy for humans. It's easy for

00:47:51.000 | LMs. And like that makes a lot of sense. And obviously adding types is another one. Like type

00:47:55.640 | annotations only help give extra context, which is really great. So that's the agreement. And then a

00:48:00.120 | disagreement is that I've actually, when I use structured output to do my chain of thought, I have

00:48:05.720 | found that I change my field names to hint to the LLM of what the field is supposed to do. So instead of

00:48:12.840 | saying topics, I'll say candidate topics. And that gives me a better result because the LLM was like,

00:48:17.960 | "Ah, this is just a draft thing I can use for chain of thought." And instead of like summaries,

00:48:22.600 | I'll say topic summaries to link the previous field to the current field. So like little stuff like that,

00:48:27.320 | I find myself optimizing for the LLM where I as a human would never do that.

00:48:31.720 | Interesting. It's kind of like the way you optimize the prompt, it might be different for humans and for

00:48:37.720 | machines. You can have a common ground that's both clear for humans and agents, but to improve the human

00:48:43.880 | performance versus improving the agent performance, they might move to different directions.

00:48:48.280 | I move to different directions. There's a lot more use of metadata as well, like descriptions,

00:48:51.720 | comments, code comments, annotations and stuff like that. Yeah.

00:48:56.040 | I would argue that's just you communicating to the agent what it should do. And maybe you need

00:49:01.480 | to communicate a little bit more than to humans because models aren't quite good enough yet. But

00:49:06.200 | like, I don't think that's crazy. I don't think that's crazy.

00:49:09.000 | I will bring this in because it just happened to me yesterday. I was at the cursor office.

00:49:12.600 | They held their first user meetup and I was telling them about the LLMOS concept and why

00:49:19.560 | basically every interface, every tool was being redesigned for AIs to use rather than humans. And

00:49:24.200 | they're like, "Why? Can't we just use Bing and Google for LLM search? Why must I use EXA?" Or

00:49:30.440 | what's the other one that you guys work with? Tavili.

00:49:33.000 | Tavili. Web Search API dedicated for LLMs. What's the difference to Bing API?

00:49:38.120 | Exactly. There weren't great APIs for search. Like the best one, like the one that we used

00:49:42.600 | initially in LinkChain was SERP API, which is like maybe illegal. I'm not sure. And like, you know,

00:49:51.160 | and now they're like venture-backed companies. Shout out to DuckDuckGo, which is free.

00:49:55.320 | Yes. Yes. Yeah. I do think there are some differences though. I think you want,

00:50:00.360 | like, I think generally these APIs try to return small amounts of text information, clear legible

00:50:05.960 | field. It's not a massive JSON blob. And I think that matters. I think like when you talk about

00:50:10.520 | designing tools, it's not only the, it's the interface in the entirety, not only the inputs,

00:50:14.520 | but also the outputs that really matter. And so I think they try to make the outputs.

00:50:18.120 | They're doing ACI. Yeah. Yeah, absolutely. Really. Like there, there's a whole set of industries that

00:50:23.000 | are just being redone for ACI. It's weird. And so my, my simple answer to, to them was like the error

00:50:29.960 | messages. When you give error messages, they should be basically prompts for the LLM to take and then

00:50:35.560 | self-correct. Then your error messages get more verbose actually than you normally would with a human.

00:50:39.960 | stuff like that. Like a little, honestly, it's not that big. Again, like, is this worth a venture

00:50:44.840 | backed industry? Unless you can tell us, but like, I think code interpreter, I think is a new thing.

00:50:51.720 | I hope so. We invested in E2B. So I think that's a very interesting point. You're trying to optimize

00:50:56.440 | to the extreme. Then obviously they're going to be different. For example, the error, take it very

00:51:00.520 | seriously. Right. The error for, for like language model, the longer, the better. But for humans,

00:51:05.320 | that will make them very nervous and very tired. Right. But, but I guess that the point is more like,

00:51:10.520 | maybe we should try to find a co optimized common ground as much as possible. And then if we have

00:51:16.040 | divergence, then we should try to diverge. But it's more philosophical now. But I think like,

00:51:20.600 | part of it is like how you use it. So Google invented the page rank because ideally you only click

00:51:25.640 | on one link, you know, like the top three should have the answer. But like with models, it's like,

00:51:29.240 | well, you can get 20. So those searches are more like semantic grouping in a way. It's like,

00:51:34.600 | for this query, I'll return you like 20, 30 things that are kind of good, you know? So it's less about

00:51:40.040 | ranking and it's more about grouping. Another fundamental thing about ACI is the difference

00:51:45.800 | between human and machine's kind of memory limit. Right. So I think what's really interesting about this

00:51:51.560 | concept of ACI versus HCI is interfaces that's optimized for them. You can kind of understand some of the

00:51:57.320 | fundamental characteristics differences of humans and machines, right? Why, you know,

00:52:02.600 | if you look at find or whatever terminal command, you know, you can only look at it one thing at a time,

00:52:08.120 | or that's because we have a very small working memory. You can only deal with one thing at a time.

00:52:13.720 | You can only look at one paragraph of text at the same time. So the interface for us is by design,

00:52:19.320 | you know, a small piece of information, but more temporal steps. But for machines, that's, that should be the

00:52:25.000 | opposite, right? You should just give them a hundred different results and they should just

00:52:28.600 | decide the context was the most relevant stuff and trade off the context for temporal steps. That's

00:52:34.200 | actually also better for language models because like the cost is smaller or whatever. So it's

00:52:39.320 | interesting to connect those interfaces to the fundamental kind of differences of those.

00:52:43.480 | When you said earlier, you know, we should try to design these to maybe be similar as possible and

00:52:48.200 | diverge if we need to. I actually don't have a problem with them diverging now and seeing venture

00:52:53.160 | backed startups emerging now because we are different from machines, code AI, and it's just so early on,

00:53:00.200 | like they may still look kind of similar and they may still be small differences, but it's still just so

00:53:05.000 | early. And I think we'll only discover more ways that they differ. And so I'm totally fine with them kind of like

00:53:09.880 | diverging early and optimizing for the... I agree. I think, I think it's more like,

00:53:13.880 | you know, we should obviously try to optimize human interface just for humans. We're already doing

00:53:18.680 | that for 50 years. We should optimize agent interface just for agents, but we might also

00:53:24.680 | try to co-optimize both and see how far we can get. There's enough people to try all three directions.

00:53:30.600 | Yeah. There's a thesis I sometimes push, which is the sour lesson as opposed to the bitter lesson,

00:53:35.240 | which we always inspired by human development, but actually AI develops its own path.

00:53:40.280 | Right. We need to understand better, you know, what are the fundamental differences between those

00:53:44.520 | creatures. It's funny when really early on this pod, you were like, how much grounding do you have in

00:53:49.480 | cognitive development and human brain stuff? And I'm like, maybe that doesn't matter. And actually,

00:53:54.840 | so like I, in my original agent's blog post, I had a picture of the human brain and now it looks a lot

00:54:01.800 | more like a CPU. The canonical picture of the LNMOS is kind of like a CPU with all the input and output

00:54:07.160 | going into it. And I think that that's probably the more scalable system.

00:54:10.520 | I think the problem with like a lot of like cognitive scientists, like is that...

00:54:14.680 | They think by analogy, right?

00:54:15.800 | They think, you know, the only way to solve intelligence is through the human way. And therefore,

00:54:20.840 | they like have a lot of critics for whatever things that are not cognitive or human.

00:54:26.120 | But I think a more useful way to use those knowledge is to think of that as just a reference

00:54:31.240 | point. I don't think we should copy exactly what's going on with human all the way, but I think it's

00:54:35.640 | good to have a reference point because this is a working example how intelligence works.

00:54:39.720 | Yeah.

00:54:40.360 | And if you know all the knowledge and you compare them, I think that actually establishes more

00:54:45.960 | interesting insights as opposed to just copy that or not copying that or opposing that.

00:54:51.160 | I think comparing is the way to go.

00:54:53.080 | I feel like this is an unanswerable question, but I'll just put it out there anyway.

00:54:56.280 | If we can answer this, I think it'd be worth a lot, which is, can we separate intelligence from knowledge?

00:55:00.280 | That's a very deep question, actually.

00:55:03.240 | And to have a little history background, I think that's really the key thesis at the beginning of AI.

00:55:10.360 | If you think about Neville and Simon and all those symbolic AI people, right?

00:55:14.760 | Basically, they're trying to create intelligence by writing down all the knowledge.

00:55:21.000 | For example, they write like a checker or write a checker program.

00:55:24.920 | Basically how you would solve the checker, you write down all the knowledge and then implement that.

00:55:29.240 | And I think the whole thesis of symbolic AI is we should just be able to write down all the knowledge

00:55:33.960 | and that just creates intelligence.

00:55:35.640 | But that kind of fails. And I think really, I think a great like quote from Hinton is,

00:55:41.240 | I think there are two approaches to intelligence.

00:55:44.200 | One approach is let's deal with reasoning or thinking or knowledge, whatever you call that.

00:55:49.720 | And then let's worry about learning later.

00:55:51.560 | The other approach is let's deal with learning first.

00:55:54.360 | And then let's worry about, you know, whatever knowledge or reasoning or thinking later.

00:55:58.360 | And it turns out right now, at least like the second approach works and the first approach doesn't work.

00:56:04.520 | And I think there might be something deep about it.

00:56:06.680 | Does that answer your question?

00:56:07.880 | Partially, I think Apple intelligence might change that.

00:56:11.640 | Can you explain?

00:56:12.520 | If this year is the year of multi-modal models, next year is like on-device year.

00:56:16.440 | And Apple intelligence basically has hot swappable capabilities, right?

00:56:20.120 | Like they have like 50 LoRa's that they swap onto a base model that does different tasks.

00:56:25.720 | And that's the first instance that we have of the separation of intelligence and knowledge.

00:56:31.640 | And I think that's that's a really interesting approach.

00:56:34.120 | Obviously, it's not exactly knowledge.

00:56:35.640 | It's just more about context.

00:56:37.800 | Yeah, it's more about context.

00:56:38.760 | It's like you can have the same model deployed to 10 million phones with 10 million contacts and see if...

00:56:44.200 | For on-device deployment, I think it's super important.

00:56:46.120 | Like if you can boil out, like I actually have most of my problems with AI news when

00:56:51.880 | the model thinks it knows more than it knows because it combines knowledge of intelligence.

00:56:55.320 | I want it to have zero knowledge whatsoever.

00:56:57.400 | And it only has the ability to parse the things I tell it.

00:57:00.440 | I kind of get what you mean.

00:57:02.200 | I feel like it's more like memorization versus kind of just generalization in some sense.

00:57:06.680 | Yeah, raw ability to understand things.

00:57:08.120 | You don't want it to know like facts like, you know, who is the president of the United States.

00:57:12.840 | They should be able to just call internet and use a tool to solve it.

00:57:15.320 | Yes, because otherwise it's not going to call the tool if it thinks it knows.

00:57:19.240 | I kind of get what you mean.

00:57:20.280 | That's why it's valuable.

00:57:22.920 | Okay.

00:57:23.320 | So if that's the case, I guess my point is I don't think it's possible to fully separate them

00:57:28.680 | because like those kind of intelligence kind of emerges.

00:57:32.840 | Like even for humans, you can't just operate in an intelligent mode without knowledge, right?

00:57:38.520 | Throughout the years, you learn how to do things and what things are.

00:57:42.360 | And it's very hard to separate those things, I would say.

00:57:45.160 | Yeah.

00:57:45.640 | But what if we could?

00:57:48.520 | As a meta strategy, I'm trying to keep as a stack ranked list of like,

00:57:52.680 | what are the 10 most valuable questions in here?

00:57:55.160 | You can think of knowledge as a cache of intelligence in some sense.

00:57:59.480 | If you have like wikihall.com saying like you should tie a shoelace using the following step,

00:58:08.920 | you can think of that piece of text as like a cache to intelligence, right?

00:58:13.240 | I guess that's kind of like reflection anyway, right?

00:58:15.960 | It's like you're storing these things as memory,

00:58:17.800 | and then you put them back.

00:58:19.080 | So without the knowledge, you wouldn't have the intelligence to do it better.

00:58:22.680 | Right.

00:58:23.080 | Right.

00:58:23.480 | I had a couple of things.

00:58:24.280 | So we had Thomas Shalom from Meta to talk about Lama 3.1.

00:58:28.760 | They mentioned...

00:58:29.160 | Then he started talking about Lama 4.

00:58:30.440 | Yeah.

00:58:30.840 | I was like, whoa, okay.

00:58:32.360 | Great.

00:58:32.920 | And he said it's going to be like really focused on agents.

00:58:35.320 | I know you talked before about, you know, it's next token prediction enough to get to like problem

00:58:41.160 | solving.

00:58:41.560 | If you say you got the perfect environment, they got the terminal, they got everything.

00:58:46.360 | And if you were to now move down to the model level and say, I need to make a model that is better

00:58:50.440 | for like agentic workflow, where would you start?

00:58:52.840 | I think it's data.

00:58:54.520 | I think it's data because like changing architecture now is too hard and we don't have a good,

00:58:59.400 | better alternative solution now.

00:59:00.680 | I think it's mostly about data and agent data is obviously hard because people just write down the

00:59:06.840 | final result on the internet.

00:59:08.040 | They don't write down how they like step by step, how they do the thing on the internet.

00:59:11.560 | Right.

00:59:11.800 | So naturally it's easier for models to learn chain of thought than tool call or whatever agent self

00:59:19.000 | reflection or search, right?

00:59:20.920 | Like even if you do a search, you won't write down all the search processes on the internet.

00:59:24.600 | You would just write down the final result.

00:59:26.840 | And, uh, I think it's a great thing that NAMA 4 is going to be more towards agents.

00:59:32.360 | That means, I mean, that should mean a lot for a lot of people.

00:59:34.920 | Yeah.

00:59:35.320 | In terms of data, you think the right data looks like trajectories basically of,

00:59:40.120 | of a react agent or of.

00:59:42.600 | Yeah.

00:59:43.720 | I mean, I have a paper called fire act.

00:59:45.720 | Do you still remember?

00:59:46.520 | No.

00:59:47.320 | Okay.

00:59:48.120 | Tell us.

00:59:49.240 | Okay.

00:59:49.640 | That's one of the not famous papers, I guess.

00:59:52.120 | It's not even on your website.

00:59:53.160 | Are we supposed to find it?

00:59:55.000 | It's on.

00:59:55.560 | It's on this Google Scholar.

00:59:56.600 | I've got it pulled out.

00:59:57.480 | Okay.

00:59:57.800 | It's not, it's, it's, it's been rejected for, for like a couple of times.

01:00:02.520 | No.

01:00:03.000 | But now it's Elena's face.

01:00:04.280 | Yeah.

01:00:04.680 | Everybody will find it.

01:00:05.720 | Anyway, I think the idea is very simple.

01:00:07.480 | Like you can try a lot of different agent methods, right?

01:00:09.800 | React, chain of thought, reflection or whatever.

01:00:12.840 | And the idea is very simple.

01:00:14.280 | You just have very diverse data, like tasks, and you try very diverse agent methods and you filter all

01:00:20.280 | the correct solutions and you train a model on all of that.

01:00:22.760 | And then the benefit is that you should somehow learn, you know, how to use simpler methods for

01:00:28.040 | simpler tasks and harder methods for harder tasks.

01:00:30.360 | I guess the problem is we don't have diverse, high quality tasks.

01:00:33.320 | That's the bottleneck for.

01:00:34.280 | So it's going to be trained on all code.

01:00:36.600 | Yeah.

01:00:36.840 | Let's hope we have more better benchmarks.

01:00:38.520 | Yeah.

01:00:39.080 | In school, that kind of pissed me off a little bit.

01:00:41.320 | When you're doing like a homework, like exercises for like calculus, like they give you the problem,

01:00:46.120 | then they give you the solution.

01:00:47.320 | Right.

01:00:47.560 | But there's no way without the professor or the TA to get like the steps.

01:00:51.160 | Right.

01:00:51.400 | So actually how you got there.

01:00:52.440 | Right.

01:00:52.760 | And so I feel like because of how schools are structured, we never brought this thing down.

01:00:57.080 | But I feel like if you went to every university and it's like write down step by step the solution

01:01:01.320 | to every single problem in the set and make it available online, that's a start to make this dataset better.

01:01:06.600 | I think it's also because, you know, it's, it might be hard for you to write down your chain of thought,

01:01:11.960 | even when you're solving the same, because part of that is conscious in language, but maybe even part

01:01:17.960 | of that is not in language.

01:01:19.080 | And okay.

01:01:19.960 | So a funny side story.

01:01:21.400 | So when I wrote down the React thing, I would tell it to my Google manager, right?

01:01:25.800 | Like, you know, what we should do, we should just hire, you know, as many people as possible and

01:01:30.920 | let them use Google and write down exactly what they think, what they search on the internet.

01:01:35.240 | and we train them all on that.

01:01:36.600 | But I think it's not, not trivial to, to write down your thoughts.

01:01:39.960 | Like if you're not trained to do that, if I tell you like, okay, write down what you're thinking

01:01:44.920 | right now, it's actually not as trivial a task as you might imagine.

01:01:48.600 | It might be more of a diffusion process than the autoregressive process.

01:01:53.080 | But I think the problem is starting with the experts, you know, because there's so much like muscle

01:01:57.240 | memory and what you do once you've done it for so long, that's why we need to like get everybody

01:02:01.960 | to do it. And then you can see it like separate knowledge and intelligence.

01:02:05.720 | The simplest way to achieve AGI is literally just, just record the reaction of every human

01:02:12.040 | being and just put them together. You know, like, what do you have thought about?

01:02:16.200 | What do you have done? Let's say on the computer, right? Imagine like solid experiment. Like you,

01:02:21.640 | you write down literally everything you think about and everything you do on the computer and

01:02:25.640 | you record them and you train on all the successful trajectories by some metric of success. I think

01:02:30.680 | that should just lead us to AGI.

01:02:32.200 | My, my first work of fiction in like 10 years was exploring that idea of what if you recorded

01:02:38.200 | everything and uploaded yourself? I'm pretty science-based like, you know, but probably the most like

01:02:42.600 | spiritual woo-woo thing about me is I don't think that would lead to consciousness or AGI just because

01:02:47.240 | like there's something in there's like, there's a soul, you know, that is the unspeakable quality.

01:02:53.080 | That's if it emerges through skill.

01:02:55.240 | We can simulate that for sure.

01:02:58.120 | What do you think about the role of few-shot prompting for some of these like agent trajectories?

01:03:03.000 | That was a big part of the original react paper, I think. And as we talk about showing your, your work

01:03:09.160 | and how you think like. I feel like it's becoming less used than zero-shot prompting. What's your

01:03:14.760 | observation? I'm pretty bullish on it, to be honest, for a few reasons. Like one, I think it can maybe

01:03:20.280 | help for more complex things, but then also two, like it's a form of prompting and prompting is just

01:03:24.440 | communicating with the model what you want it to do. And sometimes it's easier to just show the model

01:03:28.200 | what you want it to do than write out detailed kind of like instructions. I think the practical reason

01:03:33.160 | it has become less used is because the agent kind of scaffold become more complex or the tasks you're

01:03:38.920 | trying to solve is becoming more complex. It's harder to annotate a few-shot examples, right?

01:03:43.880 | Like in the chain of thought era, she just write down three lines of things. It's very easy to write

01:03:48.120 | down a few-shot or whatever, but I feel like annotation difficulty has become harder.

01:03:53.720 | I think also one of the reasons that I'm bullish on it is because I think it's a really good way to

01:03:57.240 | achieve kind of like personalization. Like if you can collect this through feedback automatically,

01:04:01.160 | you can then use that in the system at a user level or something like that. Again,

01:04:04.680 | the issue with that is more complex things that doesn't really work.

01:04:08.200 | Probably more useful is like an automatic, you know, prompt, right? If you have some way to

01:04:13.160 | retrieve examples and put it in like automatic pipeline to prompt. But I feel like if you're

01:04:17.640 | a human, you're manually writing now, I feel like more people will try to use zero-shot.

01:04:22.760 | Yeah, but if you're doing a consumer product, you're probably not going to ask

01:04:26.280 | user-facing people to write a prompt or something like that. But I think the thing that you brought

01:04:31.160 | up is also really relevant here where you can collect feedback from a user, but it's usually at the top

01:04:36.120 | level. And so then if you have three or four or five or however many LLM calls down below, how do you

01:04:42.120 | disperse that feedback to those? And I don't have an answer for that.

01:04:45.320 | There's another super popular paper that you author called Koala, Cognitive

01:04:50.120 | Architectures for Language Agents. I'm not sure if it's super popular.

01:04:52.600 | People speak highly of it here within my circles. So shout out to Charles Frey, who

01:04:58.600 | told me about it. I think that was our most popular webinar.

01:05:00.920 | I think Harrison promoted the paper a lot. Thanks to him.

01:05:06.520 | I'll read what you wrote in here and then you can just kind of go take it wherever.

01:05:10.200 | Koala organizes agents along three key dimensions: their information storage divided into working and

01:05:15.320 | long-term memories, their action space divided into internal and external actions, and their

01:05:20.440 | decision-making procedure, which is structured as an interactive loop with planning and execution.

01:05:24.360 | By the way, I think your communication is very clear, so kudos on how you do these things.

01:05:28.360 | take us through the sort of three components. And you also have this development diagram,

01:05:31.960 | which I think is really cool. I think it's figure one on your paper for people reading along.

01:05:35.880 | Normally people have input, LLM, output. Then they develop into language agents that takes an action

01:05:41.880 | into an environment and has observations. And then they go into the Koala architecture.

01:05:46.600 | Shout out to my co-first author, Ted, who made figure one. He's like, you know,

01:05:52.440 | figure is really good. You don't even need color. You just...

01:05:56.760 | One of the motivations of Koala is we're seeing those agents become really complicated.

01:06:01.400 | I think my personal philosophy is to try to make things as simple as possible, but obviously this

01:06:06.120 | field has become more complex as a whole, and it's very hard to understand what's going on.

01:06:10.360 | And I think Koala provides a very good way to understand things in terms of those three dimensions.

01:06:17.480 | And I think they are pretty first principles, because I think this idea of memory is pretty first

01:06:23.560 | principle if you think about where memory, where information is stored. And you can even think of

01:06:27.560 | the ways of neural network as some kind of non-term memory, because that's also part of the information

01:06:32.120 | that is stored. I think a very first principle way of thinking of agents is pretty much just a neural

01:06:38.520 | network plus the code to call and use the neural network. Obviously also maybe plus some vector store

01:06:45.960 | or whatever other memory modules, right? And thinking through that, then you immediately realize

01:06:50.680 | that the kind of the non-term memory or the persistent information is first, the neural network,

01:06:56.760 | and second, the code associated with the agent that calls the neural network, and maybe also some other

01:07:02.680 | vector stores. But then there's obviously another kind of storage of information that's shorter horizon,

01:07:09.400 | right? Which is the context window or whatever episode that people are using. Like you're trying

01:07:15.240 | to solve this task and information happens there. But once this task is solved, the information is gone,

01:07:19.880 | right? So I think it's very systematic and first principle to think about where information is and

01:07:24.680 | thinking, organizing them through categories and time horizon, right? So once you have those information

01:07:31.080 | stores, then obviously for agent, the next thing is, what kind of action can you do? And that leads to the concept

01:07:37.400 | of action space, right? And I think one of the fundamental difference between language agents and

01:07:41.960 | the previous agents is that for traditional agents, if you think about Atari or video game,

01:07:47.000 | they only have like a predefined action space by the environment. They only have external actions,

01:07:51.320 | right? Because they don't have complicated memory or information and kind of devices to do internal

01:07:55.880 | thinking. I think the contribution of React is just to point out that we can also have internal actions

01:08:00.760 | called thinking. And obviously if you have long-term memory, then you also have retrieval or writing or

01:08:05.880 | whatever. And then third, once you have those actions, which action should you do? That's the

01:08:10.600 | problem of decision-making. And, uh, and the three parts should just fully describe our agent.

01:08:17.800 | We solved it. We have defined agents.

01:08:19.320 | Yeah, it's done.

01:08:20.280 | Does anything that you normally say about agents not fit in that framework? Because you also get

01:08:25.480 | asked this question a lot.

01:08:26.760 | Um, I think it's very aligned. Um, if we think about a lot of the stuff we do, I'm just thinking out loud now,

01:08:33.640 | but a lot of the stuff we do on agents now is through lane graph, lane graph, we would view as

01:08:38.120 | kind of like the code part of what defines some of these things.

01:08:41.800 | Also defines part of the decision-making decision. That's what I was thinking actually. Yeah.

01:08:45.640 | Yeah. And actually one analogy that I like there is like some of the code and part of lane graph,

01:08:51.960 | and I'm actually curious what you think about this, but like, sometimes I say that like the LLMs aren't

01:08:56.520 | great at planning yet. So we can help them plan by telling them how to plan and code, because that's very

01:09:00.600 | explicit and that's a good way of communicating how they should plan and stuff like that.

01:09:04.200 | That's a Devon playbook as well.

01:09:05.080 | What do you mean by that? Like giving them like a DFS algorithm or?

01:09:08.280 | No, something like much simpler. Like you could tell agent in a prompt like,

01:09:12.200 | "Hey, every time you do this, you need to also do this and make sure to check this." Or you could just

01:09:16.280 | put those as explicit checks in kind of like the decision-making procedure or something like that.

01:09:20.760 | Right.

01:09:21.320 | And the more complex it gets, I think the more we see people encoding that in code. And another

01:09:26.440 | way that I say this is like, all of life really is communication, right? And so you can do that

01:09:31.400 | through prompts or you can do that through code. And code is great at communicating things. It really is.

01:09:34.760 | Is this the most philosophical?

01:09:36.360 | This is the cheapest I've ever had.

01:09:37.960 | Okay, okay.

01:09:38.360 | That's good.

01:09:38.920 | That's good.

01:09:39.080 | That's good.

01:09:39.640 | Yeah.

01:09:39.880 | We're talking about agents, you know?

01:09:41.320 | Yeah.

01:09:41.960 | I think the biggest thing that we're thinking a lot about is just the memory component. And we touched on it a

01:09:46.760 | little bit earlier in the, in the episode, but I think it's still like very unsolved. I think like

01:09:51.320 | clearly semantic memory, episodic memory, or types of memory, I think, but like where the boundaries

01:09:56.920 | are, is there, are there other types? How to think about that? I think that to me is maybe one of the

01:10:01.960 | bigger unsolved things in terms of agents is just memory. Like what, what does memory even mean?

01:10:06.440 | That's another top high value question. Is it a knowledge graph?

01:10:09.800 | I think that's one type of memory.

01:10:14.680 | Yeah. If you're using a knowledge graph as a hammer to hit a nail, it's, it's, it's not that.

01:10:19.160 | But I think, I think practically what we see is it's still so application specific, what relevant

01:10:25.560 | memory is. And that also makes it really tough to answer generically, like what is memory? So like,

01:10:30.360 | it could be a knowledge graph. It could also be like, I don't know, a list of instructions that you

01:10:35.000 | just keep in a list.

01:10:35.800 | Yep.

01:10:36.520 | A meta point is I feel sometimes we underestimate some aspects where humans and agents are actually

01:10:42.600 | similar and we overestimate sometimes. The difference is, you know, I feel like, I mean,

01:10:47.000 | at one point I think that's shared by agents and humans is like, we all have very different types of

01:10:52.280 | memories, right? Some people use Google doc. Some people use notion and some people use paper and pen.

01:10:57.080 | Like you can argue those are different types of long term memories for people, right? And each person

01:11:02.840 | develops its own way to maintain their long term memory and diarrhea or whatever. It's a very kind of

01:11:09.000 | individual kind of thing. And I feel like for agents, probably there's no like single best solution. But

01:11:14.760 | what we can do is we can create as many good tools as possible, like Google Docker, Notion, equivalent

01:11:20.680 | of agent memory. And we should just give the choice to the agent. Like, what do you want to use? And through

01:11:26.440 | learning, they should be able to come up with their own way to use the long memory.

01:11:29.400 | You know, or give the choice to the developer who's building the agents, because I think it also that

01:11:34.520 | it might, it depends on the task. Like I use, I think we want to control that one. Right now,

01:11:38.920 | I would agree with that for sure, because I think you need that level of control. I use linear for

01:11:43.080 | planning for code. I don't use that for my grocery list, right? Like depending on what I'm trying to do,

01:11:47.560 | I have different types of long form memory. Maybe if you tried, you would have a gorgeous kitchen.

01:11:52.200 | Do you think our like tool making kind of progress is good or not good enough in terms of, you know,

01:11:58.360 | we have all sorts of different memory stores or retrieval methods or whatever? On the memory front

01:12:04.280 | in particular, I don't think it's very good. I think there's a lot to still be done. What do you think

01:12:08.120 | are lacking? Yeah, you have a memory service. What's missing? The memory service we launched, I don't

01:12:13.080 | think really found product market fit. I think like, I mean, I think there's a bunch of different

01:12:17.080 | types of memory. I'll probably write a blog. I mean, I have a blog that I published at some point on

01:12:22.120 | this, but I think like right off the bat, there's like procedural memory, which is like how you do

01:12:26.360 | things. I think this is basically episodic memory, like trajectories of correct things. But there's also,

01:12:31.480 | then I think a very different type is like personalization. Like I like Italian food.

01:12:35.320 | It's kind of a semantic memory.

01:12:36.440 | That's kind of, maybe like a system prompt.

01:12:38.600 | Yeah, exactly. Exactly. Yeah.

01:12:40.680 | It could be a, it depends if it's semantic over like raw events or over reflections over events.

01:12:45.720 | Right. Again, semantic procedure, whatever, it's just like a categorization. What really

01:12:49.560 | matters is the implementation, right? And so one of the things that

01:12:52.120 | we'll probably have released by the time this podcast comes out is right now in line graph,

01:12:56.200 | line graph is very stateful. You define a state for your graph and basically a run of an agent operates

01:13:01.480 | on a thread. It's very similar to threads in open AI's assistant API, but you can define the state

01:13:06.360 | however you want. You can define whatever keys, whatever values you want. Right now,

01:13:09.880 | they're all persistent for a single thread. We're going to add the ability to persist that between

01:13:14.200 | threads. So then if you basically want to scope a memory to a user ID or to an assistant or to an

01:13:20.040 | organization, then you can do that. And practically what that means is you can write to that channel

01:13:25.080 | whatever you want, and then that can be read in other threads. We're not making any kind of like

01:13:29.240 | claims around what the shape of memory is, right? You can write kind of like what you want there.

01:13:33.400 | I still think it's like so early on and we see people needing a lot of control over that. And so

01:13:38.760 | I think this is our current best thought. This is what we're doing around memory at the moment. It's

01:13:43.160 | basically extending the state to beyond a thread level. I feel like there's a trade-off between

01:13:47.800 | you know, complexity and control, right? For example, like Notion is more complex than Google Docs, but

01:13:52.680 | if you use it well, then it gives you more capability, right? And it's like different tool

01:13:57.560 | might suit different applications or scenarios or whatever. Yeah. We should make more good tools,

01:14:03.320 | I guess. My quick take is when I started writing about the AI engineer, this was kind of vaguely in

01:14:08.760 | my head, but like this is basically the job. Everything outside the LLM is the AI engineer

01:14:13.800 | that the researcher is not going to do. This basically maps to LLMOS. I would add in the

01:14:20.040 | code interpreter, the browser and the other stuff, but yeah, this is mostly it. Yeah, those are the

01:14:25.800 | I mean, those are the tools, yeah. Those are the external environment, which is a small box at the bottom.

01:14:30.040 | So then having this like reasonable level of confidence, like I know what things are,

01:14:34.600 | then I want to break it. I want to be like, okay, like what's coming up that's going to blindside me

01:14:38.520 | completely. And it really is maybe like omni-model where like everything in, everything out. And like,

01:14:45.080 | does that change anything? Like if you scale up models like a hundred times more, does that change

01:14:49.160 | anything? That's actually a great, great question. I think that's actually the last paragraph of the

01:14:54.520 | paper that's talking about this. I also got asked this question when I was interviewing with OpenAI.

01:14:59.880 | Please tell us how to pass OpenAI interviews.

01:15:05.240 | Is any of this still true if, you know, if you 100x everything, if we make the model much better?

01:15:11.160 | My longer answer to this, you should just refer to the last paragraph of the paper, which is like a more

01:15:16.600 | prepared, longer answer. I think the short answer is understanding is always better. It's like a way of

01:15:22.280 | understanding things. Like the solid experiment that I write at the end of the paper is, imagine you have

01:15:27.800 | GPT-10, which is really good. Like it doesn't even need a chain of thought, right? Just input,

01:15:32.520 | output, just stick to stick, right? It doesn't even need to do browsing or whatever, or maybe it still

01:15:37.560 | needs some tools, but let's say like, it's really powerful. Like then I think even in that point, I think

01:15:43.400 | something like Koala is still useful if we want to do some neuroscience on GPT-10. It's like kind of doing

01:15:48.440 | human kind of neuroscience, right? Which model actually could it be inspectable?

01:15:52.840 | Yeah. Like you want to expect what is episodic memory? What is the decision-making module? What

01:15:56.360 | is the, it's kind of like dissecting the human brain, right? And you need some kind of prior

01:16:00.600 | kind of framework to help you do this kind of discovery.

01:16:05.000 | Cool. Just one thing I want to highlight from your work, we don't have to go into it,

01:16:08.520 | it's a Tao Bench. Oh yeah, we should definitely cover this.

01:16:12.520 | Yeah. I'm a big fan of Simulity of AI. We had a summer of Simulity of AI.

01:16:16.600 | Another term we're trying to coin hasn't stuck, but I'm going to keep at it.

01:16:19.960 | I'm really glad you covered my zero citation work. I'm really happy.

01:16:23.160 | No, zero citation work. Now it's one, now it's one. First citation.

01:16:26.520 | It's me, it's me right now. We just cited it here, so that counts.

01:16:31.160 | It's like one citation. Does it show on Google?

01:16:32.840 | We'll write a paper about this episode. One citation, one citation.

01:16:36.520 | Let's go. Last time I checked, it's still zero.

01:16:39.400 | It's awesome. Okay.

01:16:41.960 | This one was funny because you have agents interact with like LM simulated person. So it's like actually

01:16:47.960 | just another agent, right? So it's like agent simulating with other agents. This has always

01:16:52.920 | been my thing with startups doing agents. I'm like, one day there's going to be training grounds for

01:16:58.760 | companies to train agents that they hire. Actually, Singapore is the first country to build the cyber

01:17:03.880 | range for cyber attack training. And I think you'll see more of that. So what was the inspiration there?

01:17:09.160 | Most of these models are bad at it, which is great. You know, we have some room for, I think the best

01:17:14.280 | model is for all at like 48% average. So there's a lot of room to go. Yeah. Any fun stories from there,

01:17:20.760 | directions that you hope that people take?

01:17:22.600 | Yeah. First, I think shout out to Sierra, which is this a very good startup, which was funded by

01:17:30.200 | Brad Taylor and Clay Beaver. And Sierra is a startup doing conversational AI. So what they do is they

01:17:37.240 | they build agents for businesses. Like suppose you have a business and you have a customer service,

01:17:42.920 | we want to automate that part. And then it becomes very interesting because it's very different from coding or

01:17:49.000 | web agent or whatever people are doing, because it's more about how can you do simple things reliably.

01:17:54.680 | It's not about, you know, can you sample a hundred times and you find one good mass proof or kill

01:17:59.240 | solution. It's more about you chat with a hundred different users on very simple things. Can you be

01:18:04.040 | robust to solve like 99% of the time? Right. And then we find there's no really good benchmark around this.

01:18:11.640 | So that's one thing. I guess another thing is obviously this kind of customer service kind

01:18:15.960 | of domain. Previously, there are some benchmarks, but they all have their limitations. And I think

01:18:21.480 | you want the task to be kind of hard and you want user simulation to be real. We don't have that until

01:18:28.120 | LLM. So data sets from 10 years ago, like either just have trajectories, conversating with humans,

01:18:34.440 | or they have very fake kind of simulators. I think right now is a good opportunity to, if you really just

01:18:39.400 | care about this task of customer service, then it's a good opportunity because now you have LLMs to

01:18:44.200 | simulate humans. But I think a more general motivation is we don't have enough agent benchmarks

01:18:48.760 | that target this kind of robustness, reliability kind of standpoint. It's more about, you know,

01:18:53.560 | code or web. So this is a very good addition to the landscape. If you have a model that can simulate

01:19:00.200 | the persona, like the user, the right way, shouldn't the model also be able to accomplish the task,

01:19:05.640 | right? If he has the knowledge of like what the person will want, then it means...

01:19:08.840 | This is a great question. I think it really stems from like asymmetry of information, right? Because

01:19:14.040 | if you think about the customer service agent, it has information that you cannot, you cannot access,

01:19:18.440 | right? Like the APIs it could call or, you know, the policies of internal company policy, whatever.

01:19:24.280 | And that I think very interesting for TallBench is like, it's kind of okay for the user to be kind of

01:19:29.720 | stupid. So you can imagine like there are failure cases, right? But I think in our case, as long as the

01:19:35.880 | user specified the need very clearly, then it's up to the agent to figure out, for example, what is the

01:19:42.040 | second cheapest flight from this to that under that constraint, very complicated reasoning involved. Like

01:19:47.080 | we shouldn't require users to be able to solve those things. They should just be able to clearly express

01:19:53.000 | their need. But then if the task failed, then it's up to the agent. That makes the evaluation much easier.

01:19:59.000 | Awesome. Anything else?

01:20:00.440 | I have one last question for Harrison, actually.

01:20:02.600 | Oh, no, that's not this podcast.

01:20:04.360 | You can't do it.

01:20:04.920 | I mean, there are a lot of questions around AI right now, but I feel like perhaps the biggest question is

01:20:12.360 | application. Because if we have great application, we have super app, whatever, that keeps the whole thing

01:20:16.840 | going, right? Obviously, we have problems with infra, with chip, with transformer, with whatever,

01:20:22.600 | S4, a lot of stuff. But I do think the biggest question is application. And I'm curious, like,

01:20:28.200 | from your perspective, like, is there any things that are actually already kind of working, but people

01:20:33.160 | don't know enough? Or like, is there any like promising application that you're seeing so far?

01:20:37.880 | Okay, so I think one big area where there's clearly been success is in customer support,

01:20:43.160 | both companies doing that as a service, but also larger enterprises doing that and building that

01:20:48.040 | functionality in inside, right? There's a bunch of people doing coding stuff. We've already talked

01:20:53.560 | about that. I think that's a little bit, I wouldn't say that's a success yet, but there's a lot of

01:20:58.760 | excitement and stuff there. One thing that I've seen more of recently, I guess the general category would

01:21:04.760 | be like research style agents, specific things recently would be like, I've seen a few like AISDR

01:21:11.080 | companies pop up, where they basically do some data enrichment, they get a company name, they go out,

01:21:16.440 | find like, you know, funding. What is SDR? Sales Development Rep. It's an entry level job title in B2B SaaS.

01:21:22.760 | Yeah. So I don't know, I know. The PhD mind cannot comprehend.

01:21:29.400 | And so I'd classify that under the general area of kind of like researchy style agents, I think like

01:21:36.760 | legal falls in this as well. I think legal is, yeah, they're a pretty good domain for this.

01:21:43.240 | I wonder how good hardware is doing. There was some debate, but they raised a lot of money. So who

01:21:49.480 | knows? I'd say those are, those are a few of the categories that jump to mind. Like entry type kind

01:21:54.600 | of research. On the topic of applications though, the thing that I think is most interesting in this

01:21:58.840 | space right now is probably all the UXs around these apps and the different things besides chat that might

01:22:04.200 | come out. I think two that I'm really interested in, one for the idea of this AISDR. I've seen a bunch of

01:22:10.600 | them do it in kind of like a spreadsheet style view where you have like, you know, 10 different companies

01:22:16.280 | or hundreds of different companies and five different attributes you want to run up and then each

01:22:20.040 | cell is an agent. And I guess the good, the good thing about this is like, you can already use the

01:22:23.720 | first couple of rules of spreadsheet as a few shot example or whatever. There's so many good things about it. Yeah. You can, you can test it out on a few. It's a great way for humans to run things in batch, which I don't, it's a great interface for that.

01:22:24.200 | It's still kind of elusive to do this kind of like PhD kind of research, but I think those kind of entry type research where it's more repetitive and it should be automated. And then the other UX I'm really, really interested is, is kind of like when you have agents running in the background, how can they, like ambient style agents, how can they reach out to you? So I think like as an example of this, I have an email assistant,

01:22:54.180 | um, that runs in the background, it treasers all my emails and it tries to respond to them. And then when it needs my input, like, Hey, do you want to do this podcast? It reaches out to me. It sends me a message.

01:23:03.180 | Oh, you actually, Oh, you, you have it. It is live.

01:23:05.180 | Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. I use it for all my emails.

01:23:07.180 | Thank you agent.

01:23:08.180 | Well, we did Twitter. I don't have that.

01:23:09.180 | Did you write it with Lanchain?

01:23:11.180 | Yeah.

01:23:12.180 | I will open source it at some point.

01:23:14.180 | Lengraph or Lanchain?

01:23:15.180 | Yeah. I want both.

01:23:16.180 | Yeah. Both.

01:23:17.180 | So at this point, Lengraph for the orchestration, Lengchain for the integrations with the different models.

01:23:23.180 | I'm curious how the low code kind of direction is going right now. Are people, we talked about this. Oh, yeah. It's not low code.

01:23:29.180 | Lengraph is not low code.

01:23:31.180 | So you can cut this.

01:23:32.180 | No, no, no, no, no. People, people will tune in just for this.

01:23:35.180 | Well, it actually, it actually has to do with UXs as well. So like probably sums back to this idea of, I think like what it means to build with AI is changing. Like I still really, really strongly believe that developers will be a core kind of like part of this, largely because we see like, you need a lot of control over these agents to get them to work reliably. But there's also very clearly components that you don't need to be a developer for prompting is kind of like the most obvious one. With Lengraph, one of the things that you don't need to be a developer for prompting is kind of like the most obvious one.

01:23:59.180 | With Lengraph, one of the things that we added recently was like a Lengraph studio. So it's, we called it kind of like an IDE for agents. You point it to your code file where you have your graph defined in code. It spins up a representation of the graph. You can interact with it there. You can test it out. We fucked it up to kind of like a persistence layer. So you can do time travel stuff, which I think is another really cool UX that I first saw in Devon and was, yeah.

01:24:21.180 | Devon's time travel is good.

01:24:23.180 | The UX for Devon in general, I think you said it, but that was the novel. That was the best part. But to the low code, no code part, the way that I think about it is you probably want to have your cognitive architecture defined in code.

01:24:35.980 | A decision making procedure.

01:24:37.660 | Yes. But then there's parts within that that are prompts or maybe configuration options, like something to do with RAG or something like that. We've seen that be a popular configuration option.

01:24:48.340 | So is it useful for programmers more or is it for like people who cannot program? I guess if you cannot program, it's still very complicated for them.

01:24:55.340 | It's useful for both. I think like we see it being useful for developers right now, but then we also see like there's often teams building this, right? It's not one person. And so I think there's this handoff where the engineer might define the cognitive architecture. They might do some initial prompt engineering.

01:25:08.340 | It's easier to communicate to the product manager.

01:25:10.340 | It's easier to show them what's going on and it's easier to let them control it and maybe they're doing the prompting. And so, yeah, I think what the TLDR is like what it means to build is changing.

01:25:19.340 | And also like UX is UX in general is interesting, whether it's for how to build these agents or for how to use them as end consumers. And there might also be overlap as well. And it's so early on and no one knows anything.

01:25:30.340 | But I think UX is one of the most exciting spaces to be innovating in right now.

01:25:34.340 | Let's do ACI. Yeah.

01:25:35.340 | Okay.

01:25:36.340 | Yeah.

01:25:37.340 | That's another theme that we cover on the pod. We had the first AI UX meetup and we're trying to get that going. It's not a job. It's just people just tinkering.

01:25:46.340 | Well, thank you guys so much for indulging us.

01:25:49.340 | Yeah, that was amazing.

01:25:50.340 | Yeah, thank you.

01:25:51.340 | Harrison, you're amazing as a co-host. We'd love to have you back. Like, that was awesome.

01:25:53.340 | I just try to listen to you guys for inspiration and stuff.

01:25:57.340 | It's actually really scary to have you as a listener because I don't want to misrepresent. Like, I talk about 100 companies, right? And God forbid I get one of them wrong and, you know.

01:26:06.340 | I'm sure all of them listen as well. Not to add pressure.

01:26:09.340 | Yeah, thank you so much. It's a pleasure to have you on. And you had one of the most impactful PhDs in this sort of AI wave. So I don't know how you do it, but I'm excited to see what you do at OpenAI.

01:26:21.340 | Thank you.

01:26:22.340 | Thank you.

01:26:23.340 | Thank you.

01:26:24.340 | Thank you.

01:26:26.340 | Thank you.

01:26:27.340 | Thank you.

01:26:28.340 | you

01:26:28.840 | you

01:26:29.340 | you

01:26:29.840 | you

01:26:30.340 | you

01:26:30.840 | you

01:26:31.340 | you

01:26:31.840 | you

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Chapters