Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Hi, everyone. Welcome to the Leyden Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. And I'm joined by my co-host Swix, founder of SmallAI. Hey, and today we have a super special episode. I actually always wanted to take a selfie and go like, you know, POV, you're about to revolutionize the world of agents, because we have two of the most awesome high-end agents in the house.

So first, we're going to welcome back Harrison Chase. Welcome. Excited to be here. What's new with you recently in sort of like the 10, 20 second recap? Langchain, Langsmith, Langgraf, pushing on all of them. Lots of cool stuff related to a lot of the stuff that we're going to talk about today, probably.

Yeah, we'll mention it in there. And the Celtics won the title. And the Celtics won the title. You got that going on for you. I don't know. Is that like floorball, handball, baseball? Basketball. Basketball, basketball. Patriots aren't looking good, though, so that's-- And then Shunyu, you've also been on the pod, but only in like a sort of oral paper presentation capacity.

But welcome officially to the Linux-based pod. Yeah, I've been a huge fan. So thanks for the invitation. Thanks. Well, it's an honor to have you on. You're one of like-- you're maybe the first PhD thesis defense I've ever watched in like this AI world, because most people just publish single papers.

But every paper of yours is a banger. So congrats. Thanks. Yeah, maybe we'll just kick it off with, you know, what was your journey into using language models for agents? I like that your thesis advisor, I didn't catch his name, but he was like, you know-- Karthik. Yeah, it's like this guy just wanted to use language models, and it was such a controversial pick at the time.

Right. The full story is that in undergrad, I did some computer vision research, and that's how I got into AI. But at the time, I feel like, you know, you're just composing all the GAN or 3D perception or whatever together, and it's not exciting anymore. And one day, I just see this transformer paper, and that's really cool.

But I really got into language model only when I entered my PhD and met my advisor Karthik. So he was actually the second author of GPT-1 when he was like a visiting scientist at OpenAI. With Alec Redford? Yes. Wow. That's what he told me. It's like, back in OpenAI, they did this GPT-1 together, and Ilya just said, Karthik, you should stay, because we just solved the language.

But apparently, Karthik is not fully convinced. So he went to Princeton, started his professorship, and I'm really grateful. So he accepted me as a student, even though I have no prior knowledge in NLP. And, you know, we just met for the first time, and he's like, you know, what do you want to do?

And I'm like, you know, you have done those test game scenes. That's really cool. I wonder if we can just redo them with language models. And that's how the whole journey began. Awesome. And that was GPT-2, was that at the time? Yes, that was 2019. Yeah. Way too dangerous to release.

Yeah. And then I guess the first work of yours that I came across was React. Sure. Which was a big part of your defense. But also, Harrison, when you came on the podcast last year, you said that was one of the first papers that you saw when you were getting inspired for Langchain.

So maybe give a recap of why you thought it was cool, because you were already working in AI and machine learning. And then, yeah, you can kind of like enter the paper formally. But what was that interesting to you specifically? Yeah, I mean, I think the interesting part was using these language models to interact with the outside world in some form.

And I think in the paper, you mostly deal with Wikipedia, and I think there's some other data sets as well, but the outside world is the outside world. And so interacting with things that weren't present in the LLM and APIs and calling into them and thinking about, and yeah, the React reasoning and acting and kind of like combining those together and getting better results.

I had been playing around with LLMs, been talking with people who were playing around with LLMs. People were trying to get LLMs to call into APIs, do things. And it was always, how can they do it more reliably and better? And so this paper was basically a step in that direction.

And I think really interesting and also really general as well. Like, I think that's part of the appeal is just how general and simple in a good way, I think the idea was. So that it was really appealing for all those reasons. Simple is always good. Yeah. Do you have a favorite part?

Because I have one favorite part from your PhD defense, which I didn't understand when I read the paper. But you said something along the lines, React doesn't change the outside or the environment, but it does change the inside through the context, putting more things in the context. You're not actually changing any of the tools around you to work for you, but you're changing how the model thinks.

And I think that was like a very profound thing when I, not that I've been using these tools for like 18 months, I'm like, I understand what you meant. But like to say that at the time you did the PhD defense was not trivial. Yeah. Another way to put it is like thinking can be an extra tool that's useful.

It makes sense. It checks out. Who would have thought? I think it's also more controversial within his world because everyone was trying to use RL for agents. Right. And this is like the first kind of zero gradient type approach. Yeah. I think the bigger kind of historical context is that we have these two big branches of AI, right?

So if you think about RL, right, that's pretty much the equivalent of agent at a time. And it's like agent is equivalent to reinforcement learning and reinforcement learning is equivalent to whatever game environment they're using, right? Atari game or go or whatever. So you have like a pretty much, you know, you have a biased kind of like set of methodologies in terms of reinforcement learning and represents agents.

On the other hand, I think NLP is like a historical kind of subject. It's not really into agents, right? It's more about reasoning. It's more about solving those concrete tasks. And if you look at S.A.L., right, like each task has its own track, right? Summarization has a track. Question answering has a track.

So I think really it's about rethinking agents in terms of what could be the new environments that we can interact with. It's not just Atari games or whatever, video games, but also those text games or language games. And also thinking about, could there be like a more general kind of methodology beyond just designing specific pipelines for each NLP task?

That's like the bigger kind of context, I would say. Is there an inspiration spark moment that you remember? Or how did you come to this? We had Trida on the podcast and you mentioned he was really inspired working with like systems people to think about flash attention. What was your inspiration?

Yeah, so actually before React, I spent the first two years of my PhD focusing on text-based games or in other words, text adventure games. It's a very kind of small kind of research area and quite ad hoc, I would say. And there are like, I don't know, like 10 people working on that at the time.

And have you guys heard of Zork 1, for example? So basically the idea is you have this game and they have text observations. Like you see a monster, you see a dragon. You're eaten by a group. Yeah, you're eaten by a group. And you have actions like kill the group with a sword or whatever, right?

And that's like a very typical setup of a text game. So I think one day after, you know, I've seen all the GPT-3 stuff. I just think, think about, you know, how can I solve the game? Like why those AI, you know, machine learning methods are pretty stupid, but we are pretty good at solving the game relatively, right?

So for the context, the predominant method to solve this text game is obviously reinforcement learning. And the idea is you just try out an arrow in those games for like millions of steps and you kind of just overfeed to the game. But there's no language understanding at all. And I'm like, why can I solve the game better?

And it's kind of like, because we think about the game, right? Like when we see this very complex text observation, like you see a guru and you might see a sword, you know, in the right of the room and you have to go through the wooden door to go to that room.

You will think, you know, oh, I have to kill the monster and to kill that monster, I have to get the sword, I get to get the sword, I have to go, right? And this kind of thinking actually helped us kind of thorough shots of the game. And it's like, why don't we also enable the text agents to think?

And that's kind of the prototype of React. And I think that's actually very interesting because the prototype, I think, was around November of 2021. So that's even before like chain of thought or whatever came up. So we did a bunch of experiments in the text game, but it was not really working that well.

Like those texting are just too hard. I think today it's still very hard. Like if you use GPD-4 to solve it, it's still very hard. So the change came when I started the internship in Google and apparently Google care less about text game, they care more about what's more practical.

So pretty much I just reapply the idea, but to more practical kind of environments like Wikipedia or or like simpler text games like R4 and it just worked. It's kind of like you first have the idea and then you try to find the domains and the problems to demonstrate the idea, which is, I would say, different from most of the AI research.

But it kind of worked out for me in that case. Yeah. For Harrison, when you were implementing React, what were people applying React to in the early days? I think the first demo we did probably had like a calculator tool and a search tool. So like general things, we tried to make it pretty easy to write your own tools and plug in your own things.

And so this is one of the things that we've seen in LinkedIn is people who build their own applications generally write their own tools. Like there are a few common ones. I'd say like the three common ones might be like a browser, a search tool and a code interpreter.

But then other than that- The LMOs. Yep. Yeah, exactly. It matches up very nice with that. And we just, we actually just redid like our integrations docs page. And if you go to the tools section, we like highlight those three. And then there's a bunch of like other ones.

Yeah. And there's such a long tail of other ones, but in practice, like when people go to production, they generally have their own tools or maybe one of those three, maybe some other ones, but like very, very few other ones. So yeah, I think the first demo was, was a search and the calculator one.

And there's, what's the data set? Hot Pog QA. Yeah. Oh, so there's that one. And then there's like the celebrity one by the same author, I think. Olivia Wilde's boyfriend squared. Yeah. To the power of 0.23. Yeah. Right, right, right. There's, I'm forgetting the name of the author, I was like, we're going to over optimize for Olivia Wilde's boyfriend and it's going to change next year or something.

There's a few data sets kind of like in that vein that require multi-step kind of like reasoning and thinking. So one of the questions I actually had for you in this vein, like the React paper, there's a thing, I think there's a few things in there, or at least when I think of that, there's a few things that I think of.

There's kind of like the specific prompting strategy. Then there's like this general idea of kind of like thinking and then taking an action. And then there's just even more general idea of just like taking actions in a loop. Today, like obviously language models have changed a lot. We have tool calling, the specific prompting strategy probably isn't used super heavily anymore.

Would you say that like the concept of React is still used though? Or like, do you think that tool calling and running tool calling in a loop, is that React in your mind? I would say like it's like more implicitly used than explicitly used. To be fair, I think the contribution of React is actually twofold.

So first is this idea of, you know, we should be able to use calls in a very general way. Like there should be a single kind of general method to handle interaction with various environments. I think React is the first paper to demonstrate the idea. But then I think later, there are two form or whatever, and this become like a trivial idea.

But I think at the time, that's like a pretty long trivial thing. And I think the second contribution is this idea of what people call like inner monologue or thinking or reasoning or whatever to be paired with tool use. I think that's still not trivial because if you look at the default function calling or whatever, like there's no inner monologue.

And in practice, that actually is important, especially if the tool that you use is pretty different from the training distribution of the language model. So I think that's like, those are the two like main things that are kind of inherited. Yeah. On that note, I think OpenAI even recommended like when you're doing tool calling, it's sometimes helpful to put like a thought field in the tool along with all the actual acquired arguments and then have that one first.

So it fills out that first and then, and that's, they've shown that that's yielded to kind of like better results. The reason I ask is just like the same concept is still alive and I don't know whether to call it like a React agent or not. Like I don't know what to call it.

Like I think of it as React, like it's the same ideas that were in the paper, but it's obviously a very different implementation at this point in time. And so I just don't know what to call it. I feel like people will sometimes think more in terms of different tools, right?

Because if you think about a web agent versus, you know, like a function calling agent and calling a Python API, you would think of them as very different. But in some sense, the methodology is the same. It depends on how you view them, right? And I think people will tend to think more in terms of the environment and the tools rather than the methodology.

So, or in other words, I think the methodology is kind of trivial and simple. So people will try to focus more on the different tools, but I think it's good to have like a single underlying, principle underlying, all of those things. Yeah. How do you see the surface of React getting molded into the model?

So a function calling is a good example of like, now the model does it. What about the thinking? Now, most models that you use kind of do chain of thought on their own. They kind of produce steps. Do you think that more and more of this logic will be in the model?

Or do you think like the context window will still be the main driver of like reasoning and thinking? I think it's already default, right? Like you do some chain of thought and you do some tool call, like the cost of adding the chain of thought is kind of relatively low compared to other things.

So it's not hurting to do that. And I think it's already kind of common practice, I would say. Is this a good place to bring in either tree of thought or reflection? Your pick. Maybe reflection like to respect the time order, I would say. Yeah. Any backstory as well, like, you know, the people involved with NOAA and like the Princeton group, I think, you know, we talked about this offline, but people don't understand how these research pieces come together and this ideation.

Yeah. I think reflection is mostly NOAA's work. Like I'm more like advising kind of role. So the story is, I don't remember the time, but like one day we just see this preprint that's like reflection and autonomous agent with memory or whatever. And it's kind of like a extension to React, which uses this self-refaction.

I'm like, oh, somehow it become very popular. Yeah. And NOAA reached out to me. It's like, do you want to collaborate on this and make this from like a archive preprint to something more solid, you know, like a conference submission? I'm like, sure. We start, started collaborating and we remain good friends today.

And, uh, I think another interesting backstory is like NOAA was like, I think contacted by OpenAI at the time. It's like, this is pretty cool. Do you want to just work at OpenAI? And I think Sierra also reached out at the same time. It's like, this is pretty cool.

Do you want to work at Sierra? And, and I think NOAA chose, uh, Sierra, but it's pretty cool because he was like, still like a second year undergrad and he's a very smart kid. Based on one paper? Based on one paper. Yeah. Oh my God. He's done some other research based on like programming language or chemistry or whatever, but I think that's the paper that got the attention of OpenAI and Sierra, right?

Okay. For those who haven't gone too deep on it, the way that you present the inside of React, like, can you do that also for reflection? Yeah. I think one way to think of reflection is that the traditional idea of reinforcement learning is you have a scalar reward and then you, you somehow back-propagate the signal of the scalar reward to the rest of your neural network through whatever algorithm, like policy gradient or A2C or whatever.

And if you think about the real life, you know, most of the reward signal is not scalar. It's like your boss told you, you know, you should have done a better job in this, but a good job on that or whatever. Right. It's not like a scalar reward, like 29 or something.

I think in general, humans do more, deal more with, you know, long-scale reward, or you can say language feedback, right? And the way that they deal with language feedback also have this kind of back-propagation kind of process, right? Because you start from this, you did a good job on job B, and then you reflect, you know, what could have done different to, to change, to make it better.

And you kind of change your prompt, right? Basically, you change your prompt, how to do job A, and how to do job B, and then you do the whole thing again. So it's really like a pipeline of language where, in self-gradient descent, you have something like text reasoning to replace those gradient descent algorithms.

I think that's one way to think of reflection, yeah. One question I have about reflection is, how general do you think the algorithm there is? And so for context, I think at Langchain and at other places as well, we found it, like we and others found it pretty easy to implement, kind of like React in a standard way.

You plug in any tools, and it kind of works off the shelf, you know, can get it up and running. I don't think we have like an off the shelf kind of like implementation of reflection and kind of like the general sense. I think the concepts like absolutely we see used in different kind of like specific cognitive architectures, but I don't think we have one that comes off the shelf.

I don't think any of the other frameworks have one that comes off the shelf. And I'm curious whether that's because it's not general enough, or it's complex as well, because it also requires running it more times. Maybe that's not feasible. Like, I'm curious how you think about the generalality complexity.

Why? Yeah, should we have one that comes off the shelf? I think the algorithm is general in the sense that it's just as general as like other algorithms, if you think about like positive reading or whatever, but it's not applicable to all tasks, just like other algorithms, right? So you can argue PPO is also general, but it works better for those set of tasks, but not on those set of tasks.

I think it's the same situation for reflection. And I think a key bottleneck is the evaluator, right? Basically, you need to have a good sense of the signal. So for example, like if you're trying to do a very hard reasoning task, say mathematics, for example, and you don't have any tools, right?

It's operating in this chain of thought setup. Then reflection will be pretty hard because in order to reflect upon your thoughts, you have to have a very good evaluator to judge whether your thought is good or not. But that might be as hard as solving the problem itself or even harder.

The principle of self-reflection is probably more applicable if you have a good evaluator, for example, in the case of coding, right? Like if you have those arrows, then you can just reflect on that and how to solve the bug and stuff. So I think another criteria is that it depends on the application, right?

Like if you have this latency or whatever need for like an actual application with an end user, right? And the user wouldn't let you, you know, do like two hours of three of thought or reflection, right? You need something as soon as possible. So in that case, maybe this is better to be used as like a training time technique, right?

You do those reflection or tree of thought or whatever, you get a lot of data and then you try to use the data to train your model better. And then in test time, you still use something as simple as React, but that's already improved. And if you think of the Voyager paper as like a way to store skills and then reuse them, like how would you compare like this like reflective memory and like at what point it's just like ragging on the memory versus like you want to start to fine tune some of them or like what's the next step once you get a very long kind of like a reflective corpus?

Yeah. So I think there are two questions here. The first question is what type of information or memory are you considering, right? Is it like somatic memory that stores, you know, knowledge about the word or is it the episodic memory that stores, you know, trajectories or behaviors or is it like more of a procedural memory?

Like in Voyager's case, like skills or code snippets that you can use to do actions, right? That's that's one dimension. And the second dimension is obviously how you use the memory, either retrieving from it, using it in the context or or fine tuning it. I think the cognitive architecture for language agents paper have a good kind of categorization of all the different combinations.

And of course, what which way you use depends on the concrete application and the concrete need and the concrete task. But I think in general, it's good to think of those like systematic dimensions and all the possible like options there. Yeah. Harrison also has in LangMem. I think you did a presentation in my meetup and I think you've done it at a couple other venues as well.

User state, semantic memory and append only state. I think kind of maps to what you just said. What is LangMem? Can I give it like a quick... One of the modules of LinkedIn for a long time has been something around memory. And I think like, you know, we're still obviously figuring out what that means as is everyone kind of in the space.

But one of the experiments that we did and one of the proof of concepts that we did was, technically what it was is you would basically create threads. You'd push messages to those threads in the background. We process the data in a few ways. One, we like put it into some semantic store.

That's the semantic memory. And then two, we do some like extraction and reasoning over the memories to extract. And we let the user define this, but like extract key facts or anything that's of interest to the user. Those aren't exactly trajectories. They're maybe more closer to the, to the procedural memory.

Is that how you'd think about it or classify it? Or is it like about like knowledge about the word or is it more like how to do something? It's reflections basically. So in generative worlds, generative agents, generative smallville. Yeah. The smallville one. So the way that they had their memory there was they had the sequence of events and that's kind of like the raw events that happened.

But then every, and events did like run some synthesis over those events to, for the, for the LLM to insert its own memory basically. And it's, it's that type of memory. I don't know how that would be classified. I think of that more of the semantic memory, but to be fair, I think it's just one way to think of that.

But whether it's semantic memory or procedure memory or whatever memory, that's like, like abstraction layer. But in terms of implementation, you can choose whatever implementation for whatever memory. So they're totally kind of orthogonal. I think it's more of a good way to think of the things because like from the history of cognitive science and you know, cognitive architecture and how people study even neuroscience, right?

That's the way people think of how human brain organizes memory. And I think it's more useful as a way to think of things, but it's not like for semantic memory, you have to do this kind of like way to retrieve or fine tune. And for precision memory, you have to do that.

Like, I think those are totally orthogonal kind of dimensions. How much background do you have in kind of like cognitive sciences and how much do you model some of your thoughts on? That's a great question actually. And, uh, I think one of the undergrad kind of influence for my like follow-up research is I was doing like an internship at MIT's computational cognitive science lab with like, you know, Josh Tanabam and he's like a very famous cognitive scientist.

And I think a lot of, a lot of his ideas still influence me today. Like, uh, think of sensing like computational terms and get interested in language and a lot of stuff, you know, or even like developer psychology kind of stuff, I think still influencing me today. As a developer that tried out Langmem, the way I view it is just, it's a materialized view of a stream of logs.

And if anything, that's just useful for context compression. I don't have to use the full context to run it over everything, but also it's kind of debuggable. Like if it's wrong, I can show it to the user, user can manually fix it and I can carry on. That's a really good analogy.

Yeah, I like that. I'm going to steal that. Please, please. You know I'm bullish on memory databases. Um, I guess, Tree of Thoughts? Um, yeah, Tree of Thoughts. I mean, you had a... I feel like I'm relieving the defense again in like a podcast format. Yeah, no, I mean, it was a, you had a banger.

Well, this is the one where you're already successful and would just like, you know, highlight the glory. It was really good. You mentioned that since thinking is kind of like taking an action, you can use like action searching algorithms to think of thinking. So just like you will use Tree Search to like find the next thing.

And the idea behind Tree of Thoughts is like you generate all these possible outcomes and then find the best tree to get to the end. Maybe back to the latency question. You can't really do that if you have to respond in real time. So what are maybe some of the most helpful use cases for things like this?

Where have you seen people adopt it where the high latency is actually worth the weight? For things that you don't care about latency, obviously, for example, if you're trying to do math, right? If you're just trying to come up with a proof. But I feel like one type of task is more about searching for a solution, right?

You can try a hundred times, but if you find one solution, that's good. Like for example, if you're finding a math proof or if you're finding a good code to solve a problem or whatever. And I think another type of task is more like reacting, right? For example, if you're doing customer service, you're like a web agent booking a ticket for like a end user, right?

Those are more like kind of reactive kind of tasks, right? You have to, or more real-time tasks, right? You have to do things fast. They might be easy, but you have to do it reliably. And you care more about like, can you solve 99% of the time out of a hundred, but for the type of search type of tasks, then you care more about, you know, can I find one solution out of a hundred?

So it's kind of symmetric and different. Do you have any data or like intuition from your user base? Like what's the split of these type of use cases? Like how many people are doing more reactive things and how many people are experimenting with like kind of deep, long search?

I would say like reacts probably like the most popular. I think there's aspects of reflection that get used tree of thought probably like the least. So there's a great tweet from Jason way. I think you're now colleague, and he was talking about like prompting strategies and how he thinks about them.

And I think like the four things that he kind of had was like, one, how easy is it to implement? How much compute does it take? How many tasks does it kind of like solve? And how much does it improve on those tasks? And, and I'd add a fifth, which is like, how likely is it to be relevant with when the next generation of models kind of come out?

And I think if you look at kind of like those axes, and then you look at like, uh, you know, react reflection, tree of thought, it tracks that the ones that score better are used more like react is pretty easy to implement tree of thoughts, pretty hard to implement.

Um, right. Like the amount of compute. Yeah. A lot more for tree of thought, the tasks and how much it improves. I don't have amazing visibility there, but I think like, if we're comparing react versus tree of thought, react just dominates the first two axes so much that my question around that was going to be like, how do you think about like these prompting strategies, cognitive architectures, whatever you want to call them when you're, when you're thinking of them, what are the axes that you're judging them on in your head when you're thinking whether it's a good one or, or a less good one or.

Right. Right. I think there is a difference between like a prompting method versus like a research in the sense that like for research, you don't really even care about does it actually work on practical tasks or does it help whatever. I think it's more about like the idea or the, the principle, right?

Like what is the direction that you're like unblocking and whatever. And I think for the, for like an actual prompting method to solve like a concrete problem, I would say like simplicity is very important because the simpler it is, the less decision you have to make about it. And it's easier to design, it's easier to propagate and it's easier to, to do stuff.

So always try to be as simple as possible. And I think latency obviously is important. Like if you can do things fast and you don't want to do things slow. And I think in terms of the actual prompting method to use for a particular problem, I'm a, I think we should all be in the minimum list kind of camp, right?

You should try the minimum thing and see if it works and if it doesn't work and there's absolute reason to add something, then you add something, right? Like there's an absolute reason that you need some tool, then you should add the tool thing. If there's absolute reason to add reflection or whatever, you should add that.

Otherwise, if chain of thought can already solve something, then you don't even need to use any of that. Yeah. Or if just better prompting can solve it. Like, you know, you could add a reflection step or you could make your instructions a little bit clearer and it's a lot easier to do that.

I think another interesting thing is like, I personally have never done those kind of like weird tricks. I think all the prompts that I write are kind of like just talking to a human, right? It's like, I don't know, like, like I never say something like, your grandma is dying and you have to solve it too.

I mean, those are cool, but I feel like we should all try to solve this in like a very intuitive way. Like, just like talking to your co-worker and that, that should work 99% of the time. That's my personal take. Yeah. The problem with how language models, at least in these sort of GPC3 era, was that they're, they over optimized to some sets of tokens in sequence.

So like what reading the Kojima et al. paper, that was listening step by step, like he tried a bunch of them and they had wildly different results. It should not be the case, but it is the case. And hopefully we're getting better there. Yeah. I think it's also like a timing thing in the sense that if you think about this whole line of language model, right?

Like at the time it was just like a text generator. We don't have any idea how it's going to be used. Right. And obviously at the time you will find all kinds of weird issues because it's not trained to do any of that. Right. But then I think we have this loop where once we realize chain of thought is important or agent is important or tool using is important.

What we see is today's language models are heavily optimized towards those things. So I think in some sense they become more reliable and robust over those use cases. And, uh, you don't need to do as much prompt engineering tricks anymore to, to solve those things. I feel like in some sense, I feel like prompt engineering even is like a slightly negative word at the time because it refers to all those kind of weird tricks that you have to apply.

But I think we don't have to do that anymore. Like given today's progress, you should just be able to talk to like a coworker and if you're clear and concrete and you know, being reasonable, then it should be reasonable things for you. Yeah. Yeah. The way I put this is, uh, you should not be a prompt engineer because it is the goal of the big labs to put you out of a job.

You should just be a good communicator. Like if you're a good communicator to human, you should be a good communicator to them. And I think that's the key though, because oftentimes people aren't good communicators to these language models and that is a very important skill and that's still messing around with the prompt.

And so it depends what you're talking about when you're saying prompt engineer. But do you think it's like very correlated with like, are they like a good communicator to humans? You know, it's like it may be, but I also think I would say on average, people are probably worse at communicating with language models than to humans.

That's for sure. Right now, at least, because I think we're still figuring out how to do it. You kind of expect it to be magical and there's probably some correlation, but I'd say there's also just like people are worse at it right now than talking to you. We should, we should, uh, make it like a, you know, like an elementary school class or whatever, like how to talk to.

Uh, yeah. I'm very proud of that. Yeah. Before we leave the topic of, uh, trees and searching, not specific about Q*, but there's a lot of questions about MCTS and this combination of tree search and language models. And I just had to get in a question there about how seriously should people take this?

Again, I think it depends on the tasks, right? So MCTS was magical for Go, but it's probably not as magical for robotics, right? So I think right now, the problem is not even that we don't have good methodologies. It's more about we don't have good tasks. It's also very interesting, right?

Because if you look at my citations, like obviously the most cited are React, Reflection, and Treehouse, all those are methodologies. But I think like equally important, if not more important, line of my work is like benchmarks and environments, right? Like web shop or suite venture or whatever. And I think in general, what people do in academia that I think is not good is they choose a very simple task, like Alfred, and then they apply overly complex methods and to show the improved 2%.

I think like you should probably match, you know, the level of complexity of your task and your method. Right. I feel like where tasks are kind of far behind the method in some sense, right? Because we have some good test time approaches, like whatever, React or Reflection and Treehouse, all that are like, there are many, many more complicated testing methods afterwards.

But on the benchmark side, we have made a lot of good progress this year, last year. But I think we still need more progress towards that, like better coding benchmark, better web agent benchmark, better agent benchmark, not even for web or code. I think in general, we need to catch up with, with tasks.

What are the biggest reasons in your mind why, why it lags behind? I think incentive is one big reason. Like if you see, you know, all the massive paper are cited like a hundred times more than the task paper. And also making a good benchmark is actually quite hard.

And it's almost like a different set of skills in some sense, right? I feel like if you want to build a good benchmark, you need to be like a good kind of product manager kind of mindset, right? You need to think about why people should use your benchmark, why it's challenging, why it's useful.

If you think about like a PhD going into like a school, right? The prior skill that expected to have is more about, you know, can they code this method and can they just run experiments and can solve that? I think building a benchmark is not the typical prior skill that we have, but I think things are getting better.

I think more and more people are starting to build benchmarks and people are saying that it's like a way to get more impact in some sense, right? Because like, if you have a really good benchmark, a lot of people are going to use it. But if you have a super complicated test time method, like it's very hard for people to use it.

Are evaluation metrics also part of the reason, like for some of these tasks that we might want to ask these agents or language models to do, is it hard to evaluate them since so it's hard to get an automated benchmark? Obviously with Sweetbench, you can, and with coding, it's, it's easier, but.

I think that's part of the, like the skill set thing that I mentioned, because I feel like it's like, it's like a product manager because there are many dimensions and you need to strike a balance and it's really hard, right? If you want to make sense very easy to all the gradable, like automatically gradable, like either to grade or either to evaluate, then you might lose some of the realness or practicality.

Or like it might be practical, but it might not be as scalable, right? For example, if you think about text game, humans have pre-annotated all the rewards and all the language are real. So it's pretty good on auto-gradable dimension and the practical dimension. If you think about, you know, practical, like actual English being practical, but it's not scalable, right?

Like it takes like a year for like experts to, to, to build that game. So it's not really that scalable. And I think part of the reason that Sweetbench is so popular now is it kind of hits the balance between the three dimensions, right? Either to evaluate and being actually practical and being scalable.

Like if I were to criticize upon some of my prior work, I think webshop, like it's my initial attempt to get into benchmark work. And I'm trying to do a good job striking the balance, but obviously we make it all gradable and it's really scalable. But then I think the practicality is not as high as actually just using GitHub issues, right?

Because you're just creating those like synthetic tasks. Are there other areas besides coding that jumped to mind as being really good for being auto-gradable? Maybe mathematics. Yeah, classic. Yeah. Do you have thoughts on AlphaProof, the, the new DeepMind paper? I think it's pretty cool. Yeah. I think it's more of a, you know, it's more of like a confidence boost or like a, sometimes, you know, the work is not even about, you know, the technical details or the methodology that it chooses or the, the concrete results.

I think it's more about a signal, right? Yeah. Existence boost, like, yeah, yeah. It's like, can be done. This direction is exciting. It kind of encourages people to work more towards that direction. I think it's more like a boost of confidence, I would say. Yeah. So we're going to focus more on agents now.

And, you know, we were a special, all of us have a special interest in coding agents. I would consider Devin to be the sort of biggest launch of the year as far as AI startups go. And you guys in the Princeton group worked on SuiAgent alongside of SuiBench. Tell us the story about SuiAgent.

Sure. So I think it's kind of like a trilogy. It's actually a series of three works now. So actually the first work is called Intercode, but it's not as, it's not as famous, I know. And the second word is called SuiBench. And the third word is called SuiAgent. And I'm just really confused why nobody's working on coding.

You know, it's like a year ago, but I mean, not everybody's working on coding, obviously, but a year ago, like literally nobody was working on coding. I was really confused. And the people that were working on coding are, you know, trying to solve human evil in like a sick to sick way.

There's no agent, there's no chain of thought, there's no anything. They're just, you know, fine tuning the model and improve some points and whatever. Like I was really confused because obviously coding is the best application for agents because it's all degradable. It's super important. You can make everything like API or code action, right?

So I was confused and I collaborated with some of the students in Princeton and we have this work called Intercode. And the idea is, first, if you care about coding, then you should solve coding in an interactive way, meaning more like a Jupyter notebook kind of way than just writing a program and seeing if fails or succeeds and stop, right?

You should solve it in an interactive way. That's because that's exactly how humans solve it, right? If I tell you to, you know, write a program like next token, next token, next token and stop and never do any edits and you cannot really use any terminal or whatever tool, it doesn't make sense, right?

And that's the way people are solving coding at the time. Basically like sampling a program from a language model without chain of thought, without tool call, without reflection, without anything. So first point is we should solve code coding in a very interactive way. And that's a very general principle that applies for various coding benchmarks.

But also I think you can make a lot of the agent tasks kind of like interactive coding. If you have Python and you can call any package, then you can literally also browse internet or do whatever you want, like control a robot or whatever. So that seems to be a very general paradigm.

But obviously I think a bottleneck is at the time we're still doing, you know, very simple tasks like human evil or whatever coding benchmark people proposed. Like they were super hard 2021, like 20%, but they're like 95% already in 2023. So obviously the next step is we need better benchmark.

And Carlos and John, which are the first authors of Sweetbench. I think they come up with this great idea that we should just script GitHub and solve whatever human engineers are solving. And I think it's actually pretty easy to come up with this idea. And I think in the first week, they already made a lot of progress, like the script, the GitHub, and they make all the same, but then there's a lot of pain for infra work and whatever, you know, I think the idea is super easy, but the engineering is super hard.

And I feel like that's a very typical signal of a good work in the AI era now. I think also, I think the filtering was challenging because if you look at open source PRs, like a lot of them are just like, you know, fixing typos. I think it's challenging.

And to be honest, we didn't do a perfect job at the time. So if you look at the recent blog posts with OpenAI, like we improved the filtering so that, you know, it's more so I think OpenAI was just like, look, this is a thing now we have to fix this.

Like these students just like, you know, rushed it. It's a good convergence of interest for me. Yeah. Was that tied to you joining OpenAI or like, was that just unrelated? It's a coincidence for me, but it's a good coincidence. There is a history of anytime a big lab adopts a benchmark, they fix it because, you know, otherwise it's a broken benchmark.

Yeah. So naturally, once we propose Swimage, the next step is to solve it, right? But I think the typical way you solve something now is you collect some training samples or you design some complicated agent method, and then you try to solve it, right? Either a super complicated prompt or you build a better model with more trained data.

But I think at the time we realized that even before those things, there's a fundamental problem with the interface or the tool that you're supposed to use, right? Because that's like a ignored problem in some sense, right? Like what your tool is or how that matters for your task.

So what we found concretely is that if you just use the text terminal off the shelf as a tool for those agents, there's a lot of problems, right? For example, if you edit something, there's no feedback. So you don't know whether your edit is good or not. And that makes the agent very confused and makes a lot of mistakes.

And there are a lot of like small problems, you would say. And well, you can try to do prompt engineering and improve that, but it turns out to be actually very hard. And we realized that the interface design is actually a very omitted kind of part of agent design.

So we did this sweet agent work. And the key idea is just even before you talk about, you know, what the agent is, you should talk about what the environment is and you should make sure that the environment is actually friendly to whatever agent you're trying to apply, right?

And that's the same idea for humans, right? Like if I give you like text terminal is good for some tasks like git pool or whatever, right? But it's not good if you want to look at, you know, browser and whatever, right? So also like, you know, browser is a good tool for some tasks, but it's not a good tool for other tasks.

We need to talk about how to design an interface in some sense where we should treat agents as our customers, right? It's like when we treat humans as a customer, we design human computer interfaces, right? We design those beautiful desktops or browser or whatever, so that it's very intuitive and easy for humans to use.

And this whole great subject of HCI is all about that. I think now the research idea of sweet agent is just we should treat agents as our customers and we should do like, you know, AICI. So what are the tools that a sweet agent should have or a coding agent in general?

For sweet agent, it's like a modified text terminal, which kind of adapts to a lot of the patterns of language models to make it easier for language model to use. For example, now for edit, instead of having no feedback, it will actually have a feedback of, you know, actually here you introduce like a syntax error and you should probably want to fix that.

And there's ended error there. And that makes it super easy for the model to actually do that. And there's other small things like how exactly you write arguments, right? Like, do you want to write like a multi-line edit or do you want to write a single line edit? I think it's more interesting to think about the way of the development process of ACI rather than the actual ACI for like a concrete application, because I think the general paradigm is very similar to HCI and psychology, right?

Basically for how people develop HCI is they do behavior experiments on humans, right? I do A/B test, right? Like which interface is actually better? And I do those behavior experiments, kind of like psychology experiments to humans, and I change things. And I think what's really interesting for me for this three agent paper is we can probably do the same thing for agents, right?

We can do A/B test for those agents and do behavior tests. And through the process, we not only invent better interfaces for those agents, that's the practical value, but we also better understand agents. Just like when we do those A/B tests, we do those HCI, we better understand humans.

During those ACI experiments, we actually better understand agents. And that's pretty cool. Besides kind of like that A/B testing, what are other kind of like processes that people can use to think about this in a good way? That's a great question. And I think switch is like an initial work.

And what we do is the kind of the live approach, right? You just try some interface and you see what's going wrong and then you try to fix that. You do this kind of iterative fixing. But I think what's really interesting is there will be a lot of future directions that's very promising if we can apply some of the HCI principles more systematically into the interface design.

I think that would be a very cool interdisciplinary research opportunity. You talked a lot about kind of like agent computer interfaces and interactions. What about like human to agent kind of like UX patterns? I'm like, yeah, curious for any thoughts there that you might have. That's a great question.

And in some sense, I feel like prompt engineering is about human agent interface. But I think there can be a lot of interesting research done about it. So prompting is about how humans can better communicate with the agent. But I think there could be interesting research on how agents can better communicate with humans, right?

When to ask questions, how to ask questions, like what's the frequency of asking questions. And I think those kind of stuff could be very cool research. Yeah. I think some of the most interesting stuff that I saw here was also related to coding with Devon from Cognition. And they had the three or four different panels where you had like the chat, the browser, the terminal, and I guess the code editor as well.

There's more now. There's more. Okay. I'm not up to date. Yeah. I think they also did a good job on ACI. They did. Yeah. Yeah. I think that's the main learning I have from Devon. They cracked that. They actually, there was no foundational planning breakthrough. The planner is like actually pretty simple, but ACI that they broke through on.

I think making the tool good and reliable is probably like 90% of the whole agent. Once the tool is actually good, then the agent design can be much, much simpler. On the other hand, if the tool is bad, then no matter how much you put into the agent design planning or search or whatever, it's still going to be trash.

Yeah. I'd argue the same, same with like context and instructions. Like, yeah, go hand in hand. On the tool, how do you think about the tension of like, for both of you, I mean, you're building a library. So even more for you, the tension between making now a language or a library that is like easy for the agent to grasp and write versus one that is easy for like the human to grasp and write, because you know, the trend is like more and more code gets written by the agent.

So why wouldn't you optimize the framework to be as easy as possible for the model versus for the person? I think it's possible to design interface that's both friendly to humans and agents. But what do you think? We haven't thought about that from the perspective, like we're not trying to design LangChain or LangGraph to be friendly, but I mean, I think to be friendly for agents to write.

But I mean, I think we see this with like, I saw some paper that used TypeScript notation instead of JSON notation for tool calling and it got a lot better performance. So it's definitely a thing. I haven't really heard of anyone designing like a syntax or a language explicitly for agents, but there's clearly syntaxes that are better.

I think function calling is a good example where it's like a good interface for both human programmers and for agents, right? Like for developers, it's actually a very friendly interface because it's very concrete and you don't have to do prompt engineering anymore. You can be very systematic. And for models, it's also pretty good, right?

Like it can use all the existing coding content. So I think we need more of those kinds of designs. I will mostly agree and then I'll slightly disagree in terms of this, which is like whether designing for humans also overlaps the designing for AI. So Malta Ubo, who's the CTO of Vercel, who is creating basically JavaScript's, you know, competitor to LangChain, they're observing that basically like if the API is easy to understand for humans, it's actually much easier to understand for LMs.

For example, because they're not overloaded functions. They don't behave differently under different contexts. They do one thing and they always work the same way. It's easy for humans. It's easy for LMs. And like that makes a lot of sense. And obviously adding types is another one. Like type annotations only help give extra context, which is really great.

So that's the agreement. And then a disagreement is that I've actually, when I use structured output to do my chain of thought, I have found that I change my field names to hint to the LLM of what the field is supposed to do. So instead of saying topics, I'll say candidate topics.

And that gives me a better result because the LLM was like, "Ah, this is just a draft thing I can use for chain of thought." And instead of like summaries, I'll say topic summaries to link the previous field to the current field. So like little stuff like that, I find myself optimizing for the LLM where I as a human would never do that.

Interesting. It's kind of like the way you optimize the prompt, it might be different for humans and for machines. You can have a common ground that's both clear for humans and agents, but to improve the human performance versus improving the agent performance, they might move to different directions. I move to different directions.

There's a lot more use of metadata as well, like descriptions, comments, code comments, annotations and stuff like that. Yeah. I would argue that's just you communicating to the agent what it should do. And maybe you need to communicate a little bit more than to humans because models aren't quite good enough yet.

But like, I don't think that's crazy. I don't think that's crazy. I will bring this in because it just happened to me yesterday. I was at the cursor office. They held their first user meetup and I was telling them about the LLMOS concept and why basically every interface, every tool was being redesigned for AIs to use rather than humans.

And they're like, "Why? Can't we just use Bing and Google for LLM search? Why must I use EXA?" Or what's the other one that you guys work with? Tavili. Tavili. Web Search API dedicated for LLMs. What's the difference to Bing API? Exactly. There weren't great APIs for search. Like the best one, like the one that we used initially in LinkChain was SERP API, which is like maybe illegal.

I'm not sure. And like, you know, and now they're like venture-backed companies. Shout out to DuckDuckGo, which is free. Yes. Yes. Yeah. I do think there are some differences though. I think you want, like, I think generally these APIs try to return small amounts of text information, clear legible field.

It's not a massive JSON blob. And I think that matters. I think like when you talk about designing tools, it's not only the, it's the interface in the entirety, not only the inputs, but also the outputs that really matter. And so I think they try to make the outputs.

They're doing ACI. Yeah. Yeah, absolutely. Really. Like there, there's a whole set of industries that are just being redone for ACI. It's weird. And so my, my simple answer to, to them was like the error messages. When you give error messages, they should be basically prompts for the LLM to take and then self-correct.

Then your error messages get more verbose actually than you normally would with a human. stuff like that. Like a little, honestly, it's not that big. Again, like, is this worth a venture backed industry? Unless you can tell us, but like, I think code interpreter, I think is a new thing.

I hope so. We invested in E2B. So I think that's a very interesting point. You're trying to optimize to the extreme. Then obviously they're going to be different. For example, the error, take it very seriously. Right. The error for, for like language model, the longer, the better. But for humans, that will make them very nervous and very tired.

Right. But, but I guess that the point is more like, maybe we should try to find a co optimized common ground as much as possible. And then if we have divergence, then we should try to diverge. But it's more philosophical now. But I think like, part of it is like how you use it.

So Google invented the page rank because ideally you only click on one link, you know, like the top three should have the answer. But like with models, it's like, well, you can get 20. So those searches are more like semantic grouping in a way. It's like, for this query, I'll return you like 20, 30 things that are kind of good, you know?

So it's less about ranking and it's more about grouping. Another fundamental thing about ACI is the difference between human and machine's kind of memory limit. Right. So I think what's really interesting about this concept of ACI versus HCI is interfaces that's optimized for them. You can kind of understand some of the fundamental characteristics differences of humans and machines, right?

Why, you know, if you look at find or whatever terminal command, you know, you can only look at it one thing at a time, or that's because we have a very small working memory. You can only deal with one thing at a time. You can only look at one paragraph of text at the same time.

So the interface for us is by design, you know, a small piece of information, but more temporal steps. But for machines, that's, that should be the opposite, right? You should just give them a hundred different results and they should just decide the context was the most relevant stuff and trade off the context for temporal steps.

That's actually also better for language models because like the cost is smaller or whatever. So it's interesting to connect those interfaces to the fundamental kind of differences of those. When you said earlier, you know, we should try to design these to maybe be similar as possible and diverge if we need to.

I actually don't have a problem with them diverging now and seeing venture backed startups emerging now because we are different from machines, code AI, and it's just so early on, like they may still look kind of similar and they may still be small differences, but it's still just so early.

And I think we'll only discover more ways that they differ. And so I'm totally fine with them kind of like diverging early and optimizing for the... I agree. I think, I think it's more like, you know, we should obviously try to optimize human interface just for humans. We're already doing that for 50 years.

We should optimize agent interface just for agents, but we might also try to co-optimize both and see how far we can get. There's enough people to try all three directions. Yeah. There's a thesis I sometimes push, which is the sour lesson as opposed to the bitter lesson, which we always inspired by human development, but actually AI develops its own path.

Right. We need to understand better, you know, what are the fundamental differences between those creatures. It's funny when really early on this pod, you were like, how much grounding do you have in cognitive development and human brain stuff? And I'm like, maybe that doesn't matter. And actually, so like I, in my original agent's blog post, I had a picture of the human brain and now it looks a lot more like a CPU.

The canonical picture of the LNMOS is kind of like a CPU with all the input and output going into it. And I think that that's probably the more scalable system. I think the problem with like a lot of like cognitive scientists, like is that... They think by analogy, right?

They think, you know, the only way to solve intelligence is through the human way. And therefore, they like have a lot of critics for whatever things that are not cognitive or human. But I think a more useful way to use those knowledge is to think of that as just a reference point.

I don't think we should copy exactly what's going on with human all the way, but I think it's good to have a reference point because this is a working example how intelligence works. Yeah. And if you know all the knowledge and you compare them, I think that actually establishes more interesting insights as opposed to just copy that or not copying that or opposing that.

I think comparing is the way to go. I feel like this is an unanswerable question, but I'll just put it out there anyway. If we can answer this, I think it'd be worth a lot, which is, can we separate intelligence from knowledge? That's a very deep question, actually. And to have a little history background, I think that's really the key thesis at the beginning of AI.

If you think about Neville and Simon and all those symbolic AI people, right? Basically, they're trying to create intelligence by writing down all the knowledge. For example, they write like a checker or write a checker program. Basically how you would solve the checker, you write down all the knowledge and then implement that.

And I think the whole thesis of symbolic AI is we should just be able to write down all the knowledge and that just creates intelligence. But that kind of fails. And I think really, I think a great like quote from Hinton is, I think there are two approaches to intelligence.

One approach is let's deal with reasoning or thinking or knowledge, whatever you call that. And then let's worry about learning later. The other approach is let's deal with learning first. And then let's worry about, you know, whatever knowledge or reasoning or thinking later. And it turns out right now, at least like the second approach works and the first approach doesn't work.

And I think there might be something deep about it. Does that answer your question? Partially, I think Apple intelligence might change that. Can you explain? If this year is the year of multi-modal models, next year is like on-device year. And Apple intelligence basically has hot swappable capabilities, right? Like they have like 50 LoRa's that they swap onto a base model that does different tasks.

And that's the first instance that we have of the separation of intelligence and knowledge. And I think that's that's a really interesting approach. Obviously, it's not exactly knowledge. It's just more about context. Yeah, it's more about context. It's like you can have the same model deployed to 10 million phones with 10 million contacts and see if...

For on-device deployment, I think it's super important. Like if you can boil out, like I actually have most of my problems with AI news when the model thinks it knows more than it knows because it combines knowledge of intelligence. I want it to have zero knowledge whatsoever. And it only has the ability to parse the things I tell it.

I kind of get what you mean. I feel like it's more like memorization versus kind of just generalization in some sense. Yeah, raw ability to understand things. You don't want it to know like facts like, you know, who is the president of the United States. They should be able to just call internet and use a tool to solve it.

Yes, because otherwise it's not going to call the tool if it thinks it knows. I kind of get what you mean. That's why it's valuable. Okay. So if that's the case, I guess my point is I don't think it's possible to fully separate them because like those kind of intelligence kind of emerges.

Like even for humans, you can't just operate in an intelligent mode without knowledge, right? Throughout the years, you learn how to do things and what things are. And it's very hard to separate those things, I would say. Yeah. But what if we could? As a meta strategy, I'm trying to keep as a stack ranked list of like, what are the 10 most valuable questions in here?

You can think of knowledge as a cache of intelligence in some sense. If you have like wikihall.com saying like you should tie a shoelace using the following step, you can think of that piece of text as like a cache to intelligence, right? I guess that's kind of like reflection anyway, right?

It's like you're storing these things as memory, and then you put them back. So without the knowledge, you wouldn't have the intelligence to do it better. Right. Right. I had a couple of things. So we had Thomas Shalom from Meta to talk about Lama 3.1. They mentioned... Then he started talking about Lama 4.

Yeah. I was like, whoa, okay. Great. And he said it's going to be like really focused on agents. I know you talked before about, you know, it's next token prediction enough to get to like problem solving. If you say you got the perfect environment, they got the terminal, they got everything.

And if you were to now move down to the model level and say, I need to make a model that is better for like agentic workflow, where would you start? I think it's data. I think it's data because like changing architecture now is too hard and we don't have a good, better alternative solution now.

I think it's mostly about data and agent data is obviously hard because people just write down the final result on the internet. They don't write down how they like step by step, how they do the thing on the internet. Right. So naturally it's easier for models to learn chain of thought than tool call or whatever agent self reflection or search, right?

Like even if you do a search, you won't write down all the search processes on the internet. You would just write down the final result. And, uh, I think it's a great thing that NAMA 4 is going to be more towards agents. That means, I mean, that should mean a lot for a lot of people.

Yeah. In terms of data, you think the right data looks like trajectories basically of, of a react agent or of. Yeah. I mean, I have a paper called fire act. Do you still remember? No. Okay. Tell us. Okay. That's one of the not famous papers, I guess. It's not even on your website.

Are we supposed to find it? It's on. It's on this Google Scholar. I've got it pulled out. Okay. It's not, it's, it's, it's been rejected for, for like a couple of times. No. But now it's Elena's face. Yeah. Everybody will find it. Anyway, I think the idea is very simple.

Like you can try a lot of different agent methods, right? React, chain of thought, reflection or whatever. And the idea is very simple. You just have very diverse data, like tasks, and you try very diverse agent methods and you filter all the correct solutions and you train a model on all of that.

And then the benefit is that you should somehow learn, you know, how to use simpler methods for simpler tasks and harder methods for harder tasks. I guess the problem is we don't have diverse, high quality tasks. That's the bottleneck for. So it's going to be trained on all code.

Yeah. Let's hope we have more better benchmarks. Yeah. In school, that kind of pissed me off a little bit. When you're doing like a homework, like exercises for like calculus, like they give you the problem, then they give you the solution. Right. But there's no way without the professor or the TA to get like the steps.

Right. So actually how you got there. Right. And so I feel like because of how schools are structured, we never brought this thing down. But I feel like if you went to every university and it's like write down step by step the solution to every single problem in the set and make it available online, that's a start to make this dataset better.

I think it's also because, you know, it's, it might be hard for you to write down your chain of thought, even when you're solving the same, because part of that is conscious in language, but maybe even part of that is not in language. And okay. So a funny side story.

So when I wrote down the React thing, I would tell it to my Google manager, right? Like, you know, what we should do, we should just hire, you know, as many people as possible and let them use Google and write down exactly what they think, what they search on the internet.

and we train them all on that. But I think it's not, not trivial to, to write down your thoughts. Like if you're not trained to do that, if I tell you like, okay, write down what you're thinking right now, it's actually not as trivial a task as you might imagine.

It might be more of a diffusion process than the autoregressive process. But I think the problem is starting with the experts, you know, because there's so much like muscle memory and what you do once you've done it for so long, that's why we need to like get everybody to do it.

And then you can see it like separate knowledge and intelligence. The simplest way to achieve AGI is literally just, just record the reaction of every human being and just put them together. You know, like, what do you have thought about? What do you have done? Let's say on the computer, right?

Imagine like solid experiment. Like you, you write down literally everything you think about and everything you do on the computer and you record them and you train on all the successful trajectories by some metric of success. I think that should just lead us to AGI. My, my first work of fiction in like 10 years was exploring that idea of what if you recorded everything and uploaded yourself?

I'm pretty science-based like, you know, but probably the most like spiritual woo-woo thing about me is I don't think that would lead to consciousness or AGI just because like there's something in there's like, there's a soul, you know, that is the unspeakable quality. That's if it emerges through skill.

We can simulate that for sure. What do you think about the role of few-shot prompting for some of these like agent trajectories? That was a big part of the original react paper, I think. And as we talk about showing your, your work and how you think like. I feel like it's becoming less used than zero-shot prompting.

What's your observation? I'm pretty bullish on it, to be honest, for a few reasons. Like one, I think it can maybe help for more complex things, but then also two, like it's a form of prompting and prompting is just communicating with the model what you want it to do.

And sometimes it's easier to just show the model what you want it to do than write out detailed kind of like instructions. I think the practical reason it has become less used is because the agent kind of scaffold become more complex or the tasks you're trying to solve is becoming more complex.

It's harder to annotate a few-shot examples, right? Like in the chain of thought era, she just write down three lines of things. It's very easy to write down a few-shot or whatever, but I feel like annotation difficulty has become harder. I think also one of the reasons that I'm bullish on it is because I think it's a really good way to achieve kind of like personalization.

Like if you can collect this through feedback automatically, you can then use that in the system at a user level or something like that. Again, the issue with that is more complex things that doesn't really work. Probably more useful is like an automatic, you know, prompt, right? If you have some way to retrieve examples and put it in like automatic pipeline to prompt.

But I feel like if you're a human, you're manually writing now, I feel like more people will try to use zero-shot. Yeah, but if you're doing a consumer product, you're probably not going to ask user-facing people to write a prompt or something like that. But I think the thing that you brought up is also really relevant here where you can collect feedback from a user, but it's usually at the top level.

And so then if you have three or four or five or however many LLM calls down below, how do you disperse that feedback to those? And I don't have an answer for that. There's another super popular paper that you author called Koala, Cognitive Architectures for Language Agents. I'm not sure if it's super popular.

People speak highly of it here within my circles. So shout out to Charles Frey, who told me about it. I think that was our most popular webinar. I think Harrison promoted the paper a lot. Thanks to him. I'll read what you wrote in here and then you can just kind of go take it wherever.

Koala organizes agents along three key dimensions: their information storage divided into working and long-term memories, their action space divided into internal and external actions, and their decision-making procedure, which is structured as an interactive loop with planning and execution. By the way, I think your communication is very clear, so kudos on how you do these things.

take us through the sort of three components. And you also have this development diagram, which I think is really cool. I think it's figure one on your paper for people reading along. Normally people have input, LLM, output. Then they develop into language agents that takes an action into an environment and has observations.

And then they go into the Koala architecture. Shout out to my co-first author, Ted, who made figure one. He's like, you know, figure is really good. You don't even need color. You just... One of the motivations of Koala is we're seeing those agents become really complicated. I think my personal philosophy is to try to make things as simple as possible, but obviously this field has become more complex as a whole, and it's very hard to understand what's going on.

And I think Koala provides a very good way to understand things in terms of those three dimensions. And I think they are pretty first principles, because I think this idea of memory is pretty first principle if you think about where memory, where information is stored. And you can even think of the ways of neural network as some kind of non-term memory, because that's also part of the information that is stored.

I think a very first principle way of thinking of agents is pretty much just a neural network plus the code to call and use the neural network. Obviously also maybe plus some vector store or whatever other memory modules, right? And thinking through that, then you immediately realize that the kind of the non-term memory or the persistent information is first, the neural network, and second, the code associated with the agent that calls the neural network, and maybe also some other vector stores.

But then there's obviously another kind of storage of information that's shorter horizon, right? Which is the context window or whatever episode that people are using. Like you're trying to solve this task and information happens there. But once this task is solved, the information is gone, right? So I think it's very systematic and first principle to think about where information is and thinking, organizing them through categories and time horizon, right?

So once you have those information stores, then obviously for agent, the next thing is, what kind of action can you do? And that leads to the concept of action space, right? And I think one of the fundamental difference between language agents and the previous agents is that for traditional agents, if you think about Atari or video game, they only have like a predefined action space by the environment.

They only have external actions, right? Because they don't have complicated memory or information and kind of devices to do internal thinking. I think the contribution of React is just to point out that we can also have internal actions called thinking. And obviously if you have long-term memory, then you also have retrieval or writing or whatever.

And then third, once you have those actions, which action should you do? That's the problem of decision-making. And, uh, and the three parts should just fully describe our agent. We solved it. We have defined agents. Yeah, it's done. Does anything that you normally say about agents not fit in that framework?

Because you also get asked this question a lot. Um, I think it's very aligned. Um, if we think about a lot of the stuff we do, I'm just thinking out loud now, but a lot of the stuff we do on agents now is through lane graph, lane graph, we would view as kind of like the code part of what defines some of these things.

Also defines part of the decision-making decision. That's what I was thinking actually. Yeah. Yeah. And actually one analogy that I like there is like some of the code and part of lane graph, and I'm actually curious what you think about this, but like, sometimes I say that like the LLMs aren't great at planning yet.

So we can help them plan by telling them how to plan and code, because that's very explicit and that's a good way of communicating how they should plan and stuff like that. That's a Devon playbook as well. What do you mean by that? Like giving them like a DFS algorithm or?

No, something like much simpler. Like you could tell agent in a prompt like, "Hey, every time you do this, you need to also do this and make sure to check this." Or you could just put those as explicit checks in kind of like the decision-making procedure or something like that.

Right. And the more complex it gets, I think the more we see people encoding that in code. And another way that I say this is like, all of life really is communication, right? And so you can do that through prompts or you can do that through code. And code is great at communicating things.

It really is. Is this the most philosophical? This is the cheapest I've ever had. Okay, okay. That's good. That's good. That's good. Yeah. We're talking about agents, you know? Yeah. I think the biggest thing that we're thinking a lot about is just the memory component. And we touched on it a little bit earlier in the, in the episode, but I think it's still like very unsolved.

I think like clearly semantic memory, episodic memory, or types of memory, I think, but like where the boundaries are, is there, are there other types? How to think about that? I think that to me is maybe one of the bigger unsolved things in terms of agents is just memory.

Like what, what does memory even mean? That's another top high value question. Is it a knowledge graph? I think that's one type of memory. Yeah. If you're using a knowledge graph as a hammer to hit a nail, it's, it's, it's not that. But I think, I think practically what we see is it's still so application specific, what relevant memory is.

And that also makes it really tough to answer generically, like what is memory? So like, it could be a knowledge graph. It could also be like, I don't know, a list of instructions that you just keep in a list. Yep. A meta point is I feel sometimes we underestimate some aspects where humans and agents are actually similar and we overestimate sometimes.

The difference is, you know, I feel like, I mean, at one point I think that's shared by agents and humans is like, we all have very different types of memories, right? Some people use Google doc. Some people use notion and some people use paper and pen. Like you can argue those are different types of long term memories for people, right?

And each person develops its own way to maintain their long term memory and diarrhea or whatever. It's a very kind of individual kind of thing. And I feel like for agents, probably there's no like single best solution. But what we can do is we can create as many good tools as possible, like Google Docker, Notion, equivalent of agent memory.

And we should just give the choice to the agent. Like, what do you want to use? And through learning, they should be able to come up with their own way to use the long memory. You know, or give the choice to the developer who's building the agents, because I think it also that it might, it depends on the task.

Like I use, I think we want to control that one. Right now, I would agree with that for sure, because I think you need that level of control. I use linear for planning for code. I don't use that for my grocery list, right? Like depending on what I'm trying to do, I have different types of long form memory.

Maybe if you tried, you would have a gorgeous kitchen. Do you think our like tool making kind of progress is good or not good enough in terms of, you know, we have all sorts of different memory stores or retrieval methods or whatever? On the memory front in particular, I don't think it's very good.

I think there's a lot to still be done. What do you think are lacking? Yeah, you have a memory service. What's missing? The memory service we launched, I don't think really found product market fit. I think like, I mean, I think there's a bunch of different types of memory.

I'll probably write a blog. I mean, I have a blog that I published at some point on this, but I think like right off the bat, there's like procedural memory, which is like how you do things. I think this is basically episodic memory, like trajectories of correct things. But there's also, then I think a very different type is like personalization.

Like I like Italian food. It's kind of a semantic memory. That's kind of, maybe like a system prompt. Yeah, exactly. Exactly. Yeah. It could be a, it depends if it's semantic over like raw events or over reflections over events. Right. Again, semantic procedure, whatever, it's just like a categorization.

What really matters is the implementation, right? And so one of the things that we'll probably have released by the time this podcast comes out is right now in line graph, line graph is very stateful. You define a state for your graph and basically a run of an agent operates on a thread.

It's very similar to threads in open AI's assistant API, but you can define the state however you want. You can define whatever keys, whatever values you want. Right now, they're all persistent for a single thread. We're going to add the ability to persist that between threads. So then if you basically want to scope a memory to a user ID or to an assistant or to an organization, then you can do that.

And practically what that means is you can write to that channel whatever you want, and then that can be read in other threads. We're not making any kind of like claims around what the shape of memory is, right? You can write kind of like what you want there. I still think it's like so early on and we see people needing a lot of control over that.

And so I think this is our current best thought. This is what we're doing around memory at the moment. It's basically extending the state to beyond a thread level. I feel like there's a trade-off between you know, complexity and control, right? For example, like Notion is more complex than Google Docs, but if you use it well, then it gives you more capability, right?

And it's like different tool might suit different applications or scenarios or whatever. Yeah. We should make more good tools, I guess. My quick take is when I started writing about the AI engineer, this was kind of vaguely in my head, but like this is basically the job. Everything outside the LLM is the AI engineer that the researcher is not going to do.

This basically maps to LLMOS. I would add in the code interpreter, the browser and the other stuff, but yeah, this is mostly it. Yeah, those are the I mean, those are the tools, yeah. Those are the external environment, which is a small box at the bottom. So then having this like reasonable level of confidence, like I know what things are, then I want to break it.

I want to be like, okay, like what's coming up that's going to blindside me completely. And it really is maybe like omni-model where like everything in, everything out. And like, does that change anything? Like if you scale up models like a hundred times more, does that change anything? That's actually a great, great question.

I think that's actually the last paragraph of the paper that's talking about this. I also got asked this question when I was interviewing with OpenAI. Please tell us how to pass OpenAI interviews. Is any of this still true if, you know, if you 100x everything, if we make the model much better?

My longer answer to this, you should just refer to the last paragraph of the paper, which is like a more prepared, longer answer. I think the short answer is understanding is always better. It's like a way of understanding things. Like the solid experiment that I write at the end of the paper is, imagine you have GPT-10, which is really good.

Like it doesn't even need a chain of thought, right? Just input, output, just stick to stick, right? It doesn't even need to do browsing or whatever, or maybe it still needs some tools, but let's say like, it's really powerful. Like then I think even in that point, I think something like Koala is still useful if we want to do some neuroscience on GPT-10.

It's like kind of doing human kind of neuroscience, right? Which model actually could it be inspectable? Yeah. Like you want to expect what is episodic memory? What is the decision-making module? What is the, it's kind of like dissecting the human brain, right? And you need some kind of prior kind of framework to help you do this kind of discovery.

Cool. Just one thing I want to highlight from your work, we don't have to go into it, it's a Tao Bench. Oh yeah, we should definitely cover this. Yeah. I'm a big fan of Simulity of AI. We had a summer of Simulity of AI. Another term we're trying to coin hasn't stuck, but I'm going to keep at it.

I'm really glad you covered my zero citation work. I'm really happy. No, zero citation work. Now it's one, now it's one. First citation. It's me, it's me right now. We just cited it here, so that counts. It's like one citation. Does it show on Google? We'll write a paper about this episode.

One citation, one citation. Let's go. Last time I checked, it's still zero. It's awesome. Okay. This one was funny because you have agents interact with like LM simulated person. So it's like actually just another agent, right? So it's like agent simulating with other agents. This has always been my thing with startups doing agents.

I'm like, one day there's going to be training grounds for companies to train agents that they hire. Actually, Singapore is the first country to build the cyber range for cyber attack training. And I think you'll see more of that. So what was the inspiration there? Most of these models are bad at it, which is great.

You know, we have some room for, I think the best model is for all at like 48% average. So there's a lot of room to go. Yeah. Any fun stories from there, directions that you hope that people take? Yeah. First, I think shout out to Sierra, which is this a very good startup, which was funded by Brad Taylor and Clay Beaver.

And Sierra is a startup doing conversational AI. So what they do is they they build agents for businesses. Like suppose you have a business and you have a customer service, we want to automate that part. And then it becomes very interesting because it's very different from coding or web agent or whatever people are doing, because it's more about how can you do simple things reliably.

It's not about, you know, can you sample a hundred times and you find one good mass proof or kill solution. It's more about you chat with a hundred different users on very simple things. Can you be robust to solve like 99% of the time? Right. And then we find there's no really good benchmark around this.

So that's one thing. I guess another thing is obviously this kind of customer service kind of domain. Previously, there are some benchmarks, but they all have their limitations. And I think you want the task to be kind of hard and you want user simulation to be real. We don't have that until LLM.

So data sets from 10 years ago, like either just have trajectories, conversating with humans, or they have very fake kind of simulators. I think right now is a good opportunity to, if you really just care about this task of customer service, then it's a good opportunity because now you have LLMs to simulate humans.

But I think a more general motivation is we don't have enough agent benchmarks that target this kind of robustness, reliability kind of standpoint. It's more about, you know, code or web. So this is a very good addition to the landscape. If you have a model that can simulate the persona, like the user, the right way, shouldn't the model also be able to accomplish the task, right?

If he has the knowledge of like what the person will want, then it means... This is a great question. I think it really stems from like asymmetry of information, right? Because if you think about the customer service agent, it has information that you cannot, you cannot access, right? Like the APIs it could call or, you know, the policies of internal company policy, whatever.

And that I think very interesting for TallBench is like, it's kind of okay for the user to be kind of stupid. So you can imagine like there are failure cases, right? But I think in our case, as long as the user specified the need very clearly, then it's up to the agent to figure out, for example, what is the second cheapest flight from this to that under that constraint, very complicated reasoning involved.

Like we shouldn't require users to be able to solve those things. They should just be able to clearly express their need. But then if the task failed, then it's up to the agent. That makes the evaluation much easier. Awesome. Anything else? I have one last question for Harrison, actually.

Oh, no, that's not this podcast. You can't do it. I mean, there are a lot of questions around AI right now, but I feel like perhaps the biggest question is application. Because if we have great application, we have super app, whatever, that keeps the whole thing going, right? Obviously, we have problems with infra, with chip, with transformer, with whatever, S4, a lot of stuff.

But I do think the biggest question is application. And I'm curious, like, from your perspective, like, is there any things that are actually already kind of working, but people don't know enough? Or like, is there any like promising application that you're seeing so far? Okay, so I think one big area where there's clearly been success is in customer support, both companies doing that as a service, but also larger enterprises doing that and building that functionality in inside, right?

There's a bunch of people doing coding stuff. We've already talked about that. I think that's a little bit, I wouldn't say that's a success yet, but there's a lot of excitement and stuff there. One thing that I've seen more of recently, I guess the general category would be like research style agents, specific things recently would be like, I've seen a few like AISDR companies pop up, where they basically do some data enrichment, they get a company name, they go out, find like, you know, funding.

What is SDR? Sales Development Rep. It's an entry level job title in B2B SaaS. Yeah. So I don't know, I know. The PhD mind cannot comprehend. And so I'd classify that under the general area of kind of like researchy style agents, I think like legal falls in this as well.

I think legal is, yeah, they're a pretty good domain for this. I wonder how good hardware is doing. There was some debate, but they raised a lot of money. So who knows? I'd say those are, those are a few of the categories that jump to mind. Like entry type kind of research.

On the topic of applications though, the thing that I think is most interesting in this space right now is probably all the UXs around these apps and the different things besides chat that might come out. I think two that I'm really interested in, one for the idea of this AISDR.

I've seen a bunch of them do it in kind of like a spreadsheet style view where you have like, you know, 10 different companies or hundreds of different companies and five different attributes you want to run up and then each cell is an agent. And I guess the good, the good thing about this is like, you can already use the first couple of rules of spreadsheet as a few shot example or whatever.

There's so many good things about it. Yeah. You can, you can test it out on a few. It's a great way for humans to run things in batch, which I don't, it's a great interface for that. It's still kind of elusive to do this kind of like PhD kind of research, but I think those kind of entry type research where it's more repetitive and it should be automated.

And then the other UX I'm really, really interested is, is kind of like when you have agents running in the background, how can they, like ambient style agents, how can they reach out to you? So I think like as an example of this, I have an email assistant, um, that runs in the background, it treasers all my emails and it tries to respond to them.

And then when it needs my input, like, Hey, do you want to do this podcast? It reaches out to me. It sends me a message. Oh, you actually, Oh, you, you have it. It is live. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. I use it for all my emails.

Thank you agent. Well, we did Twitter. I don't have that. Did you write it with Lanchain? Yeah. I will open source it at some point. Lengraph or Lanchain? Yeah. I want both. Yeah. Both. So at this point, Lengraph for the orchestration, Lengchain for the integrations with the different models.

I'm curious how the low code kind of direction is going right now. Are people, we talked about this. Oh, yeah. It's not low code. Lengraph is not low code. So you can cut this. No, no, no, no, no. People, people will tune in just for this. Well, it actually, it actually has to do with UXs as well.

So like probably sums back to this idea of, I think like what it means to build with AI is changing. Like I still really, really strongly believe that developers will be a core kind of like part of this, largely because we see like, you need a lot of control over these agents to get them to work reliably.

But there's also very clearly components that you don't need to be a developer for prompting is kind of like the most obvious one. With Lengraph, one of the things that you don't need to be a developer for prompting is kind of like the most obvious one. With Lengraph, one of the things that we added recently was like a Lengraph studio.

So it's, we called it kind of like an IDE for agents. You point it to your code file where you have your graph defined in code. It spins up a representation of the graph. You can interact with it there. You can test it out. We fucked it up to kind of like a persistence layer.

So you can do time travel stuff, which I think is another really cool UX that I first saw in Devon and was, yeah. Devon's time travel is good. The UX for Devon in general, I think you said it, but that was the novel. That was the best part. But to the low code, no code part, the way that I think about it is you probably want to have your cognitive architecture defined in code.

A decision making procedure. Yes. But then there's parts within that that are prompts or maybe configuration options, like something to do with RAG or something like that. We've seen that be a popular configuration option. So is it useful for programmers more or is it for like people who cannot program?

I guess if you cannot program, it's still very complicated for them. It's useful for both. I think like we see it being useful for developers right now, but then we also see like there's often teams building this, right? It's not one person. And so I think there's this handoff where the engineer might define the cognitive architecture.

They might do some initial prompt engineering. It's easier to communicate to the product manager. It's easier to show them what's going on and it's easier to let them control it and maybe they're doing the prompting. And so, yeah, I think what the TLDR is like what it means to build is changing.

And also like UX is UX in general is interesting, whether it's for how to build these agents or for how to use them as end consumers. And there might also be overlap as well. And it's so early on and no one knows anything. But I think UX is one of the most exciting spaces to be innovating in right now.

Let's do ACI. Yeah. Okay. Yeah. That's another theme that we cover on the pod. We had the first AI UX meetup and we're trying to get that going. It's not a job. It's just people just tinkering. Well, thank you guys so much for indulging us. Yeah, that was amazing.

Yeah, thank you. Harrison, you're amazing as a co-host. We'd love to have you back. Like, that was awesome. I just try to listen to you guys for inspiration and stuff. It's actually really scary to have you as a listener because I don't want to misrepresent. Like, I talk about 100 companies, right?

And God forbid I get one of them wrong and, you know. I'm sure all of them listen as well. Not to add pressure. Yeah, thank you so much. It's a pleasure to have you on. And you had one of the most impactful PhDs in this sort of AI wave.

So I don't know how you do it, but I'm excited to see what you do at OpenAI. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. you you you you you you you you

Language Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph

Chapters

Transcript